Cini 2023 SparseGraphLearningFromSpatiotemporal Time Series
Cini 2023 SparseGraphLearningFromSpatiotemporal Time Series
Cini 2023 SparseGraphLearningFromSpatiotemporal Time Series
Abstract
Outstanding achievements of graph neural networks for spatiotemporal time series analysis
show that relational constraints introduce an effective inductive bias into neural forecasting
architectures. Often, however, the relational information characterizing the underlying data-
generating process is unavailable and the practitioner is left with the problem of inferring
from data which relational graph to use in the subsequent processing stages. We propose
novel, principled—yet practical—probabilistic score-based methods that learn the relational
dependencies as distributions over graphs while maximizing end-to-end the performance at
task. The proposed graph learning framework is based on consolidated variance reduction
techniques for Monte Carlo score-based gradient estimation, is theoretically grounded, and,
as we show, effective in practice. In this paper, we focus on the time series forecasting
problem and show that, by tailoring the gradient estimators to the graph learning problem,
we are able to achieve state-of-the-art performance while controlling the sparsity of the
learned graph and the computational scalability. We empirically assess the effectiveness of
the proposed method on synthetic and real-world benchmarks, showing that the proposed
solution can be used as a stand-alone graph identification procedure as well as a graph
learning component of an end-to-end forecasting architecture.
Keywords: graph learning, spatiotemporal data, graph-based forecasting, time series
forecasting, score-based learning, graph neural networks
1 Introduction
Traditional statistical and signal processing methods to time series analysis leverage on
temporal dependencies to model data generating processes (Harvey et al., 1990). Graph
signal processing methods extend these approaches to dependencies observed both in time
and space, i.e., to the setting where temporal signals are observed over the nodes of a
graph (Ortega et al., 2018; Stanković et al., 2020; Di Lorenzo et al., 2018; Isufi et al., 2019).
The key ingredient here is the use of graph shift operators, constructed from the graph
adjacency matrix, that localizes learned filters on the graph structure. The same holds true
for graph deep learning methods that have revolutionized the landscape of machine learning
for graphs (Bruna et al., 2014; Bronstein et al., 2017; Bacciu et al., 2020; Bronstein et al.,
2021). However, it is often the case that no prior topological information about the reference
graph is available, or that dependencies in the dynamics observed at different locations are
not well modeled by the available spatial information (e.g., the physical proximity of the
sensors). Examples can be found in social networks, smart grids, and brain networks, just to
name a few relevant application domains.
The interest in the graph learning problem, in the context of spatiotemporal time series
processing, indeed arises from many practical and theoretical concerns. In the first place,
learning existing relationships among time series that better explain an observed phenomenon
is worth the investigation on its own; as a matter of fact, graph identification is a well-known
problem in graph signal processing (Mei and Moura, 2016; Variddhisai and Mandic, 2020). In
the deep learning setting, several methods train, end-to-end, a graph learning module with a
neural forecasting architecture to maximize performance on the downstream task (Shang and
Chen, 2021; Wu et al., 2020). A typical deep learning approach consists in exploiting spatial
attention mechanisms to discover the reciprocal salience of different spatial locations at each
layer (Satorras et al., 2022; Rampášek et al., 2022). Graph learning, in this context, can then
be seen as a regularization of Transformer-like models (Vaswani et al., 2017); regularization
that comes in the form of the relational inductive biases typical of graph processing methods:
namely, the sparsity of the pairwise relationships between nodes and the locality of the
learned representations. In fact, despite their effectiveness, pure attention-based approaches
impair two major benefits of graph-based learning: they (1) do not allow for the sparse
computation enabled by the discrete nature of graphs and (2) do not take advantage of the
structure, introduced by the graph topology, as an inductive bias for the learning system.
Indeed, sparse computation allows graph neural networks (GNNs; Scarselli et al. 2008, Bacciu
et al. 2020) with message-passing architectures (Gilmer et al., 2017) to scale in terms of
network depth and the dimension of the graphs that are possible to process. At the same time,
sparse graphs constrain learned representations to be localized in node space and mitigate
over-fitting spurious correlations in the training data. Graph learning approaches that do
attempt to learn relational structures from time series exist, but often rely on continuous
relaxations of the binary adjacency matrix and, as a consequence, on dense computations to
enable automatic reverse-mode differentiation through any subsequent processing (Shang and
Chen, 2021; Kipf et al., 2018). Conversely, other solutions make the computation sparse (Wu
et al., 2020; Deng and Hooi, 2021) at the expense of the quality of the gradient estimates
as shown by Zügner et al. (2021). The challenge is, then, to provide accurate gradients
while, at the same time, allowing for sparse computations in the downstream message-passing
operations, typical of modern GNNs.
In this paper, we address the graph learning problem and model it from a probabilistic
perspective which, besides naturally accounting for uncertainty and the embedding of priors,
enables the learning of sparse graphs as realizations of a discrete probability distribution.
In particular, given a set of time series, we seek to learn a parametric distribution pθ such
2
Sparse Graph Learning from Spatiotemporal Time Series
that graphs sampled from pθ maximize the performance on the given downstream task,
e.g., multistep-ahead forecasting. As an example, consider a cost function δt ( · ) (e.g., the
forecasting accuracy) associated with each time step t and dependant on the inferred graph.
The core challenge in learning pθ to minimize the expected cost is associated with estimating
the gradient
∇θ EA∼pθ [δt (A)] (1)
of the expected value of the cost function δt (A) w.r.t the distributional parameters θ, the sam-
pling of a random graph (adjacency matrix A) from pθ and given batch of input-output data
pairs corresponding to observations at time step t. Previous works proposing probabilistic
methods (Shang and Chen, 2021; Kipf et al., 2018) learn pθ with path-wise gradient estima-
tors (Glasserman and Ho, 1991; Kingma and Welling, 2013), i.e., by reparametrizing A ∼ pθ
as A = g(ε, θ), with deterministic function g decoupling parameters θ from the (parameter-
free) random component ε ∼ p0 . However, these approaches imply approximating discrete
distributions with a softmax continuous relaxation (Paulus et al., 2020) which makes all the
downstream computations dense and quadratic in the number of nodes. Differently, here, we
adopt the framework of score-function (SF) gradient estimators (Rubinstein, 1969; Williams,
1992; Mohamed et al., 2020) by relying on the rewriting of Equation (1) as
which, as we detail in Section 5.1, allows us for preserving the sparsity of the sampled graphs
and the scalability of the subsequent processing steps (e.g., the forward and backward passes
of a message-passing network). In particular, our contributions are as follows.
• We propose a novel and effective, yet simple to implement, variance reduction method
for the estimators [Section 6] based on the Fréchet mean graph w.r.t. the proposed
distributions, for which we provide closed-form solutions [Propositions 1 and 3]. Our
method does not require the estimation of additional parameters and, differently from
more general-purpose approaches (e.g., see Mnih and Gregor (2014)), is as expensive
as taking a sample from the considered distributions and evaluating the corresponding
cost function.
3
Cini, Zambon and Alippi
Empirical results demonstrate that the techniques introduced here enable the use of score-
based estimators to learn graphs from spatiotemporal time series; furthermore, experiments
on time series forecasting benchmarks show that our approach compares favorably w.r.t.
the state of the art. We strongly believe that our approach constitutes an effective method
in the toolbox of the practitioner for designing new, even more effective, classes of novel
graph-based time series processing architectures.
The paper is organized as follows. Section 2 discusses related works. Section 3 introduces
relevant background material; Section 4 provides the formulation of the problem. We present
the proposed parametrizations of pθ and related gradient estimators in Section 5 and the
associated variance reduction techniques in Section 6. The proposed rewriting of the gradient
and approximated objective are derived and discussed in Section 7. Finally, the empirical
evaluation of the proposed method is given in Section 8 and conclusions are presented in
Section 9.
2 Related Works
Graph neural networks have become increasingly popular in spatiotemporal time series
processing (Seo et al., 2018; Li et al., 2018; Yu et al., 2018; Wu et al., 2019; Deng and Hooi,
2021; Cini et al., 2022; Marisca et al., 2022; Wu et al., 2022) and the graph learning problem
is well-known within this context. Wu et al. (2019) propose Graph WaveNet, an architecture
for time series forecasting that learns a weighted adjacency matrix A = σ E1 E2⊤ learned
from the factorization with node embedding matrices E1 , E2 . Several other methods follow
this direction (Bai et al., 2020; Oreshkin et al., 2021). Satorras et al. (2022) showed that
hierarchical attention-based architectures are effective to account for dependencies among
spatiotemporal time series to obtain accurate predictions in the downstream task. However,
all the aforementioned approaches generally lead to dense graphs and cannot, therefore,
exploit the sparsity and locality priors—and computational scalability—typical of graph-
based machine learning. To address this issue, MTGNN (Wu et al., 2020) and GDN (Deng
and Hooi, 2021) sparsify the learned factorized adjacency by selecting, for each node, the K
edges associated with the largest weights. Using hard top-k operators, however, results in
sparse gradients and has differentiability issues that can undermine the effectiveness of the
learning procedure. More recently, Zhang et al. (2022) proposed a different approach based
on the idea of sparsifying the learned graph by thresholding the average of learned attention
scores across time steps.
Among probabilistic models, Franceschi et al. (2019) tackle the graph learning problem for
non-temporal data by using a bi-level optimization routine and a straight-through gradient
trick (Bengio et al., 2013) which, nonetheless, requires dense computations. The NRI
approach, introduced by Kipf et al. (2018), learns a latent variable model predicting the
interactions of physical objects by learning edge attributes of a fully connected (dense)
graph. GTS (Shang and Chen, 2021) simplifies the NRI module by considering binary
relationships only and integrates graph inference in a spatiotemporal recurrent graph neural
network (Li et al., 2018). Both NRI and GTS exploit path-wise gradient estimators based
on the categorical Gumbel trick (Maddison et al., 2017; Jang et al., 2017) and, as such,
rely on continuous relaxations of discrete distributions and suffer from the computational
setbacks anticipated in the introduction. Finally, the graph learning module proposed by
4
Sparse Graph Learning from Spatiotemporal Time Series
Kazi et al. (2022) uses the Gumbel-Top-K trick (Kool et al., 2019) to sample a K-nearest
neighbors (K-NN) graph, where node scores are learned by using a heuristic for increasing
the likelihood of sampling edges that contribute to correct classifications.
Besides applications in graph-based processing, the problem of learning discrete structures
has been widely studied in deep learning and general machine learning (Niculae et al., 2023).
As alternatives to methods relying on continuous relaxations and path-wise estimators (Jang
et al., 2017; Maddison et al., 2017; Paulus et al., 2020), several approaches tackled the problem
by exploiting score-based estimators and variance reduction techniques, e.g., based on control
variates derived from continuous relaxations (Tucker et al., 2017; Grathwohl et al., 2018) and
data-driven baselines (Mnih and Gregor, 2014). In particular, related to our method, Rennie
et al. (2017) use a greedy baseline based on the mode of the distribution being learned,
while Kool et al. (2020) constructs a variance-reduced estimator based on sampling without
replacement from the discrete distribution. Beyond score-based and path-wise methods,
Correia et al. (2020) take a different approach by considering sparse distributions where
analytically computing the gradient becomes tractable. Niepert et al. (2021) introduce a
class of (biased) estimators, based on maximum-likelihood estimation, that generalize the
straight-through estimator (Bengio et al., 2013) to more complex distributions; Minervini
et al. (2023) make such estimators adaptive to balance the bias of the estimator and the
sparsity of the gradients. We refer to Mohamed et al. (2020) and Niculae et al. (2023) for an
in-depth discussion of the topic. None of these method target specifically graph distributions,
nor consider sparsity of the downstream computations as a requirement.
To the best of our knowledge, we are the first to propose a spatiotemporal graph learning
module that relies on variance-reduced score-based gradient estimators specifically tailored for
graph-based processing, and allowing for sparse computation in both training and inference
phases of message-passing neural networks.
3 Preliminaries
The section introduces some preliminary concepts and provides the reference models and the
notions regarding distributions over graphs needed to support the theoretical and technical
derivations presented in the next sections.
5
Cini, Zambon and Alippi
i,(l)
where zt indicates the representation of the i-th node at layer l; N (i) is the set of its
neighboring nodes, and ei,j are the features associated with the edge connecting the j-th to
the i-th node. Update and message functions, ρ and γ, respectively, can be implemented
by any differentiable function—e.g., a multilayer perceptron—while Aggr{·} indicates a
generic permutation invariant aggregation function. By considering a graph-wise operator,
the l-th message-passing neural network layer (MPNN) of the—possibly deep—architecture
6
Sparse Graph Learning from Spatiotemporal Time Series
where the notation is consistent with that of Equation (3). Examples of spatiotemporal
graph processing models that fall into the time-then-space category are NRI (Kipf et al.,
2018) and the encoder-decoder architecture introduced by Satorras et al. (2022).
Time-and-space models. Time-and-space models are a general class of STGNNs where
space and time are processed by operators that process representation along the time and
space dimensions. A large subset of this family of models can be seen as performing the
following operations
(0)
Zt−W :t = [Xt−W :t ||Ut−W :t ||V ] , (8)
finally followed by
n o
(L) (L)
Ybt:t+H = Readout Aggr Zt−W , . . . , Zt−1 , (11)
7
Cini, Zambon and Alippi
associated with distribution p and the squared Euclidean distance ∥·∥22 . Following Equa-
tions 12 and 13, we can derive a generalized definition of mean applicable to non-Euclidean
data, like graphs and sparse adjacency matrices. We comment P that, following this line,
we can extend these results also to the sample mean 1/M M m=1 xm of a finite sample
D = {x1 , . . . , xM }, and define accordingly the Fréchet sample mean of a sample of non-
Euclidean data.
Consider, then, the space A ⊆ {0, 1}N ×N of adjacency matrices A over the node (sensor)
set S, each of which representing a graph topology over S; for instance, for undirected graphs,
A is the subset of {0, 1}N ×N of symmetric matrices, whereas for directed k-NN graphs
XN
A = A ∈ {0, 1}N ×N : Ai,j = k, ∀ i . (14)
j=1
where A, A′ ∈ A and I is the indicator function such that I(a) = 1, if a is true, 0 otherwise.
The Hamming distance counts the number of mismatches between the entries of A and A′ ,
and is then a natural choice to measure the dissimilarity between two graphs.
We define the Fréchet function over space (A, H), and the random adjacency matrix
A ∼ p, for all A′ ∈ A as
FH (A′ ) ≜ EA∼p H(A′ , A) . (16)
According to Equation 12, we then define Fréchet mean adjacency matrix any matrix
8
Sparse Graph Learning from Spatiotemporal Time Series
A matrix Aµ always exists in A, as A is a finite set, but, in general, is not unique. Conditions
for the uniqueness of the Fréchet mean in the context of graph-structured data have been
studied in the literature, e.g., by Jain (2016). Throughout the paper, we use the term
“Fréchet mean” referring to any Fréchet mean of a given distribution.
4 Problem Formulation
This section provides a probabilistic formulation of the graph learning problem in spatiotem-
poral time series and defines the operational framework in which we operate.
where Lt (ψ, θ) is the optimization objective at time step t expressed as the expectation, over
the graph distribution pθ , of a cost—loss—function δt (At ; ψ), typically a p-norm
with, e.g., p = 1 or 2. Note that in Equation (18) the distribution of At at time step t is
conditioned on the most recent observations Xt−W :t , hence modeling a scenario associated
with a dynamic graph distribution [Section 3.1]. A static graph scenario follows by simply
removing the conditioning on Xt−W :t . We consider a generic family of predictive models Fψ
implemented by STGNNs based on the message-passing framework and following either the
TTS or the T&S paradigm to process information along space and time. Other architectures
can be considered. Notably, Fψ can be suitably designed in order to exchange messages w.r.t.
a different graph A(l) at each MP layer. Section 7 provides a thorough discussion of this
setup.
In this setting, the model family and the downstream task impact on the type of
relationships being learned. For example, linear and nonlinear models will yield different
results that depend also on the number of layers and the choice of MP operators, e.g.,
standard graph convolutions against anisotropic message-passing layers such as those used in
graph attention networks (Veličković et al., 2018). Ultimately, the learned graph distribution
is the one that best explains the observed data given the architecture of the predictive
model and the family of graph distributions. Different parametrizations of pθ allow the
9
Cini, Zambon and Alippi
practitioner for embedding different inductive biases (such as sparsity) as structural priors
into the processing.
In this section, we present our approach to probabilistic graph learning. After introducing
score-based gradient estimators [Section 5.1], we propose two graph distribution models [Sec-
tion 5.2] and comment on their practical implementations [Section 5.3]. The problem of
controlling the variance of the estimators is discussed together with novel and principled
variance reduction techniques tailored to graph-based architectures [Section 6]. Finally, we
provide a convenient rewriting of the gradient for L-layered MP architectures leading to a
novel surrogate loss [Section 7]. Figure 1 provides a schematic overview of the framework.
In particular, the block on the left shows the graph learning module, where A is sampled
from pθ ; as the figure suggests, depending on the parametrization of pθ , some components of
A can be sampled independently. The bottom of the figure, instead, shows the predictive
model Fψ that, given the sampled graph and the input window, outputs the predictions used
to estimate Lt (ψ, θ), whose gradient provides the learning signals.
10
Sparse Graph Learning from Spatiotemporal Time Series
Graph learning
Scores & Costs
Figure 1: Overview of the learning architecture. The graph learning module samples a graph
used to propagate information along the spatial dimension in Fψ ; predictions and
samples are used to compute costs and log-likelihoods. Gradient estimates are
propagated back to the respective modules.
which holds—under mild assumptions1 —for generic cost functions f and distributions pθ .
The rewriting of ∇θ Epθ [f (x)] in terms of the gradient of the score function log pθ ( · ) allows
for estimating the gradients easily by MC sampling and backpropagating them through the
computation of the score function. SF estimators are black-box optimization methods, i.e.,
they only require to evaluate pointwise the cost f (x) which does not necessary need to be
differentiable w.r.t. parameters θ. In our setup, by assuming disjoint ψ and θ, Equation (22)
becomes
∇θ Lt (ψ, θ) = ∇θ Epθ [δt (A; ψ)] = Epθ [δt (A; ψ)∇θ log pθ (A)] , (23)
allowing for computing gradients w.r.t. the graph generative process without requiring a full
evaluation of all the stochastic nodes in the CG.
Sparse computation. Path-wise gradient estimators tackle the problem of estimating
the gradient ∇θ Epθ [δt (A; ψ)] by exploiting continuous relaxations of the discrete pθ , thus
estimating the gradient by differentiating through all nodes of the stochastic CG. Defined
1. The identity is valid as long as pθ and f allow for the interchange of differentiation and integration in
Equation (21); see L’Ecuyer (1995); Mohamed et al. (2020).
11
Cini, Zambon and Alippi
The distribution pθ should be chosen to (i) efficiently sample graphs and evaluate their likeli-
hood and (ii) backpropagate the errors through the computation of the score [Equation (23)]
to parameters θ. In the following, we consider graph distributions s.t. each stochastic edge
j → i is associated with a weight Φi,j . The considered distributional parameters Φ ∈ RN ×N
can then be learned as a function of the learnable parameters θ. In the case of static graphs,
we can directly consider Φ = θ; however, to account for the dynamic case, more complex
parametrizations are possible, e.g., by exploiting amortized inference to condition distribution
pθ on the observed values. Further discussion is deferred to the end of the section.
N
X
log pθ (A) = Ai,j log(σ(Φi,j )) + (1 − Ai,j ) log(1 − σ(Φi,j )). (24)
i,j
12
Sparse Graph Learning from Spatiotemporal Time Series
where S ⃗K denotes an ordered sample without replacement and P(SK ) is the set of all the
permutations of SK .
Sampling. Sampling can be done efficiently by exploiting the Gumbel-top-k trick (Kool
et al., 2019). Accordingly, we consider the parameter vector ϕ = Φn,: and denote with
[Gϕ1 , . . . , GϕN ] the associated random vector of independent Gumbel random variables
Gϕj ∼ Gumbel(ϕj ); given a realization thereof [g1 , . . . , gN ], it is possible to show that
SK = arg top-K{gi : i ∈ S} follows the desired distribution (Kool et al., 2019).
Log-likelihood evaluation. Evaluating the score function is more challenging; in fact,
Equation (25) shows that directly computing pθ (SK |n) requires marginalizing over all the
possible K! orderings of SK . While exploiting the Gumbel-max trick can bring down
computation to O(2K ) (Huijben et al., 2022; Kool et al., 2020), exact computation remains
untractable for any practical application. Luckily, pθ (SK |n) can be approximated efficiently
using numerical integration. Following the notation of Kool et al. (2019, 2020), for a subset
B ∈ S we define !
X
LogSumExp(ϕi ) ≜ log exp ϕi , (26)
i∈B
i∈B
we use the notation ϕB = LogSumExpi∈B ϕ, and indicate with fu and Fu the p.d.f. and
c.d.f., respectively, of a Gumbel random variable Gumbel(u) with location parameter u.
Recall that Fu (z) = exp(− exp(−z + u)) and the following property of Gumbel random
variables:
GϕB ≜ max Gϕi ∼ Gumbel(ϕB ). (27)
i∈B
With a derivation analogous to that of Kool et al. (2020), Equation (25) can be conveniently
rewritten by exploiting the property shown in Equation (27) as:
pθ (SK |n) = P min Gϕi > max Gϕi (28)
i∈SK i∈S\Sk
= P Gϕi > GϕS\S , ∀i ∈ SK (29)
k
Z ∞ Y
= (1 − Fϕi (g)) fϕS\S (g) dg (30)
k
−∞ i∈S
K
13
Cini, Zambon and Alippi
With an appropriate change (details in Appendix B), the integral can be rewritten as
Z 1 Y
uexp(ϕS\SK +c)−1 1 − uexp(ϕi +c) du, (31)
pθ (SK |n) = exp ϕS\SK + c
0 i∈Sk
with M trapezoids and equally spaced intervals of length ∆u; the integrands are computed
in log-space—with a computational complexity of O(M K)—for numeric stability. The
expression in Equation (32) provides, then, a differentiable numeric approximation of the
SNS log-likelihood which can be used for backpropagation.
As previously discussed, the proposed SNS method allows for embedding structural priors
on the sparsity of the latent graph directly into the generative model. Fixing the number K
of neighbors might, however, introduce an irreducible approximation error when learning
graphs with nodes characterized by a variable number of neighbors. We solve this problem
by adding dummy nodes.
Adaptive number of neighbors. Given K, we add up to K − 1 dummy nodes to set
S (i.e. the set of candidate neighbors) and expand matrix Φ accordingly. At this point, a
neighborhood of exactly K nodes can be sampled and the log-likelihood evaluated according
to the procedure described above; however, dummy nodes are discarded to obtain the N × N
adjacency matrix A. By doing so, hyperparameter K can also be used to cap the maximum
number of edges and set a minimum sparsity threshold. The resulting computational
complexity in the subsequent MP layers is at most O(N K).
where Encoder(·) indicates a generic encoding function for in the input window (e.g., an
MLP or an RNN), σ a nonlinear activation function, W ∈ Rd×2dh is a learnable weight
matrix, b ∈ d a learnable bias and a ∈ Rd the learnable parameters of the output linear
transformation.
14
Sparse Graph Learning from Spatiotemporal Time Series
Cov[δt (A; ψ)∇θ log pθ (A), ∇θ log pθ (A)] Epθ [δt (A; ψ)(∇θ log pθ (A))2 ]
β∗ ≜ = . (35)
Varpθ [∇θ log pθ (A)] Epθ [(∇θ log pθ (A))2 ]
Unfortunately, finding the optimal β∗ can be as hard as estimating the desired gradient in
Equation (23); moreover, note also that β∗ = β∗ (Xt ), as δt depends on the observations Xt .
Therefore, we opt for the approximation
Epθ [δt (A; ψ)(∇θ log pθ (A))2 ] ≈ Epθ [δt (A; ψ)]Epθ [(∇θ log pθ (A))2 ], (36)
and obtain β∗ ≈ Epθ [δt (A; ψ)]. Note that a similar choice of baseline is very popular, for
instance, in reinforcement learning applications (e.g., see advantage actor-critic estima-
tors, Sutton et al. 1999; Mnih et al. 2016). However, since approximating Epθ [δt (A; ψ)] would
require the introduction of an additional estimator, we rely on a different approximation
15
Cini, Zambon and Alippi
by moving the expectation inside the cost function and obtaining β∗ ≈ δt (µ; ψ), where
µ = Epθ [A].
We recall that, in general, µ is dense and its components are real numbers, therefore
computing δt (µ; ψ) would require evaluating the output of the model w.r.t. a dense adjacency
matrix, potentially outside the well-behaved region of the input space, and to compute mes-
sages w.r.t. each node pair, thus negating any computational complexity benefit. Accordingly,
we substitute µ with the Fréchet mean adjacency matrix Aµ , relying on the generalized
notion of mean for binary adjacency matrices introduced in Section 3.3. We then choose as
β̂ such that
β̂ ≜ δt (Aµ ; ψ). (37)
The computational cost of evaluating β̂ corresponds then to that of a single evaluation of
the cost function δt w.r.t. the binary and eventually sparse adjacency matrix Aµ .
Finally, we point out that, even though β̂ may differ from β∗ , the variance is reduced
as long as 0 < β̂ < 2β∗ . We indicate the modified cost, i.e., the cost minus the baseline
as δ̃t (A; ψ) = δt (A; ψ) − δt (Aµ ; ψ); the modified cost is computed after each forward pass
and used to update the parameters of pθ . In next Sections 6.2 and 6.3 we derive analytic
solutions for finding Aµ for BES and SNS, respectively.
Proof As each component of A ∼ pθ is independent from the others, µi,j can be considered
element-wise as Epθ [Ai,j ] = σ(Φi,j ), for all i, j = 1, . . . , N . Similarly, each component of Aµ
can be computed independently as well, by relying on Lemma 2.
Lemma 2 The minimum of the Fréchet function FH can be expressed as
N
X
min FH (A) = min (µi,j − Ai,j )2 . (39)
A∈A A∈A
i,j=1
To conclude the proof of Preposition 1, we observe that the minimum of Equation (39) is
attained at Aµ = ⌊µ⌉, that is Aµi,j = 1 for all µi,j > 1/2 (or Φ > 0), and 0 elsewhere. The
proof of the Lemma 2 is deferred to Appendix A.
16
Sparse Graph Learning from Spatiotemporal Time Series
The proof that Aµ is indeed the Fréchet mean for SNS follows Preposition 3. Recall that,
for SNS, the support of pθ is that of directed K-NN graphs in Equation (42), where the
neighborhood of each node is sampled independently. Equation (40) is derived by considering
a neighborhood of fixed size K; however, the analysis remains valid for the adaptive case
discussed in Section 5.2.2.
In the SNS case, each entry µn,i of µ is
X
′
µn,i = pθ (i ∈ SK |n) = pθ (SK |n), (41)
′ : i ∈ S′
SK K
where the sum is taken over all subsets SK ′ of S of K elements containing node i. Even if
where µn,i = pθ (i ∈ SK |n) = pθ (An,i = 1) and c is a constant. The proof follows from
Lemma 4.
17
Cini, Zambon and Alippi
Scores, Baselines
Graph learning indep. & Costs
sampling
MP-Layer MP-Layer
Figure 2: Overview of the learning architecture with layer-wise sampling and surrogate
objective. The graph module samples a graph for each MP layer of predictor Fψ .
The proof of Lemma 4 is provided in Appendix A. Following Equation (47), the optimization
problem in Equation (44) becomes the linear program
N X
X N
minimize wi,j Ai,j
i=1 j=1
N
X (49)
s.t. Ai,j = K;
j=1
Ai,j ∈ {0, 1} ∀i = 1, . . . , N,
where wi,j = 1 − 2pθ (Ai,j = 1). Since Lemma 4 grants that, for each i, the K-smallest wi,j
weights correspond row-wise to the top-K scores Φi,j , the solution Aµ to the linear program
is given by Aµi,j = I (Φi,j ∈ top-K{Φi,: }) and, hence, the thesis.
18
Sparse Graph Learning from Spatiotemporal Time Series
The following provides proof of Proposition 5 and presents a surrogate objective function
inspired by Equation (50).
(L)
Proof A proof can be derived by noticing the independence of δti (A:L ; ψ) and pθ (Aj,: ) for
i ̸= j, and by exploiting the fact that with both BES and SNS rows of each A(l) are sampled
independently. For the sake of readability, we omit the dependency of δt and δti from A:L
and ψ. The proof follows:
∇θ Lt (ψ, θ) = Epθ δt ∇θ log pθ (A:L ) (51)
"L−1 #
X h i
= Epθ δt ∇θ log pθ (A(l) ) + Epθ δt ∇θ log pθ (A(L) ) . (52)
l=1 | {z }
(∗)
The two factors in (∗∗) are independent since δti depends only on A:L−1 and AL
i,: , hence
N h i
(L)
X X
(∗∗) = Epθ δti Epθ ∇θ log pθ (Aj,: ) = 0. (56)
i=1 j̸=i | {z }
=0
19
Cini, Zambon and Alippi
Putting everything together, we get Equation (50) and the proof is completed.
8 Experiments
To validate the effectiveness of the proposed framework, we carried out experiments in several
settings on both synthetic and real-world datasets. In particular, a set of experiments focuses
on the task of graph identification where the objective is that of retrieving graphs that
better explain a set of observations given a (fixed) predictive model. The second collection
of experiments shows instead how the proposed approach can be used as a graph-learning
module in an end-to-end forecasting architecture.
8.1 Datasets
20
Sparse Graph Learning from Spatiotemporal Time Series
21
Cini, Zambon and Alippi
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0 20 40 60 80 100 0 20 40 60 80 100
Epoch Epoch
(a) (b)
0.8 BES
0.7
SNS
SNS with dummies
0.6
With baseline
0.5 Without baseline
0.4 Optimal
0.3
0 20 40 60 80 100
Epoch
(c)
Figure 3: Experiments on GPVAR. All the curves show the validation MAE after each
training epoch.
Impact of the Baseline The first striking outcome is the effect of baseline β̂ in both the
considered configurations which dramatically accelerates the learning process.
Graph distribution The second notable result is that, although both SNS and BES are
able to retrieve the underlying graph, the sparsity prior in SNS yields faster convergence
w.r.t. the number of samples seen during training, as the validation curves are steeper
for SNS; note that the approximation error induced by having a fixed number of
neighbors is effectively removed with the dummy nodes.
Surrogate objective Figure 3b shows that the surrogate objective contributes to acceler-
ating learning even further for all considered methods.
Joint training Finally, Figure 3c reports the results for the joint training of the predictor
and graph module with the surrogate objective. The curves, in this case, were obtained
by initializing the parameters of the filter randomly and specifying an order of the
filter higher than the real one; nonetheless, the learning procedure was able to quickly
converge to the optimum when using as baseline the cost evaluated w.r.t. Aµ .
22
Sparse Graph Learning from Spatiotemporal Time Series
0.8
K = 30
0.7
Validation MAE
K = 20
0.6 K = 10
0.5 K=5
Optimal
0.4
0.3
0 20 40 60 80 100
Epoch
23
Cini, Zambon and Alippi
GPVAR(3, 4) GPVAR(4, 6)
0.60 0.60
Score-based
0.55 0.55 Path-wise
Straight-through
Validation MAE 0.50 0.50
0.45 0.45
0.40 0.40
0.35 0.35
0.30 0.30
0 20 40 0 100 200 300 400
Epoch Epoch
(a) (b)
a number of dummy nodes equal to K − 1. Results in Figure 5 show that while the use of
dummy nodes reduces the impact of a wrong assessment of K, overestimating the maximum
number of neighbors can nonetheless lead to slower convergence. In particular, given these
settings and hyperparameters, SNS fails to converge to the optimal solution for K = 30, i.e.,
a number of neighbors equal to the number of nodes. As a general recommendation, we
argue that using SNS can be beneficial as long as K < N/2, while for larger values of K a
BES parametrization is preferable due to the reduced overhead in sampling and likelihood
evaluation.
24
Sparse Graph Learning from Spatiotemporal Time Series
25
Cini, Zambon and Alippi
METR-LA PEMS-BAY
Model MAE @ 15 MAE @ 30 MAE @ 60 MAE @ 15 MAE @ 30 MAE @ 60
Full attention 2.727 ± .005 3.049 ± .009 3.411 ± .007 1.335 ± .003 1.655 ± .007 1.929 ± .007
GTS 2.750 ± .005 3.174 ± .013 3.653 ± .048 1.360 ± .011 1.715 ± .032 2.054 ± .061
MTGNN 2.690 ± .012 3.057 ± .016 3.520 ± .019 1.328 ± .005 1.655 ± .010 1.951 ± .012
Our (SNS) 2.725 ± .005 3.051 ± .009 3.412 ± .013 1.317 ± .002 1.620 ± .003 1.873 ± .005
Adjacency
–Truth 2.720 ± .004 3.106 ± .008 3.556 ± .011 1.335 ± .001 1.676 ± .004 1.993 ± .008
–Random 2.801 ± .006 3.160 ± .008 3.517 ± .009 1.327 ± .001 1.636 ± .002 1.897 ± .003
–Identity 2.842 ± .002 3.264 ± .002 3.740 ± .004 1.341 ± .001 1.684 ± .001 2.013 ± .003
s/epoch (s)
15 3
10
2
5
1
0
0 200 400 600 200 400 600
Number of nodes Number of nodes
Score-based Straight-through
complex approaches which suggests that, in some datasets, having access to the ground-truth
graph is not decisive for achieving high performance. That being said, our graph learning
methods consistently improve performance w.r.t. the naïve baselines.
8.4 Scalability
To assess the scalability of the proposed method, we consider a T&S model consisting of a
message-passing GRU (MPGRU, Cini et al. 2022), i.e., a GRU with gates implemented by
MPNNs. In particular, we consider a simple MP scheme s.t.
i,(l) i,(l−1) j,(l−1)
X
zt = MLP zt , zt . (59)
j∈N (i)
The resulting model has a space and time complexity that scales as O(LT E). By considering
the same controlled environment of the experiments in Section 8.2 and varying the number
26
Sparse Graph Learning from Spatiotemporal Time Series
of nodes in the graph underlying the generated data, we empirically assessed the time and
memory cost of learning a graph distribution with our SNS approach against the straight-
through approach. Note that, while the straight-through estimator allows for a sparse
forward pass at inference, the processing is nonetheless dense at training time—thus requiring
O(LT N 2 ) time and space, instead of O(LT E).
The resulting models are trained on mini-batches of 4 samples with a window size of 8
steps for 50 epochs, each consisting of 5 mini-batches. The empirical results in Figure 8.4
show measured GPU usage and latency for the above settings. The computational advantages
of the sparse message-passing operations of our method are evident.
9 Conclusions
In this paper, we propose a methodological framework for learning graph distributions from
spatiotemporal data. Our novel probabilistic framework relies upon score-function gradient
estimators that allow us for keeping the computation sparse throughout both the training and
inference phases. We then develop variance-reduction techniques for our method to obtain
accurate estimates for the training gradient. The proposed graph learning modules are applied
to the time series forecasting task where they can be used for both graph identification and
as components of an end-to-end architecture. Empirical results support our claims, showing
the effectiveness of the framework. Notably, we achieve forecasting performance on par
with state-of-the-art alternatives, while maintaining the benefits of graph-based processing.
Possible directions for future research include the assessment of the proposed method w.r.t.
inference of dynamic adjacency matrices, distribution agnostic variance reduction methods,
and, in particular, the design of advanced forecasting architectures to achieve accurate
predictions at scale. Furthermore, it would interesting to assess the combination of the
proposed estimators with orthogonal variance reduction techniques (e.g., Kool et al. 2020)
and data-driven baselines. Finally, future works might investigate the application of the
recently proposed implicit maximum likelihood estimators (Niepert et al., 2021; Minervini
et al., 2023) to the settings explored in this paper.
Acknowledgements
This work was supported by the Swiss National Science Foundation project FNS 204061:
HigherOrder Relations and Dynamics in Graph Neural Networks. The authors wish to
thank the Institute of Computational Science at USI for granting access to computational
resources.
Appendix
27
Cini, Zambon and Alippi
Moreover, as the first term does not depend on A′ , the minimum of FF (A′ ) is achieved at
the minimum of
N
X
∥µ − A′ ∥2F = (µi,j − A′i,j )2 . (65)
i,j=1
with G being the random variable associated with the K-th largest realization in {Gϕl : l ∈ S}
and fG its p.d.f., we obtain
(Eq. 71) (Eq. 69)
P(An,i = 1) ≥ P(An,j = 1) ⇐⇒ P(Gϕi ≥ g) ≥ P(Gϕj ≥ g) ⇐⇒ ϕi ≥ ϕj , (72)
28
Sparse Graph Learning from Spatiotemporal Time Series
pθ (SK |i) = Pmin Gϕi > max Gϕi
i∈SK i∈S\Sk
= P min Gϕi > GϕS\S
i∈SK k
= P Gϕi > GϕS\S , ∀i ∈ SK
k
Z ∞
= fϕS\S (g)P (Gϕi > g, ∀i ∈ SK ) dg
k
−∞
Z ∞ Y
= (1 − Fϕi (g)) fϕS\S (g) dg
k
−∞ i∈S
K
Z 1 Y n o
= 1 − Fϕi Fϕ−1
S\S
(v) dv v=FϕS\S (g)
k k
0 i∈S
K
Z 1 Y
= 1 − v exp(ϕi −ϕS\SK ) dv
0 i∈S
k
Z 1 Y n o
= exp (b) uexp(b)−1 1 − uexp(ϕi −ϕS\Sk +b) du u=v exp (−b)
0 i∈Sk
Z 1 Y n o
uexp(ϕS\SK +c)−1 1 − uexp(ϕi +c) du
= exp ϕS\SK + c c=b−ϕS\SK ,
0 i∈Sk
29
Cini, Zambon and Alippi
furthermore, we relied on Neptune2 (neptune.ai, 2021) for logging experiments. For GTS, we
used the code provided by the authors3 to obtain the results shown in the table, however we
fixed a bug in the performance evaluation present in the official implementation4 .
Experiments were run on a cluster equipped with Nvidia Titan V and GTX 1080 GPUs.
The code to reproduce the experiments of the paper is available online5 .
where W , V ∈ Rdz ×dz are learnable weight matrices and σ is a nonlinear activation function
(in particular we use Swish (Ramachandran et al., 2017)). All layers have a hidden size of 64
units. We use an input window size of 24 steps and train for 100 epochs the models with the
Adam optimizer with an initial learning rate of 0.005 and a multi-step learning rate scheduler.
For the GRU baseline, we use a single recurrent layer of size 64 followed by an MLP decoder
with 1 hidden layer with 32 units. For the graph module, we use SNS with K = 5 and 4
dummy nodes and train with Adam with a learning rate of 0.01 for 200 epochs. At test time,
we used models with weights corresponding to the lowest validation error across epochs.
30
Sparse Graph Learning from Spatiotemporal Time Series
used a temperature τ = 0.5 to make the sampler more deterministic. During evaluation, we
used the Aµ to obtain test-time predictions.
References
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis,
J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia,
R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore,
D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker,
V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke,
Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems,
2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
D. Bacciu, F. Errica, A. Micheli, and M. Podda. A gentle introduction to deep learning for
graphs. Neural Networks, 129:203–221, 2020.
L. Bai, L. Yao, C. Li, X. Wang, and C. Wang. Adaptive graph convolutional recurrent
network for traffic forecasting. Advances in Neural Information Processing Systems, 33:
17804–17815, 2020.
J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and deep locally connected
networks on graphs. In 2nd International Conference on Learning Representations, ICLR
2014, 2014.
A. Cini, I. Marisca, and C. Alippi. Filling the g_ap_s: Multivariate time series imputation
by graph neural networks. In International Conference on Learning Representations, 2022.
31
Cini, Zambon and Alippi
A. Deng and B. Hooi. Graph neural network-based anomaly detection in multivariate time
series. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages
4027–4035, 2021.
W. Falcon and The PyTorch Lightning team. PyTorch Lightning, 3 2019. URL https:
//github.com/PyTorchLightning/pytorch-lightning.
M. Fey and J. E. Lenssen. Fast graph representation learning with pytorch geometric. arXiv
preprint arXiv:1903.02428, 2019.
L. Franceschi, M. Niepert, M. Pontil, and X. He. Learning discrete structures for graph neural
networks. In International conference on machine learning, pages 1972–1982. PMLR, 2019.
J. Gao and B. Ribeiro. On the equivalence between temporal and static equivariant graph
representations. In International Conference on Machine Learning, pages 7052–7076.
PMLR, 2022.
P. Glasserman and Y.-C. Ho. Gradient estimation via perturbation analysis, volume 116.
Springer Science & Business Media, 1991.
32
Sparse Graph Learning from Spatiotemporal Time Series
A. C. Harvey et al. Forecasting, structural time series models and the Kalman filter.
Cambridge Books, 1990.
E. Isufi, A. Loukas, N. Perraudin, and G. Leus. Forecasting time series with VARMA
recursions on graphs. IEEE Transactions on Signal Processing, 67(18):4870–4885, 2019.
A. Kazi, L. Cosmo, S.-A. Ahmadi, N. Navab, and M. Bronstein. Differentiable graph module
(dgm) for graph convolutional networks. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 2022.
T. Kipf, E. Fetaya, K.-C. Wang, M. Welling, and R. Zemel. Neural relational inference for
interacting systems. In International Conference on Machine Learning, pages 2688–2697.
PMLR, 2018.
W. Kool, H. Van Hoof, and M. Welling. Stochastic beams and where to find them: The
gumbel-top-k trick for sampling sequences without replacement. In International Conference
on Machine Learning, pages 3499–3508. PMLR, 2019.
W. Kool, H. van Hoof, and M. Welling. Estimating gradients for discrete random variables by
sampling without replacement. In International Conference on Learning Representations,
2020. URL https://openreview.net/forum?id=rklEj2EFvB.
P. L’Ecuyer. Note: On the interchange of derivative and expectation for likelihood ratio
derivative estimators. Management Science, 41(4):738–747, 1995.
Y. Li, R. Yu, C. Shahabi, and Y. Liu. Diffusion convolutional recurrent neural network:
Data-driven traffic forecasting. In International Conference on Learning Representations,
2018.
33
Cini, Zambon and Alippi
I. Marisca, A. Cini, and C. Alippi. Learning to reconstruct missing data from spatiotemporal
graphs with sparse observations. Advances in Neural Information Processing Systems, 35:
32069–32082, 2022.
J. Mei and J. M. Moura. Signal processing on graphs: Causal modeling of unstructured data.
IEEE Transactions on Signal Processing, 65(8):2077–2092, 2016.
A. Mnih and K. Gregor. Neural variational inference and learning in belief networks. In
International Conference on Machine Learning, pages 1791–1799. PMLR, 2014.
neptune.ai. Neptune: Metadata store for mlops, built for research and production teams
that run a lot of experiments, 2021. URL https://neptune.ai.
P. Ramachandran, B. Zoph, and Q. V. Le. Searching for activation functions. arXiv preprint
arXiv:1710.05941, 2017.
34
Sparse Graph Learning from Spatiotemporal Time Series
C. Shang and J. Chen. Discrete graph structure learning for forecasting multiple time series.
In Proceedings of International Conference on Learning Representations, 2021.
35
Cini, Zambon and Alippi
Z. Wu, S. Pan, G. Long, J. Jiang, and C. Zhang. Graph wavenet for deep spatial-temporal
graph modeling. In Proceedings of the 28th International Joint Conference on Artificial
Intelligence, pages 1907–1913, 2019.
Z. Wu, S. Pan, G. Long, J. Jiang, X. Chang, and C. Zhang. Connecting the dots: Multivariate
time series forecasting with graph neural networks. In Proceedings of the 26th ACM
SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 753–
763, 2020.
Z. Wu, D. Zheng, S. Pan, Q. Gan, G. Long, and G. Karypis. Traversenet: Unifying space
and time in message passing for traffic forecasting. IEEE Transactions on Neural Networks
and Learning Systems, 2022.
X. Yi, Y. Zheng, J. Zhang, and T. Li. St-mvl: filling missing values in geo-sensory time series
data. In Proceedings of the 25th International Joint Conference on Artificial Intelligence,
2016.
B. Yu, H. Yin, and Z. Zhu. Spatio-temporal graph convolutional networks: a deep learning
framework for traffic forecasting. In Proceedings of the 27th International Joint Conference
on Artificial Intelligence, pages 3634–3640, 2018.
D. Zambon and C. Alippi. Az-whiteness test: a test for uncorrelated noise on spatio-temporal
graphs. To appear in Advances in Neural Information Processing Systems, 2022.
36