On A Linear Fused Gromov-Wasserstein Distance For Graph Structured Data

O N A LINEAR FUSED G ROMOV-WASSERSTEIN DISTANCE FOR
GRAPH STRUCTURED DATA
A P REPRINT
Dai Hai Nguyen Koji Tsuda

Graduate School of Frontier Sciences Graduate School of Frontier Sciences
arXiv:2203.04711v1 [cs.LG] 9 Mar 2022
The University of Tokyo The University of Tokyo

5-1-5 Kashiwa-no-ha, Kashiwa, Chiba 277-8561, Japan 5-1-5 Kashiwa-no-ha, Kashiwa, Chiba 277-8561, Japan
[email protected] [email protected]
March 10, 2022
A BSTRACT
We present a framework for embedding graph structured data into a vector space, taking into account
node features and topology of a graph into the optimal transport (OT) problem. Then we propose a
novel distance between two graphs, named linearFGW, defined as the Euclidean distance between
their embeddings. The advantages of the proposed distance are twofold: 1) it can take into account
node feature and structure of graphs for measuring the similarity between graphs in a kernel-based
framework, 2) it can be much faster for computing kernel matrix than pairwise OT-based distances,
particularly fused Gromov-Wasserstein, making it possible to deal with large-scale data sets. After
discussing theoretical properties of linearFGW, we demonstrate experimental results on classification
and clustering tasks, showing the effectiveness of the proposed linearFGW.
Keywords linear optimal transport · graph structured data · graph kernel
1 Introduction
Many applications of machine learning involve learning with graph structured data such as bioinformatics [17], social
networks [16], chemoinformatics [23], and so on. To deal with graph structured data, many graph kernels have been
proposed in literature for measuring the similarity between graphs in a kernel-based framework. Most of them are based
on R-framework, which focuses on comparing graphs based on their substructures such as subtree [18], shortest path
[2], random walk [7], and so on. However, these methods have several limitations: 1) they do not consider feature and
structure distributions of graphs, 2) they require to define substructures based on the domain knowledge, which might
not be available in many practical applications.
Optimal Transport (OT) [24] has received much attention in the machine learning community and has been shown
to be an effective tool for comparing probability measures in many applications. In recent years, several studies
have attempted to use the OT distance for learning graph structured data by considering the problem of measuring
the similarity of graphs as an instance of computing OT distance for graphs. Togninalli et al. [22] introduced the
Wasserstein distance to compare graphs based on their node embeddings obtained by Weisfeiler-Lehmann labeling
framework [18]. Titouan et al. [21] proposed fused Gromov-Wasserstein (FGW) which combines Wasserstein and
Gromov-Wasserstein [10, 14] distances in order to jointly take into account features and structures of graphs. These
OT-based distances have achieved great performance for graph classification. However, they have several limitations:
1) kernel matrices converted from the OT-based distances are generally not valid, so they are not ready to use for the
kernel-based framework, 2) calculating the similarity between each pair of graphs is computationally expensive, so the
need for computing the kernel matrix of all pairwise similarity can be a burden for dealing with large-scale graph data
sets.
In order to overcome the aforementioned limitations, inspired by the linear optimal transport framework introduced
by Wang et al. [25], we propose an OT-based distance, named linearFGW, for learning with graph structured data.
linearFGW: a novel optimal transport metric for graph structured data A P REPRINT
As the name suggests, our distance is a generalization of the linear optimal transport framework and FGW distance.
The basic idea is to embed the node features and topology of a graph into a linear tangent space through a fixed
reference measure graph. Then the linearFGW distance between two graphs is defined as the Euclidean distance
between their two embeddings, which approximates their FGW distance. Therefore, the linearFGW distance has the
following advantages: 1) it can take into account node features and topologiesb of graphs for the OT problem in order
to calculate the dissimilarity between graphs, 2) we can derive a valid graph kernel from the embeddings of graphs
for the downstream tasks such as graph classification and clustering, 3) by using the linearFGW as an approximate of
the FGW, we can avoid expensive computation of the pairwise FGW distance for large-scale graph data sets. Finally, we
conduct experiments on graph data sets to show the effectiveness of the proposed distance in terms of classification and
clustering accuracies.
The remainder of the paper is organised as follows: in Section 2, we present some related work. In Section 3, we present
the idea of our proposed distance for learning graph with structured data and its theoretical properties. In Section 4,
experimental results on benchmark graph data sets are provided. Finally, we conclude by summarizing this work and
discussing possible extensions in Section 5.
2 Related Work
2.1 Kernels for Graphs
Graph is a standard representation for relational data, which appear in various domains such as bioinformatics [17],
chemoinformatics [23], social network analysis [16]. Making use of graph kernels is a popular approach to learning
with graph structured data. Essentially, a graph kernel is a measure of the similarity between two graphs and must
satisfy two fundamental requirements to be a valid kernel: 1) symmetric and 2) positive semi-definite (PSD). There are
a number of kernels for graphs with discrete attributes such as random walk [7], shortest path [7], Weisfeiler-Lehman
(WL) subtree [18] kernels, just to name a few. There are several kernels for graphs with continuous attributes such as
GraphHopper [6], Hash Graph [12] kernels.
2.2 Optimal Transport Frameworks for Graphs
Optimal Transport (OT) [24] has received much attention from the machine learning community as it provide an
effective way to measure the distance between two probability measures. Some OT-based graph kernels have been
proposed and achieved great performance in comparison with traditional graph kernels. Wasserstein Weisfeiler-Lehman
(WWL) [22] used OT for measuring distance between two graphs based on their WL embeddings (discrete feature
vectors of subtree patterns). Nguyen et al. [13] extends WWL by proposing an efficient algorithm for learning subtree
pattern importance, leading to higher classification accuracy on graph data sets. However, they are not valid kernels
for graphs with continuous attributes. Following the work in [10], Peyre et al. [14] proposed a Gromow-Wasserstein
distance to compare pairwise similarity matrices from different spaces. Then, Titouan et al. [21] proposed a fused
Gromov-Wasserstein distance which combine Wasserstein and Gromov-Wasserstein distances in order to jointly
leverage feature and structure information of graphs. To reduce computational complexity, OT-based distances are often
computed using a Sinkhorn algorithm [19, 4]. Due to the nature of optimal assignment problem, these OT-based graph
kernels are indefinite similarity matrices so they are invalid kernels, leading to the use of support vector machine (SVM)
with indefinite kernels introduced in [9].
2.3 Linear Optimal Transport
Wang et al. [25] proposed a simplified version of OT in 2-Wasserstein space, called linear optimal transport. In the sense
of geometry, the basic idea is to transfer probability measures from the geodesic 2-Wasserstein space to the tangent
space with respect to some fixed base or reference measure. One advantage is that we can work on a linear tangent
space instead of the complex 2-Wasserstein space so that the downstream tasks such as classification, clustering can be
done in the linear space. Another advantage is the fast approximation of pairwise Wasserstein distance for large-scale
data sets. In the context of graph learning, Kolouri et al. [8] leveraged the above framework and introduced the concept
of linear Wasserstein embedding for learning graph embeddings. In a concurrent work, Mialon et al. [11] proposed a
similar idea for learning set of features. In this paper, we extend the idea of linear optimal transport framework from the
2-Wasserstein distance to the Fused Gromov-Wasserstein distance (FGW), and define a valid graph kernel for learning
with graph structured data. Furthermore, we derive theoretical understandings of the proposed distance.
2
3 Proposed Distance for Graphs: Linear Fused Gromov-Wasserstein

We denote a measure graph as G(X, A, µ), where X = {xi }m i=1 ∈ R
m×d
is the set of m node features with dimen-
sionality of d, A = [a]ij ∈ Rm×m is a square matrix to encode the topology of the given graph such as the adjacency
matrix or the matrix of pairwise distances between nodes, µ = [µi ] ∈ ∆m (probability simplex) is a Borel probability
measure defined on the nodes (note that when no additional information is provided, all probability measures can be set
as uniform).
3.1 FGW: A Distance for Matching Node Features and Structures
In [21], a graph distance, named Fused Gromov-Wasserstein (FGW), is proposed to take into account both node feature
and topology information into the OT problem for measuring the dissimilarity between two graphs. Formally, given
two graphs G1 (X, A, µ) and G2 (Y, B, ν), the FGW distance between G1 and G2 is defined for a trade-off parameter
α ∈ [0, 1] as:
X
FGWq,α (G1 , G2 ) = min (1 − α)kxi − yj kq + α|Ai,k − Bj,l |q πi,j πk,l (1)
π∈Π(µ,ν)
i,j,k,l
m×n Pm Pn
where Π(µ, ν) = {π ∈ R+ s.t. i=1 πi,j = νj , j=1 πi,j = µi } is the set of all admissible couplings between µ
and ν. The FGW distance acts as a generalization of the Wasserstein [24] and Gromov-Wasserstein [10], which allows
balancing the importance of matching the node features and topologies between two graphs. However, similar to the
existing OT-based graph distances, it is challenging to define a valid kernel from the FGW for the graph-related prediction
task, due to the nature of optimal assignment problem. In the following, we restrict our attention to the OT with q = 2
and for the ease of presentation, we use the notation FGWα instead of FGWq,α .
3.2 linearFGW: A New Distance for Comparing Graphs
In order to overcome the limitations of the FGW distance, we propose to approximate it by a linear optimal transport
framework, which we call Linear Fused Gromov-Wasserstein (LinearFGW). Computing the LinearFGW distance
requires a reference, which we choose to be also a measure graph G(Z, C, σ). How the reference measure graph is
chosen is described later in Subsection 3.3. To precisely define the LinearFGW distance, we first define the barycentric
projections for node features and structures of graphs as follows:
Definition 1. (Barycentric projectionsPfor nodes and edges of graphs). For a reference measure graph G(Z, C, σ) with
K nodes and a transport plan π = k,i πk,i δ(zk ,xi ) . Then the barycentric projections for nodes and edges of the
measure graph G using the transport plan π are defined as follows:
1 X 1 X
Tn,π (zk ) = πki xi and Te,π (Ck,l ) = πk,i πl,j Cij , where k, l = 1, K (2)
σk i σk σl i,j
The definitions of these projections are extended from [25, 1]. Furthermore, we derive their properties in the following
lemma.
Lemma 1. Given two measure graphs G(X, A, µ) and G(Z, C, σ), we denote π ∗ as the optimal transport plan from G
to G with respect to the FGW distance, and G̃(Z̃, C̃, σ) as the probability measure graph obtained by applying barycentric
projections for nodes and edges Tn,π∗ (.) and Te,π∗ (.), respectively (see Definition 1). Then, we have the following
claims:
 
σ1 0 0
1. diag(σ) =  0 . . . 0  is the optimal transport plan from G to G̃ in the sense of the FGW distance.
 
0 0 σK
2. FGWα (G, G̃) ≤ FGWα (G, G).
The proof is given in the Appendix section. An important implication of the above lemma is that G̃ can be considered as
a surrogate measure graph for G with respect to the reference G. Thus we propose to define the LinearFGW distance
between two measure graphs G1 and G2 with respect to the reference measure graph G as follows:
X X
linearFGWα (G1 , G2 ) = (1 − α) kTn,π1 (zk ) − Tn,π2 (zk )k2 + α |Te,π1 (Ck,l ) − Te,π2 (Ck,l )|2 (3)
k k,l
3
Figure 1: Illustration of the computation of the linearFGW distance between G1 (X, A, µ) and G2 (Y, B, ν), given
the fixed reference measure graph G(Z, C, σ). First, we find the optimal transport plans π1 and π2 from G to G1
and G2 , respectively, in the sense of the FGW. Then we transport G with the barycentric projections for nodes and
edges (see Definition 1) using the optimal plans π1 and π2 to get the surrogate measure graphs G̃1 (Z̃(1) , C̃(1) , σ)
and G̃2 (Z̃(2) , C̃(2) , σ) for G1 and G2 , respectively. Finally, the Euclidean distance between G̃1 and G̃2 can be directly
calculated using Equation (3).
where π1 and π2 denote the optimal transport plans from G to G1 and G2 , respectively, in the sense of the FGW
distance. We call this distance linearFGW as it acts as a generalization of linear optimal transport [25] and FGW [21].
Furthermore, the proposed distance also suggests a Euclidean
√ embedding of√the measure graph G1√with respect to the
reference measure graph G becomes: ΦG,α (G1 ) = 1 − αTn,π1 (z1 ), ..., 1 − αTn,π1 (zK ), ..., αTe,π1 (Ck,l ), ...
with dimension of K + K 2 . So we can derive a valid kernel for graph-related prediction tasks. The computation of the
linearFGW can be illustrated in Figure 1.
3.3 Selection of Reference Measure Graph
Selecting the reference measure graph in Subsection 3.2 is important. We empirically observe that if the reference is
randomly selected or distant from all measure graphs, the approximation error between FGW and linearFGW is likely
to increase. In the lemma presented below, we show the relation between FGW and linearFGW with respect to the
reference measure graph.
Lemma 2. We denote the mixing diameter of a graph G(X, A, µ) by diamα (G) = α maxi,j kxi − xj k2 + (1 −
α) maxi,j,i0 ,j 0 |Ai,j − Ai0 ,j 0 |2 . Then, given a fixed reference measure graph G(Z, C, σ), for two input measure graphs
G1 (X, A, µ) and G2 (Y, B, ν), we have the following inequality:
|FGWα (G1 , G2 ) − linearFGWα (G1 , G2 )| ≤ 4 min{FGWα (G1 , G), FGWα (G2 , G)} + 2diamα (G1 ) + 2diamα (G2 ) (4)
The proof is given in the Appendix section.

A corollary of the above lemma suggests how to select a good reference measure graph G: given N graphs (G1 , ..., GN ),
the total approximation error is upper bounded by:
N X
X N N
X N
X
|FGWα (Gi , Gj ) − linearFGWα (Gi , Gj )| ≤ 4 FGWα (Gi , G) + 4 diamα (Gi ) (5)
i=1 j=i+1 i=1 i=1
4
Table 1: Statistics of data sets used in experiments
Dataset #graphs #classes Ave. #odes Ave. #edges #attributes

COX2 467 2 41.22 43.45 3
BZR 405 2 35.75 38.36 3
ENZYMES 600 6 32.63 62.14 18
PROTEINS 1113 2 39.06 72.82 1
PROTEINS-F 1113 2 39.06 72.82 29
AIDS 2000 2 15.69 16.20 4
IMDB-B 1000 2 19.77 96.53 -
where the right-hand side has two terms: the first term is the objective of the fused Gromov-Wasserstein barycenter
problem [21] while the second term is constant with respect to the reference measure graph G, suggesting that we can
use the fused Gromov-Wasserstein barycenter of N given measure graphs as the reference.
3.4 Implementation Details
The FGW is the main component of our method. We use the proximal point algorithm (PPA) [26] to implement the FGW.
Specifically, given two graphs G1 and G2 , we solve the problem (1) iteratively ( with maximum T iterations) as follows:
π (t+1) = arg min h(1 − α)D12 + α(C12 − 2Aπ (t) B), πi + ηKL(π||π (t) ) (6)
π∈Π(µ,ν)
where h·, ·i denote the inner product of matrices, D12 = (X X)1d 1> >
n + 1m 1d (Y Y)> , C12 = (A A)µ1> n +
> > (t)
1m ν (B B) and denotes the Hadamard product of matrices. KL(π||π ) is the Kullback-Leibler divergence
between the optimal transport plan and the previous estimation. We can approximately solve the above problem by
Sinkhorn-Knopp update (see [26] for the algorithmic details).
4 Experimental Results
We now show the effectiveness of our proposed graph distance on real world data sets in terms of graph classification and
clustering. Our code can be accessed via the following link: https://github.com/haidnguyen0909/linearFGW.
4.1 Data sets
In this work, we focus on graph kernels/distances for graphs with continuous attributes. So we con-
sider the following seven widely used benchmark data sets: BZR [20], COX2 [20], ENZYMES [5], PRO-
TEINS [3], PROTEINS-F [3], AIDS [15] contain graphs with continous attributes, while IMDB-B [27]
contains unlabeled graphs obtained from social networks. All these data sets can be downloaded from
https://ls11-www.cs.tu-dortmund.de/staff/morris/graphkerneldatasets. The details of the used data
sets are shown in Table 1.
4.2 Experimental settings
To compute numerical features for nodes of graphs, we consider two main settings: 1) we keep the original attributes of
nodes (denoted by suffix RAW), 2) we consider Weisfeiler-Lehman (WL) mechanism by concatenating numerical vectors
of neighboring nodes (denoted by the suffix WL−H where H means we repeat the procedure H times to take neighboring
vertices within H hops into account to obtain the features, see [18] for more detail). For the matrix A, we restrict our
attention to the adjacency matrices of the input graphs. For solving the optimization problem (6), we fix η as 0.1 and
the number of iterations T as 5. We carry out our experiments on a 2.4 GHz 8-Core Intel Core i9 with 64GB RAM.
For the classification task, we convert a distance into a kernel matrix through the exponential function, i.e. K =
exp(−γD) (Gaussian kernel). We compare the classification accuracy with the following state-of-the-art graph kernels
(or distances): GraphHopper kernel (GH, [6]), HGK-WL [12], HGK-SP [12], RBF-WL [22], Wasserstein Weisferler-
Lehman kernel (WWL, [22]), FGW [21], GWF [26]. We divide them into two groups: OT-based graph kernels including
WWL, FGW, GFW and linearFGW (ours) and non-OT graph kernels including GH, HGK-WL, HGK-SP, RBF-WL.
Note that our proposed graph kernel converted from the linearFGW is the only (valid) positive definite kernel among
OT-based graph kernels.
5
Table 2: Average classification accuracy on the graph data sets with vector attributes. The best result for each column
(data set) is highlighted in bold and the standard deviation is reported with the symbol ±.
Kernels/Data sets COX2 BZR ENZYMES PROTEINS IMDB-B

GH 76.41± 1.39 76.49± 0.9 65.65± 0.8 74.48± 0.3 -
HGK-WL 78.13± 0.45 78.59± 0.63 63.04± 0.65 75.93± 0.17 -
Non OT
HGK-SP 72.57± 1.18 76.42± 0.72 66.36± 0.37 75.78± 0.17 -
RBF-WL 75.45± 1.53 80.96± 1.67 68.43± 1.47 75.43± 0.28 -
WWL 78.29±0.47 84.42± 2.03 73.25± 0.87 77.91± 0.8 -
FGW 77.2±4.7 84.1 ±4.1 71.0± 6.7 75.1± 2.9 64.2± 3.3
OT-based
GWF - - - 73.7±2.0 63.9±2.7
linearFGW-RAW (Ours) 79.74 ± 1.99 86.07±1.64 83.25 ± 2.44 82.49± 1.75 63.62±1.9
linearFGW-WL1 (Ours) 79.98 ± 3.21 84.80±2.95 85.28±1.64 83.29 ± 1.63 -
linearFGW-WL2 (Ours) 79.50±3.29 84.37±2.75 83.13± 1.56 83.95 ± 1.12 -
Table 3: Average clustering accuracy on the graph data sets with continuous attributes. The best result for each column
(data set) is highlighted in bold and the standard deviation is reported with the symbol ±.
Methods/Data sets AIDS PROTEINS PROTEINS-F IMDB-B

FGW 91.0 ±0.7 66.4±0.8 66.0±0.9 56.7±1.5
GWB-KM 95.2±0.9 64.7±1.1 62.9±1.3 53.5±2.3
GWF-BADMM 97.6±0.8 69.2±1.0 68.1±1.1 55.9±1.8
GWF-PPA 99.5±0.4 70.7±0.7 69.3±0.8 60.2±1.6
linearFGW-Kmeans (Ours) 98.7±1.2 70.58±0.57 71.46±1.03 54.49±0.3
linearFGW-SC (Ours) 98.2 ±0.83 72.70±0.03 73.33±0.82 58.3±0.8
We perform 10-fold cross validation and report the average accuracy of the experiment repeated 10 times. The
accuracies of other graph kernels are taken from the original papers. We use SVM for classification and cross-validate
the parameters C = {2−5 , 2−4 , ..., 210 }, γ = {10−2 , 10−1 , ..., 102 }. The range of the WL parameter H = 1, 2. For
our proposed linearFGW, α is cross-validated via a search in {0.0, 0.3, 0.5, 0.7, 0.9, 1.0}. Note that the linear optimal
transport [25] is a special case of the linearFGW with α = 0.
We also compare the clustering accuracy with OT-based graph distances: FGW, GWB-KM, GWF on four real world data
sets: AIDS, PROTEINS, PROTEINS-F, IMDB-B. For fair comparison, we use K-means and spectral clustering on the
Euclidean embedding and Gaussian kernel of the proposed linearFGW distance (denoted by linearFGW-Kmeans and
linearFGW-SC, respectively). We fix the parameters H = 1, α = 0.5 for data sets of graphs with continuous attributes
and γ = 0.01 for the Gaussian kernel.
4.3 Results
Classification: The average classification accuracies shown in Table 2 indicate that the linearFGW is a clear state-of-
the-art method for graph classification. It achieved the best performances on 4 out of 6 data sets. In particular, on two
data sets ENZYMES and PROTEINS, the linearFGW outperformed all the rest by large margins (around 12% and 6%,
respectively) in comparison with the second best ones. On COX2 and BZR, the linearFGW achieved improvements of
around 2% and 1.5%, respectively, over WWL which is the second best one. Note that the Gaussian kernel derived from
WWL is not valid for the data sets of graphs with continuous attributes (see [22]). On IMDB-B, the average accuracies
of the compared methods are comparable. Interestingly, despite that our linearFGW is an approximate of the FGW
distance, the linearFGW consistently achieved significantly higher performance than FGW. This can be explained by the
fact that the kernel derived from the linearFGW distance is valid.
Clustering: The average clustering accuracies shown in Table 3 also indicate that the linearFGW could achieve high
performance on clustering. On PROTEINS and PROTEINS-F, the linearFGW achieved the highest accuracies by
margins of around 2% and 3%, respectively, over the second best one. On AIDS and IMDB-B, the linearFGW achieved
comparable performances with GWF-PPA which is the best performer.
Runtime Analysis: By using the linearFGW, we can reduce the computational complexity of calculating the pairwise
FGW distance for a data set of N graphs from a quadratic complexity in N i.e. N (N − 1)/2) to linear complexity
i.e. N calculation of the FGW distances from graphs to the reference measure graph. We compare the running time of
linearFGW and FGW with the same setting as in the classification task with fixed α of 0.5 and 0.0 for labeled graph
6
Table 4: The total training time and inference time (in seconds) averaged over 10-folds of cross-validation (with fixed
α) for different data sets. The standard deviation is reported with the symbol ±.
Methods/Data sets COX2 BZR ENZYMES PROTEINS IMDB-B

FGW 520.21±21.15 347.78±5.21 817.31±7.49 3224.36±125.02 1235.33±83.28
linearFGW 72.43 ± 0.16 53.81 ± 0.2 146.26 ±1.64 431.25±9.25 358.92±10.41
data sets and IMDB-B (unlabeled), respectively. In Table 4, we report the total running time of methods (both training
time and inference time) on 5 data sets used for classification experiments . It is shown that the linearFGW is much
faster than FGW on all considered data sets (roughly 7 times faster on COX2, BZR, ENZYMES, PROTEINS and 3 times
faster on IMDB-B). These numbers confirm the computational efficiency of linearFGW, making it possible to analyze
large-scale graph data sets.
5 CONCLUSION AND FUTURE WORK

We have developed an OT-based distance for learning with graph structured data. The key idea of this method is to
embed node feature and topology of a graph into a linear tangent space, where the Euclidean distance between two
embeddings of two graphs approximates their FGW distance. In fact the proposed distance is a generalization of the
linear optimal transport [25] and the FGW distance. Thus it has the following advantages: 1) as the FGW distance, the
proposed distance allows to take into account node features and topologies of graphs into the OT problem for computing
the dissimilarity between two graphs, 2) we can derive a valid kernel from the proposed distance for graphs while the
existing OT-based graph kernels are invalid, and 3) it provides the fast approximation of pairwise FGW distance, making
it more efficient to deal with large-scale graph data sets. We conducted experiments on some benchmark graph data sets
on both classification and clustering tasks, demonstrating the effectiveness of the proposed distance.
In this work, we suggested to use the fused Gromov-Waserstein barycenter [21] as the reference measure graph. Thanks
to the differentiablity of OT frameworks using techniques such as entropic regularization [4], one possibility for
future work is to learn the reference measure graph by updating the reference to minimize the supervision loss. The
classification performance will be improved with the label information of graphs used in the training process. Another
possibility would be to incorporate the linearFGW into graph-based deep learning models for learning with graph
structured data.
References
[1] F. Beier, R. Beinert, and G. Steidl. On a linear gromov-wasserstein distance. arXiv preprint arXiv:2112.11964,
2021.
[2] K. M. Borgwardt and H.-P. Kriegel. Shortest-path kernels on graphs. In Fifth IEEE international conference on
data mining (ICDM’05), pages 8–pp. IEEE, 2005.
[3] K. M. Borgwardt, C. S. Ong, S. Schönauer, S. Vishwanathan, A. J. Smola, and H.-P. Kriegel. Protein function
prediction via graph kernels. Bioinformatics, 21(suppl_1):i47–i56, 2005.
[4] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information
processing systems, 26, 2013.
[5] P. D. Dobson and A. J. Doig. Distinguishing enzyme structures from non-enzymes without alignments. Journal of
molecular biology, 330(4):771–783, 2003.
[6] A. Feragen, N. Kasenburg, J. Petersen, M. de Bruijne, and K. Borgwardt. Scalable kernels for graphs with
continuous attributes. Advances in neural information processing systems, 26, 2013.
[7] H. Kashima, K. Tsuda, and A. Inokuchi. Marginalized kernels between labeled graphs. In Proceedings of the 20th
international conference on machine learning (ICML-03), pages 321–328, 2003.
[8] S. Kolouri, N. Naderializadeh, G. K. Rohde, and H. Hoffmann. Wasserstein embedding for graph learning. arXiv
preprint arXiv:2006.09430, 2020.
[9] R. Luss and A. d’Aspremont. Support vector machine classification with indefinite kernels. Advances in neural
information processing systems, 20, 2007.
[10] F. Mémoli. Gromov–wasserstein distances and the metric approach to object matching. Foundations of computa-
tional mathematics, 11(4):417–487, 2011.
7
[11] G. Mialon, D. Chen, A. d’Aspremont, and J. Mairal. A trainable optimal transport embedding for feature
aggregation and its relationship to attention. arXiv preprint arXiv:2006.12065, 2020.
[12] C. Morris, N. M. Kriege, K. Kersting, and P. Mutzel. Faster kernels for graphs with continuous attributes via
hashing. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pages 1095–1100. IEEE, 2016.
[13] D. H. Nguyen, C. H. Nguyen, and H. Mamitsuka. Learning subtree pattern importance for weisfeiler-lehman
based graph kernels. Machine Learning, 110(7):1585–1607, 2021.
[14] G. Peyré, M. Cuturi, and J. Solomon. Gromov-wasserstein averaging of kernel and distance matrices. In
International Conference on Machine Learning, pages 2664–2672. PMLR, 2016.
[15] K. Riesen and H. Bunke. Iam graph database repository for graph based pattern recognition and machine learning.
In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and
Syntactic Pattern Recognition (SSPR), pages 287–297. Springer, 2008.
[16] J. Scott. Social network analysis: developments, advances, and prospects. Social network analysis and mining, 1
(1):21–26, 2011.
[17] R. Sharan and T. Ideker. Modeling cellular machinery through biological network comparison. Nature biotechnol-
ogy, 24(4):427–433, 2006.
[18] N. Shervashidze, P. Schweitzer, E. J. Van Leeuwen, K. Mehlhorn, and K. M. Borgwardt. Weisfeiler-lehman graph
kernels. Journal of Machine Learning Research, 12(9), 2011.
[19] R. Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums. The American Mathematical
Monthly, 74(4):402–405, 1967.
[20] J. J. Sutherland, L. A. O’brien, and D. F. Weaver. Spline-fitting with a genetic algorithm: A method for developing
classification structure- activity relationships. Journal of chemical information and computer sciences, 43(6):
1906–1915, 2003.
[21] V. Titouan, N. Courty, R. Tavenard, and R. Flamary. Optimal transport for structured data with application on
graphs. In International Conference on Machine Learning, pages 6275–6284. PMLR, 2019.
[22] M. Togninalli, E. Ghisu, F. Llinares-López, B. Rieck, and K. Borgwardt. Wasserstein weisfeiler-lehman graph
kernels. Advances in Neural Information Processing Systems, 32, 2019.
[23] N. Trinajstic. Chemical graph theory. Routledge, 2018.
[24] C. Villani. The wasserstein distances. In Optimal transport, pages 93–111. Springer, 2009.
[25] W. Wang, D. Slepčev, S. Basu, J. A. Ozolek, and G. K. Rohde. A linear optimal transportation framework for
quantifying and visualizing variations in sets of images. International journal of computer vision, 101(2):254–269,
2013.
[26] H. Xu. Gromov-wasserstein factorization models for graph clustering. In Proceedings of the AAAI conference on
artificial intelligence, volume 34, pages 6478–6485, 2020.
[27] P. Yanardag and S. Vishwanathan. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD international
conference on knowledge discovery and data mining, pages 1365–1374, 2015.
A Appendix
A.1 Proof of Lemma 1
Proof. By the definition of barycentric projections for nodes and edges (see Definition 1), we denote z̃k = Tn,π∗ (zk )
and C̃k,l = Te,π∗ (Ck,l ) with k, l = 1, K. In order to prove the first claim, in contrary we assume that π (6= diag(σ)) is
the optimal transport plan with respect to FGWα (G, G̃). Then, we have the following inequality:
X X
(1 − α)σk kzk − z̃k k2 + ασk σl |Ck,l − C̃k,l |2 > (1 − α)kzk − z̃k0 k2 + α|Ck,l − C̃k0 ,l0 |2 πk,k0 πl,l0 (7)
k,l k,l,k0 ,l0
8
We rewrite the FGW distance between G(Z, C, σ) and G(X, A, µ) as follows:

X 2 ∗ ∗
FGWα (G, G) = (1 − α)kxi − zk k2 + α (Ai,j − Ck,l ) πk,i πl,j
i,j,k,l
X X X
∗
=(1 − α) µi kxi k2 + (1 − α) σk kzk k2 − 2(1 − α) πk,i hxi , zk i
i k i,k
X X X
∗ ∗
+α µi µj A2i,j +α σk σl C2k,l − 2α Ck,l Ai,j πk,i πl,j
i,j k,l i,j,k,l
Using the definition of barycentric projections for nodes and edges in Definition 1, we have:
X X X
FGWα (G, G) =(1 − α) µi kxi k2 + (1 − α) σk kzk k2 − 2(1 − α) σk hzk , z̃k i
i k k
X X X
+α µi µj A2ij +α σk σl C2kl − 2α σk σl Ckl C̃kl
i,j k,l k,l
X X X
=(1 − α) µi kxi k2 − (1 − α) σk kz̃k k2 + (1 − α) σk kzk − z̃k k2
i k k
X X X
+α µi µj A2ij −α σk σl C̃2kl +α σk σl |Ckl − C̃kl |2
i,j k,l k,l
By using the inequality (7), we have:

X X X X
FGWα (G, G) >(1 − α) µi kxi k2 − (1 − α) σk kz̃k k2 + α µi µj A2i,j − α σk σl C̃2k,l
i k i,j k,l
X
2 2
+ (1 − α)kzk − z̃k0 k + α|Ck,l − C̃k0 l0 | πk,k0 πl,l0
k,l,k0 ,l0
!
X X X
2 2 2
= (1 − α) µi kxi k − (1 − α) σk0 kz̃k0 k − kzk − z̃k0 k πkk0
i k0 k
| {z }
(a)
 
X X X
+α µi µj A2ij − α σk0 σl0 C̃2k0 l0 − |Ck,l − C̃k0 l0 |2 πkk0 πll0 
i,j k0 ,l0 k,l
| {z }
(b)
We first process the part (a) by using Jensen inequality as follows:

!
X X π ∗0 X
2 2
(a) ≥(1 − α) µi kxi k − (1 − α) k i
σk0 kxi k − kzk − xi k2 πk,k0
σk 0
i k0 ,i k
X πkk0 πk∗0 i X
∗∗
=(1 − α) kzk − xi k2 = (1 − α) kzk − xi k2 πk,i
σk0
k,k0 ,i k,i
∗
∗∗ πk,k0 πk 0 ,i
and π ∗∗ is an admissible transport map from G to G.
P
where πk,i = k0 σk 0
We then process the part (b) in a similar way and obtain:

X
∗∗ ∗∗
(b) ≥α |Ai,j − Ck,l |2 πk,i πl,j
i,j,k,l
Combining parts (a) and (b), we have:

X 2
∗∗ ∗∗
FGWα (G, G) > (1 − α)kxi − zk k2 + α (Ai,j − Ck,l ) πk,i πl,j
i,j,k,l
9
which contradicts the optimality of π ∗ . So we can conclude here that:

X
FGWα (G, G̃) = (1 − α)σk kzk − z̃k k2 + ασk σl |Ck,l − C̃k,l |2 (8)
k,l
Proving the second claim is straight-forward by applying Jensen inequality for Equation (8). Indeed, we have:
∗
X πk,i X πk,i ∗ ∗
X πl,j
FGWα (G, G̃) ≤ (1 − α)σk kzk − xi k2 + ασk σl |Ck,l − Ai,j |2
i
σ k i,j
σ k σl
k,l
X
2 2 ∗ ∗
= (1 − α)kzk − xi k + α|Ck,l − Ai,j | πk,i πl,j = FGWα (G, G)
k,l,i,j
A.2 Proof of Lemma 2
Proof. Let denote π1 and π2 be the optimal transport plans from G to G1 and G2 , respectively, in the sense of the
FGW distance. We also denote G̃1 and G̃2 be the measure graphs which are transported from G using the barycentric
projections {Tn,π1 , Te,π1 } and {Tn,π2 , Te,π2 }, respectively.
By using triangle inequality, we have:
|FGWα (G1 , G2 )−linearFGWα (G1 , G2 )| ≤ |FGWα (G1 , G2 ) − 2FGWα (G̃1 , G2 )| + |2FGWα (G̃1 , G2 ) − FGWα (G̃1 , G̃2 )|
+|FGWα (G̃1 , G̃2 ) − linearFGWα (G1 , G2 )|
≤2FGWα (G1 , G̃1 ) + 2FGWα (G2 , G̃2 ) + |FGWα (G̃1 , G̃2 ) − linearFGWα (G1 , G2 )|
| {z }
(c)
The last inequality is obtained by using the relaxed triangle inequality of the FGW with q = 2 (see [21]). It is obvious to
see that FGWα (G1 , G̃1 ) ≤ diam(G1 ) and FGWα (G2 , G̃2 ) ≤ diam(G2 ). In order to process the part (c), we notice that:
X X
|2FGWα (G̃1 , G)−linearFGWα (G1 , G2 )| = |(1 − α) 2σk kzk − Tn,π1 (zk )k2 + α 2σk σl |Ck,l − Te,π1 (Ck,l )|2
k k,l
X X
2
−(1 − α) σk kTn,π1 (zk ) − Tn,π2 (zk )k − α σk σl |Te,π1 (Ck,l ) − Te,π2 (Ck,l )|2 |
k k,l
X
≤(1 − α) σk |2kzk − Tn,π1 (zk )k2 − kTn,π1 (zk ) − Tn,π2 (zk )k2 |
k
X
+α σk σl |2|Ck,l − Te,π1 (Ck,l )|2 − |Te,π1 (Ck,l ) − Te,π2 (Ck,l )|2 |
k,l
X X
≤(1 − α) 2σk kzk − Tn,π2 (zk )k2 + α 2σk σl |Ckl − Te,π2 (Ck,l )|2
k k,l
=2FGWα (G̃2 , G)
(9)
2 2 2
The last inequality is obtained by applying the following inequality: |2(a − b) − (b − c) | ≤ (a − c) , for all a, b, c ∈ R.
Finally, we have:
(c) = |FGWα (G̃1 , G̃2 ) − linearFGWα (G1 , G2 )| ≤|FGWα (G̃1 , G̃2 ) − 2FGWα (G̃1 , G)| + |2FGWα (G̃1 , G) − linearFGWα (G1 , G2 )|
≤2FGWα (G̃2 , G) + 2FGWα (G̃2 , G) = 4FGWα (G̃2 , G)
≤4FGWα (G2 , G)
The second inequality is obtained by applying the relaxed triangle inequality of the FGW with q = 2 (see [21]) and the
inequality (9) while the last inequality is obtained by the second claim of Lemma 1. We also have (c) ≤ 4FGWα (G1 , G)
by the symmetry of G1 and G2 with respect to G, which concludes the proof.
10

On A Linear Fused Gromov-Wasserstein Distance For Graph Structured Data

Uploaded by

Copyright:

Available Formats

On A Linear Fused Gromov-Wasserstein Distance For Graph Structured Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

On A Linear Fused Gromov-Wasserstein Distance For Graph Structured Data

Uploaded by

Copyright:

Available Formats

O N A LINEAR FUSED G ROMOV-WASSERSTEIN DISTANCE FOR

GRAPH STRUCTURED DATA

Dai Hai Nguyen Koji Tsuda

The University of Tokyo The University of Tokyo

March 10, 2022

Keywords linear optimal transport · graph structured data · graph kernel

2.1 Kernels for Graphs

2.2 Optimal Transport Frameworks for Graphs

2.3 Linear Optimal Transport

3 Proposed Distance for Graphs: Linear Fused Gromov-Wasserstein

3.1 FGW: A Distance for Matching Node Features and Structures

3.2 linearFGW: A New Distance for Comparing Graphs

2. FGWα (G, G̃) ≤ FGWα (G, G).

3.3 Selection of Reference Measure Graph

The proof is given in the Appendix section.

Table 1: Statistics of data sets used in experiments

Dataset #graphs #classes Ave. #odes Ave. #edges #attributes

3.4 Implementation Details

4.1 Data sets

4.2 Experimental settings

Kernels/Data sets COX2 BZR ENZYMES PROTEINS IMDB-B

Methods/Data sets AIDS PROTEINS PROTEINS-F IMDB-B

Methods/Data sets COX2 BZR ENZYMES PROTEINS IMDB-B

5 CONCLUSION AND FUTURE WORK

A.1 Proof of Lemma 1

We rewrite the FGW distance between G(Z, C, σ) and G(X, A, µ) as follows:

By using the inequality (7), we have:

We first process the part (a) by using Jensen inequality as follows:

We then process the part (b) in a similar way and obtain:

Combining parts (a) and (b), we have:

which contradicts the optimality of π ∗ . So we can conclude here that:

A.2 Proof of Lemma 2

You might also like