Sub2vec Pakdd18
Sub2vec Pakdd18
1 Introduction
Graphs are a natural abstraction for representing relational data from multiple domains
such as social networks, protein-protein interaction networks, the World Wide Web, and
so on. Analysis of such networks include classification [1], detecting communities [2,
3], and so on. Many of these tasks can be solved using machine learning algorithms.
Unfortunately, since most machine learning algorithms require data to be represented
as features, applying them to graphs is challenging due to their high dimensionality and
structure. In this context, learning discriminative feature representation of subgraphs
can help in leveraging existing machine learning algorithms more widely on graph data.
Apart from classical dimensionality reduction techniques (see related work), recent
works [4–7] have explored various ways of learning feature representation of nodes in
networks exploiting relationships to vector representations in NLP (like word2vec [8]).
However, application of such methods are limited to binary and multi-class node classi-
fication and edge-prediction. It is not clear how one can exploit these methods for tasks
like community detection which are inherently based on subgraphs and node embed-
dings result in loss of information of the subgraph structure. Embedding of subgraphs
or neighborhoods themselves seem to be better suited for these tasks. Surprisingly,
learning feature representation of networks themselves (subgraphs and graphs) has not
gained much attention. Here, we address this gap by studying the problem of learning
distributed representations of subgraphs in a low dimensional continuous vector space.
Figure 1(a-b) gives an illustration of our framework. Given a set of subgraphs (Figure 1
(a)), we learn a low-dimensional feature representation of each subgraph (Figure 1(b)).
(a) Input subgraphs (b) Embeddings (c) Ground Truth (d) Node2Vec (e) Sub2Vec
Fig. 1: (a) and (b) An overview of our Sub2Vec. Our input is a set of subgraphs S. Sub2Vec
learns d dimensional feature embedding of each subgraph. (c)-(e) Leveraging embeddings learned
by Sub2Vec for community detection. (c) Communities in School network (different colors
represent different communities). (d) Communities discovered via Node2Vec deviates from the
ground truth, (e) while those discovered via Sub2Vec closely matches the ground truth.
2 Problem Formulation
We begin with the setting of our problem. Let G(V, E) be a graph where V is the vertex
set and E is the associated edge-set (we assume unweighted undirected graphs here, but
our framework can be easily extended to weighted and/or directed graphs as well). A
graph gi (vi , ei ) is said to be a subgraph of a larger graph G(V, E) if vi ⊆ V and
ei ⊆ E. For simplicity, we write gi (vi , ei ) as gi . As input, we require a set of subgraphs
S = {g1 , g2 , . . . , gn }, typically extracted from the same graph G(V, E). Our goal is to
embed each subgraph gi ∈ S into d-dimensional feature space Rd , where d << |V |.
Main Idea: Intuitively, our goal is to learn a feature representation of each subgraph
gi ∈ S such that the likelihood of preserving certain properties of each subgraph, de-
fined in the network setting, is maximized in the latent feature space. In this work, we
provide a framework to preserve two different properties—namely Neighborhood and
Structural— properties of subgraphs.
Neighborhood Property: Intuitively, the Neighborhood property of a subgraph cap-
tures the neighborhood information within a subgraph itself for each node in it. For
illustration, consider the following example. In the figure below, let g1 be the subgraph
induced by nodes {a, e, c, d}. The Neighborhood property of g1 should be able to cap-
ture the information that the nodes a, c are in the neighborhood of node e, that nodes d,
e are in the neighborhood of node c. To capture the neighborhood information of all the
nodes in a given subgraph, we consider paths annotated by ids of the nodes. We refer
to such paths as the Id-paths and define the Neighborhood property of a subgraph gi as
the set of all Id-paths in gi .
The Id-paths capture the neighborhood information in
subgraphs and each succession of nodes in Id-paths re-
veals how the neighborhood in the subgraph is evolving.
For example in g1 described above, the id-path a → c → d
shows that nodes a and c are neighbors of each other. Moreover, this path along with
a → e → d indicate that nodes a and d are in neighborhood of each other (despite not
being direct neighbors). Hence, the set of all Id-paths captures important connectivity
information of the subgraph.
Structural Property: The Structural property of a subgraph captures the overall struc-
ture of the subgraph as opposed to just the local connectivity information as captured
by the Neighborhood property. Several prior works have leveraged degree of nodes and
their neighbors to capture structural information in network representation learning [9,
10]. While degree of a node captures its local structural information within a subgraph,
it fails in characterizing the similarity between the structures of two nodes in different
subgraphs. Note that the nodes of two subgraphs with the same structure but of dif-
ferent sizes will have different degrees. For example, nodes in clique of size 10 have
degree of 9, whereas nodes in clique of size 6 have degree 5. Therefore this suggests
that instead, the ratio of degree to the size of the subgraph, of a node and its neigh-
bors better identifies subgraph structure. Hence we rely on paths in gi annotated by
the ratio of node degrees to the subgraph size. We refer to the set of all such paths as
Degree-paths. Degree-paths capture the structure by tracking how the density of edges
changes in a neighborhood. While our method is simple, it is effective as shown by the
results. One can build upon our framework by leveraging other techniques like rooted
subgraphs [11] and predefined motifs [12].
As an example, consider the subgraph g2 induced by nodes {a, b, c, e} and subgraph
g3 induced by nodes {f, h, i, g} in the graph shown above. As it is a clique, the ratio of
degree to the size of the subgraph for each node in g2 is 0.75. Hence any Degree-paths
of length 3 in g2 is 0.75 → 0.75 → 0.75. Similarly, g3 is a star and a Degree-path in g3
from i to g is 0.25 → 0.75 → 0.25. The consistent high values in the paths in cliques
show that each node in the path is densely connected to the rest of the graph, while the
fluctuation in values in stars show that the two spokes in the path are sparsely connected
to the rest of the network while the center is densely connected. In practice, since we
cannot treat each real value distinctly, we generate labels for each node from a fixed
alphabet (see Section 3).
Our Problems: Having defined the Neighborhood and Structural properties of sub-
graphs, we want to learn vector representations in Rd , such that the likelihood of pre-
serving these properties in the feature space is maximized. Formally the two versions
of our Subgraph Embedding problem are:
Problem 1. Given a graph G(V, E), d, and set of S subgraphs (of G) S = {g1 , g2 , . . . , gn },
learn an embedding function f : gi → yi ∈ Rd such that the Neighborhood property of
each gi ∈ S is preserved.
Problem 2. Given a graph G(V, E), d, and set of S subgraphs (of G) S = {g1 , g2 , . . . , gn },
learn an embedding function f : gi → yi ∈ Rd such that the Structural property of each
gi ∈ S is preserved.
3 Our Methods
3.1 Overview
A major challenge in solving our problems is to design an architecture which has global
view of subgraphs and is able to capture similarities and differences between the prop-
erties of entire subgraphs. Our idea to overcome this challenge is to leverage the Para-
graph2vec models for our subgraph embedding problems. Paragraph2vec [13] models
learn latent representation of entire paragraphs while maximizing similarity between
paragraphs which have similar word co-occurrences. Note that these models have the
global view of entire paragraphs. Intuitively, such a model is suitable for solving Prob-
lems 1 and 2. Thus, we extend Paragraph2vec to learn subgraph embedding while pre-
serving distance between subgraphs that have similar ‘node co-occurrences’. We extend
both Paragraph2vec models (PV-DBOW and PV-DM). We call our models Distributed
Bag of Nodes version of Subgraph Vector (Sub2Vec-DBON) and Distributed Memory
version of Subgraph Vector (Sub2Vec-DM) respectively. We discuss Sub2Vec-DM
and Sub2Vec-DBON in detail in Subsections 3.3 and 3.4.
In addition, another challenge is to generate meaningful context of ‘node co-occurrences’
which preserve Neighborhood and Structural properties of subgraphs. We tackle this
challenge by using our Id-paths and Degree-paths. As discussed earlier, Id-paths and
Degree-paths capture the Neighborhood and Structural property respectively. We dis-
cuss on efficiently generating samples of Id-paths and Degree-paths next.
3.3 Sub2Vec-DM
where Pr(a|W2 (θa ), W1 (i)) is the probability of predicting node a given the vector
representations of θa and gi . It is defined using the softmax function:
3.4 Sub2Vec-DBON
In the Sub2Vec-DBON model, we want to predict a set θ of co-occurring node-ids in
an Id-path sampled from subgraph gi , given only the sugraph gi . Note that Sub2Vec-
DBON does not explicitly rely on the embeddings of node-ids as in Sub2Vec-DM.
As shown in Section 3.3, the ‘co-occurrence’ means that two ids co-appear in a sliding
window of a fixed length w. For example, consider the same example as in Section 3.3:
the subgraph g1 and the node sequence a → b → c generated by random walks. Now
in the Sub2Vec-DBON model, for w = 3, the goal is to predict the set {a, b, c} given
the subgraph g1 . This model is parallel to the popular skip-gram model. The matrices
W1 and W2 are the same as in Section 3.3.
Formally, given a subgraph gi , and the θ drawn from gi , the objective of Sub2Vec-
DBON is the following:
X X
max log(Pr(θ|W1 (i)), (2)
gi ∈S θ∈gi
Since computing Equation 2 involves summation over all possible sets of co-occuring
nodes, we use approximation techniques such as negative sampling [8].
3.5 Algorithm
Our algorithm Sub2Vec works as follows: we first generate the samples of Id-
paths/Degree-paths in each subgraph by running random walks. Then we optimize the
SV-DBON/ SV-DM objectives using the stochastic gradient descent (SGD) method [15]
by leveraging the random walks. We used the Gensim package for implementation [16].
The complete pseudocode is presented in Algorithm 1.
Algorithm 1 Sub2Vec
Require: Graph G, subgraph set S = {g1 , g2 , . . . , gn }, length of the context window w, dimen-
sion d
1: walkSet = {}
2: for each gi in s do
3: walk = RandomWalk (gi )
4: walkSet[gi ] = walk
5: end for
6: f = StochasticGradientDescent(walkSet, d, w)
7: return f
4 Experiments
We leverage Sub2Vec1 for two applications, namely Community Detection and Graph
Classification, and perform a case-study on a real subgraph data. All experiments are
conducted using a 4 Xeon E7-4850 CPU with 512GB 1066Mhz RAM. We set the length
of the random walk as 1000 and following literature [5], we set dimension of the em-
bedding as 128 unless mentioned otherwise for both parameters.
Setup. Here we show how to leverage Sub2Vec for the well-known community detec-
tion problem [2, 3]. A community in a network is a coherent group of nodes which are
densely connected among themselves and sparsely connected with the rest of the net-
work. One expects many nodes within the same community to have similar neighbor-
hoods. Hence, we can use Sub2Vec to embed subgraphs while preserving the Neigh-
borhood property and cluster the embeddings to detect communities.
Approach. We propose to use Sub2Vec for community detection by embedding the
surrounding neighborhood of each node. First, we extract the neighborhood Cv of each
node v ∈ V from the input graph G(V, E). For each node v, we extract its neighbor-
hood Cv only once. Hence, we get a set S = {Cv |v ∈ V } of |V | neighborhoods are
extracted from G. Since each Cv is a subgraph, we can leverage Sub2Vec to embed
each Cv ∈ S. The idea is that similar Cv s will be embedded together, which can then
be clustered to detect communities. We use a clustering algorithm (K-Means) to clus-
ter the feature vectors f (Cv ) of each Cv . For datasets with overlapping communities
(like Youtube), we use the Neo-Kmeans algorithm [17] to obtain overlapping clus-
ters. Cluster membership of f (Cv ) determines the community membership of node v.
The complete pseudocode is in Algorithm 2.
In Algorithm 2, we define the neighborhood of each node to be its ego-network
for dense networks (School and Work) and its 2-hop ego-network for others. The
ego-network of a node is the subgraph induced by the node and its neighbors. The 2-
hop ego-network is the subgraph induced by the node, its neighbors, and neighbors’
neighbors.
1
Code in Python available at: https://goo.gl/Ef4q8g
Algorithm 2 Community Detection using Sub2Vec
Require: A network G(V, E), Sub2Vec parameters, k number of communities
1: neighborhoodSet = {}
2: for each v in V do
3: neighborhoodSet = neighborhoodSet ∪ neighbordhood of v in G.
4: end for
5: vecs = Sub2Vec (neighborhoodSet, w, d)
6: clusters = Clustering(vecs, k)
7: return clusters
Datasets. We use multiple real world datasets from multiple domains like social-interactions,
co-authorship, social networks and so on of varying sizes. See Table 1.
Table 1: Information on Datasets for Community Detection (Left) and Graph Classification
(Right). # com denotes the number of ground truth communities in each dataset. # nodes de-
notes the average number of nodes in each graph classification dataset.
Dataset |V | |E| # com Domain
Work 92 757 5 contact Dataset # graphs # classes # nodes # labels
Cornell 195 304 5 web MUT 188 2 17.9 7
School 182 2221 5 contact PTC 344 2 25.5 19
Texas 187 328 5 web ENZ 600 6 32.6 3
Wash. 230 446 5 web PRT 1113 2 39.1 3
Wisc. 265 530 5 web NC1 4110 2 29.8 37
PolBlogs 1490 16783 2 web NC109 4127 2 29.6 38
Youtube 1.13M 2.97M 5000 social
Results. We measure the performance of all the algorithms by computing the Average
F1 score [18] against the ground-truth. See Table 2. Both versions of Sub2Vec signif-
icantly and consistently outperform all the baselines We achieve a significant gain of
123.5 % over the closest competitor (Node2Vec) for Youtube. We do better than
Node2Vec and DeepWalk because intuitively, we learn the feature vector of the
neighborhood of each node for the community detection task; while they just do random
probes of the neighborhood. Performance of Newman and Louvian is considerably
poor in Youtube as these methods output non-overlapping communities. Performance
of Node2Vec is satisfactory in sparse networks like Wash. and Texas. Node2Vec
does slightly better (∼ 1%) than Sub2Vec in PolBlogs—the network consists of ho-
mogeneous neighborhoods, which favors it. However, the performance of Node2Vec
is significantly worse for dense networks like Work and School. On the other hand,
performance of Sub2Vec is even more impressive in these dense networks (where the
task is more challenging).
Setup. Here, we show an application of our method in the Graph Classification task
[10, 19]. In the graph classification problem, the data consists of multiple (gi , Yi ) tu-
ples, where each gi is a graph and Yi is its class-label. Moreover, the nodes in each
graph gi are labeled. The goal is to predict the class Yi for a given graph Gi . Since
we can generate a discriminative feature representation of each graph (while preserving
either Neighborhood or Structural properties), we can train any off-the-shelf classifier
to classify the graphs. In this experiment, we set the dimension of embedding as 300
and set the length of the random walk as 100000.
Approach. Our approach is to learn the embedding of each graph by treating them as
a subgraph of a union of all the graphs. First we learn the feature representation of the
graphs such that either the Neighborhood (Sub2Vec-N) or Structural (Sub2Vec-S)
property is preserved. We then leverage four off-the-shelf classifiers: Decision Tree,
Random Forest, SVM, and Multi-Layered Perceptron, to solve the classification task.
Datasets. We test on classic graph classification benchmark datasets. All the datasets
are publicly available2. List of datasets is presented in Table 1.
Baselines. We used two state-of-the-art methods as our competitors. WL-Kernel [10]:
This is a graph kernel method based on the Weisfeiler-Lehman test of graph-isomorphism.
DG-Kernel is a deep-learning version of WL-kernel [19], which relies on latent rep-
resentation of sub-structures of the graphs. It uses the popular skip-gram model.
2
http://mlcb.is.tuebingen.mpg.de/Mitarbeiter/Nino/Graphkernels/
Results. We report the testing accuracy of a 5-fold cross validation. For both Sub2Vec-
N and Sub2Vec-S, we run both of our models Sub2Vec-DM and Sub2Vec-DBON.
We then train all four classifiers and show the best of them. See Table 3. The re-
sults show that both of our methods consistently outperform competitors. The gain of
Sub2Vec over the state-of-the-art DG-Kernel is upto a significnt 67.9%. The bet-
ter performance of Sub2Vec-S over Sub2Vec-N indicates that structural properties
in these datasets are more discriminative. This is intuitive as different bonds between
the elements results in different structure and also determine the chemical properties of
a compound [20]. Interestingly, the Neighborhood property outperforms the Structural
property in ENZ dataset. This suggests that, in ENZ dataset, which interestingly also has
higher number of classes, the neighborhood property is more important than structural
property in determining the graph class.
4.3 Scalability
We perform case-studies
on MemeTracker3 dataset (a) Politics (b) Religion
to investigate if the em-
beddings returned by
Sub2Vec are interpretable.
Here, we run Sub2Vec
to preserve the Neigh- (c) Entertainment (d) Spanish
borhood property. The
MemeTracker consists of a series of cascades caused by memes spreading on the
network of linked web pages. Each meme-cascade induces a subgraph in the underly-
ing network. We first embed these subgraphs in a continuous vector space by leveraging
Sub2Vec. We then cluster these vectors to explore what kind of meme cascade-graphs
are grouped together and what characteristics of memes determine their similarity and
distance to each other. For this case-study, we pick the top 1000 memes by volume, and
cluster them into 10 clusters.
3
snap.stanford.edu
We find coherent clusters which are meaningful groupings of memes based on top-
ics. For example we find cluster of memes related to different topics such as entertain-
ment, politics, religion, technology and so on. Visualization of these clusters is pre-
sented above. In the entertainment cluster, we find memes which are names of popular
songs and movies such as “sweet home alabama”, “Madagascar 2” and so on. Similarly,
we also find a cluster of religious memes. These memes are quotes from the Bible. In
the politics cluster, we find popular quotes from the 2008 presidential election season
e.g. Barack Obama’s popular slogan “yes we can” along with his controversial quotes
like “you can put lipstick on a pig” in the cluster. Interestingly, we find that all the
memes in Spanish language were clustered together. This indicates that memes in dif-
ferent language travel though separate websites, which matches with the reality as most
webpages use one primary language.
5 Related Work
The network embedding problem, which seeks to generate low dimensional feature
representation of nodes, has been well studied. Early work includes [21–23]. How-
ever, these methods are slow and do not scale to large networks. Recently, several
deep learning based network embeddings algorithms were proposed. DeepWalk [4]
and Node2Vec [5] learn feature representation based on contexts generated by random
walks. SDNE [6] and LINE [7] learn feature representation of nodes while preserving
first and second order proximity. Other recent works include [24, 9, 25]. However, all of
them node embeddings, while our goal is to embed subgraphs.
The most similar network embedding literature includes [12, 19, 11]. Risen and
Bunke [12] propose to learn vector representations of graphs based on edit distance
to a set of pre-defined prototype graphs. Yanardag et al. [19] and Narayanan et al. [11]
learn vector representation of the subgraphs using the Word2Vec by generating “cor-
pus” of subgraphs where each subgraph is treated as a word. The above works focuses
on some specific subgraphs like graphlets and rooted subgraphs. None of them embed
subgraphs with arbitrary structure.