Akiba KShortest 2015

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7
 
Efficient Top-k Shortest-Path Distance Queries onLarge Networks by Pruned Landmark Labeling
Takuya Akiba
, Takanori Hayashi
, Nozomi Nori
, Yoichi Iwata
and
 Yuichi Yoshida
§
k
The University of Tokyo, 113-0033, Tokyo, Japan
 
Kyoto University, 606-8501, Kyoto, Japan
§
National Institute of Informatics, 101-8430, Tokyo, Japan
k
Preferred Infrastructure, Inc., 113-0033, Tokyo, Japan
{
t.akiba,thayashi,y.iwata
}
@is.s.u-tokyo.ac.jp, nozomi@ml.ist.i.kyoto-u.ac.jp, yyoshida@nii.ac.jp
Abstract
We propose an indexing scheme for top-
k
 shortest-path distance queries on graphs, which is useful in awide range of important applications such as network-aware searches and link prediction. While many effi-cient methods for answering standard (top-1) distancequeries have been developed, none of these methodsare directly extensible to top-
k
 distance queries. Wedevelop a new framework for top-
k
 distance queriesbased on
 2-hop cover 
 and then present an efficient in-dexingalgorithmbasedontherecentlyproposed
 pruned landmark labeling
 scheme. The scalability, efficiencyand robustness of our method is demonstrated in ex-tensive experimental results. Moreover, we demonstratethe usefulness of top-
k
 distance queries by applyingthem to link prediction, the most fundamental graphproblem in the AI and Web communities.
Introduction
The shortest-path distance between vertices in a network is a fundamental concept in graph theory and is widelyapplied in the AI and Web communities. For example,because the distances between vertices indicate the rele-vance among the vertices, they can identify other usersor contents that best match a user’s intent in socially-sensitive searches (Vieira et al. 2007; Yahia et al. 2008;Maniu and Cautis 2013). In context-aware searches, theyare used to assign higher ranks to web pages more relatedto the currently visited web page (Ukkonen et al. 2008;Potamias et al. 2009).However, there is a fundamental drawback of basing rel-evance on distance alone. Specifically, distances should beintegers and the diameters of real-world networks are typi-cally small (Watts and Strogatz 1998). Such small diametergreatlyreducethenumberofpossibledistancesandprecludethe full use of the underlying structure.ThisproblemisclearlydepictedinFigure1.Ineachgraphin the figure, the distance between the pair of black verticesis four. Hence, based on distance alone, the black pairs inall three graphs have the same similarity. However, the pairin graph (c) seems more tightly connected than the pairs in
Copyright c
 2015, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.(a) (b) (c)
Figure 1: Examples of connection between two vertices.Table 1: Distances and top-
k
 distances between the twoblack vertices in Figure 1.
Graph (Top-1) Distance Top-
k
 Distances
(a) 4 [4, 6, 6, 6, 6, 8, 8, ...](b) 4 [4, 4, 4, 6, 6, 6, 6, ...](c) 4 [4, 4, 4, 4, 4, 4, 4, ...]graphs (a) and (b), since this pair is connected by a greaternumber of shortest paths.This intuitive concept can be naturally implemented byadopting the
 top-
k
 shortest paths
 and
 top-
k
 distances
 (for-mally defined later). Table 1 presents the top-
k
 distances be-tween the pair of black vertices in each graph of Figure 1.Although the pairs in each graph are separated by the samedistance, their top-
k
 distances markedly vary, providing apotential means of distinguishing these three graph struc-tures.However, determining the top-
k
 distances between ver-tices is computationally expensive. The naive approach isto apply a variant of Dijkstra’s algorithm that visits thesame vertex
 k
 times. This approach consumes
 O
((
n
+
m
)
k
)
and
 O
((
n
log
n
 +
 m
)
k
)
 time on unweighted and weightedgraphs, respectively, where
 n
 and
 m
 are the numbers of ver-tices and edges, respectively. In the above-mentioned appli-cations, the top-
k
 distances must be interactively computedfor many vertex pairs on large social and web graphs, requir-ing a much faster algorithm. Eppstein (Eppstein 1998) im-provedthetimecomplexityto
O
(
n
+
m
+
k
)
and
O
(
n
log
n
+
m
+
k
)
 on unweighted and weighted graphs respectively, buthis algorithm remains prohibitively slow.
Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence
2
 
Contribution
To resolve this issue, we propose an indexing method for an-swering the top-
k
 distances. The proposed method is an in-dexing method, i.e., it first constructs a data structure calledan
 index
 from a graph and then top-
k
 distances between ar-bitrary pairs of vertices are rapidly obtained using the index.To our knowledge, we present the first indexing method totop-
k
 distance inquiry.Our method is built on the recently proposed
 pruned land-mark labeling
, an indexing scheme that answers shortest-path distances (Akiba, Iwata, and Yoshida 2013). However,modifying this method to answer top-
k
 distances is non-trivial because the number of paths becomes crucial, bring-ing the new challenge of carefully avoiding double counts.Moreover, it requires several interesting ideas in order tokeep the scalability.As shown later in our experiments, our method can con-struct indices from large graphs comprising millions of ver-tices and tens of millions of edges within a reasonable run-ning time. Having obtained the indices, we can compute thetop-
k
 distances within a few microseconds, six orders of magnitude faster than existing methods, which require a fewseconds to compute these distances.Moreover, to illustrate the importance of the top-
k
 dis-tances, we apply our method to the
 link prediction prob-lem
 (Liben-Nowell and Kleinberg 2003), a well-studiedproblem in AI and Web communities. We empirically showthat the support vector machine (SVM) with the top-
k
 dis-tances as its feature outperforms a number of baseline meth-ods including singular value decomposition and randomwalk with restart. We emphasize that our indexing methodenables the first use of the top-
k
 distances for such tasks.The results also indicate the feasibility of top-
k
 distances inother tasks, such as network-aware searching.Our implementation of the proposed indexing method ispublicly available from the first author’s web page. We hopethat our public code will enable further exploration of top-
k
distances in various applications.
Related Work
Distance Indices.
 Although numerous indexing meth-ods for computing shortest-path distances have been pro-posed (Cheng and Yu 2009; Xiao et al. 2009; Wei 2010;Akiba, Sommer, and Kawarabayashi 2012; Jin et al. 2012;Fu et al. 2013; Akiba, Iwata, and Yoshida 2013), none of these methods can directly answer top-
k
 distance queries.
Pruned Labeling Algorithms.
 Pruned labeling was firstproposed for distance queries on complex networks (Ak-iba, Iwata, and Yoshida 2013). Then, specialization and ex-tensions have been proposed for reachability queries on di-rected acyclic graphs (Yano et al. 2013), distance queries onroad networks (Akiba et al. 2014), and distance queries ondynamic graphs (Akiba, Iwata, and Yoshida 2014).
Other Vertex Similarities.
 The performance of applica-tions related to graph mining can also be enhanced by fea-tures other than top-
k
 distances, such as
 random walk withrestart (RWR)
. As straightforward iterative algorithms areprecluded by their high computational cost, several approxi-mation methods have been proposed (Jeh and Widom 2003;McSherry 2005; Sun et al. 2005; Tong, Faloutsos, andPan 2006). Despite sacrificing accuracy for efficiency, thesemethods remain prohibitively time-expensive for computingthe RWR scores for many vertex pairs on large networks inreal-time applications, such as network-aware searches.Similar arguments apply to other walk-based similaritiessuch as
 SimRank 
 (Jeh and Widom 2002) and
 commutingtime
 (Lov´asz 1996). In contrast, as we shall experimentallydemonstrate, such large networks are efficiently handled byour method for answering top-
k
 distances. We also believethat top-
k
 distances provide features with different proper-ties from them, which can be used as complementary fea-tures for those other vertex similarities.Some of 
 graph kernels
 (Smola and Kondor 2003), in par-ticular, those based on the
 graph Laplacian
 such as the
 regu-larizedLaplaciankernel
canalsobeusedtoassignrelevancescores for vertex pairs (Ito et al. 2005). However, the com-putational cost of graph kernels is even more infeasible forlarge graphs.
Preliminaries
The current study focus on networks that are modeledas graphs. To simplify our discussion, we consider onlyundirected and unweighted graphs first. However, as dis-cussed later, our method is easily extendible to directed andweighted graphs.Let
 G
 = (
V,
)
 be a graph with a vertex set
 V 
 and anedge set
 E 
. We denote the number of vertices
 |
 |
 and thenumber of edges
 |
|
 by
 n
 and
 m
, respectively. We assumethat vertices are uniquely represented by integers, enablingnatural comparisons of two vertices
 u,v
 ∈
 V 
 by expressionssuch as
 u < v
 or
 u
 ≤
 v
.An
 internal vertex
 of a path refers to a vertex in the paththat is not an endpoint of it. Let
 P 
 be a set of paths. The
 i
-thshortest path in
 P 
 refers to the
 i
-th path in
 P 
, ordered bylength, where ties are broken arbitrarily.For a pair of vertices
 (
s,t
)
, let
 P 
st
 be the set of all (un-necessarily simple) paths between
 s
 and
 t
. For a vertex
 v
, let
>vst
 be the set of paths in
 P 
st
 whose internal vertices are alllarger than
 v
. Similarly, let
 P 
6
>vst
 be the set of paths in
 P 
st
such that at least one internal vertex is smaller than or equalto
 v
. Then for two vertices
 s
 and
 t
, the
 i
-th shortest path be-tween
 s
 and 
 t
 is the
 i
-th shortest path in
 P 
st
. Let
 d
i
-th
(
s,t
)
,
d
>vi
-th
(
s,t
)
, and
 d
6
>vi
-th
(
s,t
)
 denote the length of the
 i
-th short-est path in
 P 
st
,
 P 
>vst
 , and
 P 
6
>vst
 , respectively. If the size of the corresponding set is less than
 i
, then we set them to
 ∞
.We define
 d
vi
-th
(
s,t
)
 and
 d
6
vi
-th
(
s,t
)
 similarly.
Problem Definition
In this paper, we propose an indexing method that, givena graph
 G
 and a positive integer
 k
, construct an index toquickly answer the following query.
Problem 1 (Top-
k
 Distance Query).
Given
: A pair of vertices
 (
s,t
)
.
 Answer
: An array
 (
d
1st
(
s,t
)
,d
2nd
(
s,t
)
,...,d
k
-th
(
s,t
))
.
3
5
 
Proposed Method
This section describes our proposed method and show itscorrectness. We also suggest several important techniquesfor practical performance enhancement.
Data Structure
The data structure and query algorithm of the proposedmethod are based on the general framework of 
 2-hopcover 
 (Cohen et al. 2002), which is designed for stan-dard (top-1) distance queries. However, as normal distancequeries do not consider the number of paths, the main chal-lengeinprocessing top-
k
 distancequeriesispreventing mul-tiple counts of the same path. To this end, we require a moreinvolved framework.For each vertex
 v
, our method precomputes and stores thefollowing two labels:
 Distance label
 L
(
v
)
, comprising a set of pairs
 (
u,
δ 
)
 of a vertex and a path length. If we gather lengths in
 L
(
v
)
associated with a vertex
 u
, they should form the sequence
(
d
>v
1
st
 (
v,u
)
, d
>v
2
nd
(
v,u
)
,...,d
>v
`
-th
(
v,u
))
 for some
 1
 ≤
 `
 ≤
k
.
 Loop label
 
(
v
)
, constituting a sequence of 
 k
 in-tegers
 (
δ 
1
,
δ 
2
,...,
δ 
k
)
. This sequence should equal
(
d
v
1
st
 (
v,v
)
,d
v
2
nd
(
v,v
)
,...,d
vk
-th
(
v,v
))
.An
 index
 is a pair
 I 
 = (
L,
)
, where
 L
 and
 C 
 are the setsof distance labels
 {
L
(
v
)
}
v
V  
 and loop labels
 {
(
v
)
}
v
V  
 ,respectively.
Query Algorithm
Given an index
 
 = (
L,
)
 and a pair of vertices
 (
s,t
)
,we compute the top-
k
 distances between
 s
 and
 t
 as follows.First, we compute the following multiset.
(
I,s,t
) =
 {
δ 
sv
 +
 δ 
vv
 +
 δ 
vt
 |
 (
v,
δ 
sv
)
 
 L
(
s
)
,
δ 
vv
 ∈
 C 
(
v
)
,
(
v,
δ 
vt
)
 
 L
(
t
)
}
.
Intuitively, we first move from
 s
 to
 v
, then loop back to
 v
several steps later, and finally move from
 v
 to
 t
. Note thatfrom the definition of distance labels and loop labels, everyinternal vertex in the path from
 s
 to
 t
 (except
 v
 itself) islarger than
 v
.Let Q
UERY
(
I,s,t
)
 denote the smallest
 k
 elements in themultiset
 
(
I,s,t
)
. If 
 |
δ 
(
I,s,t
)
|
 < k
, the remaining en-tries are filled with
 ∞
. Our answer to the query
 (
s,t
)
 isQ
UERY
(
I,s,t
)
.
Indexing Algorithm
Our index constructing algorithm is summarized in Algo-rithm 1. We first compute the loop label
 C 
(
v
)
 for every ver-tex
 v
. We then construct the distance labels
 L
 by conductinga
 pruned BFS 
 from each vertex.
Algorithm for Computing Loop Labels.
 We constructthe loop labels as follows. For each vertex
 v
, using verticeslarger than or equal to
 v
, we perform a modified version of breadth first search (BFS). In the BFS, each vertex may bevisited up to
 k
 times. The first
 k
 visits to the vertex
 v
 givesthe distance sequence
 d
v
1
st
 (
v,v
)
,d
v
2
nd
(
v,v
)
,...,d
vk
-th
(
v,v
)
.
Algorithm 1
 Indexing Algorithm
1:
 procedure
 C
ONSTRUCT
I
NDEX
(
G
)2:
 for
i
 = 1
 to
n
do
 Compute
(
v
i
)
 using the modified BFS.3:
 L
(
v
)
 ←
for all
 v
 ∈
 V  
.4:
 for
 i
 = 1
 to
 n
 do
 P
RUNED
BFS(
G
,
 v
i
).5:
 return
 (
C,L
)
.
Algorithm 2
 Pruned Top-
k
 BFS from
 v
 ∈
 V 
 .
1:
 procedure
 P
RUNED
BFS(
G
,
 v
)2:
 Q
 ←
a queue with only one element
 (
v,
0)
.3:
 while
 Q
 is not empty
 do
4: Dequeue
 (
u,
δ 
)
 from
 Q
.5:
 if 
 δ 
 <
 max(
Q
UERY
((
L,
)
,v,u
))
 then
6: Add
 (
v,
δ 
)
 to
 L
(
u
)
.7:
 for all
 w
 ∈
 V  
 such that
 (
u,w
)
 ∈
 E,w > v
 do
8: Enqueue
 (
w,
δ 
 + 1)
 onto
 Q
.
The modified BFS returns to the starting vertex long be-fore all vertices in the graph have been visited. Conse-quently, the running time is very small in practice and em-pirically estimated as
 O
(
nk
)
 in total from experiments.
Algorithm for Computing Distance Labels.
 We assumethat vertices in
 
 are ordered as
 v
1
,v
2
,...,v
n
. Then foreach
 1
 ≤
 i
 ≤
 n
, we perform a
 pruned BFS 
 from
 v
i
 (Al-gorithm 2). The pruned BFS is essentially a modified ver-sion of the BFS from
 v
 that visits the same vertex at most
k
 times. The crucial difference is the non-trivial pruning;that is, when visiting a vertex
 u
 at distance
 δ 
, the process isdiscontinued if 
 δ 
 is larger than or equal to the
 k
-th shortestdistance computable by the current index
 (
L,
)
 (Line 5).We roughly estimate the time complexity. Let
 l
 be the av-erage size of labels. We visit
 O
(
nl
)
 vertices in total, travers-ing
 O
(
mn
 )
 edges on average and evaluating a query in
 O
(
l
)
time (by using the fast pruning technique introduced later).Thus, the total time complexity of this part is
 O
(
ml
 +
 nl
2
)
.In our experiments,
 l
 was a few hundred.
Proof of Correctness
The correctness of our method is shown as follows. Let
 L
i
denote the set of distance labels
 L
 after the
 i
-th pruned BFSfrom
 v
i
. We define
 L
0
(
v
) =
 ∅
 for any
 v
. Let
 I 
i
 denote pair
(
L
i
,
)
 of the partially constructed set of distance labels andthe set of loop labels. We prove the following lemma.
Lemma 1.
 For every integer 
 i
 where
 0
 
 i
 
 n
 ,and every pair of vertices
 (
s,t
)
 ,
 Q
UERY
(
i
,s,t
) =(
d
6
>v
i
1st
 (
s,t
)
,d
6
>v
i
2nd
(
s,t
)
,...,d
6
>v
i
k
-th
(
s,t
))
 holds.Proof.
 We prove the claim by induction on
 i
. When
 i
 = 0
,we have Q
UERY
(
i
,s,t
) = (
,
,...,
)
 and the claimclearly holds. Suppose that the claim holds for every
 i
0
< i
.For a fixed pair of vertices
 (
s,t
)
 where
 s
 6
=
 t
, we validatethe claim for
 i
 and the pair
 (
s,t
)
.Note that we can already compute Q
UERY
(
i
1
,s,t
) =(
d
6
>v
i
1
1st
 (
s,t
)
,d
6
>v
i
1
2nd
 (
s,t
)
,...,d
6
>v
i
1
k
-th
 (
s,t
))
. Let
 P 
 denotethe set of paths
 
 such that (i)
 
 is in
 
>v
i
1
st
 , (ii)
 
passes through
 v
i
, and (iii) the length of 
 P 
 is smaller than
4
5
 
d
6
>v
i
1
k
-th
 (
s,t
)
. Let
 P 
0
be the first
 k
 elements in
 P 
. It sufficesto show that, after the
 i
-th pruned BFS, we can also computethe distances of paths in
 P 
0
.Let
 P 
 ∈
 
0
. We can split
 P 
 into three parts
 P 
sv
i
,
 P 
v
i
v
i
,and
 P 
v
i
t
. Here,
 P 
sv
i
 denotes the subsequence of 
 P 
 from
 s
to the first appearance of 
 v
i
 in
 P 
,
 P 
v
i
v
i
 denotes the subse-quence of 
 P 
 from the first appearance of 
 v
i
 to the final ap-pearance of 
 v
i
 in
 P 
, and
 P 
v
i
t
 denotes the subsequence of 
 P 
from the last appearance of 
 v
i
 in
 P 
 to
 t
. Note that
 P 
v
i
v
i
 mustbe among the first
 k
 elements in
 P 
>v
i
v
i
v
i
; otherwise shorter
 k
paths are possible and
 P 
 ∈
 P 
0
is contradicted. Hence,
 C 
(
v
i
)
must include the length of 
 P 
v
i
v
i
.Now we observe that the BFS from
 v
i
 along path
 P 
v
i
t
 isnot pruned in the
 i
-th pruned BFS (and similarly for
 P 
sv
i
).Toillustratebycontradiction,supposethattheBFSisprunedat some vertex
 u
 on path
 
v
i
t
. In this case, there exist atleast
 k
 paths in
 P 
6
>v
i
1
v
i
u
 shorter than
 δ 
, where
 δ 
 is the dis-tance from
 v
i
 to
 u
 in the BFS. For each of these
 k
 paths, weconcatenate
 P 
sv
i
,
 P 
v
i
v
i
, and the suffix of 
 P 
v
i
t
 from
 u
 to
 t
.Then, we obtain
 k
 paths in
 P 
6
>v
i
1
st
 that are shorter than
 P 
,and therefore shorter than
 d
6
>v
i
1
k
-th
 (
s,t
)
 from condition (iii).Hence, we reach a contradiction.
Corollary 1.
 At the end of Algorithm 1, we can correctlyanswer top-
k
 distance queries using the constructed index.
Techniques for Efficient Implementation
We introduce several key techniques for practical perfor-mance improvement.
Vertex Ordering Strategy.
 By properly selecting the or-der of vertices from which we conduct pruned BFSs, ourpruning can drastically reduce the search space and labelsizes by exploiting the structure of real-world networks,greatly enhancing the efficiency of the proposed method.This is possible because the real networks contain highlycentralized vertices (sometimes called
 hubs
). As a heuristicvertex ordering strategy, vertices are selected in order of de-creasing degrees. Further discussion is provided in (Akiba,Iwata, and Yoshida 2013).
Fast Pruning.
 When constructing distance labels, manyqueries are evaluated for pruning. However, when conduct-ing a pruned BFS from a vertex
v
, queries are limited to Arethere more than
 k
 paths of length less than
 δ 
 between
 v
 and
u
?” Given this restriction, we can reduce the query time. Foreach vertex
 w
 in the distance label of 
 v
, we can precomputethe number
 c
w,
δ
0
 of paths between
 v
 and
 w
 of length not ex-ceeding
 δ 
0
using the loop label
 C 
(
w
)
. Suppose that we havereached vertex
 u
 in the pruned BFS conducted from
 v
. Wecan then compute the number of paths between
 v
 and
 u
 of length less than
 δ 
 as
P
(
w,
δ
0
,c
)
L
(
u
)
 c
 ·
 c
w,
δ
δ
0
.
Merged Queue Entries.
 When a (pruned) BFS is per-formed from a vertex
 v
, rather than pair
 (
u,
δ 
)
, which de-notes the existence of a path of length
 δ 
 between
 v
 and
 u
,triplets
 (
u,
δ 
,c
)
 are pushed onto the queue. These triplesspecify that
 c
 paths of length
 δ 
 exist between
 v
 and
 u
, Thistechnique enables the simultaneous handling of many paths,and significantly reduces the number of pushes onto thequeue. Hence, it significantly reduces the running time.
Merged Label Entries.
 Related to the above technique,instead of pairs
 (
u,
δ 
)
, which denotes that there is a pathof length
 δ 
 between
 v
 and
 u
, triplets
 (
u,
δ 
,c
)
 are stored indistance labels. These triplets indicate that
 c
 paths of length
δ 
 exist between
 v
 and
 u
. A similar technique is applicable toloop labels.
Extensions
Directedgraphs.
 If the input graph is a directed graph, wecompute and store two distance labels
 L
IN
(
v
)
 and
 L
OUT
(
v
)
for each vertex
 v
, where
 L
IN
(
v
)
 and
 L
OUT
(
v
)
 contain thedistances from and to
 v
, respectively.
Weighted graphs.
 For weighted graphs, we can replacethe pruned BFS by pruned Dijkstra’s algorithm. In thisscheme, the queue used in Algorithm 2 is replaced bya priority queue. The time complexity becomes
 O
(
ml
 +
nl
(log
n
 +
 l
))
.
Experimental Evaluation
In this section, we show the scalability, efficiency and ro-bustness of the proposed method by experimental results us-ing real-world networks.
Setup
Environment.
 All experiments were conducted on aLinux server with Intel Xeon X5670 (2.93 GHz) and 48 GBof main memory. The proposed method was implemented inC++. The implementation will be made publicly availableonline.
Datasets.
 The target applications of the proposed methodare graph mining tasks such as network-aware searching andlink prediction. Therefore, our experiments were conductedon publicly available real-world social and web graphs
12345
.The sizes and types of these graphs are listed in Table 2. Wetreated all the graphs as unweighted undirected graphs.
Algorithms.
 As there are no previous indexing methodsfor top-
k
 distances, the proposed method was evaluatedagainst the following two algorithms without precomputa-tion.
 The first is the BFS-based naive approach, which uses aFIFO queue in the graph search, but which allows at most
k
 visits to each vertex. This algorithm was also imple-mented in C++ by the authors.
 The second is Eppstein’s algorithm (Eppstein 1998),which theoretically attains near-optimal time complexity.We adopted the C++ implementation of Jon Graehl
6
.
1
http://lovro.lpt.fri.uni-lj.si/support.jsp
2
http://grouplens.org/datasets/hetrec-2011/ 
3
http://snap.stanford.edu/ 
4
http://socialnetworks.mpi-sws.org/datasets.html
5
http://law.di.unimi.it/datasets.php (Boldi and Vigna 2004)
6
http://www.ics.uci.edu/ eppstein/pubs/p-kpath.html
5
5
5
5
5

Reward Your Curiosity

Everything you want to read.
Anytime. Anywhere. Any device.
No Commitment. Cancel anytime.
576648e32a3d8b82ca71961b7a986505