Akiba KShortest 2015

Efficient Top-k Shortest-Path Distance Queries onLarge Networks by Pruned Landmark Labeling

Takuya Akiba

†

, Takanori Hayashi

†

, Nozomi Nori

‡

, Yoichi Iwata

†

and

Yuichi Yoshida

§

k

†

The University of Tokyo, 113-0033, Tokyo, Japan

‡

Kyoto University, 606-8501, Kyoto, Japan

§

National Institute of Informatics, 101-8430, Tokyo, Japan

k

Preferred Infrastructure, Inc., 113-0033, Tokyo, Japan

{

t.akiba,thayashi,y.iwata

}

@is.s.u-tokyo.ac.jp, nozomi@ml.ist.i.kyoto-u.ac.jp, yyoshida@nii.ac.jp

Abstract

We propose an indexing scheme for top-

k

shortest-path distance queries on graphs, which is useful in awide range of important applications such as network-aware searches and link prediction. While many effi-cient methods for answering standard (top-1) distancequeries have been developed, none of these methodsare directly extensible to top-

k

distance queries. Wedevelop a new framework for top-

k

distance queriesbased on

2-hop cover

and then present an efficient in-dexingalgorithmbasedontherecentlyproposed

pruned landmark labeling

scheme. The scalability, efficiencyand robustness of our method is demonstrated in ex-tensive experimental results. Moreover, we demonstratethe usefulness of top-

k

distance queries by applyingthem to link prediction, the most fundamental graphproblem in the AI and Web communities.

Introduction

The shortest-path distance between vertices in a network is a fundamental concept in graph theory and is widelyapplied in the AI and Web communities. For example,because the distances between vertices indicate the rele-vance among the vertices, they can identify other usersor contents that best match a user’s intent in socially-sensitive searches (Vieira et al. 2007; Yahia et al. 2008;Maniu and Cautis 2013). In context-aware searches, theyare used to assign higher ranks to web pages more relatedto the currently visited web page (Ukkonen et al. 2008;Potamias et al. 2009).However, there is a fundamental drawback of basing rel-evance on distance alone. Specifically, distances should beintegers and the diameters of real-world networks are typi-cally small (Watts and Strogatz 1998). Such small diametergreatlyreducethenumberofpossibledistancesandprecludethe full use of the underlying structure.ThisproblemisclearlydepictedinFigure1.Ineachgraphin the figure, the distance between the pair of black verticesis four. Hence, based on distance alone, the black pairs inall three graphs have the same similarity. However, the pairin graph (c) seems more tightly connected than the pairs in

Copyright c



2015, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.(a) (b) (c)

Figure 1: Examples of connection between two vertices.Table 1: Distances and top-

k

distances between the twoblack vertices in Figure 1.

Graph (Top-1) Distance Top-

k

Distances

(a) 4 [4, 6, 6, 6, 6, 8, 8, ...](b) 4 [4, 4, 4, 6, 6, 6, 6, ...](c) 4 [4, 4, 4, 4, 4, 4, 4, ...]graphs (a) and (b), since this pair is connected by a greaternumber of shortest paths.This intuitive concept can be naturally implemented byadopting the

top-

k

shortest paths

and

top-

k

distances

(for-mally defined later). Table 1 presents the top-

k

distances be-tween the pair of black vertices in each graph of Figure 1.Although the pairs in each graph are separated by the samedistance, their top-

k

distances markedly vary, providing apotential means of distinguishing these three graph struc-tures.However, determining the top-

k

distances between ver-tices is computationally expensive. The naive approach isto apply a variant of Dijkstra’s algorithm that visits thesame vertex

k

times. This approach consumes

O

((

n

+

m

)

k

)

and

O

((

n

log

n

+

m

)

k

)

time on unweighted and weightedgraphs, respectively, where

n

and

m

are the numbers of ver-tices and edges, respectively. In the above-mentioned appli-cations, the top-

k

distances must be interactively computedfor many vertex pairs on large social and web graphs, requir-ing a much faster algorithm. Eppstein (Eppstein 1998) im-provedthetimecomplexityto

O

(

n

+

m

+

k

)

and

O

(

n

log

n

+

m

+

k

)

on unweighted and weighted graphs respectively, buthis algorithm remains prohibitively slow.

Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence

2

Contribution

To resolve this issue, we propose an indexing method for an-swering the top-

k

distances. The proposed method is an in-dexing method, i.e., it first constructs a data structure calledan

index

from a graph and then top-

k

distances between ar-bitrary pairs of vertices are rapidly obtained using the index.To our knowledge, we present the first indexing method totop-

k

distance inquiry.Our method is built on the recently proposed

pruned land-mark labeling

, an indexing scheme that answers shortest-path distances (Akiba, Iwata, and Yoshida 2013). However,modifying this method to answer top-

k

distances is non-trivial because the number of paths becomes crucial, bring-ing the new challenge of carefully avoiding double counts.Moreover, it requires several interesting ideas in order tokeep the scalability.As shown later in our experiments, our method can con-struct indices from large graphs comprising millions of ver-tices and tens of millions of edges within a reasonable run-ning time. Having obtained the indices, we can compute thetop-

k

distances within a few microseconds, six orders of magnitude faster than existing methods, which require a fewseconds to compute these distances.Moreover, to illustrate the importance of the top-

k

dis-tances, we apply our method to the

link prediction prob-lem

(Liben-Nowell and Kleinberg 2003), a well-studiedproblem in AI and Web communities. We empirically showthat the support vector machine (SVM) with the top-

k

dis-tances as its feature outperforms a number of baseline meth-ods including singular value decomposition and randomwalk with restart. We emphasize that our indexing methodenables the first use of the top-

k

distances for such tasks.The results also indicate the feasibility of top-

k

distances inother tasks, such as network-aware searching.Our implementation of the proposed indexing method ispublicly available from the first author’s web page. We hopethat our public code will enable further exploration of top-

k

distances in various applications.

Related Work

Distance Indices.

Although numerous indexing meth-ods for computing shortest-path distances have been pro-posed (Cheng and Yu 2009; Xiao et al. 2009; Wei 2010;Akiba, Sommer, and Kawarabayashi 2012; Jin et al. 2012;Fu et al. 2013; Akiba, Iwata, and Yoshida 2013), none of these methods can directly answer top-

k

distance queries.

Pruned Labeling Algorithms.

Pruned labeling was firstproposed for distance queries on complex networks (Ak-iba, Iwata, and Yoshida 2013). Then, specialization and ex-tensions have been proposed for reachability queries on di-rected acyclic graphs (Yano et al. 2013), distance queries onroad networks (Akiba et al. 2014), and distance queries ondynamic graphs (Akiba, Iwata, and Yoshida 2014).

Other Vertex Similarities.

The performance of applica-tions related to graph mining can also be enhanced by fea-tures other than top-

k

distances, such as

random walk withrestart (RWR)

. As straightforward iterative algorithms areprecluded by their high computational cost, several approxi-mation methods have been proposed (Jeh and Widom 2003;McSherry 2005; Sun et al. 2005; Tong, Faloutsos, andPan 2006). Despite sacrificing accuracy for efficiency, thesemethods remain prohibitively time-expensive for computingthe RWR scores for many vertex pairs on large networks inreal-time applications, such as network-aware searches.Similar arguments apply to other walk-based similaritiessuch as

SimRank

(Jeh and Widom 2002) and

commutingtime

(Lovász 1996). In contrast, as we shall experimentallydemonstrate, such large networks are efficiently handled byour method for answering top-

k

distances. We also believethat top-

k

distances provide features with different proper-ties from them, which can be used as complementary fea-tures for those other vertex similarities.Some of

graph kernels

(Smola and Kondor 2003), in par-ticular, those based on the

graph Laplacian

such as the

regu-larizedLaplaciankernel

canalsobeusedtoassignrelevancescores for vertex pairs (Ito et al. 2005). However, the com-putational cost of graph kernels is even more infeasible forlarge graphs.

Preliminaries

The current study focus on networks that are modeledas graphs. To simplify our discussion, we consider onlyundirected and unweighted graphs first. However, as dis-cussed later, our method is easily extendible to directed andweighted graphs.Let

G

= (

V,E

)

be a graph with a vertex set

V

and anedge set

E

. We denote the number of vertices

|

V

|

and thenumber of edges

|

E

|

by

n

and

m

, respectively. We assumethat vertices are uniquely represented by integers, enablingnatural comparisons of two vertices

u,v

∈

V

by expressionssuch as

u < v

or

u

≤

v

.An

internal vertex

of a path refers to a vertex in the paththat is not an endpoint of it. Let

P

be a set of paths. The

i

-thshortest path in

P

refers to the

i

-th path in

P

, ordered bylength, where ties are broken arbitrarily.For a pair of vertices

(

s,t

)

, let

P

st

be the set of all (un-necessarily simple) paths between

s

and

t

. For a vertex

v

, let

P

>vst

be the set of paths in

P

st

whose internal vertices are alllarger than

v

. Similarly, let

P

6

>vst

be the set of paths in

P

st

such that at least one internal vertex is smaller than or equalto

v

. Then for two vertices

s

and

t

, the

i

-th shortest path be-tween

s

and

t

is the

i

-th shortest path in

P

st

. Let

d

i

-th

(

s,t

)

,

d

>vi

-th

(

s,t

)

, and

d

6

>vi

-th

(

s,t

)

denote the length of the

i

-th short-est path in

P

st

,

P

>vst

, and

P

6

>vst

, respectively. If the size of the corresponding set is less than

i

, then we set them to

∞

.We define

d

≥

vi

-th

(

s,t

)

and

d

6≥

vi

-th

(

s,t

)

similarly.

Problem Definition

In this paper, we propose an indexing method that, givena graph

G

and a positive integer

k

, construct an index toquickly answer the following query.

Problem 1 (Top-

k

Distance Query).

Given

: A pair of vertices

(

s,t

)

.

Answer

: An array

(

d

1st

(

s,t

)

,d

2nd

(

s,t

)

,...,d

k

-th

(

s,t

))

.

3

Proposed Method

This section describes our proposed method and show itscorrectness. We also suggest several important techniquesfor practical performance enhancement.

Data Structure

The data structure and query algorithm of the proposedmethod are based on the general framework of

2-hopcover

(Cohen et al. 2002), which is designed for stan-dard (top-1) distance queries. However, as normal distancequeries do not consider the number of paths, the main chal-lengeinprocessing top-

k

distancequeriesispreventing mul-tiple counts of the same path. To this end, we require a moreinvolved framework.For each vertex

v

, our method precomputes and stores thefollowing two labels:

•

Distance label

L

(

v

)

, comprising a set of pairs

(

u,

δ

)

of a vertex and a path length. If we gather lengths in

L

(

v

)

associated with a vertex

u

, they should form the sequence

(

d

>v

1

st

(

v,u

)

, d

>v

2

nd

(

v,u

)

,...,d

>v

`

-th

(

v,u

))

for some

1

≤

`

≤

k

.

•

Loop label

C

(

v

)

, constituting a sequence of

k

in-tegers

(

δ

1

,

δ

2

,...,

δ

k

)

. This sequence should equal

(

d

≥

v

1

st

(

v,v

)

,d

≥

v

2

nd

(

v,v

)

,...,d

≥

vk

-th

(

v,v

))

.An

index

is a pair

I

= (

L,C

)

, where

L

and

C

are the setsof distance labels

{

L

(

v

)

}

v

∈

V

and loop labels

{

C

(

v

)

}

v

∈

V

,respectively.

Query Algorithm

Given an index

I

= (

L,C

)

and a pair of vertices

(

s,t

)

,we compute the top-

k

distances between

s

and

t

as follows.First, we compute the following multiset.

∆

(

I,s,t

) =

{

δ

sv

+

δ

vv

+

δ

vt

|

(

v,

δ

sv

)

∈

L

(

s

)

,

δ

vv

∈

C

(

v

)

,

(

v,

δ

vt

)

∈

L

(

t

)

}

.

Intuitively, we first move from

s

to

v

, then loop back to

v

several steps later, and finally move from

v

to

t

. Note thatfrom the definition of distance labels and loop labels, everyinternal vertex in the path from

s

to

t

(except

v

itself) islarger than

v

.Let Q

UERY

(

I,s,t

)

denote the smallest

k

elements in themultiset

∆

(

I,s,t

)

. If

|

δ

(

I,s,t

)

|

< k

, the remaining en-tries are filled with

∞

. Our answer to the query

(

s,t

)

isQ

UERY

(

I,s,t

)

.

Indexing Algorithm

Our index constructing algorithm is summarized in Algo-rithm 1. We first compute the loop label

C

(

v

)

for every ver-tex

v

. We then construct the distance labels

L

by conductinga

pruned BFS

from each vertex.

Algorithm for Computing Loop Labels.

We constructthe loop labels as follows. For each vertex

v

, using verticeslarger than or equal to

v

, we perform a modified version of breadth first search (BFS). In the BFS, each vertex may bevisited up to

k

times. The first

k

visits to the vertex

v

givesthe distance sequence

d

≥

v

1

st

(

v,v

)

,d

≥

v

2

nd

(

v,v

)

,...,d

≥

vk

-th

(

v,v

)

.

Algorithm 1

Indexing Algorithm

1:

procedure

C

ONSTRUCT

I

NDEX

(

G

)2:

for

i

= 1

to

n

do

Compute

C

(

v

i

)

using the modified BFS.3:

L

(

v

)

← ∅

for all

v

∈

V

.4:

for

i

= 1

to

n

do

P

RUNED

BFS(

G

,

v

i

).5:

return

(

C,L

)

.

Algorithm 2

Pruned Top-

k

BFS from

v

∈

V

.

1:

procedure

P

RUNED

BFS(

G

,

v

)2:

Q

←

a queue with only one element

(

v,

0)

.3:

while

Q

is not empty

do

4: Dequeue

(

u,

δ

)

from

Q

.5:

if

δ

<

max(

Q

UERY

((

L,C

)

,v,u

))

then

6: Add

(

v,

δ

)

to

L

(

u

)

.7:

for all

w

∈

V

such that

(

u,w

)

∈

E,w > v

do

8: Enqueue

(

w,

δ

+ 1)

onto

Q

.

The modified BFS returns to the starting vertex long be-fore all vertices in the graph have been visited. Conse-quently, the running time is very small in practice and em-pirically estimated as

O

(

nk

)

in total from experiments.

Algorithm for Computing Distance Labels.

We assumethat vertices in

V

are ordered as

v

1

,v

2

,...,v

n

. Then foreach

1

≤

i

≤

n

, we perform a

pruned BFS

from

v

i

(Al-gorithm 2). The pruned BFS is essentially a modified ver-sion of the BFS from

v

that visits the same vertex at most

k

times. The crucial difference is the non-trivial pruning;that is, when visiting a vertex

u

at distance

δ

, the process isdiscontinued if

δ

is larger than or equal to the

k

-th shortestdistance computable by the current index

(

L,C

)

(Line 5).We roughly estimate the time complexity. Let

l

be the av-erage size of labels. We visit

O

(

nl

)

vertices in total, travers-ing

O

(

mn

)

edges on average and evaluating a query in

O

(

l

)

time (by using the fast pruning technique introduced later).Thus, the total time complexity of this part is

O

(

ml

+

nl

2

)

.In our experiments,

l

was a few hundred.

Proof of Correctness

The correctness of our method is shown as follows. Let

L

i

denote the set of distance labels

L

after the

i

-th pruned BFSfrom

v

i

. We define

L

0

(

v

) =

∅

for any

v

. Let

I

i

denote pair

(

L

i

,C

)

of the partially constructed set of distance labels andthe set of loop labels. We prove the following lemma.

Lemma 1.

For every integer

i

where

0

≤

i

≤

n

,and every pair of vertices

(

s,t

)

,

Q

UERY

(

I

i

,s,t

) =(

d

6

>v

i

1st

(

s,t

)

,d

6

>v

i

2nd

(

s,t

)

,...,d

6

>v

i

k

-th

(

s,t

))

holds.Proof.

We prove the claim by induction on

i

. When

i

= 0

,we have Q

UERY

(

I

i

,s,t

) = (

∞

,

∞

,...,

∞

)

and the claimclearly holds. Suppose that the claim holds for every

i

0

< i

.For a fixed pair of vertices

(

s,t

)

where

s

6

=

t

, we validatethe claim for

i

and the pair

(

s,t

)

.Note that we can already compute Q

UERY

(

I

i

−

1

,s,t

) =(

d

6

>v

i

−

1

1st

(

s,t

)

,d

6

>v

i

−

1

2nd

(

s,t

)

,...,d

6

>v

i

−

1

k

-th

(

s,t

))

. Let

P

denotethe set of paths

P

such that (i)

P

is in

P

>v

i

−

1

st

, (ii)

P

passes through

v

i

, and (iii) the length of

P

is smaller than

4

d

6

>v

i

−

1

k

-th

(

s,t

)

. Let

P

0

be the first

k

elements in

P

. It sufficesto show that, after the

i

-th pruned BFS, we can also computethe distances of paths in

P

0

.Let

P

∈

P

0

. We can split

P

into three parts

P

sv

i

,

P

v

i

v

i

,and

P

v

i

t

. Here,

P

sv

i

denotes the subsequence of

P

from

s

to the first appearance of

v

i

in

P

,

P

v

i

v

i

denotes the subse-quence of

P

from the first appearance of

v

i

to the final ap-pearance of

v

i

in

P

, and

P

v

i

t

denotes the subsequence of

P

from the last appearance of

v

i

in

P

to

t

. Note that

P

v

i

v

i

mustbe among the first

k

elements in

P

>v

i

v

i

v

i

; otherwise shorter

k

paths are possible and

P

∈

P

0

is contradicted. Hence,

C

(

v

i

)

must include the length of

P

v

i

v

i

.Now we observe that the BFS from

v

i

along path

P

v

i

t

isnot pruned in the

i

-th pruned BFS (and similarly for

P

sv

i

).Toillustratebycontradiction,supposethattheBFSisprunedat some vertex

u

on path

P

v

i

t

. In this case, there exist atleast

k

paths in

P

6

>v

i

−

1

v

i

u

shorter than

δ

, where

δ

is the dis-tance from

v

i

to

u

in the BFS. For each of these

k

paths, weconcatenate

P

sv

i

,

P

v

i

v

i

, and the suffix of

P

v

i

t

from

u

to

t

.Then, we obtain

k

paths in

P

6

>v

i

−

1

st

that are shorter than

P

,and therefore shorter than

d

6

>v

i

−

1

k

-th

(

s,t

)

from condition (iii).Hence, we reach a contradiction.

Corollary 1.

At the end of Algorithm 1, we can correctlyanswer top-

k

distance queries using the constructed index.

Techniques for Efficient Implementation

We introduce several key techniques for practical perfor-mance improvement.

Vertex Ordering Strategy.

By properly selecting the or-der of vertices from which we conduct pruned BFSs, ourpruning can drastically reduce the search space and labelsizes by exploiting the structure of real-world networks,greatly enhancing the efficiency of the proposed method.This is possible because the real networks contain highlycentralized vertices (sometimes called

hubs

). As a heuristicvertex ordering strategy, vertices are selected in order of de-creasing degrees. Further discussion is provided in (Akiba,Iwata, and Yoshida 2013).

Fast Pruning.

When constructing distance labels, manyqueries are evaluated for pruning. However, when conduct-ing a pruned BFS from a vertex

v

, queries are limited to “Arethere more than

k

paths of length less than

δ

between

v

and

u

?” Given this restriction, we can reduce the query time. Foreach vertex

w

in the distance label of

v

, we can precomputethe number

c

w,

δ

0

of paths between

v

and

w

of length not ex-ceeding

δ

0

using the loop label

C

(

w

)

. Suppose that we havereached vertex

u

in the pruned BFS conducted from

v

. Wecan then compute the number of paths between

v

and

u

of length less than

δ

as

P

(

w,

δ

0

,c

)

∈

L

(

u

)

c

·

c

w,

δ

−

δ

0

.

Merged Queue Entries.

When a (pruned) BFS is per-formed from a vertex

v

, rather than pair

(

u,

δ

)

, which de-notes the existence of a path of length

δ

between

v

and

u

,triplets

(

u,

δ

,c

)

are pushed onto the queue. These triplesspecify that

c

paths of length

δ

exist between

v

and

u

, Thistechnique enables the simultaneous handling of many paths,and significantly reduces the number of pushes onto thequeue. Hence, it significantly reduces the running time.

Merged Label Entries.

Related to the above technique,instead of pairs

(

u,

δ

)

, which denotes that there is a pathof length

δ

between

v

and

u

, triplets

(

u,

δ

,c

)

are stored indistance labels. These triplets indicate that

c

paths of length

δ

exist between

v

and

u

. A similar technique is applicable toloop labels.

Extensions

Directedgraphs.

If the input graph is a directed graph, wecompute and store two distance labels

L

IN

(

v

)

and

L

OUT

(

v

)

for each vertex

v

, where

L

IN

(

v

)

and

L

OUT

(

v

)

contain thedistances from and to

v

, respectively.

Weighted graphs.

For weighted graphs, we can replacethe pruned BFS by pruned Dijkstra’s algorithm. In thisscheme, the queue used in Algorithm 2 is replaced bya priority queue. The time complexity becomes

O

(

ml

+

nl

(log

n

+

l

))

.

Experimental Evaluation

In this section, we show the scalability, efficiency and ro-bustness of the proposed method by experimental results us-ing real-world networks.

Setup

Environment.

All experiments were conducted on aLinux server with Intel Xeon X5670 (2.93 GHz) and 48 GBof main memory. The proposed method was implemented inC++. The implementation will be made publicly availableonline.

Datasets.

The target applications of the proposed methodare graph mining tasks such as network-aware searching andlink prediction. Therefore, our experiments were conductedon publicly available real-world social and web graphs

12345

.The sizes and types of these graphs are listed in Table 2. Wetreated all the graphs as unweighted undirected graphs.

Algorithms.

As there are no previous indexing methodsfor top-

k

distances, the proposed method was evaluatedagainst the following two algorithms without precomputa-tion.

•

The first is the BFS-based naive approach, which uses aFIFO queue in the graph search, but which allows at most

k

visits to each vertex. This algorithm was also imple-mented in C++ by the authors.

•

The second is Eppstein’s algorithm (Eppstein 1998),which theoretically attains near-optimal time complexity.We adopted the C++ implementation of Jon Graehl

6

.

1

http://lovro.lpt.fri.uni-lj.si/support.jsp

2

http://grouplens.org/datasets/hetrec-2011/

3

http://snap.stanford.edu/

4

http://socialnetworks.mpi-sws.org/datasets.html

5

http://law.di.unimi.it/datasets.php (Boldi and Vigna 2004)

6

http://www.ics.uci.edu/ eppstein/pubs/p-kpath.html

5

Akiba KShortest 2015

Uploaded by

Akiba KShortest 2015

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Akiba KShortest 2015

Uploaded by

Reward Your Curiosity

Share this document

Share or Embed Document

Sharing Options

You might also like