P2P-IR Architecture PDF
P2P-IR Architecture PDF
P2P-IR Architecture PDF
Retrieval
Karl Aberer, Fabius Klemm, Martin Rajman, Jie Wu
Abstract
Peer-to-Peer (P2P) systems are very large computer networks, where peers
collaborate to provide a common service. Providing large-scale Information
Retrieval (IR), e.g. for searching the Word Wide Web, is an appealing applica-
tion for P2P systems. The research community has presented several proposal
for P2P-IR. However, so far the concepts of P2P and of IR have been intermin-
gled. In this paper, we propose an architecture to structure P2P-IR systems.
We differentiate between concepts belonging to the construction and main-
tenance of a P2P overlay network, and those belonging to IR. Furthermore,
we distinguish basic P2P-IR concepts, which are likely to be needed in all
P2P-IR systems, and advanced P2P-IR concepts, that rather depend on the
flavor of the system. This decomposition of the P2P retrieval process is an
important step towards a structured implementation of such systems. Fur-
thermore, it allows a systematic sharing of methods and resources needed to
perform retrieval. The next generation of global information retrieval systems
will combine these distributed resources in new ways to provide more efficient
web search.
Keywords: Peer-to-Peer Information Retrieval (P2P-IR), architecture,
key-based routing (KBR), P2P web search
1 Introduction
Peer-to-Peer (P2P) systems are decentralized, large-scale computer networks,
where peers operate as clients and servers at the same time. Peers can join
and leave the system at any time. The power of P2P systems lies in their ca-
pability to provide services with practically unlimited scalability based on the
principle of resource sharing. In the context of information retrieval (IR) the
principle of sharing also applies to the knowledge on document collections and
retrieval models. Existing P2P systems form already very large networks with
hundreds of thousands or even million of computers participating. Combining
the resources of a very large number of peers, we can expect a qualitative shift
in information retrieval systems of the future.
The World Wide Web is growing at such a pace that even the biggest
centralized search engines are able to index only a small part of the avail-
able documents. Federated, decentralized search engines, where the effort is
shared among a very large number of computers, seem to have the potential
of building IR systems with almost unlimited capacity.
The research community has presented several approaches to perform P2P-
IR. The proposed systems can be categorized into IR over unstructured P2P
systems [4, 6] and IR over structured P2P systems [10, 11]. These approaches
1
focus, however, mostly on the use of P2P overlay networks for distributed
indexing of document collections.
While planning the implementation of a P2P-IR system, we noticed that
in the proposed solutions, different concepts are not clearly separated into
levels of abstraction. We therefore started to develop a generic architecture
for a P2P-IR system. In this architecture we propose to decompose the IR
process in a P2P environment into four different layers: 1) Transport Layer
Communications, 2) Structured Overlay Networks, 3) Document and Content
Management, and 4) Retrieval Models. In this way, we separate different con-
cerns and allow different solutions at the higher layers to take advantage of the
same infrastructure provided at the lower layers. This separation of concepts
increases modularity of design and thus reusability of components offering ba-
sic services. Furthermore, it provides a step towards making different P2P-IR
solutions interoperable, as it is unlikely that a single approach will prevail.
In this paper we give an overview of the key concepts of our architecture.
From this concepts we derive prototypical interfaces between the different
architectural layers. We concentrate particularly on layer 3 (document and
content management), which plays an important role with respect to perfor-
mance and flexibility when implementing a specific retrieval model in a P2P
environment. We also identify key-based routing (KBR) of (structured) P2P
overlay networks as the key contribution of P2P systems to support P2P-IR
efficiently, and clarify how P2P-IR solutions can take advantage of this func-
tionality. Last but not least, as a proof-of-concept, we discuss a case study
based on [10], a P2P-IR proposal from outside our group, to demonstrate that
our architecture has the potential to accommodate a wide variety of systems.
Our architecture will be the basis to explore a wide range of potential solutions
for efficient and scalable P2P-IR. We explicitly turn our attention to text-based
retrieval in this paper. We do not explicitly address, for example, link-based
ranking. However, after only slight extensions, it will be also possible to fit
such advanced retrieval concepts in our model.
This paper is structured as follows: Section 2 gives an overview of our P2P-
IR architecture. In Section 3, we provide the detailed concepts and interfaces
between the layers. Section 4 describes how document indexing and query
processing is carried out within the architecture. Section 5 introduces the
case study. We finish with our conclusions and future work in section 6.
2
requests, which any peer can submit, to responsible peers in the network. In
the literature this service is referred to as key-based routing (KBR)1 [2]. By
mapping keys of layer 3 to identifiers of layer 2, we associate certain document
management tasks to specific peers. In this paper, we concentrate on struc-
tured overlay networks for layer 2. An adaptation to unstructured overlays is
part of future work.
The purpose of layer 2 is to provide a logical network, which is indepen-
dent of the inherent dynamics (e.g. because of network failures, dynamic IP,
or mobility of peers) of the physical network (layer 1). The reason for choosing
different spaces for identifiers on layer 2 and keys on layer 3 is that methods
for managing a structured overlay network impose certain properties on the
identifier space that are not necessarily satisfied by a key space, which has to
fulfill IR requirements. Nevertheless, mappings with distance-preserving prop-
erties are an important tool for optimizations in P2P-IR. Similarly, distance-
preserving mappings from identifiers on layer 2 to physical addresses on layer
1 support optimizations of the physical access to peers.
In the following section, we will introduce the four different layers in more
detail, with a particular emphasis on layers 2 and 3 and their relationship.
Retrieval Models
documents
Relevance, Ranking,
Rank Aggregation
Layer 4
Document and Content
keys Management
Document Repository,
Layer 3 Vocabulary, Crawling,
Indexing, Clustering
Transport Layer
physical addresses Communications
TCP/IP, UDP/IP
Layer 1
on layer 3.
3
an application first opens a connection with the destination using a three-way
handshake protocol. TCP packets are acknowledged and lost packets are au-
tomatically retransmitted. A two-message protocol is used to close a TCP
connection.
Layer 1 provides TCP and UDP sockets as interface to the upper layers.
Network Maintenance: A peer can join and leave the group of peers
forming the overlay network. Maintenance ensures the integrity of the struc-
tured overlay network under such changes. It also performs load balancing.
To join a group of peers, a new peer has to be able to communicate with a least
one peer in the group. The joining peer therefore uses an outside bootstrap
mechanism to obtain the physical address of a peer in the group. Cooperative
peers may announce when they leave the network to support maintenance.
Routing: Given an identifier idi ∈ Id, a peer can route a message with
arbitrary payload to a peer responsible for the identifier (method route()).
The interpretation of the payload is left to applications on higher layers. More
refined versions of the routing method might return the physical address of
the target peer to allow the implementation of complex protocols directly with
this peer. To route a message with identifier idi , each peer maintains a routing
table, which contains links to peers that are closer to a peer responsible for
4
Class Identifier
equal(id: Identifier): Boolean;
distance(id: Identifier): Real;
Class Peer
identifier(): Identifier;
address(): IPAddress;
neighbors(): {Peer};
join(addr: IPAddress);
leave();
route(id: Identifier, payload: ByteString);
broadcast(payload: ByteString);
history(id: Identifier, query: HistoryQuery): HistoryData;
idi . The construction and maintenance of these routing tables, as well as the
routing strategy itself, depend on the implementation of the structured overlay
network and are hidden from upper layers. In a network of NP peers, the
average cost of routing a message to a responsible peer is usually O(Log(NP ))
overlay hops.
5
3.3.1 Concepts Supported on Layer 3
Most of the concepts we introduce on layer 3 correspond to standard concepts
from text-based information retrieval.
One not so obvious decision was to identify clusters as the main abstrac-
tion for document management. The reason is that documents in very large
collections are often clustered to allow access on higher levels of granularity,
e.g. for indexing or query answering. Furthermore, we separate cluster hier-
archies from vocabularies because retrieval models frequently apply different
document indexing methods to the same hierarchically organized document
collection.
From a distributed data management perspective we observed an impor-
tant requirement for distributed data access: To differentiate between query
and data shipping. We also point out some limitations we deliberately in-
troduced, such as the restriction of keys to term sets, which is a frequent
approach, however, not a universal principle in text retrieval.
Posting Lists: A posting list manages the association of a key with a set
of clusters and statistics of the key (e.g. the frequency of the key in a cluster).
We call this information the cluster digest. As keys, posting lists are always
unambiguously associated with a defined vocabulary.
Posting lists provide key-based retrieval of documents, which is an elemen-
tary function in IR. An important aspect in a distributed environment is the
possibility to send a key-based query together with a filter, which we call digest
query. Such a digest query can already filter the cluster digests of a posting
list in place. Possible criteria passed with digest queries are limitations on
the result size or containment in a specific (sub-) cluster. Digest queries thus
allow to choose between different query and data shipping strategies.
6
addition, vocabularies maintain histories of queries, which can be exploited
by retrieval models, e.g. for usage-based ranking or keyword selection (as
suggested in [5]).
7
Class Collection
insert(doc: Document, id: DocId);
retrieve(id: DocId): Document;
Class Document
identifier(): DocId;
equal(id: DocId): Boolean;
content(): String;
delete();
Class Cluster
identifier(): ClustId;
root(): Boolean;
createChildCluster(id: ClustId);
children(): {Cluster};
containedIn(id: ClustId): Boolean;
associate(doc: DocId);
documents(): {Document};
Class PostingList
retrieve(k: Key, q: DigestQuery): {ClustDigest};
insertDigest(key: Key, id: ClustID, d: ClustDigest);
Class ClustDigest
(* data structure together with query language
to represent indexing information *)
Class Key
terms(): Terms;
distance(key: Key): Real;
vocabulary(): Vocabulary;
8
3.4 Layer 4: Retrieval Models
Different retrieval models use the same document and content management
functions of layer 3. Thus, it is important to have a common framework, in
which we can capture the model we would like to use. Such a framework char-
acterizes notions like ranking, relevance, rank aggregation, etc. and should be
probabilistic (in some sense).
Basic concepts on layer 4 are the representation of user queries, the provi-
sion of ranking and clustering functions, and interfaces for basic information
retrieval tasks. Layer 4 provides functions for constructing vocabularies and
document indexing, such as extracting keys from documents. The implemen-
tation of these functions relies on the services provided by layer 3. Due to
space limitations we omit a detailed specification of this layer.
4 Implementation
In this section we sketch how typical retrieval tasks, namely document index-
ing and document retrieval, are implemented using our layered architecture.
9
3. p generates a set of query keys from the application query.
4. The set of query keys is used to retrieve the posting lists using the layer-3
function retrieve(), which in turn is executed by using layer-2 routing
functionality.
5. After having retrieved the posting lists, which are possibly pre-filtered, p
(on layer 4) computes the ranked result according to the specific retrieval
model.
6. For producing a more compact or better structured representation of
the result, p might exploit the structure of the cluster hierarchy that is
associated with the retrieval model.
5 Case Study
We are currently exploring a number of retrieval models that are particularly
suited for use in a P2P environment. However, for the verification of the
general applicability of the proposed architecture we prefer to check it by
matching other approaches from the literature into our architecture as well.
In the following we provide, for illustration purposes, one example of how a
concrete system proposed independently by C. Tang and S. Dwarkadas [10]
can be mapped on the framework we have introduced.
Layer 4 The vector space model (VSM) is used for mapping documents
onto keys. Therefore, all terms in a document are weighed using a special
algorithm. This algorithm relies on global statistics that are aggregated and
maintained by layer 3. Furthermore, the authors use a stemmer and exclude
stop words. This is represented in our model by the use of a retrieval-model
specific document indexing method.
Layer 3 The authors use keys, which are terms or phrases of terms. Doc-
ument identifiers and posting lists, as well as their insertion and retrieval, are
implicitly introduced. Posting lists also contain document digests instead of
only document references. A document digest consists of a complete list of
terms in the document. Tang and Dwarkadas also use digest queries to select
which document references of the posting lists are returned.
Layer 1 and 2 The authors use Chord [9] as underlying Distributed Hash
Table (DHT). Chord builds and maintains a structured overlay network and
provides key-based routing (KBR). It can be extended to also provide broad-
cast and history functionality on layer 2.
Thus, in summary, we have no difficulties in mapping all concepts intro-
duced in this specific approach into our architectural framework.
10
7 Acknowledgments
The work presented in this paper was carried out in the framework of the
EPFL Center for Global Computing and supported by the Swiss National
Funding Agency OFES as part of the European FP 6 STREP project ALVIS
(002068).
References
[1] K. Aberer. P-Grid: A Self-Organizing Access Structure for P2P Infor-
mation Systems. Sixth Inter-national Conference on Cooperative Infor-
mation Systems (CoopIS 2001), Trento, Italy 2001
[2] F. Dabek, B. Zhao, P. Druschel, and I. Stoica: Towards a common API
for structured Peer-to-Peer overlays. In 2nd International Workshop on
Peer-to-Peer Systems (IPTPS’02), Feb. 2003
[3] S. El-Ansary, L. O. Alima, P. Brand, S. Haridi: Efficient Broadcast in
Structured P2P Networks. 2nd International Workshop on Peer-to-Peer
Systems (IPTPS ’03), Berkeley, CA, USA, 2003
[4] A. Löser, W. Nejdl, M. Wolpers, and W. Siberski: Information integra-
tion in schema-based peer-to-peer networks. In Proceedings of the 15th
Conference On Advanced Information Systems Engineering (CAISE 03)
(Klagenfurt/Velden, Austria, June 2003).
[5] F. Klemm, A. Datta, K. Aberer: A Query-Adaptive Partial Distributed
Hash Table for Peer-to-Peer Systems. International Workshop on Peer-
to-Peer Computing & DataBases (P2P&DB 2004), Crete, Greece, March
2004
[6] J. Lu, and J. Callan: Content-based retrieval in hybrid peer-to-peer net-
works. In Proceedings of the Twelfth International Conference on Infor-
mation and Knowledge Management (CIKM’03). New Orleans, 2003
[7] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. A scal-
able content-addressable network. In SIGCOMM, Aug. 2001.
[8] A. Rowstron and P. Druschel. Pastry: Scalable, distributed object loca-
tion and routing for large-scale peer-to-peer systems. In IFIP/ACM In-
ternational Conference on Distributed Systems Platforms (Middleware),
November 2001.
[9] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.
In Proceedings of ACM SIGCOMM 2001.
[10] C. Tang, S. Dwarkadas: Hybrid Global-Local Indexing for Efficient Peer-
to-Peer Information Retrieval. First Symposium on Networked Systems
Design and Implementation (NSDI’04), San Francisco, March 2004.
[11] C. Tang, Z. Xu, S. Dwarkadas: Peer-to-Peer Information Retrieval Using
Self-Organizing Semantic Overlay Networks. SIGCOMM 2003
11