Heyuan Li1, 2, Yuanhai Xue1, 2, Xu Chen1, 2, Xiaoming Yu1, Feng Guan1, Yue Liu1, Xueqi Cheng1
1. Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190
2. Graduate School of Chinese Academy of Sciences, Beijing, 100190
1. Introduction
An ad-hoc task in TREC investigates the performance of systems that search a static set of documents
using previously-unseen topics. This year, ClueWeb09 Dataset [1] was used again as document collection.
But the topics developed for this year was less common and ambiguous than before.
The rest of this paper is organized as follows. In Section 2, we discuss the processing of ClueWeb09,
derived data and external resources. In Section 3, the BM25 model with term proximity, searching with
anchor text, query expansion and promoting authoritative sites are introduced. We report experimental
results in Section 4 and conclude our work in Section 5.
2. Data Processing
2.1 Parsing the documents
The ClueWeb09 dataset is consist of 500 million English pages, identified by TREC_ID. We parse
these pages and split them into 6 parts, TREC_ID, Title, Keywords, Content, URL and Anchors. The parsed
documents are expressed as XML documents for index. During our experiments, it’s very common to
request a certain content or URL by TRECID. Therefore, we use Tokyo Tyrant to build three big maps,
TREC_ID – URL, TREC_ID – Content and URL – TREC_ID. These maps are provided as RPC service
and also used by diversity task, session track and entity track.
2.2 System
This year, we use Golaxy [2], a high performance distributed search platform. Golaxy was deployed
over eleven servers, one for merge, ten for index and search. Each server has 16 CPU cores, 32GB memory
and 2TB hard disk. It takes about ten hours to index all of the 500 million English documents. A C++
search client was developed for TREC experiments. To retrieve N documents, the client send query to
merge server. Merge server dispatch query to search servers, merge the results from each of them and return
the top N documents to the client.
2.3 Spam filtering
As we found last year, spam detection and removal was very important for improving the performance
of retrieval. We use Waterloo Spam Rankings [3] as spam filter this year. The Fusion score was used. We
filter out the bottom 50% of the documents, which are most likely to be spam. To speed up this procedure, a
map of TREC_ID – Spam-Score was set up using Tokyo Tyrant as we described in Section 2.1.
2.4 Using Open Directory Project
The Open Directory Project (ODP), also known as DMOZ, is a multilingual open content directory of
World Wide Web links [5]. Navigational queries are aimed at seeking a single website or web page of a
single entity. Traditional retrieval model is good at Informational queries, but may not boost the
navigational website. ODP, on the other hand, can help this case by looking up the directory and giving the
corresponding URL directly. We use the Search function of ODP and use Python to get the top 1 result for
all of the topics this year. Also, the title of the result is checked, the websites which fail to contain all the
words in query were dropped and not used.