Search Engines Information Retrieval in Practice 1St Edition Croft Full Chapter PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 69

Search Engines: Information Retrieval

in Practice 1st Edition Croft


Visit to download the full and correct content document:
https://ebookmass.com/product/search-engines-information-retrieval-in-practice-1st-e
dition-croft/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Ontology-Based Information Retrieval for Healthcare


Systems 1st Edition Vishal Jain

https://ebookmass.com/product/ontology-based-information-
retrieval-for-healthcare-systems-1st-edition-vishal-jain/

In Search of Truth Sharon Wray

https://ebookmass.com/product/in-search-of-truth-sharon-wray/

Fundamentals of Heat Engines 1st Edition Jamil Ghojel

https://ebookmass.com/product/fundamentals-of-heat-engines-1st-
edition-jamil-ghojel/

Satellite Soil Moisture Retrieval. Techniques and


Applications 1st Edition Prashant K Srivastava

https://ebookmass.com/product/satellite-soil-moisture-retrieval-
techniques-and-applications-1st-edition-prashant-k-srivastava/
In Search of Lost Futures 1st Edition Magdalena
Kazubowski-Houston

https://ebookmass.com/product/in-search-of-lost-futures-1st-
edition-magdalena-kazubowski-houston/

My Roman: Boys on the Hill, #1 An Enemies to Lovers


College Romance Rose Croft [Croft

https://ebookmass.com/product/my-roman-boys-on-the-hill-1-an-
enemies-to-lovers-college-romance-rose-croft-croft/

John Lewis : in search of the beloved community 1st


Edition Raymond Arsenault

https://ebookmass.com/product/john-lewis-in-search-of-the-
beloved-community-1st-edition-raymond-arsenault/

Lossless Information Hiding in Images 1st Edition Zhe-


Ming Lu

https://ebookmass.com/product/lossless-information-hiding-in-
images-1st-edition-zhe-ming-lu/

In Search of Truth Sharon Wray

https://ebookmass.com/product/in-search-of-truth-sharon-wray-3/
Search Engines
Information Retrieval
in Practice
This page intentionally left blank
Search Engines
Information Retrieval
in Practice
W. BRUCE CROFT
University of Massachusetts, Amherst

DONALD METZLER
Yahoo! Research

TREVOR STROHMAN
Google Inc.

Addison
Wesley

Boston Columbus Indianapolis New York San Francisco Upper Saddle River
Amsterdam Cape Town Dubai London Madrid Milan Munich Paris Montreal Toronto
Delhi Mexico City Sao Paulo Sydney Hong Kong Seoul Singapore Taipei Tokyo
Editor-in-Chief Michael Hirsch
Acquisitions Editor Matt Goldstein
Editorial Assistant Sarah Milmore
Managing Editor Jeff Holcomb
Online Product Manager Bethany Tidd
Director of Marketing Margaret Waples
Marketing Manager Erin Davis
Marketing Coordinator Kathryn Ferranti
Senior Manufacturing Buyer Carol Melville
Text Design, Composition, W. Bruce Croft, Donald Metzler,
and Illustrations and Trevor Strohman
Art Direction Linda Knowles
Cover Design Elena Sidorova
Cover Image © Peter Gudella / Shutterstock

Many of the designations used by manufacturers and sellers to distinguish their products
are claimed as trademarks. Where those designations appear in this book, and Addison-
Wesley was aware of a trademark claim, the designations have been printed in initial caps
or all caps.

The programs and applications presented in this book have been included for their
instructional value. They have been tested with care, but are not guaranteed for any
particular purpose. The publisher does not offer any warranties or representations, nor
does it accept any liabilities with respect to the programs or applications.

Library of Congress Cataloging-in-Publication Data available upon request

Copyright © 2010 Pearson Education, Inc. All rights reserved. No part of this publication
may be reproduced, stored in a retrieval system, or transmitted, in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise, without the prior
written permission of the publisher. Printed in the United States of America. For
information on obtaining permission for use of material in this work, please submit a
written request to Pearson Education, Inc., Rights and Contracts Department, 501
Boylston Street, Suite 900, Boston, MA 02116, fax (617) 671-3447, or online at
http://www.pearsoned.com/legal/permissions.htm.

ISBN-13: 978-0-13-607224-9
ISBN-10: 0-13-607224-0

1 2 3 4 5 6 7 8 9 1 0 - H P - 1 3 1211 1009
Preface

This book provides an overview of the important issues in information retrieval,


and how those issues affect the design and implementation of search engines. Not
every topic is covered at the same level of detail. We focus instead on what we
consider to be the most important alternatives to implementing search engine
components and the information retrieval models underlying them. Web search
engines are obviously a major topic, and we base our coverage primarily on the
technology we all use on the Web,1 but search engines are also used in many other
applications. That is the reason for the strong emphasis on the information re-
trieval theories and concepts that underlie all search engines.
The target audience for the book is primarily undergraduates in computer sci-
ence or computer engineering, but graduate students should also find this useful.
We also consider the book to be suitable for most students in information sci-
ence programs. Finally, practicing search engineers should benefit from the book,
whatever their background. There is mathematics in the book, but nothing too
esoteric. There are also code and programming exercises in the book, but nothing
beyond the capabilities of someone who has taken some basic computer science
and programming classes.
The exercises at the end of each chapter make extensive use of a Java™-based
open source search engine called Galago. Galago was designed both for this book
and to incorporate lessons learned from experience with the Lemur and Indri
projects. In other words, this is a fully functional search engine that can be used
to support real applications. Many of the programming exercises require the use,
modification, and extension of Galago components.
1
In keeping with common usage, most uses of the word "web" in this book are not cap-
italized, except when we refer to the World Wide Web as a separate entity.
VI Preface

Contents

In the first chapter, we provide a high-level review of the field of information re-
trieval and its relationship to search engines. In the second chapter, we describe
the architecture of a search engine. This is done to introduce the entire range of
search engine components without getting stuck in the details of any particular
aspect. In Chapter 3, we focus on crawling, document feeds, and other techniques
for acquiring the information that will be searched. Chapter 4 describes the sta-
tistical nature of text and the techniques that are used to process it, recognize im-
portant features, and prepare it for indexing. Chapter 5 describes how to create
indexes for efficient search and how those indexes are used to process queries. In
Chapter 6, we describe the techniques that are used to process queries and trans-
form them into better representations of the user's information need.
Ranking algorithms and the retrieval models they are based on are covered
in Chapter 7. This chapter also includes an overview of machine learning tech-
niques and how they relate to information retrieval and search engines. Chapter
8 describes the evaluation and performance metrics that are used to compare and
tune search engines. Chapter 9 covers the important classes of techniques used for
classification, filtering, clustering, and dealing with spam. Social search is a term
used to describe search applications that involve communities of people in tag-
ging content or answering questions. Search techniques for these applications and
peer-to-peer search are described in Chapter 10. Finally, in Chapter 11, we give an
overview of advanced techniques that capture more of the content of documents
than simple word-based approaches. This includes techniques that use linguistic
features, the document structure, and the content of nontextual media, such as
images or music.
Information retrieval theory and the design, implementation, evaluation, and
use of search engines cover too many topics to describe them all in depth in one
book. We have tried to focus on the most important topics while giving some
coverage to all aspects of this challenging and rewarding subject.

Supplements

A range of supplementary material is provided for the book. This material is de-
signed both for those taking a course based on the book and for those giving the
course. Specifically, this includes:
• Extensive lecture slides (in PDF and PPT format)
Preface VII

• Solutions to selected end-of-chapter problems (instructors only)


• Test collections for exercises
• Galago search engine
The supplements are available at www.search-engines-book.com, or at www.aw.com.

Acknowledgments

First and foremost, this book would not have happened without the tremen-
dous support and encouragement from our wives, Pam Aselton, Anne-Marie
Strohman, and Shelley Wang. The University of Massachusetts Amherst provided
material support for the preparation of the book and awarded a Conti Faculty Fel-
lowship to Croft, which sped up our progress significantly. The staff at the Center
for Intelligent Information Retrieval (Jean Joyce, Kate Moruzzi, Glenn Stowell,
and Andre Gauthier) made our lives easier in many ways, and our colleagues and
students in the Center provided the stimulating environment that makes work-
ing in this area so rewarding. A number of people reviewed parts of the book and
we appreciated their comments. Finally, we have to mention our children, Doug,
Eric, Evan, and Natalie, or they would never forgive us.

BRUCE CROFT
DONMETZLER
TREVOR STROHMAN
This page intentionally left blank
Contents

1 Search Engines and Information Retrieval 1


1.1 What Is Information Retrieval? 1
1.2 The Big Issues 4
1.3 Search Engines 6
1.4 Search Engineers 9

2 Architecture of a Search Engine 13


2.1 What Is an Architecture? 13
2.2 Basic Building Blocks 14
2.3 Breaking It Down 17
2.3.1 Text Acquisition 17
2.3.2 Text Transformation 19
2.3.3 Index Creation 22
2.3.4 User Interaction 23
2.3.5 Ranking 25
2.3.6 Evaluation 27
2.4 How Does ItReaUy Work? 28

3 Crawls and Feeds 31


3. 1 Deciding What to Search 31
3.2 Crawling the Web 32
3.2. 1 Retrieving Web Pages 33
3.2.2 The Web Crawler 35
3.2.3 Freshness 37
3.2.4 Focused Crawling 41
3.2.5 Deep Web 41
X Contents

3.2.6 Sitemaps 43
3.2.7 Distributed Crawling 44
3.3 Crawling Documents and Email 46
3.4 Document Feeds 47
3.5 The Conversion Problem 49
3.5.1 Character Encodings 50
3.6 Storing the Documents 52
3.6. 1 Using a Database System 53
3.6.2 Random Access 53
3.6.3 Compression and Large Files 54
3.6.4 Update 56
3.6.5 BigTable 57
3.7 Detecting Duplicates 60
3.8 Removing Noise 63

4 Processing Text 73
4. 1 From Words to Terms 73
4.2 Text Statistics 75
4.2. 1 Vocabulary Growth 80
4.2.2 Estimating Collection and Result Set Sizes 83
4.3 Document Parsing 86
4.3. 1 Overview 86
4.3.2 Tokenizing 87
4.3.3 Stopping 90
4.3.4 Stemming 91
4.3.5 Phrases and N-grams 97
4.4 Document Structure and Markup 101
4.5 Link Analysis 104
4.5.1 Anchor Text 105
4.5.2 PageRank 105
4.5.3 Link Quality Ill
4.6 Information Extraction 113
4.6. 1 Hidden Markov Models for Extraction 115
4.7 Internationalization 118
Contents XI

5 Ranking with Indexes 125


5. 1 Overview 125
5.2 Abstract Model of Ranking 126
5.3 Inverted Indexes 129
5.3.1 Documents 131
5.3.2 Counts 133
5.3.3 Positions 134
5.3.4 Fields and Extents 136
5.3.5 Scores 138
5.3.6 Ordering 139
5.4 Compression 140
5.4.1 Entropy and Ambiguity 142
5.4.2 Delta Encoding 144
5.4.3 Bit-Aligned Codes 145
5.4.4 Byte -Aligned Codes 148
5.4.5 Compression in Practice 149
5.4.6 Looking Ahead 151
5.4.7 Skipping and Skip Pointers 151
5.5 Auxiliary Structures 154
5.6 Index Construction 156
5.6. 1 Simple Construction 156
5.6.2 Merging 157
5.6.3 Parallelism and Distribution 158
5.6.4 Update 164
5.7 Query Processing 165
5.7.1 Document-at-a-time Evaluation 166
5.7.2 Term-at-a-time Evaluation 168
5.7.3 Optimization Techniques 170
5.7.4 Structured Queries 178
5.7.5 Distributed Evaluation 180
5.7.6 Caching 181

6 Queries and Interfaces 187


6. 1 Information Needs and Queries 187
6.2 Query Transformation and Refinement 190
6.2. 1 Stopping and Stemming Revisited 190
6.2.2 Spell Checking and Suggestions 193
XII Contents

6.2.3 Query Expansion 199


6.2.4 Relevance Feedback 208
6.2.5 Context and Personalization 211
6.3 Showing the Results 215
6.3. 1 Result Pages and Snippets 215
6.3.2 Advertising and Search 218
6.3.3 Clustering the Results 221
6.4 Cross-Language Search 226

7 Retrieval Models 233


7.1 Overview of Retrieval Models 233
7. 1 . 1 Boolean Retrieval 235
7.1.2 The Vector Space Model 237
7.2 Probabilistic Models 243
7.2. 1 Information Retrieval as Classification 244
7.2.2 The BM25 Ranking Algorithm 250
7.3 Ranking Based on Language Models 252
7.3.1 Query Likelihood Ranking 254
7.3.2 Relevance Models and Pseudo -Relevance Feedback . . 261
7.4 Complex Queries and Combining Evidence 267
7.4.1 The Inference Network Model 268
7.4.2 The Galago Query Language 273
7.5 Web Search 279
7.6 Machine Learning and Information Retrieval 283
7.6. 1 Learning to Rank 284
7.6.2 Topic Models and Vocabulary Mismatch 288
7.7 Application-Based Models 291

8 Evaluating Search Engines 297


8.1 Why Evaluate? 297
8.2 The Evaluation Corpus 299
8.3 Logging 305
8.4 Effectiveness Metrics 308
8.4. 1 Recall and Precision 308
8.4.2 Averaging and Interpolation 313
8.4.3 Focusing on the Top Documents 318
8.4.4 Using Preferences 321
Contents XIII

8.5 Efficiency Metrics 322


8.6 Training, Testing, and Statistics 325
8.6.1 Significance Tests 325
8.6.2 Setting Parameter Values 330
8.6.3 Online Testing 332
8.7 The Bottom Line 333

9 Classification and Clustering 339


9. 1 Classification and Categorization 340
9.1.1 Naive Bayes 342
9.1.2 Support Vector Machines 351
9.1.3 Evaluation 359
9. 1 .4 Classifier and Feature Selection 359
9.1.5 Spam, Sentiment, and Online Advertising . . 364
9.2 Clustering 373
9.2.1 Hierarchical and K -Means Clustering 375
9.2.2 K Nearest Neighbor Clustering 384
9.2.3 Evaluation 386
9.2.4 How to Choose K 387
9.2.5 Clustering and Search 389

10 Social Search 397


10.1 What Is Social Search? 397
10.2 User Tags and Manual Indexing 400
10.2.1 Searching Tags 402
10.2.2 Inferring Missing Tags 404
10.2.3 Browsing and Tag Clouds 406
10.3 Searching with Communities 408
10.3.1 What Is a Community? 408
10.3.2 Finding Communities 409
10.3.3 Community-Based Question Answering 415
10.3.4 Collaborative Searching 420
10.4 Filtering and Recommending 423
10.4.1 Document Filtering 423
10.4.2 Collaborative Filtering 432
10.5 Peer-to-Peer and Metasearch 438
10.5.1 Distributed Search 438
XIV Contents

1052 P2P Networks 442

1 1 Beyond Bag of Words 451


11.1 Overview 451
112 Feature-Based Retrieval Models 452
113 Term Dependence Models 454
1 1.4 Structure Revisited 459
1 1.4.1 XML Retrieval 461
1 1.4.2 Entity Search 464
1 1.5 Longer Questions, Better Answers 466
1 1.6 Words, Pictures, and Music 470
1 1.7 One Search Fits All? 479

References 487

Index 513
List of Figures

1.1 Search engine design and the core information retrieval issues . . . 9

2.1 The indexing process 15


2.2 The query process 16

3.1 A uniform resource locator (URL), split into three parts 33


3.2 Crawling the Web. The web crawler connects to web servers to
find pages. Pages may link to other pages on the same server or
on different servers 34
3.3 An example robots.txt file 36
3.4 A simple crawling thread implementation 37
3.5 An HTTP HEAD request and server response 38
3.6 Age and freshness of a single page over time 39
3.7 Expected age of a page with mean change frequency A = 1/7
(one week) 40
3.8 An example sitemap file 43
3.9 An example RSS 2.0 feed 48
3.10 An example of text in the TREC Web compound document
format 55
3.11 An example link with anchor text 56
3.12 BigTable stores data in a single logical table, which is split into
many smaller tablets 57
3.13 A BigTable row 58
3.14 Example of fingerprinting process 62
3.15 Example of simhash fingerprinting process 64
3.16 Main content block in a web page 65
XVI List of Figures

3.17 Tag counts used to identify text blocks in a web page 66


3.18 Part of the DOM structure for the example web page 67

4.1 Rank versus probability of occurrence for words assuming


Zipf 's law (rank X probability = 0.1) 76
4.2 A log-log plot of Zipf s law compared to real data from AP89.
The predicted relationship between probability of occurrence
and rank breaks down badly at high ranks 79
4.3 Vocabulary growth for the TREC AP89 collection compared
to Heaps' law 81
4.4 Vocabulary growth for the TREC GOV2 collection compared
to Heaps' law 82
4.5 Result size estimate for web search 83
4.6 Comparison of stemmer output for a TREC query. Stopwords
have also been removed 95
4.7 Output of a POS tagger for a TREC query 98
4.8 Part of a web page from Wikipedia 102
4.9 HTML source for example Wikipedia page 103
4.10 A sample "Internet" consisting of just three web pages. The
arrows denote links between the pages 108
4.11 Pseudocode for the iterative PageRank algorithm 110
4.12 Trackback links in blog postings 112
4.13 Text tagged by information extraction 114
4.14 Sentence model for statistical entity extractor 116
4.15 Chinese segmentation and bigrams 119

5.1 The components of the abstract model of ranking: documents,


features, queries, the retrieval function, and document scores 127
5.2 A more concrete model of ranking. Notice how both the query
and the document have feature functions in this model 128
5.3 An inverted index for the documents (sentences) in Table 5.1 132
5.4 An inverted index, with word counts, for the documents in
Table 5.1 134
5.5 An inverted index, with word positions, for the documents in
Table 5.1 135
5.6 Aligning posting lists for "tropical" and "fish" to find the phrase
"tropical fish" 136
List of Figures XVII

5.7 Aligning posting lists for "fish" and tide to find matches of the
word "fish" in the title field of a document 138
5.8 Pseudocode for a simple indexer 157
5.9 An example of index merging. The first and second indexes are
merged together to produce the combined index 158
5.10 MapReduce 161
5.11 Mapper for a credit card summing algorithm 162
5.12 Reducer for a credit card summing algorithm 162
5.13 Mapper for documents 163
5.14 Reducer for word postings 164
5.15 Document-at-a-time query evaluation. The numbers (x-.y)
represent a document number (x) and a word count (y) 166
5.16 A simple document-at-a-time retrieval algorithm 167
5.17 Term-at-a-time query evaluation 168
5.18 A simple term-at-a-time retrieval algorithm 169
5.19 Skip pointers in an inverted list. The gray boxes show skip
pointers, which point into the white boxes, which are inverted
list postings 170
5.20 A term-at-a-time retrieval algorithm with conjunctive processing 173
5.21 A document-at-a-time retrieval algorithm with conjunctive
processing 174
5.22 MaxScore retrieval with the query "eucalyptus tree". The gray
boxes indicate postings that can be safely ignored during scoring. 176
5.23 Evaluation tree for the structured query #combine(#od: 1 (tropical
fish) #od: 1 (aquarium fish) fish) 179

6.1 Top ten results for the query "tropical fish" 209
6.2 Geographic representation of Cape Cod using bounding
rectangles 214
6.3 Typical document summary for a web search 215
6.4 An example of a text span of words (w) bracketed by significant
words (s) using Luhn's algorithm 216
6.5 Advertisements displayed by a search engine for the query "fish
tanks" 221
6.6 Clusters formed by a search engine from top-ranked documents
for the query "tropical fish". Numbers in brackets are the
number of documents in the cluster. 222
XVIII List of Figures

6.7 Categories returned for the query "tropical fish" in a popular


online retailer 225
6.8 Subcategories and facets for the "Home & Garden" category . . .. 225
6.9 Cross-language search 226
6.10 A French web page in the results list for the query "pecheur
france" 228

7.1 Term-document matrix for a collection of four documents . . . . 239


7.2 Vector representation of documents and queries 240
7.3 Classifying a document as relevant or non-relevant 245
7.4 Example inference network model 269
7.5 Inference network with three nodes 271
7.6 Galago query for the dependence model 282
7.7 Galago query for web data 282

8.1 Example of a TREC topic 302


8.2 Recall and precision values for two rankings of six relevant
documents 311
8.3 Recall and precision values for rankings from two different quer ies314
8.4 Recall-precision graphs for two queries 315
8.5 Interpolated recall-precision graphs for two queries 316
8.6 Average recall-precision graph using standard recall levels 317
8.7 Typical recall-precision graph for 50 queries from TREC .. 318
8.8 Probability distribution for test statistic values assuming the
null hypothesis. The shaded area is the region of rejection for a
one-sided test 327
8.9 Example distribution of query effectiveness improvements . . . . .. 335

9.1 Illustration of how documents are represented in the multiple-


Bernoulli event space. In this example, there are 10 documents
(each with a unique id), two classes (spam and not spam), and a
vocabulary that consists of the terms "cheap", "buy", "banking",
"dinner", and "the". 346
9.2 Illustration of how documents are represented in the
multinomial event space. In this example, there are 10
documents (each with a unique id), two classes (spam and not
spam), and a vocabulary that consists of the terms "cheap",
«1 » «1 1 . » « 1. » 1« l »
buy , banking , dinner , and the 349
List of Figures XIX

9.3 Data set that consists of two classes (pluses and minuses). The
data set on the left is linearly separable, whereas the one on the
right is not . 352
9.4 Graphical illustration of Support Vector Machines for the
linearly separable case. Here, the hyperplane defined by w is
shown, as well as the margin, the decision regions, and the
support vectors, which are indicated by circles . 353
9.5 Generative process used by the Naive Bayes model. First, a class
is chosen according to P(c), and then a document is chosen
according to P(d\c) . 360
9.6 Example data set where non-parametric learning algorithms,
such as a nearest neighbor classifier, may outperform parametric
algorithms. The pluses and minuses indicate positive and
negative training examples, respectively. The solid gray line
shows the actual decision boundary, which is highly non-linear. . 361
97 Example output of SpamAssassin email spam filter 365
9.8 Example of web page spam, showing the main page and some
of the associated term and link spam 367
99 Example product review incorporating sentiment . 370
9.10 Example semantic class match between a web page about
rainbow fish (a type of tropical fish) and an advertisement
for tropical fish food. The nodes "Aquariums", "Fish", and
"Supplies" are example nodes within a semantic hierarchy.
The web page is classified as "Aquariums - Fish" and the ad is
classified as "Supplies - Fish". Here, "Aquariums" is the least
common ancestor. Although the web page and ad do not share
any terms in common, they can be matched because of their
semantic similarity. . 372
9.11 Example of divisive clustering with K = 4. The clustering
proceeds from left to right and top to bottom, resulting in four
clusters 376
9.12 Example of agglomerative clustering with K = 4. The
clustering proceeds from left to right and top to bottom,
resulting in four clusters . 377
9.13 Dendrogram that illustrates the agglomerative clustering of the
ooints from Fieure 9.12 ^77
XX List of Figures

9. 14 Examples of clusters in a graph formed by connecting nodes


representing instances. A link represents a distance between the
two instances that is less than some threshold value 379
9.15 Illustration of how various clustering cost functions are compute d381
9.16 Example of overlapping clustering using nearest neighbor
clustering with K = 5. The overlapping clusters for the black
points (A, B, C, and D) are shown. The five nearest neighbors
for each black point are shaded gray and labeled accordingly. . . . . 385
9.17 Example of overlapping clustering using Parzen windows. The
clusters for the black points (A, B, C, and D) are shown. The
shaded circles indicate the windows used to determine cluster
membership. The neighbors for each black point are shaded
gray and labeled accordingly. 388
9.18 Cluster hypothesis tests on two TREC collections. The top
two compare the distributions of similarity values between
relevant-relevant and relevant-nonrelevant pairs (light gray) of
documents. The bottom two show the local precision of the
relevant documents 390

10.1 Search results used to enrich a tag representation. In this


example, the tag being expanded is "tropical fish". The query
"tropical fish" is run against a search engine, and the snippets
returned are then used to generate a distribution over related
terms 403
10.2 Example of a tag cloud in the form of a weighted list. The
tags are in alphabetical order and weighted according to some
criteria, such as popularity. 407
10.3 Illustration of the HITS algorithm. Each row corresponds to a
single iteration of the algorithm and each column corresponds
to a specific step of the algorithm 412
10.4 Example of how nodes within a directed graph can be
represented as vectors. For a given node p, its vector
renresentation has comnonent a set to 1 if v —> a 413
List of Figures XXI

10.5 Overview of the two common collaborative search scenarios.


On the left is co -located collaborative search, which involves
multiple participants in the same location at the same time.
On the right is remote collaborative search, where participants
are in different locations and not necessarily all online and
searching at the same time 4?1
10.6 Example of a static filtering system. Documents arrive over time
and are compared against each profile. Arrows from documents
to profiles indicate the document matches the profile and is
retrieved 4? 5
10.7 Example of an adaptive filtering system. Documents arrive
over time and are compared against each profile. Arrows from
documents to profiles indicate the document matches the
profile and is retrieved. Unlike static filtering, where profiles are
static over time, profiles are updated dynamically (e.g., when a
new match occurs) 47 8
10.8 A set of users within a recommender system. Users and their
ratings for some item are given. Users with question marks
above their heads have not yet rated the item. It is the goal of
the recommender system to fill in these question marks 434
10.9 Illustration of collaborative filtering using clustering. Groups
of similar users are outlined with dashed lines. Users and their
ratings for some item are given. In each group, there is a single
user who has not judged the item. For these users, the unjudged
item is assigned an automatic rating based on the ratings of
similar users 43 5
10.10 Metasearch engine architecture. The query is broadcast to
multiple web search engines and result lists are merged 439
10.11 Network architectures for distributed search: (a) central hub;
(b) pure P2P; and (c) hierarchical P2P. Dark circles are hub
or superpeer nodes, gray circles are provider nodes, and white
circles are consumer nodes 443
10.12 Neighborhoods (JVj) of a hub node (H) in a hierarchical P2P
network 445
XXII List of Figures

11.1 Example Markov Random Field model assumptions, including


full independence (top left), sequential dependence (top
right), full dependence (bottom left), and general dependence
fhorrom ricrhr) 455
11.2 Graphical model representations of the relevance model
technique (top) and latent concept expansion (bottom) used
for pseudo -relevance feedback with the query "hubble telescope
achievements" 459
11.3 Functions provided by a search engine interacting with a simple
database system 461
11.4 Example of an entity search for organizations using the TREC
Wall StreetJournal 1987 Collection 464
11.5 Question answering system architecture 467
11.6 Examples of OCR errors 472
11.7 Examples of speech recognizer errors 473
11.8 Two images (a fish and a flower bed) with color histograms.
The horizontal axis is hue value 474
11.9 Three examples of content-based image retrieval. The collection
for the first two consists of 1,560 images of cars, faces, apes,
and other miscellaneous subjects. The last example is from a
collection of 2,048 trademark images. In each case, the leftmost
image is the query. 475
11.10 Key frames extracted from a TREC video clip 476
11.11 Examples of automatic text annotation of images 477
11.12 Three representations of Bach's "Fugue #10": audio, MIDI, and
conventional music notation 478
List of Tables

1.1 Some dimensions of information retrieval 4

3.1 UTF-8 encoding 51

4.1 Statistics for the AP89 collection 77


4.2 Most frequent 50 words from AP89 78
4.3 Low-frequency words from AP89 78
4.4 Example word frequency ranking 79
4.5 Proportions of words occurring n times in 336,310 documents
from the TREC Volume 3 corpus. The total vocabulary size
(number of unique words) is 508,209 80
4.6 Document frequencies and estimated frequencies for word
combinations (assuming independence) in the GOV2 Web
collection. Collection size (N) is 25,205,179 84
4.7 Examples of errors made by the original Porter stemmer. False
positives are pairs of words that have the same stem. False
negatives are pairs that have different stems 93
4.8 Examples of words with the Arabic root ktb 96
4.9 High-frequency noun phrases from a TREC collection and
U.S. patents from 1996 99
4.10 Statistics for the Google n-gram sample 101

5.1 Four sentences from the Wikipedia entry for tropical fish 13?
5.2 Elias-7 code examples 146
5.3 Elias-$ code examples 147
5.4 Space requirements for numbers encoded in v-byte 149
XXIV List of Tables

5.5 Sample encodings for v-byte 149


5.6 Skip lengths (k) and expected processing steps 152

6.1 Partial entry for the Medical Subject (MeSH) Heading "Neck
Pain" 200
6.2 Term association measures 203
6.3 Most strongly associated words for "tropical" in a collection of
TREC news stories. Co-occurrence counts are measured at the
document level 204
6.4 Most strongly associated words for "fish" in a collection of
TREC news stories. Co-occurrence counts are measured at the
document level 205
6.5 Most strongly associated words for "fish" in a collection of
TREC news stories. Co-occurrence counts are measured in
windows of five words 205

7.1 Contingency table of term occurrences for a particular query . . . 248


7.2 BM25 scores for an example document 252
7.3 Query likelihood scores for an example document 260
7.4 Highest-probability terms from relevance model for four
example queries (estimated using top 10 documents) 266
7.5 Highest-probability terms from relevance model for four
example queries (estimated using top 50 documents) 267
7.6 Conditional probabilities for example network 272
7.7 Highest-probability terms from four topics in LDA model 290

8.1 Statistics for three example text collections. The average number
of words per document is calculated without stemming 301
8.2 Statistics for queries from example text collections 301
8.3 Sets of documents defined by a simple search with binary
relevance 309
8.4 Precision values at standard recall levels calculated using
interpolation 317
8.5 Definitions of some important efficiency metrics 323
8.6 Artificial effectiveness data for two retrieval algorithms (A and
B) over 10 queries. The column B - A gives the difference in
effectiveness 328
List of Tables ]OCV

9.1 A list of kernels that are typically used with SVMs. For each
kernel, the name, value, and implicit dimensionality are given. . . 357

10 1 Example questions submitted to Yahoo! Answers 416


10.2 Translations automatically learned from a set of question and
answer pairs. The 10 most likely translations for the terms
"everest", "xp", and "search" are given 419
10.3 Summary of static and adaptive filtering models. For each, the
profile representation and profile updating algorithm are given. . 430
10.4 Contingency table for the possible outcomes of a filtering
system. Here, TP (true positive) is the number of relevant
documents retrieved, FN (false negative) is the number of
relevant documents not retrieved, FP (false positive) is the
number of non-relevant documents retrieved, and TN (true
negative) is the number of non-relevant documents not retrieved. 431

11.1 Most likely one- and two-word concepts produced using latent
concept expansion with the top 25 documents retrieved for
the query "hubble telescope achievements" on the TREC
ROBUST collection 460
11.2 Example TREC QA questions and their corresponding
Question categories 469
This page intentionally left blank
I
Search Engines and Information
Retrieval
"Mr. Helpmann, I'm keen to get into
Information Retrieval."
Sam Lowry, Brazil

I.I What Is Information Retrieval?


This book is designed to help people understand search engines, evaluate and
compare them, and modify them for specific applications. Searching for infor-
mation on the Web is, for most people, a daily activity. Search and communi-
cation are by far the most popular uses of the computer. Not surprisingly, many
people in companies and universities are trying to improve search by coming up
with easier and faster ways to find the right information. These people, whether
they call themselves computer scientists, software engineers, information scien-
tists, search engine optimizers, or something else, are working in the field of In-
formation Retrieval.1 So, before we launch into a detailed journey through the
internals of search engines, we will take a few pages to provide a context for the
rest of the book.
Gerard Salton, a pioneer in information retrieval and one of the leading figures
from the 1960s to the 1990s, proposed the following definition in his classic 1968
textbook (Salton, 1968):
Information retrieval is a field concerned with the structure, analysis, or-
ganization, storage, searching, and retrieval of information.
Despite the huge advances in the understanding and technology of search in the
past 40 years, this definition is still appropriate and accurate. The term "informa-
1
Information retrieval is often abbreviated as IR. In this book, we mostly use the full
term. This has nothing to do with the fact that many people think IR means "infrared"
or something else.
2 1 Search Engines and Information Retrieval

tion" is very general, and information retrieval includes work on a wide range of
types of information and a variety of applications related to search.
The primary focus of the field since the 1950s has been on text and text docu-
ments. Web pages, email, scholarly papers, books, and news stories are just a few
of the many examples of documents. All of these documents have some amount
of structure, such as the tide, author, date, and abstract information associated
with the content of papers that appear in scientific journals. The elements of this
structure are called attributes, or fields, when referring to database records. The
important distinction between a document and a typical database record, such as
a bank account record or a flight reservation, is that most of the information in
the document is in the form of text, which is relatively unstructured.
To illustrate this difference, consider the information contained in two typical
attributes of an account record, the account number and current balance. Both are
very well defined, both in terms of their format (for example, a six-digit integer
for an account number and a real number with two decimal places for balance)
and their meaning. It is very easy to compare values of these attributes, and conse-
quendy it is straightforward to implement algorithms to identify the records that
satisfy queries such as "Find account number 321456" or "Find accounts with
balances greater than $50,000.00".
Now consider a news story about the merger of two banks. The story will have
some attributes, such as the headline and source of the story, but the primary con-
tent is the story itself. In a database system, this critical piece of information would
typically be stored as a single large attribute with no internal structure. Most of
the queries submitted to a web search engine such as Google2 that relate to this
story will be of the form "bank merger" or "bank takeover". To do this search,
we must design algorithms that can compare the text of the queries with the text
of the story and decide whether the story contains the information that is being
sought. Defining the meaning of a word, a sentence, a paragraph, or a whole news
story is much more difficult than defining an account number, and consequendy
comparing text is not easy. Understanding and modeling how people compare
texts, and designing computer algorithms to accurately perform this comparison,
is at the core of information retrieval.
Increasingly, applications of information retrieval involve multimedia docu-
ments with structure, significant text content, and other media. Popular infor-
mation media include pictures, video, and audio, including music and speech. In
; http://www.google.com
1.1 What Is Information Retrieval? 3

some applications, such as in legal support, scanned document images are also
important. These media have content that, like text, is difficult to describe and
compare. The current technology for searching non-text documents relies on text
descriptions of their content rather than the contents themselves, but progress is
being made on techniques for direct comparison of images, for example.
In addition to a range of media, information retrieval involves a range of tasks
and applications. The usual search scenario involves someone typing in a query to
a search engine and receiving answers in the form of a list of documents in ranked
order. Although searching the World Wide Web (web search] is by far the most
common application involving information retrieval, search is also a crucial part
of applications in corporations, government, and many other domains. Vertical
search is a specialized form of web search where the domain of the search is re-
stricted to a particular topic. Enterprise search involves finding the required infor-
mation in the huge variety of computer files scattered across a corporate intranet.
Web pages are certainly a part of that distributed information store, but most
information will be found in sources such as email, reports, presentations, spread-
sheets, and structured data in corporate databases. Desktop search is the personal
version of enterprise search, where the information sources are the files stored
on an individual computer, including email messages and web pages that have re-
cently been browsed. Peer-to-peer search involves finding information in networks
of nodes or computers without any centralized control. This type of search began
as a file sharing tool for music but can be used in any community based on shared
interests, or even shared locality in the case of mobile devices. Search and related
information retrieval techniques are used for advertising, for intelligence analy-
sis, for scientific discovery, for health care, for customer support, for real estate,
and so on. Any application that involves a collection^ of text or other unstructured
information will need to organize and search that information.
Search based on a user query (sometimes called ad hoc search because the range
of possible queries is huge and not prespecified) is not the only text-based task
that is studied in information retrieval. Other tasks include filtering, classification,
and question answering. Filtering or tracking involves detecting stories of interest
based on a person's interests and providing an alert using email or some other
mechanism. Classification or categorization uses a defined set of labels or classes
3
The term database is often used to refer to a collection of either structured or unstruc-
tured data. To avoid confusion, we mostly use the term document collection (or just
collection) for text. However, the terms web database and search engine database are so
common that we occasionally use them in this book.
4 1 Search Engines and Information Retrieval

(such as the categories listed in the Yahoo! Directory4) and automatically assigns
those labels to documents. Question answering is similar to search but is aimed
at more specific questions, such as "What is the height of Mt. Everest?". The goal
of question answering is to return a specific answer found in the text, rather than
a list of documents. Table 1.1 summarizes some of these aspects or dimensions of
the field of information retrieval.

Examples of Examples of Examples of


Content Applications Tasks
Text Web search Ad hoc search
Images Vertical search Filtering
Video Enterprise search Classification
Scanned documents Desktop search Question answering
Audio Peer-to-peer search
Music

Table 1.1. Some dimensions of information retrieval

1.2 The Big Issues

Information retrieval researchers have focused on a few key issues that remain just
as important in the era of commercial web search engines working with billions
of web pages as they were when tests were done in the 1960s on document col-
lections containing about 1.5 megabytes of text. One of these issues is relevance.
Relevance is a fundamental concept in information retrieval. Loosely speaking, a
relevant document contains the information that a person was looking for when
she submitted a query to the search engine. Although this sounds simple, there are
many factors that go into a person's decision as to whether a particular document
is relevant. These factors must be taken into account when designing algorithms
for comparing text and ranking documents. Simply comparing the text of a query
with the text of a document and looking for an exact match, as might be done in
a database system or using the grep utility in Unix, produces very poor results in
terms of relevance. One obvious reason for this is that language can be used to ex-
: http://dir.yahoo.com/
1.2 The Big Issues 5

press the same concepts in many different ways, often with very different words.
This is referred to as the vocabulary mismatch problem in information retrieval.
It is also important to distinguish between topical relevance and user relevance.
A text document is topically relevant to a query if it is on the same topic. For ex-
ample, a news story about a tornado in Kansas would be topically relevant to the
query "severe weather events". The person who asked the question (often called
the user) may not consider the story relevant, however, if she has seen that story
before, or if the story is five years old, or if the story is in Chinese from a Chi-
nese news agency. User relevance takes these additional features of the story into
account.
To address the issue of relevance, researchers propose retrieval models and test
how well they work. A retrieval model is a formal representation of the process of
matching a query and a document. It is the basis of the ranking algorithm that is
used in a search engine to produce the ranked list of documents. A good retrieval
model will find documents that are likely to be considered relevant by the person
who submitted the query. Some retrieval models focus on topical relevance, but
a search engine deployed in a real environment must use ranking algorithms that
incorporate user relevance.
An interesting feature of the retrieval models used in information retrieval is
that they typically model the statistical properties of text rather than the linguistic
structure. This means, for example, that the ranking algorithms are typically far
more concerned with the counts of word occurrences than whether the word is a
noun or an adjective. More advanced models do incorporate linguistic features,
but they tend to be of secondary importance. The use of word frequency infor-
mation to represent text started with another information retrieval pioneer, H.P.
Luhn, in the 1950s. This view of text did not become popular in other fields of
computer science, such as natural language processing, until the 1990s.
Another core issue for information retrieval is evaluation. Since the quality of
a document ranking depends on how well it matches a person's expectations, it
was necessary early on to develop evaluation measures and experimental proce-
dures for acquiring this data and using it to compare ranking algorithms. Cyril
Cleverdon led the way in developing evaluation methods in the early 1960s, and
two of the measures he used, precision and recall, are still popular. Precision is a
very intuitive measure, and is the proportion of retrieved documents that are rel-
evant. Recall is the proportion of relevant documents that are retrieved. When
the recall measure is used, there is an assumption that all the relevant documents
for a given query are known. Such an assumption is clearly problematic in a web
6 1 Search Engines and Information Retrieval

search environment, but with smaller test collections of documents, this measure
can be useful. A test collection5 for information retrieval experiments consists of
a collection of text documents, a sample of typical queries, and a list of relevant
documents for each query (the relevancejudgments}. The best-known test collec-
tions are those associated with the TREC6 evaluation forum.
Evaluation of retrieval models and search engines is a very active area, with
much of the current focus on using large volumes of log data from user interac-
tions, such as clickthrough data, which records the documents that were clicked
on during a search session. Clickthrough and other log data is strongly correlated
with relevance so it can be used to evaluate search, but search engine companies
still use relevance judgments in addition to log data to ensure the validity of their
results.
The third core issue for information retrieval is the emphasis on users and their
information needs. This should be clear given that the evaluation of search is user-
centered. That is, the users of a search engine are the ultimate judges of quality.
This has led to numerous studies on how people interact with search engines and,
in particular, to the development of techniques to help people express their in-
formation needs. An information need is the underlying cause of the query that
a person submits to a search engine. In contrast to a request to a database system,
such as for the balance of a bank account, text queries are often poor descriptions
of what the user actually wants. A one-word query such as "cats" could be a request
for information on where to buy cats or for a description of the Broadway musi-
cal. Despite their lack of specificity, however, one-word queries are very common
in web search. Techniques such as query suggestion, query expansion, and relevance
feedback use interaction and context to refine the initial query in order to produce
better ranked lists.
These issues will come up throughout this book, and will be discussed in con-
siderably more detail. We now have sufficient background to start talking about
the main product of research in information retrieval—namely, search engines.

1.3 Search Engines

A search engine is the practical application of information retrieval techniques


to large-scale text collections. A web search engine is the obvious example, but as
5
Also known as an evaluation corpus (plural corpora).
6
Text REtrieval Conference—http://trec.nist.gov/
1.3 Search Engines 7

has been mentioned, search engines can be found in many different applications,
such as desktop search or enterprise search. Search engines have been around for
many years. For example, MEDLINE, the online medical literature search sys-
tem, started in the 1970s. The term "search engine" was originally used to refer
to specialized hardware for text search. From the mid-1980s onward, however, it
gradually came to be used in preference to "information retrieval system" as the
name for the software system that compares queries to documents and produces
ranked result lists of documents. There is much more to a search engine than the
ranking algorithm, of course, and we will discuss the general architecture of these
systems in the next chapter.
Search engines come in a number of configurations that reflect the applica-
tions they are designed for. Web search engines, such as Google and Yahoo !,7 must
be able to capture, or crawl, many terabytes of data, and then provide subsecond
response times to millions of queries submitted every day from around the world.
Enterprise search engines—for example, Autonomy8—must be able to process
the large variety of information sources in a company and use company-specific
knowledge as part of search and related tasks, such as data mining. Data mining
refers to the automatic discovery of interesting structure in data and includes tech-
niques such as clustering. Desktop search engines, such as the Microsoft Vista™
search feature, must be able to rapidly incorporate new documents, web pages,
and email as the person creates or looks at them, as well as provide an intuitive
interface for searching this very heterogeneous mix of information. There is over-
lap between these categories with systems such as Google, for example, which is
available in configurations for enterprise and desktop search.
Of en source search engines are another important class of systems that have
somewhat different design goals than the commercial search engines. There are a
number of these systems, and the Wikipedia page for information retrieval9 pro-
vides links to many of them. Three systems of particular interest are Lucene,10
Lemur,11 and the system provided with this book, Galago.12 Lucene is a popular
Java-based search engine that has been used for a wide range of commercial ap-
plications. The information retrieval techniques that it uses are relatively simple.
7 http://www.yahoo.com
8
http://www.autonomy.com
9
http://en.wikipedia.org/wiki/Information_retrieval
10
http://lucene.apache.org
11
http://www.lemurproject.org
12
http://www.search-engines-book.com
8 1 Search Engines and Information Retrieval

Lemur is an open source toolkit that includes the Indri C++-based search engine.
Lemur has primarily been used by information retrieval researchers to compare
advanced search techniques. Galago is a Java-based search engine that is based on
the Lemur and Indri projects. The assignments in this book make extensive use of
Galago. It is designed to be fast, adaptable, and easy to understand, and incorpo-
rates very effective information retrieval techniques.
The "big issues" in the design of search engines include the ones identified for
information retrieval: effective ranking algorithms, evaluation, and user interac-
tion. There are, however, a number of additional critical features of search engines
that result from their deployment in large-scale, operational environments. Fore-
most among these features is the performance of the search engine in terms of mea-
sures such as response time, query throughput, and indexing speed. Response time
is the delay between submitting a query and receiving the result list, throughput
measures the number of queries that can be processed in a given time, and index-
ing speed is the rate at which text documents can be transformed into indexes
for searching. An index is a data structure that improves the speed of search. The
design of indexes for search engines is one of the major topics in this book.
Another important performance measure is how fast new data can be incorpo-
rated into the indexes. Search applications typically deal with dynamic, constantly
changing information. Coverage measures how much of the existing information
in, say, a corporate information environment has been indexed and stored in the
search engine, and recency on freshness measures the "age" of the stored informa-
tion.
Search engines can be used with small collections, such as a few hundred emails
and documents on a desktop, or extremely large collections, such as the entire
Web. There may be only a few users of a given application, or many thousands.
Scalability is clearly an important issue for search engine design. Designs that
work for a given application should continue to work as the amount of data and
the number of users grow. In section 1.1, we described how search engines are used
in many applications and for many tasks. To do this, they have to be customizable
or adaptable. This means that many different aspects of the search engine, such as
the ranking algorithm, the interface, or the indexing strategy, must be able to be
tuned and adapted to the requirements of the application.
Practical issues that impact search engine design also occur for specific appli-
cations. The best example of this is spam in web search. Spam is generally thought
of as unwanted email, but more generally it could be defined as misleading, inap-
propriate, or non-relevant information in a document that is designed for some
1.4 Search Engineers 9

commercial benefit. There are many kinds of spam, but one type that search en-
gines must deal with is spam words put into a document to cause it to be retrieved
in response to popular queries. The practice of "spamdexing" can significantly de-
grade the quality of a search engine's ranking, and web search engine designers
have to develop techniques to identify the spam and remove those documents.
Figure 1.1 summarizes the major issues involved in search engine design.

Information Retrieval Search Engines

Relevance Performance
-Effective ranking -Efficient search and indexing
Evaluation Incorporating new data
-Tes ting and measuring -Coverage and freshness
Information needs Scalability
-User interaction -Growingwith data and users
Adaptability
-Tuning/or applications
Specific problems
-E.g., spam

Fig. 1.1. Search engine design and the core information retrieval issues

Based on this discussion of the relationship between information retrieval and


search engines, we now consider what roles computer scientists and others play in
the design and use of search engines.

1.4 Search Engineers

Information retrieval research involves the development of mathematical models


of text and language, large-scale experiments with test collections or users, and
a lot of scholarly paper writing. For these reasons, it tends to be done by aca-
demics or people in research laboratories. These people are primarily trained in
computer science, although information science, mathematics, and, occasionally,
social science and computational linguistics are also represented. So who works
10 1 Search Engines and Information Retrieval

with search engines ? To a large extent, it is the same sort of people but with a more
practical emphasis. The computing industry has started to use the term search
engineer to describe this type of person. Search engineers are primarily people
trained in computer science, mostly with a systems or database background. Sur-
prisingly few of them have training in information retrieval, which is one of the
major motivations for this book.
What is the role of a search engineer? Certainly the people who work in the
major web search companies designing and implementing new search engines are
search engineers, but the majority of search engineers are the people who modify,
extend, maintain, or tune existing search engines for a wide range of commercial
applications. People who design or "optimize" content for search engines are also
search engineers, as are people who implement techniques to deal with spam. The
search engines that search engineers work with cover the entire range mentioned
in the last section: they primarily use open source and enterprise search engines
for application development, but also get the most out of desktop and web search
engines.
The importance and pervasiveness of search in modern computer applications
has meant that search engineering has become a crucial profession in the com-
puter industry. There are, however, very few courses being taught in computer
science departments that give students an appreciation of the variety of issues that
are involved, especially from the information retrieval perspective. This book is in-
tended to give potential search engineers the understanding and tools they need.

References and Further Reading

In each chapter, we provide some pointers to papers and books that give more
detail on the topics that have been covered. This additional reading should not
be necessary to understand material that has been presented, but instead will give
more background, more depth in some cases, and, for advanced topics, will de-
scribe techniques and research results that are not covered in this book.
The classic references on information retrieval, in our opinion, are the books
bySalton (1968; 1983) and van Rijsbergen (1979). Van Rijsbergen's book remains
popular, since it is available on the Web.13 All three books provide excellent de-
scriptions of the research done in the early years of information retrieval, up to
the late 1970s. Salton's early book was particularly important in terms of defining
13 http://www.dcs.gla.ac.uk/Keith/Preface.html
1.4 Search Engineers 11

the field of information retrieval for computer science. More recent books include
Baeza-Yates and Ribeiro-Neto (1999) and Manning et al. (2008).
Research papers on all the topics covered in this book can be found in the
Proceedings of the Association for Computing Machinery (ACM) Special In-
terest Group on Information Retrieval (SIGIR) Conference. These proceedings
are available on the Web as part of the ACM Digital Library.1 Good papers
on information retrieval and search also appear in the European Conference
on Information Retrieval (ECIR), the Conference on Information and Knowl-
edge Management (CIKM), and the Web Search and Data Mining Conference
(WSDM). The WSDM conference is a spin-off of the World Wide Web Confer-
ence (WWW), which has included some important papers on web search. The
proceedings from the TREC workshops are available online and contain useful
descriptions of new research techniques from many different academic and indus-
try groups. An overview of the TREC experiments can be found in Voorhees and
Harman (2005). An increasing number of search-related papers are beginning to
appear in database conferences, such as VLDB and SIGMOD. Occasional papers
also show up in language technology conferences, such as ACL and HLT (As-
sociation for Computational Linguistics and Human Language Technologies),
machine learning conferences, and others.

Exercises

1.1. Think up and write down a small number of queries for a web search engine.
Make sure that the queries vary in length (i.e., they are not all one word). Try
to specify exactly what information you are looking for in some of the queries.
Run these queries on two commercial web search engines and compare the top
10 results for each query by doing relevance judgments. Write a report that an-
swers at least the following questions: What is the precision of the results? What
is the overlap between the results for the two search engines? Is one search engine
clearly better than the other? If so, by how much? How do short queries perform
compared to long queries?

1.2. Site search is another common application of search engines. In this case,
search is restricted to the web pages at a given website. Compare site search to
web search, vertical search, and enterprise search.
14
http://www.acm.org/dl
12 1 Search Engines and Information Retrieval

1.3. Use the Web to find as many examples as you can of open source search en-
gines, information retrieval systems, or related technology. Give a brief descrip-
tion of each search engine and summarize the similarities and differences between
them.

1.4. List five web services or sites that you use that appear to use search, not includ-
ing web search engines. Describe the role of search for that service. Also describe
whether the search is based on a database or gr ep style of matching, or if the search
is using some type of ranking.
2
Architecture of a Search Engine
"While your first question may be the most per-
tinent, you may or may not realize it is also the
most irrelevant."
The Architect, Matrix Reloaded

2.1 What Is an Architecture?

In this chapter, we describe the basic software architecture of a search engine. Al-
though there is no universal agreement on the definition, a software architecture
generally consists of software components, the interfaces provided by those com-
ponents, and the relationships between them. An architecture is used to describe
a system at a particular level of abstraction. An example of an architecture used to
provide a standard for integrating search and related language technology compo-
nents is UIMA (Unstructured Information Management Architecture).1 UIMA
defines interfaces for components in order to simplify the addition of new tech-
nologies into systems that handle text and other unstructured data.
Our search engine architecture is used to present high-level descriptions of
the important components of the system and the relationships between them. It
is not a code-level description, although some of the components do correspond
to software modules in the Galago search engine and other systems. We use this
architecture in this chapter and throughout the book to provide context to the
discussion of specific techniques.
An architecture is designed to ensure that a system will satisfy the application
requirements or goals. The two primary goals of a search engine are:
• Effectiveness (quality): We want to be able to retrieve the most relevant set of
documents possible for a query.
• Efficiency (speed): We want to process queries from users as quickly as possi-
ble.
1
http://www.research.ibm.com/UIMA
14 2 Architecture of a Search Engine

We may have more specific goals, too, but usually these fall into the categories
of effectiveness or efficiency (or both). For instance, the collection of documents
we want to search may be changing; making sure that the search engine immedi-
ately reacts to changes in documents is both an effectiveness issue and an efficiency
issue.
The architecture of a search engine is determined by these two requirements.
Because we want an efficient system, search engines employ specialized data struc-
tures that are optimized for fast retrieval. Because we want high-quality results,
search engines carefully process text and store text statistics that help improve the
relevance of results.
Many of the components we discuss in the following sections have been used
for decades, and this general design has been shown to be a useful compromise
between the competing goals of effective and efficient retrieval. In later chapters,
we will discuss these components in more detail.

2.2 Basic Building Blocks

Search engine components support two major functions, which we call the index-
ing process and the query process. The indexing process builds the structures that
enable searching, and the query process uses those structures and a person's query
to produce a ranked list of documents. Figure 2.1 shows the high-level "building
blocks" of the indexing process. These major components are text acquisition, text
transformation, and index creation.
The task of the text acquisition component is to identify and make available
the documents that will be searched. Although in some cases this will involve sim-
ply using an existing collection, text acquisition will more often require building
a collection by crawling or scanning the Web, a corporate intranet, a desktop, or
other sources of information. In addition to passing documents to the next com-
ponent in the indexing process, the text acquisition component creates a docu-
ment data store, which contains the text and metadata for all the documents.
Metadata is information about a document that is not part of the text content,
such the document type (e.g., email or web page), document structure, and other
features, such as document length.
The text transformation component transforms documents into index terms
of features. Index terms, as the name implies, are the parts of a document that
are stored in the index and used in searching. The simplest index term is a word,
but not every word may be used for searching. A "feature" is more often used in
2.2 Basic Building Blocks 15

Fig. 2.1. The indexing pro cess

the field of machine learning to refer to a part of a text document that is used to
represent its content, which also describes an index term. Examples of other types
of index terms or features are phrases, names of people, dates, and links in a web
page. Index terms are sometimes simply referred to as "terms." The set of all the
terms that are indexed for a document collection is called the index vocabulary.
The index creation component takes the output of the text transformation
component and creates the indexes or data structures that enable fast searching.
Given the large number of documents in many search applications, index creation
must be efficient, both in terms of time and space. Indexes must also be able to be
efficiently updated when new documents are acquired. Inverted indexes, or some-
times inverted files, are by far the most common form of index used by search
engines. An inverted index, very simply, contains a list for every index term of the
documents that contain that index term. It is inverted in the sense of being the
opposite of a document file that lists, for every document, the index terms they
contain. There are many variations of inverted indexes, and the particular form of
index used is one of the most important aspects of a search engine.
Figure 2.2 shows the building blocks of the query process. The major compo-
nents are user interaction, ranking, and evaluation.
The user interaction component provides the interface between the person
doing the searching and the search engine. One task for this component is accept-
ing the user's query and transforming it into index terms. Another task is to take
the ranked list of documents from the search engine and organize it into the re-
16 2 Architecture of a Search Engine

Fig. 2.2. The query process

suits shown to the user. This includes, for example, generating the snippets used to
summarize documents. The document data store is one of the sources of informa-
tion used in generating the results. Finally, this component also provides a range
of techniques for refining the query so that it better represents the information
need.
The ranking component is the core of the search engine. It takes the trans-
formed query from the user interaction component and generates a ranked list of
documents using scores based on a retrieval model. Ranking must be both effi-
cient, since many queries may need to be processed in a short time, and effective,
since the quality of the ranking determines whether the search engine accom-
plishes the goal of finding relevant information. The efficiency of ranking depends
on the indexes, and the effectiveness depends on the retrieval model.
The task of the evaluation component is to measure and monitor effectiveness
and efficiency. An important part of that is to record and analyze user behavior
using log data. The results of evaluation are used to tune and improve the ranking
component. Most of the evaluation component is not part of the online search
engine, apart from logging user and system data. Evaluation is primarily an offline
activity, but it is a critical part of any search application.
2.3 Breaking It Down 17

2.3 Breaking It Down

We now look in more detail at the components of each of the basic building
blocks. Not all of these components will be part of every search engine, but to-
gether they cover what we consider to be the most important functions for a broad
range of search applications.

2.3.1 Text Acquisition

Crawler

In many applications, the crawler component has the primary responsibility for
identifying and acquiring documents for the search engine. There are a number of
different types of crawlers, but the most common is the general web crawler. A web
crawler is designed to follow the links on web pages to discover and download new
pages. Although this sounds deceptively simple, there are significant challenges in
designing a web crawler that can efficiently handle the huge volume of new pages
on the Web, while at the same time ensuring that pages that may have changed
since the last time a crawler visited a site are kept "fresh" for the search engine. A
web crawler can be restricted to a single site, such as a university, as the basis for Text Acquisition
site search. Focused, or topical, web crawlers use classification techniques to restrict Crawler
the pages that are visited to those that are likely to be about a specific topic. This Feeds
Conversion
type of crawler may be used by a vertical or topical search application, such as a Document data store
search engine that provides access to medical information on web pages.
For enterprise search, the crawler is adapted to discover and update all docu-
ments and web pages related to a company's operation. An enterprise document
crawler follows links to discover both external and internal (i.e., restricted to the
corporate intranet) pages, but also must scan both corporate and personal di-
rectories to identify email, word processing documents, presentations, database
records, and other company information. Document crawlers are also used for
desktop search, although in this case only the user's personal directories need to
be scanned.

Feeds

Document feeds are a mechanism for accessing a real-time stream of documents.


For example, a news feed is a constant stream of news stories and updates. In con-
trast to a crawler, which must discover new documents, a search engine acquires
18 2 Architecture of a Search Engine

new documents from a feed simply by monitoring it. RSS2 is a common standard
used for web feeds for content such as news, blogs, or video. An RSS "reader"
is used to subscribe to RSS feeds, which are formatted using XML? XML is a
language for describing data formats, similar to HTML. The reader monitors
those feeds and provides new content when it arrives. Radio and television feeds
are also used in some search applications, where the "documents" contain auto-
matically segmented audio and video streams, together with associated text from
closed captions or speech recognition.

Conversion

The documents found by a crawler or provided by a feed are rarely in plain text.
Instead, they come in a variety of formats, such as HTML, XML, Adobe PDF,
Microsoft Word™, Microsoft PowerPoint*, and so on. Most search engines require
that these documents be converted into a consistent text plus metadata format.
In this conversion, the control sequences and non-content data associated with
a particular format are either removed or recorded as metadata. In the case of
HTML and XML, much of this process can be described as part of the text trans-
formation component. For other formats, the conversion process is a basic step
that prepares the document for further processing. PDF documents, for example,
must be converted to text. Various utilities are available that perform this conver-
sion, with varying degrees of accuracy. Similarly, utilities are available to convert
the various Microsoft Office* formats into text.
Another common conversion problem comes from the way text is encoded in a
document. ASCII5 is a common standard single-byte character encoding scheme
used for text. ASCII uses either 7 or 8 bits (extended ASCII) to represent either
128 or 256 possible characters. Some languages, however, such as Chinese, have
many more characters than English and use a number of other encoding schemes.
Unicode is a standard encoding scheme that uses 16 bits (typically) to represent
most of the world's languages. Any application that deals with documents in dif-
ferent languages has to ensure that they are converted into a consistent encoding
scheme before further processing.
2 RSS actually refers to a family of standards with similar names (and the same initials),
such as Really Simple Syndication or Rich Site Summary.
3
extensible Markup Language
4
HyperText Markup Language
5
American Standard Code for Information Interchange
Another random document with
no related content on Scribd:
With an exquisite dexterity of address the Countess contrived to
introduce an allusion to the situation of Penelope Primrose; and as
neither the young lady nor her aunt was in full possession of the
circumstances in which Mr Primrose was at that time, they both had
the impression on their minds that there was no other immediate
prospect for his daughter than the exertion of her own talents and
acquirements to provide her with the means of support. The worthy
rector had not as yet been long enough in the grave to give
Penelope an opportunity of feeling the difference of Mrs Greendale’s
manner towards her; but she had penetration enough to foresee
what must be her situation so long as she remained under the same
roof as her aunt. With the utmost readiness did she therefore listen
to the Countess, when speaking of the various employments to
which a young person situated as she was might turn her attention.
“Lord Smatterton,” said the Countess, “has frequently mentioned
the subject to me, and he recommends a situation in a private family.
There are certainly some advantages and some disadvantages in
such a situation: very much depends upon the temper and
disposition of almost every individual in the family. It is possible that
you may meet with a family consisting of reasonable beings, but it is
more than probable that you may have to encounter arrogance or
ignorance; these are not excluded from any rank.”
This language seemed to Penelope as an intimation that a school
would be a more desirable sphere in which to make profitable use of
her acquisitions. It was not for her to oppose any objections to the
implied recommendations of so good and so great a friend as her
ladyship; but she felt considerable reluctance to that kind of
employment, which she fancied had been suggested. Her reply was
embarrassed but respectful, intimating that she was ready to adopt
any mode of employment which the Countess might be pleased to
suggest. Her ladyship gave a smile of approbation to the
acquiescent disposition which the young lady manifested, and
added:
“If Miss Primrose could conquer a little feeling of timidity, which
might naturally enough be experienced by one so retired in her
habits, it would be possible for her, with her great vocal powers and
musical talent, not only to find means of maintenance, but to arrive at
a competent independence, by adopting the musical profession.
Then she would also enjoy the pleasure of good society. If such
arrangement be agreeable, I will most willingly charge myself with
providing the preparatory instruction under a distinguished professor.
What does my young friend think of such occupation?”
Had sincerity been the readiest road to the patronage and
friendship of the great, this question might have been very readily
and easily answered. But Penelope knew better than to suppose that
any advantage could arise from a direct opposition to the wishes of a
patron. Repugnant as she was to the proposal, she dared not to
whisper the least syllable of contradiction, on the ground of dislike, to
the profession; but after a blush of mortification, which the Countess
mistook for a symptom of diffidence, she replied:
“I fear that your ladyship is disposed to estimate rather too highly
the humble talents I may possess, and that I shall not answer the
expectations which so distinguished patronage might raise.”
The Countess was not altogether pleased with this shadow of an
objection; for it seemed to call in question her own discernment. She
therefore replied with some quickness:
“I beg your pardon, Miss Primrose: I have usually been considered
as something of a judge in these matters; and, if I do not greatly
mistake, you are peculiarly qualified for the profession; and, if you
would condescend to adopt my recommendation, I will be
answerable for its success.”
The Countess, with all her kindness and considerateness, had not
the slightest idea that there could be in a young person, situated as
Penelope, any feeling of pride or thought of degradation. But pride
was in being before titles were invented; and even republics, which,
in the arrogance of equality, may repel from their political vocabulary
all distinctions of fellow citizens, cannot eradicate pride from the
human heart. In a civilized country there is not perhaps an individual
to be found who is incapable of the sensation of degradation. Miss
Primrose thought it degrading to become a public singer; she felt that
it would be publishing to the world that she was not independent.
The world cares little about such matters. Right or wrong, however,
this feeling took possession of the young lady’s mind; and as pride
does not enter the mind by means of reasoning, it will not be
expelled by any process of ratiocination. For all this, however, the
worthy Countess could make no allowance; and it appeared to her
that if a young person were under the necessity of serving her
superiors in rank for the sake of maintenance, it signified very little
what mode of servitude were applied to.
There was also another consideration which weighed not a little
with the Countess, in almost insisting upon Miss Primrose’s adopting
the musical profession. Her ladyship was a distinguished patroness,
and a most excellent judge of musical talent; and there was a rival
patroness who had never yet been able to produce, under her
auspices, anything at all equal to Penelope Primrose. The
mortification or defeat of a rival is a matter of great moment to minds
of every description. Whenever there is the weakness of rivalry there
must be of necessity also the vanity of triumph, and to that
occasionally much will be sacrificed.
Mrs Greendale, who was present at this discussion, sided most
cordially with the Countess; but had the proposal come from any
other quarter, in all human probability it would have been resented
as an indignity. Penelope was also well aware that it was absolutely
necessary that she should leave the asylum in which so many of her
few days had been spent, and she therefore, with as good a grace
as her feelings permitted, gave assent to the proposal which the
Countess had made. And thereby her generous patroness was
softened.
The discussion of this question occupied no inconsiderable portion
of time, though we have not thought it necessary to repeat at length
the very common-place dialogue which passed on the subject. Our
readers must have very languid imaginations if they cannot supply
the omission for themselves. Suffice it to say, that the arguments
used by the Countess of Smatterton were much stronger than the
objections which arose in the mind of Penelope Primrose; and the
consideration of these arguments, backed by the reflection that she
had no other immediately available resource, determined the
dependent one to acquiesce in that which her soul abhorred. It was
all very true, as the amiable Countess observed, that an occupation
which introduced the person so employed to the notice and into the
saloons of the nobility, could not be essentially degrading; it was also
very true that there could be no moral objection to a profession
which had been ornamented by some of the purest and most
virtuous characters. All this was very true; but notwithstanding this
and much more than this which was urged by the Countess, still
Penelope did not like it. There is no accounting for tastes.
Some young ladies there are who think that, if they should be
situated as Penelope was, they would not suffer any inducement to
lead them to a compliance with such a proposal. They imagine that
no earthly consideration whatever should compel them to that which
they abhorred or disapproved. They cannot think that Penelope
deserved the title of heroine, if she could thus easily surrender her
judgment and bend her will to the dictation of a patroness. But let
these young ladies be informed, that in this compliance lay no small
portion of the heroism of Penelope’s character. She gained a victory
over herself; she did not gratify a pert self-will at the expense of
propriety and decorum, and she had no inclination to play the part of
a Quixote.
It is an easy thing for a young man to set himself up as
independent. The world with all its various occupations is before him.
He may engage in as many freaks as suit his fancy; he may dwell
and live where and how he pleases; but the case is widely different
with a young woman delicately brought up, respectably connected,
and desirous of retaining a respectable condition and the
countenance of her friends. She is truly dependent, and must
oftentimes sacrifice her judgment and feelings to avoid more serious
and important sacrifices.
Penelope used to talk about dependence while under the roof of
her benevolent and kind-hearted relative, now no more. But she felt
it not then, as she felt it when her uncle had departed from life. Then
it was merely a name, now it became a reality.
When the Countess had prevailed upon Penelope to give her
assent to the proposal of publicly displaying her musical talents, her
ladyship was in exceeding good humour; and when a lady of high
rank is in good humour, her condescension, her affability, her wit, her
wisdom, and whatever she pleases to assume or affect of the
agreeable and praiseworthy, are infinitely above all language of
commendation to such a person as Mrs Greendale. The widow
therefore was quite charmed with the exquisitely lady-like manners
of the Countess, astonished at her great good sense; and, had the
Countess requested it, Mrs Greendale herself would have become a
public singer.
While this negociation was going on at the castle at Smatterton,
another discussion concerning Penelope was passing at the rectory
at Neverden.
“Well, papa,” said Miss Darnley, “I took particular notice of
Penelope Primrose yesterday, and purposely mentioned the name of
Lord Spoonbill, to see whether it would produce any emotion, and I
did not observe anything that led me to suppose what you suspect.”
“Very likely, my child, you could not discern it. That was not a time
for the expression of any such feelings. Her thoughts were then
otherwise engaged. But I can say that, from what I have observed, I
have no reason whatever to doubt that her affections are not as they
were with respect to your brother. You know that Robert wrote to her
by the same conveyance which brought us a letter, and although I
gave every opportunity and hint I could to that purpose, Miss
Primrose did not mention having heard.”
“But, my dear papa,” replied Miss Darnley, still unwilling to think
unfavourably of so valued a friend as Penelope, “might not her
thoughts be otherwise engaged at the time, when you visited her; for
you recollect that your call was much sooner after Dr Greendale’s
death than our’s was.”
Mr Darnley smiled with a look of incredulity, and said, “You are
very charitable in your judgment, my dear, but I think in this instance
you extend your candour rather too far. I did not only observe
symptoms of alienation, but had, I tell you, almost a proof of the fact.
I went so far as to allude to her engagement and to offer our house
as an asylum; and her reply was, that she would be at the direction
of Lady Smatterton. Whether she be vain and conceited enough to
aspire to Lord Spoonbill’s hand, I will not pretend to say, but I am
abundantly convinced that she does not regard your brother with the
same affection that she did some time ago; and there certainly have
been symptoms to that effect in the course of her correspondence,
or Robert would never have used such language, or made such
enquiries as he has in his last letter. And I think it would be but an act
of kindness, or even of justice, to let your brother know what are our
suspicions.”
Now Mary Darnley, who was rather inclined to be blue-stockingish,
and had of course, a mighty admiration for wisdom, and learning,
and science, thought it not unlikely that if Penelope had changed her
mind, and transferred her affections to another, that other was more
likely to be Mr Kipperson than Lord Spoonbill. For, she reasoned, it
was not probable that a young woman so brought up as Penelope
had been, should be at all pleased with a character so profligate as
Lord Spoonbill was generally supposed to be. Then Mr Kipperson,
though he was double Penelope’s age, yet was a very agreeable
man, and far superior to the common run of farmers; and he was a
man of very extensive information and of great reading. The
reasoning then went on very consequentially to prove, that as
Penelope loved reading, and as Mr Kipperson loved reading,
therefore Penelope must love Mr Kipperson. This perhaps was not
the best kind of reasoning in the world, yet it might do in default of a
better to support a theory.
The truth of the matter is, that Miss Mary Darnley herself was a
little disposed to admire Mr Kipperson, in virtue of his literary and
scientific character; and the truth also is, that Mr Kipperson had
really manifested symptoms of admiration towards Penelope
Primrose; and last, but not least, is the truth, that Miss Mary Darnley
was somewhat inclined to be jealous of the attention which the
literary and scientific Mr Kipperson had recently paid to Miss
Primrose.
This theory of Miss Mary Darnley seemed the most plausible, and
it was therefore adopted by her mother and sisters, and by them it
was unanimously concluded that Penelope was not unfavourable to
the suit of Mr Kipperson; and then they thought that the young lady
had behaved, or was behaving very ill to their brother; and then they
thought that their brother might do much better for himself; and then
they thought that Mr Kipperson was at least fifty, though till then it
had been the common opinion that he was but forty; and then they
thought that no dependence could be placed on any one; and then
they made many wise remarks on the unexpectedness of human
events, not considering that the experience of millions, and the
events of centuries, have conspired to shew that events take any
other direction than that which is expected. Ann Darnley was sorry
for it, Martha laughed at it, and Mary was angry with it.
As for Mr Darnley himself, he was not much moved; but he could
not admit of the idea that he was wrong in his conjecture that Miss
Primrose was partial to Lord Spoonbill, therefore he could not see
the force of the reasoning which went to prove, that the transfer of
Penelope’s affections was not from Robert Darnley to Lord
Spoonbill, but to Mr Kipperson.
“Beside,” said Mr Darnley, “is it likely that a young woman of such
high notions as Miss Primrose should think of accepting an offer
from Mr Kipperson, who, though he is a man of property and of
literary taste, is still but a farmer, or agriculturist. It is far more likely
that the vanity of the young lady should fix her hopes on Lord
Spoonbill, especially if his lordship has paid her, as is not unlikely,
very marked attentions.”
Although in the family at the rectory of Neverden there was
diversity of opinion as to the person on whom Miss Primrose had
placed her affections, there was at least unanimity in the feeling and
expression of disapprobation. And, in pursuance of this feeling, there
was a diminution, and indeed nearly a cessation of intercourse
between the parties. Many days passed away, and no message and
no visitor from Neverden arrived at Smatterton.
This was deeply and painfully felt by Penelope, and the more so
as it was absolutely impossible for her to ask an explanation. Indeed,
she concluded that no explanation was wanting; the fact that no
letter had been received for so long time, and the circumstance of
the coldness and change in the manners of the young ladies at
Neverden, were sufficient manifestations to Penelope that, for some
cause or other, there was a change in the mind of Robert Darnley
towards her. Then in addition to these things was the reflection, that
she had allowed herself to be persuaded contrary to her own
judgment to adopt the profession of music as a public singer, or at
least as a hired performer. Thus, in a very short time, she was
plunged from the height of hope to the depth of despair. A little while
ago she had been taught to entertain expectations of her father’s
return to England in a state of independence; she had also reason to
hope that, the lapse of a few months, there might come from a
distant land one for whom she did entertain a high esteem, and who
should become her guardian, and guide, and companion through life.
A little while ago also, she had in the society and sympathy of her
worthy and benevolent uncle, Dr Greendale, a refuge from the
storms of life, and some consolation to enable her to bear up aright
under the pressure of life’s evils, its doubts and its fears. All these
hopes were now vanished and dispersed, and she left to the mercy
of a rude world. Her best benefactor was in his grave, and those very
agreeable and pleasant companions in whom he confided as in
relatives, and more than sisters, they also had deserted her. It
required a great effort of mind to bear up under these calamities. Her
mind however had been habituated to exertion, and it had gained
strength from the efforts which it had formerly made; but still her
constitution was not stoical; she had strong and deep feelings. It was
with some considerable effort that she did not yield so far to the
pressure of present circumstances as to lose all elasticity of mind
and to relinquish all love of life. And pity itself need not seek and
cannot find an object more worthy of its tears than one living, who
has lost all relish for life, and ceased to enjoy its brightness or to
dread its darkness.
CHAPTER X.
Some few weeks after Penelope had given her consent to the
arrangement suggested by the Countess of Smatterton, the family at
the castle took their departure for London. Her ladyship did not forget
her promise of providing Miss Primrose with the means of cultivating
and improving her natural talents; but, in a very few days after
arriving in town, negociations were entered into and concluded with
an eminent professor to take under his tuition a young lady
patronized by the Countess of Smatterton.
Great compliments of course were paid to the judgment of the
Countess, and high expectations were raised of the skill and power
of this new vocal prodigy; for countesses never patronize anything
but prodigies, and if the objects of their patronage be not prodigies
by nature, they are very soon made so by art and fashion.
Now the Countess of Smatterton was really a good judge of
musical excellence; her taste was natural, not acquired or affected
as a medium of notoriety, or a stimulus for languid interest in life’s
movements. And when her ladyship had a musical party, which was
indeed not unfrequently, there was not one individual of the whole
assemblage more really and truly delighted with the performances
than herself, and few perhaps were better able to appreciate their
excellence.
At this time but few families were in town, and the winter
assortment of lions, and prodigies, and rages, was not formed or
arranged. Lady Smatterton would have been best pleased to have
burst upon the assembled and astonished world at once with her
new human toy. But the good lady was impatient. She wished to
enjoy as soon as possible the pleasure of exhibiting to her friends
and neighbours and rivals the wonderful talents of Penelope
Primrose. As soon therefore as arrangements could be made with
the professor who was destined to be the instructor of Miss
Primrose, a letter was despatched to Smatterton, desiring the young
lady to make as much haste as possible to town.
This was indeed a sad and painful trial to Penelope. Little did she
think that the plan was so soon to be put in force to which she had
given her reluctant assent. It seemed inconsiderate in her ladyship to
remove Penelope from Mrs Greendale so very soon; not that the
young lady had any very great reluctance to part from Mrs
Greendale; but as she had some reluctance to make the journey to
London for the object which was in view, she felt rather more than
otherwise she would have done the inconvenience to which it
necessarily put her aunt. Having therefore shewn Lady Smatterton’s
letter to the widow, she expressed her concern that the Countess
should be so very hasty in removing her, and said, that if her aunt
wished it she would take the liberty of writing to her ladyship,
requesting a little longer indulgence, that she might render any
assistance which might be needed under present circumstances.
Some persons there are who never will and who never can be
pleased: Mrs Greendale was one of them. Instead of thanking
Penelope for her considerate and kind proposal, her answer was:
“Indeed, Miss Primrose, I think you would be acting very
improperly to question Lady Smatterton’s commands. I know not
who is to provide for you, if you thus turn your back upon your best
friends. I can assure you I have no great need of any of your
assistance, which I dare say you would not be so ready to offer if it
did not suit your own convenience.”
To repeat much such language as this would be wearisome.
Suffice it to say, that there was no form of expression which
Penelope could use, nor any line of conduct which she could
propose, which Mrs Greendale was not ingenious enough to carp at
and object to. It may then be easily imagined that the situation of our
heroine was not much to be envied; nor will it be supposed that she
felt any great reluctance to leave such a companion and friend as
this. With the best grace imaginable, therefore, did Penelope prepare
for yielding obedience to Lady Smatterton’s commands; but it was
still with a heavy heart that she made preparation for her journey.
Before her departure it was absolutely and indispensably
necessary that she should go through the ceremony of taking leave
of her friends. Of several persons, whose names are not here
recorded, Penelope Primrose took leave, with expressions of mutual
regret. There was however no embarrassment and no difficulty in
these cases. When, however, she prepared to take leave of her
friends at Neverden, the case was widely different. Then arose much
perplexity, and then her heart felt such a bitter pang. It was probable
that this would be a final leave. The Darnleys never visited London,
or at least not above once in twenty years. They had recently looked
coldly upon her, and had partially neglected her. It was contrary to
their general practice to act capriciously; there certainly must be a
motive for their behaviour, and what could that motive be but a
change in the intentions of Robert Darnley with respect to herself.
The ground of that change she was at a loss to determine. At all
events she must call and take leave of them.
In pursuance of this determination, Penelope Primrose took, not
the earliest, but the latest opportunity of calling upon Mr Darnley and
the family at Neverden rectory; for it would not be very pleasant to
remain any time in the neighbourhood after a cool and unfriendly
separation from those with whom so many of her pleasantest hours
had been spent, and with whose idea so many of her hopes had
been blended. When she called, the whole family was at home. Her
reception was by no means decidedly unkind, or artificially polite.
There was always indeed a degree of stateliness in the manner of
Mr Darnley, and that stateliness did not appear any less than usual,
nor did it appear quite so tolerable as on former days and on former
occasions.
In the young ladies, notwithstanding their general good sense and
most excellent education, there was towards Penelope that kind of
look, tone, and address, which is so frequently adopted towards
those who once were equals, and whom misfortune has made
inferiors. Those of our readers who cannot understand us here we
sincerely congratulate.
It had been made known to Mr Darnley for what purpose Miss
Primrose was making preparations for a journey to London. But,
though the fact had been communicated, the reason for that step
had not been mentioned; not a word had been said concerning the
pressing importunity of the Countess; nor was there any notice taken
to him of the reluctance with which Penelope had consented to this
arrangement. It appeared therefore to Mr Darnley that the measure
was quite in unison with the young lady’s own wishes; nor did he see
how incongruous such a movement as this must be with his
suspicions of the aspiring views of his late friend’s niece. At all
events, this proceeding on the part of Miss Primrose appeared to
him, and very naturally so, as a tacit relinquishment of the
engagement with his son: as it was impossible for her not to know
how repugnant it must be to the feelings and taste of Mr Robert
Darnley. But as the elder Mr Darnley held the clerical office, of the
sanctity and dignity of which he had very high ideas, he thought it but
part of his duty to administer a word or two of exhortation to the
young lady about to embark in a concern of such a peculiar nature.
Now to render exhortation palatable, or even tolerable, requires a
very considerable share of address and dexterity, more indeed than
usually falls to the lot of clerical or of laical gentry. It is easy enough
to utter most majestically and authoritatively a mass of common
places concerning the dangers to which young people are exposed
in the world. It is easy to say, “Now let me advise you always to be
upon your guard against the allurements of the world, and to conduct
yourself circumspectly, and be very, very attentive to all the proper
decorums and duties of your station.” Such talk as this anybody may
utter; and when young people commence life, they expect to hear
such talk; and for the most part, to say the best of it, it produces no
effect, good, bad, or indifferent. It is also easy to render exhortation
painful and distressing, by making it assume the form of something
humiliating and reproachful; and when it has also a reference to
some departed friend, or to circumstances once bright, but now
gloomy, and when these references are founded on injustice, and
when this injustice cannot be refuted or rectified without some
explanation or explanations more painful still, then it is that
exhortation is doubly painful and distressing. So fell upon the ear
and heart of poor Penelope the exhorting language of Mr Darnley.
When Penelope had first entered the apartment she had
announced the purpose of her call, and had, by the assistance of the
Darnleys, stated the views with which she was going to London: for
so reluctant was she to mention the fact, that its annunciation was
almost extorted from her by those who knew beforehand what were
her intentions. After a very little and very cold common-place talk,
uttered merely from a feeling of the necessity of saying something,
the conversation dropped, and the parties looked awkwardly at one
another. Then did Mr Darnley, assuming a right reverend look,
address himself to Miss Primrose.
“Now, Miss Primrose, before we part, let me as your friend, and as
a friend of your late uncle, give you a little parting advice. I am sorry
that you have determined on taking this step, and had you
condescended to consult me on the subject, I certainly should have
dissuaded you from the undertaking. But, however, that is past.
Though I rather am surprised, I must acknowledge that, recollecting
as you must, how strongly your late worthy uncle used to speak
against this pursuit, you should so soon after his decease resolve to
engage in it. But, however, you are perfectly independent, and have
a right to do as you please. I do not say that in this pursuit there is
anything inconsistent with religion and morality. I would by no means
be so uncharitable. But I should have thought, Miss Primrose, that,
considering your high spirit, you would hardly have condescended to
such an employment; for I may call it condescension, when I
consider the prospects to which you were born: but those, I am sorry
to say, are gone. As you have then fully resolved upon thus making a
public display of your musical talents, which, for anything I know to
the contrary, may be of the highest order—for I do not understand
music myself—you will perhaps excuse me if, as a friend of your late
uncle, and really a well-wisher to yourself, I just take the liberty to
caution you against the snares by which you are surrounded.
Beware of the intoxications of flattery, and do not be unduly
distressed if you should occasionally in the public journals be made
the subject of ill-natured criticism. For I understand there are many
young and inexperienced writers who almost regularly assail by
severe criticism public performers of every kind; and they make use
of very authoritative language. Now this kind of criticism would be
very offensive to a person who was not aware that it is the
production of ignorant, conceited boys. I was once acquainted with a
young man who made acknowledgments to me that have given me a
very different view of the critical art from that which I formerly
entertained. But, my good young lady, there are severer trials which
await you than these: you will be very much exposed to the society
of the vicious and dissipated. You will have need of all your caution
and circumspection to take care that your religious and moral
principles be not weakened or impaired. I do not say, indeed, that
your profession is to be esteemed irreligious or immoral; but it
certainly is exposed to many snares, and does require an unusual
share of attention. I hope you will not neglect to attend church
regularly and punctually. It will assuredly be noticed if you neglect
this duty. Many will keep you in countenance should you be disposed
to slight the public ordinances of religion; but there are also not a few
who patronize public musical performances, and who also attend on
religious worship: it is desirable therefore to let these persons see
that you are also attentive to the duties of religion, I must add, Miss
Primrose, that I am concerned to find you so bent upon this scheme.
It would have given me great pleasure, had all things proceeded
rightly, to afford you an asylum in this house till the return of your
father, or till any other change had rendered such accommodation no
longer necessary. But, as circumstances now are, this cannot be.”
It is easy to conceive what effect such language as this must have
had on the sensitive mind and almost broken heart of Penelope
Primrose. It is very true that, in this address to her, Mr Darnley had
no malicious or cruel intention, though every sentence which he
uttered grieved her to the very soul. Well was it for Penelope that
she was partly prepared for something of this kind, and that her
sorrows had crept upon her gradually. Therefore she bore all this
with a most enduring patience, and never attempted to make any
explanation or apology otherwise than by meekly and calmly replying
to the elaborate harangue of Mr Darnley:
“I thank you, sir, for your advice; I hope and trust I shall attend to it;
but I wish you to understand that I am not acting purely according to
my own inclinations in adopting this employment. I am sorry that I
am under the necessity—”
The sentence was unfinished, and the tone in which it was uttered
excited Mr Darnley’s compassion: but he thought it very strange that
Miss Primrose should express any reluctance to engage in a pursuit
which, according to all appearance, she had voluntarily and
unnecessarily adopted. The young ladies also were very sorry for
her, but still they could not help blaming her mentally for her
fickleness towards their brother; for they were sure that he was
attached to her, and they plainly saw, or at least thought they saw,
that she had withdrawn her affections from him. Penelope also was
very well convinced, by this interview with the family, that all her
hopes of Robert Darnley were gone.
To avoid any farther unpleasantness, she then took leave of her
late friends, and, with a very heavy heart, returned to Smatterton to
make immediate preparation for her journey to London. Alas! poor
girl, she was not in a frame of mind favourable to the purposes of
festivity or the notes of gladness. She, in whose heart was no
gladness, was expected to be the means of delighting others. Thus
does it happen, that the tears of one are the smiles of another, and
the pleasures of mankind are founded in each others pains. Never
do the burning words and breathing thoughts of poetry spring with
such powerful energy and sympathy-commanding force, as when
they come from a heart that has felt the bitterness of grief, and that
has been agitated even unto bursting.
Our heroine would then have appeared to the greatest advantage,
and would then have commanded the deepest sympathy in those
moments of solitude, which intervened between the last leave-taking
and her departure for a metropolis of which she had seen nothing,
heard much, and thought little. But now her mind was on the rack of
thought, and so deeply and painfully was it impressed, that her
feeling was of the absolute impossibility of effectually answering the
designs and intentions of her friend the Countess. She could not
bear to look back to the days that were past—she felt an
indescribable reluctance to look forward, but her mind was of
necessity forced in that direction. All that spirit of independence and
feeling of almost pride, which formed no small part of her character,
seemed now to have taken flight, and to have left her a humble,
destitute, helpless creature. It was a pretty conceit that came into her
head, and though it was sorrowful she smiled at it; for she thought
that her end would be swanlike, and that her first song would be her
last, with which she should expire while its notes were trembling on
her lips.
CHAPTER XI.
It was not very considerate of the Countess of Smatterton to let a
young lady like Penelope Primrose take a long and solitary journey
of two hundred miles in a stage-coach without any guide,
companion, or protector. The Earl had a very ample supply of
travelling apparatus, and it would have been quite as easy to have
found room for Penelope in one of the carriages when the family
travelled up to town. But they who do not suffer inconveniences
themselves, can hardly be brought to think that others may.
Penelope felt rather mortified at this neglect, and it was well for her
that she did, as it was the means of taking away her attention from
more serious but remoter evil. It was also productive of another
advantage; for it gave Mr Kipperson an opportunity of exhibiting his
gallantry and politeness. For, the very morning before Penelope was
to leave Smatterton, Mr Kipperson called in person on the young
lady, and stated that imperious business would compel him to visit
the metropolis, and he should have infinite pleasure in
accompanying Miss Primrose on her journey, and perhaps that might
be more agreeable to her than travelling alone or with total
strangers. Penelope could not but acknowledge herself highly
obliged by Mr Kipperson’s politeness, nor did she, with any
affectation or foolery, decline what she might perhaps be compelled
to accept. On the following morning, therefore, Miss Primrose,
escorted by Mr Kipperson, left the sweet village of Smatterton. That
place had been a home to Penelope from almost her earliest
recollections, and all her associations and thoughts were connected
with that place, and with its little neighbour Neverden. Two hundred
miles travelling in a stage-coach is a serious business to one who
has hardly ever travelled but about as many yards. It is also a very
tedious affair even to those who are accustomed to long journies by
such conveyance. In the present instance, however, the journey did
not appear too long to either of our travellers. For Penelope had
looked forward to the commencement of her journey with too much
repugnance to have any very great desire for its completion, and Mr
Kipperson was too happy in the company of Miss Primrose to wish
the wheels of time, or of the coach, to put themselves to the
inconvenience of rolling more rapidly than usual on his account. It
was also an additional happiness to Mr Kipperson that there were in
the coach with him two fellow travellers who had long heard of his
fame, but had never before seen his person; and when they
discovered that they were in company with the great agriculturist,
and the great universal knowledge promoter, Mr Kipperson, they
manifested no small symptoms of satisfaction and admiration.
Now the mind of the scientific agriculturist was so constructed as
to experience peculiar pleasure and delight at aught which came to
his ear in the form of compliment and admiration. And, when Mr
Kipperson was pleased, he was in general very eloquent and
communicative; and he informed his fellow travellers that he was
now hastening up to London on business of the utmost importance.
He had received despatches from town, calling him up to attend the
House of Commons, and to consult with, or rather to advise, certain
committees connected with the agricultural interest. And he, the said
Mr Kipperson, certainly could not decline any call which the deeply
vital interests of agriculture might make upon him. Thereupon he
proceeded to shew that there was no one individual in the kingdom
uniting in himself those rare combinations of talent, which were the
blessing and distinction of the celebrated Mr Kipperson of
Smatterton; and that if he should not pay attention to the bill then
before the House, or at least likely to be before the House, by the
time he should arrive in London, the agricultural interest must be
completely ruined; there could be no remunerating price, and then
the farmers would throw up their farms and leave the country, taking
with them all their implements, skill, forethought, and penetration;
and then all the land would be out of cultivation, and the kingdom
would be but one vast common, only maintaining, and that very
scantily, donkeys and geese.
When the safety of a nation depends upon one individual, that
individual feels himself very naturally of great importance. But
perhaps this is a circumstance not happening quite so often as is
imagined. Strange indeed must it be that, if out of a population of ten
or twelve millions, only one or two can be found on whose wisdom
the state can rely, or from whose councils it can receive benefit. But
as the pleasure of imagining one’s self to be of importance is very
great, that pleasure is very liberally indulged in. And thus the number
of those rarities, called “the only men in the world,” is considerably
increased. Now Mr Kipperson was the only man in the world who
had sagacity and penetration enough to know wherein consisted the
true interest of agriculture; and he was most happy in giving his time
and talents to the sacred cause of high prices. Enough of this: we do
not like to be panegyrical, and it is very probable that our readers will
not be much disappointed if we protest that it is not our intention to
enter very deeply into the subject of political economy. Indeed were
we to enter very deeply into the subject with which Mr Kipperson was
intimate, we should be under the necessity of making an
encyclopedia, or of plundering those already made, beyond the
forbearance of their proprietors.
That must be an exceedingly pleasant mode of travelling which
does not once, during a very long journey, provoke the traveller to
wish himself at his journey’s end. Pleased as was Mr Kipperson at
the opportunity afforded him of behaving politely to Miss Primrose,
and gratified as he was by the respectful veneration with which his
two other fellow travellers received the enunciations of his oracular
wisdom; fearful as was Penelope that her new life would be the
death of her, and mourning as she was under the actual loss of one
most excellent friend, and contemplating the possible loss of others,
still both were pleased to be at their journey’s end.
It would have given Mr Kipperson great pleasure to accompany
Miss Primrose to the Earl of Smatterton’s town residence; but it gave
him much greater pleasure to be able to apologize for this apparent
neglect, by saying that business of a most important nature
demanded his immediate attendance in the city, and from thence to
the House of Commons; but that he should have great pleasure in
calling on the following morning to make enquiries after his fellow
traveller, and to pay his respects to his worthy and right honorable
neighbour, Lord Smatterton. For although my Lord Smatterton was
what the world calls a proud man, yet he did admit of freedom and a
species of familiarity from some sort of people; and a little freedom
with a great man goes a great way with a little man. Now Mr
Kipperson was one of those persons to whom the Earl of Smatterton
was most graciously condescending, and with good reason was he
condescending; for this said Mr Kipperson, wishing to keep up the
respectability of the farming profession, and though being much of a
tenant, and a little of a landlord, but hoping in due time to be more of
a landlord through an anticipated inheritance, he gave all his mind to
impress upon his agricultural neighbours the importance of keeping
up prices, and he paid no small sum for the farm which he tenanted
under the Earl of Smatterton. It may be indeed said with some
degree of truth, that he paid Lord Smatterton exceedingly well for his
condescension; and as his lordship was not much exposed to Mr
Kipperson’s invasions in London, he bore them with great
resignation and address when they did happen. The Countess also
was condescending to Mr Kipperson, being very sensible of his
value to the Smatterton estate; so that the great and scientific
agriculturist appeared to visit this noble family on terms of equality;
and it is a fact that he thought himself quite equal, if not rather
superior, to the Smatterton nobleman. It was a pleasure to Mr
Kipperson to enjoy this conceit; and it did no one any injury, and it is
a pity that he should be disturbed in the possession of the fancy.
The nobility do not act judiciously when they admit of any other
token of distinction than actual rank. When once they adopt any
fanciful distinction from fashion, or ton, or impudence, for they are
nearly the same, the benefit of the civil distinction is at once
renounced, and there is no established immoveable barrier against
innovation. A merchant, or the son of a merchant, may by means of
an imperturbable self-conceit, or by force of commanding
impudence, push himself up into the highest walks of life, and look
down upon nobility. Though the biographer of a deceased statesman
may express his lament that nobility does not admit talent ad
eundem, yet there is danger lest nobility should hold its hereditary
honors with too light a hand. Lord Smatterton indeed was not guilty
of neglecting to preserve upon his own mind, or endeavouring to

You might also like