Lecture 10

Web search basics (Recap)
User
Web crawler
Search
Indexer
The Web Indexes
Query Engine
Query Engine
Process query Look-up the index Retrieve list of documents Order documents

Content relevance Link analysis Popularity
Prepare results page Todays question: Given a large list of documents that match a query, how to order them according to their relevance?
2
Answer: Scoring Documents

Given document d Given query q Calculate score(q,d) Rank documents in decreasing order of score(q,d) Generic Model: Documents = bag of [unordered] words (in set theory a bag is a multiset) A document is composed of terms A query is composed of terms score(q,d) will depend on terms
3
Method 1: Assign weights to terms

Assign to each term a weight tft,d - term frequency (how often term t occurs in document d)
query = who wrote wild boys doc1 = Duran Duran sang Wild Boys in 1984. doc2 = Wild boys dont remain forever wild. doc3 = Who brought wild flowers? doc4 = It was John Krakauer who wrote In to the wild.
score(q, d ) tft ,d
tq
query = {boys: 1, who: 1, wild: 1, wrote: 1} doc1 = {1984: 1, boys: 1, duran: 2, in: 1, sang: 1, wild: 1} doc2 = {boys: 1, dont: 1, forever: 1, remain: 1, wild: 2} score(q, doc1) = 1 + 1 = 2 score(q,doc3) = 1 + 1 = 2
score(q, doc2) = 1 + 2 = 3 score(q, doc4) = 1 + 1 + 1 = 3

4
Why is Method 1 not good?

All terms have equal importance. Bigger documents have more terms, thus the score is larger. It ignores term order. Postulate: If a word appears in every document, probably it is not that important (it has no discriminatory power).
Method 2: New weights

dft - document frequency for term t idft - inverse document frequency for term t
N idf t log df t
N - total number of documents
tf-idftd - a combined weight for term t in document d
score(q, d ) tf idft ,d
tq
tf idft,d tf t,d idft
Increases with the number of occurrences within a doc Increases with the rarity of the term across the whole corpus
Example: idf values

terms 1984 boys brought dont duran flowers forever df 1 2 1 1 1 1 1 idf 0.602 0.301 0.602 0.602 0.602 0.602 0.602 terms krakauer remain sang the to was who df 1 1 1 1 1 1 2 idf 0.602 0.602 0.602 0.602 0.602 0.602 0.301
in
it john
2
1 1
0.301
0.602 0.602
wild
wrote
4
1
0.0
0.602
Example: calculating scores (1)

query = who wrote wild boys
documents duran duran sang wild boys in 1984 wild boys don't remain forever wild who brought wild flowers it was john krakauer who wrote in to the wild documents S: tf-idf 0.301 0.301 0.301 0.903 S: tf-idf S: tf 2 3 2 3 S: tf
duran duran sang wild boys in 1984

wild boys don't remain forever wild who brought wild flowers it was john krakauer who wrote in to the wild
0.426
0.551 0.301 1.028
2
3 1 3
8
Example: calculating scores (2)

query = who wrote wild boys
documents duran duran who sang wild boys in 1984 wild boys don't remain forever wild who brought wild flowers it was john krakauer who wrote in to the wild documents S: tf-idf 0.551 0.551 0.125 0.852 S: tf-idf S: tf 3 3 1 3 S: tf
duran duran sang wrote wild boys in 1984

wild boys don't remain forever wild who brought wild flowers it was john krakauer who wrote in to the wild
0.727
0.551 0.301 0.727
3
3 1 3
9
The Vector Space Model

Formalizing the bag-of-words model. Each term from the collection becomes a dimension in a n-dimensional space. A document is a vector in this space, where term weights serve as coordinates. It is important for:

Scoring documents for answering queries Query by example Document classification Document clustering
10
Term-document matrix (revision)

Anthony & Cleopatra Anthony 167 Julius Caesar 76 Hamlet 0 Othello 0
Brutus
Caesar Calphurnia Cleopatra
4
235 0 48
161
228 10 0
1
2 0 0
0
1 0 0
The counts in each column represent term frequency (tf).
11
Documents as vectors
combat HenryVI-1 3.5147 courage 1.4731 enemy 1.1288 fierce 0.6425 peace 0.9507 war 3.8548
HenryVI-2
HenryVI-3 Othello Rom.&Jul. Taming
0
0.4393 0 0 0
0.491
2.2096 0.2455 0.2455 0
0.7525
0.8278 0.2258 0.602 0
0
0.3212 0 0.3212 0
1.2881
0.3374 0.2454 0.5827 0.184
7.7096
16.0617 0 0 0
Calculation example:
N = 44 (works in the Shakespeare collection) war df = 21, idf = log(44/21) = 0.32123338 HenryVI-1 tf-idf war= tf war * idf war = 12 * 0.321 = 3.8548 HenryVI-3 = 50 * 0.321 = 16.0617
12
Why turn docs into vectors?

Query-by-example
Given a doc D, find others like it.
Now that D is a vector, => Given a doc, find vectors (docs) near it. Intuition: d3
t3
d2
d1

t1 d5 d4
t2 Postulate: Documents that are close together in vector space talk about the same things.
13
Some geometry
cos( / 2) 0
t2
cos( / 8) 0.92
cosine can be used as a measure of similarity between two vectors Given two vectors
x and y
d1
d1 d2 t1
x y cos(x , y ) x y
i 1 i
x yi yi2 i 1
n
2 i 1 i
14
Cosine Similarity
For any two given documents dj and dk, their similarity is:
d j dk sim(d j , d k ) d j dk
where
i 1
wi , j wi ,k wi2,k i 1
n
wi2, j i1
wi is a weight, e.g., tf-idf
We can regard a query q as a document dq and use the same formula:
d j dq sim(d j , d q ) d j dq
i 1
wi , j wi ,q wi2,q i 1
n
i1 w
2 i, j
15
Example
Given the Shakespeare play Hamlet, find most similar plays to it.
1. Taming of the shrew 2. Winters tale 3. Richard III
hor
tf Hamlet Taming of the Shrew 95 58 tf-idf 127.5302 77.8605 tf 175 163
haue
tf-idf 19.5954 18.2517
The word hor appears only in these two plays. It is an abbreviation (Hor.) for the names Horatio and Hortentio. The product of the tf-idf values for this word amounts to 82% of the similarity value between the two documents.
16
Digression: spamming indices

This method was invented before the days when people were in the business of spamming web search engines. Consider:

Indexing a sensible passive document collection vs. An active document collection, where people (and indeed, service companies) are shaping documents in order to maximize scores
Vector space similarity may not be as useful in this context.
17
Issues to consider
How would you augment the inverted index to support cosine ranking computations? Walk through the steps of serving a query.
The math of the vector space model is quite straightforward, but being able to do cosine ranking efficiently at runtime is nontrivial
18

Lecture 10

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Lecture 10

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 10

Uploaded by

Copyright:

Available Formats

Web search basics (Recap)

The Web Indexes

Content relevance Link analysis Popularity

Answer: Scoring Documents

Method 1: Assign weights to terms

score(q, doc2) = 1 + 2 = 3 score(q, doc4) = 1 + 1 + 1 = 3

Why is Method 1 not good?

Method 2: New weights

N - total number of documents

tf-idftd - a combined weight for term t in document d

tf idft,d tf t,d idft

Example: idf values

Example: calculating scores (1)

duran duran sang wild boys in 1984

Example: calculating scores (2)

duran duran sang wrote wild boys in 1984

The Vector Space Model

Term-document matrix (revision)

The counts in each column represent term frequency (tf).

Why turn docs into vectors?

Given a doc D, find others like it.

wi is a weight, e.g., tf-idf

We can regard a query q as a document dq and use the same formula:

Digression: spamming indices

Vector space similarity may not be as useful in this context.

You might also like