Lecture 10

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 18

Web search basics (Recap)

User
Web crawler

Search

Indexer

The Web Indexes

Query Engine

Query Engine
Process query Look-up the index Retrieve list of documents Order documents

Content relevance Link analysis Popularity

Prepare results page Todays question: Given a large list of documents that match a query, how to order them according to their relevance?
2

Answer: Scoring Documents


Given document d Given query q Calculate score(q,d) Rank documents in decreasing order of score(q,d) Generic Model: Documents = bag of [unordered] words (in set theory a bag is a multiset) A document is composed of terms A query is composed of terms score(q,d) will depend on terms
3

Method 1: Assign weights to terms


Assign to each term a weight tft,d - term frequency (how often term t occurs in document d)
query = who wrote wild boys doc1 = Duran Duran sang Wild Boys in 1984. doc2 = Wild boys dont remain forever wild. doc3 = Who brought wild flowers? doc4 = It was John Krakauer who wrote In to the wild.

score(q, d ) tft ,d
tq

query = {boys: 1, who: 1, wild: 1, wrote: 1} doc1 = {1984: 1, boys: 1, duran: 2, in: 1, sang: 1, wild: 1} doc2 = {boys: 1, dont: 1, forever: 1, remain: 1, wild: 2} score(q, doc1) = 1 + 1 = 2 score(q,doc3) = 1 + 1 = 2

score(q, doc2) = 1 + 2 = 3 score(q, doc4) = 1 + 1 + 1 = 3


4

Why is Method 1 not good?


All terms have equal importance. Bigger documents have more terms, thus the score is larger. It ignores term order. Postulate: If a word appears in every document, probably it is not that important (it has no discriminatory power).

Method 2: New weights


dft - document frequency for term t idft - inverse document frequency for term t

N idf t log df t

N - total number of documents

tf-idftd - a combined weight for term t in document d

score(q, d ) tf idft ,d
tq

tf idft,d tf t,d idft

Increases with the number of occurrences within a doc Increases with the rarity of the term across the whole corpus

Example: idf values


terms 1984 boys brought dont duran flowers forever df 1 2 1 1 1 1 1 idf 0.602 0.301 0.602 0.602 0.602 0.602 0.602 terms krakauer remain sang the to was who df 1 1 1 1 1 1 2 idf 0.602 0.602 0.602 0.602 0.602 0.602 0.301

in
it john

2
1 1

0.301
0.602 0.602

wild
wrote

4
1

0.0
0.602

Example: calculating scores (1)


query = who wrote wild boys
documents duran duran sang wild boys in 1984 wild boys don't remain forever wild who brought wild flowers it was john krakauer who wrote in to the wild documents S: tf-idf 0.301 0.301 0.301 0.903 S: tf-idf S: tf 2 3 2 3 S: tf

duran duran sang wild boys in 1984


wild boys don't remain forever wild who brought wild flowers it was john krakauer who wrote in to the wild

0.426
0.551 0.301 1.028

2
3 1 3
8

Example: calculating scores (2)


query = who wrote wild boys
documents duran duran who sang wild boys in 1984 wild boys don't remain forever wild who brought wild flowers it was john krakauer who wrote in to the wild documents S: tf-idf 0.551 0.551 0.125 0.852 S: tf-idf S: tf 3 3 1 3 S: tf

duran duran sang wrote wild boys in 1984


wild boys don't remain forever wild who brought wild flowers it was john krakauer who wrote in to the wild

0.727
0.551 0.301 0.727

3
3 1 3
9

The Vector Space Model


Formalizing the bag-of-words model. Each term from the collection becomes a dimension in a n-dimensional space. A document is a vector in this space, where term weights serve as coordinates. It is important for:

Scoring documents for answering queries Query by example Document classification Document clustering
10

Term-document matrix (revision)


Anthony & Cleopatra Anthony 167 Julius Caesar 76 Hamlet 0 Othello 0

Brutus
Caesar Calphurnia Cleopatra

4
235 0 48

161
228 10 0

1
2 0 0

0
1 0 0

The counts in each column represent term frequency (tf).

11

Documents as vectors
combat HenryVI-1 3.5147 courage 1.4731 enemy 1.1288 fierce 0.6425 peace 0.9507 war 3.8548

HenryVI-2
HenryVI-3 Othello Rom.&Jul. Taming

0
0.4393 0 0 0

0.491
2.2096 0.2455 0.2455 0

0.7525
0.8278 0.2258 0.602 0

0
0.3212 0 0.3212 0

1.2881
0.3374 0.2454 0.5827 0.184

7.7096
16.0617 0 0 0

Calculation example:
N = 44 (works in the Shakespeare collection) war df = 21, idf = log(44/21) = 0.32123338 HenryVI-1 tf-idf war= tf war * idf war = 12 * 0.321 = 3.8548 HenryVI-3 = 50 * 0.321 = 16.0617
12

Why turn docs into vectors?


Query-by-example

Given a doc D, find others like it.

Now that D is a vector, => Given a doc, find vectors (docs) near it. Intuition: d3

t3

d2

d1

t1 d5 d4

t2 Postulate: Documents that are close together in vector space talk about the same things.

13

Some geometry
cos( / 2) 0
t2

cos( / 8) 0.92

cosine can be used as a measure of similarity between two vectors Given two vectors

x and y

d1

d1 d2 t1

x y cos(x , y ) x y

i 1 i

x yi yi2 i 1
n

2 i 1 i

14

Cosine Similarity
For any two given documents dj and dk, their similarity is:

d j dk sim(d j , d k ) d j dk
where

i 1

wi , j wi ,k wi2,k i 1
n

wi2, j i1

wi is a weight, e.g., tf-idf

We can regard a query q as a document dq and use the same formula:

d j dq sim(d j , d q ) d j dq

i 1

wi , j wi ,q wi2,q i 1
n

i1 w

2 i, j

15

Example
Given the Shakespeare play Hamlet, find most similar plays to it.
1. Taming of the shrew 2. Winters tale 3. Richard III

hor
tf Hamlet Taming of the Shrew 95 58 tf-idf 127.5302 77.8605 tf 175 163

haue
tf-idf 19.5954 18.2517

The word hor appears only in these two plays. It is an abbreviation (Hor.) for the names Horatio and Hortentio. The product of the tf-idf values for this word amounts to 82% of the similarity value between the two documents.
16

Digression: spamming indices


This method was invented before the days when people were in the business of spamming web search engines. Consider:

Indexing a sensible passive document collection vs. An active document collection, where people (and indeed, service companies) are shaping documents in order to maximize scores

Vector space similarity may not be as useful in this context.

17

Issues to consider
How would you augment the inverted index to support cosine ranking computations? Walk through the steps of serving a query.

The math of the vector space model is quite straightforward, but being able to do cosine ranking efficiently at runtime is nontrivial

18

You might also like