Lecture 10
Lecture 10
Lecture 10
User
Web crawler
Search
Indexer
Query Engine
Query Engine
Process query Look-up the index Retrieve list of documents Order documents
Prepare results page Todays question: Given a large list of documents that match a query, how to order them according to their relevance?
2
score(q, d ) tft ,d
tq
query = {boys: 1, who: 1, wild: 1, wrote: 1} doc1 = {1984: 1, boys: 1, duran: 2, in: 1, sang: 1, wild: 1} doc2 = {boys: 1, dont: 1, forever: 1, remain: 1, wild: 2} score(q, doc1) = 1 + 1 = 2 score(q,doc3) = 1 + 1 = 2
N idf t log df t
score(q, d ) tf idft ,d
tq
Increases with the number of occurrences within a doc Increases with the rarity of the term across the whole corpus
in
it john
2
1 1
0.301
0.602 0.602
wild
wrote
4
1
0.0
0.602
0.426
0.551 0.301 1.028
2
3 1 3
8
0.727
0.551 0.301 0.727
3
3 1 3
9
Scoring documents for answering queries Query by example Document classification Document clustering
10
Brutus
Caesar Calphurnia Cleopatra
4
235 0 48
161
228 10 0
1
2 0 0
0
1 0 0
11
Documents as vectors
combat HenryVI-1 3.5147 courage 1.4731 enemy 1.1288 fierce 0.6425 peace 0.9507 war 3.8548
HenryVI-2
HenryVI-3 Othello Rom.&Jul. Taming
0
0.4393 0 0 0
0.491
2.2096 0.2455 0.2455 0
0.7525
0.8278 0.2258 0.602 0
0
0.3212 0 0.3212 0
1.2881
0.3374 0.2454 0.5827 0.184
7.7096
16.0617 0 0 0
Calculation example:
N = 44 (works in the Shakespeare collection) war df = 21, idf = log(44/21) = 0.32123338 HenryVI-1 tf-idf war= tf war * idf war = 12 * 0.321 = 3.8548 HenryVI-3 = 50 * 0.321 = 16.0617
12
Now that D is a vector, => Given a doc, find vectors (docs) near it. Intuition: d3
t3
d2
d1
t1 d5 d4
t2 Postulate: Documents that are close together in vector space talk about the same things.
13
Some geometry
cos( / 2) 0
t2
cos( / 8) 0.92
cosine can be used as a measure of similarity between two vectors Given two vectors
x and y
d1
d1 d2 t1
x y cos(x , y ) x y
i 1 i
x yi yi2 i 1
n
2 i 1 i
14
Cosine Similarity
For any two given documents dj and dk, their similarity is:
d j dk sim(d j , d k ) d j dk
where
i 1
wi , j wi ,k wi2,k i 1
n
wi2, j i1
d j dq sim(d j , d q ) d j dq
i 1
wi , j wi ,q wi2,q i 1
n
i1 w
2 i, j
15
Example
Given the Shakespeare play Hamlet, find most similar plays to it.
1. Taming of the shrew 2. Winters tale 3. Richard III
hor
tf Hamlet Taming of the Shrew 95 58 tf-idf 127.5302 77.8605 tf 175 163
haue
tf-idf 19.5954 18.2517
The word hor appears only in these two plays. It is an abbreviation (Hor.) for the names Horatio and Hortentio. The product of the tf-idf values for this word amounts to 82% of the similarity value between the two documents.
16
Indexing a sensible passive document collection vs. An active document collection, where people (and indeed, service companies) are shaping documents in order to maximize scores
17
Issues to consider
How would you augment the inverted index to support cosine ranking computations? Walk through the steps of serving a query.
The math of the vector space model is quite straightforward, but being able to do cosine ranking efficiently at runtime is nontrivial
18