P 01 Intro
P 01 Intro
2019-02-21
Take-away
Overview
1 Introduction
3 Boolean model
4 Inverted index
5 Processing queries
6 Query optimization
Prerequisites
Teachers
Evaluation of students
Questions?
Presentation style? Warm ups? Personal cards.
Erasmus? Bc. or Mgr.? Discussion forum in IS!
1998: google.stanford.edu
Boolean retrieval
Incidence vectors
Answers to query
Bigger collections
Inverted index
Calpurnia −→ 2 31 54 101
..
.
| {z } | {z }
dictionary postings
Generate postings
term docID
i 1
did 1
enact 1
julius 1
caesar 1
i 1
was 1
killed 1
i’ 1
the 1
capitol 1
brutus 1
Doc 1. i did enact julius caesar i was
killed 1
killed i’ the capitol brutus killed me
me 1
Doc 2. so let it be with caesar the =⇒ so 2
noble brutus hath told you caesar was
let 2
ambitious
it 2
be 2
with 2
caesar 2
the 2
noble 2
brutus 2
hath 2
told 2
you 2
caesar 2
was 2
ambitious 2
Sort postings
term docID term docID
i 1 ambitious 2
did 1 be 2
enact 1 brutus 1
julius 1 brutus 2
caesar 1 capitol 1
i 1 caesar 1
was 1 caesar 2
killed 1 caesar 2
i’ 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 i 1
killed 1 i 1
me 1 i’ 1
so 2
=⇒ it 2
let 2 julius 1
it 2 killed 1
be 2 killed 1
with 2 let 2
caesar 2 me 1
the 2 noble 2
noble 2 so 2
brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
Calpurnia −→ 2 31 54 101
..
.
| {z } | {z }
dictionary postings file
Intersection =⇒ 2 → 31
This is linear in the length of the postings lists.
Note: This only works if postings lists are sorted.
Intersect(p1 , p2 )
1 answer ← h i
2 while p1 6= nil and p2 6= nil
3 do if docID(p1 ) = docID(p2 )
4 then Add(answer , docID(p1 ))
5 p1 ← next(p1 )
6 p2 ← next(p2 )
7 else if docID(p1 ) < docID(p2 )
8 then p1 ← next(p1 )
9 else p2 ← next(p2 )
10 return answer
france −→ 1 → 2 → 3 → 4 → 5 → 7 → 8 → 9 → 11 → 12 → 13 → 14 → 15
paris −→ 2 → 6 → 10 → 12 → 14
lear −→ 12 → 15
Boolean queries
Westlaw: Comments
Query optimization
Query optimization
Intersect(ht1 , . . . , tn i)
1 terms ← SortByIncreasingFrequency(ht1 , . . . , tn i)
2 result ← postings(first(terms))
3 terms ← rest(terms)
4 while terms 6= nil and result 6= nil
5 do result ← Intersect(result, postings(first(terms)))
6 terms ← rest(terms)
7 return result
Exercise
segment reduce
map files
phase phase
3
2
1
0
0 1 2 3 4 5 6 7
log10 rank
Zipf’s law
Metadata in Inexact
Tiered inverted Scoring
zone and top K k-gram
positional index parameters training
field indexes retrieval
Indexes MLR set
X
X
X X
X
X
X
∗
X
X
X
X
http://news.google.com
Doc URL
FP’s set
✛✲ DNS ✓✏ ✓✏
To
✒✑ other ✒✑
✻
❄ ✒✑ nodes ✒✑
✻
❄ ✻✻✻ ✻
❄
www ✛ ✲ ✲ ✲ ✲
Content URL Host ✲ Dup
✲Fetch ✲
Parse
Seen? Filter splitter ✲ URL
✲ Elim
From
other
✻ nodes
URL Frontier ✛
Take-away
Resources
Chapter 1 of IIR
Resources at https://www.fi.muni.cz/~sojka/PV211/
and http://cislmu.org, materials in MU IS and FI MU
library
course schedule and overview
information retrieval links
Shakespeare search engine
https://www.rhymezone.com/shakespeare/