Chapter 3,4, 5 and 6
Chapter 3,4, 5 and 6
Chapter 3,4, 5 and 6
Chapter 3:
Indexing Structures
Indexing structure
• What is index structure? a data structure that speed the
search operation on a database.
An inverted file is then the sorted list (or index) of keywords
(attributes), with each keyword having links to the documents
containing that keyword
Inverted Index
“Inverted index is generated to index words in
files.”
file 47
generate 19
in 44
index 10, 32
invert 1
is 16
to 29
word 38 9
Index and inverted index
index Inverted index
In your cell phone your list of What allows you to manually enter a phone
number and when you hit ‘dial’ you see the
contacts and which phone numbers
person’s name, rather than the number, because
(cell, home,work) are associated with your phone has taken the phone number and
those contacts found the contact associated with it
DNS, lookup(takes the host name reverse lookup(which takes an IP address and
returns the host name)
and returns an IP address)
storing a mapping from content, such as
mapping a document from words or numbers to its location in a db file
documents to content a file or method of file organization in which
labels indicating the locations of all documents
of a given type are placed in a single record
Cont……..
Vocabulary
Postings
(word list) Documents
(inverted list)
Term No Tot Pointer
of freq To
Doc posting
Act 3 3 Inverted
Bus 3 4 lists
pen 1 1
total 2 3
19
Example:
• Given a collection of documents, they are parsed to extract
words and these are saved with the Document ID.
So let it be with
Caesar. The noble
Doc 2
Brutus hath told you
Caesar was ambitious
20
Sorting the Vocabulary
• After all documents have been parsed the inverted file is sorted by terms
Term Doc #
I 1 Term Doc #
did 1 ambitious 2
enact 1 be 2
julius 1 brutus 1
caesar 1 brutus 2
I 1 capitol 1
was 1 caesar 1
killed 1 caesar 2
i' 1 caesar 2
the 1 did 1
capitol 1 enact 1
brutus 1 hath 1
killed 1 I 1
me 1 I 1
so 2 i' 1
let 2 it 2
it 2 julius 1
be 2 killed 1
with 2 killed 1
caesar 2
let 2
the 2
me 1
noble 2
noble 2
brutus 2
so 2
hath 2
the 1
told 2
the 2
you 2
told 2
caesar 2
you 2
was 2
was 1
ambitious 2
was 2
with 2
21
Remove duplicate terms & add frequency
Term Doc # Freq
•Multiple term
Term Doc # ambitious 2 1
ambitious 2
be 2 1
be 2
entries in a brutus 1
brutus
brutus
1
2
1
1
single
brutus 2
capitol 1 capitol 1 1
added i'
it
1
2
it
julius
2
1
1
1
julius 1
•Counting killed 1
killed
let
1
2
2
1
killed 1
number of let 2 me
noble
1
2
1
1
me 1
occurrence of noble 2 so 2 1
terms in the
so 2 the 1 1
the 1 the 2 1
collections
the 2
told 2 1
told 2
you 2 1
helps to you
was
2
1 was 1 1
was 2 1
compute TF was
with
2
2 with 2 1
22
Vocabulary and postings file
The file is commonly split into a Dictionary and a Postings file
Pointers 23
Tries
Trie is the data structure very similar to Binary Tree.
Trie data structure stores the data in particular
fashion, so that retrieval of data became much faster
and helps in performance.
retrieve.
Searching in a Trie
–If this branch does not exist in the Trie, then q can not be one
of the set of strings
Trie Example:
• Example 1: search for the string GOOD.
– we start at the root and we follow the G edge, followed by
the O edge, another O edge and finally the D edge.
Drawback:
• The above structure is rather
wasteful of memory
because each edge
represents a single symbol.
For huge texts this design is
an enormous waste of space.
Compact tries
• Compact trie trims(decreases) unary nodes which lead to
leaves.
.
{bear, bell, bid, bull, buy, sell, stock, stop}
Compressed
Why we use tries?
• To do a fast search in a large text. For example,
searching an item in a dictionary which contains
several gigabytes of text.
–E.g. the Oxford English dictionary.
IR Models
Introduction of IR Models
At the end of this chapter every student must able to:
– Probabilistic models
1. Boolean model
Consider a set of five docs and assume that they contain the terms
shown in the table
Doc. Terms
D1 Algorithm, information, retrieval
D2 Retrieval, science
D3 Algorithm, information, science
D4 Pattern, retrieval, science
D5 Science, algorithm
Then :
3. Term weighting (tf*idf)
Term weighting is the assignment of numerical values to
terms that represent their importance in a document in
order to improve retrieval effectiveness
User Query
Understanding
Information Need Representation of user need is
uncertain
How to match?
Uncertain guess of
Document whether document has
Documents Representation
relevant content
Probabilistic IR models are among the oldest, but also among the best-
• performing and most widely used IR models
Probability ranking principle
– Rd,q = 0 otherwise
b/ P(R/Y) +p (NR/Y) =1
P(R/Y) = 1-p (NR/Y)
P(R/Y) = 1-0.3
No of documents r n-r n
containing term tk
Total R N-R N
Based on this,
r/R-r=n-r/N-n-R+r
Now we can calculate the relevance function as:
RF(W) = r+0.5(N-n-R+r+0.5)
(R-r+0.5)(n-r+0.5),
RF(W) = 5+0.5(20-10-15+5+0.5)
(15-5+0.5)(10-5+0.5),
=0.095 (the probability that there is relevant document)
Information Retrieval
Chapter 5:
Retrieval Evaluation
IR Evaluation
How ?
Why System Evaluation?
It provides the ability to measure the difference between IR systems
How well do our search engines work?
Is system A better than B?
Under what conditions?
Identify techniques that work and do not work
There are many retrieval models/ algorithms/ systems
which one is the best?
Types of Evaluation Strategies
•System-centered studies
– Given documents, queries, and relevance judgments
• Try several variations of the system
• Measure which system returns the “best” hit list
•User-centered studies
– Given several users, and at least two retrieval systems
• Have each user try the same task on both systems
• Measure which system works the “best” for users information need
Evaluation Criteria
What are some main measures for evaluating an IR system’s performance?
• Effectiveness
User satisfaction: How “good” are the documents that are returned as a response to
user query?
• R = # of relevant docs = 6
Answer:
R-Precision = 4/6 = 0.67
Example 2
• Let total number of relevant documents = 6, compute recall and
precision for each cut off point n:
n doc # relevant Recall Precision
1 588 x 0.167 1
2 589 x 0.333 1
3 576
4 590 x 0.5 0.75
5 986
6 592 x 0.667 0.667
7 984
8 988
9 578
10 985
11 103
12 591
13 772 x 0.833 0.38
14 990 128
Problems with both precision and recall
Number of irrelevant documents in the collection is not
taken into account.
Recall is undefined when there is no relevant document
in the collection.
Precision is undefined when no document is retrieved.
Other measures
Noise = retrieved irrelevant docs / retrieved docs
Silence/Miss = non-retrieved relevant docs / relevant
docs
– An external judge
CHAPTER
5
Information Retrieval
Chapter 6:
Query Languages and Operations
Definitions
140
Multiple-word queries
•A query is a set of words (or phrases).
•Two options: A document is retrieved if it includes
–any of the query words, or
–each of the query words.
•Documents are ranked by the number of query words they contain:
– A document containing n query words is ranked higher than a
document containing m < n query words.
– Documents are ranked in decreasing order:
• those containing all the query words are ranked at the top, only
one query word at bottom.
–Frequency counts may be used to break ties among documents that
contain the same query words.
–Example: what is the result for the query “Red Bird” ?
141
Proximity queries
• Restrict the distance within a document between two search terms.
• Important for large documents in which the two search words may appear in different
contexts.
• Proximity specifications limit the acceptable occurrences and hence increase the
precision of the search.
• General Format: Word1 within m units of Word2.
– Unit may be character, word, paragraph, etc.
• Examples:
– Information within 5 words of Retrieval:
Finds documents that discuss “Information Processing for Document Retrieval” but
not “Information Processing and Searching for Relevant Document Retrieval”.
– Nuclear within 0 paragraphs of science:
142
Finds documents that discuss “Nuclear” and “science” in the same paragraph.
Boolean queries
• Based on concepts from logic: AND, OR, NOT
148
Relevance feedback
Revise Rankings
d IR ReRanked
Query System Documents
1. Doc2
2. Doc4
Query
Ranked 1. Doc1 3. Doc5
Reformulation 2. Doc2 .
Documents 3. Doc3 .
1. Doc1 .
2. Doc2 .
3. Doc3
Feedback .
.
150
Query expansion
documents.
Research areas
1. Text Annotation Techniques
2. Cross-lingual IR
3. Web search engine for local languages
7. Document summarization
8. Multimedia Retrieval
THE COURSE
GOOD BY