Chapter 3,4, 5 and 6

Information Retrieval
Chapter 3:
Indexing Structures
Indexing structure
• What is index structure? a data structure that speed the
search operation on a database.
• Querying using an index is faster than searching every row in

a data base
What is Indexing?
•Some definitions
–Is the art of organizing information
–Is an association of descriptors (keywords, concepts) to
documents in view of future retrieval
–Is a process of constructing document surrogates by assigning
identifiers to text items
–Is the process of storing data in a particular way in order to locate
and retrieve the data

Indexing
Indexing
File structures
 A fundamental decision in the design of IR systems is which

type of file structure to use for the underlying document
database
• The file structures used in IR systems are: inverted files,

signature files, PAT trees (trie), flat files, and graphs
• Focused on Inverted file and tries.

The concept of the inverted file type of index is as follows
• Assume a set of documents. Each document is assigned a list

of keywords or attributes, with optional relevance weights
associated with each keyword (attribute).
An inverted file is then the sorted list (or index) of keywords
(attributes), with each keyword having links to the documents
containing that keyword
Inverted Index
“Inverted index is generated to index words in
files.”
Term Position of Doc or Sentence
file 47
generate 19
in 44
index 10, 32
invert 1
is 16
to 29
word 38 9
Index and inverted index
index Inverted index
In your cell phone your list of What allows you to manually enter a phone
number and when you hit ‘dial’ you see the
contacts and which phone numbers
person’s name, rather than the number, because
(cell, home,work) are associated with your phone has taken the phone number and
those contacts found the contact associated with it
DNS, lookup(takes the host name reverse lookup(which takes an IP address and
returns the host name)
and returns an IP address)
storing a mapping from content, such as
mapping a document from words or numbers to its location in a db file
documents to content a file or method of file organization in which
labels indicating the locations of all documents
of a given type are placed in a single record

Cont……..
• Inverted files are created as records are added to a database;

they’re important for the process by which computers search
for the terms entered in a query.
• When a query is placed to an electronic database, what the

computer searches is actually the inverted file, not the
records themselves.
Cont….
• The inverted file extracts all the words from each
field for each record entered into the database, and
sorts them into alphabetical order.
Cont……
• Inverted file is a word oriented indexing mechanism based
on sorted list of keywords, with each keyword having links to
the documents containing it
• Data to be held in the inverted file includes list of index terms
and for each term:
Fji, number of occurrences of term tj in document di
Nj, number of documents containing tj
Mi, maximum frequency of any term in di
N, total number of documents in the collection

Inverted file contains:
1. The vocabulary(Lists of terms)
2. The occurrence(location and frequency of terms in document
collection)
Vocabulary is the set of all distinct words(index terms in the text

collection)
Construction of Inverted file
• An inverted index consists of two files: vocabulary and

posting files
A vocabulary file(Word List)
Posting file(inverted list)

A vocabulary file(Word List)
• Stores all of the distinct terms (keywords) that appear in any

of the documents in lexicographical order and for each word a
pointer to posting file The act of writing
dictionaries
• Records kept for each term j in the word list contains the
following:
Term j
Number of documents in which term j occurs
Total frequency of term j

Pointer(posting list of ) term j
Posting file(inverted list)
• For each distinct term in the vocabulary, stores a list of

pointers to the documents that contain that term
• Each element in the inverted list is called posting, i.e. the

occurrence of a term in a document
Indexer steps
 Step1: Token sequence arranging the tokens(words) of the
given documents
 Step2: sort by terms and give docID
  if there are the same term, sort it by docID

 Step3: Dictionary and postings(doc fre, postings)
Organization of Index File
Vocabulary
Postings
(word list) Documents
(inverted list)
Term No Tot Pointer
of freq To
Doc posting
Act 3 3 Inverted
Bus 3 4 lists
pen 1 1
total 2 3
19
Example:
• Given a collection of documents, they are parsed to extract
words and these are saved with the Document ID.
I did enact Julius

Doc 1 Caesar I was killed
i' the Capitol;
Brutus killed me.
So let it be with
Caesar. The noble
Doc 2
Brutus hath told you
Caesar was ambitious
20
Sorting the Vocabulary
• After all documents have been parsed the inverted file is sorted by terms
Term Doc #
I 1 Term Doc #
did 1 ambitious 2
enact 1 be 2
julius 1 brutus 1
caesar 1 brutus 2
I 1 capitol 1
was 1 caesar 1
killed 1 caesar 2
i' 1 caesar 2
the 1 did 1
capitol 1 enact 1
brutus 1 hath 1
killed 1 I 1
me 1 I 1
so 2 i' 1
let 2 it 2
it 2 julius 1
be 2 killed 1
with 2 killed 1
caesar 2
let 2
the 2
me 1
noble 2
noble 2
brutus 2
so 2
hath 2
the 1
told 2
the 2
you 2
told 2
caesar 2
you 2
was 2
was 1
ambitious 2
was 2
with 2
21
Remove duplicate terms & add frequency
Term Doc # Freq
•Multiple term
Term Doc # ambitious 2 1
ambitious 2
be 2 1
be 2
entries in a brutus 1
brutus
brutus
1
2
1
1
single
brutus 2
capitol 1 capitol 1 1
document are caesar

caesar
1
2
caesar
caesar
1
2
1
2
merged and caesar
did
2
1
did
enact
1
1
1
1
frequency enact
hath
1
1 hath 2 1
I 1 2
information I
I
1
1 i' 1 1
added i'
it
1
2
it
julius
2
1
1
1
julius 1
•Counting killed 1
killed
let
1
2
2
1
killed 1
number of let 2 me
noble
1
2
1
1
me 1
occurrence of noble 2 so 2 1
terms in the
so 2 the 1 1
the 1 the 2 1
collections
the 2
told 2 1
told 2
you 2 1
helps to you
was
2
1 was 1 1
was 2 1
compute TF was
with
2
2 with 2 1
22
Vocabulary and postings file
The file is commonly split into a Dictionary and a Postings file
Term Doc # Freq

ambitious 2 1 Doc # Freq
be 2 1 Term N docs Tot Freq 2 1
brutus 1 1 ambitious 1 1 2 1
brutus 2 1 be 1 1 1 1
capitol 1 1 brutus 2 2 2 1
caesar 1 1 capitol 1 1 1 1
caesar 2 2 caesar 2 3 1 1
did 1 1 did 1 1 2 2
enact 1 1 1 1
enact 1 1
hath 1 1 1 1
hath 2 1
I 1 2 2 1
I 1 2 i' 1 1 1 2
i' 1 1 it 1 1 1 1
it 2 1 julius 1 1 2 1
julius 1 1 killed 1 2 1 1
killed 1 2 let 1 1 1 2
let 2 1 me 1 1 2 1
me 1 1 noble 1 1 1 1
noble 2 1 so 1 1 2 1
so 2 1 the 2 2 2 1
told 1 1 1 1
the 1 1
you 1 1 2 1
the 2 1
was 2 2 2 1
told 2 1 with 1 1
you 2 1 2 1
was 1 1 1 1
2 1
was 2 1
2 1
with 2 1
Pointers 23
Tries
 Trie is the data structure very similar to Binary Tree.
 Trie data structure stores the data in particular
fashion, so that retrieval of data became much faster
and helps in performance.
 The name "TRIE" is coined from the word
retrieve.
Searching in a Trie
• Tries are useful for testing whether a given query string q is in

the list
–Starting with the first character of q we traverse the Trie
along the branch defined by the next character of q.
–If this branch does not exist in the Trie, then q can not be one
of the set of strings
Trie Example:
• Example 1: search for the string GOOD.
– we start at the root and we follow the G edge, followed by
the O edge, another O edge and finally the D edge.
• Example 2: search for the string BAD.

– we start from the root, follow the B edge and find out that
there is no A edge.
Tries construction
• A trie is a data structure that stores information about the
contents of each node in the path from the root to the
leaves.
Non compact tries
• A non compact trie is one in which every edge of the
underlying tree represents a symbol of the alphabet.
• Example: Assume that the symbols in our alphabet are the
CAPITAL letters A..Z with terminal symbol $. Construct the
trie for the following 5 strings: BIG, BIGGER, BILL, GOOD,
GOSH.
 In the figure the strings either starts with B or G.
Therefore the root of the trie is connected to 3 edges (B,
G, $)
 $ used to fulfill the 3rd trie
Trie
Example 2:
cat, can’t, hey, hello, dog
Non compact tries
Drawback:
• The above structure is rather
wasteful of memory
because each edge
represents a single symbol.
For huge texts this design is
an enormous waste of space.
Compact tries
• Compact trie trims(decreases) unary nodes which lead to
leaves.
.
{bear, bell, bid, bull, buy, sell, stock, stop}
Compressed
Why we use tries?
• To do a fast search in a large text. For example,
searching an item in a dictionary which contains
several gigabytes of text.
–E.g. the Oxford English dictionary.
• Support fast pattern matching.

–An example is an application where users type a
query and the system quickly come up with a list of
words starting with what the user typed in. 35
Exercise 1
Draw the inverted index that would be built for the following
document collection.
Doc 1 new home sales top forecasts

Doc 2 home sales rise in July
Doc 3 increase in home sales in July
Doc 4 July new home sales rise
Exercise 2
• Construct trie (non compact and compact) for the
words(pot$, potato$, tattoo$, temp$)
Signature files
• The signature file approach works as follows:

 The documents are stored sequentially in the "text file."
Their "signatures" (hash-coded bit patterns) are stored in the
"signature file.“
 When a query arrives, the signature file is scanned and many
non-qualifying documents are discarded.
 The rest are either checked (so that the "false drops" are
discarded) or they are returned to the user as they are.
Signature file
• A “signature” is created as an abstraction of a

document.
• All the signatures that represent the documents
in the collection are kept in a file called
“signature file”.
Cont…
signature files have been used in the following environments:
1. PC-based, medium size db
2. WORMs(write once read many)

3. Parallel machines
4. Distributed text db(all documents stored on several servers.
But the database may be managed together or independently )
Example: the web can be managed independently
End
Chapter 4
IR Models
Introduction of IR Models
At the end of this chapter every student must able to:
 Define what model is

 Describe why model is needed in information retrieval
 Differentiate different types of information retrieval models
 Boolean Model
Vector space model

 probabilistic model
 know how to calculate and find the similarity of some
documents to the given query
 Identify term frequency, document frequency, inverted
document frequency, term weight and similarity
measurements
What is model?
Model- is the ideal abstraction of something
which is working in the real world.
There are 2 good reasons for having models of IR
1. Models guide research and provide the means
of academic discussion
2. Models can serve as a blueprint to implement ac
actual retrieval system
IR Models
• In IR, mathematical models are used to understand and reason
about some behavior or phenomena in the real world
• A model of an information retrieval predicts and explains what

a user will find relevant given the user query
Retrieval model
• Thus, retrieval models are models that can describe
the computational processes (here, retrieval)
– e.g., how documents are ranked
– e.g., how similarities are measured
• Are models that can attempt to describe the
human process
– e.g., the information need, interaction
– Few do so meaningfully
• Are models that specify the details of
– Document representation
– Query representation
– Retrieval function (matching function)
– Ranking
Retrieval Models
• A number of IR models are proposed over the years to retrieve
information
• The following are the major models developed to retrieve
information
– Boolean model
• Exact match model
– Statistical models
• Vector space and probabilistic models are the major
statistical models
• Are “best match” models
– Linguistic and knowledge based models
• Are “best match models”
What is the difference b/n best match and exact match?
Types of models
• The three classic information retrieval models are:
– Boolean retrieval models
– Vector space models
– Probabilistic models
1. Boolean model
A document either matches a query, or does

not.
 The Boolean retrieval model is a model for information
retrieval in which we can pose(create) any query which
is in the form of a Boolean expression of terms, that is,
in which terms are combined with the operators
AND, OR, and NOT.
…..cont
 The first model of an information retrieval
 The most criticized model
 Developed by George Boole
• Boole defined 3 basic operators
AND
OR
NOT
Example
……cont
• Boolean relevance prediction ( R )

– A document is predicted as relevant to a query iff it satisfies the query
expression
– Each query term specifies a set of documents containing them
• AND (^) : the intersection of two sets
• OR (V) : the union of two sets
• NOT (¬) :set inverse, or really set difference
– A query, thus, searches a set of documents to determine their content
– The search engine retrieves those documents satisfying the logical
constraints of the query
….cont
• Most queries search for more than one term
– Information need: to find all documents containing “information”

and “retrieval”
Answer Only documents containing both “information” and “retrieval” satisfy
this query

or “retrieval”
Answer Will be satisfied by a document that contains either of the two words or
both.

or “retrieval”, but not both
Boolean model
Consider a set of five docs and assume that they contain the terms
shown in the table
Doc. Terms
D1 Algorithm, information, retrieval
D2 Retrieval, science
D3 Algorithm, information, science
D4 Pattern, retrieval, science
D5 Science, algorithm
Find documents retrieved by the following expressions in

a. Information AND retrieval
{d1,d3} ∩{d1,d2,d4}={d1}
b. Information OR retrieval
{d1,d3} U{d1,d2,d4}={d1, d2,d3,d4}
Advantages and disadvantages of Boolean
model
• Advantages of Boolean model
A very simple model based on sets(Easy for
expert)
It is computationally efficient
Expressiveness and clarity
Is still a dominant model with the commercial
database systems
Disadvantages of Boolean model
• Disadvantages
Needs to train users
Very limited to show user information need in detail
No weighting of index or query terms
Based on exact matchingthere may be relevant

document that is partially matched
Vector Space Model
Suggested by peter Luhn and Salton
A classic model of document retrieval based on
representing documents and queries as a vector
Partial matching is possible
Retrieval is based on the similarity between the query

vector and document vectors
The output documents are ranked according to this
similarity
….cont
• The similarity is based on the occurrences of the

keywords in the query and document
• The angle between query and document is measured

by using:
cosine similarity measurement since both

document and user’s query are represented as vectors
in VSM
…cont
• VSM assumes that if document vector V1 is closer to

the query than another document vector V2: then;
 The document represented by V1 is more
relevant than the one represented by V2
In VSM, to decide the similarity of the document to the
given query, term weighting(tf*idf) is used. To
calculate tf*idf, first we have to calculate the following
1.Term frequency(tf)
• Term frequency: is the number of times a given
term appears in that documents
Tf = number of frequency of a term

maximum frequency
(for a single document)

2. Inverse document frequency(IDF)
• IDF used to measure whether the terms are

common or rare across all documents
• Idf = log N/df where
N total number of documents

Df document frequency (number of documents
containing the given term)
When we change the log2 to log10
Then :
3. Term weighting (tf*idf)
Term weighting is the assignment of numerical values to
terms that represent their importance in a document in
order to improve retrieval effectiveness
• Term weighting = tf*idf

Tfi,j* log N/df
4. Document length
calculate the length of the document
Document length normalization adjusts the term frequency or the
relevance score in order to normalize the effect of document
length on the document ranking.
Document length= the square root summation of term weight

square for each document
5. similarity
• At the end we need to calculate similarity of the documents to
the query.
• The widely used measure of similarity in vector space model is

called the cosine similarity.
• The cosine similarity between two vectors d,j (the document

vector) and q (query vector) is given by:
…..cont
…cont
• The denominator of the above formula can be
replaced by the length of the document times
length of the query. This means
Examples
• Example1: If the following three documents are given with

one query, then, which document must be ranked first?
D1: New york times
D2: New York post
D3: Los Angeles times
Query: new new times

Solution
• Step1: calculate inversed document frequency (IDF) for each term.
Step2: calculate term frequency (tf)
Step 3: calculate term weight (tw): TW=tf*idf
Step 4: calculate document length or length of the
document and length of the query
Step 5: calculate the similarity of each document
to the query
Therefore; since the value of D1>D2>D3, the document must be

ranked as:
D1, D2, D3.
Example 2
 Which document must be ranked first for the
following?
Doc1: Breakthrough drug for schizophrenia

Doc2: New schizophrenia drug
Doc3:New approach for treatment of
schizophrenia
Doc4: New hopes for schizophrenia patients
Query: Treatment of schizophrenia patients
Example 3
From the following documents which one
must be ranked first?
D1:The health observances for march

D2:The health oriented calendar
D3: The awareness news for march awareness

Q: March health awareness
Advantages and Disadvantages of VSM
Latent semantic indexing
• Latent Semantic Indexing (LSI) is an extension of the
vector space retrieval method (Deerwester et al., 1990).
• LSI can retrieve relevant documents even when they do
not share any words with the query.
• if only a synonym of the keyword is present in a

document, the document will be still found relevant.
LSI/LSA
LSI
• Latent Semantic Indexing (LSI) [Deerwester et al] tries to

overcome the problems of lexical matching by using
statistically derived conceptual indices instead of individual
words for retrieval.
• Latent Semantic Indexing is a technique that projects queries and

documents into a space with “latent” semantic dimensions.
• In the latent semantic space, a query and a document can
have high cosine similarity even if they do not share any
terms
LSI
Classic IR might lead to poor retrieval due to:

• unrelated documents might be included in the answer set
• relevant documents that do not contain at least one index term
are not retrieved
• Reasoning: retrieval based on index terms is vague and noisy
• The user information need is more related to concepts and ideas
than to index terms
• A document that shares concepts with another document known
to be relevant might be of interest
Probabilistic model
• The probabilistic model captures the IR problem using a
probabilistic framework
• Given a user query, there is an ideal answer set for this query
• Given a description of this ideal answer set, we could retrieve
the relevant documents
• Querying is seen as a speciﬁcation of the properties of this
ideal answer set
Cont…
• Given a user query q and a document dj, the probabilistic model
tries to estimate the probability that the user will find the
document dj interesting (i.e., relevant).
– Estimate the probability that a given doument is relevant to

the given query
Why probabilities in IR?
User Query
Understanding
Information Need Representation of user need is
uncertain
How to match?
Uncertain guess of
Document whether document has
Documents Representation
relevant content
In traditional IR systems, matching between each document and

query is attempted in a semantically imprecise space of index terms.
Probabilities provide a principled foundation for uncertain reasoning.
Can we use probabilities to quantify our uncertainties?
Probabilistic Model
Why introduce probabilities and probability theory in IR?
 As a process, retrieval is inherently uncertain
 Understanding of user’s information needs is uncertain
Are we sure the user mapped his need into a good query?
 Even if the query represents well the need, did we represent it well?
 Estimating document relevance for the query
 Uncertainty from selection of document representation
 Uncertainty from matching query and documents
Probability theory is a common framework for modeling
uncertainty
An IR system is uncertain primarily about
1. Understanding of the query
2. Whether a document satisfies the query

Probability theory
 Provides principled foundation for reasoning under uncertainty
 Probabilistic information retrieval models estimate how likely it is that a

document is relevant for a query
Probabilistic IR models
 Classic probabilistic models (BIM, BM11, BM25)
 Bayesian networks for text retrieval
Probabilistic IR models are among the oldest, but also among the best-
• performing and most widely used IR models
Probability ranking principle
• Ranked retrieval setup: given a collection of documents, the

user issues a query, and an ordered list of documents is
returned.
• Assume binary notion of relevance: Rd,q is a random

dichotomous (yes/no) variable, such that
– Rd,q = 1 if document d is relevant w.r.t query q
– Rd,q = 0 otherwise
• Probabilistic ranking orders documents decreasingly by their

estimated probability of relevance w.r.t. query: P(R = 1|d, q)
Bayesian based probabilistic model
• Let x be a document in the collection
• Let R represent relevance document with respect to given
query and let NR be non relevance
• Need to find p(R|x)- probability that a document x is relevant
P(R|x)=probability that a document ‘x’ is relevant
P(x|R)=probability that if a relevant document is retrieved, it is ‘x’
P(R)=probability of relevant document in the collection
P(x)=probability that ‘x’ is in the collection
Example 1
• Assume that the following is given for you
P(R)=0.6
P(x)=0.5
P(x|R)=0.7
Then what is P(R|x)?
• P(R|x)=p(R)*P(x|R) = 0.6*0.7 = 0.84
P(x) 0.5
Example 2
Assume that document ‘y’ is in the collection
Probability that if a non-relevant document retrieved, it is ‘y’ is
p(y/NR) =0.2
Probability of non-relevant documents in the collection is p (NR) =0.6

Probability of ‘y’ in the collection is p(y) =0.4
a) What is the probability that y is non relevant document?
b) Is the document is relevant or non relevant?
Solution
a/ P (NR/Y) =p(y/NR) p (NR)/p(y)

P (NR/Y)=0.2*0.6/0.4= 0.3
b/ P(R/Y) +p (NR/Y) =1
P(R/Y) = 1-p (NR/Y)
P(R/Y) = 1-0.3
P(R/Y) =0.7, hence the document is relevant

Binary independence model
• As the name implies, this model assumes that the index terms
exist independently in the documents and we can then assign
binary (0,1) values to these index terms.
• For a further illustration of this model, consider a document

Dk in a collection, is represented by a binary vector t = (t 1 ,t 2
,t 3 ,…,t u ) where u represents total number of terms in the
collection, t i =1 indicates the presence of the ith index term and
t i =0 indicates its absence.
Binary independence model
• A decision rule can be formulated by which any document
can be assigned to either the relevant or non-relevant set of
documents for a particular query.
• The obvious rule is to assign a document to the relevant set if the

probability of the document being relevant given the
document representation is greater than the probability of
document being non relevant, that is, if:
P(relevant|t) > P(non-relevant|t)

Using Bayes’s theorem,
BIM
• Binary documents and queries are represented as vector of
binary
• Independence terms are independent of each other
• Some of the assumptions of the BIM can be removed. For

example, we can remove the assumption that terms are
independent
• A case that violates this assumption s term pairs like: hong and
kong, new york, los angeles, Addis Ababa, Arba Minch, Abba
Gada, Haadha Siinqee, w/c are strongly dependent on each other
No of relevant No of non Total

documents relevant
documents
No of documents r n-r n
containing term tk
No of documents R-r N-n-R+r N-n

not containing
term tk
Total R N-R N
Total number of Total number of

relevant doc retrieved doc in collection
Definition
N Total number of documents in the collection
n Total number of documents containing the term tk
R total number of relevant documents retrieved

r Total number of relevant documents retrieved containing the term
tk
Based on this,
odds(probability) of term tk appearing in relevant document is given by

r/R-r
odds(probability) of term tk appearing in irrelevant document is given

by n-r/N-n-R+r
• Assume that if odds of term tk appearing in relevant is
equal to that of non relevant, then
r/R-r=n-r/N-n-R+r
Now we can calculate the relevance function as:
Relevance function(W) = r(N-n-R+r)/(R-r)(n-r), but when

we have no relevant document that contain the term tk (r=0), this
become zero, so, we have to add 0.5 to reduce zero result
Example
N=20Total no of documents in the collection
n=10 Total no of documents containing the term tk
R =15Total no of relevant documents retrieved
r=5Total no of relevant documents retrieve containing term tk
Then calculate the probability of relevance function of the term tk

Solution
• Since our result becomes zero, when we don’t have relevant
document, let us add 0.5 to the given formula
Relevance function(W) = r(N-n-R+r)/(R-r)(n-r),
RF(W) = r+0.5(N-n-R+r+0.5)
(R-r+0.5)(n-r+0.5),
RF(W) = 5+0.5(20-10-15+5+0.5)
(15-5+0.5)(10-5+0.5),
=0.095 (the probability that there is relevant document)
Chapter 5:
Retrieval Evaluation
IR Evaluation
• It is known that measuring or evaluating the performance

and accuracy of the system is very important after IR
system is designed.
• According to (Singhal, 2001), there are two main things

to measure in IR system; these are: effectiveness of the
system and its efficiency
…cont
 To measure informal information retrieval effectiveness in the

standard way, we need a test collection consisting of three
things:
1. A document collection
2. A test suite of information needs, expressible as queries
3. A set of relevance judgments, standardly a binary assessment
of either relevant or non relevant for each query-document
pair or Relevance judgments indicate which documents are
relevant to the information need of a user
….cont
The standard approach to information retrieval system

evaluation revolves around the notion of relevant and
non relevant documents.
With respect to a user information need, a document in
the test collection is given a binary classification as
either relevant or non relevant.
This decision is referred to as the gold standard or
ground truth judgment of relevance.
Mind Break
A document is relevant if it addresses the stated
information need, not because it just happens to
contain all the words in the query.
How ?
Why System Evaluation?
It provides the ability to measure the difference between IR systems

How well do our search engines work?

Is system A better than B?

Under what conditions?
Evaluation drives what to research

Identify techniques that work and do not work

There are many retrieval models/ algorithms/ systems

which one is the best?
Types of Evaluation Strategies
•System-centered studies
– Given documents, queries, and relevance judgments
• Try several variations of the system
• Measure which system returns the “best” hit list
•User-centered studies
– Given several users, and at least two retrieval systems
• Have each user try the same task on both systems
• Measure which system works the “best” for users information need
Evaluation Criteria
What are some main measures for evaluating an IR system’s performance?
• Efficiency: time, space
 Speed in terms of retrieval time and indexing time
 Speed of query processing
 Index size: Index/corpus size ratio
• Effectiveness
 How is a system capable of retrieving relevant documents from the collection?
 Is a system better than another one?
 User satisfaction: How “good” are the documents that are returned as a response to
user query?
 “Relevance” of results to meet information need of users

Performance measures (Recall, Precision, etc.)
• The two most frequent and basic measures for

information retrieval effectiveness are :
1. Precision and
2. Recall.
Precision
Precision (P) is the fraction of retrieved documents that

are relevant
The ability to retrieve top-ranked
documents that are mostly relevant.
Precision is percentage of retrieved
documents that are relevant to the query (i.e. number of
retrieved documents that are relevant).
Precision Formula
Recall
Recall (R) is the fraction of relevant documents that are
retrieved
– The ability of the search to find all of the

relevant items in the corpus
– Recall is percentage of relevant documents

retrieved from the database in response to users query.
Recall Formula
Accuracy
These notions can be made clear by examining the

following contingency table:
….cont
Then
precision = tp / (tp + fp)
Recall = tp / (tp + fn)
In terms of the contingency table above,

accuracy = (tp + tn)/(tp + f p + f n + tn).
Definition
Examples
An IR system returns 8 relevant documents, and 10
non relevant documents. There are a total of 20
relevant documents in the collection. What is the
precision of the system on this search, and what is its
recall?
Example
• A database contains 80 records on a particular topic. A search was
conducted on that topic and 60 records were retrieved, of the 60
records retrieved, 45 were relevant. The total number of relevant
document of the record is 70. Based on this:
a) Calculate the ratio of relevant retrieved documents to the totally

retrieved documents
b) Calculate the ratio of relevant retrieved documents to the total
relevant documents which exist in the collection
Example 2
• Assume there are a total of 14 relevant documents, compute
precision and recall?
R- Precision
Precision at the R-th position in the ranking of results

for a query, where R is the total number of relevant
documents.
– Calculate precision after R documents are seen
– Can be averaged over all queries

Example
What is P@4? For the above example?
• R = # of relevant docs = 6
Answer:
R-Precision = 4/6 = 0.67
Example 2
• Let total number of relevant documents = 6, compute recall and
precision for each cut off point n:
n doc # relevant Recall Precision
1 588 x 0.167 1
2 589 x 0.333 1
3 576
4 590 x 0.5 0.75
5 986
6 592 x 0.667 0.667
7 984
8 988
9 578
10 985
11 103
12 591
13 772 x 0.833 0.38
14 990 128
Problems with both precision and recall
 Number of irrelevant documents in the collection is not
taken into account.
 Recall is undefined when there is no relevant document
in the collection.
 Precision is undefined when no document is retrieved.
Other measures
 Noise = retrieved irrelevant docs / retrieved docs
 Silence/Miss = non-retrieved relevant docs / relevant
docs
Noise = 1 – Precision; Silence = 1 – Recall

F-measure
• A single measure that trades off precision versus

recall is the F measure, which is the weighted
harmonic mean of precision and recall:
• One measure of performance that takes into accounts
both recall and precision. Harmonic mean of recall
and precision:
Difficulties in Evaluating IR System
 IR systems essentially facilitate communication between a

user and document collections
 Relevance is a measure of the effectiveness of
communication
– Effectiveness is related to the relevancy of retrieved
items.
– Relevance: relates to problem, information need,
query and a document or surrogate
……..cont
 Relevancy is not typically binary but continuous.
– Even if relevancy is binary, it is a difficult judgment to make.
 Relevance judgments is made by

– The user who posed the retrieval problem
– An external judge
– Is the relevance judgment made by users and external person the

same?
 Relevance judgment is usually:

……….cont
– Subjective: Depends upon a specific user’s judgment.

– Situational: Relates to user’s current needs.
– Cognitive: Depends on human perception and
behavior.
– Dynamic: Changes over time.
END OF
CHAPTER
5
Chapter 6:
Query Languages and Operations
Definitions
• A query is a request for data or information from a database table or

combination of tables.
• Query language (QL) refers to any computer programming

language that requests and retrieves data from database and
information systems by sending queries.
• It works on user entered structured and formal programming

command based queries to find and extract data from host databases.
Keyword-based queries
Queries are combinations of words.
The document collection is searched for documents that contain

these words.
The concept of word must be defined.

A word is a sequence of letters terminated by a separator (period,
comma, space, etc).
Types of keyword based query
• Single-word queries
• Phrase queries
• Multiple word queries
• Proximity queries
• Boolean Queries
Single-word queries
• A query is a single word
– Usually used for searching in document images
• Simplest form of query.
• All documents that include this word are retrieved.
• Documents may be ranked by the frequency of this word in the

document.
139
Phrase queries
• A query is a sequence of words treated as a single unit.
– Also called “literal string” or “exact phrase” query.
• Phrase is usually surrounded by quotation marks.
• All documents that include this phrase are retrieved.
• Usually, separators (commas, colons, etc.) and common words (e.g.,
“a”, “the”, “of”, “for”…) in the phrase are ignored.
• In effect, this query is for a set of words that must appear in sequence.
– Allows users to specify a context and thus gain precision.
• Example: “Information Processing for Document Retrieval”.
140
Multiple-word queries
•A query is a set of words (or phrases).
•Two options: A document is retrieved if it includes
–any of the query words, or
–each of the query words.
•Documents are ranked by the number of query words they contain:
– A document containing n query words is ranked higher than a
document containing m < n query words.
– Documents are ranked in decreasing order:
• those containing all the query words are ranked at the top, only
one query word at bottom.
–Frequency counts may be used to break ties among documents that
contain the same query words.
–Example: what is the result for the query “Red Bird” ?
141
Proximity queries
• Restrict the distance within a document between two search terms.
• Important for large documents in which the two search words may appear in different
contexts.
• Proximity specifications limit the acceptable occurrences and hence increase the
precision of the search.
• General Format: Word1 within m units of Word2.
– Unit may be character, word, paragraph, etc.
• Examples:
– Information within 5 words of Retrieval:
Finds documents that discuss “Information Processing for Document Retrieval” but
not “Information Processing and Searching for Relevant Document Retrieval”.
– Nuclear within 0 paragraphs of science:
142
Finds documents that discuss “Nuclear” and “science” in the same paragraph.
Boolean queries
• Based on concepts from logic: AND, OR, NOT
– It describes the information needed by relating multiple words with Boolean

operators.
• Operators: AND, OR, NOT
• Semantics: For each query word w a corresponding set Dw is constructed that

includes the documents that contain w.
• The Boolean expression is then interpreted as an expression on the corresponding
document sets with corresponding set operators:
– AND: Finds only documents containing all of the specified words or phrases.
– OR: Finds documents containing at least one of the specified words or phrases.
– NOT: Excludes documents containing the specified word or phrase.

143
Examples: Boolean queries
1.computer OR server
– Finds documents containing either computer, server or both
2. (computer OR server) NOT mainframe

– Select all documents that discuss computers or servers, do not select any
documents that discuss mainframes.
3. computer NOT (server OR mainframe)
– Select all documents that discuss computers, and do not discuss either
servers or mainframes.
4. computer OR server NOT mainframe
– Select all documents that discuss computers, or documents that discuss
servers but do not discuss mainframes. 144
Pattern matching
Relevance Feedback &
Query Expansion
148
Relevance feedback
• After initial retrieval results are presented, allow the user to

provide feedback on the relevance of one or more of the
retrieved documents.
• Use this feedback information to reformulate the query.
• Produce new results based on reformulated query.

• Allows more interactive, multi-pass process.
Relevance Feedback Architecture
Query Document
String corpus
Revise Rankings
d IR ReRanked
Query System Documents
1. Doc2
2. Doc4
Query
Ranked 1. Doc1 3. Doc5
Reformulation 2. Doc2 .
Documents 3. Doc3 .
1. Doc1  .
2. Doc2  .
3. Doc3 
Feedback .
.
150
Query expansion
• Revise query to account for feedback:

– Query Expansion: Add new terms to query from relevant
documents.
– Term Reweighting: Increase weight of terms in relevant

documents and decrease weight of terms in irrelevant
documents.
Research areas
1. Text Annotation Techniques
2. Cross-lingual IR
3. Web search engine for local languages
4. Intelligent IR (content-based understanding)

5. Application of NLP techniques for IR
6. Document classification
7. Document summarization
8. Multimedia Retrieval
9. Image Retrieval (Content-based, Document, etc.)

152
END OF
THE COURSE
GOOD BY

Chapter 3,4, 5 and 6

Uploaded by

Copyright:

Available Formats

Chapter 3,4, 5 and 6

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 3,4, 5 and 6

Uploaded by

Copyright:

Available Formats

Information Retrieval

• Querying using an index is faster than searching every row in

–Is the art of organizing information

–Is an association of descriptors (keywords, concepts) to

documents in view of future retrieval

–Is a process of constructing document surrogates by assigning

identifiers to text items

–Is the process of storing data in a particular way in order to locate

and retrieve the data

 A fundamental decision in the design of IR systems is which

• The file structures used in IR systems are: inverted files,

• Focused on Inverted file and tries.

• Assume a set of documents. Each document is assigned a list

Term Position of Doc or Sentence

• Inverted files are created as records are added to a database;

• When a query is placed to an electronic database, what the

N, total number of documents in the collection

Vocabulary is the set of all distinct words(index terms in the text

• An inverted index consists of two files: vocabulary and

A vocabulary file(Word List)

Posting file(inverted list)

• Stores all of the distinct terms (keywords) that appear in any

Total frequency of term j

• For each distinct term in the vocabulary, stores a list of

• Each element in the inverted list is called posting, i.e. the

  if there are the same term, sort it by docID

I did enact Julius

document are caesar

Term Doc # Freq

 The name "TRIE" is coined from the word

• Tries are useful for testing whether a given query string q is in

• Example 2: search for the string BAD.

• Support fast pattern matching.

Doc 1 new home sales top forecasts

• The signature file approach works as follows:

• A “signature” is created as an abstraction of a

2. WORMs(write once read many)

 Define what model is

Vector space model

• A model of an information retrieval predicts and explains what

– Boolean retrieval models

– Vector space models

A document either matches a query, or does

• Boolean relevance prediction ( R )

• Most queries search for more than one term

– Information need: to find all documents containing “information”

– Information need: to find all documents containing “information”

– Information need: to find all documents containing “information”

Find documents retrieved by the following expressions in

Very limited to show user information need in detail

No weighting of index or query terms

Based on exact matchingthere may be relevant

Retrieval is based on the similarity between the query

• The similarity is based on the occurrences of the

• The angle between query and document is measured

cosine similarity measurement since both

• VSM assumes that if document vector V1 is closer to

Tf = number of frequency of a term

(for a single document)

• IDF used to measure whether the terms are