Model of information retrieval (3)

BY
N. SUMANJALI
DPT OF LIS
PONDICHERRY UNIVERSITY

INFORMATION RETRIEVAL
 Information retrieval is the activity of obtaining
information resources relevant to an information need
from a collection of information resources.
 Searches can be based on metadata or on full-text (or
other content-based) indexing.
 Goal: Find the documents most relevant to a certain Query
 Dealing with notions of:
 Collection of documents
 Query (User’s information need)
 Notion of Relevancy

MODEL
 A model is a construct designed help us understand a
complex system
 A particular way of “looking at things”
 Models inevitably make simplifying assumptions
 What are the limitations of the model?
 Different types of models:
 Conceptual models
 Physical analog models
 Mathematical models

Retrieval Models
A retrieval model specifies the details
of:
 Document representation
 Query representation
 Retrieval function
Determines a notion of relevance.
Notion of relevance can be binary or
continuous (i.e. ranked retrieval).

CLASSES OF RM
Boolean models (set theoretic)
 Extended Boolean
Vector space models
(statistical/algebraic)
 Generalized VS
 Latent Semantic Indexing
Probabilistic models

MODELS OF IR
 Boolean model
 Based on the notion of sets
 Documents are retrieved only if they satisfy Boolean
conditions specified in the query
 Does not impose a ranking on retrieved documents
 Exact match
 Vector space model
 Based on geometry, the notion of vectors in high dimensional
space
 Documents are ranked based on their similarity to the query
(ranked retrieval)
 Best/partial match

 Language models
 Based on the notion of probabilities and processes for
generating text
 Documents are ranked based on the probability that
they generated the query
 Best/partial match

BOOLEAN MODEL
 Invented by George Boole (1815-1864)
 He devised a system of symbolic logic in which he used
three operators (+, , - ) to combine statements in
symbolic form.
 John Venn named to this operators of Boolean logic
are the logical sum(+), logical product(), and logical
difference(-).
 IR systems allow the users to express their queries by
using this operators.

BOOLEAN MODEL
 Each index term is either present or absent
 Documents are either Relevant or Not Relevant(no
ranking)
 A document is represented as a set of keywords.
 Queries are Boolean expressions of
keywords, connected by AND, OR, and
NOT, including the use of brackets to indicate scope.
 [[Rio & Brazil] | [Hilo & Hawaii]] & hotel & !Hilton]
 Output: Document is relevant or not. No partial
matches or ranking.

BOOLEAN RETRIEVAL MODEL
 Popular retrieval model because:
 Easy to understand for simple queries.
 Clean formalism.
 Boolean models can be extended to include ranking.
 Reasonably efficient implementations possible for
normal queries.

BOOLEAN MODEL
 Weights assigned to terms are either “0” or “1”
 “0” represents “absence”: term isn’t in the document
 “1” represents “presence”: term is in the document
 Build queries by combining terms with Boolean
operators
 AND, OR, NOT
 The system returns all documents that satisfy the
query

Why Boolean Retrieval Works
 Boolean operators approximate natural language
 Find documents about a good party that is not over
 AND can discover relationships between concepts
 good party
 OR can discover alternate terminology
 excellent party, wild party, etc.
 NOT can discover alternate meanings
 Democratic party

The Perfect Query Paradox
 Every information need has a perfect set of documents
 If not, there would be no sense doing retrieval
 Every document set has a perfect query
 AND every word in a document to get a query for it
 Repeat for each document in the set
 OR every document query to get the set query
 But can users realistically be expected to formulate this
perfect query?
 Boolean query formulation is hard!

Why Boolean Retrieval Fails
• Natural language is way more complex
• AND “discovers” nonexistent relationships
– Terms in different sentences, paragraphs, …
• Guessing terminology for OR is hard
– good, nice, excellent, outstanding, awesome, …
• Guessing terms to exclude is even harder!
– Democratic party, party to a lawsuit, …

BOOLEAN MODEL
 Strengths
 Precise, if you know the right strategies
 Precise, if you have an idea of what you’re looking for
 Efficient for the computer
 Simple
 Weaknesses
 Users must learn Boolean logic
 Boolean logic insufficient to capture the richness of language
 No control over size of result set: either too many documents or none
 When do you stop reading? All documents in the result set are
considered “equally good”
 What about partial matches? Documents that “don’t quite match” the
query may be useful also
 No notion of ranking (exact matching only)
 All index terms have equal weight

PROBLEMS
 Very rigid: AND means all; OR means any.
 Difficult to express complex user requests.
 Difficult to control the number of documents retrieved.
 All matched documents will be returned.
 Difficult to rank output.
 All matched documents logically satisfy the query.
 Difficult to perform relevance feedback.
 If a document is identified by the user as relevant or
irrelevant, how should the query be modified?

ADVANTAGES & DISADVANTAGES
 Advantages
 Results are predictable, relatively easy to explain
 Many different features can be incorporated
 Efficient processing since many documents can be
eliminated from search
 Disadvantages
 Effectiveness depends entirely on user
 Simple queries usually don’t work well
 Complex queries are difficult.

LIMITATIONS
 The first relates to the formulation of search statements.
 It has been noted that users are not able to formulate an exact search
statement by the combination of AND, OR and NOT operators,
especially when several query terms are involved.
 In such cases either the search statement becomes too narrow or too
broad.
 The second limitation relates to the number of retrieval items.
 It has been noted that users cannot predict a priori exactly how many
items are to be retrieved to satisfy a given query.
 If the search statement is broad, the number of retrieved items may
sometimes be several hundreds and thus it may be quite difficult to
find out the exact information required.
 The third limitation is that it identifies an item as relevant by finding
out whether a given query term is present or not in a given record in the
database.

Model of information retrieval (3)

Model of information retrieval (3)

More Related Content

Model of information retrieval (3)