SlideShare a Scribd company logo
BY
N. SUMANJALI
DPT OF LIS
PONDICHERRY UNIVERSITY
INFORMATION RETRIEVAL
 Information retrieval is the activity of obtaining
information resources relevant to an information need
from a collection of information resources.
 Searches can be based on metadata or on full-text (or
other content-based) indexing.
 Goal: Find the documents most relevant to a certain Query
 Dealing with notions of:
 Collection of documents
 Query (User’s information need)
 Notion of Relevancy
MODEL
 A model is a construct designed help us understand a
complex system
 A particular way of “looking at things”
 Models inevitably make simplifying assumptions
 What are the limitations of the model?
 Different types of models:
 Conceptual models
 Physical analog models
 Mathematical models
Retrieval Models
A retrieval model specifies the details
of:
 Document representation
 Query representation
 Retrieval function
Determines a notion of relevance.
Notion of relevance can be binary or
continuous (i.e. ranked retrieval).
CLASSES OF RM
Boolean models (set theoretic)
 Extended Boolean
Vector space models
(statistical/algebraic)
 Generalized VS
 Latent Semantic Indexing
Probabilistic models
MODELS OF IR
 Boolean model
 Based on the notion of sets
 Documents are retrieved only if they satisfy Boolean
conditions specified in the query
 Does not impose a ranking on retrieved documents
 Exact match
 Vector space model
 Based on geometry, the notion of vectors in high dimensional
space
 Documents are ranked based on their similarity to the query
(ranked retrieval)
 Best/partial match
 Language models
 Based on the notion of probabilities and processes for
generating text
 Documents are ranked based on the probability that
they generated the query
 Best/partial match
BOOLEAN MODEL
 Invented by George Boole (1815-1864)
 He devised a system of symbolic logic in which he used
three operators (+, , - ) to combine statements in
symbolic form.
 John Venn named to this operators of Boolean logic
are the logical sum(+), logical product(), and logical
difference(-).
 IR systems allow the users to express their queries by
using this operators.
BOOLEAN MODEL
 Each index term is either present or absent
 Documents are either Relevant or Not Relevant(no
ranking)
 A document is represented as a set of keywords.
 Queries are Boolean expressions of
keywords, connected by AND, OR, and
NOT, including the use of brackets to indicate scope.
 [[Rio & Brazil] | [Hilo & Hawaii]] & hotel & !Hilton]
 Output: Document is relevant or not. No partial
matches or ranking.
BOOLEAN RETRIEVAL MODEL
 Popular retrieval model because:
 Easy to understand for simple queries.
 Clean formalism.
 Boolean models can be extended to include ranking.
 Reasonably efficient implementations possible for
normal queries.
BOOLEAN MODEL
 Weights assigned to terms are either “0” or “1”
 “0” represents “absence”: term isn’t in the document
 “1” represents “presence”: term is in the document
 Build queries by combining terms with Boolean
operators
 AND, OR, NOT
 The system returns all documents that satisfy the
query
AND/OR/NOT
A B
C
Why Boolean Retrieval Works
 Boolean operators approximate natural language
 Find documents about a good party that is not over
 AND can discover relationships between concepts
 good party
 OR can discover alternate terminology
 excellent party, wild party, etc.
 NOT can discover alternate meanings
 Democratic party
The Perfect Query Paradox
 Every information need has a perfect set of documents
 If not, there would be no sense doing retrieval
 Every document set has a perfect query
 AND every word in a document to get a query for it
 Repeat for each document in the set
 OR every document query to get the set query
 But can users realistically be expected to formulate this
perfect query?
 Boolean query formulation is hard!
Why Boolean Retrieval Fails
• Natural language is way more complex
• AND “discovers” nonexistent relationships
– Terms in different sentences, paragraphs, …
• Guessing terminology for OR is hard
– good, nice, excellent, outstanding, awesome, …
• Guessing terms to exclude is even harder!
– Democratic party, party to a lawsuit, …
BOOLEAN MODEL
 Strengths
 Precise, if you know the right strategies
 Precise, if you have an idea of what you’re looking for
 Efficient for the computer
 Simple
 Weaknesses
 Users must learn Boolean logic
 Boolean logic insufficient to capture the richness of language
 No control over size of result set: either too many documents or none
 When do you stop reading? All documents in the result set are
considered “equally good”
 What about partial matches? Documents that “don’t quite match” the
query may be useful also
 No notion of ranking (exact matching only)
 All index terms have equal weight
PROBLEMS
 Very rigid: AND means all; OR means any.
 Difficult to express complex user requests.
 Difficult to control the number of documents retrieved.
 All matched documents will be returned.
 Difficult to rank output.
 All matched documents logically satisfy the query.
 Difficult to perform relevance feedback.
 If a document is identified by the user as relevant or
irrelevant, how should the query be modified?
ADVANTAGES & DISADVANTAGES
 Advantages
 Results are predictable, relatively easy to explain
 Many different features can be incorporated
 Efficient processing since many documents can be
eliminated from search
 Disadvantages
 Effectiveness depends entirely on user
 Simple queries usually don’t work well
 Complex queries are difficult.
LIMITATIONS
 The first relates to the formulation of search statements.
 It has been noted that users are not able to formulate an exact search
statement by the combination of AND, OR and NOT operators,
especially when several query terms are involved.
 In such cases either the search statement becomes too narrow or too
broad.
 The second limitation relates to the number of retrieval items.
 It has been noted that users cannot predict a priori exactly how many
items are to be retrieved to satisfy a given query.
 If the search statement is broad, the number of retrieved items may
sometimes be several hundreds and thus it may be quite difficult to
find out the exact information required.
 The third limitation is that it identifies an item as relevant by finding
out whether a given query term is present or not in a given record in the
database.
Model  of information retrieval (3)

More Related Content

Model of information retrieval (3)

  • 1. BY N. SUMANJALI DPT OF LIS PONDICHERRY UNIVERSITY
  • 2. INFORMATION RETRIEVAL  Information retrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources.  Searches can be based on metadata or on full-text (or other content-based) indexing.  Goal: Find the documents most relevant to a certain Query  Dealing with notions of:  Collection of documents  Query (User’s information need)  Notion of Relevancy
  • 3. MODEL  A model is a construct designed help us understand a complex system  A particular way of “looking at things”  Models inevitably make simplifying assumptions  What are the limitations of the model?  Different types of models:  Conceptual models  Physical analog models  Mathematical models
  • 4. Retrieval Models A retrieval model specifies the details of:  Document representation  Query representation  Retrieval function Determines a notion of relevance. Notion of relevance can be binary or continuous (i.e. ranked retrieval).
  • 5. CLASSES OF RM Boolean models (set theoretic)  Extended Boolean Vector space models (statistical/algebraic)  Generalized VS  Latent Semantic Indexing Probabilistic models
  • 6. MODELS OF IR  Boolean model  Based on the notion of sets  Documents are retrieved only if they satisfy Boolean conditions specified in the query  Does not impose a ranking on retrieved documents  Exact match  Vector space model  Based on geometry, the notion of vectors in high dimensional space  Documents are ranked based on their similarity to the query (ranked retrieval)  Best/partial match
  • 7.  Language models  Based on the notion of probabilities and processes for generating text  Documents are ranked based on the probability that they generated the query  Best/partial match
  • 8. BOOLEAN MODEL  Invented by George Boole (1815-1864)  He devised a system of symbolic logic in which he used three operators (+, , - ) to combine statements in symbolic form.  John Venn named to this operators of Boolean logic are the logical sum(+), logical product(), and logical difference(-).  IR systems allow the users to express their queries by using this operators.
  • 9. BOOLEAN MODEL  Each index term is either present or absent  Documents are either Relevant or Not Relevant(no ranking)  A document is represented as a set of keywords.  Queries are Boolean expressions of keywords, connected by AND, OR, and NOT, including the use of brackets to indicate scope.  [[Rio & Brazil] | [Hilo & Hawaii]] & hotel & !Hilton]  Output: Document is relevant or not. No partial matches or ranking.
  • 10. BOOLEAN RETRIEVAL MODEL  Popular retrieval model because:  Easy to understand for simple queries.  Clean formalism.  Boolean models can be extended to include ranking.  Reasonably efficient implementations possible for normal queries.
  • 11. BOOLEAN MODEL  Weights assigned to terms are either “0” or “1”  “0” represents “absence”: term isn’t in the document  “1” represents “presence”: term is in the document  Build queries by combining terms with Boolean operators  AND, OR, NOT  The system returns all documents that satisfy the query
  • 13. Why Boolean Retrieval Works  Boolean operators approximate natural language  Find documents about a good party that is not over  AND can discover relationships between concepts  good party  OR can discover alternate terminology  excellent party, wild party, etc.  NOT can discover alternate meanings  Democratic party
  • 14. The Perfect Query Paradox  Every information need has a perfect set of documents  If not, there would be no sense doing retrieval  Every document set has a perfect query  AND every word in a document to get a query for it  Repeat for each document in the set  OR every document query to get the set query  But can users realistically be expected to formulate this perfect query?  Boolean query formulation is hard!
  • 15. Why Boolean Retrieval Fails • Natural language is way more complex • AND “discovers” nonexistent relationships – Terms in different sentences, paragraphs, … • Guessing terminology for OR is hard – good, nice, excellent, outstanding, awesome, … • Guessing terms to exclude is even harder! – Democratic party, party to a lawsuit, …
  • 16. BOOLEAN MODEL  Strengths  Precise, if you know the right strategies  Precise, if you have an idea of what you’re looking for  Efficient for the computer  Simple  Weaknesses  Users must learn Boolean logic  Boolean logic insufficient to capture the richness of language  No control over size of result set: either too many documents or none  When do you stop reading? All documents in the result set are considered “equally good”  What about partial matches? Documents that “don’t quite match” the query may be useful also  No notion of ranking (exact matching only)  All index terms have equal weight
  • 17. PROBLEMS  Very rigid: AND means all; OR means any.  Difficult to express complex user requests.  Difficult to control the number of documents retrieved.  All matched documents will be returned.  Difficult to rank output.  All matched documents logically satisfy the query.  Difficult to perform relevance feedback.  If a document is identified by the user as relevant or irrelevant, how should the query be modified?
  • 18. ADVANTAGES & DISADVANTAGES  Advantages  Results are predictable, relatively easy to explain  Many different features can be incorporated  Efficient processing since many documents can be eliminated from search  Disadvantages  Effectiveness depends entirely on user  Simple queries usually don’t work well  Complex queries are difficult.
  • 19. LIMITATIONS  The first relates to the formulation of search statements.  It has been noted that users are not able to formulate an exact search statement by the combination of AND, OR and NOT operators, especially when several query terms are involved.  In such cases either the search statement becomes too narrow or too broad.  The second limitation relates to the number of retrieval items.  It has been noted that users cannot predict a priori exactly how many items are to be retrieved to satisfy a given query.  If the search statement is broad, the number of retrieved items may sometimes be several hundreds and thus it may be quite difficult to find out the exact information required.  The third limitation is that it identifies an item as relevant by finding out whether a given query term is present or not in a given record in the database.