The document discusses information retrieval models. It describes the Boolean retrieval model, which represents documents and queries as sets of terms combined with Boolean operators. Documents are retrieved if they satisfy the Boolean query, but there is no ranking of results. The Boolean model has limitations including difficulty expressing complex queries, controlling result size, and ranking results. It works best for simple, precise queries when users know exactly what they are searching for.
2. INFORMATION RETRIEVAL
Information retrieval is the activity of obtaining
information resources relevant to an information need
from a collection of information resources.
Searches can be based on metadata or on full-text (or
other content-based) indexing.
Goal: Find the documents most relevant to a certain Query
Dealing with notions of:
Collection of documents
Query (User’s information need)
Notion of Relevancy
3. MODEL
A model is a construct designed help us understand a
complex system
A particular way of “looking at things”
Models inevitably make simplifying assumptions
What are the limitations of the model?
Different types of models:
Conceptual models
Physical analog models
Mathematical models
4. Retrieval Models
A retrieval model specifies the details
of:
Document representation
Query representation
Retrieval function
Determines a notion of relevance.
Notion of relevance can be binary or
continuous (i.e. ranked retrieval).
5. CLASSES OF RM
Boolean models (set theoretic)
Extended Boolean
Vector space models
(statistical/algebraic)
Generalized VS
Latent Semantic Indexing
Probabilistic models
6. MODELS OF IR
Boolean model
Based on the notion of sets
Documents are retrieved only if they satisfy Boolean
conditions specified in the query
Does not impose a ranking on retrieved documents
Exact match
Vector space model
Based on geometry, the notion of vectors in high dimensional
space
Documents are ranked based on their similarity to the query
(ranked retrieval)
Best/partial match
7. Language models
Based on the notion of probabilities and processes for
generating text
Documents are ranked based on the probability that
they generated the query
Best/partial match
8. BOOLEAN MODEL
Invented by George Boole (1815-1864)
He devised a system of symbolic logic in which he used
three operators (+, , - ) to combine statements in
symbolic form.
John Venn named to this operators of Boolean logic
are the logical sum(+), logical product(), and logical
difference(-).
IR systems allow the users to express their queries by
using this operators.
9. BOOLEAN MODEL
Each index term is either present or absent
Documents are either Relevant or Not Relevant(no
ranking)
A document is represented as a set of keywords.
Queries are Boolean expressions of
keywords, connected by AND, OR, and
NOT, including the use of brackets to indicate scope.
[[Rio & Brazil] | [Hilo & Hawaii]] & hotel & !Hilton]
Output: Document is relevant or not. No partial
matches or ranking.
10. BOOLEAN RETRIEVAL MODEL
Popular retrieval model because:
Easy to understand for simple queries.
Clean formalism.
Boolean models can be extended to include ranking.
Reasonably efficient implementations possible for
normal queries.
11. BOOLEAN MODEL
Weights assigned to terms are either “0” or “1”
“0” represents “absence”: term isn’t in the document
“1” represents “presence”: term is in the document
Build queries by combining terms with Boolean
operators
AND, OR, NOT
The system returns all documents that satisfy the
query
13. Why Boolean Retrieval Works
Boolean operators approximate natural language
Find documents about a good party that is not over
AND can discover relationships between concepts
good party
OR can discover alternate terminology
excellent party, wild party, etc.
NOT can discover alternate meanings
Democratic party
14. The Perfect Query Paradox
Every information need has a perfect set of documents
If not, there would be no sense doing retrieval
Every document set has a perfect query
AND every word in a document to get a query for it
Repeat for each document in the set
OR every document query to get the set query
But can users realistically be expected to formulate this
perfect query?
Boolean query formulation is hard!
15. Why Boolean Retrieval Fails
• Natural language is way more complex
• AND “discovers” nonexistent relationships
– Terms in different sentences, paragraphs, …
• Guessing terminology for OR is hard
– good, nice, excellent, outstanding, awesome, …
• Guessing terms to exclude is even harder!
– Democratic party, party to a lawsuit, …
16. BOOLEAN MODEL
Strengths
Precise, if you know the right strategies
Precise, if you have an idea of what you’re looking for
Efficient for the computer
Simple
Weaknesses
Users must learn Boolean logic
Boolean logic insufficient to capture the richness of language
No control over size of result set: either too many documents or none
When do you stop reading? All documents in the result set are
considered “equally good”
What about partial matches? Documents that “don’t quite match” the
query may be useful also
No notion of ranking (exact matching only)
All index terms have equal weight
17. PROBLEMS
Very rigid: AND means all; OR means any.
Difficult to express complex user requests.
Difficult to control the number of documents retrieved.
All matched documents will be returned.
Difficult to rank output.
All matched documents logically satisfy the query.
Difficult to perform relevance feedback.
If a document is identified by the user as relevant or
irrelevant, how should the query be modified?
18. ADVANTAGES & DISADVANTAGES
Advantages
Results are predictable, relatively easy to explain
Many different features can be incorporated
Efficient processing since many documents can be
eliminated from search
Disadvantages
Effectiveness depends entirely on user
Simple queries usually don’t work well
Complex queries are difficult.
19. LIMITATIONS
The first relates to the formulation of search statements.
It has been noted that users are not able to formulate an exact search
statement by the combination of AND, OR and NOT operators,
especially when several query terms are involved.
In such cases either the search statement becomes too narrow or too
broad.
The second limitation relates to the number of retrieval items.
It has been noted that users cannot predict a priori exactly how many
items are to be retrieved to satisfy a given query.
If the search statement is broad, the number of retrieved items may
sometimes be several hundreds and thus it may be quite difficult to
find out the exact information required.
The third limitation is that it identifies an item as relevant by finding
out whether a given query term is present or not in a given record in the
database.