IRWS Lecture 04 - Boolean Model and Transcrib

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 40

Information Retrieval and Web Search

Self Study – Boolean Model

Information Retrieval and Web Search - IRWS - Griffith College 1


Today
 Term Weights
 Boolean IR Model
 Extended Boolean Model

Information Retrieval and Web Search - IRWS - Griffith College 2


Term Weights
 We have seen how we perform pre-processing to divide a document into terms
that are stored in an index.
 However, not all terms are equally useful for describing document contents.
 For this reason, terms are generally weighted in some way, to reflect this
usefulness.
 Deciding on the importance of a term (i.e. the weight) is not a trivial issue.

Information Retrieval and Web Search - IRWS - Griffith College 3


Term Weights
 Consider a document collection of 100,000 documents.
 A word that appears in every document is not very useful as an index term, as
it tells us very little about which documents a user might be interested in.
 A word that appears in 5 documents considerably narrows the space of
documents that might be of interest to the user.
 This is the basic rationale behind stop words, but it has wider consequences
also.

Information Retrieval and Web Search - IRWS - Griffith College 4


Term Weights
 We can also take the number of times a term occurs in a document into
account, for example:
 The term “Obama” will appear in a biography of Barack Obama, for obvious reasons.
 The term “Obama” will also appear in a biography of John McCain, as he once lost an
election to Obama. It will not be as common, however.
 The first document is more relevant to a search for “Obama” but without using
weights, both would be treated equally.
 The greater number of occurrences of the term in the first document should
result in a greater weight.

Information Retrieval and Web Search - IRWS - Griffith College 5


Term Weights
 When talking about term weights, we use the following notation:
 A weight wi,j > 0 is associated with every index term ki of document dj .
 For an index term that does not appear in the document, wij = 0
 This weight quantifies the importance of the index term for describing the
document's semantic contents.
 Any time we examine a new IR model, we must firstly consider how its
weighting scheme works.

Information Retrieval and Web Search - IRWS - Griffith College 6


Term Weights
 Index terms are usually assumed to be independent, though this is a
simplification.
 Consider a document about computer networks. Here, the terms computer and
network are clearly related, as the appearance of one attracts the appearance of
the other.
 You may argue that their weights should altered to take this dependency into
account

Information Retrieval and Web Search - IRWS - Griffith College 7


Term Weights
 Assuming independence simplifies the task of computing the term weights and
allows for fast ranking computation.
 Taking advantage of term correlation is a difficult task.
 Furthermore, nobody has yet demonstrated that taking this into account even
improves document ranking.
 For these reasons, we continue to assume that terms are independent.

Information Retrieval and Web Search - IRWS - Griffith College 8


Boolean Model

Information Retrieval and Web Search - IRWS - Griffith College 9


Introduction
 The Boolean model is a simple retrieval model based on set theory and
Boolean algebra.
 Its greatest advantage is its relative simplicity, and most people can understand
the concepts involved.
 It was adopted by many early commercial bibliographic systems.
 See page 25 of Modern Information Retrieval

Information Retrieval and Web Search - IRWS - Griffith College 10


Weighting
 The Boolean model only considers whether index terms are present or absent
in a document:
 For a term ki that is contained in a document dj , wij = 1
 For a term ki that is not contained in a document dj , wij = 0
 Thus we can see that the Boolean model does not make use of index term
weighting to differentiate between the usefulness of terms

Information Retrieval and Web Search - IRWS - Griffith College 11


Queries
 Queries consist of index terms linked by combinations of the three core
Boolean operators: and, or, not.
 Some systems that look like Boolean retrieval systems use other operators too
(e.g. NEAR) but we will be looking at the traditional Boolean IR model.

Information Retrieval and Web Search - IRWS - Griffith College 12


Boolean Model Operators: OR
 OR is used to broaden a search by retrieving any, some or all of the keywords
used in the search statement.
 It is commonly used to search for synonymous terms or concepts, for example:
 college: 17,320,770 results
 university: 33,685,205 results
 college OR university: 38,702,660 results
 Note that the final result is not the same as the sum of the individual results:
documents containing both terms (e.g. “University College Dublin”) will only
be counted once.
 The effect of using OR is that documents containing any number of the terms
specified will be returned.

Information Retrieval and Web Search - IRWS - Griffith College 13


Set theoretic representation of OR
 If C is the set of documents that contain
the term “college” and U is the set of
documents that contain the term
“university”, the query “college OR
university” can be calculated by C U.
(i.e. the area in yellow)

Information Retrieval and Web Search - IRWS - Griffith College 14


Boolean Model Operators: AND
 AND is used to narrow a search by ensuring that all the search terms should
appear int he results.
 It is commonly used to search for relationships between two concepts or terms,
for example:
 poverty: 783,447 results
 crime: 2,962,165 results
 poverty AND crime: 1,677 results (max possible results 783447)
 The more terms or concepts combined in a search with AND, the fewer records
that will be retrieved.

Information Retrieval and Web Search - IRWS - Griffith College 15


Set theoretic representation of AND
 If P is the set of documents that contain the
term “poverty” and C is the set of
documents that contain the term “crime”,
the query “poverty AND crime” can be
calculated by P C (i.e. the area in yellow).

Information Retrieval and Web Search - IRWS - Griffith College 16


Boolean Model Operators: NOT
 NOT is used to speciffically exclude a term from your search, for example:
 pets: 4,556,515 results
 cats: 3,651,252 results
 pets NOT cats: 81,497 results
 One difficulty with using NOT is that a document that is highly relevant to
what you're searching for may also contain the term you had attempted to
avoid.
 Again, the more terms or concepts combined in a search with NOT, the fewer
records that will be retrieved.

Information Retrieval and Web Search - IRWS - Griffith College 17


Set theoretic representation of NOT
 If P is the set of documents that contain the
term “pets” and C is the set of documents
that contain the term “cats”, the query “pets
NOT cats” can be calculated by P \ C (i.e.
the area in yellow).

Information Retrieval and Web Search - IRWS - Griffith College 18


Boolean Model: Queries
 A query is essentially a Boolean expression that can be represented in
disjunctive normal form (DNF).
 This is a standard way of representing Boolean expressions.
 The advantage is that expressions in this form are easy for a computer to
process.
 In Boolean expressions, a disjunction is a series of expressions that are
combined using the OR ( ) operator, e.g. (A B C)
 A conjunction is a series of expressions combined using the AND ( ) operator,
e.g. (A B C).

Information Retrieval and Web Search - IRWS - Griffith College 19


Boolean Model: Queries
 Disjunctive Normal Form (DNF) is a disjunction of conjunctions. Examples
include:

 These are not in DNF, although they can be modified to be expressed that way:

Information Retrieval and Web Search - IRWS - Griffith College 20


Boolean Model: Queries
 One advantage of DNF is that it allows us to represent queries (and documents)
as bit vectors, which computers are extremely fast at processing.
 For example, suppose we have a very simple document collection with only
three terms in it: ka, kb and kc.
 Suppose we have the following query:
 This can be represented as bit vectors in DNF like so: 110 V 100

Information Retrieval and Web Search - IRWS - Griffith College 21


Boolean Model: Queries
 Documents can also be expressed as bit vectors:
 A document containing all three terms: 111
 A document containing ka and kb but not kc : 110
 A document containing just kb: 010
 We can now compare these document representations with the components of
the query. Documents matching any of them can be considered relevant.

Information Retrieval and Web Search - IRWS - Griffith College 22


Disadvantages
 The model predicts that each document is either relevant or non-relevant to the
query.
 There is no notion of a partial match to the query conditions and so this can
lead to too few documents being retrieved.
 Every document that is considered to be relevant is treated the same, so no
ranking occurs.
 It is known that index term weighting can lead to substantial improvements in
retrieval performance.

Information Retrieval and Web Search - IRWS - Griffith College 23


Extended Boolean Model

Information Retrieval and Web Search - IRWS - Griffith College 24


Introduction
 To address the significant difficulties with the traditional Boolean model, the
Extended Boolean Model was proposed by Salton, Fox and Wu in 1983.
 This allows partial matching and caters for the use of term weights to facilitate
ranking results.
 For more, see page 38 in Modern Information Retrieval

Information Retrieval and Web Search - IRWS - Griffith College 25


Motivation
 Consider the query kx ^ ky ^ kz
 Using the traditional Boolean Model, only documents containing all three terms will be
returned.
 However, there is an argument that a document containing two of the terms would be
more relevant than a document containing none.
 Using the traditional Boolean model, neither of these are returned, so the user has to
modify the query.
 Obviously, it is preferable that a document containing all the terms would still be better
than one that contains fewer query terms.

Information Retrieval and Web Search - IRWS - Griffith College 26


Motivation
 Similarly kx _ ky _ kz
 Under the Boolean Model, any document containing any of the terms will be returned
and will be treated in the same way.
 Again, it is logical that a document containing all three terms may have more relevance
than one that only contains a single query term.
 Some form of document ranking would be desirable in this instance also.

Information Retrieval and Web Search - IRWS - Griffith College 27


Term Weights
 The Extended Boolean Model makes use of term weights, where 0 < wij < 1
 We will go into more detail on how exactly term weights are calculated in later
in the module but for now we just need to know that these weights lie between
0 (for a term that does not appear in the document at all) and 1 (for a very
useful term that appears in the document).
 Unlike the traditional Boolean model, the weights can lie somewhere between
0 and 1 also.

Information Retrieval and Web Search - IRWS - Griffith College 28


Partial Matching
 The Extended Boolean Model is the first we have seen that facilitates partial
matching by using non-binary term weights.
 In models like this, each document must have a similarity score calculated for
it, which measures how similar it is to the given query.
 This is usually shown as sim(q, d) (i.e. the similarity between a query q and a
document d).
 These models return a ranked list of documents, where the documents with the
highest similarity scores are at the top of the list.
 In this way, it is hoped that the most relevant documents are at the top of the
result set, so that the user can find relevant information easier.

Information Retrieval and Web Search - IRWS - Griffith College 29


Illustration
 To illustrate how the Extended Boolean Model works, we will consider a very
simple system where there are only two terms: kx and ky .
 The principles that apply to this type of simple system also apply in more
realistic situations where more terms are involved.
 We display these two dimensions in the plane, with the weights wxj (the
weight of term kx in document dj ) and wyj (the weight of term ky in
document dj ) on the x and y axes, respectively.

Information Retrieval and Web Search - IRWS - Griffith College 30


Illustration
 Every document can be positioned
somewhere on this graph.
 A document where neither term is
important will be near the bottom-
left (0,0).
 A document where term ky is
important will be close to the top.
 A document where term kx is
important will be toward the right.

Information Retrieval and Web Search - IRWS - Griffith College 31


Illustration
 Here, document d1 is a document
where term kx is very important (it
is very close to the right-hand side).
 Term ky is moderately important in
this document (it is not as close to
the top as it is to the right).

Information Retrieval and Web Search - IRWS - Griffith College 32


Illustration: OR operator
 Using the traditional Boolean
Model, given the query “kx OR
ky”, only documents located at
(0,0) would not be returned.
 Therefore the documents preferred
are those that are furthest from this
point.

Information Retrieval and Web Search - IRWS - Griffith College 33


Illustration: OR operator
 We can measure this distance by using the following:

 Note: this is based on the formula for the distance between two points in
coordinate geometry, where one of the points is (0,0).
 The division by 2 within the square root is an adjustment that normalises the
similarity scores, so that they must lie between 0 and 1

Information Retrieval and Web Search - IRWS - Griffith College 34


Illustration: AND operator
 Using the traditional Boolean
Model, given the query “kx AND
ky", only documents located at
(1,1) would be returned.
 Therefore the documents preferred
are those that are closest to this
point.

Information Retrieval and Web Search - IRWS - Griffith College 35


Illustration: AND operator
 We can measure this distance by using the following:

 The division by 2 within the square root is an adjustment that normalises the
similarity scores, so that they must lie between 0 and 1.
 The distance from position (1; 1) is subtracted from 1 so as to give a higher
similarity score to those documents that minimise this distance.

Information Retrieval and Web Search - IRWS - Griffith College 36


Illustration: Combining AND and OR
 Processing of more general queries is done by grouping the operations.
 For example, consider the query

Information Retrieval and Web Search - IRWS - Griffith College 37


Extended Boolean Model in m Dimensions
 These formulae are not limited to only two terms. Both extend to cater for
multiple terms in the query.
 For queries with m terms, the following formulae apply:

Information Retrieval and Web Search - IRWS - Griffith College 38


The p-norm Model
 An interesting variation of the Extended Boolean Model allows users to affect
the behaviour of the system.
 This involves the introduction of an additional p-norm variable.
 For p = , the ranking will be similar to that of the basic Boolean Model.
 Setting p = 0 results in behaviour more similar to the Vector Space Model,
which we will see later in the module.
 Thus the user can vary the effects of things such as partial matches, depending
on the nature of the specific query.

Information Retrieval and Web Search - IRWS - Griffith College 39


Extended Boolean Model
 Advantages
 Extends the Boolean model to allow for term weights and partial matching
 Expert users may alter the system's behaviour as appropriate (using the p-norm
variation).
 Disadvantages
 Assumption of term independence
 Not widely used in practice
 Users must be familiar with specialist query language

Information Retrieval and Web Search - IRWS - Griffith College 40

You might also like