IRWS Lecture 04 - Boolean Model and Transcrib

Information Retrieval and Web Search

Self Study – Boolean Model

 Term Weights
 Boolean IR Model
 Extended Boolean Model

Term Weights
 We have seen how we perform pre-processing to divide a document into terms
that are stored in an index.
 However, not all terms are equally useful for describing document contents.
 For this reason, terms are generally weighted in some way, to reflect this
 Deciding on the importance of a term (i.e. the weight) is not a trivial issue.

Term Weights
 Consider a document collection of 100,000 documents.
 A word that appears in every document is not very useful as an index term, as
it tells us very little about which documents a user might be interested in.
 A word that appears in 5 documents considerably narrows the space of
documents that might be of interest to the user.
 This is the basic rationale behind stop words, but it has wider consequences

Term Weights
 We can also take the number of times a term occurs in a document into
account, for example:
 The term “Obama” will appear in a biography of Barack Obama, for obvious reasons.
 The term “Obama” will also appear in a biography of John McCain, as he once lost an
election to Obama. It will not be as common, however.
 The first document is more relevant to a search for “Obama” but without using
weights, both would be treated equally.
 The greater number of occurrences of the term in the first document should
result in a greater weight.

Term Weights
 When talking about term weights, we use the following notation:
 A weight wi,j > 0 is associated with every index term ki of document dj .
 For an index term that does not appear in the document, wij = 0
 This weight quantifies the importance of the index term for describing the
document's semantic contents.
 Any time we examine a new IR model, we must firstly consider how its
weighting scheme works.

Term Weights
 Index terms are usually assumed to be independent, though this is a
 Consider a document about computer networks. Here, the terms computer and
network are clearly related, as the appearance of one attracts the appearance of
the other.
 You may argue that their weights should altered to take this dependency into

Term Weights
 Assuming independence simplifies the task of computing the term weights and
allows for fast ranking computation.
 Taking advantage of term correlation is a difficult task.
 Furthermore, nobody has yet demonstrated that taking this into account even
improves document ranking.
 For these reasons, we continue to assume that terms are independent.

Boolean Model

 The Boolean model is a simple retrieval model based on set theory and
Boolean algebra.
 Its greatest advantage is its relative simplicity, and most people can understand
the concepts involved.
 It was adopted by many early commercial bibliographic systems.
 See page 25 of Modern Information Retrieval

 The Boolean model only considers whether index terms are present or absent
in a document:
 For a term ki that is contained in a document dj , wij = 1
 For a term ki that is not contained in a document dj , wij = 0
 Thus we can see that the Boolean model does not make use of index term
weighting to differentiate between the usefulness of terms

 Queries consist of index terms linked by combinations of the three core
Boolean operators: and, or, not.
 Some systems that look like Boolean retrieval systems use other operators too
(e.g. NEAR) but we will be looking at the traditional Boolean IR model.

Boolean Model Operators: OR
 OR is used to broaden a search by retrieving any, some or all of the keywords
used in the search statement.
 It is commonly used to search for synonymous terms or concepts, for example:
 college: 17,320,770 results
 university: 33,685,205 results
 college OR university: 38,702,660 results
 Note that the final result is not the same as the sum of the individual results:
documents containing both terms (e.g. “University College Dublin”) will only
be counted once.
 The effect of using OR is that documents containing any number of the terms
specified will be returned.

Set theoretic representation of OR
 If C is the set of documents that contain
the term “college” and U is the set of
documents that contain the term
“university”, the query “college OR
university” can be calculated by C U.
(i.e. the area in yellow)

Boolean Model Operators: AND
 AND is used to narrow a search by ensuring that all the search terms should
appear int he results.
 It is commonly used to search for relationships between two concepts or terms,
for example:
 poverty: 783,447 results
 crime: 2,962,165 results
 poverty AND crime: 1,677 results (max possible results 783447)
 The more terms or concepts combined in a search with AND, the fewer records
that will be retrieved.

Set theoretic representation of AND
 If P is the set of documents that contain the
term “poverty” and C is the set of
documents that contain the term “crime”,
the query “poverty AND crime” can be
calculated by P C (i.e. the area in yellow).

Boolean Model Operators: NOT
 NOT is used to speciffically exclude a term from your search, for example:
 pets: 4,556,515 results
 cats: 3,651,252 results
 pets NOT cats: 81,497 results
 One difficulty with using NOT is that a document that is highly relevant to
what you're searching for may also contain the term you had attempted to
 Again, the more terms or concepts combined in a search with NOT, the fewer
records that will be retrieved.

Set theoretic representation of NOT
 If P is the set of documents that contain the
term “pets” and C is the set of documents
that contain the term “cats”, the query “pets
NOT cats” can be calculated by P \ C (i.e.
the area in yellow).

Boolean Model: Queries
 A query is essentially a Boolean expression that can be represented in
disjunctive normal form (DNF).
 This is a standard way of representing Boolean expressions.
 The advantage is that expressions in this form are easy for a computer to
 In Boolean expressions, a disjunction is a series of expressions that are
combined using the OR ( ) operator, e.g. (A B C)
 A conjunction is a series of expressions combined using the AND ( ) operator,
e.g. (A B C).

Boolean Model: Queries
 Disjunctive Normal Form (DNF) is a disjunction of conjunctions. Examples

 These are not in DNF, although they can be modified to be expressed that way:

Boolean Model: Queries
 One advantage of DNF is that it allows us to represent queries (and documents)
as bit vectors, which computers are extremely fast at processing.
 For example, suppose we have a very simple document collection with only
three terms in it: ka, kb and kc.
 Suppose we have the following query:
 This can be represented as bit vectors in DNF like so: 110 V 100

Boolean Model: Queries
 Documents can also be expressed as bit vectors:
 A document containing all three terms: 111
 A document containing ka and kb but not kc : 110
 A document containing just kb: 010
 We can now compare these document representations with the components of
the query. Documents matching any of them can be considered relevant.

 The model predicts that each document is either relevant or non-relevant to the
 There is no notion of a partial match to the query conditions and so this can
lead to too few documents being retrieved.
 Every document that is considered to be relevant is treated the same, so no
ranking occurs.
 It is known that index term weighting can lead to substantial improvements in
retrieval performance.

Extended Boolean Model

 To address the significant difficulties with the traditional Boolean model, the
Extended Boolean Model was proposed by Salton, Fox and Wu in 1983.
 This allows partial matching and caters for the use of term weights to facilitate
ranking results.
 For more, see page 38 in Modern Information Retrieval

 Consider the query kx ^ ky ^ kz
 Using the traditional Boolean Model, only documents containing all three terms will be
 However, there is an argument that a document containing two of the terms would be
more relevant than a document containing none.
 Using the traditional Boolean model, neither of these are returned, so the user has to
modify the query.
 Obviously, it is preferable that a document containing all the terms would still be better
than one that contains fewer query terms.

 Similarly kx _ ky _ kz
 Under the Boolean Model, any document containing any of the terms will be returned
and will be treated in the same way.
 Again, it is logical that a document containing all three terms may have more relevance
than one that only contains a single query term.
 Some form of document ranking would be desirable in this instance also.

Term Weights
 The Extended Boolean Model makes use of term weights, where 0 < wij < 1
 We will go into more detail on how exactly term weights are calculated in later
in the module but for now we just need to know that these weights lie between
0 (for a term that does not appear in the document at all) and 1 (for a very
useful term that appears in the document).
 Unlike the traditional Boolean model, the weights can lie somewhere between
0 and 1 also.

Partial Matching
 The Extended Boolean Model is the first we have seen that facilitates partial
matching by using non-binary term weights.
 In models like this, each document must have a similarity score calculated for
it, which measures how similar it is to the given query.
 This is usually shown as sim(q, d) (i.e. the similarity between a query q and a
document d).
 These models return a ranked list of documents, where the documents with the
highest similarity scores are at the top of the list.
 In this way, it is hoped that the most relevant documents are at the top of the
result set, so that the user can find relevant information easier.

 To illustrate how the Extended Boolean Model works, we will consider a very
simple system where there are only two terms: kx and ky .
 The principles that apply to this type of simple system also apply in more
realistic situations where more terms are involved.
 We display these two dimensions in the plane, with the weights wxj (the
weight of term kx in document dj ) and wyj (the weight of term ky in
document dj ) on the x and y axes, respectively.

 Every document can be positioned
somewhere on this graph.
 A document where neither term is
important will be near the bottom-
left (0,0).
 A document where term ky is
important will be close to the top.
 A document where term kx is
important will be toward the right.

 Here, document d1 is a document
where term kx is very important (it
is very close to the right-hand side).
 Term ky is moderately important in
this document (it is not as close to
the top as it is to the right).

Illustration: OR operator
 Using the traditional Boolean
Model, given the query “kx OR
ky”, only documents located at
(0,0) would not be returned.
 Therefore the documents preferred
are those that are furthest from this

Illustration: OR operator
 We can measure this distance by using the following:

 Note: this is based on the formula for the distance between two points in
coordinate geometry, where one of the points is (0,0).
 The division by 2 within the square root is an adjustment that normalises the
similarity scores, so that they must lie between 0 and 1

Illustration: AND operator
 Using the traditional Boolean
Model, given the query “kx AND
ky", only documents located at
(1,1) would be returned.
 Therefore the documents preferred
are those that are closest to this

Illustration: AND operator
 We can measure this distance by using the following:

 The division by 2 within the square root is an adjustment that normalises the
similarity scores, so that they must lie between 0 and 1.
 The distance from position (1; 1) is subtracted from 1 so as to give a higher
similarity score to those documents that minimise this distance.

Illustration: Combining AND and OR
 Processing of more general queries is done by grouping the operations.
 For example, consider the query

Extended Boolean Model in m Dimensions
 These formulae are not limited to only two terms. Both extend to cater for
multiple terms in the query.
 For queries with m terms, the following formulae apply:

The p-norm Model
 An interesting variation of the Extended Boolean Model allows users to affect
the behaviour of the system.
 This involves the introduction of an additional p-norm variable.
 For p = , the ranking will be similar to that of the basic Boolean Model.
 Setting p = 0 results in behaviour more similar to the Vector Space Model,
which we will see later in the module.
 Thus the user can vary the effects of things such as partial matches, depending
on the nature of the specific query.

Extended Boolean Model
 Advantages
 Extends the Boolean model to allow for term weights and partial matching
 Expert users may alter the system's behaviour as appropriate (using the p-norm
 Disadvantages
 Assumption of term independence
 Not widely used in practice
 Users must be familiar with specialist query language

