IRWS Lecture 04 - Boolean Model and Transcrib

Information Retrieval and Web Search
Self Study – Boolean Model
Information Retrieval and Web Search - IRWS - Griffith College 1

Today
 Term Weights
 Boolean IR Model
 Extended Boolean Model

Term Weights
 We have seen how we perform pre-processing to divide a document into terms
that are stored in an index.
 However, not all terms are equally useful for describing document contents.
 For this reason, terms are generally weighted in some way, to reflect this
usefulness.
 Deciding on the importance of a term (i.e. the weight) is not a trivial issue.

Term Weights
 Consider a document collection of 100,000 documents.
 A word that appears in every document is not very useful as an index term, as
it tells us very little about which documents a user might be interested in.
 A word that appears in 5 documents considerably narrows the space of
documents that might be of interest to the user.
 This is the basic rationale behind stop words, but it has wider consequences
also.

Term Weights
 We can also take the number of times a term occurs in a document into
account, for example:
 The term “Obama” will appear in a biography of Barack Obama, for obvious reasons.
 The term “Obama” will also appear in a biography of John McCain, as he once lost an
election to Obama. It will not be as common, however.
 The first document is more relevant to a search for “Obama” but without using
weights, both would be treated equally.
 The greater number of occurrences of the term in the first document should
result in a greater weight.

Term Weights
 When talking about term weights, we use the following notation:
 A weight wi,j > 0 is associated with every index term ki of document dj .
 For an index term that does not appear in the document, wij = 0
 This weight quantifies the importance of the index term for describing the
document's semantic contents.
 Any time we examine a new IR model, we must firstly consider how its
weighting scheme works.

Term Weights
 Index terms are usually assumed to be independent, though this is a
simplification.
 Consider a document about computer networks. Here, the terms computer and
network are clearly related, as the appearance of one attracts the appearance of
the other.
 You may argue that their weights should altered to take this dependency into
account

Term Weights
 Assuming independence simplifies the task of computing the term weights and
allows for fast ranking computation.
 Taking advantage of term correlation is a difficult task.
 Furthermore, nobody has yet demonstrated that taking this into account even
improves document ranking.
 For these reasons, we continue to assume that terms are independent.

Boolean Model

Introduction
 The Boolean model is a simple retrieval model based on set theory and
Boolean algebra.
 Its greatest advantage is its relative simplicity, and most people can understand
the concepts involved.
 It was adopted by many early commercial bibliographic systems.
 See page 25 of Modern Information Retrieval

Weighting
 The Boolean model only considers whether index terms are present or absent
in a document:
 For a term ki that is contained in a document dj , wij = 1
 For a term ki that is not contained in a document dj , wij = 0
 Thus we can see that the Boolean model does not make use of index term
weighting to differentiate between the usefulness of terms

Queries
 Queries consist of index terms linked by combinations of the three core
Boolean operators: and, or, not.
 Some systems that look like Boolean retrieval systems use other operators too
(e.g. NEAR) but we will be looking at the traditional Boolean IR model.

Boolean Model Operators: OR
 OR is used to broaden a search by retrieving any, some or all of the keywords
used in the search statement.
 It is commonly used to search for synonymous terms or concepts, for example:
 college: 17,320,770 results
 university: 33,685,205 results
 college OR university: 38,702,660 results
 Note that the final result is not the same as the sum of the individual results:
documents containing both terms (e.g. “University College Dublin”) will only
be counted once.
 The effect of using OR is that documents containing any number of the terms
specified will be returned.

Set theoretic representation of OR
 If C is the set of documents that contain
the term “college” and U is the set of
documents that contain the term
“university”, the query “college OR
university” can be calculated by C U.
(i.e. the area in yellow)

Boolean Model Operators: AND
 AND is used to narrow a search by ensuring that all the search terms should
appear int he results.
 It is commonly used to search for relationships between two concepts or terms,
for example:
 poverty: 783,447 results
 crime: 2,962,165 results
 poverty AND crime: 1,677 results (max possible results 783447)
 The more terms or concepts combined in a search with AND, the fewer records
that will be retrieved.

Set theoretic representation of AND
 If P is the set of documents that contain the
term “poverty” and C is the set of
documents that contain the term “crime”,
the query “poverty AND crime” can be
calculated by P C (i.e. the area in yellow).

Boolean Model Operators: NOT
 NOT is used to speciffically exclude a term from your search, for example:
 pets: 4,556,515 results
 cats: 3,651,252 results
 pets NOT cats: 81,497 results
 One difficulty with using NOT is that a document that is highly relevant to
what you're searching for may also contain the term you had attempted to
avoid.
 Again, the more terms or concepts combined in a search with NOT, the fewer
records that will be retrieved.

Set theoretic representation of NOT
 If P is the set of documents that contain the
term “pets” and C is the set of documents
that contain the term “cats”, the query “pets
NOT cats” can be calculated by P \ C (i.e.
the area in yellow).

Boolean Model: Queries
 A query is essentially a Boolean expression that can be represented in
disjunctive normal form (DNF).
 This is a standard way of representing Boolean expressions.
 The advantage is that expressions in this form are easy for a computer to
process.
 In Boolean expressions, a disjunction is a series of expressions that are
combined using the OR ( ) operator, e.g. (A B C)
 A conjunction is a series of expressions combined using the AND ( ) operator,
e.g. (A B C).

 Disjunctive Normal Form (DNF) is a disjunction of conjunctions. Examples
include:
 These are not in DNF, although they can be modified to be expressed that way:

 One advantage of DNF is that it allows us to represent queries (and documents)
as bit vectors, which computers are extremely fast at processing.
 For example, suppose we have a very simple document collection with only
three terms in it: ka, kb and kc.
 Suppose we have the following query:
 This can be represented as bit vectors in DNF like so: 110 V 100

 Documents can also be expressed as bit vectors:
 A document containing all three terms: 111
 A document containing ka and kb but not kc : 110
 A document containing just kb: 010
 We can now compare these document representations with the components of
the query. Documents matching any of them can be considered relevant.

Disadvantages
 The model predicts that each document is either relevant or non-relevant to the
query.
 There is no notion of a partial match to the query conditions and so this can
lead to too few documents being retrieved.
 Every document that is considered to be relevant is treated the same, so no
ranking occurs.
 It is known that index term weighting can lead to substantial improvements in
retrieval performance.

Extended Boolean Model

Introduction
 To address the significant difficulties with the traditional Boolean model, the
Extended Boolean Model was proposed by Salton, Fox and Wu in 1983.
 This allows partial matching and caters for the use of term weights to facilitate
ranking results.
 For more, see page 38 in Modern Information Retrieval

Motivation
 Consider the query kx ^ ky ^ kz
 Using the traditional Boolean Model, only documents containing all three terms will be
returned.
 However, there is an argument that a document containing two of the terms would be
more relevant than a document containing none.
 Using the traditional Boolean model, neither of these are returned, so the user has to
modify the query.
 Obviously, it is preferable that a document containing all the terms would still be better
than one that contains fewer query terms.

Motivation
 Similarly kx _ ky _ kz
 Under the Boolean Model, any document containing any of the terms will be returned
and will be treated in the same way.
 Again, it is logical that a document containing all three terms may have more relevance
than one that only contains a single query term.
 Some form of document ranking would be desirable in this instance also.

Term Weights
 The Extended Boolean Model makes use of term weights, where 0 < wij < 1
 We will go into more detail on how exactly term weights are calculated in later
in the module but for now we just need to know that these weights lie between
0 (for a term that does not appear in the document at all) and 1 (for a very
useful term that appears in the document).
 Unlike the traditional Boolean model, the weights can lie somewhere between
0 and 1 also.

Partial Matching
 The Extended Boolean Model is the first we have seen that facilitates partial
matching by using non-binary term weights.
 In models like this, each document must have a similarity score calculated for
it, which measures how similar it is to the given query.
 This is usually shown as sim(q, d) (i.e. the similarity between a query q and a
document d).
 These models return a ranked list of documents, where the documents with the
highest similarity scores are at the top of the list.
 In this way, it is hoped that the most relevant documents are at the top of the
result set, so that the user can find relevant information easier.

Illustration
 To illustrate how the Extended Boolean Model works, we will consider a very
simple system where there are only two terms: kx and ky .
 The principles that apply to this type of simple system also apply in more
realistic situations where more terms are involved.
 We display these two dimensions in the plane, with the weights wxj (the
weight of term kx in document dj ) and wyj (the weight of term ky in
document dj ) on the x and y axes, respectively.

Illustration
 Every document can be positioned
somewhere on this graph.
 A document where neither term is
important will be near the bottom-
left (0,0).
 A document where term ky is
important will be close to the top.
 A document where term kx is
important will be toward the right.

Illustration
 Here, document d1 is a document
where term kx is very important (it
is very close to the right-hand side).
 Term ky is moderately important in
this document (it is not as close to
the top as it is to the right).

Illustration: OR operator
 Using the traditional Boolean
Model, given the query “kx OR
ky”, only documents located at
(0,0) would not be returned.
 Therefore the documents preferred
are those that are furthest from this
point.

Illustration: OR operator
 We can measure this distance by using the following:
 Note: this is based on the formula for the distance between two points in
coordinate geometry, where one of the points is (0,0).
 The division by 2 within the square root is an adjustment that normalises the
similarity scores, so that they must lie between 0 and 1

Illustration: AND operator
 Using the traditional Boolean
Model, given the query “kx AND
ky", only documents located at
(1,1) would be returned.
 Therefore the documents preferred
are those that are closest to this
point.

Illustration: AND operator
 We can measure this distance by using the following:
 The division by 2 within the square root is an adjustment that normalises the
similarity scores, so that they must lie between 0 and 1.
 The distance from position (1; 1) is subtracted from 1 so as to give a higher
similarity score to those documents that minimise this distance.

Illustration: Combining AND and OR
 Processing of more general queries is done by grouping the operations.
 For example, consider the query

Extended Boolean Model in m Dimensions
 These formulae are not limited to only two terms. Both extend to cater for
multiple terms in the query.
 For queries with m terms, the following formulae apply:

The p-norm Model
 An interesting variation of the Extended Boolean Model allows users to affect
the behaviour of the system.
 This involves the introduction of an additional p-norm variable.
 For p = , the ranking will be similar to that of the basic Boolean Model.
 Setting p = 0 results in behaviour more similar to the Vector Space Model,
which we will see later in the module.
 Thus the user can vary the effects of things such as partial matches, depending
on the nature of the specific query.

Extended Boolean Model
 Advantages
 Extends the Boolean model to allow for term weights and partial matching
 Expert users may alter the system's behaviour as appropriate (using the p-norm
variation).
 Disadvantages
 Assumption of term independence
 Not widely used in practice
 Users must be familiar with specialist query language

IRWS Lecture 04 - Boolean Model and Transcrib

Uploaded by

Copyright:

Available Formats

IRWS Lecture 04 - Boolean Model and Transcrib

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IRWS Lecture 04 - Boolean Model and Transcrib

Uploaded by

Copyright:

Available Formats

Information Retrieval and Web Search

Self Study – Boolean Model

Information Retrieval and Web Search - IRWS - Griffith College 1

Information Retrieval and Web Search - IRWS - Griffith College 2

Information Retrieval and Web Search - IRWS - Griffith College 3

Information Retrieval and Web Search - IRWS - Griffith College 4

Information Retrieval and Web Search - IRWS - Griffith College 5

Information Retrieval and Web Search - IRWS - Griffith College 6

Information Retrieval and Web Search - IRWS - Griffith College 7

Information Retrieval and Web Search - IRWS - Griffith College 8

Information Retrieval and Web Search - IRWS - Griffith College 9

Information Retrieval and Web Search - IRWS - Griffith College 10

Information Retrieval and Web Search - IRWS - Griffith College 11

Information Retrieval and Web Search - IRWS - Griffith College 12

Information Retrieval and Web Search - IRWS - Griffith College 13

Information Retrieval and Web Search - IRWS - Griffith College 14

Information Retrieval and Web Search - IRWS - Griffith College 15

Information Retrieval and Web Search - IRWS - Griffith College 16

Information Retrieval and Web Search - IRWS - Griffith College 17

Information Retrieval and Web Search - IRWS - Griffith College 18

Information Retrieval and Web Search - IRWS - Griffith College 19

Information Retrieval and Web Search - IRWS - Griffith College 20

Information Retrieval and Web Search - IRWS - Griffith College 21

Information Retrieval and Web Search - IRWS - Griffith College 22

Information Retrieval and Web Search - IRWS - Griffith College 23

Information Retrieval and Web Search - IRWS - Griffith College 24

Information Retrieval and Web Search - IRWS - Griffith College 25

Information Retrieval and Web Search - IRWS - Griffith College 26

Information Retrieval and Web Search - IRWS - Griffith College 27

Information Retrieval and Web Search - IRWS - Griffith College 28

Information Retrieval and Web Search - IRWS - Griffith College 29

Information Retrieval and Web Search - IRWS - Griffith College 30

Information Retrieval and Web Search - IRWS - Griffith College 31

Information Retrieval and Web Search - IRWS - Griffith College 32

Information Retrieval and Web Search - IRWS - Griffith College 33

Information Retrieval and Web Search - IRWS - Griffith College 34

Information Retrieval and Web Search - IRWS - Griffith College 35

Information Retrieval and Web Search - IRWS - Griffith College 36

Information Retrieval and Web Search - IRWS - Griffith College 37

Information Retrieval and Web Search - IRWS - Griffith College 38

Information Retrieval and Web Search - IRWS - Griffith College 39

Information Retrieval and Web Search - IRWS - Griffith College 40

You might also like