Chapter One Introduction To Information Storage and Retrieval

lOMoARcPSD|32779613
CHAPTER ONE
INTRODUCTION TO INFORMATION STORAGE AND RETRIEVAL
The main contents of this chapter are the following. The readers of this module must understand
the concept of the following key terms and must have the ability to answer the listed questions
after the end of the chapter.
 What is information?
 What are the main difference between data, Information
 What is storage?
 What is retrieval?
 What is information retrieval (IR)?
 What is information storage and retrieval (ISR)?
 What are IR systems?
 What are the basic structures of IR systems?
Discussion question:
 What is information crisis?

 What do we mean when we say National information programs?
 The effects of mismanagement of information on the development of the country
 The future of information retrieval
1.1. Data, information, Knowledge and Wisdom
Data: - is raw fact, unprocessed information, it simply exists and has no full meaning (does not
have meaning of itself). It can exist in any form, usable or not. Can be symbol.
Example: 2009: – Can be year, amount, number, etc It is raining:-because there is no any
suggestion about the condition of temperature
1
lOMoARcPSD|32779613
Data can be defined as a representation of facts, concepts, or instructions in a formalized manner,

which should be suitable for communication, interpretation, or processing, by human or
electronic machines.
It can be described as unprocessed facts and figures. It is represented with the help of characters
such as alphabets (A-Z, a-z), digits (0-9) or special characters (+, -, /, *, , =, etc.). Whereas
information is the processed data on which decisions and actions are based. It is data that has
been processed into a form that is meaningful to the recipient and is of real or perceived value in
the current or the prospective action or decision of the recipient. Furtherer, information is
interpreted data; created from organized, structured, and processed data in a particular context.
1.1.1. Types of data
Data can be categorized by its content into three types: Structured, Semi-structured, and
Unstructured data.
 Structured
Structured data is data that adheres to a pre-defined data model and is therefore straightforward
to analyze. Structured data conforms to a tabular format with a relationship between the different
rows and columns. Common examples of structured data are Excel files or SQL databases. Each
of these has structured rows and columns that can be sorted.
)
lOMoARcPSD|32779613
Table 1. Example of Structure data
 Unstructured data
Unstructured data is information that either does not have a predefined data model or is not
organized in a pre-defined manner. Unstructured information is typically text-heavy but may
contain data such as dates, numbers, and facts as well. This results in irregularities and
ambiguities that make it difficult to understand using traditional programs as compared to data
stored in structured databases.
Typically refers to free-text data and it Allows Keyword queries including operators and more
sophisticated “concept” queries e.g. find all web pages dealing with drug abuse. Common
examples of unstructured data include audio, video files or NoSQL databases.
 Semi-structured
Semi-structured data is a form of structured data that does not conform with the formal structure
of data models associated with relational databases or other forms of data tables, but nonetheless,
contains tags or other markers to separate semantic elements and enforce hierarchies of records
and fields within the data. Therefore, it is also known as a self-describing structure. Examples of
semi-structured data include JSON and XML are forms of semistructured data.
3
lOMoARcPSD|32779613
Information: is Data that have been processed and has meaning of itself and the meaning is
useful but does not have to be. Information embodies the understanding of a relationship of some
sort, cause and effect. Provides answers to who, what, where, and when questions. Information
is a critical business resource and like any other critical resource must be properly managed.
Constantly evolving technology, however, is changing the way even very small businesses
manage vital business information. An information or records management system -- most often
electronic -- designed to capture, process, store and retrieve information is the glue that holds a
business together
Example: - The temperature dropped 15 degrees and then it started raining.
2009 is the year that I never forget because of its event, especially in Ethiopia.
Knowledge: is the appropriate collection of information, such that it‟s intent is to be useful.
Knowledge is a deterministic process. When someone memorize information (as less-aspiring
test-bound students often do), then they have amassed knowledge. This knowledge has useful
meaning to them, but it does not provide for, in and of itself.
For example, elementary school children memorize, or amass knowledge of, the time table. They
can simply tell us 2*2=4, because they have amassed that knowledge. But when asked what is
“2340*700”, they cannot respond correctly because that entry is not in their time table. To
correctly answer such a question requires a true cognitive and analytical ability that is only
encompassed in the next level…understanding.
Wisdom: is an extrapolative and non deterministic, non probabilistic process and it goes far
beyond understanding itself. Wisdom is therefore, the process by which we judge what is right
and what is wrong or what is good and what is bad.
Storage: The action of or method of storing something. The place where data is held in an
electro agnetic or optical for access by a computer processor.
4
lOMoARcPSD|32779613
Retrieval: The process of getting something back from somewhere. The action of obtaining or
consulting material stored in a computer system. Example: find „BRUTUS AND CAESAR AND
NOT CALPURNIA‟ in the big book of shakespare.
Information storage: The computers can store different types of information in different ways,
depending on what the information is, how much storage it requires and how quickly it needs to
be accessed.
Information Retrieval (IR) is finding material (usually documents) of an unstructured nature

(usually text) that satisfies an information need from within large collections (usually stored on
computers). Information retrieval technology has been central to the success of the Web.
Information Retrieval is the process of obtaining relevant information from a collection of
informational resources. It does not return information that is restricted to a single object
collection but matches several objects which vary in the degree of relevancy to the query. So, we
have to think about what concepts IR systems use to model this data so that they can return all
the documents that are relevant to the query term and ranked based on certain importance
measures. These concepts include dimensionality reduction, data modeling, ranking measures,
clustering etc. These tools that IR systems provide would help you get your results faster. So,
while computing the results and their relevance, programmers use these concepts to design their
system, think of what data structures and procedures are to be used which would increase speed
of the searches and better handling of data.
1.2. What is Information Storage and Retrieval (ISR)?
Information storage & retrieval is the process of searching for relevant documents from an
unstructured large corpus that satisfy the user's information need. Information storage &
Retrieval (IS&R) deals with the representation, storage, organization, and access to information
items to satisfy user information needs. In other terms, Information storage & retrieval is the
process of searching for relevant documents from an unstructured large corpus that satisfy the
user’s information need. And it is a tool that finds and selects from a collection of items a subset
that serves the user’s purpose.
5
lOMoARcPSD|32779613
Generally, Information storage & retrieval is the science of searching for documents, for
information within documents, as well as that of searching relational databases and the World
Wide Web.
Information storage & retrieval is interdisciplinary, based on computer science, mathematics,

library science, information science and technology, information architecture, cognitive
psychology, linguistics, statistics, and physics. Automated information retrieval systems are used
to reduce information overload. In information retrieval systems the emphasis is on the retrieval
of information (not data). This information can be in the form of audio, video or text and it is
devoted to finding relevant documents, not finding simple matches to patterns. The effective
retrieval of relevant information is directly affected both by the user task and the logical view of
the documents. The user task may be information retrieval or information browsing. Classic
information retrieval systems (such as web interfaces) normally allow information retrieval alone,
while hypertext systems provide quick browsing. Modern digital libraries and Web interfaces
might attempt to combine these tasks to provide improved retrieval capabilities.
1.3. Information Retrieval System (IRS)
The concept of Information Retrieval System (IRS) is self-explanatory from the terminological
point of view and refers to a ‘system which retrieves information’. IRS is concerned with two
basic aspects: (i) How to store information, and (ii) How to retrieve information. One may simply
denote such a system as one that stores and retrieves information. IRS is comprised of a set of
interacting components, each of which is designed to serve a specific function for a specific
purpose. All these components are interrelated to achieve a goal. The concept of IR thus is based
on the fact that there are some items of information which have been organized in a suitable order
for easy retrieval.
Downloaded by shegaw mulat ([email protected])

lOMoARcPSD|32779613
An information retrieval system is designed to analyze, process and store sources of information
and retrieve those that match a particular user’s requirements. Modern information retrieval
systems can either retrieve bibliographic items or the exact text that matches a user’s search
criteria from a stored database of documents. IRS originally meant text retrieval systems as they
were dealing with textual documents. Modern information retrieval systems deal not only with
textual information but also with multimedia information comprising text, audio, images and
video. Thus, modern information retrieval systems deal with storage, organization and access to
text, as well as multimedia information resources. Thus, an IR system is a set of rules and
procedures, for performing some or all of the following operations:
a) Indexing (or constructing of representations of documents);
b) Search formulation (or constructing of representations of information needs);
c) Searching (or matching representations of documents against representations of needs); and
d) Index language construction (or generation of rules of representation)
So information retrieval is collectively defined as a “science of search” or a process, method and
procedure used to select or recall, recorded and/or indexed information from files of data.
4. Data versus information retrieval
Data Retrieval systems directly retrieve data from database management systems by identifying
keywords in the queries provided by users and matching them with the documents in the
database.
Data retrieval : the task of determining which documents of a collection contain the keywords
in the user query.
• Data retrieval system is most like:-
– Relational database
– Deals with data that has a well defined structure and semantics.
– A single mistaken object among a thousand retrieved objects means total failure.
• Data retrieval does not solve the problem of retrieving information about a subject or
topic.

lOMoARcPSD|32779613
The main reason for this difference is that information retrieval usually deals with natural
language text which is not always well structured and could be semantically ambiguous. On the
other hand, a data retrieval system (such as a relational database) deals with data that has a well
defined structure and semantics.
S.No Information Retrieval Data Retrieval
Retrieves information based on the similarity Retrieves data based on the keywords in
1
between the query and the document. the query entered by the user.
Small errors are tolerated and will likely go There is no room for errors since it results
2
unnoticed. in complete system failure.
It is ambiguous and doesn’t have a defined It has a defined structure with respect to
3
structure. semantics.
Does not provide a solution to the user of the Provides solutions to the user of the
4
database system. database system.
Information Retrieval system produces Data Retrieval system produces exact

5
approximate results results.
Displayed results are not sorted by

6 Displayed results are sorted by relevance
relevance.
The Data Retrieval model is deterministic

7 The IR model is probabilistic by nature.
by nature.
Table 1. Data versus information retrieval
1.5. Information Retrieval serve as Bridge
An Information Retrieval System aims at collecting and organizing all the documents available
in one or more subject areas in order to provide them to users as soon as required. The following
scenario shall reflect the purpose of information retrieval system as under:

lOMoARcPSD|32779613
1. A writer presents a set of ideas in a document using a set of concepts.

2. Somewhere there will be some users who require the ideas but may not be able to identify
those.
3. Information retrieval system serve to match the writer’s ideas expressed in the document
with the user’s requirements on demands for those.
Thus, an information retrieval system serves as a bridge between the world of authors and world
of users. An Information Retrieval System serves as a bridge between the world of authors and
the world of readers/users, that is, writers present a set of ideas in a document using a set of
concepts. Then Users seek the IR system for relevant documents that satisfy their information
needs.
fig 1.1 view of information retrieval
1.6. Information Retrieval System Architecture
Before an information retrieval system can actually operate to retrieve some information the
information must have already been stored inside the system this is true both for manual and
computerized systems.
Originally it will usually have been in the form of documents. The retrieval system is not likely
to have stored the complete text of each document in the natural language in which it was
written. It will have, instead, a document representative which may have been produced from the
documents either manually or automatically.

lOMoARcPSD|32779613
1.6.1. Typical Information Retrieval System Architecture
Information Retrieval (IR) means finding a set of documents that is relevant to the query. The
ranking of a set of documents is usually performed according to their relevance scores to the
query. The user with information need issues a query to the retrieval system through the query
operational module.
fig 1. 6. A Typical Information Retrieval System.
We use a Search Engines to retrieve information from different web sites. A search engine is a
tool designed to search for information. When this occurs on the World Wide Web it is then
called a web search engine. The search results are usually presented in a list and are commonly
called hits. The information may consist of web pages, images, information, and other types of
files. Some search engines also mine data available in databases or open directories. Unlike Web
directories, which are maintained by human editors, search engines operate algorithmically or are
10

lOMoARcPSD|32779613
a mixture of algorithmic and human input. Common search engines include Google, Yahoo, AOL
Search, Ask.com, Bing and Look smart.
1.7. Information retrieval System vs. Web Search System
With the fast growth of the Internet, more and more information is available on the web, and as a
result, web information retrieval has become a fact of life for most Internet users.
The input of classic information retrieval is mainly for document collection and the goal is to
retrieve documents or text with information content that is relevant to the user’s information
need.
Classic information retrieval involves two main aspects:
 Processing the document collection.

 Processing queries (searching).
To determine the query results (ie. which documents to return), information retrieval employed
different models like the Boolean and Vector models.
On the other hand, as we have seen in figure 1.2, the input of web information retrieval is the
publicly accessible web while the goal is to retrieve high-quality pages that are relevant to the
user’s information needs.
Classic information retrieval involves two main aspects: processing the document collection and
processing queries (searching).
Web information retrieval can be static, in which files like text, audio, and videos are retrieved,
or dynamic, which is mainly database access generated on request. Two aspects of web
information retrievals are processing and representation of the document collection and
processing queries.
Processing and representation of document collection involve either gathering the static pages or
learning about the dynamic pages.
Web information retrieval has the following advantages over classic information retrieval:
11

lOMoARcPSD|32779613
1. User
 Many tools are available to the user

 Personalization of information result given a query is better
 Interactivity: for instance, the query can be refined or expanded as desired
2. Collection/System
 Hyperlinks are available to link one document to the other

 Statistics is easy to gather even in large sample sizes
 Interactivity: the system makes the users explain what they want
fig 1.3. Example of Web Information Retrieval
1.8. Information retrieval Process
Information retrieval is the process of searching for relevant documents from unstructured large
corpus that satisfy user’s information need. IR is simply about finding relevant information. It
focuses on providing the user with easy access to information of their interest.
12

lOMoARcPSD|32779613
Fig 1.7 Information retrieval process
Information retrieval is the process of matching the query against the indexed information
objects.
An index is an optimized data structure that is built on top of the information objects.
– allowing faster access for the search process

– The indexer:
• tokenizes the text (tokenization)
• removes words with little semantic value (stop-words)
• unifies word families (stemming)
– The same is done for the query as well
 Components of Retrieval Process
As we discussed above before any of the retrieval processes are initiated it is necessary to define
the text database and this is usually done by the manager of the database and includes specifying
the documents to be used. The text operations performed on the text. The text operation model
to be used (the text structure and what elements can be retrieved). The text operations transform
13

lOMoARcPSD|32779613
the original documents and the information needs and generate a logical view of them. Once the
logical view of the documents is defined, the database module builds an index of the text. An
index is a critical data structure. It allows fast searching over large volumes of data
Different index structures might be used, but the most popular one is the inverted file. Given the
document database is indexed, the retrieval process can be initiated The user first specifies a user
need which is then parsed and transformed by the same text the operation applied to the text,
Next, the query operations are applied before the actual query, which provides a system
representation for the user need, is generated The query is then processed to retrieve documents
Before the retrieved documents are sent to the user, the retrieved Documents are ranked
according to the likelihood of relevance
The user then examines the set of ranked documents in the search for useful information. Two
choices for the user:
(i) Reformulate query, run on the entire collection or
(ii) Reformulate query, run on result set
At this point, he/she might pinpoint a subset of the documents seen as definitely of interest and
initiate a user feedback cycle In such a cycle, the system uses the documents selected by the user
to change the query formulation. Hopefully, this modified query is a better representation of the
real user need.
1.9. Issues that arise in Information Retrieval
The main issues of the Information Retrieval (IR) are:-
 Document and Query Indexing

 Query Evaluation, and System Evaluation.
 How can interactive query formulation and refinement be supported?
 Comparing representations
 What is a “good” similarity measure & retrieval model?
 How is uncertainty represented?
14

lOMoARcPSD|32779613
 Text representation
 What makes a “good” representation?
 How is a representation generated from the text?
 What are the retrievable objects and how are they organized?
 Information needs representation
 What is an appropriate query language?
 Evaluating the effectiveness of retrieval
 What are good metrics?
1.10. Information Retrieval Research areas
Information retrieval is the science of searching for information in a document, searching for
documents themselves, and also searching for the metadata that describes data, and for databases
of texts, images or sounds. Much of IR research focuses more specifically on text retrieval.
But there are many other interesting areas:
– Audio retrieval, which deals with searching for speech or music file
– Cross-language retrieval, which uses a query in one language (say English) and
finds documents in other languages (say Amharic and Russian).
– Question-answering IR systems, which retrieve answers from a body of text. For

example, the question Who won the 1997 World Series? finds a 1997 headline
World Series: Marlins are champions.
– Image retrieval, which finds images on a given topic or images that contain a
given shape or color.
– Video retrieval, which searches for video file that the user looking for.
1.11. Information retrieval some Applications areas
 Graphical interfaces to support information search

 Information Retrieval & Extraction
15

lOMoARcPSD|32779613
 XML retrieval
 Geographic Information Retrieval
 Multimedia information retrieval
 Cross-Language & Multilingual Information Retrieval
 Agent-based (like information filtering, tracking, routing) Information Retrieval
 Adversarial Information Retrieval
 Question answering
 Document Summarization
 Text classification
 Multi-database searching
 Document provenance
 Recommender systems
 Information Retrieval & Machine Learning
 Text Mining & Web Mining
 N-Grams in Information Retrieval
Basic assumptions of Information Retrieval
There is the collection or a set of documents, assume it is a static collection for the moment. The
main goal of the IR is it retrieve documents with information that is relevant to the user’s
information need and helps the user complete a task
16

lOMoARcPSD|32779613
CHAPTER TWO
TEXT/DOCUMENT OPERATIONS and AUTOMATIC INDEXING
2.1. Introduction
A search engine or information retrieval system won't look through every document to check
whether it matches the query. It employs an index to find the pertinent papers quickly.
Information must first be saved inside the computer before a computerized information retrieval
system can actually work to retrieve that information. Often, it was first presented in the form of
documents. But, it's unlikely that the computer has saved every document's whole text in the
natural language in which it was written. Instead, it will include a document representation that
could have been created manually or automatically from the documents. The full text of the
document, an abstract, only the title, or even a summary could serve as the process's starting
point for text analysis.
Index a list of concepts with pointers to documents that discuss (represent) them. What goes in
the index is very important. Document representation deciding what concepts should go in the
index. Before a computerized information retrieval system can actually operate to retrieve some
information, that information must have already been stored inside the computer. Originally it
will usually have been in the form of documents. The computer, however, is not likely to have
stored the complete text of each document in the natural language in which it was written. It will
have, instead, a document representative which may have been produced from the documents
either manually or automatically. The starting point of the text analysis process may be the
complete document text, an abstract, the title only, or perhaps a list of words only. From it the
process must produce a document representative in a form which the computer can handle.
An index language is the language used to describe documents and requests. The elements of the
index language are index terms , which may be derived from the text of the document to be
described, or may be arrived at independently. Index languages may be described as pre-
coordinate or post-coordinate ; the first indicates that terms are coordinated at the time of
indexing and the latter at the time of searching. More specifically, in pre-coordinate indexing a
logical combination of any index terms may be used as a label to identify a class of documents,
17

lOMoARcPSD|32779613
whereas in post-coordinate indexing the same class would be identified at search time by
combining the classes of documents labeled with the individual index terms.
One last distinction, the vocabulary of an index language may be controlled or uncontrolled. The
former refers to a list of approved index terms that an indexer may use, such as for example used
by MEDLARS. The controls on the language may also include hierarchic relationships between
the index terms. Or, one may insist that certain terms can only be used as adjectives (or
qualifiers). There is really no limit to the kind of syntactic controls one may put on a language.
The index language which comes out of the conflation algorithm may be described as
uncontrolled, post-coordinate and derived. The vocabulary of index terms at any stage in the
evolution of the document collection is just the set of all conflation class names.
There is much controversy about the kind of index language which is best for document retrieval.
The main debate is really about whether automatic indexing is as good as or better than manual
indexing. Each can be done to various levels of complexity. However, there seems to be
mounting evidence that in both cases, manual and automatic indexing, adding complexity in the
form of controls more elaborate than index term weighting do not pay dividends. The message is
that uncontrolled vocabularies based on natural language achieve retrieval effectiveness
comparable to vocabularies with elaborate controls. This is extremely encouraging, since the
simple index language is the easiest to automate.
Probably the most substantial evidence for automatic indexing has come out of the SMART
Project (1966). Gerard Salton recently summarized its conclusions: ' ... on the average the
simplest indexing procedures which identify a given document or query by a set of terms,
weighted or un weighted, obtained from document or query text are also the most effective'. Its
recommendations are clear, automatic text analysis should use weighted terms derived from
document excerpts whose length is at least that of a document abstract.
The document representatives used by the SMART project are more sophisticated than just the
lists of stems extracted by conflation. There is no doubt that stems rather than ordinary word
forms are more effective (Carroll and Debruyn). On top of this, the SMART project adds index
term weighting, where an index term may be a stem or some concept class arrived at through the
use of various dictionaries.
18

lOMoARcPSD|32779613
Documents are the primary objects in IR systems and there are many operations for them. In
many types of IR systems, documents added to a database must be given unique identifiers,
parsed into their constituent fields, and those fields broken into field identifiers and terms. Once
in the database, one sometimes wishes to mask off certain fields for searching and display. For
example, the searcher may wish to search only the title and abstract fields of documents for a
given query, or may wish to see only the title and author of retrieved documents.
2. Index term selection
Some words are not good for representing documents, use of all words have computational cost,
increase searching time and storage requirements and using the set of all words in a collection to
index documents generates too much noise for the retrieval task, therefore, term selection is
very important. The main objectives of term selections are:
• Represent textual documents by a set of keywords called index terms or simply terms
• Increase efficiency by extracting from the resulting document a selected set of terms to be used
for indexing the document
• If full text representation is adopted then all words are used for indexing (not as such efficient as
it will have an overhead, time and space) Index term is also called keyword or is a word (a single
word) or phrase (multiword) in a document whose semantics gives an indication of the
document’s theme (main idea)
 Capture subject discussed
 Help in remembering the documents main theme
Index term is mainly noun (because nouns have meanings by themselves) Indexing- User‟s ability to
find documents on a particular subject is limited by the indexing process used to create index terms
for the subject. Some definitions of indexing are the following.
Indexing:
– Is the art of organizing information
– Is an association of descriptors (keywords, concepts) to documents in view of future retrieval
– Is a process of constructing document surrogates by assigning identifiers to text items
– Is the process of storing data in a particular way in order to locate and retrieve the data
19

lOMoARcPSD|32779613
– Is the process of analyzing the information content in the language of the indexing system
The main important and well known Purpose/objective of indexing are
– To give access point to a collection that are expected to be most useful to the users of information
– To allow easy identification of documents (e.g., find documents by topic)
– To relate documents to each other
–To allow prediction of document relevance to a particular information need
There are 2 ways of indexing:
1. Manual indexing
Indexers decide which keywords to assign to documents based on controlled vocabulary

(Human indexers assign index terms to documents). The indexers try to summarize the
contents or about of the whole document in a few keywords, that is, indexers analyze and
represent the content of a document through keywords which is based on intellectual
judgment and semantic interpretation of (concepts, themes) of indexers. In manual indexing
indexers prior knowledge of the following is important to come up with good keywords or
index terms.
– Terms that will be used by the user
– Indexing vocabulary
– Collection characteristics
Indexers are normally provided with guidelines (input sheets, manuals and instructions,
printed thesaurus) to determine the contents of a given document and are usually done in the
library environment.
20

lOMoARcPSD|32779613
2. Automatic indexing
What is automatic indexing system?
Automatic indexing is the computerized process of scanning large volumes of documents against
a controlled vocabulary, taxonomy, thesaurus or ontology and using those controlled terms to
quickly and effectively index large electronic document depositories.
Automatic Indexing is the process of assigning documents with search terms for search and
retrieval purposes. This process in searches is widely used today to lessen the time of the search.
It uses a computer to scan a large volume of documents against a dictionary, rather than manual
indexing which makes use of manpower due to manual typing.
Who uses Automatic Indexing?
There are plenty of companies that use automatic indexing today. Even online catalogues,
periodical database, and internet search engines use automatic indexing. By automatic indexing,
searching is faster and more reliable. One good example is searching for something via the
internet and getting fast and reliable results.
What are the possible advantages of Automatic Indexing?
There are numerous advantages to using automatic indexing. One very obvious advantage of
automatic indexing is it lessens the job of the user (human) to scan and search a document as fast
as how a computer can. Other than that, the computer can also categorize each search it has
made. Through this innovation, users are no longer obliged to do such tedious work of scanning,
searching and categorizing. However, users may still have to check for errors but it is still
considered easier compared to manually doing everything.
Advantages of Automatic Indexing
 Predictability
 More sophisticated than manual indexing
 Great for similar material
 Less expensive
 Can extract terms and cluster them as well
 Can help users find information faster and thoroughly.
21

lOMoARcPSD|32779613
 It can be applied to a great number of texts without any hassle.

 Faster, more reliable and cost-effective compared to manual indexing.
 Practically it compensates for difference among the terms used in searches and indexing
terms.
While there are advantages in automatic indexing, it also has some disadvantages. For instance,
one disadvantage of automatic indexing is, it is not flexible. However, the more it is being used,
the system learns even more from the entries made by the users. In finding information, both
human and automatic indexing can help users. However, compared to automatic indexing, human
or manual indexing takes up a lot of time. It is more tedious and expensive. In terms of
technique, automatic indexing has more methods compared to human indexing. To top it all,
automatic indexing is well suited for the online environment where there are masses of
documents stored and companies with massive amounts of data.
Automatic indexing is the assignment of content identifiers, with the help of modern computing
technology. A computer system is used to record the descriptors generated by the human and the
system extracts “typical”/ “significant” terms. Then the human may contribute by setting the
parameters or thresholds, or by choosing components or algorithms. The original texts of
information items are used as a basis of indexing.
An automatic indexing is necessary because of the following reason
– Information overload
• Enormous amount of information is being generated from day to day activities
– Explosion of machine-readable text
• Massive information available in electronic format and on Internet.
– Cost effectiveness
• Human indexing is expensive and labor intensive.
22

lOMoARcPSD|32779613
2.3. Procedures for automatic indexing
Generating document representatives through automatic indexing involves;-
– Lexical analysis
– Use of stop list
– Noun identification (optional)
– Phrase formation (optional)
– Use of conflation procedures (stemming, optional)
– Selection of index terms
– Weighting the resulting terms (optional)
Figure 2.1: procedures of automatic indexing
Thus, automatic indexing consists of two processes
– Assigning terms or concepts capable of representing document content
23

lOMoARcPSD|32779613
–Assigning a weight or value to each term reflecting its presumed importance for the purpose of
content identification
• Important words are assigned higher weights
• Less important words are assigned lower weights
Note that: Not all words in a text are good index terms. Some are good, some are bad and some
are indifferent. Therefore, how do we know whether a term is good or bad or indifferent for
indexing? Luhn‟s idea will give us answer to this question.
2.4. Statistical Properties of Text
The chapter, therefore, starts with the original ideas of Luhn on which much of automatic text
analysis has been built and then goes on to describe a concrete way of generating document
representatives. Furthermore, ways of exploiting and improving document representatives
through weighting or classifying keywords are discussed. In passing, some of the evidence for
automatic indexing is presented.
 Luhn’s ideas
In one of Luhn’s6 early papers, he states: ‘It is here proposed that the frequency of word
occurrence in an article furnishes a useful measurement of word significance. It is further
proposed that the relative position within a sentence of words having given values of significance
furnish a useful measurement for determining the significance of sentences. The significant factor
of a sentence will therefore be based on a combination of these two measurements.’ This quote
fairly summarizes Luhn’s contribution to automatic text analysis. His assumption is that
frequency data can be used to extract words and sentences to represent a document.
Let f be the frequency of occurrence of various word types in a given position of text and r their
rank order, that is, the order of their frequency of occurrence, then a plot relating f and r yields a
curve similar to the hyperbolic curve in Figure 2.1. This is in fact a curve demonstrating.
 Zipf’s Law
24

lOMoARcPSD|32779613
Zipf’s Law which states that the product of the frequency of use of wards and the rank order is
approximately constant. Zipf verified his law on American Newspaper English. Luhn used it as a
null hypothesis to enable him to specify two cut-offs, an upper and a lower, thus excluding non-
significant words. The words exceeding the upper cut-off were considered to be common and
those below the lower cut-off are rare, and therefore not contribute significantly to the content of
the article. He thus devised a counting technique for finding significant words. Consistent with
this he assumed that the resolving power of significant words, by which he meant the ability of
words to discriminate content, reached a peak at a rank order position halfway between the two
cut-offs and from the peak fell off in either direction reducing to almost zero at the cut-off points.
A certain arbitrariness is involved in determining the cut-offs. There is no oracle that gives their
values. They have to be established by trial and error.
It is interesting that these ideas are really basic to much of the later work in IR. Luhn himself
used them to devise a method of automatic abstracting. He went on to develop a numerical
measure of significance for sentences based on the number of significant and non-significant
words in each portion of the sentence. Sentences were ranked according to their numerical score
and the highest-ranking was included in the abstract (extract really).
Edmundson and Wyllys8 have gone on to generalize some of Luhn’s work by normalizing his
measurements with respect to the frequency of occurrence of words in general text. There is no
reason why such an analysis should be restricted to just words. It could equally well be applied to
stems of words (or phrases) and in fact, this has often been done.
25

lOMoARcPSD|32779613
fig 2.1. A plot of the hyperbolic curve relating f, the frequency of occurrence, and r, the
rank order
5. Text Operations
Text operations are the process of text transformations in to logical representations. The main
operations for selecting index terms, i.e. to choose words/stems (or groups of words) to be used
as indexing terms are:
 Lexical analysis/Tokenization of the text - digits, hyphens, punctuations marks, and the
case of letters
 Elimination of stop words - filter out words which are not useful in the retrieval process
 Stemming words - remove affixes (prefixes and suffixes)
 Construction of term categorization structures such as thesaurus, to capture relationship for
allowing the expansion of the original query with related terms
26

lOMoARcPSD|32779613
Not all words in a document are equally significant to represent the contents/meanings of a
document some words carry more meaning than others for instance Noun words are the most
representative of document content. Therefore, need to preprocess the text of a document in a
collection to be used as index terms. Because using the set of all words in a collection to index
documents creates too much noise for the retrieval task. So, reducing noise means reducing
words that can be used to refer to the document.
Preprocessing is the process of controlling the size of the vocabulary or the number of distinct
words used as index terms. Preprocessing will lead to an improvement in information retrieval
performance. However, some search engines on the Web omit to preprocess every word in the
document is an index term.
The logical view of the document is provided by representative keywords or index terms, which
are frequently used historically to represent documents in a collection. In modern computers,
retrieval systems adopt a full-text logical view of the document. However, with very large
collections, the set of representative keywords may have to be reduced. This process of reduction
or compression of the set of representative keywords is called text operations (or transformation).
 Generating Document Representatives
Ultimately one would like to develop a text processing system that by means of computable
methods with the minimum of human intervention will generate from the input text (full text,
abstract, or title) a document representative adequate for use in an automatic retrieval system.
This is a tall order and can only be partially met. The document representative we aiming for is
one consisting simply of a list of class names, each name representing a class of words occurring
in the total input text. A document will be indexed by a name if one of its significant words
occurs as a member of that class.
Such a system will usually consist of three parts:
(1) Removal of high-frequency words,
27

lOMoARcPSD|32779613
(2) Suffix stripping,

(3) Detecting equivalent stems.
The removal of high-frequency words, ‘stop’ words, or ‘fluff’ words is one way of implementing
Luhn’s upper cut-off. This is normally done by comparing the input text with a ‘stop list’ of
words that are to be removed.
Figure 2.1 gives a portion of such a list, and demonstrates the kind of words that are involved.
The advantages of the process are not only that non-significant words are removed and will
therefore not interfere during retrieval, but also that the size of the total document file can be
reduced by between 30 and 50 percent.
The second stage, suffix stripping, is more complicated. A standard approach is to have a
complete list of suffixes and to remove the longest possible one.
2.5.1. Lexical Analysis/Tokenization of Text
Lexical analysis or Tokenization is a fundamental operation in both query processing and

automatic indexing. It is the process of converting an input stream of characters into a stream of
words or tokens. Tokens are groups of characters with collective significance. in other words, it
is one of the steps used to convert the text of the documents into the sequence of words, w1, w2,
… wn to be adopted as index terms. It is the process of demarcating and possibly classifying
sections of a string of input characters into words.
Generally, Lexical analysis is the first stage of automatic indexing, and of query processing.
Automatic indexing is the process of algorithmically examining information items to generate
lists of index terms. The lexical analysis phase produces candidate index terms that may be
further
processed and eventually added to indexes. Query processing is the activity of analyzing a query
and comparing it to indexes to find relevant items. Lexical analysis of a query produces tokens
that are parsed and turned into an internal representation suitable for comparison with indexes.
 Issues in Tokenization
28

lOMoARcPSD|32779613
The main Objective of Tokenization is – the identification of words in the text document.
Tokenization is greatly dependent on how the concept of the word is defined.
The first decision that must be made in designing a lexical analyzer for an automatic indexing
system is: What counts as a word or token in the indexing scheme?
Is that a sequence of characters, numbers, and alpha-numeric once? A word is a sequence of

letters terminated by a separator (period, comma, space, etc).
The definition of letter and separator is flexible; e.g., a hyphen could be defined as a letter or
as a
separator. Usually, common words (such as “a”, “the”, “of”, …) are ignored.
The standard tokenization approach is single-word tokenization where input is split into words
using white space characters as delimiters and it ignores other characters rather than words. This
approach introduces errors at an early stage because it ignores multi-word units, numbers,
hyphens, punctuation marks, and apostrophes.
How to handle special cases involving hyphens, apostrophes, punctuation marks, etc? C++, C#,
URLs, e-mail, …
Sometimes punctuations (e-mail), numbers (1999), & cases (Republican vs. republican) can be a
meaningful part of a token.
There are many Issues in Tokenization and some of these are listed below. Frequently they are
not.
 Two words may be connected by hyphens.
Can two words connected by hyphens be taken as one word or two words? Break up hyphenated
sequence as two tokens? In most cases hyphens – break up the words (e.g. state-of-the-art, state
of the art), but some words, e.g. MS-DOS, B49 – are unique words that require hyphens
 Two words may be connected by punctuation marks.
29

lOMoARcPSD|32779613
Punctuation marks: remove totally unless significant, e.g. program code: x.exe and xexe. What
about Kebede’s, www.command.com?
 Two words (phrases) may be separated by space
E.g. Addis Ababa, San Francisco, Los Angeles
Two words may be written in different ways lowercase, lower case, lower case? database,
database, database?
Numbers: are numbers/digits words used as index terms?

• dates (3/12/91 vs. Mar. 12, 1991);
• phone numbers (+251923415005)
• IP addresses (100.2.86.144)
• Numbers are not good index terms (like 1910, 1999); but 510 B.C. is unique. Generally,
don’t
index numbers as text, though very useful.
 What about the case of letters (e.g. Data or data or DATA):
• Cases are not important and there is a need to convert all to upper or lower.
Which one is mostly followed by human beings?
The simplest approach is to ignore all numbers and punctuation marks (period, colon, comma,
brackets, semi-colon, apostrophe, …) & use only case-insensitive unbroken strings of alphabetic
characters as words.
Will often index “meta-data”, including creation date, format, etc. separately Issues of
tokenization are language-specific and require the language to be known. The following is an
example of how standard tokenization performed
Analyze text into a sequence of discrete tokens (words)?
• Input: “Friends, Romans, and Countrymen”

• Output: Tokens (an instance of a sequence of characters that are grouped together as a
useful semantic unit for processing)
30

lOMoARcPSD|32779613
Friends
Romans and
Countrymen
• Each such token is now a candidate for an index entry, after further processing, But what are
valid tokens to emit?
2.5.2. Elimination of Stopwords
Stopwords are extremely common words across document collections that have no
discriminatory power. They may occur in 80% of the documents in a collection. They would
appear to be of little value in helping select documents matching a user’s need and needs to be
filtered out as potential
index terms. Examples of stopwords are articles, prepositions, conjunctions, etc.:
articles (a, an, the); pronouns: (I, he, she, it, their, his), Some prepositions (on, of, in, about,
besides, against), conjunctions/ connectors (and, but, for, nor, or, so, yet), verbs (is, are, was,
were), adverbs (here, there, out, because, soon, after) and adjectives (all, any, each, every, few,
many, some) can also be treated as stopwords. Stopwords are language-dependent.
Why Stopword Removal?
Intuition:
 Stopwords have little semantic content; It is typical to remove such high-frequency words
 Stopwords take up 50% of the text. Hence, document size reduces by 3050% Smaller indices for
information retrieval
 Good compression techniques for indices: The 30 most common words account for 30% of the
tokens in written text
With the removal of stopwords, we can measure a better approximation of the importance of text
classification, text categorization, text summarization, etc.
How to detect a stopword?
31
lOMoARcPSD|32779613
One method: Sort terms (in decreasing order) by document frequency (DF) and take the most
frequent ones based on the cutoff point.
• Another method: Build a stop word list that contains a set of articles, pronouns, etc.
– Why do we need stop lists: With a stop list, we can compare and exclude from the index terms
entirely the commonest words?
Stop word elimination used to be standard in older IR systems. But the trend is away from doing
nowadays most web search engines index stop words: Good query optimization techniques mean
you pay little at query time for including stop words. You need stopwords for:
• Phrase queries: “King of Denmark”
• Various song titles, etc.: “Let it be”, “To be or not to be”
• “Relational” queries: “flights to London”
elimination of stopwords might reduce recall (e.g. “To be or not to be” – all eliminated except
“be” – no or irrelevant retrieval)
2.5.3. Normalization
It is canceling tokens so that matches occur despite superficial differences in the character
sequences of the tokens.

Need to “normalize” terms in the indexed text as well as query terms into the same form
• Example: We want to match U.S.A. and USA, by deleting periods in a term.
Case Folding: Often best to lower case everything, since users will use lowercase regardless of
‘correct’ capitalization…
– Republican vs. republican
– Fasil vs. fasil vs. FASIL
– Anti-discriminatory vs. antidiscriminatory
–Car vs. Automobile?
Normalization issues
• Good for:
–Allow instances of Automobile at the beginning of a sentence to match with a query of
automobile
– Helps a search engine when most users type Ferrari while they are interested in a
Ferrari car
• Not advisable for:
32

lOMoARcPSD|32779613
– Proper names vs. common nouns

• E.g. General Motors, Associated Press, Kebede…
• Solution:
– lowercase only words at the beginning of the sentence
In IR, lowercasing is most practical because of the way users issue their queries
2.5.4. Stemming/Morphological analysis

Morphology is the study of the structure of a word and how it is built. Martin and Jurafsky [20]
describe it as small building blocks of a word called morphemes and they can be divided into
two groups, stems, and affixes.
A stem is the main part of a word and affixes are morphemes that are added to a stem to give
different meanings to it. In the word dogs, for example, the dog is the stem and -s is the affix.
Using affixes allow a word to occur in different forms, it can give it different inflections and
derivations.
How common do these variations differ between different languages.
According to Hedlund et al. [17], a language can be considered to be simple or complex with
regard to morphology. English, is considered to be simple while Swedish is a language that is
considered to be morphologically complex.
Stemming reduces tokens to their root form of words to recognize morphological variation. The
process involves the removal of affixes (i.e. prefixes & suffixes) with the aim of reducing
variants to the same stem. There are two types of morphology and those are inflectional &
derivational morphology Inflectional morphology: varies the form of words in order to express
grammatical features, such as singular/plural or past/present tense. E.g. Boy → boys, cut →
cutting.
Derivational morphology: makes new words from old ones. E.g. creation is formed from
creating, but they are two separate words. And also, destruction → destroy. Correct stemming is
language-specific and can be complex.
Stemming is one technique to provide ways of finding morphological variants of search terms.
Used to improve retrieval effectiveness and to reduce the size of indexing files.
a. Taxonomy for stemming algorithms
33

lOMoARcPSD|32779613
fig 2.3
Criteria for judging stemmers
a. Correctness
o Over stemming: too much of a term is removed.
o Under stemming: too little of a term is removed.
b. Retrieval effectiveness :– measured with recall and precision and on their speed, size, and so
on.
c. Compression performance :- greatly reduce the cost of storing common words
Type of stemming algorithms
There are 4 basic types of stemming algorithms those are:

 Table lookup approach
 Successor Variety
 n-gram stemmers
 Affix Removal Stemmers
1. Table lookup approach

Stemming is done via lookups in the table. Store a table of all index terms and their stems, so
34

lOMoARcPSD|32779613
terms from queries and indexes could be stemmed very fast.

Problems
– There is no such data for English. Or some terms are
domain dependent.
– The storage overhead for such a table, though trading size
for time is sometimes warranted.
2. Successor Variety approach
Determine word and morpheme boundaries based on the
distribution of phonemes in a large
body of utterances. And then, The successor variety of a string is the number of different
characters that follow it in words in some body of text. The successor variety of substrings of a
term will decrease as more characters are added until a segment boundary is reached. See the
following example
Table 2 examples of successor variety approach
cutoff method
some cutoff value is selected and a boundary is identified whenever the cutoff value is reached

peak and plateau method
35

lOMoARcPSD|32779613
 segment break is made after a character whose successor variety exceeds that of the characters
immediately preceding and following it
A criteria used to evaluate various segmentation methods. the number of correct segment cuts
divided by the total number of cuts. After segmenting, if the first segment occurs in more than 12
words in the corpus, it is probably a prefix. The successor variety stemming process has three
parts.
1. determine the successor varieties for a word
2. segment the word using one of the methods

3. select one of the segments as the stem
3. n-gram stemmers
Association measures are calculated between pairs of terms based on shared unique digrams.
statistics => st ta at ti is st ti ic cs
unique digrams = at cs ic is st ta ti statistical => st ta at ti is st ti ic ca al unique digrams = al
at ca
ic is st ta ti
Dice’s coefficient (similarity)
S= 2C = 2A+B
A and B are the numbers of unique diagrams in the first and the second words. C is the
number
of unique diagrams shared by A and B. Similarity measures are determined for all pairs of terms
in the database, forming a similarity matrix. Once such a similarity matrix is available, terms are
clustered using a single link clustering method.
4. Affix Removal Stemmers
Affix removal algorithms remove suffixes and/or prefixes from terms leaving a stem – If a word
ends in “ies” but not ”eies” or ”aies ” (Harman 1991)
Then “ies” -> “y”
– If a word ends in “es” but not ”aes” , or ”ees ” or “oes”

Then “es” -> “e”
– If a word ends in “s” but not ”us” or ”ss ”
Then “s” -> “NULL”
36
lOMoARcPSD|32779613
The Porter algorithm

The most common stemmer is the Porter stemmer (Porter, 1980). It is designed to fit the
characteristics of English language
Idea: Suffixes in the English language are mostly made up of a combination of smaller
and
simpler suffixes
• How does it work?
– The algorithms runs through five steps, one by one
– In each step, several rules are applied that change the word’s suffix
Some (simplified!) examples of rules used in the Porter stemmer:
Table 3 examples of Porter algorithm Some transformations made by the Porter

stemmer:
Stemming Studies: Conclusion

 The majority of stemming’s affection on retrieval performance have been positive
 Stemming is as effective as manual conflation
 The effect of stemming is dependent on the nature of vocabulary used
 There appears to be little difference between the retrieval effectiveness of different full
stemmers
37

lOMoARcPSD|32779613
2.5.5. Selecting Index Term

Index language is the language used to describe documents and requests. Elements of the index
language are index terms which may be derived from the text of the document to be described, or
may be arrived at independently.
If a full text representation of the text is adopted, then all words in the text are used as index
terms = full text indexing. Otherwise, need to select the words to be used as index terms for
reducing the size of the index file which is basic to design an efficient searching IR system.
38

lOMoARcPSD|32779613
CHAPTER THREE
INDEXING STRUCTURES
3.1. Introduction
Indexing is an important process in Information Retrieval (IR) systems. It forms the core
functionality of the IR process since it is the first step in IR and assists in efficient information
retrieval. Indexing reduces the documents to the informative terms contained in them.
File (Index) structures: A fundamental decision in the design of IR systems is which type of file
structure to use for the underlying document database. The file structures used in IR systems are
flat files, inverted files, signature files, PAT trees, and graphs. Though it is possible to keep file
structures in main memory, in practice IR databases are usually stored on disk because of their
size. Using a flat file approach, one or more documents are stored in a file, usually as ASCII or
EBCDIC text. Flat file searching is usually done via pattern matching. On UNIX, for example,
one can store a document collection one per file in a UNIX directory, and search it using pattern
searching tools such as grep (Earhart 1986) or awk (Aho, Kernighan, and Weinberger 1988).
Information is organized into (a large number of) documents/Large collections of documents

from various sources: books, journal articles, conference papers, newspapers, magazines, digital
libraries, Web pages, etc.
o Sample Statistics of Text Collections
• Dialog:
– claims to have more than 15 terabytes of data in >600 Databases, > 800 million unique records
• LEXIS/NEXIS:
– claims 7 terabytes, 1.7 billion documents, 1.5 million subscribers, 11,400 databases; >200,000
searches per day; 9 mainframes, 300 Unix servers, 200 NT servers
• Web Search Engines:

– Google claims to index over 1.5 billion pages
39

lOMoARcPSD|32779613
• TREC collections:
– a total of about 5 gigabytes of text
o The Text Retrieval Conf. (TREC)

TREC was started in 1992. Its goal is to develop an evaluation methodology for Terabyte-scale
document collections. The size of test data reached several GBs of text & million of documents.
The TREC test collections & evaluation software are available to all researchers in IR, so as to
evaluate their own retrieval systems at any time.
For each TREC, a test set of documents and questions is provided. The Participants run their
own
IR systems on the data and return to TREC a list of the retrieved top-ranked documents. TREC
pools the individual results judges the retrieved documents for correctness and evaluates the
results.
The number of participating systems & tasks in TREC has grown each year. For instance, 93
groups representing 22 countries participated in TREC 2003. TREC has also introduced
evaluations for open-domain question answering and content-based retrieval of digital video.
o Document corpus
The corpus may be:
• Primary documents: e.g., books, journal articles, or Web pages.
•Surrogates: a representation of a document such as a title, author, subject, and a short summary.
e.g., catalog records or abstracts, which refer to the primary documents. Surrogates are common
to display the answers to a user query.
The storage of the documents may be:
• Central (monolithic) – all documents stored together on a single server (e.g., library catalog)
•Distributed database – all documents stored on several servers. But the database may be
managed together (e.g., Medline), or managed independently (e.g., the Web, )
Each document has a unique identifier: a document ID that can be used by the search system to
refer to the actual document.
o Storage of text: Image vs. ASCII
Document images: Scanned image of the text document. Not searchable as text: Texts
(characters, words, etc.) are represented as patterns of pixels. Retrieval from Document Images:
Two options
40

lOMoARcPSD|32779613
•Recognition-based retrieval: OCR is required to convert document images to ASCII (may be

error-prone). Then Apply text retrieval systems to the recognized documents
•Document image retrieval: retrieval without explicit recognition. Search relevant documents
directly from image collections.
How do present search engines (like Google) search for relevant document images? Textual
documents Searchable as text and words are represented as ASCII codes.
2. Tries, Suffix Trees and Suffix Arrays
1. Tries
Trie. in computer science, a trie, also called digital tree and sometimes radix tree or prefix
tree
(as they can be searched by prefixes), is a kind of search tree—an ordered tree data structure that
is used to store a dynamic set or associative array where the keys are usually strings
Trie is the data structure very similar to Binary Tree. Trie data structure stores the data in
particular fashion, so that retrieval of data became much faster and helps in performance. The
name "TRIE" is coined from the word retrieve.
What are TRIE data structure usages or applications?
1. Dictionary Suggestions OR Auto complete dictionary
Retrieving data stored in Trie data structure is very fast, so it is most suited for application where
retrieval are more frequently performed like Phone directory where contact searching operation
is used frequently.
2. Searching Contact from Mobile Contact list OR Phone Directory
Auto suggestion of words while searching for anything in dictionary is very common.
If we search for word "tiny", then it auto suggest words starting with same characters like
"tine", "tin", "tinny" etc.
Auto suggestion is very useful and Trie plays a nice role there,
If say, Person doesn't know the complete spelling of the some word but know few, then rest of
words starting with few characters can be auto suggested using TRIE data structure.
41

lOMoARcPSD|32779613
A prefix tree, or trie (often pronounced "try"), is a tree whose nodes don't hold keys, but
rather, hold partial keys. For example, if you have a prefix tree that stores strings, then each node
would be a character of a string. If you have a prefix tree that stores arrays, each node would be
an element of that array. The elements are ordered from the root. So if you had a prefix tree with
the word "hello" in it, then the root node would have a child "h," and the "h" node would have a
child, "e," and the "e" node would have a child node "l," etc. The deepest node of a key would
have some sort of boolean flag on it indicating that it is the terminal node of some key.
(This is important because the last node of a key isn't always a leaf node... consider a prefix tree
with "dog" and "doggy" in it). Prefix trees are good for looking up keys with a particular prefix.
A trie (pronounced “try”) is a tree representing a collection of strings with one node per common
prefix. Smallest tree such that:
 Each edge is labeled with a character c ∈ Σ

 A node has at most one outgoing edge labeled c, for c ∈ Σ
42

lOMoARcPSD|32779613
 Each key is “spelled out” along some path starting at the root
Tries: example
Represent the following map with a Trie:
3.2.2. Suffix Trees
Definition. Let T = T [1 .. n] be a text of length n over a fixed alphabet ∑. A suffix tree for T is a
tree with n leaves and the following properties:
1. Every internal node other than the root has at least two children.
2. Every edge is labeled with a nonempty substring of T.
3. The edges leaving a given node have labels starting with different letters.
4. The concatenation of the labels of the path from the root to leaf i spells out the i-th suffix T[i .
. . n] of T. We denote T[i . . . n] by Ti.
43

lOMoARcPSD|32779613
Example:-
A Suffix Tree for a given text is a compressed trie for all suffixes of the given text.
{bear, bell, bid, bull, buy, sell, stock, stop}
Following is standard trie for the above input set of words
Following is the compressed trie. Compress Trie is obtained from standard trie by joining chains
of single nodes. The nodes of a compressed trie can be stored by storing index ranges at the
nodes.
44

lOMoARcPSD|32779613
Let us consider an example text “banana\0” where „\0‟ is string termination character. Following
are all suffixes of “banana\0”
banana\0
anana\0
nana\0
ana\0
na\0
a\0
\0
If we
consider
all of the
above
suffixes
as
individu
al words
and
build a
trie, we
get
followin
g.
45

lOMoARcPSD|32779613
If we join chains of single nodes, we get the following compressed trie, which is the Suffix Tree
for given text “banana\0”
3.2.3. Suffix Arrays
An array is an aggregate data structure that is designed to store a group of objects of the same or
different types
3.3. Signature Files
In signature file indexes [Faloutsos 1992], each record is allocated a fixed-width signature, or
bitstring, of w bits. Each word that appears in the record is hashed a number of times to
determine the bits in the signature that should be set, with no remedial action taken if two or
more distinct words should happen (as is inevitable) to set the same bit. Conjunctive queries are
similarly hashed, then evaluated by comparing the query signature to each record signature;
disjunctive queries are turned into a series of signatures, one per disjunct. Any record whose
signature has a 1-bit corresponding to every 1-bit in the query signature is a potential answer.
Each such record must be fetched and checked directly against the query to determine whether it
is a false match-a record which the signature indicates may be an answer, but in fact is not-or a
true match. Again, an address table is used to convert record numbers to addresses.
46

lOMoARcPSD|32779613
3. Designing an IR system
Our focus during IR system design is: On improving the performance and effectiveness of the
system. The effectiveness of the system is measured in terms of precision, and recall. Stemming,
 stopwords, weighting schemes, and matching algorithms In improving performance efficiency.
The concern here is:
 storage space usage, access time,
 Compression, data/file structures, space-time tradeoffs
1. Subsystems of IR system
The two subsystems of an IR system: Indexing and Searching
• Indexing:
It is an offline process of organizing documents using keywords extracted from the collection.
Indexing is used to speed up access to desired information from document collection as per
users’ query
fig 3.1.indexing process
47

lOMoARcPSD|32779613
Searching
Is an online process that scans document corpus to find relevant documents that matches users
query.
fig 3.2. searching process

Indexing and searching are inexorably connected. You cannot search that that was not first
indexed in some manner or other. Indexing of documents or objects is done in order to be
searchable. There are many ways to do indexing, to index one needs an indexing language. There
are many indexing languages. Even taking every word in a document is an indexing language.
Knowing searching is knowing indexing.
3.3.2. Indexing Basic Concepts

An indexing language is a language used to describe documents and requests. The elements of
the index language are index terms, which may be derived from the text of the document to be
described, or may be arrived at independently. Index languages may be described as pre
coordinate or post-coordinate, the first indicates that terms are coordinated at the time of
indexing and the latter at the time of searching. More specifically, in pre-coordinate indexing a
logical combination of any index terms may be used as a label to identify a class of documents,
whereas in post-coordinate indexing the same class would be identified at search time by
48

lOMoARcPSD|32779613
combining the classes of documents labeled with the individual index terms.
One last distinction, the vocabulary of an index language may be controlled or uncontrolled. The
former refers to a list of approved index terms that an indexer may use, such as for example used
by MEDLARS. The controls on the language may also include hierarchic relationships between
the index terms. Or, one may insist that certain terms can only be used as adjectives (or
qualifiers).
There is really no limit to the kind of syntactic controls one may put on a language. There is
much controversy about the kind of index language which is best for document retrieval. The
recommendations range from the complicated relational languages of Farradane et al.12 and the
Syntol group to the simple index terms extracted by text
processing systems just described. The main debate is really about whether automatic indexing is
as good as or better than manual indexing.
Each can be done to various levels of complexity. However, there seems to be mounting
evidence that in both cases, manual and automatic indexing, adding complexity in the form of
controls more elaborate than index term weighting do not pay dividends. This has been
demonstrated by the results obtained by Cleverdon et al.14, Aitchison et al.15, Comparative
Systems Laboratory16 and more recently Keen and Digger 17. The message is that uncontrolled
vocabularies based on natural language achieve retrieval effectiveness comparable to
vocabularies with elaborate controls. This is extremely encouraging, since the simple index
language is the easiest to automate.
Probably the most substantial evidence for automatic indexing has come out of the SMART
Project (1966). Salton18 recently summarized its conclusions: ‘ … on the average the simplest
indexing procedures which identify a given document or query by a set of terms, weighted or
unweighted, obtained from document or query text are also the most effective’. Its
recommendations are clear, automatic text analysis should use weighted terms derived from
document excerpts whose length is at least that of a document abstract.
The document representatives used by the SMART project are more sophisticated than just the
lists of stems extracted by conflation. There is no doubt that stems rather than ordinary word
forms are more effective (Carroll and Debruyn19). On top of this the SMART project adds index
term weighting, where an index term may be a stem or some concept class arrived at through the
49

lOMoARcPSD|32779613
use of various dictionaries. For details of the way in which SMART elaborates its document
representatives.
An index file consists of records, called index entries. The usual unit for indexing is the
word.
Index terms – are used to look up records in a file
o Issues in Indexing
Creating the index (with the objective of reducing storage space & searching time) what index
terms to use: words, sentences, paragraph, etc.
o What indexing structure to use: inverted file, suffix array…
o Storing the index
o the need for compression
o where to store the index
o Processing the index
o access index file directly from disk or load on RAM
o select file/data structure that speed up execution and reduce memory usage
o Updating the index

o Is the update performed in batch or as documents arrive one by one
o Are we updating incrementally or complete re-indexing
o How to synchronize changes to index and documents
50

lOMoARcPSD|32779613
3.233. Basic Indexing Process
o Major Steps in Index Construction Source file: Collection of text document
A document can be described by a set of representative keywords called index terms.
Index Terms Selection:

oTokenize: identify words in a document, so that each document is represented by a list of
keywords or attributes
oStop words: removal of high frequency words Stop list of words is used for comparing the
input text
o Stemming and Normalization: reduce words with similar meaning into their stem/root
word.
Suffix stripping is the common method
oWeighting terms: Different index terms have varying importance when used to describe
document contents. This effect is captured through the assignment of numerical weights to each
index term of a document. There are different index terms weighting methods (TF, DF, CF)
51

lOMoARcPSD|32779613
based on which TF*IDF weight can be calculated during searching.

Output: a set of index terms (vocabulary) to be used for indexing the documents that each term
occurs in.
4. Building Index file

An index file of a document is a file consisting of a list of index terms and a link to one or more
documents that has the index term. A good index file maps each keyword Ki to a set of
documents
Di that contain the keyword. Index file usually has index terms in a sorted order. The sort order
of the terms in the index file provides an order on a physical file.
An index file is list of search terms that are organized for associative look-up, i.e., to answer
user’s query:
In which documents does a specified search term appear?
Where within each document does each term appear? (There may be several occurrences.)
For organizing index file for a collection of documents, there are various options available:
Decide what data structure and/or file structure to use. Is it sequential file, inverted file, suffix
array, signature file, etc.?
5. Index file Evaluation Metrics Running time

o Indexing time
o Access/search time: is that allows sequential or random searching/access?
o Update time (Insertion time, Deletion time, modification time….): can the indexing
structure
support re-indexing or incremental indexing?
Space overhead
o Computer storage space consumed.
Access types supported efficiently.
oIs the indexing structure allows to access: records with a specified term, or records with terms
falling in a specified range of values
52

lOMoARcPSD|32779613
3.3.6. Types of an index file

Most surveys of file structures address themselves to applications in data management which is
reflected in the terminology used to describe the basic concepts.
There is one important distinction that must be made at the outset when discussing file
structures.
And that is the difference between the logical and physical organization of the data. On the
whole a file structure will specify the logical structure of the data that is the relationships that
will exist between data items independently of the way in which these relationships may actually
be realized within any computer. It is this logical aspect that we will concentrate on.
The physical organization is much more concerned with optimizing the use of the storage
medium when a particular logical structure is stored on, or in it. Typically for every unit of
physical store there will be a number of units of the logical structure (probably records) to be
stored in it. For example, if we were to store a tree structure on a magnetic disk, the physical
organization would be concerned with the best way of packing the nodes of the tree on the disk
given the access characteristics of the disk.
o Sequential File
A sequential file is the most primitive of all file structures. It has no directory and no linking
pointers. The records are generally organized in lexicographic order on the value of some key. In
other words, a particular attribute is chosen whose value will determine the order of the records.
Sometimes when the attribute value is constant for a large number of records a second key is
chosen to give an order when the first key fails to discriminate.
The implementation of this file structure requires the use of a sorting routine.
Its main advantages are:
1. It is easy to implement;
2.It provides fast access to the next record using lexicographic order.
Its disadvantages:
3. It is difficult to update – inserting a new record may require
moving a large proportion of the
file;
2. Random access is extremely slow.
53

lOMoARcPSD|32779613
Sometimes a file is considered to be sequentially organized despite the fact that it is not ordered
according to any key. Perhaps the date of acquisition is considered to be the key value, the
newest entries are added to the end of the file and therefore pose no difficulty to updating.
Example:
• Given a collection of documents, they are parsed to extract words and these are saved with the
Document ID.
Sorting the Vocabulary
54

lOMoARcPSD|32779613
o Inverted file
The importance of this file structure will become more apparent when Boolean Searches are
discussed in the next chapter. For the moment we limit ourselves to describing its structure. An
inverted file is a file structure in which every list contains only one record. Remember that a list
is defined with respect to a keyword K, so every K-list contains only one record. This implies
that the directory will be such that ni = hi for all i, that is, the number of records containing Ki
will equal the number of Ki -lists. So the directory will have an address for each record
containing Ki . For document retrieval this means that given a keyword we can immediately
locate the addresses of all the documents containing that keyword.
Inverted file is technique that index based on sorted list of terms, with each term having links to
the documents containing it.
Building and maintaining an inverted index is a relatively low-cost risk. On a text of n words an
inverted index can be built in O(n) time, n is number of terms ✓ Content of the inverted file:
Data to be held in the inverted file includes:
• The vocabulary (List of terms)
• The occurrence (Location and frequency of terms in a document collection)
55

lOMoARcPSD|32779613
The occurrence: contains one record per term, listing Frequency of each term in a document
• TFij, number of occurrences of term tj in document di
• DFj, number of documents containing tj
• maxi, maximum frequency of any term in di
• N, total number of documents in a collection
• CFj,, collection frequency of tj in nj Locations/Positions of words in the text
Term Weighting: Term Frequency (TF)

•TF (term frequency) – Count the number of times a term occurs in document. fij = frequency of
term i in document j
• The more times a term t occurs in document d the more likely it is that t is relevant to the
document, i.e. more indicative of the topic. If used alone, it favors common words and long
documents. It gives too much credit to words that appears more frequently. There is a need to
normalize term frequency (tf)
56

lOMoARcPSD|32779613
Document Frequency
It is defined to be the number of documents in the collection that contain a term
DF = document frequency
57

lOMoARcPSD|32779613
Count the frequency considering the whole collection of documents. Less frequently a term
appears in the whole collection, the more discriminating it is. df i (document frequency of term i)
= number of documents containing term i Why vocabulary?
– Having information about vocabulary (list of terms) speeds searching for relevant documents
Why location?
– Having information about the location of each term within the document helps for: user
interface design: highlight location of search term. and proximity-based ranking: adjacency
and near operators (in Boolean searching)
Why frequencies?
– Having information about frequency is used for: calculating term weighting
(like IDF, TF*IDF, …), optimizing query processing In inverted file documents are
organized by
the terms/words they contain
Table 4 Example of inverted file structure

This is called an index file
58

lOMoARcPSD|32779613
Text operations are performed before building the index.

Organization of Index File An inverted index consists of two files:
• vocabulary file
• Posting file
Vocabulary file
A vocabulary file (Word list): stores all of the distinct terms (keywords) that appear in any of the
documents (in lexicographical order) and for each word a pointer to posting file. In vocabulary
file Records kept for each term j in the word list contains the following: term j, DFj, CFj and
pointer to posting file
Postings File (Inverted List): For each distinct term in the vocabulary, stores a list of pointers
to the documents that contain that term. Each element in an inverted list is called a posting, i.e.,
the occurrence of a term in a document. It is stored as a separate inverted list for each column,
i.e., a list corresponding to each term in the index file. Each list consists of one or many
individual
postings related to Document ID, TF and location information about a given term i
Advantage of dividing inverted file:
• Keeping a pointer in the vocabulary to the list in the posting file allows:
o the vocabulary to be kept in memory at search time even for large text collection, and Posting
file to be kept on disk for accessing to documents
Separation of inverted file into vocabulary and posting file is a good idea.
59

lOMoARcPSD|32779613
Vocabulary: For searching purpose we need only word list. This allows the vocabulary to be
kept in memory at search time since the space required for the vocabulary is small.
• The vocabulary grows by O(nβ), where β is a constant between 0 – 1.
• Example: from 1,000,000,000 documents, there may be 1,000,000 distinct words. Hence,
the
size of index is 100 MBs, which can easily be held in memory of a dedicated computer.
– Posting file requires much more space. For each word appearing in the text we are keeping
statistical information related to word occurrence in documents. Each of the postings pointer to
the document requires an extra space of O(n).
Example:
– Given a collection of documents, they are parsed to extract words and these are saved with the
Document ID.
Step 1: Sorting the Vocabulary
60

lOMoARcPSD|32779613
Step2: Remove stopwords, apply to stem & compute term frequency
61

lOMoARcPSD|32779613
Step 3: The file is commonly split into a Dictionary and a Posting file
62

lOMoARcPSD|32779613
Complexity Analysis
• The inverted index can be built in O(n) + O(n log n) time.
– n is the number of vocabulary terms
• Since terms in the vocabulary file are sorted searching takes logarithmic time.
•To update the inverted index it is possible to apply Incremental indexing which requires
O(k) time, k is the number of new index terms.
63

lOMoARcPSD|32779613
CHAPTER FOUR
INFORMATION RETREIVAL(IR) MODELS
At the end of this chapter every student must able to:
 Define what model is
 Describe why model is needed in information retrieval
 Differentiate different types of information retrieval models
 Boolean Model
 Vector space model
 Probabilistic model
 Know how to calculate and find the similarity of some documents to the given query
 Identify term frequency, document frequency, inverted document frequency, term weight and
similarity measurements
What is Model? Model- is the ideal abstraction of something which is working in the real world.
There are two good reasons for having models of information retrieval. The first is that models
guide research and provide the means for academic discussion. The second is that models can
serve as a blue print to implement an actual retrieval system. In IR, mathematical models are used
to understand and reason about some behavior or phenomena in the real world. A model of an
Information Retrieval predicts and explains what a user will find (relevant given the user query).
4.1. IR MODELS – BASIC CONCEPTS
Traditional information retrieval systems usually adopt index terms to index and retrieve
documents. In a restricted sense, an index term is a keyword (or group of related words) that has
some meaning of its own (i.e., which usually has the semantics of a noun). In its more general
64

lOMoARcPSD|32779613
form, an index term is simply any word that appears in the text of a document in the collection.
Retrieval based on index terms is simple but raises key questions regarding the information
retrieval task.
For instance, retrieval using index terms adopts as a fundamental foundation the idea that the
semantics of the documents and of the user information need can be naturally expressed through
sets of index terms. Clearly, this is a considerable oversimplification of the problem because a lot
of the semantics in a document or user request is lost when we replace its text with a set of
words. Furthermore, matching between each document and the user request is attempted in this
very imprecise space of index terms. Thus, it is no surprise that the documents retrieved in
response to a user request expressed as a set of keywords are frequently irrelevant. If one also
considers that most users have no training in properly forming their queries, the problem is
worsened with potentially disastrous results. The frequent dissatisfaction of Web users with the
answers they normally obtain is just one good example of this fact.
Clearly, one central problem regarding information retrieval systems is the issue of predicting
which documents are relevant and which are not. Such a decision is usually dependent on a
ranking algorithm that attempts to establish a simple ordering of the documents retrieved.
Documents appearing at the top of this order are considered to be more likely to be relevant.
Thus, ranking algorithms are at the core of information retrieval systems. A ranking algorithm
operates according to basic premises regarding the notion of document relevance. Distinct sets of
premises (regarding document relevance) yield distinct information and retrieval models. The IR
model adopted determines the predictions of what is relevant and what is not (i.e., the notion of
relevance implemented by the system).
3. Classification of Information Retrieval models
1. Boolean Model
The Boolean model is a simple retrieval model based on set theory and Boolean algebra. Since
the concept of a set is quite intuitive, the Boolean model provides a framework that is easy to
grasp by a common user of an IR system. Furthermore, the queries are specified as Boolean
expressions which have precise semantics. Given its inherent simplicity and neat formalism, the
65

lOMoARcPSD|32779613
The boolean model received great attention in past years and was adopted by many of the early
commercial bibliographic systems.
fig 4.1. The Three Conjunctive components for the query [q =k« /\ (kb V …,kc)J
Unfortunately, the Boolean model suffers from major drawbacks. First, its retrieval strategy is
based on a binary decision criterion (Le., a document is predicted to be either relevant or
nonrelevant) without any notion of a grading scale, which prevents good retrieval performance.
Thus, the Boolean model is in reality much more a data (instead of information) retrieval model.
Second, while Boolean expressions have precise semantics, frequently it is not simple to translate
an information need into a Boolean expression.
In fact, most users find it difficult and awkward to express their query requests in terms of
Boolean expressions. The Boolean expressions actually formulated by users often are quite
simple.
Despite these drawbacks, the Boolean model is still the dominant model in commercial
document database systems and provides a good starting point for those new to the field. The
Boolean model considers that index terms are present or absent in a document. As a result, the
index term weights are assumed to be all binary, i.e., Wi,j E {O, I}. A query q is composed of
index terms linked by three connectives: not, and, or. Thus, a query is essentially a conventional
A boolean expression which can be represented as a disjunction of conjunctive vectors (i.e., in
disjunctive normal form – DNF).
For instance, the query [q = ka /\ (kb V …,kc ) ] can be written in disjunctive normal form as
[Qdnf = (1,1,1) V (1,1,0) V (1,0,0)], where each of the components is a binary weighted vector
associated with the tuple (ka , kb’ kc ) ‘ These binary weighted vectors are called the conjunctive
components of Qdnf’.
66

lOMoARcPSD|32779613
Definition For the Boolean model, the index term weight variables are all binary i.e., Wi,j E {O,
I}. A query q is a conventional Boolean expression. Let Qdnf be the disjunctive normal form for
the query q. Further, let lfcc be any of the conjunctive components of Qdnf. The similarity of a
document dj to the query q is defined as
If sim(dj , q) = 1 then the Boolean model predicts that the document dj is relevant to the query q
(it might not be). Otherwise, the prediction is that the document is not relevant. The Boolean
model predicts that each document is either relevant or nonrelevant.
There is no notion of a partial match to the query conditions. For instance, let dj be a document
for which d~ = (0,1,0). Document dj includes the index term kb but is considered non-relevant to
the query [q = ka/\(kb V-,kc ) ]’ The main advantages of the Boolean model are the clean
formalism behind the model and its simplicity.
The main disadvantages are that exact matching may lead to retrieval of too few or too many
documents. Today, it is well known that index term weighting can lead to a substantial
improvement in retrieval performance. Index term weighting brings us to the vector model.
– Note that no weights are assigned in-between 0 and 1, just only values 0 or 1
Binary Weights
Only the presence (1) or absence (0) of a term is included in the vector. The binary formula
gives every word that appears in a document equal relevance. It can be useful when the
frequency is not important.
Binary Weights Formula:
67

lOMoARcPSD|32779613
The Boolean Model: Example

Given the following three documents, Construct Term – document matrix
Boolean model for the query “gold silver truck”
• D1: “Shipment of gold damaged in a fire”
• D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck”
The table below shows the document –term (ti) matrix
The Boolean Model: Further Example

Assume there are four documents and we want to retrieve some documents for the query
„Information AND retrieval‟. If the followings are our documents, which document can be
retrieved for the given query? The documents are:
Doc1: Computer Information Retrieval

Doc2: Computer Retrieval
Doc3: Information
Doc4: Computer Information
Query: Information AND Retrieval
Solution: put „1‟ in the following
table if the terms exist in the
document, and „0‟ if not. This is
to identify the presence and
absence of each term in the given
documents.
68

lOMoARcPSD|32779613
Now since our query is: Information AND Retrieval, list down documents which contain both
terms,
Information AND Retrieval

Doc1, Doc3, Doc4 AND Doc1, Doc2
Then take the intersection of the above documents (i.e. document which contain both the word
information and retrieval). That is Doc1.
4.3.2. Vector-Space Model
The vector model recognizes that the use of binary weights is too limiting and proposes a
framework in which partial matching is possible. This is accomplished by assigning non-binary
weights to index terms in queries and in documents. These term weights are ultimately used to
compute the degree of similarity between each document stored in the system and the user query.
By sorting the retrieved documents in decreasing order of this degree of similarity, the vector
model takes into consideration documents that match the query terms only partially.
The main resultant effect is that the ranked document answer set is a lot more precise (in the
sense that it better matches the user information need) than the document answer set retrieved by
the Boolean model.
Definition For the vector model, the weight Wi,j associated with a pair (ki , dj ) is positive and
non-binary. Further, the index terms in the query are also weighted. Let Wi,q be the weight
associated with the pair [ki , q], where Wi,q O. Then, the query vector if is defined as if = (WI,q,
W2,q,···, Wt,q) where t is the total number of index terms in the system. As before, the vector
for a document dj is represented by d~ = (WI,j, W2,j, .•. , Wt,j). Therefore, a document dj and a
69

lOMoARcPSD|32779613
user query q are represented as t-dimensional vectors as shown in Figure 2.4.

The vector model proposes to evaluate the degree of similarity of the document dj with regard to
the query q as the correlation between the vectors d~ and if. This correlation can be quantified,
for instance, by the cosine of the angle between these two vectors. That is,
fig 4.2. The cosine angle

where Id~ I and I~ are the norms of the document and query vectors. The factor I~ does not
affect the ranking (i.e., the ordering of the documents) because it is the same for all documents.
The factor Id~ I provides a normalization in the space of the documents. Since Wi,j 0 and Wi,q 0,
sim(q, dj ) varies from 0 to +1. Thus, instead of attempting to predict whether a document is
relevant or not, the vector model ranks the documents according to their degree of similarity to
the query. A document might be retrieved even if it matches the query only partially. For
instance, one can establish a threshold on sim(dj , q) and retrieve the documents with a degree of
similarity above that threshold. But to compute rankings we need first to specify how index term
weights are obtained. Index term weights can be calculated in many different ways. The work by
Salton and McGill reviews various term-weighting techniques. Here, we do not discuss them in
detail. Instead, we concentrate on elucidating the main idea behind the most effective term-
weighting techniques. This idea is related to the basic principles which support clustering
techniques, as follows. Given a collection C of objects and a vague description of a set A, the
goal of a simple clustering algorithm might be to separate the collection C of objects into two
sets: a first one which is composed of objects related to the set A and a second one which is
composed of objects not related to the set A. A vague description here means that we do not have
70

lOMoARcPSD|32779613
complete information for deciding precisely which objects are and which are not in set A. For
instance, one might be looking for a set A of cars that have a price comparable to that of a Lexus
400. Since it is not clear what the term comparable means exactly, there is not a precise (and
unique) description of set A. More sophisticated clustering algorithms might attempt to separate
the objects of a collection into various clusters (or classes) according to their properties. For
instance, patients of a doctor specializing in cancer could be classified into five classes: terminal,
advanced, metastasis, diagnosed, and healthy. Again, the possible class descriptions might be
imprecise (and not unique) and the problem is one of deciding to which of these classes a new
patient should be assigned. In what follows, however, we only discuss the simpler version of the
clustering problem [i.e., the one which considers only two classes) because all that is required is
a decision on which documents are predicted to be relevant and which ones are predicted to be
not relevant (with regard to a given user query).
To view the IR problem as one of clustering, we refer to the early work of Salton. We think of the
documents as a collection C of objects and think of the user query as a (vague) specification of a
set A of objects. In this scenario, the IR problem can be reduced to the problem of determining
which documents are in set A and which ones are not (i.e., the IR problem can be viewed as a
clustering problem). In a clustering problem, two main issues have to be resolved. First, one
needs to determine what are the features? which better describes the objects in set A.
Second, one needs to determine what are the features? Which better distinguishes the objects in
set A from the remaining objects in the collection C.
The first set of features provides for the quantification of intra-cluster similarity, while the
second set of features provides for the quantification of inter-cluster dissimilarity.
The most successful clustering algorithms try to balance these two effects. In the vector
model,
the intra-clustering similarity is quantified by measuring the raw frequency of a term k;
inside a
document dj . Such term frequency is usually referred to as the tf factor and provides one
measure of how well that term describes the document contents (i.e., intra-document
characterization).
Furthermore, inter-cluster dissimilarity is quantified by measuring the inverse of the
frequency of
a term k; among the documents in the collection. This factor is usually referred to as the inverse
document frequency or the IDF factor.
71([email protected])
Downloaded by shegaw mulat
lOMoARcPSD|32779613
The motivation for the usage of an IDF factor is that terms which appear in many documents are
not very useful for distinguishing a relevant document from a non-relevant one. As with good
clustering algorithms, the most effective term weighting schemes for IR try to balance these two
effects. Definition Let N be the total number of documents in the system and ni be the number of
documents in which the index term ki appears.
Let freqi,j be the row frequency of term k; in the document dj (i.e., the number of times the term
k; is mentioned in the text of the document dj). Then, the normalized frequency Aj of term k; in
document, dj is given by
or by a variation of this formula. Such term-weighting strategies are called tf-idf schemes.
Several variations of the above expression for the weight Wi,j are described in an interesting
paper by Salton and Buckley which appeared in 1988. However, in general, the above expression
should provide a good weighting scheme for many collections. For the query term weights,
Salton and Buckley suggest
where freqi,q is the raw frequency of the term k; in the text of the information request q. The
main
72

lOMoARcPSD|32779613
advantages of the vector model are:
(1) its term-weighting scheme improves retrieval performance;
(2)its partial matching strategy allows retrieval of documents that approximate the query
conditions; and
(3) its cosine ranking formula sorts the documents according to their degree of similarity
to the
query. Theoretically, the vector model has the disadvantage that index terms are assumed to be
mutually independent (equation 2.3 does not account for index term dependencies).
Due to the locality of many term dependencies, their indiscriminate application to all the
documents in the collection might in fact hurt the overall performance. Despite its simplicity, the
vector model is a resilient ranking strategy with general collections. It yields ranked answer sets
which are difficult to improve upon without query expansion or relevance feedback within the
framework of the vector model. A large variety of alternative ranking methods have been
compared to the vector model but the consensus seems to be that, in general, the vector model is
either superior or almost as good as the known. Alternatives. Furthermore, it is simple and fast.
For these reasons, the vector model is a popular retrieval model nowadays However, in practice,
consideration of term dependencies might be a disadvantage.
Computing TF-IDF: An Example
Assume collection contains 10,000 documents and statistical analysis shows that document
frequencies (DF) of three terms are: A(50), B(1300), C(250). And also term frequencies (TF) of
these terms are: A(3), B(2), C(1). Compute TF*IDF for each term?
A: tf = 3/3=1.00; idf = log2(10000/50) = 7.644; tf*idf = 7.644
B: tf = 2/3=0.67; idf = log2(10000/1300) = 2.943; tf*idf = 1.962 C: tf = 1/3=0.33;
idf = log2(10000/250) = 5.322; tf*idf = 1.774
•Query is also treated as a short document and also tf-idf weighted.
wij =
[0.5*tfij ])* log2 (N/ dfi)
More Example
73

lOMoARcPSD|32779613
Consider a document containing 100 words wherein the word computer appears 3 times. Now,
assume we have 10, 000, 000 documents and the computer appears in 1, 000 of these.
– The term frequency (TF) for computer :
3/100 = 0.03
– The inverse document frequency is log2(10,000,000 / 1,000) = 13.228
– The TF*IDF score is the product of these frequencies: 0.03 * 13.228 = 0.39684
Example: Computing weights
A collection includes 10,000 documents

The term A appears 20 times in a particular document j ✓ The maximum appearance of any
term in document j is 50
The term A appears in 2,000 of the collection documents.
Compute TF*IDF weight of term A?
tf(A,j) = freq(A,j) / max(freq(k,j)) = 20/50 = 0.4
idf(A) = log(N/DFA) = log (10,000/2,000) = log(5) =
2.32 wAj = tf(A,j) * log(N/DFA) = 0.4 * 2.32 = 0.928
Similarity Measure
We now have vectors for all documents in the collection, a
vector for the query, and how to
74

lOMoARcPSD|32779613
compute similarity?
A similarity measure is a function that computes the degree of similarity or distance between the
document vector and query vector. Using a similarity measure between the query and each
document: It is possible to rank the retrieved documents in the order of presumed relevance.
Also, it is possible to enforce a certain threshold so that the size of the retrieved set can be
controlled.
Similarity/Dissimilarity Measures
• Euclidean distance
It is the most common similarity measure. Euclidean distance examines the root of square
differences between coordinates of a pair of document and query terms.
• Dot product
The dot product is also known as the scalar product or inner product. the dot product is defined
as the product of the magnitudes of query and document vectors
• Cosine similarity (or normalized inner product)
It projects to document and query vectors into a term space and calculate the cosine angle
between these.
75

lOMoARcPSD|32779613
Euclidean distance:
• Similarity between vectors for the document di and query q can be computed as:
where wij is the weight of term i in document j and wiq is the weight of term i in the query.
Example:
Dissimilarity Measures
• Euclidean distance is generalized to the popular dissimilarity measure called: Minkowski
distance:
where X = (x1, x2, …, xn) and Y = (y1, y2, …, yn) are two n-dimensional data objects; n is size
of vector attributes of the data object; q= 1,2,3,…
• If q = 1, dis(X,Y) is Manhattan distance
Inner Product
•Similarity between vectors for the document di and query q can be computed as the vector
inner product:
where wij is the weight of term i in document j and wiq is the weight of term i in the query q
• For binary vectors, the inner product is the number of matched query terms in the document
76

lOMoARcPSD|32779613
(size of intersection).
•For weighted term vectors, it is the sum of the products of the weights of the matched terms.
Inner Product — Examples
• Given the following term-document matrix, using the inner product which document is more
relevant for the query Q?
• sim(D1 , Q) = 2*1 + 3*0 + 5*2 = 12

• sim(D2 , Q) = 3*1 + 7*0 + 1*2 = 5
Cosine similarity
• Measures similarity between d1 and d2 captured by the cosine of the angle x between them.
• The denominator involves the lengths of the vectors

• So the cosine measure is also known as the normalized inner product
Example 1: Computing Cosine Similarity

Let say we have query vector Q = (0.4, 0.8); and also document D1 = (0.2, 0.7). Compute their
similarity using cosine?
77

lOMoARcPSD|32779613
Example 2: Computing Cosine Similarity

Let say we have two documents in our corpus; D1 = (0.8, 0.3) and D2 = (0.2, 0.7). Given query
vector Q = (0.4, 0.8), determine which document is more relevant one for the query?
Example
Given three documents; D1, D2 and D3 with the corresponding TFIDF weight, Which
documents are more similar using the three similarity measurement?
Vector Space with Term Weights and Cosine Matching
78

lOMoARcPSD|32779613
Vector-Space Model: Example

Suppose user query for: Q = “gold silver truck”. The database collection consists of three
documents with the following content.
D1: “Shipment of gold damaged in a fire”
D2: “Delivery of silver arrived in a silver truck”
D3: “Shipment of gold arrived in a truck”
Show retrieval results in ranked order?
1.Assume that full text terms are used during indexing, without removing common terms,
stop words, & also no terms are stemmed.
2. Assume that content-bearing terms are selected during indexing
3. Also compare your result with or without normalizing term frequency
79

lOMoARcPSD|32779613
Compute similarity using cosine Sim(q,d1)

First, for each document and query, compute all vector lengths (zero terms ignored)
Next, compute dot products (zero products ignored)
80

lOMoARcPSD|32779613
Now, compute the similarity score
Finally, we sort and rank documents in descending order according to the similarity scores
4.3.3. Probabilistic Model
The fundamental idea is as follows. Given a user query, there is a set of documents that contains
exactly the relevant documents and no other. Let us refer to this set of documents as the ideal
answer set.
Given the description of this ideal answer set, we would have no problems in retrieving its
documents. Thus, we can think of the querying process as a process of specifying the properties
of an ideal answer set (which is analogous to interpreting the IR problem as a problem of
clustering).
The problem is that we do not know exactly what these properties are. All we know is that there
are index terms whose semantics should be used to characterize these properties. Since these
properties are not known at query time, an effort has to be made at initially guess what they
could be.
This initial guess allows us to generate a preliminary probabilistic description of the ideal answer
set which is used to retrieve the first set of documents. Interaction with the user is then initiated
with the purpose of improving the probabilistic description of the ideal answer set. Such
interaction could proceed as follows.
81

lOMoARcPSD|32779613
The user takes a look at the retrieved documents and decides which ones are relevant and which
ones are not (in truth, only the first top documents need to be examined). The system then uses
this information to refine the description of the ideal answer set. By repeating this process many
times, it is expected that such a description will evolve and become closer to the real description
of the ideal answer set. Thus, one should always have in mind the need to guess at the beginning
the description of the ideal answer set.
Furthermore, a conscious effort is made to model this description in probabilistic terms. The
probabilistic model is based on the following fundamental assumption. Assumption
(Probabilistic Principle) Given a user query q and a document dj in the collection, the
probabilistic model tries to estimate the probability that the user will find the document dj
interesting (i.e., relevant).
The model assumes that this probability of relevance depends on the query and the document
representations only. Further, the model assumes that there is a subset of all documents which the
user prefers as the answer set for the query q. Such an ideal answer set is labeled R and should
maximize the overall probability of relevance to the user. Documents in the set R are predicted to
be relevant to the query. Documents not in this set are predicted to be non-relevant. This
assumption is quite troublesome because it does not state explicitly how to compute the
probabilities of relevance. In fact, not even the sample space which is to be used for defining
such probabilities is given… Given a query q, the probabilistic model assigns to each document
dj , as a measure of its similarity to the query, the ratio P(dj relevant to q)jP(dj nonrelevant-to q)
which computes the odds of the document dj being relevant to the query q. Taking the odds of
relevance as the rank minimizes the probability of an erroneous judgment. Definition For the
probabilistic model, the index term weight variables are all binary i.e., Wi,j E {a, I}, Wi,q E {a,
I}. A query q is a subset of index terms. Let R be the set of documents known (or initially
guessed) to be relevant. Let R be the complement of R [i.e., the set, of nonrelevant documents).
Let P(Rldj) be the probability that the document dj is relevant to the query q and P(Rld~) be the
probability that dj is non-relevant to q. The similarity sim(dj, q) of the document dj to the query
q is defined as the ratio
82

lOMoARcPSD|32779613
P(t4IR) stands for the probability of randomly selecting the document dj from the set R of
relevant documents. Further, P(R) stands for the probability that a document randomly selected
from the entire collection is relevant. The meanings attached to P(d~IR) and P(R) are analogous
and complementary. Since P(R) and P(R) is the same for all the documents in the collection, we
write,
P(kiIR) stands for the probability that the index term k, is present in a document randomly
selected from the set R. P(kiIR) stands for the probability that the index term k; is not present in
a document randomly selected from the set R. The probabilities associated with the set R have
meanings that are analogous to the ones just described. Taking logarithms, recalling that P(kiIR)
+ P(kiIR) = 1, and ignoring factors that are constant for all documents in the context of the same
query, we can finally write
which is a key expression for ranking computation in the probabilistic model. Since we do not
know the set R at the beginning, it is necessary to devise a method for initially computing the
probabilities P(kiIR) and P(kiIR). There are many alternatives for such computation. We discuss
a couple of them below. In the very beginning (i.e., immediately after the query specification),
83

lOMoARcPSD|32779613
there are no retrieved documents. Thus, one has to make simplifying assumptions such as: (a)
assume that P(kiIR) is constant for all index terms k; (typically, equal to 0.5) and (b) assume that
the distribution of index terms among the non-relevant documents can be approximated by the
distribution of index terms among all the documents in the collection. These two assumptions
yield.
where, as already defined, ni is the number of documents that contain the index term ki and N is
the total number of documents in the collection. Given this initial guess, we can then retrieve
documents that contain query terms and provide an initial probabilistic ranking for them. After
that, this initial ranking is improved as follows. Let V be a subset of the documents initially
retrieved and ranked by the probabilistic model. Such a subset can be defined, for instance, as the
top r ranked documents where r is a previously defined threshold. Further, let \’i be the subset of
V composed of the documents in V which contain the index term ki . For simplicity, we also use
V and \’i to refer to the number of elements in these sets (it should always be clear when the used
variable refers to the set or to the number of elements in it). For improving the probabilistic
ranking, we need to improve our guesses for P(kiIR) and P(kiIR). This can be accomplished with
the following assumptions: (a) we can approximate P(kiIR) by the distribution of the index term
ki among the documents retrieved so far, and (b) we can approximate P(kiIR) by considering that
all the non-retrieved documents are not relevant. With these assumptions, we can write,
This process can then be repeated recursively. By doing so, we are able to improve on our
guesses for the probabilities P(kiIR) and P(k;IR) without any assistance from a human subject
(contrary to the original idea). However, we can also use assistance from the user for the
definition of the subset V as originally conceived. The last formulas for P(kiIR) and P(kiIR) pose
84

lOMoARcPSD|32779613
problems for small values of V and \’i which arise in practice (such as V = 1 and \’i = 0). To
circumvent these problems, an adjustment factor is often added in which yields
An adjustment factor that is constant and equal to 0.5 is not always satisfactory. An alternative is
to take the fraction nilN as the adjustment factor which yields
This completes our discussion of the probabilistic model. The main advantage of the
probabilistic model, in theory, is that documents are ranked in decreasing order of their
probability of being relevant.
The disadvantages include:
(1) the need to guess the initial separation of documents into relevant and nonrelevant
sets;
(2)the fact that the method does not take into account the frequency with which an index term
occurs inside a document (i.e., all weights are binary); and
(3) the adoption of the independence assumption for index terms. However, as discussed for
the
vector model, it is not clear that the independence of index terms is a bad assumption in practical
situations.
85

lOMoARcPSD|32779613
CHAPTER FIVE
RETRIEVAL EVALUATION
1. Evaluation of IR systems
Why System Evaluation? Any systems need validation and verification.
– Check whether the system is right or not
– Check whether it is the right system or not
– It provides the ability to measure the difference between IR systems
– How well do our search engines work?
– Is system A better than B? and under what conditions?
– Identify techniques that work well and do not work
– There are many retrieval models/algorithms, so which one is the best?
– What is the best component for:
• Similarity measures (is it dot-product, cosine, …?)
• Index term selection (tokenization, stop-word removal, stemming…)
• Term weighting (TF, TF-IDF,…)
Before the final implementation of an information retrieval system, an evaluation of the system is
usually carried out. The type of evaluation to be considered depends on the objectives of the
retrieval system. Clearly, any software system has to provide the functionality it was conceived
for. Thus, the first type of evaluation which should be considered is a functional analysis in which
the specified system functionalities are tested one by one. Such an analysis should also include
an error analysis phase in which, instead of looking for functionalities, one behaves erratically
trying to make the system fail. It is a simple procedure that can be quite useful for catching
programming errors. Given that the system has passed the functional analysis phase, one should
proceed to evaluate the performance of the system.
The most common measures of system performance are time and space. The shorter the response
time, the smaller the space used, the better the system is considered to be. In a system designed
for providing data retrieval, the response time and the space required are usually the metrics of
most interest and the ones normally adopted for evaluating the system. In the search), the
interaction with the operating system, the delays in communication channels, and the overheads
86

lOMoARcPSD|32779613
introduced by the many software layers which are usually present. We refer to such a form of
evaluation simply as performance evaluation.
In a system designed for providing information retrieval, other metrics, besides time and space,
are also of interest. In fact, since the user query request is inherently vague, the retrieved
documents are not exact answers and have to be ranked according to their relevance to the query.
Such relevance ranking introduces a component that is not present in data retrieval systems and
which plays a central role in information retrieval. Thus, information retrieval systems require the
evaluation of how precise is the answer set. This type of evaluation is referred to as retrieval
performance evaluation.
In this chapter, we discuss retrieval performance evaluation for information retrieval systems.
Such an evaluation is usually based on a test reference collection and on an evaluation measure.
The test reference collection consists of a collection of documents, a set of example information
requests, and a set of relevant documents (provided by specialists) for each example information
request. Given a retrieval strategy S, the evaluation measure quantifies (for each example
information request) the similarity between the set of documents retrieved by S and the set of
relevant documents provided by the specialists. This provides an estimation of the goodness of
the retrieval strategy S. In our discussion, we first cover the two most used retrieval evaluation
measures: recall and precision. We also cover alternative evaluation measures such as the E
measure etc.
When considering retrieval performance evaluation, we should first consider the retrieval task
that is to be evaluated. For instance, the retrieval task could consist simply of a query processed
in batch mode (i.e., the user submits a query and receives an answer back) or of a whole
interactive session (i.e., the user specifies his information need through a series of interactive
steps with the system). Further, the retrieval task could also comprise a combination of these two
strategies. Batch and interactive query tasks are quite distinct processes and thus their evaluations
are also distinct. In fact, in an interactive session, user effort, characteristics of the interface
design, guidance provided by the system, and duration of the session are critical aspects that
should be observed and measured. In a batch session, none of these aspects is nearly as important
as the quality of the answer set generated.
87

lOMoARcPSD|32779613
Besides the nature of the query request, one has also to consider the setting where the evaluation
will take place and the type of interface used. Regarding the setting, evaluation of experiments
performed in a laboratory might be quite distinct from evaluation of experiments carried out in a
real life situation. Retrieval performance evaluation in the early days of computer-based
information retrieval systems focused primarily on laboratory experiments designed for batch
interfaces. In the 1990s, a lot more attention has been paid to the evaluation of real-life
experiments. Despite this tendency, laboratory experimentation is still dominant. Two main
reasons are the repeatability and the scalability provided by the closed setting of a laboratory.
1. Types of Evaluation measures
What are the main evaluation measures to check the performance of an IR system?
• Efficiency
When we talk efficiency of the system the following issues must be considered. Time and space
complexity, Speed in terms of retrieval time and indexing time, Speed of query processing, The
space taken by corpus vs. index file,(questions like Index size: determine Index/corpus size ratio
and is there a need for compression)
• Effectiveness
When we talk effectiveness of the system the following issues must be considered. How is a
system capable of retrieving relevant documents from the collection? Is system X better than
other systems? User satisfaction: How “good” are the documents that are returned as a response
to user query? The Relevance of results to meet information need of users? Relevance is the
measure of a correspondence existing between a document and query.
The standard approach to information retrieval system evaluation revolves around the notion of
relevant and non relevant documents. With respect to a user information need, a document in the
test collection is given a binary classification as either relevant or non relevant. The test
document collection and suite of information needs have to be of a reasonable size: you need to
average performance over fairly large test sets, as results are highly variable over different
documents and information needs. As a rule of thumb, 50 information needs has usually been
found to be a sufficient minimum.
88

lOMoARcPSD|32779613
Relevance is assessed relative to an information need, not a query. For example, an information
need might be:
Information on whether drinking red wine is more effective at reducing your risk of heart attacks
than white wine.
This might be translated into a query such as:
Wine AND red AND white AND heart AND attack AND effective
A document is relevant if it addresses the stated information need, not because it just happens to
contain all the words in the query. This distinction is often misunderstood in practice, because the
information need is not overt. But, nevertheless, an information need is present. If a user types
python into a web search engine, they might want to know where they can purchase a pet python.
Or they might want information on the programming language Python. From a one word query, it
is very difficult for a system to know what the information need is. But, nevertheless, the user has
one, and can judge the returned results on the basis of their relevance to it. To evaluate a system,
we require an overt expression of an information need, which can be used for judging returned
documents as relevant or non relevant. At this point, we make a simplification: relevance can
reasonably be thought of as a scale, with some documents highly relevant and others marginally
so. Many systems contain various weights (often known as parameters) that can be adjusted to
tune system performance. It is wrong to report results on test collections which were obtained by
tuning these parameters to maximize performance on that collection. That is because such tuning
overstates the expected performance of the system, because the weights will be set to maximize
performance on one particular set of queries rather than for a random sample of queries. In such
cases, the correct procedure is to have one or more development test collections, and to tune the
parameters on the development test collection. The tester then runs the system with those weights
on the test collection and reports the results on that collection as an unbiased estimate of
performance.
5.2. Evaluation of unranked retrieval sets

The two most frequent and basic measures for information retrieval effectiveness are precision
and recall.
89

lOMoARcPSD|32779613
Precision (P):-
– is the fraction of retrieved documents that are relevant.

– The ability to retrieve top-ranked documents that are mostly relevant.
– Precision is percentage of retrieved documents that are relevant to the query (i.e. number of
retrieved documents that are relevant).
Recall (R)
– is the fraction of relevant documents that are retrieved.

– The ability of the search to find all of the relevant items in the corpus
– Recall is percentage of relevant documents retrieved from the database in response to users
query.
These notions can be made clear by examining the following contingency table:
Then precision = tp / (tp + fp)
Recall = tp / (tp + fn)
An obvious alternative that may occur to the reader is to judge an information retrieval system by
its accuracy , that is, the fraction of its classifications that are correct. In terms of the contingency
table above, accuracy = (tp + tn)/(tp + fp + fn + tn). This seems plausible; since there are two
actual classes, relevant and nonrelevant, and an information retrieval system can be thought of as
a two-class classifier which attempts to label them as such (it retrieves the subset of documents
which it believes to be relevant). This is precisely the effectiveness measure often used for
evaluating machine learning classification problems.
90

lOMoARcPSD|32779613
Example 1
An IR system returns 12 relevant documents and 10 irrelevant documents. There are a total of 25
relevant documents in the collection. What is the precision of the system on this search and what
is the recall.
Solution:-
Precision = (Number of relevant items retrieved) / (Total number of retrieved items)
Those Precision = 12 / 22
Recall = (Number of relevant items retrieved) / (Total number of relevant items)
Recall = 12 / 25
Example 2
For a given query there are 20 relevant documents in the collection. The
precision of the query is
0.50 and recall for the query is 0.35. How many documents are there in the
result set.
Solution:-
P = 0.50, R = 0.35
Let the number of relevant items in result set = x
Let the number of irrelevant items in result set = y
0.35 = x / 20
x = 20 * 0.35 = 7
0.50 = 7 / ( x +
y)
y=7
Total number of
documents in the 91
result set = x + y
= 14
lOMoARcPSD|32779613
F-measure
A single measure that trades off precision versus recall is the F measure, which is the weighted
harmonic mean of precision and recall: One measure of performance that takes into accounts
both recall and precision. Harmonic mean of recall and precision:
2PR 2
F PR
 1 1

R P
However, using an even weighting is not the only choice. Values of B<1 emphasize precision,
while values of B>1 emphasize recall. For example, a value of B=3 or B=5 might be used if recall
is to be emphasized. Recall, precision, and the F measure are inherently measures between 0 and
1, but they are also very commonly written as percentages, on a scale between 0 and 100.
To summarize:-
3. Types of Evaluation Strategies

 System-centered evaluation
– Given documents, queries, and relevance judgments
 Try several variations of the system
 Measure which system returns the “best” hit list
92

lOMoARcPSD|32779613
 User-centered evaluation
– Given several users, and at least two retrieval systems
 Have each user try the same task on both systems
 Measure which system works the “best” for users information need
 How to measure users satisfaction?
4. Problems with both precision and recall

• Number of irrelevant documents in the collection is not taken into account.
• Recall is undefined when there is no relevant document in the collection.
• Precision is undefined when no document is retrieved.
Other measures
• Noise = retrieved irrelevant docs / retrieved docs
• Silence/Miss = non-retrieved relevant docs / relevant docs
– Noise = 1 – Precision; Silence = 1 – Recall
|{Relevant} {NotRetrieved}
Miss 
| |{Relevant}
|
|{Retrieved}{NotRelevant}
Fallout 
| |{NotRelevant}
|
5.5. Difficulties in Evaluating IR System
 IR systems essentially facilitate communication between a user and document collections
 Relevance is a measure of the effectiveness of communication
o Effectiveness is related to the relevancy of retrieved items.
o Relevance: relates to problem, information need, query and a document or
surrogate
 Relevancy is not typically binary but continuous.
o Even if relevancy is binary, it is a difficult judgment to make.
 Relevance judgments is made by
o The user who posed the retrieval problem
93

lOMoARcPSD|32779613
o An external judge
o Is the relevance judgment made by users and external person the same?
 Relevance judgment is usually:
 Subjective: Depends upon a specific user’s judgment.
 Situational: Relates to user’s current needs.
 Cognitive: Depends on human perception and behavior.
 Dynamic: Changes over time.
94

lOMoARcPSD|32779613
CHAPTER SIX
QUERY LANGUAGES AND OPERATIONS
6.1. Introduction
Information is the main value of Information Society. The recent developments in computing
power and telecommunications, along with the constant drop of Internet access costs and data
management and storing, created the right conditions for the global diffusion of the Web and,
more generally, of new research tools able to analyze information and their contents. Depending
on the particular application scenario and on the type of information that has to be managed and
searched, different techniques need to be devised. The dictionary definition of query is a set of
instructions passed to a database to retrieve particular data. A query is the formulation of a user
information need. A query is composed of keywords and the documents containing such
keywords are searched for popular and Intuitive, Easy to express, Allow fast ranking. A query is
formulation of a user information need. Query Languages: A source language consisting of
procedural operators that invoke functions to be executed.
Moreover, some words which are very frequent and do not carry meaning (such as ‘the’), called
stopwords, and may be removed. Here we assume that all the query preprocessing has already
been done. Although these operations are usually done for information retrieval, many of them
can also be useful in a data retrieval context. When we want to emphasize the difference between
words that can be retrieved by a query and those which cannot, we call the former ‘keywords.’
Orthogonal to the kind of queries that can be asked is the subject of the retrieval unit the
information system adopts. The retrieval unit is the basic element which can be retrieved as an
answer to a query (normally a set of such basic elements is retrieved, sometimes ranked by
relevance or other criterion). The retrieval unit can be a file, a document, a Web page, a
paragraph, or some other structural unit which contains an answer to the search query. From this
point on, we will simply call those retrieval units ‘documents,’ although as explained this can
have different meanings.
95

lOMoARcPSD|32779613
2. Types of query languages
1. Key word based queries
Queries are combinations of words. The document collection is searched for documents that
contain these words. Word queries are intuitive, easy to express and provide fast ranking. The
concept of word must be defined as a sequence of letters terminated by a separator (period,
comma, blank, etc). Definition of letter and separator is flexible; e.g., hyphen could be defined as
a letter or as a separator. Usually, “trivial words” (such as “a”, “the”, or “of”) are ignored.
A query is the formulation of a user information need. In its simplest form, a query is composed
of keywords and the documents containing such keywords are searched for. Keyword-based
queries are popular because they are intuitive, easy to express, and allow for fast ranking. Thus, a
query can be (and in many cases is) simply a word, although it can in general be a more complex
combination of operations involving several words. In the rest of this chapter we will refer to
single-word and multiple-word queries as basic queries. Patterns are also considered as basic
queries.
Popular Keyword-based queries are:-
 Single-word queries:
• A query is a single word
• Simplest form of query.
• All documents that include this word are retrieved.
• Documents may be ranked by the frequency of this word in the document.
The most elementary query that can be formulated in a text retrieval system is a word. Text
documents are assumed to be essentially long sequences of words. Although some models
present a more general view, virtually all models allow us to see the text in this perspective and
to search words. Some models are also able to see the internal division of words into letters.
These latter models permit the searching of other types of patterns. The set of words retrieved by
96

lOMoARcPSD|32779613
these extended queries can then be fed into the word treating machinery, say to perform thesaurus
expansion or for ranking purposes.
A word is normally defined in a rather simple way. The alphabet is split into ‘letters’ and
‘separators,’ and a word is a sequence of letters surrounded by separators. More complex models
allow us to specify that some characters are not letters but do not split a word, e.g. the hyphen in
‘on-line.’ It is good practice to leave the choice of what is a letter and what is a separator to the
manager of the text database.
Phrase queries: A query is a sequence of words treated as a single unit. Also called “literal
string” or “exact phrase” query, Phrase is usually surrounded by quotation marks, All documents
that include this phrase are retrieved, Usually, separators (commas, colons, etc.) and “trivial
words” (e.g., “a”, “the”, or “of”) in the phrase are ignored, In effect, this query is for a set of
words that must appear in sequence, Allows users to specify a context and thus gain precision.
Example: “United States of America”.

Multiple-word queries: A query is a set of words (or phrases). Two interpretations:
 A document is retrieved if it includes any of the query words.
 A document is retrieved if it includes each of the query words.
Documents may be ranked by the number of query words they contain: A document containing n
query words is ranked higher than a document containing m < n query words. Documents
containing all the query words are ranked at the top. Documents containing only one query word
are ranked at bottom. Frequency counts may still be used to break ties among documents that
contain the same query words.
Example:
• The phrase “Venetian blind” finds documents that discuss Venetian blinds.
• The set (Venetian, blind) finds in addition documents that discuss blind Venetians.
Proximity queries: Restrict the distance within a document between two search terms.
Important for large documents in which the two search words may appear in different contexts.
97

lOMoARcPSD|32779613
Proximity specifications limit the acceptable occurrences and hence increase the precision of the
search.
•General Format: Word1 within m units of Word2. Unit may be character, word,
paragraph, etc.
Examples: united within 5 words of american: Finds documents that discuss “United Airlines and
American Airlines” but not “United States of America and the American dream”. Another
example is: nuclear within 0 paragraphs of cleanup: Finds documents that discuss “nuclear” and
“cleanup” in the same paragraph.
Boolean queries: Describe the information needed by relating multiple words with Boolean
operators.
• Operators: and, or, But

• But corresponds to and not
• Semantics: For each query word w a corresponding set Dw is constructed that includes
the documents that contain w.
• The Boolean expression is then interpreted as an expression on the corresponding
document sets with corresponding set operators:
 and = intersection
 or = union
 But = difference
The use of But prevents creation of very large answers: not B computes all the documents that do
not include B (complement), whereas A But B limits the universe to the documents that include
A. Precedence: But, and, or; use parentheses to override; process left-to-right among operators
with the same precedence.
Examples:
a. Computer or server But mainframe: Select all documents that discuss computers, or documents
that discuss servers but do not discuss mainframes.
98

lOMoARcPSD|32779613
b. (Computer or server) But mainframe: Select all documents that discuss computers or servers,
do not select any documents that discuss mainframes.
a. Computer But (server or mainframe): Select all documents that discuss computers, and do not
discuss either servers or mainframes.
6.2.2. Pattern Matching
A pattern is a set of syntactic features that must occur in a text segment. Those segments
satisfying the pattern specifications are said to ‘match’ the pattern. We are interested in
documents containing segments which match a given search pattern. Each system allows the
specification of some types of patterns, which range from very simple (for example, words) to
rather complex (such as regular expressions).
In general, as more powerful is the set of patterns allowed, more involved are the queries that the
user can formulate and more complex is the implementation of the search. The most used types
of patterns are:
•Words A string (sequence of characters) which must be a word in the text. This is the most
basic pattern.
•Prefixes A string which must form the beginning of a text word. For instance, given the prefix
‘comput’ all the documents containing words such as ‘computer,’ ‘computation,’ ‘computing,’
etc. are retrieved.
•Suffixes A string which must form the termination of a text word. For instance, given the suffix
‘ters’ all the documents containing words such as ‘computers,’ ‘testers,’ ‘painters,’ etc. are
retrieved.
• Substrings A string which can appear within a text word. For instance, given the substring
‘tal’ all the documents containing words such as ‘coastal,’ ‘talk,’ ‘metallic,’ etc. are retrieved.
This query can be restricted to find the substrings inside words, or it can go further and search
the substring anywhere in the text (in this case the query is not restricted to be a sequence of
letters but can contain word separators). For instance, a search for ‘any flow’ will match in the
phrase’ …many flowers ….’
99

lOMoARcPSD|32779613
•Ranges A pair of strings which matches any word lying between them in lexicographical order.
Alphabets are normally sorted, and this induces an order into the strings which is called
lexicographical order (this is indeed the order in which words in a dictionary are listed). For
instance, the range between words ‘held’ and ‘hold’ will retrieve strings such as ‘hoax’ and
‘hissing.’
• Allowing errors A word together with an error threshold. This search pattern retrieves all text
words which are ‘similar’ to the given word. The concept of similarity can be defined in many
ways. The general concept is that the pattern or the text may have errors (coming from typing,
spelling, or from optical character recognition software, among others), and the query should try
to retrieve the given word and what are likely to be its erroneous variants. Although there are
many models for similarity among words, the most generally accepted in text retrieval is the
Levenshtein distance, or simply edit distance.
The edit distance between two strings is the minimum number of character insertions, deletions,
and replacements needed to make them equal. Therefore, the query specifies the maximum
number of allowed errors for a word to match the pattern (i.e., the maximum allowed edit
distance). This model can also be extended to search substrings (not only words), retrieving any
text segment which is at the allowed edit distance from the search pattern.
Under this extended model, if a typing error splits ‘flower’ into ‘flo wer’ it could still be found
with one error, while in the restricted case of words it could not (since neither ‘flo’ nor ‘wer’ are
at edit distance 1 from ‘flower’). Variations on this distance model are of use in computational
biology for searching on DNA or protein sequences as well as in signal processing.
•Regular expressions some text retrieval systems allow searching for regular expressions. A
regular expression is a rather general pattern built up by simple strings (which are meant to be
matched as substrings) and the following operators: – union: if el and e2 are regular expressions,
then (elle2) matches what el or e2 matches.
– Concatenation: if el and e2 are regular expressions, the occurrences of (el e2) are formed by the
occurrences of el immediately followed by those of e2 (therefore simple strings can be thought of
as a concatenation of their individual letters).
10
0

lOMoARcPSD|32779613
– Repetition: if e is a regular expression, then (e*) matches a sequence of zero or more

contiguous occurrences of e. For instance, consider a query like ‘pro (blem I tein) (s I e) (0 I 1 I
2)*’ (where e denotes the empty string). It will match words such as ‘problem02’ and ‘proteins.’
As in previous cases, the matches can be restricted to comprise a whole word, to occur inside a
word, or to match an arbitrary text segment. This can also be combined with the previous type of
patterns to search a regular expression allowing errors.
•Extended patterns It is normal to use a more userfriendly query language to represent some
common cases of regular expressions. Extended patterns are subsets of the regular expressions
which are expressed with a simpler syntax. The retrieval system can internally convert extended
patterns into regular expressions, or search them with specific algorithms. Each system supports
its own set of extended patterns, and therefore no formal definition exists. Some examples found
in many new systems are:
–classes of characters, i.e. one or more positions within the pattern are matched by any character
from a pre-defined set. This involves features such as case-insensitive matching, use of ranges of
characters (e.g., specifying that some character must be a digit), complements (e.g., some
character must not be a letter), enumeration (e.g., a character must be a vowel), wild cards (i.e., a
position within the pattern matches with anything), among others.
–Conditional expressions, i.e., a part of the pattern mayor may not appear. – wild characters
which match any sequence in the text, e.g. any word which starts as ‘flo’ and ends with ‘ers,’
which matches ‘flowers’ as well as ‘flounders.’ – Combinations that allow some parts of the
pattern to match exactly and other parts with errors.
6.2.3. Structural Queries
Up to now we have considered the text collection as a set of documents which can be queried
with regard to their text content. This model is unable to take advantage of novel text features
which are becoming commonplace, such as the text structure. The text collections tend to have
some structure built into them, and allowing the user to query those texts based on their structure
(and not only their content) is becoming attractive.
10
1

lOMoARcPSD|32779613
fig 2 The three main structures: (a) form-like fixed structure, (b) hypertext structure, and (c)
hierarchical structure
As discussed in above, mixing contents and structure in queries allows us to pose very powerful
queries, which are much more expressive than each query mechanism by itself. By using a query
language that integrates both types of queries, the retrieval quality of textual databases can be
improved.
This mechanism is built on top of the basic queries so that they select a set of documents that
satisfy certain constraints on their content (expressed using words, phrases, or patterns that the
documents must contain). On top of this, some structural constraints can be expressed using
containment, proximity, or other restrictions on the structural elements (e.g., chapters, sections,
etc.) present in the documents.
The Boolean queries can be built on top of the structural queries, so that they combine the sets of
documents delivered by those queries. In the Boolean syntax tree (recall the example of Figure 2)
the structural queries form the leaves of the tree. On the other hand, structural queries can
themselves have a complex syntax. We divide this section according to the type of structures
found in text databases. Figure 2 illustrates them. Although structured query languages should be
amenable for ranking, this is still an open problem. In what follows it is important to distinguish
the difference between the structure that a text may have and what can be queried about that
structure. In general, natural language texts may have any desired structure. However, different
models allow the querying of only some aspects of the real structure. When we say that the
structure allowed is restricted in some way, we mean that only the aspects which follow this
restriction can be queried, albeit the text may have more structural information. For instance, it is
possible that an article has a nested structure of sections and subsections, but
102

lOMoARcPSD|32779613
the query model does not accept recursive structures. In this case, we will not be able to query for
sections included in others, although this may be the case in the text documents under
consideration.
3. Query Operations
So far, we assumed documents that are entirely free of structure. Structured documents would
allow more powerful queries. Queries could combine text queries with structural
queries:
queries that relate to the structure of the document.Mixing contents and structure in
queries:
Contents words, phrases, or patterns and Structural constraintscontainment, proximity, or
other restrictions on structural elements
Example: Retrieve documents that contain a page in which the phrase “terrorist attack” appears
in the text and a photo whose caption contains the phrase “World Trade Center”. The
corresponding query could be: same page (“terrorist attack”, photo (caption (“World Trade
Center”))). Three main structures
 Fixed structure
 Hypertext structure
 Hierarchical structure
Fixed Structure: Document is divided to a fixed set of fields, much like a filled form. Fields
may be associated with types, such as date. Each field has text and fields cannot nest or overlap.
Queries (multiple-words, Boolean, proximity, patterns, etc.) are targeted at particular fields.
Suitable for documents such as mail messages, with fields for: Sender, Receiver, Date, Subject,
Message body.
EX: a mail has a sender, a receiver, a date, a subject and a body field Search for the mails sent to
a given person with “football” in the Subject field
Hypertext Structure: The most general document structure. The term hypertext was coined
by American computer scientist Ted Nelson in 1965 to describe textual information that could be
103

lOMoARcPSD|32779613
accessed in a nonlinear way. The prefix hyper describes the speed and facility with which users
could jump to and from related areas of text. Each document is divided into regions (nodes),
where a region could be a section, a paragraph, or an entire document; regions may be nested.
The nodes are connected with directed links. A link is anchored in a phrase or a word in one node
and leads to another node. Result is a network of document parts. Hypertext lends itself more to
browsing than to querying.
WebGlimpse: combine browsing and searching on the Web
Hierarchical structure: Intermediate model between fixed structure and hypertext. The
“anarchic”hypertext network is restricted to a hierarchical structure. The model allows recursive
decomposition of documents. Queries may combine Regular text queries, which are targeted at
particular areas (the target area is defined by a “path expression”) and Queries on the structure
itself; for example “retrieve documents with at least 5 sections.
4. Relevance feedback
After initial retrieval results are presented, allow the user to provide feedback on the relevance of
one or more of the retrieved documents. The system use this feedback information to reformulate
the query and Produce new results based on reformulated query. After that allows more
interactive, multi-pass process.
The idea of relevance feedback (RF) is to involve the user in RELEVANCE FEEDBACK the
retrieval process so as to improve the final result set. In particular, the user gives feedback on the
relevance of documents in an initial set of results. The basic procedure is:
 The user issues a (short, simple) query.

 The system returns an initial set of retrieval results.
 The user marks some returned documents as relevant or non relevant.
 The system computes a better representation of the information need based on the user
feedback.
 The system displays a revised set of retrieval results.
Relevance feedback can go through one or more iterations of this sort. The process exploits the
idea that it may be difficult to formulate a good query when you don‟t know the collection well,
104

lOMoARcPSD|32779613
but it is easy to judge particular documents, and so it makes sense to engage in iterative query
refinement of this sort.
Fig. 6.4. Relevance feedback architecture
6.5. Query Expansion
Query expansion (QE) is the process of reformulating a given query to improve retrieval
performance in information retrieval operations, particularly in the context of query
understanding. An example of query expansion in the interface of the Yahoo! web search engine
in 2006.The expanded query suggestions appear just below the ``Search Results'' bar.
In relevance feedback, users give additional input on documents (by marking documents in the
results set as relevant or not), and this input is used to reweight the terms in the query for
documents. In query expansion on the other hand, users give additional input on query words or
phrases, possibly suggesting additional query terms. Some search engines (especially on the
web) suggest related queries in response to a query; the users then opt to use one of these
alternative query suggestions. The central question in this form of query expansion is how to
generate alternative or expanded queries for the user. The most common form of query expansion
is global analysis, using some form of thesaurus. For each term t in a query, the query can be
automatically expanded with synonyms and related words of t from the thesaurus. Use of
105

lOMoARcPSD|32779613
a thesaurus can be combined with ideas of term weighting: for instance, one might weight added
terms less than original query terms.
Methods for building a thesaurus for query expansion include:
 Use of a controlled vocabulary that is maintained by human editors. Here, there is a

canonical term for each concept. The subject headings of traditional library subject
indexes, such as the Library of Congress Subject Headings, or the Dewey Decimal system
are examples of a controlled vocabulary. Use of a controlled vocabulary is quite common
for well-resourced domains. A well-known example is the Unified Medical Language
System (UMLS) used with MedLine for querying the biomedical research literature.
 A manual thesaurus. Here, human editors have built up sets of synonymous names for
concepts, without designating a canonical term. The UMLS metathesaurus is one example
of a thesaurus. Statistics Canada maintains a thesaurus of preferred terms, synonyms,
broader terms, and narrower terms for matters on which the government collects statistics,
such as goods and services. This thesaurus is also bilingual English and French.
 An automatically derived thesaurus. Here, word co-occurrence statistics over a collection

of documents in a domain are used to automatically induce a thesaurus;
 Query reformulations based on query log mining. Here, we exploit the manual query
reformulations of other users to make suggestions to a new user. This requires a huge
query volume, and is thus particularly appropriate to web search.
Thesaurus-based query expansion has the advantage of not requiring any user input. Use of
query expansion generally increases recall and is widely used in many science and engineering
fields. As well as such global analysis techniques, it is also possible to do query expansion by
local analysis, for instance, by analyzing the documents in the result set. User input is now
usually required, but a distinction remains as to whether the user is giving feedback on
documents or on query terms.
106

lOMoARcPSD|32779613
Reference
Qiu Y. and Frei H.P. Concept based query expansion. In Proc. 16th Annual Int. ACM
SIGIR Conf. on Research and Development in Information Retrieval, 1993, pp. 160–
169.
Spärck Jones K. Automatic keyword classification for information retrieval.

Butterworths, London, 1971.
https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-in-information-retrieval-1.html
10
7

Chapter One Introduction To Information Storage and Retrieval

Uploaded by

Copyright:

Available Formats

Chapter One Introduction To Information Storage and Retrieval

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter One Introduction To Information Storage and Retrieval

Uploaded by

Copyright:

Available Formats

lOMoARcPSD|32779613

INTRODUCTION TO INFORMATION STORAGE AND RETRIEVAL

 What are the main difference between data, Information

 What is information retrieval (IR)?

 What is information storage and retrieval (ISR)?

 What are IR systems?

 What are the basic structures of IR systems?

 What is information crisis?

1.1. Data, information, Knowledge and Wisdom

Data can be defined as a representation of facts, concepts, or instructions in a formalized manner,

1.1.1. Types of data

Table 1. Example of Structure data

Example: - The temperature dropped 15 degrees and then it started raining.

Information Retrieval (IR) is finding material (usually documents) of an unstructured nature

1.2. What is Information Storage and Retrieval (ISR)?

Information storage & retrieval is interdisciplinary, based on computer science, mathematics,

1.3. Information Retrieval System (IRS)

Downloaded by shegaw mulat ([email protected])

4. Data versus information retrieval

• Data retrieval system is most like:-

Downloaded by shegaw mulat ([email protected])

S.No Information Retrieval Data Retrieval

Information Retrieval system produces Data Retrieval system produces exact

Displayed results are not sorted by

The Data Retrieval model is deterministic

Table 1. Data versus information retrieval

1.5. Information Retrieval serve as Bridge

Downloaded by shegaw mulat ([email protected])

1. A writer presents a set of ideas in a document using a set of concepts.

fig 1.1 view of information retrieval

1.6. Information Retrieval System Architecture

Downloaded by shegaw mulat ([email protected])

1.6.1. Typical Information Retrieval System Architecture

fig 1. 6. A Typical Information Retrieval System.

Downloaded by shegaw mulat ([email protected])

1.7. Information retrieval System vs. Web Search System

 Processing the document collection.

Downloaded by shegaw mulat ([email protected])

 Many tools are available to the user

 Hyperlinks are available to link one document to the other

fig 1.3. Example of Web Information Retrieval

1.8. Information retrieval Process

Downloaded by shegaw mulat ([email protected])

Fig 1.7 Information retrieval process

– allowing faster access for the search process

 Components of Retrieval Process

Downloaded by shegaw mulat ([email protected])

(i) Reformulate query, run on the entire collection or

(ii) Reformulate query, run on result set

1.9. Issues that arise in Information Retrieval

The main issues of the Information Retrieval (IR) are:-

 Document and Query Indexing

Downloaded by shegaw mulat ([email protected])

1.10. Information Retrieval Research areas

But there are many other interesting areas:

– Question-answering IR systems, which retrieve answers from a body of text. For

1.11. Information retrieval some Applications areas

 Graphical interfaces to support information search