Part I IR VTU M Tech SSE
Part I IR VTU M Tech SSE
Part I IR VTU M Tech SSE
Information retrieval
information about a subject or topic
semantics is frequently loose
small errors are tolerated
IR system:
interpretcontents of information items
generate a ranking which reflects relevance
notion of relevance is most important
Motivation
IR at the center of the stage
IR in the last 20 years:
classificationand categorization
systems and languages
Still,
area was seen as of narrow interest
Advent of the Web changed this perception
once and for all
universal repository of knowledge
free (low cost) universal access
Retrieval
Database
Browsing
Retrieval
information or data
purposeful
Browsing
glancing around
F1; cars, Le Mans, France, tourism
Basic Concepts
Logical view of the documents
Accents Noun Manual
Docs spacing stopwords groups stemming indexing
structure
• 1960-70’s:
– Initial exploration of text retrieval systems for
“small” corpora of scientific abstracts, and law
and business documents.
– Development of the basic Boolean and vector-
space models of retrieval.
– Prof. Salton and his students at Cornell
University are the leading researchers in the
area.
11
IR History Continued
• 1980’s:
– Large document database systems, many run by
companies:
• Lexis-Nexis
• Dialog
• MEDLINE
12
IR History Continued
• 1990’s:
– Searching FTPable documents on the Internet
• Archie
• WAIS
– Searching the World Wide Web
• Lycos
• Yahoo
• Altavista
13
IR History Continued
• 1990’s continued:
– Organized Competitions
• NIST TREC
– Recommender Systems
• Ringo
• Amazon
• NetPerceptions
– Automated Text Categorization & Clustering
14
Recent IR History
• 2000’s
– Link analysis for Web Search
• Google
– Automated Information Extraction
• Whizbang
• Fetch
• Burning Glass
– Question Answering
• TREC Q/A track
15
Recent IR History
• 2000’s continued:
– Multimedia IR
• Image
• Video
• Audio and music
– Cross-Language IR
• DARPA Tides
– Document Summarization
16
The Seven Ages of
Information Retrieval
Vannevar Bush's 1945 article set a
goal of fast access to the contents of
the world's libraries which looks like
it will be achieved by 2010, sixty-five
years later.
Bush’s Prediction
Modern History
The “information overload” problem is much older
than you may think
Origins in period immediately after World War II
Tremendous scientific progress during the war
Rapid growth in amount of scientific publications
available
The “Memex Machine”
Conceived by Vannevar Bush, President Roosevelt's
science advisor
Outlined in 1945 Atlantic Monthly article titled “As We
May Think”
Foreshadows the development of hypertext (the Web)
and information retrieval system
The Memex Machine
Historical aspects
As We May Think'', by Vannevar Bush
Article was originally published in 1945.
He imagined that machines would read in visual form
His assertion that logic is suitable for mechanical computation is
not yet appreciated
Documents are accessible & viewable from the memex
system of Bush
Documents may exist on many media: text, pictures, audio.
The memex can keep the ``trail'' of documents you read while you
follow your curiosity(Basically, it's a persistent history of URLs as
you surf the web.)
You can create associations between documents
You can enter original material
time-sharing systems
DIALOG
Text Operations
6, 7
logical view logical view
Query DB Manager
Operations Indexing
Module
user feedback
5 8
inverted file
query
Searching
Index
8
retrieved docs
Text
Database
Ranking
ranked docs
2
Information Retrieval – PART I
INTRODUCTION,RETRIEVAL STRATEGIES –I:
Introduction-
Motivation
Basic Concepts
Past, Present and the Future
The Retrieval Process
Other Related Slides – not part of the book
Information Retrieval
(IR)
• The indexing and retrieval of textual
documents.
• Searching for pages on the World Wide
Web is the most recent “killer app.”
• Concerned firstly with retrieving relevant
documents to a query.
• Concerned secondly with retrieving from
large sets of documents efficiently.
30
Typical IR Task
• Given:
– A corpus of textual natural-language
documents.
– A user query in the form of a textual string.
• Find:
– A ranked set of documents that are relevant to
the query.
31
IR System
Document
corpus
Query IR
String System
1. Doc1
2. Doc2
Ranked 3. Doc3
Documents .
.
32
Relevance
33
Keyword Search
34
Problems with Keywords
37
IR System Architecture
User Interface
Text
User
Text Operations
Need
Logical View
User Query Database
Feedback Operations Indexing
Manager
Inverted
file
Query Searching Index
Text
Ranked Retrieved Database
Docs Ranking Docs
38
IR System Components
• Text Operations forms index words (tokens).
– Stopword removal
– Stemming
• Indexing constructs an inverted index of
word to document pointers.
• Searching retrieves documents that contain a
given query token from the inverted index.
• Ranking scores all retrieved documents
according to a relevance metric.
39
IR System Components (continued)
• User Interface manages interaction with the
user:
– Query input and document output.
– Relevance feedback.
– Visualization of results.
• Query Operations transform the query to
improve retrieval:
– Query expansion using a thesaurus.
– Query transformation using relevance feedback.
40
Web Search
41
Web Search System
Query IR
String System
1. Page1
2. Page2
3. Page3
Ranked
. Documents
.
42
Other IR-Related Tasks
• Database Management
• Library and Information Science
• Artificial Intelligence
• Natural Language Processing
• Machine Learning
44
Database Management
45
Library and Information Science
48
Natural Language Processing:
IR Directions
• Methods for determining the sense of an
ambiguous word based on context (word
sense disambiguation).
• Methods for identifying specific pieces of
information in a document (information
extraction).
• Methods for answering specific NL
questions from document corpora.
49
Machine Learning
• Text Categorization
– Automatic hierarchical classification (Yahoo).
– Adaptive filtering/routing/recommending.
– Automated spam filtering.
• Text Clustering
– Clustering of IR query results.
– Automatic formation of hierarchies (Yahoo).
• Learning for Information Extraction
• Text Mining
51
IR research
System prototyping
User
Top Ten Research Issues
10. Relevance Feedback.
9. Information Extraction.
8. Multimedia Retrieval.
7. Effective Retrieval.
6. Routing and Filtering.
Top Ten Research Issues
5. Interfaces and Browsing.
4. “Magic” (Vocabulary Mapping).
3. Efficient, Flexible Indexing and
Retrieval.
2. Distributed IR.
1. Integrated Solutions.
A new Industry – Content
Management
Introduction to Information Retrieval
55
Introduction to Information Retrieval
56
Definitions
• An Information Retrieval (IR) System
• attempts to find relevant documents to
respond to a user’s request.
• The real problem boils down to matching
the language of the query to the language of
the document.
What is Information?
What do you think?
There is no “correct” definition
Cookie Monster’s definition:
“news or facts about something”
Different approaches:
Philosophy
Psychology
Linguistics
Electrical engineering
Physics
Computer science
Information science
Dictionary says…
Oxford English Dictionary
information: informing, telling; thing told, knowledge,
items of knowledge, news
knowledge: knowing familiarity gained by experience;
person’s range of information; a theoretical or practical
understanding of; the sum of what is known
Random House Dictionary
information: knowledge communicated or received
concerning a particular fact or circumstance; news
Intuitive Notions
Information must
Be something, although the exact nature (substance,
energy, or abstract concept) is not clear;
Be “new”: repetition of previously received messages is
not informative
Be “true”: false or counterfactual information is “mis-
information”
Be “about” something
Input Output
Ibid.
Where’s the human?
If a tree falls in the forest, and no one is around to
hear it, is information transmitted?
In the “information as process”: Yes, but that’s
not very interesting to us
We’re concerned about information for human
consumption
Transmission of information from one person to another
Recording of information
Reconstruction of stored information
Another View
Information science is characterized by “the
deliberate (purposeful) structure of the message
by the sender in order to affect the image
structure of the recipient”
This implies that the sender has knowledge of the
recipient's structure
Text = “a collection of signs purposefully
structured by a sender with the intention of
changing image-structure of a recipient”
Information = “the structure of any text which is
capable of changing the image-structure of a
recipient”
Nicholas J. Belkin and Stephen E. Robertson. (1976) Information Science and the Phenomenon of
Information. Journal of the American Society for Information Science, 27(4), 197-204.
Transfer of Information
Communication = transmission of information
Thoughts Thoughts
Telepathy?
Words Words
Writing
Sounds Sounds
Speech
Encoding Decoding
Information Hierarchy
Wisdom
Knowledge
Information
Data
• Simply matching on words is a very brittle approach.
• One word can have a zillion different semantic meanings
– Consider: Take
– “take a place at the table”
– “take money to the bank”
– “take a picture”
– “take a lot of time”
– “take drugs”
Difference of IR with rest of CS
What is Different about IR from the rest of Computer Science
Most algorithms in computer science have a “right” answer:
Consider the two problems:
– Sort the following ten integers
– Find the higest integer
Now consider:
– Find the document most relevant to “hippos in the zoo”
Measuring Effectiveness
• An algorithm is deemed incorrect if it does not have a “right” answer.
• A heuristic tries to guess something close to the right answer.
Heuristics are measured on “how close” they come to a right answer.
IR techniques are essentially heuristics because we do not know the
right answer.
• So we have to measure how close to the right answer we can come.
DOCUMENT RETRIEVAL
Document Routing
Predetermined queries or User profiles
Document Routing
System
Incoming documents
Relevant • Retrieved
Relevant Retrieved
Precision = Relevant Retrieved
Retrieved
Recall = Relevant Retrieved
Relevant
Precision and Two points of Recall
Answer set in order of
similarity coefficient
1.0 (relevant documents:d5,d2) d1
d2 50% recall
0.8
Precision
d3
0.6 (0.5,0.5) d4
d5
0.4 100% recall
(1.0, 0.4) d6
0.2 d7
d8
0.2 0.4 0.6 0.8 1.0 d9
Recall d10