Information Retrieval: Text Processing
Information Retrieval: Text Processing
Text Processing
Slide 2
IR System Architecture
User Interface
Text
User
Text Operations
Need
Logical View
User Query Database
Feedback Operations Indexing
Manager
Inverted
file
Query Searching Index
Text
Ranked Retrieved Database
Docs Ranking Docs
Slide 3
Tokenization : Text { word}
Slide 4
Simple Tokenization : Text { word}
• Analyze text into a sequence of discrete tokens (words).
• Most Languages
– Word = string of characters separated by white spaces and/or
punctuation
• Difficulties:
– Abbreviations (etc. , ..)
• Transformed to original format using MRD (Machine Readable
Dictionary)
– Hyphenated terms (_, -)
– Apostrophes
– Numbers
Slide 5
Simple Tokenization : Text { word}
• Sometimes punctuation (e-mail), numbers (1999), and case
(Republican vs. republican) can be a meaningful part of a
token.
– However, frequently they are not.
• Simplest approach is to ignore all numbers and punctuation
and use only case-insensitive unbroken strings of alphabetic
characters as tokens.
• More careful approach:
– Separate ? ! ; : “ ‘ [ ] ( ) < >
– Care with .
– Care with …
Slide 6
Punctuation
Slide 7
Numbers
• 3/12/91
• Mar. 12, 1991
• 55 B.C.
• B-52
• 100.2.86.144
– Generally, don’t index as text
– Creation dates for docs
Slide 8
Case folding
Slide 9
Tokenizing HTML
Slide 10
Stop words Removals
• Many of the most frequently used words in English are
worthless in the indexing – these words are called stop
words.
– the, of, and, to, ….
– Typically about 400 to 500 such words
Slide 11
Stopwords
• Stopwords are language dependent
• For efficiency, store strings for stopwords in a hashtable to
recognize them in constant time.
Slide 12
Stop words Removals
Slide 13
Some English Stop words
•a • also
• about • although
• above • always
• across • am
• after • among
• afterwards • amongst
• again • amount
• against • an
• all • and
• almost • another
• alone • any
• along • anyhow
• already
Slide 14
Lemmatization
• Reduce inflectional/variant forms to base form
• Direct impact on VOCABULARY size
• E.g.,
– am, are, is be
– car, cars, car's, cars' car
• the boy's cars are different colors the boy car be different color
• How to do this?
– Need a list of grammatical rules + a list of irregular words
– Children child, spoken speak …
Slide 15
Stemming
Morphological variants of a word (morphemes). Similar
terms derived from a common stem:
engineer, engineered, engineering
use, user, users, used, using
Stemming in Information Retrieval. Words with a common
stem are mapped into the same term.
For example, read, reads, reading, and readable are mapped onto
the term read.
Slide 16
Advantages of stemming
• improving effectiveness
• matching similar words
Slide 17
Categories of Stemmer
Longest Simple
match removal
(Porter)
Slide 18
Table Lookup
• Store a table of all index terms and their stems
• Terms from queries and indexes could then be stemmed
via table lookup
• Problems
– No such data for English
– Domain-dependent vocabulary may not use standard English
– Storage overhead
Term Stem
Engineering Engineer
Engineered Engineer
Engineer Engineer
Slide 19
Affix Removal Stemmers
Slide 20
Porter Stemmer
• Simple procedure for removing known affixes in
English without using a dictionary.
• Can produce unusual stems that are not English words:
– “computer”, “computational”, “computation” all reduced to
same token “comput”
Slide 21
Basic stemming methods
Use tables and rules
Slide 22
Typical rules in Porter
• sses ss
• ies i
• ational ate
• tional tion
Slide 23
Porter Stemmer
A multi-step, longest-match stemmer.
M. F. Porter, An algorithm for suffix stripping. (Originally
published in Program, 14 no. 3, pp 130-137, July 1980.)
http://www.tartarus.org/~martin/PorterStemmer/def.txt
Notation
v vowel(s)
c constant(s)
(vc)m vowel(s) followed by constant(s), repeated m times
Any word can be written: [c](vc)m[v]
m is called the measure of the word
Slide 24
Porter's Stemmer
Multi-Step Stemming Algorithm
Complex suffixes
Complex suffixes are removed bit by bit in the different
steps. Thus:
GENERALIZATIONS
becomes GENERALIZATION (Step 1)
becomes GENERALIZE (Step 2)
becomes GENERAL (Step 3)
becomes GENER (Step 4)
[In this example, note that Steps 3 and 4 appear to be
unhelpful for information retrieval.]
Slide 25
Porter Stemmer: Step 1a
Suffix Replacement Examples
Slide 26
Porter Stemmer: Step 1b
Conditions Suffix Replacement Examples
(m > 0) eed ee feed -> feed
agreed -> agree
(*v*) ed null plastered -> plaster
bled -> bled
(*v*) ing null motoring -> motor
sing -> sing
Notation
m - the measure of the stem
*v* - the stem contains a vowel
Slide 27
Porter Stemmer: Step 5a
Slide 28
Porter Stemmer: Results
Suffix stripping of a vocabulary of 10,000 words
Number of words reduced in step 1: 3597
step 2: 766
step 3: 327
step 4: 2424
step 5: 1373
Number of words not reduced: 3650
The resulting vocabulary of stems contained 6370
distinct entries. Thus the suffix stripping process
reduced the size of the vocabulary by about one third.
Slide 29
Successor Variety
• Example
a body of text: able, axle, accident, ape, about
successor variety of apple
1st(a): 4 (b, x, c, p)
2nd(ap): (e)
Slide 30
Successor Variety (Continued)
• Idea
The successor variety of substrings of a term will decrease as more characters
are added until a segment boundary is reached, i.e., the successor variety will
sharply increase.
• Example
Test word: READABLE
Corpus: ABLE, BEATABLE, FIXABLE, READ,
READABLE, READING, RED, ROPE, RIPE
Prefix Successor Variety Letters
R 3 E, O, I
RE 2 A, D
REA 1 D
READ 3 A, I, S
READA 1 B
READAB 1 L
READABL 1 E
READABLE 1 blank
Slide 31
The successor variety stemming
process
Slide 32
Basic stemming methods
N-gram stemmers
• A n-gram is n-consecutive letters
– Conflates terms based on number of n-grams (= sequences of
n consecutive letters) that they share
– Often use of bigrams or trigrams
– Terms that are strongly related by the number of shared n-
grams are clustered
– Heuristics help in detecting the root form
– Language- independent technique
– Example :
All diagrams of the word “statistics” are
st ta ti is st ti ic cs
All diagrams of “statistical” are
st ta ti is st ti ic ca al
Slide 33
n-gram stemmers
• Diagram
a pair of consecutive letters
• Shared diagram method (Adamson and Boreham, 1974)
association measures are calculated between pairs of terms
2C
S
A B
• where A: the number of unique diagrams in the first word,
B: the number of unique diagrams in the second,
C: the number of unique diagrams shared by A and B
Slide 34
n-gram stemmers (Continued)
• Example
statistics => st ta at ti is st ti ic cs
unique diagrams => at cs ic is st ta ti
statistical => st ta at ti is st ti ic ca al
unique diagrams => al at ca ic is st ta ti
2C 2 *6
S 0.80
A B 78
Slide 35
n-gram stemmers (Continued)
• similarity matrix
determine the semantic measures for all pairs of terms in the database
word1 word2 word3 ... wordn-1
word1
word2 S21
word3 S31 S32
.
.
Wordn Sn1 Sn2 Sn3 … Sn(n-1)
• terms are clustered using a single link clustering method
– most pairwise similarity measures were 0
– using a cutoff similarity value of .6
Slide 36
Stemmers are not perfect:
Organization organ
University universe
Policy police
Slide 37
Identification of phrases and collocation
• Collocations
Expression consisting of two or more words that correspond to
some conventional way of saying things
–Good indicators of text’s content (especially
noun and prepositional phrases):
–Important concepts in subject domain:
• e.g. joint venture, make up
–Less ambiguous than the single words they are
composed of
Slide 38
Identification of phrases and collocation
Recognition of phrases
• Use of dictionary with phrases:
• Only practical in restricted subject domains
• Statistical approach:
• Assumption: words that often co-occur might
denote a phrase
• For phrases: not always correct and meaningful
Slide 39
Identification of phrases and collocation
• Normalization of phrases
Mapping of equivalent phrases to standard single phrases
Slide 40
Identification of phrases and collocation
Slide 41
On Metadata
• On Metadata
– Often included in Web pages
– Hidden from the browser, but useful for indexing
• Information about a document that may not be a part of the
document itself (data about data).
• Descriptive metadata is external to the meaning of the document:
– Author
– Title
– Source (book, magazine, newspaper, journal)
– Date
– ISBN
– Publisher
– Length
Slide 42
Web Metadata
Slide 43