Module5-Representing and Mining Text

Module 5
Representing and Mining Text

Dealing with Text
• Data are represented in ways natural to problems from which
they were derived.
• Vast amount of text..
• If we want to apply the many data mining tools that we have at

our disposal, we must
• either engineer the data representation to match the tools
(representation engineering), or
• build new tools to match the data.

Why Text is Difficult
• Text is “unstructured” data.
• Linguistic structure is intended for human communication and
not computers.
• Word order matters sometimes.
• Text can be dirty

• People write ungrammatically, misspell words, abbreviate
unpredictably, and punctuate randomly.
• It may contain Synonyms, homograms, abbreviations, etc.
• Context matters.
Text Representation
• Goal: Take a set of documents –each of which is a relatively

free- form sequence of words– and turn it into our familiar
feature-vector form.
• A collection of documents is called a corpus.
• A document is composed of individual tokens or terms.
• Each document is one instance

• but we don’t know in advance what the features will be
“Bag of Words”
• Treat every document as just a collection of individual words.
• Ignore grammar, word order, sentence structure, and (usually)
punctuation.
• Treat every word in a document as a potentially important keyword of the
document.
• What will be the feature’s value in a given document?

• Each document is represented by a one (if the token is present in the
document) or a zero (the token is not present in the document).
• Straightforward representation
• Inexpensive to generate.
• Tends to work well for many tasks.
Pre-processing of Text
The following steps should be performed:
• The case should be normalized

• Every term is in lowercase
• Words should be stemmed

• Suffixes are removed
• E.g., noun plurals are transformed to singular forms
• Stop-words should be removed

• A stop-word is a very common word in English (or whatever language is
being parsed)
• Typical words such as the words the, and, of, and on are removed
Term Frequency
• Use the word count (frequency) in the document instead of just a zero
or one.
• Differentiates between how many times a word is used.

Normalized Term Frequency
• Documents of various lengths.
• Words of different frequencies

• Words should not be too common or too rare.
• Both upper and lower limit on the number (or fraction) of documents in
which a word may occur.
• Feature selection is often employed.
• The raw term frequencies are normalized in some way,

• such as by dividing each by the total number of words in the document
• or the frequency of the specific term in the corpus.
TF-IDF
TFIDF 𝑡, 𝑑 = TF 𝑡, 𝑑 × IDF 𝑡
• Inverse Document Frequency (IDF) of a term
Total number of documents

IDF 𝑡 = 1 + log
Number of documents containing 𝑡
TFIDF
Example: Jazz Musicians
• 15 prominent jazz musicians and excerpts of their

biographies from Wikipedia.
• Nearly 2,000 features after stemming and stop-word

removal!.
• Consider the sample phrase “Famous jazz saxophonist

born in Kansas who played bebop and latin” after
stemming.
Representation of the query “Famous jazz saxophonist born in Kansas who played
bebop and latin” after stopword removal and term frequency normalization.
Final TFIDF Representation of the query “Famous jazz saxophonist born in

Kansas who played bebop and latin”.
Beyond “Bag of Words”
• 𝑁 -gram Sequences
• Named Entity Extraction
• Topic Models
N-gram Sequences
• In some cases, word order is important and you want to preserve
some information about it in the representation
• A next step up in complexity is to include sequences of adjacent

words as terms
• Adjacent pairs are commonly called bi-grams
• Example: “The quick brown fox jumps”

• It would be transformed into {quick, brown, fox, jumps,
quick_brown, brown_fox, fox_jumps}
• N-grams they greatly increase the size of the feature set

Topic Models
Text Mining Example
Task: predict the stock market based on the stories that appear on
the news wires.
Mining News Stories to Predict Stock Price
Movement
Movement
Movement
Movement
Movement

Module5-Representing and Mining Text

Uploaded by

Module5-Representing and Mining Text

Uploaded by

Module 5

Representing and Mining Text

• Vast amount of text..

• If we want to apply the many data mining tools that we have at

• build new tools to match the data.

• Word order matters sometimes.

• Text can be dirty

• Goal: Take a set of documents –each of which is a relatively

• A collection of documents is called a corpus.

• A document is composed of individual tokens or terms.

• Each document is one instance

• What will be the feature’s value in a given document?

• The case should be normalized

• Words should be stemmed

• Stop-words should be removed

• Differentiates between how many times a word is used.

• Documents of various lengths.

• Words of different frequencies

• The raw term frequencies are normalized in some way,

• Inverse Document Frequency (IDF) of a term

Total number of documents

• 15 prominent jazz musicians and excerpts of their

• Nearly 2,000 features after stemming and stop-word

• Consider the sample phrase “Famous jazz saxophonist

Final TFIDF Representation of the query “Famous jazz saxophonist born in

• Named Entity Extraction

• A next step up in complexity is to include sequences of adjacent

• Adjacent pairs are commonly called bi-grams

• Example: “The quick brown fox jumps”

• N-grams they greatly increase the size of the feature set

You might also like