0% found this document useful (0 votes)
20 views24 pages

Module5-Representing and Mining Text

1) Text documents can be represented as "bags of words" where each unique word becomes a feature and the presence of a word in a document is indicated with a 1 or 0. 2) Common pre-processing steps include normalization, stemming, stopword removal and term frequency normalization. 3) TF-IDF weighting takes into account the frequency of words in documents and across the entire corpus to emphasize more important words.

Uploaded by

Green Mongor
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
20 views24 pages

Module5-Representing and Mining Text

1) Text documents can be represented as "bags of words" where each unique word becomes a feature and the presence of a word in a document is indicated with a 1 or 0. 2) Common pre-processing steps include normalization, stemming, stopword removal and term frequency normalization. 3) TF-IDF weighting takes into account the frequency of words in documents and across the entire corpus to emphasize more important words.

Uploaded by

Green Mongor
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 24

Module 5

Representing and Mining Text


Dealing with Text
• Data are represented in ways natural to problems from which
they were derived.

• Vast amount of text..

• If we want to apply the many data mining tools that we have at


our disposal, we must
• either engineer the data representation to match the tools
(representation engineering), or

• build new tools to match the data.


Why Text is Difficult
• Text is “unstructured” data.
• Linguistic structure is intended for human communication and
not computers.

• Word order matters sometimes.

• Text can be dirty


• People write ungrammatically, misspell words, abbreviate
unpredictably, and punctuate randomly.
• It may contain Synonyms, homograms, abbreviations, etc.

• Context matters.
Text Representation

• Goal: Take a set of documents –each of which is a relatively


free- form sequence of words– and turn it into our familiar
feature-vector form.

• A collection of documents is called a corpus.

• A document is composed of individual tokens or terms.

• Each document is one instance


• but we don’t know in advance what the features will be
“Bag of Words”
• Treat every document as just a collection of individual words.
• Ignore grammar, word order, sentence structure, and (usually)
punctuation.
• Treat every word in a document as a potentially important keyword of the
document.

• What will be the feature’s value in a given document?


• Each document is represented by a one (if the token is present in the
document) or a zero (the token is not present in the document).

• Straightforward representation
• Inexpensive to generate.
• Tends to work well for many tasks.
Pre-processing of Text
The following steps should be performed:

• The case should be normalized


• Every term is in lowercase

• Words should be stemmed


• Suffixes are removed
• E.g., noun plurals are transformed to singular forms

• Stop-words should be removed


• A stop-word is a very common word in English (or whatever language is
being parsed)
• Typical words such as the words the, and, of, and on are removed
Term Frequency

• Use the word count (frequency) in the document instead of just a zero
or one.

• Differentiates between how many times a word is used.


Normalized Term Frequency

• Documents of various lengths.

• Words of different frequencies


• Words should not be too common or too rare.
• Both upper and lower limit on the number (or fraction) of documents in
which a word may occur.
• Feature selection is often employed.

• The raw term frequencies are normalized in some way,


• such as by dividing each by the total number of words in the document
• or the frequency of the specific term in the corpus.
TF-IDF

TFIDF 𝑡, 𝑑 = TF 𝑡, 𝑑 × IDF 𝑡

• Inverse Document Frequency (IDF) of a term

Total number of documents


IDF 𝑡 = 1 + log
Number of documents containing 𝑡
TFIDF
Example: Jazz Musicians

• 15 prominent jazz musicians and excerpts of their


biographies from Wikipedia.

• Nearly 2,000 features after stemming and stop-word


removal!.

• Consider the sample phrase “Famous jazz saxophonist


born in Kansas who played bebop and latin” after
stemming.
Example: Jazz Musicians
Example: Jazz Musicians

Representation of the query “Famous jazz saxophonist born in Kansas who played
bebop and latin” after stopword removal and term frequency normalization.
Example: Jazz Musicians

Final TFIDF Representation of the query “Famous jazz saxophonist born in


Kansas who played bebop and latin”.
Example: Jazz Musicians
Beyond “Bag of Words”

• 𝑁 -gram Sequences

• Named Entity Extraction

• Topic Models
N-gram Sequences
• In some cases, word order is important and you want to preserve
some information about it in the representation

• A next step up in complexity is to include sequences of adjacent


words as terms

• Adjacent pairs are commonly called bi-grams

• Example: “The quick brown fox jumps”


• It would be transformed into {quick, brown, fox, jumps,
quick_brown, brown_fox, fox_jumps}

• N-grams they greatly increase the size of the feature set


Topic Models
Text Mining Example

Task: predict the stock market based on the stories that appear on
the news wires.
Mining News Stories to Predict Stock Price
Movement
Mining News Stories to Predict Stock Price
Movement
Mining News Stories to Predict Stock Price
Movement
Mining News Stories to Predict Stock Price
Movement
Mining News Stories to Predict Stock Price
Movement

You might also like