Text Mining

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10
At a glance
Powered by AI
The key takeaways are that text mining involves deriving patterns and trends from text to extract high-quality information through statistical analysis and natural language processing. It aims to structure and analyze unstructured text data.

Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling.

Text analysis involves information retrieval, lexical analysis of word frequencies, pattern recognition, tagging/annotation, information extraction, data mining techniques like link and association analysis, visualization, and predictive analytics. The goal is to transform text into structured data for analysis.

Text mining

From Wikipedia, the free encyclopedia


(Redirected from Text analytics)
Jump to: navigation, search

Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics,
refers to the process of deriving high-quality information from text. High-quality information is typically
derived through the devising of patterns and trends through means such as statistical pattern learning.
Text mining usually involves the process of structuring the input text (usually parsing, along with the
addition of some derived linguistic features and the removal of others, and subsequent insertion into a
database), deriving patterns within the structured data, and finally evaluation and interpretation of the
output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and
interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity
extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity
relation modeling (i.e., learning relations between named entities).

Text analysis involves information retrieval, lexical analysis to study word frequency distributions,
pattern recognition, tagging/annotation, information extraction, data mining techniques including link and
association analysis, visualization, and predictive analytics. The overarching goal is, essentially, to turn
text into data for analysis, via application of natural language processing (NLP) and analytical methods.

A typical application is to scan a set of documents written in a natural language and either model the
document set for predictive classification purposes or populate a database or search index with the
information extracted.

Contents
1 Text mining and text analytics
2 History
3 Text analysis processes
4 Applications
o 4.1 Security applications
o 4.2 Biomedical applications
o 4.3 Software applications
o 4.4 Online media applications
o 4.5 Marketing applications
o 4.6 Sentiment analysis
o 4.7 Academic applications
5 Software and applications
o 5.1 Commercial
o 5.2 Open source
6 Implications
7 See also
8 Notes
9 References
10 External links

[edit] Text mining and text analytics


The term text analytics describes a set of linguistic, statistical, and machine learning techniques that
model and structure the information content of textual sources for business intelligence, exploratory data
analysis, research, or investigation.[1] The term is roughly synonymous with text mining; indeed, Ronen
Feldman modified a 2000 description of "text mining"[2] in 2004 to describe "text analytics."[3] The latter
term is now used more frequently in business settings while "text mining" is used in some of the earliest
application areas, dating to the 1980s,[4] notably life-sciences research and government intelligence.

The term text analytics also describes that application of text analytics to respond to business problems,
whether independently or in conjunction with query and analysis of fielded, numerical data. It is a truism
that 80 percent of business-relevant information originates in unstructured form, primarily text.[5] These
techniques and processes discover and present knowledge facts, business rules, and relationships that
is otherwise locked in textual form, impenetrable to automated processing.

[edit] History
Labor-intensive manual text mining approaches first surfaced in the mid-
1980s,Template:Http://www.ppc.sas.upenn.edu/cave.htm[examples needed] but technological advances have
enabled the field to advance during the past decade. Text mining is an interdisciplinary field that draws
on information retrieval, data mining, machine learning, statistics, and computational linguistics. As most
information (common estimates say over 80%)[5] is currently stored as text, text mining is believed to
have a high commercial potential value. Increasing interest is being paid to multilingual data mining: the
ability to gain information across languages and cluster similar items from different linguistic sources
according to their meaning.

The challenge of exploiting the large proportion of enterprise information that originates in
"unstructured" form has been recognized for decades.[6] It is recognized in the earliest definition of
business intelligence (BI), in an October 1958 IBM Journal article by H.P. Luhn, A Business Intelligence
System, which describes a system that will:

"...utilize data-processing machines for auto-abstracting and auto-encoding of documents and for creating
interest profiles for each of the 'action points' in an organization. Both incoming and internally generated
documents are automatically abstracted, characterized by a word pattern, and sent automatically to
appropriate action points."

Yet as management information systems developed starting in the 1960s, and as BI emerged in the '80s
and '90s as a software category and field of practice, the emphasis was on numerical data stored in
relational databases. This is not surprising: text in "unstructured" documents is hard to process. The
emergence of text analytics in its current form stems from a refocusing of research in the late 1990s from
algorithm development to application, as described by Prof. Marti A. Hearst in the paper Untangling Text
Data Mining:[7]

For almost a decade the computational linguistics community has viewed large text collections as a
resource to be tapped in order to produce better text analysis algorithms. In this paper, I have attempted to
suggest a new emphasis: the use of large online text collections to discover new facts and trends about the
world itself. I suggest that to make progress we do not need fully artificial intelligent text analysis; rather,
a mixture of computationally-driven and user-guided analysis may open the door to exciting new results.

Hearst's 1999 statement of need fairly well describes the state of text analytics technology and practice a
decade later.

[edit] Text analysis processes


Subtasks components of a larger text-analytics effort typically include:
Information retrieval or identification of a corpus is a preparatory step: collecting or identifying a
set textual materials, on the Web or held in a file system, database, or content management
system, for analysis.
Although some text analytics systems limit themselves to purely statistical methods, many others
apply more extensive natural language processing, such as part of speech tagging, syntactic
parsing, and other types of linguistic analysis.[citation needed]
Named entity recognition is the use of gazetteers or statistical techniques to identify named text
features: people, organizations, place names, stock ticker symbols, certain abbreviations, and so
on. Disambiguation the use of contextual clues may be required to decide where, for
instance, "Ford" refers to a former U.S. president, a vehicle manufacturer, a movie star (Glenn or
Harrison?[who?]), a river crossing, or some other entity.
Recognition of Pattern Identified Entities: Features such as telephone numbers, e-mail addresses,
quantities (with units) can be discerned via regular expression or other pattern matches.
Coreference: identification of noun phrases and other terms that refer to the same object.
Relationship, fact, and event Extraction: identification of associations among entities and other
information in text
Sentiment analysis involves discerning subjective (as opposed to factual) material and extracting
various forms of attitudinal information: sentiment, opinion, mood, and emotion. Text analytics
techniques are helpful in analyzing sentiment at the entity, concept, or topic level and in
distinguishing opinion holder and opinion object.[8]
Quantitative text analysis is a set of techniques stemming from the social sciences where either a
human judge or a computer extracts semantic or grammatical relationships between words in
order to find out the meaning or stylistic patterns of, usually, a casual personal text for the purpose
of psychological profiling etc.[9]

[edit] Applications
The technology is now broadly applied for a wide variety of government, research, and business needs.
Applications can be sorted into a number of categories by analysis type or by business function. Using
this approach to classifying solutions, application categories include:

Enterprise Business Intelligence/Data Mining, Competitive Intelligence


E-Discovery, Records Management
National Security/Intelligence
Scientific discovery, especially Life Sciences
Sentiment Analysis Tools, Listening Platforms
Natural Language/Semantic Toolkit or Service
Publishing
Automated ad placement
Search/Information Access
Social media monitoring

[edit] Security applications

Many text mining software packages are marketed for security applications, especially monitoring and
analysis of online plain text sources such as Internet news, blogs, etc. for national security purposes.[10] It
is also involves in the study of text encryption/decryption.

[edit] Biomedical applications

Main article: Biomedical text mining

A range of text mining applications in the biomedical literature has been described.[11]
One online text mining application in the biomedical literature is GoPubMed.[12] GoPubmed was the first
semantic search engine on the Web.[citation needed] Another example is PubGene that combines biomedical
text mining with network visualization as an Internet service.[13]

[edit] Software applications

Text mining methods and software is also being researched and developed by major firms, including IBM
and Microsoft, to further automate the mining and analysis processes, and by different firms working in
the area of search and indexing in general as a way to improve their results. Within public sector much
effort has been concentrated on creating software for tracking and monitoring terrorist activities.[14]

[edit] Online media applications

Text mining is being used by large media companies, such as the Tribune Company, to disambiguate
information and to provide readers with greater search experiences, which in turn increases site
"stickiness" and revenue. Additionally, on the back end, editors are benefiting by being able to share,
associate and package news across properties, significantly increasing opportunities to monetize content.

[edit] Marketing applications

Text mining is starting to be used in marketing as well, more specifically in analytical customer
relationship management. Coussement and Van den Poel (2008)[15] apply it to improve predictive
analytics models for customer churn (customer attrition).[16]

[edit] Sentiment analysis

Sentiment analysis may involve analysis of movie reviews for estimating how favorable a review is for a
movie.[17] Such an analysis may need a labeled data set or labeling of the affectivity of words. Resources
for affectivity of words and concepts have been made for WordNet[18] and ConceptNet,[19] respectively.

Text has been used to detect emotions in the related area of affective computing.[20] Text based
approaches to affective computing have been used on multiple corpora such as students evaluations,
children stories and news stories.

[edit] Academic applications

The issue of text mining is of importance to publishers who hold large databases of information needing
indexing for retrieval. This is especially true in scientific disciplines, in which highly specific information
is often contained within written text. Therefore, initiatives have been taken such as Nature's proposal for
an Open Text Mining Interface (OTMI) and the National Institutes of Health's common Journal
Publishing Document Type Definition (DTD) that would provide semantic cues to machines to answer
specific queries contained within text without removing publisher barriers to public access.

Academic institutions have also become involved in the text mining initiative:

The National Centre for Text Mining (NaCTeM), is the first publicly funded text mining centre in
the world. NaCTeM is operated by the University of Manchester[21] in close collaboration with the
Tsujii Lab,[22] University of Tokyo.[23] NaCTeM provides customised tools, research facilities and
offers advice to the academic community. They are funded by the Joint Information Systems
Committee (JISC) and two of the UK Research Councils (EPSRC & BBSRC). With an initial
focus on text mining in the biological and biomedical sciences, research has since expanded into
the areas of social sciences.
In the United States, the School of Information at University of California, Berkeley is developing
a program called BioText to assist biology researchers in text mining and analysis.

[edit] Software and applications


Text mining computer programs are available from many commercial and open source companies and
sources.

[edit] Commercial

AeroText a suite of text mining applications for content analysis. Content used can be in
multiple languages.
Attensity hosted, integrated and stand-alone text mining (analytics) software that uses natural
language processing technology to address collective intelligence in social media and forums; the
voice of the customer in surveys and emails; customer relationship management; e-services;
research and e-discovery; risk and compliance; and intelligence analysis.
Autonomy text mining, clustering and categorization software
Basis Technology provides a suite of text analysis modules to identify language, enable search
in more than 20 languages, extract entities, and efficiently search for and translate entities.
Clarabridge text analytics (text mining) software, including natural language (NLP), machine
learning, clustering and categorization. Provides SaaS, hosted and on-premise text and sentiment
analytics that enables companies to collect, listen to, analyze, and act on the Voice of the
Customer (VOC) from both external (Twitter, Facebook, Yelp!, product forums, etc.) and internal
sources (call center notes, CRM, Enterprise Data Warehouse, BI, surveys, emails, etc.).
Endeca Technologies provides software to analyze and cluster unstructured text.
Expert System S.p.A. suite of semantic technologies and products for developers and
knowledge managers.
Fair Isaac leading provider of decision management solutions powered by advanced analytics
(includes text analytics).
General Sentiment - Social Intelligence platform that uses natural language processing to discover
affinities between the fans of brands with the fans of traditional television shows in social media.
Stand alone text analytics to capture social knowledge base on billions of topics stored to 2004.
IBM LanguageWare - the IBM suite for text analytics (tools and Runtime).
IBM SPSS - provider of Modeler Premium, which contains advanced NLP-based text analysis
capabilities (multi-lingual sentiment, event and fact extraction), that can be used in conjunction
with Predictive Modeling. Text Analytics for Surveys provides the ability to categorize survey
responses using NLP-based capabilities for further analysis or reporting.
Inxight provider of text analytics, search, and unstructured visualization technologies. (Inxight
was bought by Business Objects that was bought by SAP AG in 2008).
LanguageWare text analysis libraries and customization software from IBM.
Language Computer Corporation text extraction and analysis tools, available in multiple
languages.
Lexalytics - provider of a text analytics engine used in Social Media Monitoring, Voice of
Customer, Survey Analysis, and other applications.
LexisNexis provider of business intelligence solutions based on an extensive news and company
information content set. LexisNexis acquired DataOps to pursue search
Mathematica provides built in tools for text alignment, pattern matching, clustering and
semantic analysis.
SAS SAS Text Miner and Teragram; commercial text analytics, natural language processing,
and taxonomy software used for Information Management. SAS Text Miner rated as the third
most used text mining software (9%) by Rexer's Annual Data Miner Survey in 2010.[24]
IBM SPSS provider of IBM SPSS Modeler and IBM SPSS Text Analytics (now called IBM
SPSS Modeler Premium).[25] Rated as the second (17%) and fourth (7%), respectively, most used
text mining software by Rexer's Annual Data Miner Survey in 2010.[24]
StatSoft provides STATISTICA Text Miner as an optional extension to STATISTICA Data
Miner, for Predictive Analytics Solutions. Rated as the top used text mining software (19%) by
Rexer's Annual Data Miner Survey in 2010.[24]
Sysomos - provider social media analytics software platform, including text analytics and
sentiment analysis on online consumer conversations.
WordStat - Content analysis and text mining add-on module of QDA Miner for analyzing large
amounts of text data.
Thomson Data Analyzer enables complex analysis on patent information, scientific publications
and news.

[edit] Open source

Carrot2 text and search results clustering framework.


GATE General Architecture for Text Engineering, an open-source toolbox for natural language
processing and language engineering
OpenNLP - natural language processing
Natural Language Toolkit (NLTK) a suite of libraries and programs for symbolic and statistical
natural language processing (NLP) for the Python programming language.
RapidMiner with its Text Processing Extension data and text mining software. Rated as the fifth
most used text mining software (6%) by Rexer's Annual Data Miner Survey in 2010.[24]
Unstructured Information Management Architecture (UIMA) a component framework to
analyze unstructured content such as text, audio and video, originally developed by IBM.
The programming language R provides a framework for text mining applications in the package
tm

[edit] Implications
Until recently, websites most often used text-based searches, which only found documents containing
specific user-defined words or phrases. Now, through use of a semantic web, text mining can find content
based on meaning and context (rather than just by a specific word).

Additionally, text mining software can be used to build large dossiers of information about specific
people and events. For example, large datasets based on data extracted from news reports can be built to
facilitate social networks analysis or counter-intelligence. In effect, the text mining software may act in a
capacity similar to an intelligence analyst or research librarian, albeit with a more limited scope of
analysis.

Text mining is also used in some email spam filters as a way of determining the characteristics of
messages that are likely to be advertisements or other unwanted material.

[edit] See also


Approximate nonnegative matrix factorization, an algorithm used for text mining
BioCreative text mining evaluation in biomedical literature
Business intelligence
Computational linguistics
Concept Mining
Data mining
Information retrieval
Name resolution
National Centre for Text Mining (NaCTeM)
Natural language processing
Stop words
Text classification sometimes is considered a (sub)task of text mining.
OpenNLP Java NLP library from Apache
UIMA Unstructured Information Management Architecture from IBM.
Web mining, a task that may involve text mining (e.g. first find appropriate web pages by
classifying crawled web pages, then extract the desired information from the text content of these
pages considered relevant).
w-shingling
Sequence mining: String and Sequence Mining
Noisy text analytics
Information extraction
Computational linguistics
Named entity recognition
Identity resolution
Text mining
News analytics
Sequence mining

[edit] Notes
1. ^ Defining Text Analytics
2. ^ KDD-2000 Workshop on Text Mining
3. ^ Text Analytics: Theory and Practice
4. ^ Hobbs, Walker, and Amsler, Natural Language Access to Structured Text, 1982
5. ^ a b Unstructured Data and the 80 Percent Rule
6. ^ http://www.b-eye-network.com/view/6311
7. ^ http://people.ischool.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html
8. ^ http://www.clarabridge.com/default.aspx?tabid=137&ModuleID=635&ArticleID=722
9. ^ http://dingo.sbs.arizona.edu/~mehl/eReprints/Text%20analysis%20Handbook.pdf
10. ^ Alessandro Zanasi: Virtual Weapons for Real Wars: Text Mining for National Security, E.
Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 5360, Springer 2009
11. ^ K. Bretonnel Cohen & Lawrence Hunter (January 2008). "Getting Started in Text Mining".
PLoS Computational Biology 4 (1): e20. doi:10.1371/journal.pcbi.0040020. PMC 2217579.
PMID 18225946. http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.0040020.
12. ^ GoPubMed: exploring PubMed with the Gene Ontology, A. Doms and M. Schroeder, 2005,
http://nar.oxfordjournals.org/content/33/suppl_2/W783.long
13. ^ Tor-Kristian Jenssen, Astrid Lgreid, Jan Komorowski1 & Eivind Hovig (2001). "A literature
network of human genes for high-throughput analysis of gene expression". Nature Genetics 28
(1): 2128. doi:10.1038/ng0501-21. PMID 11326270.
http://www.nature.com/ng/journal/v28/n1/abs/ng0501_21.html.
Summary: Daniel R. Masys (2001). "Linking microarray data to the literature". Nature
Genetics 28 (1): 910. doi:10.1038/ng0501-9. PMID 11326264.
14. ^ Texor
15. ^ Academic Papers about Analytical Customer Relationship Management
16. ^ Kristof Coussement, and Dirk Van den Poel (forthcoming 2008). "Integrating the Voice of
Customers through Call Center Emails into a Decision Support System for Churn Prediction".
Information and Management. http://www.textmining.ugent.be.
17. ^ Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan (2002). "Thumbs up? Sentiment
Classification using Machine Learning Techniques". Proceedings of the Conference on Empirical
Methods in Natural Language Processing (EMNLP). pp. 7986.
http://www.cs.cornell.edu/home/llee/papers/sentiment.pdf.
18. ^ Alessandro Valitutti, Carlo Strapparava, Oliviero Stock (2005). "Developing Affective Lexical
Resources". Psychology Journal 2 (1): 6183.
http://www.psychnology.org/File/PSYCHNOLOGY_JOURNAL_2_1_VALITUTTI.pdf.
19. ^ Erik Cambria; Robert Speer, Catherine Havasi and Amir Hussain (2010). "SenticNet: a Publicly
Available Semantic Resource for Opinion Mining". Proceedings of AAAI CSK. pp. 14-18.
http://www.aaai.org/ocs/index.php/FSS/FSS10/paper/download/2216/2617.pdf.
20. ^ Rafael A. Calvo, Sidney K. D'Mello (2010). "Affect Detection: An Interdisciplinary Review of
Models,Methods, and their Applications". IEEE Transactions on Affective Computing 1 (1): 18
37. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5520655.
21. ^ The University of Manchester
22. ^ Tsujii Laboratory
23. ^ The University of Tokyo
24. ^ a b c d Rexer Analytics 4th Annual Data Miner Survey - 2010
25. ^ IBM - SPSS - Software Products

[edit] References
Ananiadou, S. and McNaught, J. (Editors) (2006). Text Mining for Biology and Biomedicine.
Artech House Books. ISBN 978-1-58053-984-5
Bilisoly, R. (2008). Practical Text Mining with Perl. New York: John Wiley & Sons. ISBN 978-
0-470-17643-6
Feldman, R., and Sanger, J. (2006). The Text Mining Handbook. New York: Cambridge
University Press. ISBN 978-0-521-83657-9
Indurkhya, N., and Damerau, F. (2010). Handbook Of Natural Language Processing, 2nd Edition.
Boca Raton, FL: CRC Press. ISBN 978-1-4200-8592-1
Kao, A., and Poteet, S. (Editors). Natural Language Processing and Text Mining. Springer. ISBN
1-84628-175-X
Konchady, M. Text Mining Application Programming (Programming Series). Charles River
Media. ISBN 1-58450-460-9
Manning, C., and Schutze, H. (1999). Foundations of Statistical Natural Language Processing.
Cambridge, MA: MIT Press. ISBN 978-0-262-13360-9
Miner, G., Elder, J., Hill. T, Nisbet, R., Delen, D. and Fast, A. (2012). Practical Text Mining and
Statistical Analysis for Non-structured Text Data Applications. Elsevier Academic Press. ISBN
978-0-12-386979-1
McKnight, W. (2005). "Building business intelligence: Text data mining in business intelligence".
DM Review, 21-22.
Srivastava, A., and Sahami. M. (2009). Text Mining: Classification, Clustering, and Applications.
Boca Raton, FL: CRC Press. ISBN 978-1-4200-5940-3

[edit] External links


Marti Hearst: What Is Text Mining? (October, 2003)
Automatic Content Extraction, Linguistic Data Consortium
Automatic Content Extraction, NIST
Academic, Open Source, and Industrial tools, Alias-I

Retrieved from "http://en.wikipedia.org/w/index.php?title=Text_mining&oldid=516903582"


Categories:

Artificial intelligence applications


Data mining
Computational linguistics
Data analysis
Natural language processing
Statistical natural language processing

Hidden categories:

Wikipedia articles needing clarification from April 2012


All articles with unsourced statements
Articles with unsourced statements from February 2012
All articles with specifically marked weasel-worded phrases
Articles with specifically marked weasel-worded phrases from February 2012
Articles with unsourced statements from April 2012

Personal tools

Create account
Log in

Namespaces

Article
Talk

Variants

Views

Read
Edit
View history

Actions

Search

Special:Search

Navigation

Main page
Contents
Featured content
Current events
Random article
Donate to Wikipedia

Interaction

Help
About Wikipedia
Community portal
Recent changes
Contact Wikipedia

Toolbox

What links here


Related changes
Upload file
Special pages
Permanent link
Cite this page

Print/export
Create a book
Download as PDF
Printable version

Languages


esky
Deutsch
Espaol
Franais
Bahasa Indonesia
Magyar

Polski
Portugus

Svenska

Ting Vit

This page was last modified on 9 October 2012 at 22:39.

You might also like