Rapidminer Text 4.5 Tutorial
Rapidminer Text 4.5 Tutorial
Rapidminer Text 4.5 Tutorial
July 19, 2009 Copyright 20012009 The Word Vector Tool and this Tutorial are published under the GNU Public License.
Contents
7 9
9 10 11 14
17
17 17 18 18 19 19 19 20 20
Parameter Optimization . . . . . . . . . . . . . . . . . . . . . . Creating and Maintaining Word Lists . . . . . . . . . . . . . . . 3.5.1 3.5.2 3.5.3 Creating an Initial Word List Applying a Word List . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
4 Advanced Topics
4.1 4.2 Web Crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . Using a Thesaurus . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 4.2.2 Using a Simple Dictionary . . . . . . . . . . . . . . . . . Using Wordnet . . . . . . . . . . . . . . . . . . . . . . . 3
21
21 23 23 23
CONTENTS
4.2.3
Information Extraction
. . . . . . . . . . . . . . . . . .
24
5 Performance 6 Aknowledgements 7 Appendix A - Java Example 8 Appendix B - RapidMiner Text Plugin Operator Reference
8.1 Text 8.1.1 8.1.2 8.1.3 8.1.4 8.1.5 8.1.6 8.1.7 8.1.8 8.1.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Crawler . . . . . . . . . . . . . . . . . . . . . . . . . . . DictionaryStemmer . . . . . . . . . . . . . . . . . . . .
27 29 33 37
38 38 39 39 40 41 42 42 43 44 45 45 46 47 47 49 49 50 51 53 53 54 54
EnglishStopwordFilter . . . . . . . . . . . . . . . . . . . FeatureExtraction . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
CONTENTS
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56 58 58 59
CONTENTS
Chapter 1
Introduction
The Word Vector Tool WVTool builds the core of the RapidMiner Text plugin and is a exible Java library for statistical language modeling. In particular it is used to create word vector representations of text documents in the vector space model [1]. In the vector space model, a document is represented by a vector Terms that denotes the relevance of a given set of terms for this document.
are usually natural language words, but they can also be more general entities, as words that are reduced to some linguistic base form or abstract concept as <number> denoting any occurrence of a number in the text.
From the early days of automatic text processing and information retrieval, the vector space model has played a very important role. It is the point of departure for many automatic text processing tasks, as text classication, clustering, characterization and summarization as well as information retrieval [2]. The aim of the Java WVTool is to provide a simple to use, simple to extend pure Java library for creating word vectors. It can easily be invoked from any Java application. Furthermore, the tool is tightly integrated with the RapidMiner machine learning environment [3], allowing to perform diverse experiments using textual data directly. In this way, the WVTool bridges a gap between highly sophisticated linguistic packages as the GATE system [11] on the one side and many partial solutions that are part of diverse text and information retrieval applications on the other side. Closest related to the Word Vector Tool is the Bow package [10], which is a C library, for the creation of word vectors and clustering/classifying text. 7
CHAPTER 1.
INTRODUCTION
In the next chapter, the basic concepts of the library are explained and how to use it from Java applications. Chapter 3 discusses the RapidMiner integration. In chapter 4 some advanced topics as using a web crawler or dictionaries are introduced. Chapter 5 gives a brief overview of the performance of the WVTool on a test corpus.
Chapter 2
Using the WVTool as Java Library
The WVTool can be used as a standalone Java library or as plugin for the
2.1 Installation
To use the WVTool as Java library, rst obtain a copy of the WVTool from the sourceforge WVTool homepage , uncompress the archive and put the wvtool.jar le and all jar les in the lib subdirectory into your classpath. There are two basic operations the WVTool is able to perform: 1. Create a word list (the dimensions of the vector space) from a set of text documents and 2. Create word vectors from a set of texts (given a word list). A word list contains all terms used for vectorization together with some statistics (e.g. in how many documents a term appears). The word list is needed for vectorization to dene which terms are considered as dimensions of the vector space and for weighting purposes. Both functions have two basic input parameters. First, an input list that tells the system which text documents to process and second, a conguration object, that tells the system which methods to use in the individual steps.
http://wvtool.sourceforge.net
10
CHAPTER 2.
A URI to the text resource. Currently this can be a local le/directory or an URL In the case of a directory, all les in this directory are processed (not recursing to subdirectories). As the WVTool is extendable, other types of le references could be used as well, as long as the user provides a method that handles them (see 2.3)
The language the document is written in (optional) The type of the document (optional) The character encoding of the document, e.g. UTF-8 (optional) A class label Texts can be assigned to classes, such as topics. This information is usually used for automatic text classication, but could be relevant for word vectorization as well. A class label index is ranging from where
to
m 1,
In the following example, an input list with three entries is created, two pointing to documents on the local le system and one pointing to a webpage.
//Initialize the input list with three classes WVTFileInputList list = new WVTFileInputList(3); //Add entries list.addEntry( new WVTDocumentInfo("data/alt.atheism", "txt","","english",0)); list.addEntry( new WVTDocumentInfo("data/soc.religion.christian", "txt","","english",1)); list.addEntry( new WVTDocumentInfo("http://www-ai.cs.uni-dortmund.de",
July 19, 2009
2.3.
CONFIGURATION
11
"html","","english",2));
2.3 Conguration
The WVTool is written in a modular way, as to allow a maximum of exibility and extendibility. The general idea is, that vectorization and word list creation consist of a xed sequence of steps. For every step in the vectorization process, the user states the Java class that should be used for this step. This class can be one already included in the tool or a new one, written by the user. The only constraint is, that it has to implement the corresponding interface of a given step. In the following, these steps will be described in more detail together with the available Java implementations:
TextLoader The TextLoader is responsible for opening a stream to the processed document. Currently, the system provides one loader capable of reading from local les and URLs. The corresponding class is called UniversalLoader and should be sucient for most applications.
UniversalLoader
Decoder
If the text is encoded/wrapped (e.g. in HTML code), it has to be decoded to plain text before vectorization. supported. Currently, only plain text (no decoding necessary) and XML based markup languages (tags are ignored) are
SimpleTagIgnoringReader
it.
XMLInputFilter PDFInputFilter
- Parses the le and removes tags from it. - Reads the le as text le.
TextInputFilter
- Extracts the text parts of a PDF le. -Selects the input lter automatically, based on
SelectingInputFilter
the le sux (default).
PDFInputFilter, evaluate the encoding information given for each entry in the input list. If no (legal) encoding is given, the system default is used. Note, that currently the encoding cannot be determined automatically for XML and HTML les.
12
CHAPTER 2.
CodeMapper In some cases the encoding of a text has to be mapped to another encoding. One might like to remove all the accents from a French text for instance in this step. At the moment only a dummy class is available.
DummyCharConverter
Tokenizer
The tokenizer splits the whole text into individual units. heuristic is sucient.
Tokenization
is a non-trivial task in general. Though for vectorization often a simple Currently, only one tokenizer is available, which uses the Unicode specication to decide whether a character is a letter. All non-letter characters are assumed to be separators, thus the resulting tokens contain only letters. Additionally, there is a tokenizer that creates character n-grams from given tokens.
SimpleTokenizer
fault).
NGramTokenizer
WordFilter
In this step, tokens that should not be considered for vectorization are ltered. These are usually tokens appearing very often (referred to as Standard English and German stopword lists are included. stopwords.
- a standard English stop word list (default). - a standard German stop word list.
StopWordsWrapperGerman
Stemmer/Reducer Often it is useful to map dierent grammatical forms of a word to a common term. At the moment the system incorporates several dierent stemming algorithms: a Porter Stemmer, a Lovins Stemmer, a German Stemmer and the Snowball Stemmer package (providing stemmers for dierent languages, see [4]). Also, there is the possibility to dene additionally an own dictionary or to use the Wordnet thesaurus (see 4.2).
LovinsStemmerWrapper PorterStemmerWrapper
- a Lovings stemmer (default) - a Porter Stemmer - the Snowball stemmer package.You need to
SnowballStemmerWrapper
dene the language of each text that is parsed, as the corresponding stemmer is chosen according to this information
2.3.
CONFIGURATION
13
ToLowerCaseConverter
case
DictionaryStemmer DummyStemmer
words to a base form (see 4.2.1 for more information) - does not do anything - uses Wordnet to replace a word by its hy-
WordNetHypernymStemmer WordNetSynonymStemmer
VectorCreation
pernym (see 4.2.2 for more information) - uses Wordnet to replace a word by a repre-
After the tokens have been counted, the actual vectors have to be created. There are dierent schemes for doing this. They are based on the following counts:
fij the number of occurrences of term i in document j f dj the total number of terms occurring in document j f ti the total number of documents in which term i appears
at least once
Based on these counts, currently four classes are available that measure the importance of term
for document
j,
as denoted by
vij : |D |
is the
TFIDF
vij =
TermFrequency
unit length.
- the relative frequency of a term in a document, vij = fij f dj . The resulting vector for each document is normalized to the Euclidean
TermOccurrences - the absolute number of fij The resulting vector is not normalized.
occurrences of a term
vij =
Output The output steps determines where the resulting vectors are written to. Currently, only writing them to a le is supported. This step must be congured, as there is no default where to write the vectors to.
The Operators in the Text plugin for RapidMiner allows you to specify which java class to use for a given step by dening the single steps as inner operators. This can be done in a static way (for each document the same java class is used)
14
CHAPTER 2.
or dynamically (the java class is chosen depending on properties of the document, such as the language or the encoding). The following are two examples. The rst example sets the java class for the output step in a static way.
FileWriter outFile = new FileWriter("wv.txt"); WordVectorWriter wvw = new WordVectorWriter(outFile, true); config.setConfigurationRule(WVTConfiguration.STEP_OUTPUT, new WVTConfigurationFact(wvw));
The second example selects the the stemming algorithm dynamically, depending on the language the text document is written in:
final WVTStemmer dummyStemmer = new DummyStemmer(); final WVTStemmer porterStemmer = new PorterStemmerWrapper(); config.setConfigurationRule(WVTConfiguration.STEP_STEMMER, new WVTConfigurationRule() { public Object getMatchingComponent(WVTDocumentInfo d) throws Exception { if(d.getContentLanguage().equals("english")) return porterStemmer; else return dummyStemmer; } });
By writing your own classes (implementing the corresponding interface) you can use your own methods instead of the ones provide with the tool.
2.4.
15
achieved by calling the word list creation function with a list of String values as in the following example (creating a word list with only two entries):
List dimensions = new Vector(); dimensions.add("apple"); dimensions.add("pc"); wordList = wvt.createWordList(list, config, dimensions, false);
The last parameter determines whether additional terms occurring in the texts should be added to the word list.
16
CHAPTER 2.
Chapter 3
The Word Vector Tool and RapidMiner
Instead of using the WVTool as a library, you can use it directly with the Rapid-
Miner system (formerly YALE, see [3]). RapidMiner provides a nice GUI to
specify the input and the conguration for vector creation. In the following, it is assumed that you are familiar with the basic concepts of the RapidMiner environment. Please note that the WVTool is available as part of the Text plugin of Rapid-
Miner.
3.1 Installation
The WVTool Plugin is installed by downloading the Text plugin jar le from the RapidMiner homepage
17
18
CHAPTER 3.
for each term. The text collection must be specied in one of two ways:
1. If the parameter list texts is specied, each key-value pair must contain the class label and the directory which holds the texts. In this case, the entries in default_encoding, default_language and default_type are used for all input documents. 2. Otherwise the operator expects an ExampleSet in its input. are evaluated (see 2.2): Up to four
regular attributes of this example set having special names and the label
(a) document_source - A le, directory, or URL specifying a (set of ) text(s) (b) type - The document type (c) encoding - The content encoding (d) language - The content language (e) the label attribute - The class label of the text(s)
3.4.
PARAMETER OPTIMIZATION
19
20
CHAPTER 3.
up, use the load function to load your original word list. overwrite parameter is set.
the ones that are generated by the TextInput. All terms for which you already decided that they should or should not be in the word list are preserved. All new terms will be between these values in the list (sorted according to their weight). You can also use the combo box to choose which weights should be displayed. After you nished simply save the word list as described above.
Chapter 4
Advanced Topics
4.1 Web Crawling
The WVTool contains an interface to the WebSPHINX web crawler package [7]. This enables you to obtain word vectors from webcontent easily. The WebSPHINX package is very exible and allows to congure the behavior of the crawler in various ways. To use it with the WVTool , you must rst create a subclass of the abstract class WVToolCrawler. The additional methods you must implement determine whether a link should be visited and whether a page should be processed by the WVTool . The following is an example.
WVToolCrawler test = new WVToolCrawler() { protected boolean vectorizePage(Page page) { String url = page.getURL().toExternalForm(); return url.contains("PERSONAL")&& url.contains("html")&& (!url.contains("index"));
public boolean shouldVisit(Link link) { return link.getPageURL(). toExternalForm().contains("PERSONAL"); } }; URL start = new URL("http://www-ai.cs.uni-dortmund.de/PERSONAL");
21
22
CHAPTER 4.
ADVANCED TOPICS
The crawler visits only links, that point to an URL containing the term PERSONAL. A page is processed if its URL contains PERSONAL and html but does not contain index. Root method. The crawler starts at a page provided by the addThere are Also, the maximal depth of the crawler is set to 2.
many other possible checks in the WebSPHINX package, e.g. based on regular expressions. Refer to the javadoc of WebSPHINX for more information. Given the personalized web crawler, you need to create an input list based on this crawler using the following code:
You can now use this input list just as the le input list. The crawler can also be invoked from RapidMiner. To do so, add the Crawler operator to your experiment. Using the parameter url, you may dene a at which url the crawler starts. The crawler policy allows you to state rules, on whether the crawler should follow a link and on whether it should vectorize a page. The following conditions are possible:
visit_url
eter.
A page is only visited if its url contains all terms stated in this param-
this parameter. A link is only followed, if the target url contains all terms stated in
this parameter. A link is only followed, if the link text contains all terms stated in
this parameter.
If several expressions are given for the same condition, they are treated a disjunction. This allows to express DNF expressions for each individual condition. Conditions of dierent types are combined by conjunction, i.e. all of the have to be fullled.
4.2.
USING A THESAURUS
23
An expression is either a String or a regular expression. For regular expressions, the Java RegExpression semantic is used .
against the xed terms specied in the le. If there are dierent matches, the rst one is used. If no match was found, the system checks the word against all regular expressions in the order in which they appear in the le. Again, the rst match is used.
24
CHAPTER 4.
ADVANCED TOPICS
of the given word. As the part of speech is usually not known, the Word Vector Tool tries to resolve it rst as noun, then as verb, adjective and adverb. For the stemmer based on synonyms, the word is reduced to the rst representative of the synset, for hypernym based stemming it is reduced to the rst hypernym of the synset.
It is matched against the input text and only the rst match is returned. The replacement pattern species, how the nal term is derived from the matched expression. It should contain at least one expression of the form $<groupNr>, that is replaced by the corresponding matching group. In the simplest case, the replacement string is just $0, stating that the whole expression should be used. Example: If the documents contains the text Amount: 5, the expression Amount: ([0-9]+) $1 would extract the value 5. By default, structured information and word vectors are extracted. If you want to use only extracted attributes, specify a min_occurrences that is higher than the number of input documents to avoid that word vectors are created. An additional hint, you can use the preview function to interactively deploy your queries.
4.2.
USING A THESAURUS
25
Accessing Webservices
Many information sources on the web are available through a WebService API. The MashUp Operator allows you to enrich an existing example set with additional attributes obtained from such a WebService. The most important parameter of this operator is url. In this parameter you specify the url under which the service can be accessed. Most importantly, this url may contain expressions of the form <<attribute>>. These expressions are replace by the value for the attribute for each example in the example set. For each example in the example set, one query is send to the WebService in this way. The result for each query is parsed and the attributes specied in the parameter attributes are extracted and added to the example. The syntax for the extraction of attributes is the same as in the WVTool . Again, be careful about namespaces! A special function of the MashUp Operator is, that it allow to use the same query twice. In this case, the result of the query is tokenized using the de-
26
CHAPTER 4.
ADVANCED TOPICS
limiters dened in the parameter delimiters and the tokens are assigned to the attributes using this query. This allows to parse expressions like <position>12,4;34,3</position> into two attributes.
Chapter 5
Performance
The WVTool has been designed and optimized for exibility and extendibility rather than for eciency. Nevertheless, it is well suited for large text corpora in the sense that it keeps only the word list and the currently processed text document in main memory. To give you an idea of the actual processing speed of the Word Vector Tool the following table shows the processing times for vectorizing the well known 20 newsgroups [6] data set, containing 20.000 news articles.
WVTool word list creation word vector creation both 138 s 341 s 479 s
For these experiments an Intel P4 with 2,6 GHz was used. For vector creation the word list was pruned to contain only words appearing between 4 and 300 times.
27
28
CHAPTER 5.
PERFORMANCE
Chapter 6
Aknowledgements
I would like to thank Ingo Mierswa and Simon Fischer for the rst version of the WVTool operator and the corresponding documentation, Stefan Haustein for the TagIgnoringReader and the creators of the Snowball stemmer package[4], Wordnet, PDFBox, FontBox, the Java Wordnet Library and WebSPHINX for making their source code publically available.
29
30
CHAPTER 6.
AKNOWLEDGEMENTS
Bibliography
[1] G. Salton, A. Wong, C. S. Yang: A vector space model for automatic indexing, Commun. ACM, 18, p. 613-620, 1975. [2] R. Baeza-Yates, B. Ribeiro-Neto: Modern Information Retrieval; Taschenbuch - 464 Seiten - Addison Wesley, 1999. [3] I. Mierswa and M. Wurst, R. Klinkenberg, M. Scholz and T. Euler. YALE: Rapid Prototyping for Complex Data Mining Tasks. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-06). [4] http://snowball.tartarus.org/ [5] http://www.nzdl.org/Kea/ [6] http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups donated by T. Mitchell) [7] http://www.cs.cmu.edu/ rcm/websphinx/ [8] http://jwordnet.sourceforge.net [9] http://wordnet.princeton.edu [10] A.K. guage McCallum: modeling, Bow: text A toolkit for statistical and lan(originally
retrieval,
classication
clustering,
http://www.cs.cmu.edu/~mccallum/bow, 1996. [11] H. Cunningham, K. Humphreys, Y. Wilks, R. Gaizauskas: Software Infrastructure for Natural Language Processing, Proceedings of the Fifth Conference on Applied Natural Language Processing (ANLP-97), 1997.
31
32
BIBLIOGRAPHY
Chapter 7
Appendix A - Java Example
The following is a complete example of how to invoke the WVTool from Java.
import import import import import import import import import import import import import import import import
java.io.BufferedReader; java.io.FileReader; java.io.FileWriter; java.util.List; java.util.Vector; edu.udo.cs.wvtool.config.WVTConfiguration; edu.udo.cs.wvtool.config.WVTConfigurationFact; edu.udo.cs.wvtool.generic.output.WordVectorWriter; edu.udo.cs.wvtool.generic.stemmer.DummyStemmer; edu.udo.cs.wvtool.generic.vectorcreation.TFIDF; edu.udo.cs.wvtool.generic.vectorcreation.TermOccurrences; edu.udo.cs.wvtool.main.WVTDocumentInfo; edu.udo.cs.wvtool.main.WVTInputList; edu.udo.cs.wvtool.main.WVTWordVector; edu.udo.cs.wvtool.main.WVTool; edu.udo.cs.wvtool.wordlist.WVTWordList;
/** * An example program on how to use the Word Vector Tool. * * @author Michael Wurst * */ public class WVToolExample {
33
34
CHAPTER 7.
public static void main(String[] args) throws Exception { // EXAMPLE HOW TO CALL THE PROGRAM FROM JAVA // Initialize the WVTool WVTool wvt = new WVTool(true); // Initialize the configuration WVTConfiguration config = new WVTConfiguration(); config.setConfigurationRule(WVTConfiguration.STEP_STEMMER, new WVTConfigurationFact(new DummyStemmer())); //Initialize the input list with two classes WVTFileInputList list = new WVTFileInputList(2); //Add entries list.addEntry( new WVTDocumentInfo("data/alt.atheism", "txt","","english",0)); list.addEntry( new WVTDocumentInfo("data/soc.religion.christian", "txt","","english",1)); // Generate the word list WVTWordList wordList = wvt.createWordList(list, config); // Prune the word list wordList.pruneByFrequency(2, 5); // Store the word list in a file wordList.storePlain(new FileWriter("wordlist.txt")); // // // // // Alternatively: read an already created word list from a file WVTWordList wordList2 = new WVTWordList( new FileReader("/home/wurst/tmp/wordlisttest.txt")); Create the word vectors
July 19, 2009
35
// Set up an output filter (write sparse vectors to a file) FileWriter outFile = new FileWriter("wv.txt"); WordVectorWriter wvw = new WordVectorWriter(outFile, true); config.setConfigurationRule( WVTConfiguration.STEP_OUTPUT, new WVTConfigurationFact(wvw)); config.setConfigurationRule(WVTConfiguration.STEP_VECTOR_CREATION, new WVTConfigurationFact(new TFIDF())); // Create the vectors wvt.createVectors(list, config, wordList); // Alternatively: create word list and vectors together //wvt.createVectors(list, config); // Close the output file wvw.close(); outFile.close(); // Just for demonstration: Create a vector from a String WVTWordVector q = wvt.createVector("cmu harvard net", wordList); } }
36
CHAPTER 7.
Chapter 8
Appendix B - RapidMiner Text Plugin Operator Reference
This chapter describes the Word Vector operators of the RapidMiner Text plugin.
37
CHAPTER 8. 38
8.1 Text
This section describes the text related operators of the WVTool plugin.
8.1.1 Crawler
Group:
IO.Web
Generated output:
ExampleSet NumericalMatrix
Parameters:
url:
Species the url at which the crawler should start (string) Species a set of rules that determine, which links to
crawling_rules:
follow and which pages to process (see tutorial for details) (list)
max_depth: Species the maximal depth of the crawling process (integer; 0-+; default: 2) delay: Species the delay 0-+; default: 1000)
when vistiting a page in milleseconds (integer;
max_threads: Species the number of crawling threads working in parallel (integer; 1-+; default: 1) output_dir: extension:
Species the directory to which to write the les (lename) Species the extension of the stored les (string; default: 'txt') Species the maximum page size (in KB): pages larger
max_page_size: user_agent:
than this limit are not downloaded (integer; 1-+; default: 100) The identity the crawler uses while accessing a server (string;
default: 'rapid-miner-crawler')
obey_robot_exclusion:
which pages on site might be visited by a robot. Disable only if you know what you are doing and if you a sure not to violate any existing laws by doing so (boolean; default: true)
Values:
applycount: looptime: time:
The number of times the operator was applied. The time elapsed since the current loop started.
8.1.
TEXT
39
Short description:
directory.
Description:
8.1.2 DictionaryStemmer
Group:
IO.Text.Stemmer
Required input:
TokenSequence
Generated output:
TokenSequence
Parameters:
le:
File that contains the dictionary. See operator reference for the le format. (lename)
Values:
applycount: looptime: time:
The number of times the operator was applied. The time elapsed since the current loop started.
8.1.3 EnglishStopwordFilter
Group:
IO.Text.Filter
Required input:
TokenSequence
Generated output:
TokenSequence
Values:
applycount: looptime: time:
The number of times the operator was applied. The time elapsed since the current loop started.
CHAPTER 8. 40
8.1.4 FeatureExtraction
Group:
IO.Text
Generated output:
ExampleSet
Parameters:
preview: texts:
Shows a preview for the results which will be achieved by the current conguration. Species a list of class/directory pairs. (list) The default content type if not specied by the pdf, html, htm, xml, text, txt). (string;
default_content_type:
default: )
default_content_encoding:
(string; default: )
ed by the example set (only encodings supported by Java can be used).
default_content_language: use_content_attributes:
false)
ed by the example set. (string; default: ) If set to true, the returned example set will con-
id_attribute_type: attributes:
part of the source name), or numerical ids will be used. Species a list of attribute names and extraction queries. These If a regular expression '<regex-expression> '$1' would yield
<replacement-pattern>', where the <replacement_pattern> states how a match is replaced to generate the nal information. the rst matching group as result. A number sign in front of an attribute name marks the attribute as numeric. In these cases, the operator uses dierent heuristicts to parse a number from the extracted string. An ! in front of an attribute name marks it as binary. For both XPath and regex, only the rst match is used. (list)
8.1.
TEXT
41
namespaces:
h. (list)
extractor_class:
Values:
applycount: looptime: time:
The number of times the operator was applied.
Short description:
Description:
8.1.5 GermanStemmer
Group:
IO.Text.Stemmer
Required input:
TokenSequence
Generated output:
TokenSequence
Values:
applycount: looptime: time:
The number of times the operator was applied.
CHAPTER 8. 42
8.1.6 GermanStopwordFilter
Group:
IO.Text.Filter
Required input:
TokenSequence
Generated output:
TokenSequence
Values:
applycount: looptime: time:
The number of times the operator was applied.
8.1.7 LogFileSource
Group:
IO.Web
Generated output:
ExampleSet
Parameters:
cong_le: log_dir:
false) the format conguration le (lename)
the directory containing the log les (lename) Perform reverse dns lookup on the client ip (boolean; default:
dns_lookup: robot_lter:
le that contains regular expressions on user agents that Each line must contain exactly one regular ex-
letype_lter:
be ltered out. (string)
le that contains regular expressions on les that should Each line must contain exactly one regular expression.
only_HTTP_200:
8.1.
TEXT
43
browser_matcher:
expression>. (list)
types. Each line must contain exactly an expression of the form <name>:<regular
os_matcher:
line must contain exactly an expression of the form <name>:<regular expression>. (list)
language_matcher:
expression>. (list)
guages. Each line must contain exactly an expression of the form <name>:<regular
session_timeout:
that the second request can be assumed to be a new session (integer; 0-+; default: 400000)
Values:
applycount: looptime: time:
The number of times the operator was applied.
8.1.8 LovinsStemmer
Group:
IO.Text.Stemmer
Required input:
TokenSequence
Generated output:
TokenSequence
Values:
applycount: looptime: time:
The number of times the operator was applied.
Short description:
The WVTool Tutorial
CHAPTER 8. 44
Description:
8.1.9 MashUp
Group:
IO.Web
Required input:
ExampleSet
Parameters:
attributes:
Species a list of attribute names and extraction queries. These If a regular expression '<regex-expression> '$1' would yield queries can be XPath or a regular expression.
<replacement-pattern>', where the <replacement_pattern> states how a match is replaced to generate the nal information. the rst matching group as result. A number sign in front of an attribute name marks the attribute as numeric. In these cases, the operator uses dierent heuristicts to parse a number from the extracted string. An ! in front of an attribute name marks it as binary. For both XPath and regex, only the rst match is used. (list)
namespaces:
h. (list)
url:
terms of the form <attributeName> that are replaced by the value of the corresonding attribute before invoking the query. (string)
separators: delay:
by XPath or regular expression. (string) Amount of milliseconds to wait between requests (integer; 0-+;
default: 0)
Values:
applycount: looptime: time:
The number of times the operator was applied.
Short description:
source.
8.1.
TEXT
45
Description:
8.1.10 NGramTokenizer
Group:
IO.Text.Tokenizer
Required input:
TokenSequence
Generated output:
TokenSequence
Parameters:
length:
The maximal length of the ngrams. (integer; 1-+; default: 3) Indicates if the original terms should be kept along with the
keep_terms:
Values:
applycount: looptime: time:
The number of times the operator was applied.
8.1.11 PorterStemmer
Group:
IO.Text.Stemmer
Required input:
TokenSequence
Generated output:
TokenSequence
Values:
applycount: looptime: time:
The number of times the operator was applied.
CHAPTER 8. 46
8.1.12 Segmenter
Group:
IO.Text.Misc
Parameters:
preview: texts:
Shows a preview for the results which will be achieved by the
current conguration. A directory containing the documents to be segmented (lename) The content type of the input texts (txt, xml, html) (string)
The directory to which to write the segments (lename) Species a regular expression or XPath expression that matches
against substrings of the content which should be treated as individual segments. The syntax is the same as for attribute extraction (see WVTool operator), but instead of extracting only the rst match, all matches are extracted and written to individual les (string)
ignore_cdata: namespaces:
h. (list)
HTML (boolean; default: true) Species pairs of identier and namespace for use in XPath
Values:
applycount: looptime: time:
The number of times the operator was applied.
8.1.
TEXT
47
8.1.13 ServerLog2Transactions
Group:
IO.Web
Required input:
ExampleSet
Generated output:
ExampleSet
Values:
applycount: looptime: time:
The number of times the operator was applied.
Short description:
actions
Description:
8.1.14 SingleTextInput
Group:
IO.Text
Generated output:
ExampleSet WordList
Parameters:
text:
The input text. (string) The default content type if not specied by the pdf, html, htm, xml, text, txt). (string;
default_content_type:
default: )
default_content_encoding:
(string; default: )
ed by the example set (only encodings supported by Java can be used).
default_content_language:
CHAPTER 8. 48
prune_below:
-1 for no pruning. Alternatively you can provide a percentage value, denoting the lowest document frequency in p words with the highest frequency. (string; default: '-1')
prune_above:
-1 for no pruning. Alternatively you can provide a percentage value, denoting the highest document frequency in p words with the lowest frequency. (string; default: '-1')
vector_creation:
Method used to create word vectors If set to true, the returned example set will con-
use_content_attributes:
false)
used (boolean; default: false) Load a word list from this le instead of creating it from
the input data. (lename) If checked the word list will be returned as part of the
result. (boolean; default: false) Save the used word list into this le. (lename) Indicates if long ids (complete paths), short ids (last
part of the source name), or numerical ids will be used. Species pairs of identier and namespace for use in XPath
text_query:
sion.
be used for vectorization. This query can be XPath or a regular expresIf a regular expression is used, the query must have the following form: '<regex-expression> <replacement-pattern>', where the <replacement_pattern> states how a match is replaced to generate the nal information. '$1' would yield the rst matching group as result. For both, XPath and regular expression, all matches are concatanated and then passed to the vectorization process. (string)
create_text_visualizer:
be created which can be used in plotters etc. Note: Text visualization does not work for id type number. (boolean; default: false)
on_the_y_pruning:
0-+; default: -1)
Values:
July 19, 2009
8.1.
TEXT
49
Inner operators:
8.1.15 SnowballStemmer
Group:
IO.Text.Stemmer
Required input:
TokenSequence
Generated output:
TokenSequence
Values:
applycount: looptime: time:
The number of times the operator was applied.
8.1.16 SplitSegmenter
Group:
IO.Text.Misc
Parameters:
preview:
Shows a preview for the results which will be achieved by the current conguration.
CHAPTER 8. 50
texts:
A directory containing the documents to be segmented (lename) The directory to which to write the segments (lename) Species a regular expression or XPath expression that The syntax is the same as for attribute extraction
output:
split_expression:
matches against substrings of the content which should be treated as individual segments. (see WVTool operator), but instead of extracting only the rst match, all matches are extracted and written to individual les (string)
Values:
applycount: looptime: time:
The number of times the operator was applied.
8.1.17 StopwordFilterFile
Group:
IO.Text.Filter
Required input:
TokenSequence
Generated output:
TokenSequence
Parameters:
le:
File that contains the stopwords one per line (lename) Should words be matched case sensitive (boolean; default:
case_sensitive:
false)
Values:
applycount: looptime: time:
The number of times the operator was applied.
Short description:
external le.
8.1.
TEXT
51
Description:
8.1.18 StringTextInput
Group:
IO.Text
Required input:
ExampleSet
Generated output:
ExampleSet WordList
Parameters:
lter_nominal_attributes:
Indicates if nominal attributes should also be ltered in addition to string attributes. (boolean; default: false)
remove_original_attributes:
(boolean; default: false)
string attributes should also be removed after the word vector creation.
default_content_type:
default: )
The default content type if not specied by the pdf, html, htm, xml, text, txt). (string;
default_content_encoding:
(string; default: )
ed by the example set (only encodings supported by Java can be used).
default_content_language: prune_below:
ed by the example set. (string; default: ) Prune words that appear inat most that many documents.
-1 for no pruning. Alternatively you can provide a percentage value, denoting the lowest document frequency in p words with the highest frequency. (string; default: '-1')
prune_above:
-1 for no pruning. Alternatively you can provide a percentage value, denoting the highest document frequency in p words with the lowest frequency. (string; default: '-1')
vector_creation:
Method used to create word vectors If set to true, the returned example set will con-
use_content_attributes:
false)
use_given_word_list:
CHAPTER 8. 52
input_word_list: return_word_list:
the input data. (lename) If checked the word list will be returned as part of the
Save the used word list into this le. (lename) Indicates if long ids (complete paths), short ids (last
part of the source name), or numerical ids will be used. Species pairs of identier and namespace for use in XPath
text_query:
sion.
be used for vectorization. This query can be XPath or a regular expresIf a regular expression is used, the query must have the following form: '<regex-expression> <replacement-pattern>', where the <replacement_pattern> states how a match is replaced to generate the nal information. '$1' would yield the rst matching group as result. For both, XPath and regular expression, all matches are concatanated and then passed to the vectorization process. (string)
create_text_visualizer:
be created which can be used in plotters etc. Note: Text visualization does not work for id type number. (boolean; default: false)
on_the_y_pruning:
0-+; default: -1)
Values:
applycount: looptime: time:
The number of times the operator was applied.
Inner operators:
8.1.
TEXT
53
8.1.19 StringTokenizer
Group:
IO.Text.Tokenizer
Required input:
TokenSequence
Generated output:
TokenSequence
Values:
applycount: looptime: time:
The number of times the operator was applied.
8.1.20 TagLogSource
Group:
IO.Web
Generated output:
ExampleSet
Parameters:
tag_logle:
the tag log le (lename) of occurrences of a tag to be consid-
Values:
applycount: looptime: time:
The number of times the operator was applied.
Short description:
The WVTool Tutorial
CHAPTER 8. 54
Description:
8.1.21 TermNGramGenerator
Group:
IO.Text.Tokenizer
Required input:
TokenSequence
Generated output:
TokenSequence
Parameters:
max_length:
2) The maximal length of the ngrams. (integer; 1-+; default:
Values:
applycount: looptime: time:
The number of times the operator was applied.
8.1.22 TextInput
Group:
IO.Text
Generated output:
ExampleSet WordList
Parameters:
texts:
Species a list of class/directory pairs. (list) The default content type if not specied by the pdf, html, htm, xml, text, txt). (string;
default_content_type:
default: )
8.1.
TEXT
55
default_content_encoding:
(string; default: )
ed by the example set (only encodings supported by Java can be used).
default_content_language: prune_below:
ed by the example set. (string; default: ) Prune words that appear inat most that many documents.
-1 for no pruning. Alternatively you can provide a percentage value, denoting the lowest document frequency in p words with the highest frequency. (string; default: '-1')
prune_above:
-1 for no pruning. Alternatively you can provide a percentage value, denoting the highest document frequency in p words with the lowest frequency. (string; default: '-1')
vector_creation:
Method used to create word vectors If set to true, the returned example set will con-
use_content_attributes:
false)
used (boolean; default: false) Load a word list from this le instead of creating it from
the input data. (lename) If checked the word list will be returned as part of the
result. (boolean; default: false) Save the used word list into this le. (lename) Indicates if long ids (complete paths), short ids (last
part of the source name), or numerical ids will be used. Species pairs of identier and namespace for use in XPath
text_query:
sion.
be used for vectorization. This query can be XPath or a regular expresIf a regular expression is used, the query must have the following form: '<regex-expression> <replacement-pattern>', where the <replacement_pattern> states how a match is replaced to generate the nal information. '$1' would yield the rst matching group as result. For both, XPath and regular expression, all matches are concatanated and then passed to the vectorization process. (string)
create_text_visualizer:
be created which can be used in plotters etc. Note: Text visualization does not work for id type number. (boolean; default: false)
CHAPTER 8. 56
on_the_y_pruning:
0-+; default: -1)
extend_exampleset:
If true, an input example set is not only used to Note, that this works only with nominal ids!
specify the documents that should be vectorized, but this example set is merged with the vectors. (boolean; default: false)
Values:
applycount: looptime: time:
The number of times the operator was applied.
8.1.23 TextObjectTextInput
Group:
IO.Text
Generated output:
ExampleSet WordList
Parameters:
default_content_type:
default: ) The default content type if not specied by the pdf, html, htm, xml, text, txt). (string; example set (possible values:
default_content_encoding:
(string; default: )
ed by the example set (only encodings supported by Java can be used).
default_content_language:
8.1.
TEXT
57
prune_below:
-1 for no pruning. Alternatively you can provide a percentage value, denoting the lowest document frequency in p words with the highest frequency. (string; default: '-1')
prune_above:
-1 for no pruning. Alternatively you can provide a percentage value, denoting the highest document frequency in p words with the lowest frequency. (string; default: '-1')
vector_creation:
Method used to create word vectors If set to true, the returned example set will con-
use_content_attributes:
false)
used (boolean; default: false) Load a word list from this le instead of creating it from
the input data. (lename) If checked the word list will be returned as part of the
result. (boolean; default: false) Save the used word list into this le. (lename) Indicates if long ids (complete paths), short ids (last
part of the source name), or numerical ids will be used. Species pairs of identier and namespace for use in XPath
text_query:
sion.
be used for vectorization. This query can be XPath or a regular expresIf a regular expression is used, the query must have the following form: '<regex-expression> <replacement-pattern>', where the <replacement_pattern> states how a match is replaced to generate the nal information. '$1' would yield the rst matching group as result. For both, XPath and regular expression, all matches are concatanated and then passed to the vectorization process. (string)
create_text_visualizer:
be created which can be used in plotters etc. Note: Text visualization does not work for id type number. (boolean; default: false)
on_the_y_pruning:
0-+; default: -1)
Values:
The WVTool Tutorial
CHAPTER 8. 58
8.1.24 ToLowerCaseConverter
Group:
IO.Text.Stemmer
Required input:
TokenSequence
Generated output:
TokenSequence
Values:
applycount: looptime: time:
The number of times the operator was applied.
8.1.25 TokenLengthFilter
Group:
IO.Text.Filter
Required input:
TokenSequence
Generated output:
TokenSequence
Parameters:
July 19, 2009
8.1.
TEXT
59
min_chars: max_chars:
to be considered. (integer; 0-+; default: 4) The maximal number of characters that a token must contain
Values:
applycount: looptime: time:
The number of times the operator was applied.
Short description:
they must contain.
Description:
8.1.26 TokenReplace
Group:
IO.Text.Transformer
Required input:
TokenSequence
Generated output:
TokenSequence
Parameters:
replace_dictionary:
Denes the replacements. (list)
Values:
applycount: looptime: time:
The number of times the operator was applied.
Short description:
Description:
The WVTool Tutorial