Wachsmuth 2015
Wachsmuth 2015
Wachsmuth 2015
(preprint)
A dissertation presented by
Henning Wachsmuth
to the
Faculty of Computer Science, Electrical Engineering, and Mathematics
of the
University of Paderborn
Paderborn, Germany
January 2015
Dissertation
Pipelines for Ad-hoc Large-scale Text Mining
Henning Wachsmuth, University of Paderborn
Paderborn, Germany, 2015
Reviewers
Prof. Dr. Gregor Engels, University of Paderborn
Prof. Dr. Benno Stein, Bauhaus-Universitt Weimar
Dr. Bernd Bohnet, Google London
Doctoral Committee
Prof. Dr. Gregor Engels, University of Paderborn
Prof. Dr. Benno Stein, Bauhaus-Universitt Weimar
Dr. Bernd Bohnet, Google London
Prof. Dr. Hans Kleine Bning, University of Paderborn
Prof. Dr. Friedhelm Meyer auf der Heide, University of Paderborn
To Max, my son.
Acknowledgments
The findings of this thesis should not be attributed to a single person. Al-
though I wrote the thesis on my own at the Database and Information Sys-
tems group and the Software Quality Lab of the University of Paderborn,
many people have worked together with me or have helped me in other
important respects during my PhD time.
First, Id like to thank both my advisor, Gregor Engels, and my co-advisor,
Benno Stein, for supporting me throughout the whole time and for giving
me the feeling that my research is worth doing. Gregor, I express to you my
deep gratitude for letting me take my own path while keeping me in the
right direction. You showed me what a dissertation really means and you
continuously challenged me by finding even the smallest fallacy in my argu-
mentation. Benno, thank you so much for teaching me what science is all
about and how thoroughly I have to work for that. You improved my ability
to get the best out of me and you let me experience that the best ideas emerge
from collaboration. Id like to thank Bernd Bohnet for such collaboration
and for directly saying yes, when I asked you to be the third reviewer of
this thesis. Similarly, I thank Hans Kleine Bning and Friedhelm Meyer auf
der Heide for serving as members of my doctoral committee.
As this thesis reuses content of a number of conference publications, Id
like to thank all my co-authors not named so far. In chronological order, this
is Peter Prettenhofer, Kathrin Bujna, Mirko Rose, Tsvetomira Palakarska,
and Martin Trenkmann. Thank you for your great work. Without you, parts
of this thesis wouldnt exist. Some parts have also profited from the effort of
students who worked at our lab, wrote their thesis under my supervision,
or participated in our project group ID|SE. Special thanks go to Joachim
Khring and Steffen Beringer. Moreover, some results presented here are
based on work of companies we cooperated with in two research projects.
I want to thank Dennis Hannwacker in this regard, but also the other em-
ployees of Resolto Informatik and Digital Collections.
The mentioned projects including my position were funded by the Ger-
man Federal Ministry of Education and Research (BMBF), for which Im
very grateful. For respective reasons, I want to thank the company HRS, the
German Federal Ministry for Economic Affairs and Energy (BMWI), and
my employer, the University of Paderborn, in general.
v
vi
I had a great time at the university, first and foremost because of my col-
leagues. Besides those named above, I say thank you to Fabian Christ and
Benjamin Nagel for all the fun discussions, for tolerating my habits, and for
becoming friends. The same holds for Christian Soltenborn and Christian
Gerth, who I particularly thank for making me confident about my research.
Further thanks go to Jan Bals, Markus Luckey, and Yavuz Sancar for exciting
foosball table matches, to the brave soccer team AG Engels, and to the rest
of our group. I express my gratitude to Friedhelm Wegener for constant and
patient technical help as well as to Stefan Sauer for managing all the official
matters, actively supported by Sonja Saage and Beatrix Wiechers.
Especially, I am grateful to Theo Lettmann for pushing me to apply for
my position, for guiding me in the first time of my PhD, and for giving me
advice whenever needed without measurable benefit for himself. Thanks to
the Webis group in Weimar and to the people in our university that made all
my conference attendances possible. Because of you, I could enter the com-
putational linguistics community, which I appreciate so much, and make
new friends around the world. Id like to point out Julian Brooke, who I en-
joyed discussing research and life with every year on a biannual conference,
as well as Alberto Barron, who opened my eyes to my limited understand-
ing of the world with sincerity and a dry sense of humor.
Aside from my professional life, I deeply thank all my dear friends for so
many great experiences and fun memories, for unconditionally accepting
how I am, and for giving me a relief from my everyday life. Id like to name
Dirk, Kathrin, Sebastian, Stephan, and Tim here, but many more current
and former Bielefelders and Paderborners influenced me, including but not
limited to those from the HG Kurzfilme and the Projektbereich Eine Welt.
I thank my parents for always loving and supporting me and for being the
best role models I can imagine. Ipke, the long time you spent for the great
corrections and your encouraging feedback helped me more than I can tell.
Thanks also to the rest of my family for being such a wonderful family.
Finally, my greatest thanks go to my own small family, Katrin and Max,
for letting me experience that there are much more important things in life
than work, for providing me the chance to learn being some sort of father,
for giving me a home throughout my PhD, for accepting my long working
hours, and for all the love and care I felt. Katrin, Im sure you will eventually
find a profession that makes you as happy as research makes me. And Max,
I hope that the effort I put into this thesis and my excitement for learning
new stuff will give you inspiration for your life.
The basic notations used in this thesis are listed in the following. Specific
forms and variations of these notations are introduced where needed and
are marked explicitly with according indices or similar.
Analysis
A A text analysis algorithm.
A A set or a repository of text analysis algorithms.
Text
d A portion or a unit of a text.
D A text.
D A collection or a stream of texts.
Information
c A piece of information, such as a class, an entity, a relation, etc.
C An information type or a set of pieces of information.
C A set of information types or a specification of an information need.
vii
viii
Task
A query, which specifies a combination of information needs.
A scoped query, i.e., a query with assigned degrees of filtering.
The dependency graph of a scoped query.
Quality
q A quality value or estimation.
q A vector of quality estimations.
Q A quality criterion.
Q A set of quality criteria.
Measures
t A run-time, possibly averaged over a certain unit of text.
a An accuracy value.
p A precision value.
r A recall value.
f1 An F1 -score.
ix
x
xi
xii
3 Pipeline Design 71
3.1 Ideal Construction and Execution for Ad-hoc Text Mining . . 72
3.2 A Process-oriented View of Text Analysis . . . . . . . . . . . 85
3.3 Ad-hoc Construction via Partial Order Planning . . . . . . . 94
3.4 An Information-oriented View of Text Analysis . . . . . . . . 111
3.5 Optimal Execution via Truth Maintenance . . . . . . . . . . . 120
3.6 Trading Efficiency for Effectiveness in Ad-hoc Text Mining . 138
xiii
xiv
6 Conclusion 259
6.1 Contributions and Open Problems . . . . . . . . . . . . . . . 260
6.2 Implications and Outlook . . . . . . . . . . . . . . . . . . . . 264
B Software 283
B.1 An Expert System for Ad-hoc Pipeline Construction . . . . . 283
B.2 A Software Framework for Optimal Pipeline Execution . . . 288
B.3 A Web Application for Sentiment Scoring and Explanation . 290
B.4 Source Code of All Experiments and Case Studies . . . . . . 293
References 313
In turning from the smaller instruments in frequent use to
the larger and more important machines, the economy aris-
ing from the increase of velocity becomes more striking.
Charles Babbage
1
Introduction
15
16 1.1 Information Search in Times of Big Data
Figure 1.1: Screenshot of Pentaho Big Data Analytics as an example for an enter-
prise software. The shown heat grid visualizes the vehicle sales of a company.
deals, at its heart, with indexing and searching unstructured texts, data min- data mining
ing targets at the discovery of patterns in structured data. Natural language natural language processing
processing, finally, is concerned with algorithms and engineering issues
for understanding and generating speech and human-readable text (Tsujii,
2011). It bridges the gap between the other fields by converting unstruc-
tured into structured information. Text mining is studied within the broad
interdisciplinary field of computational linguistics, as it addresses computa- computational linguistics
tional approaches from computer science to the processing of data and in-
formation while operationalizing findings from linguistics.
According to Sarawagi (2008), the most important text mining techniques
for identifying and filtering relevant texts and information within the three
information extraction
fields refer to the areas of information extraction and text classification. The
text classification
former aims at extracting entities, relations between entities, and events the
entities participate in from mostly unstructured text. The latter denotes the
task of assigning a text to one or more predefined categories, such as topics,
genres, or sentiment polarities. Information extraction, text classification,
and similar tasks are considered in both natural language processing and
information retrieval. In this thesis, we summarize these tasks under the
term text analysis. All text analyses have in common that they can signifi- text analysis
cantly increase the velocity of information search in many situations.
In our research project InfexBA2 , for instance, we developed algorithms
for a fast extraction and aggregation of financial forecast events from online
news articles, thereby supporting strategic business decision making. Such
events describe financial developments of organizations and markets over
time. They may have an author, a date, and the like. Entities and events of
the implied types can be found in texts like If Apple does end up launching a
television in the next two years as has been rumored, it could help Apples annual
revenue skyrocket to $400 billion by 2015, according to Morgan Stanley analyst
Katy Huberty.3 In contrast, the goal of our research project ArguAna4 was
to classify and summarize opinions on products and their features found
in large numbers of review texts. To this end, we analyzed the sequence of
local sentiment on certain product features found in each of the reviews in
order to account for the argumentation of texts.
Of course, major search engines already use text analysis when address-
ing information needs (Pasca, 2011). E.g., a Google search in late 2014 for
2
InfexBA Information Extraction Technologies for Business Applications, funded by the
German Federal Ministry of Education and Research (BMBF), http://infexba.upb.de
3
Taken from Business Insider, http://www.businessinsider.com/how-apples-annual-
revenue-could-hit-400-billion-by-2015-2012-5, accessed on December 8, 2014.
4
ArguAna Argumentation Analysis in Customer Opinion Mining, also funded by the
BMBF, http://www.arguana.com. Details on both research projects are given in Section 2.3.
18 1.2 A Need for Efficient and Robust Text Analysis Pipelines
Figure 1.2: Google result page for the query Charles Babbage, showing an exam-
ple of directly providing relevant information instead of returning only web links.
Charles Babbage, the author of this chapters introductory quote, led to the
results in Figure 1.2, which convey that Babbage was recognized as a person
entity with a number of attributes and related entities.5 The exact extent to
which todays search engines perform text analyses is hard to guess, since
they also rely on knowledge bases like Freebase.6 However, the presentation
of analysis results as answers is currently restricted to some frequent entity
types, such as persons or locations. Correspondingly, the few text-based
applications for analyzing big data focus on predefined information needs
of wide interest. E.g., Appinions continuously mines and aggregates opin-
ions on specified topics for monitoring purposes.7 But, in accordance with
the quote of Babbage, the benefit of text mining arising from the increase of
velocity becomes more striking when turning from predefined text analyses
text analysis process in frequent use to arbitrary and more complex text analysis processes.
Figure 1.3: A text analysis pipeline = hA, i with algorithm set A = {A1 , . . . , Am }
and schedule . Each text analysis algorithm Ai A takes a text and information
of certain types as input and Ai provides information of certain types as output.
lies on so called features (Manning et al., 2008) that are derived from lexical
and syntactic annotations or even from entities like in ArguAna.
To realize the steps of a text analysis process, text analysis algorithms are text analysis algorithm
employed that annotate new information types in a text or that classify, re-
late, normalize, or filter previously annotated information. Such algorithms
perform analyses of different computational cost, ranging from the typically
cheap evaluation of single rules and regular expressions, over the matching
of lexicon terms and the statistical classification of text fragments, to com-
plex syntactic analyses like dependency parsing (Bohnet, 2010). Because of
the interdependencies between analysis steps, the standard way to realize a
text analysis process is in the form of a text analysis pipeline, which sequen- text analysis pipeline
tially applies each employed text analysis algorithm to its input.
...
A'1 A'2 ... A'm-1 A'm ( , , ..., )
A1 A2 A3 A4 ... An
Figure 1.4: The basic text analysis scenario considered in this thesis: A text analysis
pipeline = hA, i composes a subset A of the set of all available text analysis
algorithms {A1 , . . . , An } in order to infer output information of a stuctured set of
information types C from a collection or a stream of input texts D.
numbers of texts. Both of them deal with ad-hoc information needs, i.e., infor- ad-hoc information need
mation needs that are stated ad-hoc and are, thus, unknown beforehand.
We argue that the process complexity and task specificity outlined above
ad-hoc large-scale
prevent such ad-hoc large-scale text mining today due to three problems: text mining
First, the design of text analysis pipelines in terms of selecting and sche- design
duling algorithms for the information needs at hand and the input texts
to be processed is traditionally made manually, because it requires human
expert knowledge about the functionalities and interdependencies of the
algorithms (Wachsmuth et al., 2013a). If information needs are stated ad-
hoc, also the design of pipelines has to be made ad-hoc, which takes time in
case of manual construction, even if proper tool support is given (Kano et al.,
2010). Hence, text mining currently cannot be performed immediately.
Second, the run-time efficiency of traditionally executing text analysis efficiency
pipelines is low, because computationally expensive analyses are performed
on the whole input texts (Sarawagi, 2008). Techniques are missing that
identify the portions of input texts, which contain information relevant for
the information need at hand, and that restrict expensive analyses to these
portions. Different texts vary in the distribution of relevant information,
which additionally makes these techniques input-dependent (Wachsmuth
and Stein, 2012). While a common approach to avoid efficiency problems
is to analyze input texts in advance when they are indexed (Cafarella et al.,
2005), this is not feasible for ad-hoc information needs. Also, the applica-
tion of faster algorithms (Pantel et al., 2004) seems critical because it mostly
also results in a reduced effectiveness. Hence, ad-hoc text mining currently
cannot be performed on large numbers of texts.
Third, text analysis pipelines tend not to infer high-quality information
with high robustness, because the employed algorithms traditionally rely on robustness
features of input texts that are dependent on the domains of the texts (Blitzer
et al., 2007). The applications we target at, however, may process texts from
arbitrary domains, such that pipelines will often fail to infer information ef-
fectively. An approach to still achieve user acceptance under limited effec-
tiveness is to explain how information was inferred (Li et al., 2012b), but this
is difficult for pipelines, as they realize a process with several complex and
uncertain decisions about natural language (Das Sarma et al., 2011). Hence,
text mining cannot generally guarantee high quality.
though, it suffices to follow the simple view that knowledge is an interpreta- knowledge
tion of data that is assumed to be true irrespective of the context (e.g., Apple
is a company), whereas information is data that has been given meaning by information
a particular context (e.g., in this text the term Apple denotes a company).
In this regard, knowledge can be understood as specified beforehand, while
information is inferred during processing. Now, when speaking of knowl-
edge about a text analysis process, we basically mean two kinds:
1. Knowledge about the text analysis task to be addressed, namely, the
information need at hand, expected properties of the input texts to be
processed, as well as efficiency and effectiveness criteria to be met.
2. Knowledge about the text analysis algorithms to be employed, na-
mely, their input and output information types, restrictions of their
applicability, and their expected efficiency and effectiveness.
Similarly, we distinguish the following three kinds of information obtained
within the process:
1. Information about the processed input texts, namely, their concretely
observed properties, especially domain-independent properties.
2. Information about the produced output, namely, the occurrence and
distribution of the different types of information in the input texts.
3. Information about the executed text analysis pipelines, namely, the
schedule of the employed text analysis algorithms as well as their
achieved efficiency and effectiveness as far as observable.
The exploitation of knowledge and information for automatically solving
problems is closely related to the field of artificial intelligence. Artificial in- artificial intelligence
telligence describes the ability of software and machines to think and act ra-
tionally or human-like. An according system aims to find efficient solutions
to problems based on operationalized expert knowledge and a perception
of its environment (Russell and Norvig, 2009). While text mining itself can
be viewed as a subfield of artificial intelligence, we here develop approaches
that use classical and statistical artificial intelligence techniques like plan-
ning, reasoning, and machine learning to improve traditional approaches
to text mining using the stated kinds of knowledge and information.
Our goal is to provide widely applicable approaches to the three require-
ments outlined at the end of Section 1.2, which is why we also cover aspects
of software engineering, such as the modeling of domains and the scaling of software engineering
methods. To investigate the defined research question, we evaluate all ap-
proaches with respect to these requirements. Some properties are proven
formally, whereas in all other cases we implement the approaches as open-
24 1.3 Towards Intelligent Pipeline Design and Execution
...
...
...
A*1 A*2 ... A*m-1 A*m ( , , ..., )
input texts ad-hoc large-scale text analysis pipeline * =A*,* output information
A1 A2 A3 A4 ... An Aout
text analysis algorithms
Figure 1.5: Our overall approach to enable ad-hoc large-scale text mining: Each
algorithm Ai in the automatically constructed ad-hoc large-scale text analysis pipeline
= hA , i gets only portions of text from the input control its output is relevant
for. The schedule is optimized in terms of efficiency, while the effectiveness of
is improved through an overall analysis that produces the final output information.
input control
overall
input texts output information
analysis
ad-hoc large-scale text analysis pipeline
Figure 1.6: The three high-level contributions of the thesis: We present approaches
(1) to automatically design text analysis pipelines that optimally process input texts
ad-hoc, (2) to optimize the run-time efficiency of pipelines on all input texts, and
(3) to improve the robustness of pipelines on input texts from different domains.
3 4 5
Pipeline Design Pipeline Efficiency Pipeline Robustness
abstract solutions
ideal construction ideal ideal domain
and execution scheduling independence
3.1 4.1 5.1
concrete models
process- information- information- structure-
oriented view oriented view oriented view oriented view
3.2 3.4 3.4 5.2
experimental analyses
text analysis impact of relevant impact of impact of
processes information heterogeneity overall structure
2.2 4.2 4.4 5.3
practical approaches
ad-hoc optimal optimized adaptive features for
construction execution scheduling scheduling domain indep'ce
3.3 3.5 4.3 4.5 5.4
implications
trading efficiency parallelizing pipe- explaining
for effectiveness line execution results
3.6 4.6 5.5
Figure 1.7: The structure of this thesis according to the three main contributions,
showing short names of all sections of the three main chapters.
relevant for our purposes (Section 2.1). We point out the importance of text
analysis processes and their realization through pipelines in Section 2.2,
while case studies that we resort to in examples and experiments follow in
Section 2.3. Section 2.4 then summarizes the state of the art.
As Figure 1.7 shows, Chapter 3 deals with the automation of pipeline
design. In Section 3.1, we present paradigms of an ideal pipeline construc-
tion and execution. On this basis, we formalize key concepts from Sec-
tion 2.2 in a process-oriented view of text analysis (Section 3.2) and then
address ad-hoc pipeline construction (Section 3.3). Afterwards, we develop
an information-oriented view of text analysis (Section 3.4), which can be
operationalized to achieve an optimal execution (Section 3.5). This view
provides new ways of trading efficiency for effectiveness (Section 3.6).
Next, we optimize pipeline efficiency in Chapter 4, starting with a formal
solution to the optimal scheduling of text analysis algorithms (Section 4.1).
We analyze the impact of the distribution of relevant information in Sec-
tion 4.2, followed by our approach to optimized scheduling (Section 4.3)
that also requires the filtering view from Section 3.4. An analysis of the
heterogeneity of texts (Section 4.4) then motivates the need for adaptive
scheduling, which we approach in Section 4.5. Scheduling has implications
for pipeline parallelization, as we discuss in Section 4.6.
In Chapter 5, finally, we present a novel approach for improving pipeline
robustness, which refers to an ideal domain independence (Section 5.1).
We model text analysis from a structure-oriented viewpoint (Section 5.2),
which emphasizes the impact of the overall structure in the classification of
argumentative texts (Section 5.3). The model forms the basis for our overall
analysis in Section 5.4 and it can also be exploited for an explanation of
pipeline results (Section 5.5).
We conclude the thesis in Chapter 6 with a summary of our contributions
and of remaining problems (Section 6.1). As a closing step, we give an out-
look and we sketch implications for the given and other areas of computer
science (Section 6.2). Information on all text analysis algorithms, software,
and text corpora referred to in this thesis is found in Appendices A to C.
Table 1.1: Overview of peer-reviewed publications this thesis is based on. For each
publication, the venue, the type, and the number of pages are given as well as the
main sections of this thesis, in which content of the publication is reused.
Table 1.2: Overview of the student theses written in the context of this thesis. For
each student thesis, its type and the connection to this thesis are given as well as
the main sections of this thesis, to which its content has contributed.
lists the reference of each paper together with the short name of the respec-
tive conference, the type and length of publication, and the main sections
of this thesis content of the paper appears in.
Moreover, some parts of this thesis integrate content of student theses.
Among these, the most noteworthy refers to Rose (2012), whose results have
significantly influenced the approach to ad-hoc pipeline construction pre-
sented in Section 3.3. For completeness, Table 1.2 lists all student theses
written in the context of this thesis. As shown, some of them tackle only
work related to this thesis, though. Also, some of those that are classified
as on-topic have contributed to this thesis to a limited degree only.
The exact reuse of content from the listed papers and theses is outlined
in the respective sections. In all cases, this thesis provides many new de-
tails and more comprehensive information on the discussed concepts. In
addition, extended evaluations and tool descriptions are given for most ap-
proaches. Besides, some parts of this thesis represent original contributions
that have not been published before, as pointed out where given.
I put my heart and my soul into my work, and have lost my
mind in the process.
Vincent van Gogh
33
34 2.1 Foundations of Text Mining
Text Mining
Text mining deals with the automatic or semi-automatic discovery of new,
previously unknown information of high quality from large numbers of
unstructured texts (Hearst, 1999). Different than sometimes assumed, the
types of information to be inferred from the texts are usually specified man-
ually beforehand, i.e., text mining tackles given tasks. As introduced in Sec-
tion 1.1, this commonly requires to perform three steps in sequence, each of
which can be associated to one field (Ananiadou and McNaught, 2005):
1. Information retrieval. Gather input texts that are potentially relevant
for the given task.
2. Natural language processing. Analyze the input texts in order iden-
tify and structure relevant information.2
3. Data mining. Discover patterns in the structured information that has
been inferred from the texts.
Hearst (1999) points out that the main aspects of text mining are actually
the same as those studied in empirical computational linguistics. Although
focusing on natural language processing, some of the problems computa-
tional linguistics is concerned with are also addressed in information re-
trieval and data mining, such as text classification or machine learning. In
this thesis, we refer to all these aspects with the general term text analy-
sis (cf. Section 1.1). In the following, we look at the concepts of the three
fields that are important for our discussion of text analysis.
1
Notice that, throughout this thesis, we generally assume that the reader has a more or
less graduate-level background in computer science.
2
Ananiadou and McNaught (2005) refer to the second step as information extraction.
While we agree that information extraction is often the important part of this step, also other
techniques from natural language processing play a role, as discussed later in this section.
2 Text Analysis Pipelines 35
Information Retrieval
Following Manning et al. (2008), the primary use case of information re-
trieval is to search and obtain those texts from a large collection of unstruc-
tured texts that can satisfy an information need, usually given in the form of
a query. In ad-hoc web search, such a query consists of a few keywords, but,
in general, it may also be given by a whole text, a logical expression, etc. An
information retrieval application assesses the relevance of all texts with re-
spect to a query based on some similarity measure. Afterwards, it ranks the
texts by decreasing relevance or it filters only those texts that are classified
as potentially relevant (Manning et al., 2008).
Although the improvement of ad-hoc search denotes one of the main mo-
tivations behind this thesis (cf. Chapter 1), we hardly consider the retrieval
step of text mining, since we focus on the inference of information from the
potentially relevant texts, as we detail in Section 2.2. Still, we borrow some
techniques from information retrieval, such as filtering or the determina-
tion of similar texts. For this purpose, we require the following concepts,
which are associated to information retrieval rather than to text analysis.
Vectors To determine the relevance of texts, many approaches map all texts
and queries into a vector space model (Manning et al., 2008). In general, such vector space model
a model defines a common vector representation x = (x1 , . . . , xk ), k 1, for
all inputs, where each xi x formalizes an input property. A concrete input
(D)
like a text D is then represented by one value xi for each xi . In web search,
the standard way to represent texts and queries is by the frequencies of the
words they contain from a set of (possibly hundreds of thousands) words.
Generally, any measurable property of an input can be formalized, though,
which becomes particularly relevant for tasks like text classification.
Similarity Given a common representation, similarities between texts and
queries can be computed. Most word frequencies of a search query will of-
ten be 0. In case they are of interest, a reasonable similarity measure is the
cosine distance, which puts emphasis on the properties that occur (Manning
et al., 2008). In Chapter 5, we compute similarities of whole texts, where a
zero does not always mean the absence of a property. Such scenarios sug-
gest other measures. In our experiments, we use the Manhattan distance be- manhattan distance
tween two vectors x(1) and x(2) of length k (Cha, 2007), which is defined as:
k
(1) (2)
X
(1) (2)
Manhattan distance(x ,x )= |xi xi |
i=1
Indexing While queries are typically stated ad-hoc, the key to efficient ad-
hoc search is that all texts in a given collection have been indexed before.
36 2.1 Foundations of Text Mining
search index A query is then matched against the search index, thereby avoiding to pro-
cess the actual texts during search. Very sophisticated indexing approaches
exist and are used in todays web search engines (Manning et al., 2008). In
its basic form, a search index contains one entry for every measured prop-
erty. Each entry points to all texts that are relevant with respect to the prop-
erty. Some researchers have adapted indexing to information extraction by
building specialized search indexes based on concepts like entities, such as
Cafarella et al. (2005). We discuss in Section 2.4 in how far they reduce the
need for ad-hoc large-scale text mining that we address in this thesis.
Filtering While the ranking of texts by relevance is not needed in this thesis,
we filter relevant portions of texts in Chapter 3. Filtering is addressed in
text filtering information retrieval on two levels: Text filtering classifies complete texts
passage retrieval as being relevant or irrelevant (Sebastiani, 2002), whereas passage retrieval
aims to determine the passages of a text that are relevant for answering a
given query (Cui et al., 2005). We investigate the difference between our and
existing filtering approaches in Section 2.4 and their integration at the end
of Chapter 3. Filtering is usually seen as a classification task (Manning et al.,
2008) and, thus, addressed as a text analysis. We cover text classification as
part of natural language processing, which we describe next.
lysis is hence often hard and can even be impossible. For instance, the sen-
tence SHES AN APPLE FAN. alone leaves undecidable whether it refers
to a fruit or a company.
Technically, natural language processing can be seen as the production
of annotations (Ferrucci and Lally, 2004). An annotation marks a text or a annotation
span of text that represents an instance of a particular type of information.
We discuss the role of annotations more extensively in Section 2.2, before
we formalize the view of text analysis as an annotation task in Chapter 3.
Lexical and Syntactic Analyses For our purposes, we distinguish three
types of lexical and syntactical analyses: The segmentation of a text into sin- segmentation
tagging
gle units, the tagging of units, and the parsing of syntactic structure.
parsing
Mostly, the smallest text unit considered in natural language processing
is a token, denoting a word, a number, a symbol, or anything similar (Man- token
ning and Schtze, 1999). Besides the tokenization of texts, we also perform tokenization
sentence splitting
segmentation with sentence splitting and paragraph splitting in this thesis. In
paragraph splitting
terms of tagging, we look at part-of-speech, meaning the categories of tokens part-of-speech
like nouns or verbs, although much more specific part-of-speech tags are used part-of-speech tag
in practice (Jurafsky and Martin, 2009). Also, we perform lemmatization in lemmatization
some experiments to get the lemmas of tokens, i.e., their dictionary forms, lemma
such as be in case of is (Manning and Schtze, 1999). Finally, we use
shallow parsing, called chunking (Jurafsky and Martin, 2009), to identify dif- chunking
ferent types of phrases, and dependency parsing to infer the dependency tree dependency parsing
structure of sentences (Bohnet, 2010). Appendix A provides details on all
named analyses and on the respective algorithms we rely on. The output of
parsing is particularly important for information extraction.
Information Extraction The basic semantic concept is a named or numeric
entity from the real world (Jurafsky and Martin, 2009). Information extrac- entity
tion analyzes usually unstructured texts in order to recognize references of
relation
such entities, relations between entities, and events the entities participate
event
in (Sarawagi, 2008). In the classical view of the Message Understanding
Conferences, information extraction is seen as a template filling task (Chin-
chor et al., 1993), where the goal is to fill entity slots of relation or event
templates with information from a collection or a stream of texts D.
The set of information types C to be recognized is often predefined, al-
though some recent approaches address this limitation (cf. Section 2.4).
Both rule-based approaches, e.g. based on regular expressions or lexicons,
and statistical approaches, mostly based on machine learning (see below),
are applied in information extraction (Sarawagi, 2008). The output is struc-
tured information that can be stored in databases or directly displayed to the
38 2.1 Foundations of Text Mining
genre of a text in terms of the form, purpose, and/or intended audience of genre
the text (Stein et al., 2010), or authorship attribution where the author of a text authorship attribution
is to be determined (Stamatatos, 2009). In automatic essay grading, the goal is automatic essay grading
to assign ratings from a usually numeric classification scheme to texts like
essays (Dikli, 2006), and stance recognition seeks for the stance of a person stance recognition
with respect to some topic (Somasundaran and Wiebe, 2010). All these and
some related tasks are discussed more or less detailed in this thesis.
Of highest importance in our experiments is sentiment analysis, which sentiment analysis
has become one of the most widely investigated text classification tasks in
the last decade. By default, sentiment analysis refers to the classification
of the sentiment polarity of a text as being positive or negative (Pang et al., sentiment polarity
2002). An example is shown in Figure 2.1(b). Sometimes, also an objec-
tive (or neutral) polarity is considered, although this class rather refers to
subjectivity (Pang and Lee, 2004). Moreover, sentiment can also be assessed subjectivity
on numeric scales (Pang and Lee, 2005), which we call sentiment scoring here. sentiment scoring
We employ a number of sentiment analysis algorithms in Section 5. They
are listed in Appendix A.
As passage retrieval (see above), text classification does not always deal
with complete texts. In Chapter 5, we classify the subjectivity of single dis-
course units, where objective units can be seen as facts and subjective units fact
opinion
as opinions. In opinion mining, such techniques are combined with informa-
opinion mining
tion extraction techniques to find opinions on certain topics (Popescu and
Etzioni, 2005), as done in one of our case studies (cf. Section 2.3).3 Senti-
ment analysis and opinion mining are of high practical relevance, because
they can be used in text mining applications that analyze the peoples opin-
ions on products and brands in social media, online review sites, and the
like (Pang and Lee, 2008). For this purpose, data mining needs to be per-
formed on the output of the respective algorithms.
3
Unlike us, some researchers do not distinguish between sentiment analysis and opinion
mining, but they use these two terms interchangeably (Pang and Lee, 2008).
40 2.1 Foundations of Text Mining
( , , ..., )
input output
...
...
data information
data mining ( , , ..., )
representation generalization
machine learning
instances patterns
Figure 2.2: Illustration of a high-level view of data mining. Input data is repre-
sented as a set of instances, from which a model is derived using machine learning.
The model is then generalized to infer new output information.
Data Mining
Data mining primarily aims at the inference of new information of specified
types from typically huge amounts of input data, already given in struc-
prediction problem tured form (Witten and Frank, 2005). To address such a prediction problem,
the data is first converted into instances of a defined representation and then
machine learning handed over to a machine learning algorithm. The algorithm recognizes sta-
tistical patterns in the instances that are relevant for the prediction problem.
training This process is called training. The found patterns are then generalized, such
that they can be applied to infer new information from unseen data, gener-
prediction ally referred to as prediction. In this regard, machine learning can be seen
as the technical basis of data mining applications (Witten and Frank, 2005).
Figure 2.2 shows a high-level view of the outlined process.
Data mining and text mining are related in two respects: On one hand,
the structured output information of text analysis serves as the input to ma-
chine learning, e.g. to train a text classifier. On the other hand, many text
analyses themselves rely on machine learning algorithms to produce output
information. Both respects are important in this thesis. In the following, we
summarize the basic concepts relevant for our purposes.4
Machine Learning Machine learning describes the ability of an algorithm
to learn without being explicitly programmed (Samuel, 1959). An algorithm
can be said to learn from data with respect to a given prediction problem
and some quality measure, if the measured prediction quality increases the
more data is processed (Mitchell, 1997).5 Machine learning aims at predic-
target function tion problems where the target function, which maps input data to output in-
4
Besides the cited references, parts of the summary are inspired by the Coursera machine
learning course, https://www.coursera.org/course/ml (accessed on December 2, 2014).
5
A discussion of common quality measures follows at the end of this section.
2 Text Analysis Pipelines 41
Also, some features generalize worse than others, often because they cap-
ture domain-specific properties, as we see in Chapter 5.9
Generalization As shown in Figure 2.2, generalization refers to the infer-
ence of output information from unseen data based on patterns captured in
a learned model (Witten and Frank, 2005). As such, it is strongly connected
to the used machine learning algorithm. The training of such an algorithm
based on a given set of instances explores a large space of models, because
most algorithms have a number of parameters. An important decision in
this regard is how much to bias the algorithm with respect to the complex-
ity of the model to be learned (Witten and Frank, 2005). Simple models (say,
linear functions) induce a high bias, which may not fit the input data well,
but regularize noise in the data and, thus, tend to generalize well. Complex
models (say, high polynomials) can be fitted well to the data, but tend to
fitting generalize less. We come back to this problem of fitting in Section 5.1.
During training, a machine learning algorithm incrementally chooses a
possible model and evaluates the model based on some cost function. The
gradient descent choice relies on an optimization procedure, e.g. gradient descent stepwise
heads towards a local minimum of the cost function until convergence by
adapting the model to all input data (Witten and Frank, 2005). In large-scale
stochastic gradient descent scenarios, a variant called stochastic gradient descent is often more suitable. It
repeatedly iterates over all data instances in isolation, thereby being much
faster while not guaranteeing to find a local minimum (Zhang, 2004). No
deep understanding of the generalization process is needed in this thesis,
since we focus only on the question how to address text analysis tasks with
existing machine learning algorithms in order to then select an adequate
one. What matters for us is the type of learning that can or should be per-
formed within the task at hand. Mainly, we consider two very prominent
types in this thesis, supervised learning and unsupervised learning.
supervised learning Supervised Learning In supervised learning, a machine learning algorithm
training data derives a model from known training data, i.e., pairs of a data instance and
the associated output information (Witten and Frank, 2005). The model can
then be used to predict output information for unknown data. The notion
of being supervised refers to the fact that the learning process is guided by
examples of correct predictions. In this thesis, we use supervised learning
for both statistical classification and statistical regression.
classification Classification describes the task to assign a data instance to the most likely
of a set of two or more predefined discrete classes (Witten and Frank, 2005).
9
Techniques like feature selection and dimensionality reduction, which aim to reduce
the set of considered features to improve generalizability and training efficiency among oth-
ers (Hastie et al., 2009), are beyond the scope of this thesis.
2 Text Analysis Pipelines 43
ua les
s
sq irc
re
training instance
c
instances
training
instances unknown
instance
X1 X1
cluster 2
cluster 1
inner
cluster
leaf
cluster 3 cluster
root
cluster
X1 X1
Figure 2.4: Illustration of unsupervised learning: (a) Flat clustering groups a set
of instances into a (possibly predefined) number of clusters. (b) Hierarchical clus-
tering creates a binary hierarchy tree structure over the instances.
clustering learning is clustering, which groups a set of instances into a possibly but not
cluster necessarily predefined number of clusters (Witten and Frank, 2005). Here,
we consider only hard clusterings, where each instance belongs to a single
cluster that represents some class. Different from classification, the mean-
ing of a class is usually unknown in clustering, though. Clustering learns
patterns in the similarities of instances based on similarity measures like
those used in information retrieval (see above). The resulting model can
assign arbitrary instances to one of the given clusters. In text mining, clus-
tering is e.g. used to detect texts with similar properties.
Conceptually, two basic types of clustering exist, as shown in Figure 2.4.
flat clustering While flat clustering partitions instances without specifying associations be-
hierarchical clustering tween the created clusters, hierarchical clustering organizes instances in a hi-
erachical tree (Manning et al., 2008). Each node in the tree represents a clus-
ter of a certain size. The root cluster consists of all instances and each leaf
refers to a single instance. A flat clustering can be derived from a hierar-
chical clustering through cuts in the tree. The tree is incrementally created
by measuring distances between instances and clusters. To this end, a clus-
centroid ter is e.g. represented by its centroid, i.e., the average of all instances in the
cluster (Manning et al., 2008). In general, both clustering types have certain
advantages with respect to efficiency and cluster quality. We rely on hierar-
chical clustering in Chapter 5 for reasons discussed there. In particular, we
agglomerative
hierarchical clustering perform agglomerative hierarchical clustering where the hierarchy is created
bottom-up, beginning with the single instances (Manning et al., 2008).
Further Learning Types Some other machine learning types are used more
or less frequently in text mining, part of which are variations of supervised
semi-supervised learning learning. Sporadically, we talk about semi-supervised learning in this thesis,
which targets at tasks where much input data is available, but little known
training data. Intuitively, semi-supervised learning first derives patterns
from the training data and then applies knowledge about these patterns to
2 Text Analysis Pipelines 45
find similar patterns in the other data (Chapelle et al., 2006). Some research
in information extraction proposes self-supervised learning approaches that self-supervised learning
aim to fully overcome the need for known data by generating training data
on their own (Banko et al., 2007). This can work when some output informa-
tion is accessable without uncertainty. We present an according approach
in Chapter 5. Also, we employ an entity recognition algorithm that relies
on sequence labeling (cf. Appendix A). Sequence labeling classifies each of a sequence labeling
sequence of instances, exploiting information about the other instances.
While there are more learning types, like reinforcement learning, recom-
mender systems or one-class classification, we do not apply them in this
thesis and, so, omit to introduce them here for brevity.
text analyses (e.g. for sentiment analysis). Text corpora often contain an-
notations, especially annotations of the target variable that represents the
output information to be inferred (e.g. the sentiment polarity of a text). In
contrast to the annotations produced by text analysis algorithms, corpus an-
annotation process notations have usually been created manually in a cost-intensive annotation
process. To avoid such a process, they can sometimes be derived from exist-
ing metadata (such as the author or star rating of a review). In both cases,
ground truth they are seen as ground truth annotations (Manning et al., 2008).
To allow for generalization, the compilation of texts in a text corpus usu-
representative ally aims to be representative for some target variable C, i.e., it includes the
full range of variability of texts with respect to C (Biber et al., 1998). We dis-
cuss representativeness at the beginning of Chapter 5. For evaluation, also
the distribution of texts over the values of C should be representative for
balanced the real distribution. For machine learning, though, a balanced distribution,
where all values of the target variable are evenly represented, is favorable
according to statistical learning theory (Batista et al., 2004).
Effectiveness Text analysis approaches are mostly evaluated with respect to
their effectiveness, which quantifies the extent to which output information
is correct. Given a collection of input texts with ground truth annotations
for the target variable C of a text analysis approach be given, the effective-
ness of all approaches relevant in this thesis can be evaluated in the sense
of a two-class classification task, i.e., whether the decision to produce each
possible instance of C is correct or not.
positives We call the output instances of an approach the positives and all other
negatives possible instances the negatives. On this basis, four different sets can be dis-
true positives tinguished (Witten and Frank, 2005): True positives (TP) are all positives that
true negatives belong to the ground truth, true negatives (TN) are all negatives that do not
false negatives belong to the ground truth, false negatives (FN) are all negatives that belong
false positives to the ground truth, and false positives (FP) are all positives that do not be-
long to the ground truth. Figure 2.5 illustrates the four sets.
Given the four sets, effectiveness can directly be quantified with different
measures whose adequateness depends on the given task. One measure is
accuracy given by the accuracy a, which means the ratio of correct decisions:
a = (|TP| + |TN|) / (|TP| + |TN| + |FP| + |FN|)
The accuracy is an adequate measure, when all decisions are of equal im-
portance. This holds for many text classification tasks as well as for other
text analysis tasks, in which every portion of an input text is annotated and,
thus requires a decision, such as in tokenization. In contrast, especially in
2 Text Analysis Pipelines 47
Figure 2.5: Venn diagram showing the four sets that can be derived from the
ground truth information of some type in a collection of input texts and the output
information of that type inferred from the input texts by a text analysis approach.
text
corpus ... ... ...
text
corpus ... ... ... ... ... ...
Figure 2.6: Illustration of two ways of splitting a text corpus for development and
evaluation: (a) A training set is used for development, a validation set for optimiz-
ing parameters, and a test set for the final evaluation. (b) Each fold i out of n folds
serves for evaluation in the i-th of n runs, while all others are used for development.
mation need. Since text analysis pipelines are in the focus of all approaches
proposed in this thesis, we now explain the outlined concepts of text ana-
lysis more comprehensively and we illustrate them at the end. Thereby, we
define the starting point for all discussions in Chapters 3 to 5.
cessing. We have introduced the general analyses that can infer these infor-
mation types from input texts in Section 2.1. However, even the inference of
analysis step a single information type often requires several analysis steps, each of which
refers to one text analysis. The reason is that many text analyses require as
input the output of other text analyses, which in turn depend on further text
analyses, and so forth. As a consequence, addressing such an information
need means the realization of a complex text analysis process. Common ex-
amples refer to the areas of information extraction and text classification, as
sketched below. In general, also other natural language processing tasks en-
semantic role labeling tail a number of analysis steps, like semantic role labeling, which seeks for the
associations between the verb in a sentence and its arguments (Gildea and
Jurafsky, 2002). Some researchers report on processes in the intersection of
the different areas with almost 30 steps (Solovyev et al., 2013).
Information Extraction As discussed in Section 2.1, information extrac-
tion often aims at filling complex event templates whose instances can be
stored in databases. Therefore, information extraction processes are made
up of possibly tens of analysis steps, covering the whole spectrum from lex-
ical and syntactic preprocessing over entity recognition, relation extraction,
and event detection to coreference resolution and normalization. While we
investigate processes with up to 11 distinguished analysis steps in the ex-
periments of the subsequent chapters, for brevity we here exemplify only
that even binary relation extraction may already require several steps.
In particular, assume that instances of the above-mentioned relation type
Founded shall be extracted from the sentences of an input text using super-
vised classification (cf. Section 2.1). Before features can be computed for
classification, both organization and time entities need to be recognized in
the sentences. Entity recognition often relies on the output of a chunker,
while relation extraction benefits from information about the positions of
candidate entities in a dependency parse tree (Sarawagi, 2008). These ana-
lyses are usually based on part-of-speech tags and lemmas, which mostly
makes a preceding tokenization and sentence splitting necessary.
Text Classification In terms of the number of distinguished analysis steps,
text classification processes tend to be shorter than information extraction
processes, because the focus is usually on the computation of feature values
the class of an input text is inferred from. Still, many features rely on the
existence of previously produced instances of information types, especially
those resulting from lexical and shallow syntactic analyses (cf. Section 2.1).
In sentiment analysis, for example, some baseline approaches derive fea-
tures from the output of tokenization and part-of-speech tagging only (Pang
2 Text Analysis Pipelines 53
et al., 2002), while others e.g. also perform chunking, and extract relations
between recognized domain-specific terms (Yi et al., 2003). Moreover, some
text classification approaches rely on fine-grained information from seman-
tic and pragmatic text analyses, such as the sentiment analysis in our case
study ArguAna that we introduce in Section 2.3.
Realization The complexity of common text analysis processes raises the
question of how to approach a text analysis task without losing the mind in
the process like van Gogh according to the introductory quote of this chap-
ter. As the examples above indicate, especially the dependencies between
analysis steps are not always clear in general (e.g. some entity recognition
algorithms require part-of-speech tags, while others do not). In addition,
errors may propagate through the analysis steps, because the output of one
step serves as input to subsequent steps (Bangalore, 2012). This entails the
danger of achieving limited overall effectiveness, although each single ana-
lysis step works fine. A common approach to avoid error propagation is to
perform joint inference, where all or at least some steps are performed con-
currently. Some studies indicate that joint approaches can be more effective
in tasks like information extraction16 (cf. Section 2.4 for details).
For our purposes, joint approaches entail limitations, though, because we
seek to realize task-specific processes ad-hoc for arbitrary information needs
from text analysis. Moreover, joint approaches tend to be computationally
expensive (Poon and Domingos, 2007), as they explore larger search spaces
emanating from combinations of information types. This can be problem-
atic for the large-scale scenarios we target at. Following (Buschmann et al.,
1996), our requirements suggest the resort to a sequence of small analysis
steps composed to address a task at hand. In particular, small analysis steps
allow for an easy recombination and they simplify the handling of interde-
pedencies. Still, a joint apprach may be used as a single step in an according
sequence. We employ a few joint approaches (e.g. the algorithm ene de-
scribed in Appendix A.1) in the experiments of this thesis. Now, we present
the text analysis pipelines that realize sequences of analysis steps.
texts D D D ... D D D
2 m-1 m
information C0 C0 C1 C
i=0
i ... C
i=0
i C
i=0
i C
Cairo, August 25th 2010 -- This hotel is pretty nice. The rooms are large
Forecast on Egypt's Automobile industry and comfortable. The lobby is also very nice.
[...] In the next five years, revenues will rise The only problem with this hotel is the pool.
by 97% to US-$19.6 bn. [...] It is very cold. I might go back here.
Output information Output information
Figure 2.8: Example for output information inferred from an input text to address
the main information needs in (a) the InfexBA project and (b) the ArguAna project.
Martin, 2009). Some types, though, occur in diverse types of natural lan-
guage texts, of which the most common are person names, location names,
and organization names. They have been in the focus of the CoNLL-2003
named entity recognition shared task on named entity recognition (Tjong Kim Sang and Meulder, 2003).
In Chapter 4, we analyze the distribution of the three entity types in several
text corpora from Appendix C in the context of influencing factors of pipe-
line efficiency. There, we rely on a common sequence labeling approach
to named entity recognition (cf. Section 2.1), using the algorithm ene (cf.
Appendix A) in the pipeline ene = (sse, sto2 , tpo1 , pch, ene).
Language Function Analysis Finally, we address the text classification
language function analysis task language function analysis in this thesis. We introduced this task our-
selves in (Wachsmuth and Bujna, 2011). As argued there, every text can be
seen as being predominantly expressive, appellative, or informative. These
language function language functions define an abstract classification scheme, which can be un-
derstood as capturing a single aspect of genres (Wachsmuth and Bujna,
2011). In Chapter 5, we concretize the scheme for product-related texts in
order to then outline how much text classification depends on the domain of
the input texts. Moreover, in Chapter 3 we integrate clf in the information
extraction pipelines from InfexBA (see above). In particular, we employ clf
to filter possibly relevant candidate input texts, which can be seen as one of
the most common applications of text classification in text mining.
2012). The leading software frameworks for text analysis, Apache UIMA
and GATE, target at pipelines (cf. Section 1.3). Some of our approaches as-
sume that no analysis is performed by more than one algorithm in a pipe-
line. This is usual, but not always the case (Whitelaw et al., 2008). As a con-
sequence, algorithms can never make up for errors of their predecessors,
which may limit the overall effectiveness of pipelines (Bangalore, 2012). In
addition, the task dependency of effective text analysis algorithms and pipe-
lines (cf. Sections 2.1 and 2.2) renders their use in the ad-hoc search scenar-
ios we focus on problematic (Etzioni, 2011). In the following, we describe
the most important approaches to tackle these problems, grouped under
the topics joint inference, pipeline enhancement, and task independence.
Joint Inference We have already outlined joint inference as a way to avoid
the problem of error propagation in classic pipelines in Section 2.2. Joint
approaches infer different types of information at the same time, thereby
mimicking the way humans process and analyze texts (McCallum, 2009).
Among others, tasks like entity recognition and relation extraction have
been said to benefit from joint inference (Choi et al., 2006). However, the
possible gain of effectiveness comes at the cost of lower efficiency and less
reusability (cf. Section 2.2), which is why we do not target at joint ap-
proaches in this thesis, but only integrate them when feasible.
Pipeline Enhancement Other researchers have addressed the error pro-
pragation through iterative or probabilistic pipelines. In case of the former,
a pipeline is executed repeatedly, such that the output of later algorithms in
a pipeline can be used to improve the output of earlier algorithms (Holling-
shead and Roark, 2007).19 In case of the latter, a probability model is built
based on different possible outputs of each algorithm (Finkel et al., 2006) or
on confidence values given for the outputs (Raman et al., 2013). While these
approaches provide reasonable enhancements of the classic pipeline archi-
tecture, they require modifications of the available algorithms and partly
also significantly reduce efficiency. Both does not fit well to our motivation
of enabling ad-hoc large-scale text mining (cf. Section 1.1).
Task Independence The mentioned approaches can improve the effective-
ness of text analysis. Still, they have to be designed for the concrete task at
hand. For the extraction of entities and relations, Banko et al. (2007) intro-
duced open information extraction to overcome such task dependency. Unlike
traditional approaches for predefined entity and relation types (Cunning-
19
Iterative pipelines are somewhat related to compiler pipelines that include feedback
loops (Buschmann et al., 1996). There, results from later compiler stages (say, semantic ana-
lysis) are used to resolve ambiguities in earlier stages (say, lexical analysis).
62 2.4 State of the Art in Ad-hoc Large-Scale Text Mining
ham, 2006), their system TextRunner efficiently looks for general syntactic
patterns (made up of verbs and certain part-of-speech tags) that indicate re-
lations. Instead of task-specific analyses, it requires only a keyword-based
query as input that allows identifying task-relevant relations. While Cun-
ningham (2006) argues that high effectiveness implies high specificity, open
information extraction targets at web-scale scenarios. There, precision can
be preferred over recall, which suggests the exploitation of redundancy in
the output information (Downey et al., 2005) and the resort to highly reliable
extraction rules, as in the subsequent system ReVerb (Fader et al., 2011).
Open information extraction denotes an important step towards the use
of text analysis in web search and big data analytics applications. Until
today, however, it is restricted to rather simple binary relation extraction
tasks (Mesquita et al., 2013). In contrast, we seek to be able to tackle arbitrary
text analysis tasks, for which appropriate algorithms are available. With re-
spect to pipelines, we address the problem of task dependency in Chapter 3
through an automatic design of text analysis pipelines.
context often aimed to reduce the cost of adapting to new domains by ex-
ploiting machine learning techniques for obtaining training data automati-
cally, surveyed by Turmo et al. (2006). However, the predominant approach
today is to tackle domain dependence through domain adaptation (Daum
and Marcu, 2006), as explained below. Some approaches also strive for do-
main independence based on generally valid features. From these, we adopt
the idea of focusing on structure, especially on the argumentation structure
of texts, which in turn relates to information structure and discourse structure.
Since robustness does not mean perfect effectiveness, we end with existing
work on how to increase the user acceptance of erroneous results.
Domain Adaptation The scenario usually addressed in domain adaptation
is that many training texts are given from some source domain, but only few
from a target domain (Blitzer et al., 2008). The goal is to learn a model on
the source texts that works well on unknown target texts. In information
extraction, most domain adaptation approaches share that they choose a
representation of the source texts that makes them close the distribution of
the target texts (Gupta and Sarawagi, 2009). Similarly, domain adaptation
is often tackled in text classification by separating the domain-specific from
the domain-independent features and then exploiting knowledge about the
latter (Daum and Marcu, 2006). Also, structural correspondences can be
learned between domains (Blitzer et al., 2007; Prettenhofer and Stein, 2011).
In particular, domain-specific features are aligned based on a few domain-
independent features, e.g. Stay away! in the hotel domain may have a
similar meaning as Read the book! in the film domain.
Domain adaptation, however, does not really apply to ad-hoc search and
similar applications, where it is not possible to access texts from all tar-
get domains in advance. This also excludes the approach of Gupta and
Sarawagi (2009) who derive domain-independent features from a compari-
son of the set of all unknown target texts to the set of known source texts.
Domain Independence Since domains are often characterized by content
words and the like (cf. Chapter 5 for details), most approaches that explic-
itly aim for domain independence try to abstract from content. Glorot et al.
(2011), for instance, argue that higher-level intermediate concepts obtained
through the non-linear input transformations of deep learning help in cross-
domain sentiment analysis of reviews. While we evaluate domain robust-
ness for the same task, we do not presume a certain type of machine learning
algorithms. Rather, we work on the features to be learned. Lipka (2013) ob-
serves that style features like character trigrams serve the robustness of text
quality assessment. Similar results are reported for authorship attribution
2 Text Analysis Pipelines 69
3
Pipeline Design
71
72 3.1 Ideal Construction and Execution for Ad-hoc Text Mining
input control
Figure 3.1: Abstract view of the overall approach of this thesis (cf. Figure 1.5). Sec-
tions 3.1 to 3.3 discuss the automatic design of ad-hoc text analysis pipelines.
Now, the question is how to design an optimal pipeline for a given infor-
mation need C and a collection or a stream of input texts D. Our focus is
realizing complete text analysis processes rather than single text analyses.
Therefore, we consider the question for a universe where the set A of all
available algorithms is predefined (cf. Section 2.2). Under this premise, the
quality of a pipeline follows only from its construction and its execution.
As presented in Section 2.2, the design style of a pipeline = hA, i is
fixed, consisting in a sequence of algorithms where the output of one algo-
rithm is the input of the next. Consequently, pipeline construction means the pipeline construction
selection of an algorithm set A from A that can address C on D as well as
the definition of a schedule of the algorithms in A. Similarly, we use the
term pipeline execution to refer to the application of a pipelines algorithms to pipeline execution
the texts in D and to its production of output information of the types in C.
While the process of producing output from an input is defined within an
algorithm, the execution can be influenced by controlling what part of the
input is processed by each algorithm. As a matter of fact, pipeline optimal-
ity follows from an optimal selection and scheduling of algorithms as well
as from an optimal control of the input of each selected algorithm.
74 3.1 Ideal Construction and Execution for Ad-hoc Text Mining
admissibility 2. Admissibility. The schedule fulfills all input constraints of all algo-
rithms in A, i.e., i1
(in)
[ (out)
Ai (A1 , . . . , Am ) : Ci C0 Cj (3.3)
j=1
pipeline run-time
schedule '
*
tA
3
tA
tA
tA
se
se
se
se
hm
hm
hm
hm
rit
rit
rit
rit
go
go
go
go
al
schedule schedule
al
al
al
schedule *
Figure 3.2: The impact of the selection and the schedule of the algorithms in a text
analysis pipeline: (a) Selecting a more effective algorithm set improves a pipelines
effectiveness, but it also entails higher run-time. (b) A smart scheduling of the
algorithms can improve the pipelines run-time without impairing its effectiveness.
entity type is recognized first, relation extraction must take place only on
those sentences that contain both a time entity and a money entity.
What we get from admissibility is that we can subdivide the problem of
finding an optimal pipeline = hA , i into two subproblems. The first
subproblem is to select an algorithm set A that best matches the efficiency-
effectiveness tradeoff to be made, which has to be inferred from the quality
function Q at hand. This situation is illustrated in Figure 3.2(a). Once A is
given, the second subproblem breaks down to a single-criterion optimiza-
tion problem that is independent from Q, namely, to schedule and execute
the selected algorithms in the most efficient manner, because all pipelines
based on A are of equal effectiveness. Accordingly, the best pipeline for
A in Figure 3.2(b) refers to the one with lowest run-time. We conclude
that only the selection of algorithms actually depends on Q. Altogether, the
developed pipeline optimization problem can be summarized as follows:2 pipeline optimization problem
Figure 3.3: Sample illustration of the four steps of designing an optimal text analy-
sis pipeline for a collection or a stream of input texts D and an information need C
using a selection A1 , . . . , Am of the set of all text analysis algorithms A .
Section 3.5, we offer evidence that the time required for filtering is in fact almost neglible.
3 Pipeline Design 81
Figure 3.4: Application of the paradigms from Figure 3.3 of designing an optimal
pipeline 2 = hA2 , i for addressing the information need Forecast(Time, Money)
on the Revenue corpus. The application is based on the algorithm set A2 .
Input Texts In the case study, we process the provided split of our Revenue
corpus, for which details are presented in Appendix C.1. We use the training
set of the corpus to estimate all run-times and initially filtered portions of
text of the employed text analysis algorithms.
Maximum Decomposition To address C, we need a recognition of time
entities and money entities, an extraction of their relations, and a detec-
tion of revenue and forecast events. For these text analyses, an input text
must be segmented into sentences and tokens before. Depending on the
employed algorithms, the tokens may additionally have to be extended by
part-of-speech tags, lemmas, and dependency parse information. Based on
these circumstances, we consider three algorithm sets for C, each of them
representing a different level of effectiveness:
A1 = { sse, sto2 , tpo1 , eti, emo, rtm1 , rre1 , rfo }
A2 = { sse, sto2 , tpo1 , pde1 , eti, emo, rtm2 , rre2 , rfo }
A3 = { sse, sto2 , tpo2 , tle, pde2 , eti, emo, rtm2 , rre2 , rfo }
For information on the algorithms (input and output, quality estimations,
etc.), see Appendix A. Differences between A1 , A2 , and A3 are that (1) only
A3 contains separated algorithms tpo2 and tle for part-of-speech tagging
and lemmatization, (2) because of the simple rule-based relation extractor
rtm1 , A1 requires no dependency parser, and (3) the revenue event detector
rre1 of A1 is faster but less effective than rre2 . Exemplarily, Figure 3.4 il-
lustrates the design of an optimal text analysis pipeline for A2 , showing the
result of maximum decomposition at the top, i.e., an initial valid pipeline
(a) (a)
2 = hA2 , 2 i. For each algorithm set, the initial pipeline schedules the
algorithms in the ordering shown above.
Early Filtering Next, each of the three algorithm sets is modified by adding
filtering steps after every non-preprocessing algorithm. This results in three
82 3.1 Ideal Construction and Execution for Ad-hoc Text Mining
(b)
modified pipelines, such as the pipeline 2 = hA2 , 2 i in Figure 3.4(b).
(b) (b)
The pipelines 1 to 3 perform filtering on the sentence-level in order
to match the information needed to address C.
Lazy Evaluation According to the input and output constraints of the em-
ployed algorithms (cf. Appendix A.1), the outputs of the algorithms tpo1
(b) (b) (c) (c)
and pde1 in 2 = hA2 , 2 i are first needed by rtm2 . In 2 = hA2 , 2 i,
we hence delay them after eti and emo, as shown in Figure 3.4(c), and we
(b) (b)
perform similar operations on 1 and 3 .
Optimal Scheduling Finally, we compute an optimal scheduling of the fil-
(c) (c)
tering stages in 1 , . . . , 3 in order to obtain optimal pipelines 1 , . . . , 3 .
Here, we sketch scheduling for 2 = hA2 , 2 i only. We know that admissible
pipelines based on A2 execute eti and emo before rtm2 and eti also before
rfo. Given these constraints, we apply the approximation of Inequality 3.5,
i.e., we pairwise compute the optimal schedule of two filtering stages. E.g.,
for emo = (sse, sto2 , emo, emo(F ) ) and rfo = (sse, sto2 , tpo1 , rfo, rfo(F ) ),
we have:
t(rfo) + qrfo t(emo) = 1.67 ms + 0.20 ms
< t(emo) + qemo t(rfo) = 1.77 ms + 0.17 ms
Therefore, we move the algorithms in rfo before emo, which also means
that we separate tpo1 from pde1 to insert tpo1 before rfo. For corresponding
(F )
reasons, we postpone rtm2 to the end of the schedule. Thereby, we obtain
the final pipeline 2 that is illustrated in Figure 3.4(d). Correspondingly,
we obtain the following pipelines for A1 , A3 , and A4 :
1 = (sse, sto2 , rre1 , rre(F ), eti, eti(F ), emo, emo(F ), tpo1 , rfo, rfo(F ),
rtm1 , rtm(F ) )
3 = (sse, sto2 , eti, eti(F ), emo, emo(F ), rre1 , rre(F ), tpo2 , tle, rfo,
rfo(F ), pde2 , rtm2 , rtm(F ) )
Baselines To show the single effects of our method, we consider all con-
(a) (b)
structed intermediate pipelines. For A2 , for instance, this means 2 , 2 ,
(c)
2 , and 2 . In addition, we compare the schedules of the different opti-
mized pipelines. I.e., for 1 i, j 3, we compare each i = hAi , i to all
pipelines hAi , j i with i 6= j except for 1 . 1 applies rre1 before time and
money recognition, which would not be admissible for rre2 . For hAi , j i,
we assume that j refers to the algorithms of A .
Experiments We compute the run-time per sentence t() and its standard
deviation for each pipeline on the test set of the Revenue corpus us-
ing a 2 GHz Intel Core 2 Duo MacBook with 4 GB RAM. All run-times are
averaged over five runs. Effectiveness is captured in terms of precision p,
recall r, and F1 -score f1 (cf. Section 2.1).
3 Pipeline Design 83
j A1 : p r t A2 : p r t A3 : p r t
(a)
i 0.65 0.56 3.23 .07 0.72 0.58 51.05 .40 0.75 0.61 168.40 .57
(b)
i 0.65 0.56 2.86 .09 0.72 0.58 49.66 .28 0.75 0.61 167.85 .70
(c)
i 0.65 0.56 2.54 .08 0.72 0.58 15.54 .23 0.75 0.61 45.16 .53
1 0.65 0.56 2.44 .03 0.72 0.58 0.75 0.61
2 0.65 0.56 2.47 .15 0.72 0.58 4.77 .06 0.75 0.61 16.25 .15
3 0.65 0.56 2.62 .05 0.72 0.58 4.95 .09 0.75 0.61 10.19 .05
Table 3.1: The precision p and recall r as well as the average run-time in millisec-
onds per sentence t(ji ) and its standard deviation for each considered pipeline
ji = hAi , j i based on A1 , A2 , and A3 on the Revenue corpus.
Results Table 3.1 lists the precision, recall and run-time of each pipeline
based on A1 , A2 , or A3 . In all cases, the application of the paradigms does
not change the effectiveness of the employed algorithm set.6 Both precision
and recall significantly increase from A1 to A2 and from A2 to A3 , leading
to F1 -scores of 0.60, 0.65, and 0.67, respectively. These values match the
hypothesis that deeper analyses supports higher effectiveness.
Paradigms (b) to (c) reduce the average run-time of A1 from 3.23 ms per
(a)
sentence of 1 to t(1 ) = 2.44 ms. 1 is indeed the fastest pipeline, but
only at a low confidence level according to the standard deviations. The
efficiency gain under A1 is largely due to early filtering and lazy evalua-
tion. In contrast, the benefit of optimal scheduling becomes obvious for A2
and A3 . Most significantly, 3 clearly outperforms all other pipelines with
t(3 ) = 10.19 ms. It requires less than one fourth of the run-time of the
(c)
pipeline after lazy evaluation, 3 , and even less than one sixteenth of the
(a)
run-time of 3 .
In Figure 3.5(a), we plot the run-times of all considered pipelines as a
function of their effectiveness in order to stress the efficiency impact of the
four paradigms. The shown interpolated curves have the shape sketched in
Figure 3.2. While they grow more rapidly under increasing F1 -score, only a
moderate slope is observed after optimal scheduling. For A2 , Figure 3.5(b)
illustrates the main effects of the paradigms on the employed algorithms:
Dependency parsing (pde1 ) takes about 90% of the run-time of both hA2 , a i
and hA2 , b i. Lazy evaluation then postpones pde1 , reducing the run-time
to one third. The same relative gain is achieved by optimal scheduling, re-
sulting in hA2 , 2 i where pde1 takes less than half of the total run-time.
6
In (Wachsmuth et al., 2011), we report on a small precision loss for A2 , which we there
assume to emanate from noise of algorithms that operate on token-level. Meanwhile, we
have found out that the actual reason was an implementation error, which is now fixed.
84 3.1 Ideal Construction and Execution for Ad-hoc Text Mining
40
3
125
tA
se
hm
rit
go
100 30
al
2
PDE1
tA
75 se
20
hm
rit
go
al
50 schedule c
1
tA
10
se
hm
25 schedule *2
rit
TPO1
go
schedule *3
al
STO2
SEN
0
schedule *1 0.61 0.63 0.65 0.67 F1-score a b c *2
Figure 3.5: (a) Visualization of the run-time per sentence of each considered pipe-
line hAi , j i on the test set of the Revenue corpus at the levels of effectiveness rep-
resented by A1 , A2 , and A3 . (b) Run-time per sentence of each algorithm A A2
on the test set depending on the schedule j of the respective pipeline hA2 , j i.
are needed that can (1) choose a complete algorithm set on their own and
(2) perform filtering for arbitrary text analysis tasks. We address these is-
sues in the remainder of Chapter 3. On this basis, Chapter 4 then turns our
view to the raised optimization problem of pipeline scheduling.
The rationale behind this process-oriented view is that all text analyses can
largely be operationalized as an annotation of input texts (cf. Section 2.2).
Hence, we can model general expert knowledge about text analysis pro-
cesses in an ontology that is irrespective of the given task, whereas both
the input texts to be processed and the information need to be addressed
annotation task metamodel depend on the task. Such an annotation task metamodel serves as an upper
ontology that is extended by concrete types of knowledge in a task at hand.
In particular, we model three aspects of the universe of annotation tasks:
1. The information to be annotated,
2. the analysis to be performed for annotation, and
3. the quality to be achieved by the annotation.
Each aspect subsumes different abstract concepts, each of which is instanti-
ated by the concrete concepts of the text analysis task at hand. Since OWL-
DL integrates types and instances within one model, such an instantiation
can be understood as an extension of the metamodel. Figure 3.6 illustrates
the complete annotation task metamodel as an RDF graph. In the following,
we discuss the rationale and representation of all shown concepts in detail.
For a concise presentation of limited complexity and for lack of other re-
quirements, we define only some concepts formally.
1 0..1 2 3
Annotation Quality
1 Algorithm 1
type criterion
supertype output
1
1..* input * 1 0..1
* *
0..1 Information Quality Order Aggregate
Feature
active type estimation relation function
*
1 successor
Value Selectivity Quality
Type
constraint estimation priority
0..1
1..* * 1
Primitive Quality
Filter
type prioritization
Figure 3.6: The proposed metamodel of the expert knowledge that is needed for
addressing annotation tasks, given in the form of an RDF graph. Black arrowheads
denote has relations and white arrowheads subclass relations. The six non-
white abstract concepts are instantiated by concrete concepts in an application.
the notion of active features and value constraints, we formally define the
abstract information type to be found in annotation tasks as follows:
Information Type A set of instances of an annotation type denotes an in-
formation type C if it contains all instances that meet two conditions:
1. Active Feature. The instances in C either have no active feature or
they have the same single active feature.
2. Constraints. The instances in C fulfill the same value constraints.
By defining an information type C to have at most one active feature, we
obtain a normalized unit of information in annotation tasks. I.e., every in-
formation need can be stated as a set of information types C = {C1 , . . . , Ck },
meaning a conjunction C1 . . . Ck with k 1, as defined in Section 2.2. In
this regard, we can denote the above-sketched example information need
from InfexBA as {Forecast, Forecast.organization = Google}, where Forecast is
a concrete annotation type with a feature organization.
With respect to an information type, the internal operations of a text ana-
lysis algorithm that infers this type from a text do not matter, but only the al-
input type
gorithms behavior in terms of the input types it requires and the output types
output type
it produces. The actual quality of an algorithm (say, its efficiency and/or
effectiveness) in processing a collection or a stream of texts is, in general,
quality estimation unknown beforehand. For many algorithms, quality estimations are known
algorithm from evaluations, though. Formally, our abstract concept of an algorithm in
the center of Figure 3.6 hence has the following properties:
Algorithm Let C be a set of information types and Q a set of quality criteria.
Then an algorithm A is a 3-tuple hC(in) , C(out) , qi with C(in) 6= C(out) and
1. Input types. C(in) C is a set of input information types,
2. Output types. C(out) C is a set of output information types, and
3. Quality estimations. q (Q1 {}) . . . (Q|Q| {}) contains
one value qi for each quality criterion Qi Q. qi defines a quality
estimation or it is unknown, denoted as .
Different from frameworks like Apache UIMA, the definition does not al-
low equal input and output types, which is important for ad-hoc pipeline
construction. We come back to this disparity in Section 3.3.
Now, assume that an algorithm has produced instances of an output
type C C (say, Organization) for an information need C. As discussed in
Section 3.1, a means to improve efficiency is early filtering, i.e., to further an-
alyze only portions of text that contain instances of C and that, hence, may
be relevant for C. Also, portions can be excluded from consideration, if they
3 Pipeline Design 91
span only instances that do not fulfill some value constraint in C (say, orga-
nization = Google). For such purposes, we introduce the notion of filters, filter
which discard portions of an input text that do not meet some checked value
constraint and, thus, filter the others. We formalize filters as follows:
Filter Let C be a set of information types. Then a filter is an algorithm A(F )
that additionally defines a 2-tuple hC(F ) , q (F ) i such that
1. Value constraints. C(F ) C is the set of value constraints of A(F ) ,
2. Selectivity estimations. q (F ) [0, 1] is a vector of selectivity estima-
tions of A(F ) , where each estimation refers to a set of input types.
In line with our case study in Section 3.1, the definition states that a filter
entails certain selectivities, which depend on the given input types. Selectiv-
ities, however, strongly depend on the processed input texts, as we observed
in (Wachsmuth and Stein, 2012). Therefore, reasonable selectivity estimations selectivity estimation
can only be obtained during analysis and then assigned to a given filter.
Filters can be created on-the-fly for information types. A respective filter
then has a single input type in C(in) that equals its output type in C(out)
except that C(out) additionally meets the filters value constraints. We use
filters in Section 3.3 in order to improve the efficiency of text analysis pipe-
lines. In Sections 3.4 and 3.5, we outsource filtering into an input control,
which makes an explicit distinction of filters obsolete.
...
...
...
...
nucleus Fact Fact() Accuracy = 78% {(q1,q2) | q1 > q2}
:Feature :Annotation type :Information type :Quality estimation :Order relation
output
supertype
Figure 3.7: Excerpt from the annotation task ontology that we considered in the
project ArguAna. The shown concrete concepts instantiate the abstract concepts
of the annotation task metamodel from Figure 3.6.
The quality criteria directly imply possible quality prioritizations. Here, the
only quality prioritization assigns Prio 1 to accuracy. A more sophisticated
quality model follows in the evaluation in Section 3.3.
pipelinePartialOrderPlanning(C0 , C, , A )
1: Algorithm set A {A0 }
2: Partial schedule
3: Input requirements {hC, A0 i | C C\C0 }
4: while 6= do
5: Input requirement hC, Ai .poll()
6: if C C then
7: Filter A(F ) createFilter(C)
(F )
8: A A {A }
9:
{(A(F ) < A)}
10: hC, Ai hA(F ).C (in).poll(), A(F ) i
11: Algorithm A selectBestAlgorithm(C, C, , A )
12: if A = then return
13: A A {A }
14: {(A < A)}
(in)
15: {hC, A i | C A .C \C0 }
16: return hA,
i
Pseudocode 3.1: Partial order planning for selecting an algorithm set A (with a
) that addresses a planning problem () = hC0 , C, , A i.
partial schedule
While filters reduce the input to be processed, they do not remove infor-
mation types from the current state, thus never preventing subsequent al-
gorithms from being applicable (Dezsnyi et al., 2005). Consequently, the
preconditions of an algorithm will always be satisfied as soon as they are
satisfied once. Partial order planning follows a least commitment strategy,
which leaves the ordering of the actions as open as possible. Therefore, it
is, in many cases, a very efficient planning variant (Minton et al., 1995).
Pseudocode 3.1 shows our partial order planning approach to algorithm
selection. Given a planning problem, the approach creates a complete algo-
rithm set A together with a partial schedule . Only to initialize planning, a
planning agenda helper finish algorithm A0 is first added to A. Also, the planning agenda is
derived from the information need C and the initial state C0 (pseudocode
input requirement lines 1 to 3). stores each open input requirement, i.e., a single precondition
to be satistified together with the algorithm it refers to. As long as open in-
put requirements exist, lines 4 to 15 iteratively update the planning agenda
while inserting algorithms into A and respective ordering constraints into .
In particular, line 5 retrieves an input requirement hC, Ai from using the
method poll(). If C contains C, a filter A(F ) is created and integrated on-the-
fly (lines 6 to 9). According to Section 3.2, A(F ) discards all portions of text
that do not comprise instances of C. After replacing hC, Ai with the input
requirement of A(F ) , line 11 selects an algorithm A A that produces C
and that is best in terms of the quality prioritization . If any C cannot be
3 Pipeline Design 97
selectBestAlgorithm(C, C, , A )
1: Algorithm set AC {A A | C A.C(out) }
2: if |AC | = 0 then return
3: for each Quality criterion Qi with i from 1 to || do
4: Algorithm set AC
5: Quality estimation q Qi .worst()
6: for each Algorithm A AC do
7: Quality estimation q estimateQuality(A, Qi , C, A )
8: if Qi .isEqual(q, q ) then AC AC {A}
9: if Qi .isBetter(q, q ) then
10: AC {A}
11: q q
12: AC AC
13: if |AC | = 1 then return AC .poll()
14: return AC .poll()
satisfied, planning fails (line 12) and does not reach line 16 to return a par-
tially ordered pipeline hA, i.
Different from (Wachsmuth et al., 2013a), we also present the method
selectBestAlgorithm in detail here, shown in Pseudocode 3.2. The under-
lying process has been defined by Rose (2012) originally. Lines 1 and 2
check if algorithms exist that produce the given precondition C. The set
AC of these algorithms is then compared subsequently for each quality cri-
terion Qi in (lines 3 to 13) in order to determine the set AC of all algorithms
with the best quality estimation q (initialized with the worst possible value
of Qi in lines 4 and 5). To build AC , line 6 to 11 iteratively compare the qual-
ity estimation q of each algorithm A in AC with respect to Qi . Only possibly
best algorithms are kept (line 12). In case only one algorithm remains for
any Qi , it constitutes the single best algorithm (line 13). Otherwise, any best
algorithm is eventually returned in line 14.
Finally, Pseudocode 3.3 estimates the quality q of an algorithm A from
the repository A . q is naturally based on the quality estimation A.qi of A.
For lack of better alternatives, we assume q to be the worst possible value of
Qi whenever A.qi is not specified. If values of Qi cannot be aggregated (cf.
Section 3.2), q simply equals A.qi (lines 1 to 3). Elsewise, line 4 to 9 re-
cursively aggregate the quality estimations qC
(in) of all best algorithms (in
estimateQuality(A, Qi , C, A )
1: Quality estimation q A.qi
2: if q = then q Qi .worst()
3: if Qi has no aggregate function then return q
4: for each Information type C (in) A.C(in) \C do
5: Quality estimation qC (in) Qi .worst()
some information type C (in) 6 C. The reason is that other algorithms will
be succeeded by a filter in the partial schedule
(cf. Pseudocode 3.1). Since
filters change the input to be processed, it seems questionable to aggregate
quality estimations of algorithms before and after filtering.
greedyPipelineLinearization(A,
)
1: Algorithm set A
2: Schedule
3: while A 6= A do
4: Filtering stages
5: for each Filter A(F ) {A A\A | A is a filter} do
6: Algorithm set A(F ) {A(F ) } getPredecessors(A\A , , A(F ) )
(F ) (F )
7: Schedule getAnyCorrectTotalOrdering(A )
,
Estimated run-time q(hA(F ), (F ) i) AA(F ) t(A)
P
8:
9: {hA(F ) , (F ) i}
10: Filtering stage hAj , j i arg min q(hA(F ) , (F ) i)
hA(F ) , (F ) i
11: j {(A < Aj ) | A A Aj Aj }
12: A A Aj
13: return hA, i
from A . Similarly, we do not pay attention to algorithms with a circular circular dependency
dependency. As an example, assume that we have (1) a tokenizer sto, which
(in) (out)
requires Csto = {Sentence} as input and produces Csto = {Token} as output,
(in) (out)
and (2) a sentence splitter sse with Csse = {Token} and Csse = {Sentence}.
Given each of them is the best to satisfy the others precondition, these al-
gorithms would be repeatedly added to the set of selected algorithms A in
an alternating manner. A solution to avoid circular dependencies is to ig-
nore algorithms whose input types are output types of algorithms already
added to A. However, this might cause situations where planning fails,
even though a valid pipeline would have been possible. Here, we leave more
sophisticated solutions to future work. In the end, the described problem
might be realistic, but it is in our experience far from common.
Theorem 3.1. Let () = hC0 , C, , A i be a planning problem with a con-
sistent algorithm repository A that does not contain circular dependencies.
Then pipelinePartialOrderPlanning(C0 , C, , A ) returns a complete algo-
rithm set A for C\C0 iff. such an algorithm set exists in A .
Proof. We provide only an informal proof here, since the general correctness
of partial order planning is known from the literature (Minton et al., 1995).
The only case where planning fails is when selectBestAlgorithms finds no
algorithm in A that satisfies C. Since A is consistent, this can happen
only if C C holds. Then, C\C0 must indeed be unsatisfiable using A .
If C\C0 is satisfiable using A , selectBestAlgorithms always returns an
algorithm that satisfies C by definition of AC . It remains to be shown that
Pseudocode 3.1 returns a complete algorithm set A for C\C0 then. Without
circular dependencies in A , the while-loop in lines 4 to 15 always termi-
nates, because (1) the number of input requirements added to is finite and
(2) an input requirement is removed from in each iteration. As all added
input requirements are satisfied, each algorithm in A works properly, while
the initialization of in line 3 ensures that all information types in C\C0
are produced. Hence, A is complete and, so, Theorem 3.1 is correct.
Theorem 3.2. Let hA,
i be a partially ordered pipeline returned by pipelinePar-
tialOrderPlanning for a planning problem hC0 , C, , A i. Then greedyPipe-
) returns an admissible pipeline hA,
lineLinearization(A, i for C\C0 .
Proof. To prove Theorem 3.2, we first show the termination of greedy-
PipelineLinearization. The guaranteed total order in then follows from
induction over the length of . According to the pseudocode of our plan-
ner (Pseudocode 3.1), each text analysis algorithm in A is a predecessor
of at least one filter in A. Since all predecessors of the filter in the filter-
ing stage hAj , j i, chosen in line 10 of greedyPipelineLinearization, belong
102 3.3 Ad-hoc Construction via Partial Order Planning
A O( |A|0) calls
C1(in) ... C1'(in)
A|C|,1
A|C|,1'
Figure 3.9: Sketch of the worst-case number O(|A ||C | ) of calls of the method
estimateQuality for a given algorithm A, visualized by the algorithms that produce
a required information type and, thus, lead to a recursive call of the method.
This process is reflected in Figure 3.9. Analog to our argumentation for the
preconditions, the maximum recursion depth is |C |, which implies a total
number of O(|A ||C | ) executions of estimateQuality. Therefore, we obtain
the asymptotic worst-case overall run-time
tpipelinePartialOrderPlanning (C , A ) = O |C | |A ||C | .
(3.6)
This estimation seems problematic for large type systems C and algorithm
repositories A . In practice, however, both the while-loop iterations and the
recursion depth are governed rather by the number of information types in
the information need C. Moreover, the recursion (which causes the main
factor in the worst-case run-time) assumes the existence of aggregate func-
tions, which will normally hold for efficiency criteria only. With respect
to algorithms, the actual influencing factor is the number of algorithms
that serve as preprocessors, called the branching factor in artificial intelli-
gence (Russell and Norvig, 2009). The average branching factor is limited
by C again. Additionally, it is further reduced through the disregard of
algorithms that allow for filtering (cf. line 6 in Pseudocode 3.3).
Given the output hA, i of the planner, the run-time of greedyPipeline-
Linearization depends on the number of algorithms in A. Since the while-
loop in Pseudocode 3.4 adds algorithms to the helper algorithm set A , it it-
erates O(|A|) times (cf. the proof of Theorem 3.2). So, the driver of the asym-
potic run-time is not the number of loop iterations, but the computation of a
transitive closure for getPredecessor, which typically takes O(|A|3 ) opera-
tions (Cormen et al., 2009). As mentioned above, the computation needs to
be performed only once. Thus, we obtain a worst-case run-time of
tgreedyPipelineLinearization (A) = O(|A|3 ). (3.7)
104 3.3 Ad-hoc Construction via Partial Order Planning
Figure 3.10: An UML-like class diagram that shows the three-tier architecture of
our expert system Pipeline XPS for ad-hoc pipeline construction and execution.
Figure 3.11: Visualization of the built-in quality model of our expert system. The
colored, partially labeled circles denote possible quality prioritizations. Exemplar-
ily, the shown implies relations illustrate that some prioritizations imply others.
The resulting descriptor files comprise all knowledge required by our expert
system for ad-hoc pipeline construction, called Pipeline XPS.
Figure 3.10 sketches the three-tier architecture of Pipeline XPS in a UML-
like class diagram notation (OMG, 2011). As usual for expert systems, the
architecture separates the user interface from the inference engine, and both
of them from the knowledge base. In accordance with Section 3.2, the latter
stores all domain-specific expert knowledge in an annotation task ontology
realized with OWL-DL. Via a knowledge acquisition component, users (typi-
cally experts) can trigger an automatic ontology import that creates an algo-
rithm repository and a type system from a set of descriptor files. Conversely,
we decided to rely on a predefined quality model for lack of specified qual-
ity criteria in Apache UIMA (cf. Section 3.2) and for convenience reasons:
Since the set of quality criteria is rather stable in text analysis, we thereby
achieve that users only rarely deal with ontology specifications if at all.
The quality model that we provide is visualized in Figure 3.11. It contains
six criteria from Section 2.2, one for efficiency (i.e., run-time per sentence) and
five for effectiveness (e.g. accuracy). Possible quality prioritizations are rep-
resented by small circles. Some of these are labeled for illustration, such
106 3.3 Ad-hoc Construction via Partial Order Planning
Table 3.2: Each information need C from the InfexBA and Genia task, for which
we evaluate the run-time of ad-hoc pipeline construction, as well as the number of
algorithms |A| in the respective resulting text analysis pipeline = hA, i.
(a) 30 (b) 30
text analysis pipelines for text analysis pipelines for
Revenue events PositiveRegulation events n
average run-time in milliseconds
5 5
scheduling scheduling
0 0
3 6 9 15 16 18 21 6 12 14 17 19 20 21
number of algorithms in pipeline number of algorithms in pipeline
Figure 3.12: The run-time of our expert system on a standard computer taken for
ad-hoc pipeline construction in total as well as for algorithm selection and schedul-
ing alone, each as a function of the number of algorithms in the constructed pipe-
line. The algorithms are selected from a repository of either 76 (solid curves) or 38
algorithms (dashed) and target at event types from (a) InfexBA or (b) Genia.
all cases. While we construct pipelines once based on 1 and once based on
2 , the information needs lead to the same selected algorithm sets for both
ontologies. By that, we achieve that we can directly compare the run-time
of our expert system under 1 and 2 . The cardinalities of the algorithm
sets are listed in the right column of Table 3.2.
Figure 3.12 plots interpolated curves of the run-time of our expert sys-
tem as a function of the number of algorithms in the resulting pipeline. For
simplicity, we omit to show the standard deviations, which range between
3.6 ms and 9.8 ms for pipeline construction in total and proportionally lower
values for algorithm selection and scheduling. Even on the given far from
up-to-date standard computer, both the algorithm selection via partial or-
der planning and the scheduling via greedy linearization take only a few
milliseconds for all information needs. The remaining run-time of pipeline
construction refers to operations, such as the creation of Apache UIMA de-
scriptor files. Different from the asymptotic worst-case run-times computed
above, the measured run-times seem to grow only linear in the number of
employed text analysis algorithms in practice, although there is some noise
in the depicted curves because of the high deviations.
As expected from theory, the size of the algorithm repositories has only
a small effect on the run-time, since the decisive factor is the number of al-
gorithms available for each required text analysis. Accordingly, scheduling
is not dependent on the size of the repository at all. Altogether, our expert
system takes at most 26 ms for pipeline construction in all cases, and this ef-
3 Pipeline Design 109
Table 3.3: The run-time t in ms per sentence on the test set of the Revenue corpus
averaged over ten runs with standard deviation as well as the precision p and
recall r of each text analysis pipeline resulting from one of the evaluated quality
prioritizations for the information need Revenue(Time, Money).
Figure 3.13: Abstract view of the overall approach of this thesis (cf. Figure 1.5). Sec-
tions 3.4 to 3.6 address the extension of a text analysis pipeline by an input control.
Google's ad revenues are going to reach $20B. The search company was founded in 1998.
? Coreference
Relation
types
Figure 3.14: A sample text with instances of information types associated to a fi-
nancial event and a foundation relation. One information (the time) of the financial
event is missing, which can be exploited to filter and analyze only parts of the text.
(a) Defining the relevance (b) Specifying a degree of filtering (c) Modeling the dependencies bet-
of portions of text for each relation type ween relevant information types
Scoped Dependency
Query
query graph
conjunction root
* *
Figure 3.15: Modeling expert knowledge of filtering tasks: (a) A query defines the
relevance of a portion of text. (b) A scoped query specifies the degrees of filtering.
(c) The scoped query implies a dependency graph for the relevant information types.
Annotation
view Foundation relation
Figure 3.16: The annotations (bottom) and the relevant portions (top) of a sample
text. For the query 1 = Founded(Organization, Time), the only relevant portion of
text is dp2 on the paragraph level and ds2 on the sentence level, respectively.
(a)
Sentence root g:Dependency root Paragraph
:Degree of filtering graph :Degree of Filtering
(b) GOOGLE NEWS. 2014 ad revenues predicted. Forecasts promising: Google, founded in 1998, hits $20B in 2014.
scope of Sentence[Founded(Organization, Time)]
ds1: Sentence ds2: Sentence ds3: Sentence ds4: Sentence
Figure 3.17: (a) The dependency graph of the scoped query 4 = 1 3 . (b) The
scopes of a sample text associated to the degrees of filtering in 4 . They store the
portions of text that are relevant for 4 after all text analyses have been performed.
(a) (b)
assumption-based Input
truth maintenance control
The input control models the relevance of each portion of text using an in-
dependent set of propositional formulas. In a formula, every propositional
assumption symbol represents an assumption about the portion of text, i.e., the assumed
existence of an information type or the assumed fulfillment of the scoped
query or a conjunction in . The formulas themselves denote justifica-
justification tions. A justification is an implication in definite Horn form whose conse-
quent corresponds to the fulfillment of a query or conjunction, while the
antecedent consists of the assumptions under which the fulfillment holds.
Concretely, the following formulas are defined initially. For each portion
of text d that is associated to an outer conjunction CS [CR (C1 , . . . , Ck )] in ,
we denote the relevance of d with respect to the scoped query as (d) and
we let the input control model its justification as (d) :
(d) (d) (d)
(d) : CR C1 . . . Ck (d) (3.8)
(d)
Additionally, the input control defines a justification of the relevance Ci of
the portion d with respect to each inner conjunction of the outer conjunction
0 (C , . . . , C )]. Based on
CS [CR (C1 , . . . , Ck )] that has the form Ci = CS0 [CR i1 il
the portions of text associated to the degree of filtering of Ci , we introduce
(d0 )
a formula i for each such portion of text d0 :
(d0 ) 0(d0 ) (d0 ) (d0 ) (d)
i : CR Ci1 . . . Cil Ci (3.9)
(d0 )
This modeling step is repeated recursively until each child node Cij in a
(d0 )
new formula i represents either an entity type CE or a relation type CR .
(d0 )
As a result, the set of all formulas (d) and i of all portions of text defines
what can initially believed for the respective input text.
To give an example, we look at the sample text from Figure 3.17(b). The
scoped query 4 to be addressed has two outer conjunctions, 1 and 3 ,
with the degrees of filtering Sentence and Paragraph, respectively. For the
four sentences and two paragraphs of the text, we have six formulas:
(ds1 )
(ds1 ) : F ounded(ds1 ) Organization(ds1 ) T ime(ds1 ) 4
(ds2 )
(ds2 ) : F ounded(ds2 ) Organization(ds2 ) T ime(ds2 ) 4
(ds3 )
(ds3 ) : F ounded(ds3 ) Organization(ds3 ) T ime(ds3 ) 4
(ds4 )
(ds4 ) : F ounded(ds4 ) Organization(ds4 ) T ime(ds4 ) 4
(dp1 ) (dp1 )
(dp1 ) : F inancial(dp1 ) M oney (dp1 ) 2 4
(dp2 ) (dp2 )
(dp2 ) : F inancial(dp2 ) M oney (dp2 ) 2 4
In case of the two latter formulas, the relevance depends on the inner con-
junction 2 of 4 , for which we define four additional formulas:
3 Pipeline Design 123
(dp1 )
(ds1 ) : F orecast(ds1 ) Anchor(ds1 ) T ime(ds1 ) 2
(dp2 )
(ds2 ) : F orecast(ds2 ) Anchor(ds2 ) T ime(ds2 ) 2
(dp2 )
(ds3 ) : F orecast(ds3 ) Anchor(ds3 ) T ime(ds3 ) 2
(dp2 )
(ds4 ) : F orecast(ds4 ) Anchor(ds4 ) T ime(ds4 ) 2
The antecedents of these formulas consist of entity and relations types only,
so no further formula needs to be added. Altogether, the relevance of the
six distinguished portions of the sample text is hence initially justified by
the ten defined formulas.
After each text analysis, the formulas of a processed input text must be
updated, because their truth depends on the set of currently believed as-
sumptions, which follows from the output of all text analysis algorithms
applied so far. Moreover, the set of current formulas implies, whether a
portion of text must be processed by a specific text analysis algorithm or
not. In particular, an algorithm can cause a change of only those formulas
that include an output type of the algorithm. At the end of the text analy-
sis process then, what formula ever remains, must be the truth, just in the
sense of this chapters introductory quote by Arthur Conan Doyle.
Here, by truth, we mean that the respective portions of text are relevant
with respect to the scoped query to be addressed. To maintain the rel-
evant portions of an input text, we have already introduced the concept
of scopes that are associated to the degrees of filtering in the dependency
graph of . Initially, these scopes span the whole input text. Updating
the formulas then means to filter the scopes according to the output of a text
analysis algorithm. Similarly, we can restrict the analysis of that algorithm
to those portions of text its output types are relevant for. In the following,
we discuss how to perform these operations.
updateScopes(C(out) )
1: for each Information type C (out) in C(out) do
2: if C (out) is a degree of filtering in the dependency graph then
3: generateScope(C (out) )
4: Scopes S getRelevantScopes(C(out) )
5: for each Scope S in S do
6: Information types C all C C(out) to which S is assigned
7: for each Portion of text d in S do
8: if not d contains an instance of any C C then S.remove(d)
9: Scope S0 .getRootScope(S)
10: if S0 6= S then
11: for each Portion of text d in S0 do
12: if not d intersects with S then S0 .remove(d)
13: Scopes S0 .getAllDescendantScopes(S0 )
14: for each Scope S 0 6= S in S0 do
15: for each Portion of text d in S 0 do
16: if not d intersects with S0 then S 0.remove(d)
Pseudocode 3.5: Update of scopes based on the set of output types C(out) of a text
analysis algorithm and the produced instances of these types. An update may lead
both to the generation and to the filtering of the affected scopes.
tified by replacing them with true and, consequently, deleting them from
the antecedents of the formulas. In addition, updating a formula (d) re-
quires a recursive update of all formulas that contain the consequent of (d) .
(d )
In the given case, the consequent 2 p1 of (ds1 ) becomes false, which is
why (dp1 ) also cannot hold anymore. This in turn could render the fulfill-
ment of further nested conjunctions useless. However, such conjunctions
do not exist in (dp1 ) . Therefore, the following formulas remain:
(ds2 )
(ds2 ) : F ounded(ds2 ) Organization(ds2 ) 4
(ds4 )
(ds4 ) : F ounded(ds4 ) Organization(ds4 ) 4
(dp2 ) (dp2 )
(dp2 ) : F inancial(dp2 ) M oney (dp2 ) 2 4
(dp2 )
(ds2 ) : F orecast(ds2 ) Anchor(ds2 ) 2
(dp2 )
(ds4 ) : F orecast(ds4 ) Anchor(ds4 ) 2
We summarize that the output of a text analysis algorithm is used to fil-
ter not only the scopes analyzed by the algorithm, but also the dependent
scopes of these scopes. The set of dependent scopes of a scope S consists of
the scope S0 associated to the root of the degree of filtering CS of S in the
dependency graph of as well as of each scope S 0 of a descendant degree
of filtering of the root. This, of course, includes the scopes of all ancestor
degrees of filtering of CS besides the root.
3 Pipeline Design 125
getRelevantScopes(C(out) )
1: Scopes S
2: for each Degree of filtering CS in the dependency graph do
3: if .getChildren(CS ) C(out) 6= then
4: S.add(.getScope(CS ))
5: else if getPredecessorTypes(.getChildren(CS )) C(out) 6= then
6: S.add(.getScope(CS ))
7: return S
Pseudocode 3.6: Determination of the set S of all scopes that are relevant with
respect to the output types C(out) of a text analysis algorithm.
Pseudocode 3.5 shows how to update the scopes of an input text based
on the output types C(out) of a text analysis algorithm. To enable filtering,
all scopes must initially be generated by segmentation algorithms (e.g. by a
sentence splitter), i.e., algorithms with an output type C (out) that denotes a
degree of filtering in the dependency graph . This is done in lines 1 to 3 of
the pseudocode, given that the employed pipeline schedules the according
algorithms first. Independent of the algorithm, the method getRelevant-
Scopes next determines the set S of scopes that are relevant with respect
to the output of the applied algorithm (line 4).18 For each scope S S, a
portion of text d is maintained only if it contains an instance of one of the
types C C(out) relevant for S (lines 5 to 8). Afterwards, lines 9 to 12
remove all portions of text from the root scope S0 of S that do not intersect
with any portion of text in S. Accordingly, only those portions of text in the
set of descendant scopes S0 of S0 are retained that intersect with a portion
in S0 (lines 13 to 16).
getRelevantScopes is given in Pseudocode 3.6: A scope is relevant with
respect to C(out) if at least one of two conditions holds for the associated
degrees of filtering: First, an information type from C(out) is a child of the
degree in the depedency graph (lines 3 to 4). Second, an information
type from C(out) serves as the required input of another algorithm in the
employed pipeline (lines 5 to 7), i.e., it denotes a preprocessing type in the
sense discussed at the end of Section 3.4. E.g., part-of-speech tags are not
specified in 4 , but they might be necessary for the type Organization.
determineUnifiedScope(C(out) )
1: for each Information type C (out) in C(out) do
2: if C (out) is a degree of filtering in the dependency graph then
3: return the whole input text
4: Scopes S getRelevantScopes(C(out) )
5: Scope S
6: for each Scope S in S do
7: for each Portion of text d in S do
8: if not d intersects with S then S .add(d)
9: else S .merge(d)
10: return S
all input texts (and, hence, not all possible annotations) are relevant to fulfill
the information needs at hand. In the following, we restrict our view to
pipelines where both prerequisites hold. As for the pipeline construction
in Section 3.3, we look at the correctness and run-time of the developed
approaches. In (Wachsmuth et al., 2013c), we have sketched these properties
roughly, whereas we analyze them more formally here.
Correctness Concretely, we investigate the question whether the execution
of a pipeline that is equipped with an input control, which determines and
updates the scopes of an input text before each step of a text analysis pro-
cess (as presented), is optimal in that it analyzes only relevant portions of
text.19 As throughout this thesis, we consider only pipelines, where no out-
put type is produced by more than one algorithm (cf. Section 3.1). Also,
for consistent filtering, we require all pipelines to schedule the algorithms
whose output is needed for generating the scopes of an input text before
any possible filtering takes place. Given these circumstances, we now show
the correctness of our algorithms for determining and updating scopes:
Lemma 3.1. Let a text analysis pipeline = hA, i address a scoped query on
an input text D. Let updateScopes(C(out) ) be called after each execution of an
algorithm A A on D with the output types C(out) of A. Then every scope S of D
associated to always contains exactly those portions of text that are currently
relevant with respect to .
Proof. We prove the lemma by induction over the number m of text analysis
algorithms executed so far. By assumption, no scope is generated before the
first algorithm has been executed. So, for m = 0, the lemma holds. Therefore,
we hypothesize that each generated scope S contains exactly those portions
of text that can be relevant with respect to after the execution of an arbi-
trary but fixed number m of algorithms.
Now, by definition, line 4 of updateScopes determines all scopes S whose
portions of text need to span an instance of one of the output types C(out) of
the m+1-th algorithm in order to be relevant for .20 Every such portion d
in a scope S S is retained in lines 7 to 8. Because of the induction hypothe-
sis, d can still fulfill the conjunction C of its associated degree of filtering
is assigned to. Consequently, also the outer conjunction of C can still be ful-
19
Here, we analyze the optimality of pipeline execution for the case that both the algorithms
employed in a pipeline and the schedule of these algorithms have been defined. In contrast,
the examples at the beginning of Section 3.4 have suggested that the amount of text to be an-
alyzed (and, hence, the run-time optimal pipeline) may depend on the schedule. The problem
of finding the optimal schedule under the given filtering view is discussed in Chapter 4.
20
For the proof, it does not matter whether the instances of an information type in C(out)
are used to generate scopes, since no filtering has taken place yet in this case and, hence, the
whole text D can be relevant after lines 1 to 3 of updateScopes.
128 3.5 Optimal Execution via Truth Maintenance
filled by the portions of text that intersect with S in lines 11 to 12 (e.g., the
paragraph dp2 in Figure 3.17(b) remains relevant after time recognition, as it
intersects with the scope of Sentence[Forecast(Anchor, Time)].) If an outer con-
junction becomes false (as for dp1 after money recognition), the same holds
for all inner conjunctions, which is why the respective portions of text (ds1
only) are removed in lines 14 to 16. Altogether, exactly those portions that
are relevant with respect to any conjunction in remain after executing the
m+1-th algorithm. So, Lemma 3.1 holds.
Lemma 3.2. Let a text analysis pipeline = hA, i address a scoped query on
an input text D. Further, let each degree of filtering in have an associated scope S
of D. Given that S contains exactly the portions of text that can be relevant with
respect to , the scope S returned by determineUnifiedScope(C(out) ) contains
a portion of text d D iff. it is relevant for the information types C(out) .
Proof. By assumption, every segmentation algorithm must always process
the whole input text, which is assured in lines 1 to 3 of Pseudocode 3.7. For
each other algorithm A A, exactly those scopes belong to S where the
output of A may help to fulfill a conjunction (line 4). All portions of text of
the scopes in S are unified incrementally (line 5 to 9) while preventing that
overlapping parts of the scopes are considered more than once. Thus, no
relevant portion of text is missed and no irrelevant one is analyzed.
The two lemmas lead to the optimality of using an input control:
Theorem 3.3. Let a text analysis pipeline = hA, i address a scoped query
on an input text D. Let updateScopes(C(out) ) be called after each execution of an
algorithm A A on D with the output types C(out) of A, and let each A process only
the portions of D returned by determineUnifiedScope(C(out) ). Then analyzes
only portions of D that are currently relevant with respect to .
Proof. As Lemma 3.1 holds, all scopes contain exactly those portions of D
that are relevant with respect to according to the current knowledge.
As Lemma 3.2 holds, each algorithm employed in gets only those portions
of D its output is relevant for. From that, Theorem 3.3 follows directly.
Theorem 3.3 implies that an input-controlled text analysis pipeline does not
perform any unnecessary analysis. The intended benefit is to make a text
analysis process faster. Of course, the maintenance of relevant portions of
text naturally produces some overhead in terms of computational cost. In
the evaluation below, however, we give experimental evidence that these
additional costs only marginally affect the efficiency of an application in
comparison to the efficiency gains achieved through filtering. Before, we
now analyze the asymptotic time complexity of the proposed methods.
3 Pipeline Design 129
Filtering framework
defines 1..*
Scoped query
creates 1 1
Scope TMS
0..1
Filtering 1 * Scope
analysis engine *
Figure 3.19: An UML-like class diagram that shows the high-level architecture of
realizing an input control as a filtering framework, which extends Apache UIMA.
that (1) analyzes the main parameters intrinsic to filtering and (2) offers ev-
idence for the efficiency of our proposed approach. Appendix B.4 yields
information on the Java source code of this evaluation.
Input Texts Our experiments are conducted all on texts from two text cor-
pora of different languages. First, the widely used English dataset of the
CoNLL-2003 shared task that has originally served for the development
of approaches to language-independent named entity recognition (cf. Ap-
pendix C.4). The dataset consists of 1,393 mixed classic newspaper stories.
And second, our complete Revenue corpus with 1,128 German online busi-
ness news articles that we already processed in Sections 3.1 and 3.3 and that
is described in Appendix C.1.
Scoped Queries From a task perspective, the impact of our approach is pri-
marily influenced by the complexity and the filtering potential of the scoped
query to be addressed. To evaluate these parameters, we consider the ex-
ample queries 1 to 4 from Section 3.4 under three degrees of filtering:
Sentence, Paragraph, and Text, where the latter is equivalent to performing
no filtering at all. The resulting scoped queries are specified below.
Text Analysis Pipelines We address the scoped queries with different pipe-
lines, some of which use an input control, while the others do not. In all
cases, we employ a subset of eleven text analysis algorithms that have been
adjusted to serve as filtering analysis engines. Each of these algorithms can
be parameterized to work both on English and on German texts. Concretely,
we make use of the segmentation algorithms sto2 , sse, and tpo1 as well as
of the chunker pch for preprocessing. The entity types that appear in the
queries (i.e., Time, Money, and Organization) are recognized with eti, emo,
and ene, respectively. Accordingly, we extract relations with the algorithms
rfo (Forecast), rfu (Founded), and rfi (Financial). While rfo operates only on
the sentence-level, the other two qualify for arbitrary degrees of filtering.
Further information on the algorithms can be found in Appendix A.
All employed algorithms have a roughly comparable run-time that scales
linear with the length of the processed input text. While computationally
expensive algorithms (say, a dependency parser) strongly increase the ef-
ficiency potential of filtering (the later such an algorithm is scheduled the
better), employing them would render it hard to distinguish the effects of
filtering from those of the order of algorithm application (cf. Chapter 4).
Experiments We quantify the filtering potential of our approach by com-
filter ratio paring the filter ratio (Filter %) of each evaluated pipeline , i.e., the quotient
between the number of characters processed by and the number of charac-
ters processed by a respective non-filtering pipeline. Similarly, we compute
3 Pipeline Design 133
(a) filter ratio 100.0% 100.0% (b) filter ratio 100.0% 100.0%
100% 100%
no filtering no filtering
80% 80%
73.6% paragraph level 75.2% paragraph level
60% 60%
60.3%
53.5%
40% 40%
42.0%
sentence level
20% query Y1 on 28.9% sentence level 20% query Y1 on
CONLL-2003 dataset REVENUE CORPUS 17.2%
10.8%
0% 0%
SPA SSE ETI STO2 TPO2 PCH ENE RFU SPA SSE ETI STO2 TPO2 PCH ENE RFU
algorithm algorithm
Figure 3.20: Interpolated curves of the filter ratios of the algorithms in pipeline 1
under three degrees of filtering for the query 1 = Founded(Organization, Time) on
(a) the English CoNLL-2003 dataset and (b) the German Revenue corpus.
the time ratio (Time %) of each as the quotient between the run-time of time ratio
and the run-time of a non-filtering pipeline.22 All run-times are measured
on a 2 GHz Intel Core 2 Duo MacBook with 4 GB memory and averaged
over ten runs (with standard deviation ). In terms of effectiveness, below
we partly count the positives (P) only, i.e., the number of extracted relations
of the types sought for, in order to roughly compare the recall of pipelines.
For the foundation relations, we also distinguish between false positives (FP)
and true positives (TP) to compute the extraction precision. To this end, we
have decided for each positive manually whether it is true or false. In par-
ticular, an extracted foundation relation is considered a true positive if and
only if its anchor is brought into relation with the correct time entity while
spanning the correct organization entity.23
Tradeoff between Efficiency and Effectiveness We analyze different de-
grees of filtering for the query 1 = Founded(Organization, Time). In particu-
lar, we execute the pipeline 1 = (spa, sse, eti, sto2 , tpo2 , pch, ene, rfu) on
both given corpora to address each of three scoped versions of 1 :
1 = Sentence[1 ] 10 = Paragraph[1 ] 100 = Text[1 ]
To examine the effects of an input control, we first look at the impact
of the degree of filtering. Figure 3.20 illustrates the filter ratios of all single
algorithms in 1 on each of the two corpora with one interpolated curve for
every evaluated degree of filtering. As the beginnings of the curves convey,
even the segmentation of paragraphs (given for the paragraph level only)
and sentences already enables the input control to disregard small parts of a
22
We provide no comparison to existing filtering approaches, as these approaches do not
compete with our approach, but rather can be integrated with it (cf. Section 3.6).
23
An exact evaluation of precision and recall is hardly feasible on the input texts, since
the relation types sought for are not annotated. Moreover, the given evaluation of precision
is only fairly representative: In practice, many extractors do not look for cross-sentence and
cross-paragraph relations at all. In such cases, precision remains unaffected by filtering.
134 3.5 Optimal Execution via Truth Maintenance
Table 3.4: The number of processed characters in millions with filter ratio Filter %,
the run-time t in seconds with standard deviation and time ratio Time %, and
the numbers of true positives (TP) and false positives (FP) as well as the resulting
precision p of pipeline 1 for the query 1 = Founded(Organization, Time) with three
degrees of filtering on the English CoNLL-2003 dataset and on the Revenue corpus.
text, namely those between the segmented text portions. The first algorithm
in 1 , then, that really reduces the number of relevant portions of text is tim.
On the sentence level, it filters 28.9% of its input characters from the texts in
the CoNLL-2003 dataset and 42.0% from the Revenue corpus. These values
are further decreased by ene, such that rfu has to analyze only 10.8% and
17.2% of all characters, respectively. The values for the degree of filtering
Paragraph behave similar, while naturally being higher.
The resulting overall efficiency and effectiveness values are listed in Ta-
ble 3.4. On the paragraph level, 1 processes 81.5% of the 12.70 million
characters of the CoNLL-2003 dataset that it processes on the text level, re-
sulting in a time ratio of 69.0%. For both these degrees of filtering, the same
eight relations are extracted with a precision of 87.5%. So, no relation is
found that exceeds paragraph boundaries. Filtering on the sentence level
lowers the filter ratio to 40.6% and the time ratio to 32.9%. While this re-
duces the number of true positives to 5, it also prevents any false positive.
Such behavior may be coincidence, but it may also indicate a tendency to
achieve better precision, when the filtered portions of texts are small.
On the Revenue corpus, the filter and time ratios are higher due to a larger
amount of time entities (which are produced first by 1 ). Still, the use of
an input control saves more than half of the run-time t, when performing
filtering on the sentence level. Even for simple binary relation types like
Founded and even without employing any computationally expensive algo-
rithm, the efficiency potential of filtering hence becomes obvious. At the
same time, the numbers of found true positives in Table 3.4 (37 in total, 27
within paragraphs, 14 within sentences) suggest that the use of an input
control provides an intuitive means to trade the efficiency of a pipeline for
its recall, whereas precision remains quite stable.
3 Pipeline Design 135
Table 3.5: The number of processed characters in millions with filter ratio Filter %,
the run-time t in seconds with standard deviation and time ratio Time %, and the
number of positives (in terms of extracted relations) of pipeline 2 for the query
2 = Forecast(Anchor, Time) under three degrees of filtering on the Revenue corpus.
Table 3.6: The number of processed characters with filter ratio Filter % and the run-
time t in seconds with standard deviation and time ratio Time % of 1 , . . . , 3
on the Revenue corpus under increasingly complex queries . Each run-time is
broken down into the times spent for text analysis and for input control. In the
right-most column, the positives are listed, i.e., the number of extracted relations.
100%
100%
filter ratio
98.7% 98%
80%
filter ratio of 4
60%
40%
42% 42%
20% scoped query Y4* 28.5% 27%
on REVENUE CORPUS 17.2%
0%
SPA SSE EMO ETI STO2 TPO2 RFO RFI PCH ENE RFU algorithm
Figure 3.21: Interpolated curve of the filter ratios of the eleven algorithms in the
pipeline 4 for the scoped query 4 = 1 3 on the Revenue corpus.
In Table 3.6, we list the efficiency results and the numbers of positives for
the three queries. While the time ratios get slightly higher under increasing
query complexity (i.e., from 1 to 4 ), the input control saves over 50% of the
run-time of a standard pipeline in all cases. At the same time, up to 1,760
relations are extracted from the Revenue corpus (2,103 relations without
filtering). While the longest pipeline (4 ) processes the largest number of
characters (24.40 millions), the filter ratio of 4 (57.9%) rather appears to be
the weighted average of the filter ratios of 1 and 3 .
For a more exact interpretation of the results of 4 , Figure 3.21 visualizes
the filter ratios of all algorithms in 4 . As shown, the interpolated curve
does not decline monotonously along the pipeline. Rather, the filter ratios
depend on what portions of text are relevant for which conjunctions in 4 ,
which follows from the dependency graph of 4 (cf. Figure 3.17(a)). For
instance, the algorithm rfo precedes the algorithm pch, but entails a lower
filter ratio (28.9% vs. 42%). rfo needs to analyze the portions of text in the
scope of 3 only. According to the schedule of 5 , this means all sentences
with a time entity in paragraphs that contain a money entity. In contrast,
chu processes all sentences with time entities, as it produces a predecessor
type required by ene, which is relevant for the scope of 1 .
Besides the efficiency impact of controlling the input, Table 3.6 also pro-
vides insights into the efficiency of our implementation. In particular, it
3 Pipeline Design 137
opposes the analysis time of each pipeline (i.e., the overall run-time of the
employed text analysis algorithms) to the control time (i.e., the overall run-
time of the input control). In case of 1 , the input control takes 1.0% of the
total run-time (0.7 of 74.9 seconds). This fraction grows only marginally un-
der increasing query complexity, as the control times of 3 and 4 suggest.
While our implementation certainly leaves room for optimizations, we thus
conclude that the input control can be operationalized efficiently.
meet certain constraints, while discarding others. In Section 2.4, we have al-
ready pointed out that such kind of text filtering has been applied since the
early times in order to determine candidate texts for information extraction.
As such, text filtering can be seen as a regular text classification task.
Usually, the classification of candidate texts and the extraction of rele-
vant information from these texts are addressed in separate stages of a text
mining application (Cowie and Lehnert, 1996; Sarawagi, 2008). However,
they often share common text analyses, especially in terms of preprocess-
ing, such as tokenization or part-of-speech tagging. Sometimes, features for
text classification are also based on information types like entities, as holds
e.g. for the main approach in our project ArguAna (cf. Section 2.3) as well
as for related works like (Moschitti and Basili, 2004). Given that the two
stages are separated, all common text analyses are performed twice, which
increases run-time and produces redundant or inconsistent output.
To address these issues, Beringer (2012) has analyzed the integration of
text classification and information extraction pipelines experimentally in his
masters thesis written in the context of the thesis at hand. In particular,
the masters thesis investigates the hypothesis that the later filtering is per-
formed within an integrated pipeline, the higher its effectiveness but the
lower its efficiency will be (and vice versa).
While existing works implicitly support this hypothesis, they largely fo-
cus on effectiveness, such as Lewis and Tong (1992) who compare text fil-
tering at three positions in a pipeline. In contrast, Beringer (2012) explicitly
evaluates the efficiency-effectiveness tradeoff, focusing on the InfexBA pro-
cess (cf. Section 2.3) that has original been proposed in (Stein et al., 2005):
Informational texts like reports and news articles are first filtered from a col-
lection of input texts. Then, forecasts are extracted from the informational
texts. To realize this process, the algorithm clf from (Wachsmuth and Bu-
jna, 2011) for language function analysis (cf. Section 2.3) is integrated in dif-
ferent positions of the optimized pipeline 2 that we use in Section 3.1. The
(b)
later clf is scheduled in the integrated pipeline 2,lfa , the more information
is accessed for text classification, but the later the filtering of text portions
starts, too. The input control operates on the sentence level, which matches
the analyses of all employed algorithms. Therefore, observed effectiveness
differences must be caused by the text filtering stage. For this scenario,
Beringer (2012) performs several experiments with variations of 2,lfa .
Here, we exemplarily look at the main results of one of these experiments.
The experiment has been conducted on the union of the test sets from the
Revenue corpus and from the music part of the LFA-11 corpus, both of
which are described in Appendix C. This combination is not perfectly ap-
140 3.6 Trading Efficiency for Effectiveness in Ad-hoc Text Mining
e
tim
tim
80% 0.65
n-
n-
25 s 25 s
ru
ru
75% 0.6
accuracy
precision
F1-score
15 s 15 s
recall
70% 0.55
65% 5s 0.5 5s
SSE STO2 ETI TPO2 EMO ENE SSE STO2 ETI TPO2 EMO ENE
last algorithm before CLF last algorithm before CLF
Figure 3.22: Illustration of the effectiveness of (a) filtering candidate input texts and
(b) extracting forecasts from these texts in comparison to the run-time in seconds
of the integrated text analysis pipeline 2,lfa depending on the position of the text
filtering algorithm clf in 2,lfa . The figure is based on results from (Beringer, 2012).
propriate for evaluation, both because the domain difference between the
corpora makes text classification fairly easy and because the music texts
contain no false positives with respect to the forecast extraction task at all.
Still, it suffices to outline the basic effects of the pipeline integration.
(b)
Figure 3.22 plots the efficiency and effectiveness of 2,lfa for different po-
sitions of clf in the pipeline. The run-times have been measured on a 3.3
GHz Intel Core i5 Windows 7 system with 8 GB memory. According to Fig-
ure 3.22(a), spending more run-time improves the accuracy of text filtering
in the given case, at least until the application of clf after tpo2 .25 This in
turn benefits the recall and, thus, the F1 -score of extracting forecasts, which
are raised up to 0.59 and 0.64 in Figure 3.22(b), respectively.26
The observed results indicate that integrating text filtering and text ana-
lysis provides another means to trade efficiency for effectiveness. As in our
experiments, the relevant portions of the filtered texts can then be main-
tained by our input control. We do not analyze the integration in detail in
this thesis. However, we point out that the input control does not prevent
text filtering approaches from being applicable, as long as it does not start to
restrict the input of algorithms before text filtering is finished. Otherwise,
less and diffently distributed information is given for text filtering, which
can cause unpredictable changes in effectiveness, cf. (Beringer, 2012).
Aside from the outlined tradeoff, the integration of the two stages gener-
ally improves the efficiency of text mining. In particular, the more text ana-
lyses are shared by the stages, the more redundant effort can be avoided.
(b)
For instance, the 2,lfa requires 19.1 seconds in total when clf is scheduled
25
As shown in Figure 3.22(a), the accuracy is already close to its maximum when clf is
integrated after sto2 , i.e., when token-based features are available, such as bag-of-words,
bigrams, etc. So, more complex features are not really needed in the end, which indicates
that the classification of language functions is comparably easy on the given input texts.
26
While the extraction precision remains unaffected from the position of integration in the
experiment, this is primarily due to the lack of false positives in the LFA-11 corpus only.
3 Pipeline Design 141
after tpo2 , as shown in Figure 3.22. Separating text filtering and text ana-
(b)
lysis would require to execute the first four algorithms 2,lfa double on all
filtered texts, hence taking a proportional amount of additional time (except
for the time taken by clf itself). The numeric efficiency impact of avoiding
redundant operations has not been evaluated in (Beringer, 2012). In the end,
however, the impact depends on the schedule of the employed algorithms
as well as on the fraction of relevant texts and relevant information in these
texts, which leads to the conluding remark of this chapter.
4
Pipeline Efficiency
143
144 4.1 Ideal Scheduling for Large-scale Text Mining
input control
Figure 4.1: Abstract view of the overall approach of this thesis (cf. Figure 1.5). All
sections of Chapter 4 contribute to the design of large-scale text analysis pipelines.
tion 4.3). In cases where input texts are homogeneous in the distribution of
relevant information, the approach reliably finds a near-optimal schedule
according to our evaluation. In other cases, there is not one single opti-
mal schedule (Section 4.4). To optimize efficiency, a pipeline then needs to
adapt to the input text at hand. Under high heterogeneity, such an adap-
tive scheduling works well by learning in a self-supervised manner what
schedule is fastest for which text (Section 4.5). For large-scale text mining,
a pipeline can finally be parallelized, as we outline in Section 4.6. The con-
tribution of Chapter 4 to our overall approach is shown in Figure 4.1.
10
sentences MOF 6 4
4 6
sentences with sentences
organizations with money input of AF input of AO input of AM
entities
10
1 2
sentences with FOM
forecasts
1
2
Figure 4.2: (a) Venn diagram representation of a sample text with ten sentences,
among which one is a forecast that contains a money and an organization entity.
(b) The sentences of the sample text that need to be processed by each text analysis
algorithm in the pipelines MOF (top) and FOM (bottom), respectively.
same effectiveness in the tackled text analysis task, as long as both of them
are admissible (cf. Section 3.1).
As an example, consider the task to extract all sentences that denote fore-
casts with a money and an organization entity from a single news article,
which is related to our project InfexBA (cf. Section 2.3). Let the article con-
sist of ten sentences, six of which contain money entities. Let two of the six
sentences denote forecasts and let four of them contain organization enties.
Only one of these also spans an organization entity and, so, contains all
information sought for. Figure 4.2(a) represents such an article as a Venn
diagram. To tackle the task, assume that three algorithms AM , AO , and
AF for the recognition of money entities, organization entities, and forecast
events are given that have no interdependencies, meaning that all possible
schedules are admissible. For simplicity, let AM always take t(AM ) = 4 ms
to process a single sentence, while AO and AF need t(AO ) = t(AF ) = 5 ms.
Without an input control, each algorithm must process all ten sentences,
resulting in the following run-time t(no filtering ) of a respective pipeline:
t(no filtering ) = 10 t(AM ) + 10 t(AO ) + 10 t(AF ) = 140 ms
Now, given an input control that performs filtering on the sentence level, it
may seem reasonable to apply the fastest algorithm AM first, e.g. in a pipe-
line MOF = (AM , AO , AF ). This is exactly what our method greedyPipeline-
Linearization from Section 3.3 does. As a result, AM is applied to all ten
sentences, AO to the six sentences with money entities (assuming all entities
are found), and AF to the four with money and organization entities (ac-
cordingly), as illustrated at the top of Figure 4.2(b). Hence, we have:
t(MOF ) = 10 t(AM ) + 6 t(AO ) + 4 t(AF ) = 90 ms
146 4.1 Ideal Scheduling for Large-scale Text Mining
(
t1 (D) if j = 1
t((j) ) = (4.1)
t((j1) ) + tj (S((j1) )) otherwise
This recursive definition resembles the one used by the Viterbi algorithm, viterbi algorithm
which operates on hidden Markov models (Manning and Schtze, 1999). A hidden markov models
hidden Markov model describes a statistical process as a sequence of states.
A transition from one state to another is associated to some state probability.
While the states are not visible, each state produces an observation with an
according probability. Hidden Markov models have the Markov property, markov property
i.e., the probability of a future state depends on the current state only. On
this basis, the Viterbi algorithm computes the Viterbi path, which denotes viterbi path
the most likely sequence of states for a given sequence of observations.
We adapt the Viterbi algorithm for scheduling an algorithm set A, such
that the Viterbi path corresponds to the run-time optimal admissible pipe-
line = hA, i on an input text D. As throughout this thesis, we restrict
our view to pipelines where no algorithm processes D multiple times. Also,
under admissibility, only algorithms with fulfilled input constraints can be
executed (cf. Section 3.1). Putting both together, we call an algorithm Ai
applicable at some position j in a pipelines schedule, if Ai has not been ap- applicable
plied at positions 1 to j1 and if all input types Ai .C(in) of Ai are produced
by the algorithms at positions 1 to j 1 or are already given for D.
To compute the Viterbi path, the original Viterbi algorithm determines
the most likely sequence of states for each observation and possible state at
that position in an iterative (dynamic programming) manner. For our pur-
poses, we let states of the scheduling process correspond to the algorithms
in A, while each observation denotes the position in a schedule.1 According
(j)
to the Viterbi algorithm, we then propose to store a pipeline i from 1 to j
for each combination of a position j {1, . . . , m} and an algorithm Ai A
that is applicable at position j. To this end, we determine the set (j1)
with all those previously computed pipelines of length j1, after which Ai
(j)
is applicable. The recursive function to compute the run-time of i can be
directly derived from Equation 4.1:
(j)
ti (D) if j = 1
t(i ) = (4.2)
min t(l ) + ti (S(l )) otherwise
l (j1)
As can be seen, the scheduling process does not have Markov property, be-
(j)
cause the run-time t(i ) of an algorithm Ai A at some position j de-
1
Different from (Wachsmuth and Stein, 2012), we omit to explicitly define the underlying
model here for a more focused presentation. The adaptation works even without the model.
148 4.1 Ideal Scheduling for Large-scale Text Mining
optimalPipelineScheduling({A1 , . . . , Am }, D)
1: for each i {1, . . . , m} do
2: if Ai is applicable in position 1 then
(1)
3: Pipeline i (Ai )
(1)
4: Run-time t(i ) ti (D)
(1)
5: Scope S(i ) Si (D)
6: for each j {2, . . . , m} do
7: for each i {1, . . . , m} do
(j1) (j1)
8: Pipelines (j1) {l | Ai is applicable after l }
(j1)
9: if 6 then
=
(j1)
10: Pipeline k arg min t(l ) + ti (S(l ))
l (j1)
(j) (j1)
11: Pipeline i k k (Ai )
(j) (j1) (j1)
12: Run-time t(i ) t(k ) + ti (S(k ))
(j) (j1)
13: Scope S(i ) Si (S(k ))
(m)
14: return arg min t(i )
(m)
i , i {1,...,m}
pends on the scope it is executed on. Hence, we need to keep track of the
(j) (j) (j)
values t(i ) and S(i ) for each pipeline i during the computation
(m)
process. After computing all pipelines i based on the full algorithm
set A = {A1 , . . . , Am }, the optimal schedule of A on the input text D is
(m) (m)
the one of the pipeline i with the lowest run-time t(i ).
Pseudocode 4.1 shows our adaptation of the Viterbi algorithm. In lines 1
(1)
to 5, a pipeline i is stored for every algorithm Ai A that is already ap-
(1) (1) (1)
plicable given D only. The run-time t(i ) and the scope S(i ) of i are
set to the respective values of Ai . Next, lines 6 to 13 incrementally compute
(j)
a pipeline i of length j for each algorithm Ai that is applicable at all in
position j. Here, the set (j1) is computed in line 8. If (j1) is not empty
(j)
(which implies the applicability of Ai ), lines 9 to 11 then create i by ap-
(j1)
pending Ai to the pipeline k that is best in terms of Equation 4.2.2 The
(j)
run-time and the scope of i are computed accordingly (lines 12 and 13).
(m)
After the final iteration, the fastest pipeline i of length m is returned as
an optimal solution (line 13). A trellis diagram that schematically illustrates
the described operations for Ai at position j is shown in Figure 4.3.
2
Different from the pseudocode in (Wachsmuth and Stein, 2012), we explicitly check here
if (j1) is not empty. Apart from that, the pseudocodes differ only in terms of namings.
4 Pipeline Efficiency 149
A1 1(j-1)
t(1(j-1)) +
ti(S(1(j-1)))
...
...
(j-1)
Ak (applicable) k(j-1)
t( (j-1)
) S(k(j-1))
...
k
...
...
t(k(j-1)) +
Ai i(j-1) ti(S(k(j-1))) i(j)
already t(i(j)) S(i(j))
...
...
applied
Al l
(j-1)
Al+1
not yet
...
...
applicable
Am
Proof. We show the lemma by induction over the length j. For j = 1, each
(1)
pipeline i created in line 3 of optimalPipelineScheduling consists only of
the algorithm Ai . As there is only one pipeline of length 1 that ends with Ai ,
150 4.1 Ideal Scheduling for Large-scale Text Mining
(1) (1)
i = 0i (1) holds for all i {1, . . . , m} and, so, i is optimal. Therefore,
we hypothesize that the lemma is true for an arbitrary but fixed length j1,
and we prove by contradiction that, in this case, it also holds for j. For this
purpose, we assume the opposite:
0(j) (j) (j) 0(j)
i = hAi , 0 i = (A01 , . . . , A0j1 , Ai ) : t(i ) > t(i )
According to Equation 4.1, this inequality can be rewritten as follows:
t((j1) ) + ti (S((j1) )) > t(0(j1) ) + ti (S(0(j1) ))
By definition, the pipelines (j1) and 0(j1) employ the same algorithm
(j)
set A(j1) = Ai \ {Ai }. Since both pipelines are admissible, they entail the
same relevant portions of text, i.e., S((j1) ) = S(0(j1) ). Therefore, the
run-time of algorithm Ai must be equal on S((j1) ) and S(0(j1) ), so we
remain with the following inequality:
t((j1) ) > t(0(j1) )
0(j1) must end with a different algorithm A0j1 than Aj1 in (j1) , be-
(j)
cause otherwise i would not be created from (j1) in lines 10 and 11
of optimalPipelineScheduling. For the same reason, 0(j1) cannot belong
to the set (j1) computed in line 8. Each pipeline in (j1) is run-time
optimal according to the induction hypothesis, including the one that ends
with A0j1 (if such a pipeline exists). But, then, (j1) cannot be optimal.
This means that the assumed opposite must be wrong.
Complexity Knowing that the approach is correct, we now come to its com-
putational complexity. As in Chapter 3, we rely on the O-notation (Cormen
et al., 2009) to capture the worst-case run-time of the approach. Given an
arbitrary algorithm set A and some input text D, Pseudocode 4.1 iterates ex-
actly |A| times over the |A| algorithms, once for each position in the sched-
(j)
ule to be computed. In each of these |A|2 loop iterations, a pipeline i
(j1)
is determined based on at most |A|1 other pipelines l , resulting in
3 (j) (j) (j)
O(|A| ) operations. For each i , the run-time t(i ) and the scope S(i )
are stored. In practice, these values are not known beforehand, but they
(j)
need to be measured when executing i on its input. In the worst case, all
algorithms in A have an equal run-time tA (D) on D and they find relevant
information in all portions of text, i.e., Si (D) = D for each algorithm Ai A.
Then, all algorithms must indeed process the whole text D, which leads to
an overall upper bound of
toptimalPipelineScheduling (A, D) = O(|A|3 tA (D)). (4.3)
imental evidence that different input texts may lead to different run-time
optimal schedules. An analysis of the reasons behind follows in Section 4.2.
Details on the source code used here is found in Appendix B.4.
Information Need We consider the extraction of all forecasts on organiza-
tions with a time and a money entity. An example that spans all relevant
information types is the following: IBM will end the first-quarter 2011 with
$13.2 billion of cash on hand and with generated free cash flow of $0.8 billion.4 For
our input control from Section 3.5, the information need can be formalized
as the scoped query = Sentence[Forecast(Time, Money, Organization)].5
Algorithm Set To produce the output sought for in , we use the entity
recognition algorithms eti, emo, and ene as well as the forecast event detec-
tor rfo. To fulfill their input requirements, we additionally employ three
preprocessing algorithms, namely, sse, sto2 , and tpo1 . All algorithms oper-
ate on the sentence level (cf. Appendix A for further information). During
scheduling, we implicity apply the lazy evaluation step from Section 3.1 to
create filtering stages, i.e., each preprocessor is scheduled as late as possible
in all pipelines. Instead of filtering stages, we simply speak of the algorithm
set A = {eti, emo, ene, rfo} in the following without loss of generality.
Text Corpora As in Section 3.5, we consider both our Revenue corpus (cf.
Appendix C.1) and the CoNLL-2003 dataset (cf. Appendix C.4). For lack of
alternatives to the employed algorithms, we rely on the German part of the
CoNLL-2003 dataset this time. We process only the training sets of the two
corpora. These training sets consist of 21,586 sentences (Revenue corpus)
and 12,713 sentences (CoNLL-2003 dataset), respectively.
(j)
Experiments We run all pipelines i , which are computed within the ex-
ecution of optimalPipelineScheduling, ten times on both text corpora using
(j)
a 2 GHz Intel Core 2 Duo MacBook with 4 GB memory. For each i , we
(j)
measure the averaged overall run-time t(i ). In our experiments, all stan-
dard deviations were lower than 1.0 seconds on the Revenue corpus and
0.5 seconds on the CoNLL-2003 dataset. Below, we omit them for a concise
presentation. For similar reasons, we state only the number of sentences in
(j) (j)
the scopes S(i ) of all i , instead of the sentences themselves.
Input Dependency of Pipeline Scheduling Figure 4.4 illustrates the appli-
cation of optimalPipelineScheduling for the algorithm set A to the two con-
4
Taken from http://ibm.com/investor/1q11/press.phtml, accessed on May 21, 2014.
5
We note here once that some of the implementations in the experiments in Chapter 4
do not use the exact input control approach presented in Chapter 3. Instead, the filtering
of relevant portions of text is directly integrated into the employed algorithms. However,
as long as only a single information need is addressed and only one degree of filtering is
specified, there will be no conceptual difference in the obtained results.
4 Pipeline Efficiency 153
(a) Revenue
corpus 1 2 3 4
(b) CoNLL-2003
dataset 1 2 3 4
sidered corpora as trellis diagrams. The bold arrows correspond to the re-
(j)
spective Viterbi paths, indicating the optimal pipeline i of each length j.
Given all four algorithms, the optimal pipeline takes 48.25 seconds on the
Revenue corpus, while the one on the CoNLL-2003 dataset requires 18.17
seconds. eti is scheduled first and ene is scheduled last on both corpora,
but the optimal schedule of emo and rfo differs. This emphasizes the input-
dependency of the run-time optimality of a schedule.
One main reason behind lies in the selectivities of the employed algo-
rithms: On the Revenue corpus, 3813 sentences remain relevant after apply-
(2) (2)
ing emo = (eti, emo) as opposed to 2294 sentences in case of rfo = (eti, rfo).
Conversely, only 82 sentences of the CoNLL-2003 dataset are filtered after
(2) (2)
applying emo , whereas 555 sentences still need to be analyzed after rfo .
(4)
Altogether, each admissible pipeline i based on the complete algorithm
set A classifies the same 215 sentences (1.0%) of the Revenue corpus as rele-
vant, while not more than two sentences of the CoNLL-2003 dataset (0.01%)
154 4.1 Ideal Scheduling for Large-scale Text Mining
efficient way to compute benchmarks for more or less arbitrary text analysis
tasks.7 Rather, it clarifies the theoretical background of empirical findings
on efficient text analysis pipelines in terms of the underlying algorithmic
and linguistic determinants. In particular, we have shown that the optimal-
ity of a pipeline depends on the run-times and selectivities of the employed
algorithms on the processed input texts. In the next section, we investigate
the characteristics of collections and streams of input texts that influence
pipeline optimality. On this basis, we then turn to the development of effi-
cient practical scheduling approaches.
of text in D (of some specified text unit type, cf. Section 3.4). This frequency
can affect the efficiency of algorithms that take instances of C as input. Al-
though its impact is definitely worth analyzing, in the given context we are
primarily interested in the efficiency impact of filtering. Instead, we there-
density fore capture the density of an information type C in D, which we define as
the fraction of portions of text in D in which instances of C are found.8
To illustrate the difference between frequency and density, assume that
relations of a type IsMarriedTo(Person, Person) shall be extracted from a given
text D with two portions of text (say, sentences). Let three person names be
found in the first sentence, while none is found in the second one. Then
the type Person has a relative frequency of 1.5 in D but a density of 0.5. The
frequency affects the average number of candidate relations for extraction.
In contrast, the density implies that relation extraction needs to take place
on 50% of the given sentences only, which is what we are up to.
Now, consider the general case that some text analysis pipeline = hA, i
is given to address an information need C on a text D. The density of each
information type from C in D directly governs what portions of D are fil-
tered by our input control after the execution of an algorithm in A. De-
pending on the schedule , the resulting run-times of all algorithms on the
filtered portions of text then sum up to the run-time of . Hence, it might
seem reasonable to conclude that the fraction of those portions of D, which
pipeline classifies as relevant, impacts the run-time optimality of . In
fact, however, the optimality depends on the portions of D classified as not
relevant, as follows from Theorem 4.2:
Theorem 4.2. Let = hA, i be run-time optimal on a text D under all admis-
sible text analysis pipelines based on an algorithm set A, and let S(D) D denote
a scope containing all portions of D classified as relevant by . Let S(D0 ) D0
denote the portions of any other input text D0 classified as relevant by . Then
is also run-time optimal on (D\S(D)) S(D0 ).
Proof. In the proof, we denote the run-time of a pipeline on an arbitrary
scope S as t (S). By hypothesis, the run-time t (D) of = hA, i is
optimal on D, i.e., for all admissible pipelines 0 = hA, 0 i, we have
t (D) t0 (D). (4.4)
As known from Section 3.1, for a given input text D, all admissible text
analysis pipelines based on the same algorithm set A classify the same por-
tions S(D) D as relevant. Hence, each portion of text in S(D) is processed
8
Besides, an influencing factor of efficiency is the length of the portions of text, of course.
However, we assume that, on average, all relative frequencies and densities equally scale
with the length. Consequently, both can be seen as an implicit model of length.
4 Pipeline Efficiency 157
Theorem 4.2 states that the portions of text S(D) classified as relevant by a
text analysis pipeline have no impact on the run-time optimality of the pipe-
line. Consequently, differences in the efficiency of two admissible pipelines
based on the same algorithm set A must emanate from applying the algo-
rithms in A to different numbers of irrelevant portions of texts. We give
experimental evidence for this conclusion in the following.
(a) Changing the number of (b) Changing the number of (c) Changing the number of
relevant sentences random irrelevant sentences 6.06 specific irrelevant sentences
average run-time in ms / sentence 3.3 3.22 6 2.9
2.86
5.53
3.07
3.0 5 2.6
2.48
2
2
2.7 4 2 2.3
1
2.4 1 3 2.0 1.91
2.36 1
2.32
2.17 2.16 1.84
2.1 2 1.7
0. 1
02
05
0. 1
02
05
1
0
0.
0.
0.
0.
00
01
01
01
02
0.
0.
0.
0.
0.
0.
0.
0.
0.
density of relevant information types
Figure 4.5: Interpolated curves of the average run-times per sentence of the pipe-
lines 1 = (eti, rfo, emo, ene) and 2 = (emo, eti, rfo, ene) under different densities
of the relevant information types in modified training sets of the Revenue corpus.
The densities were created by duplicating or deleting (a) relevant sentences, (b) ran-
dom irrelevant sentences, and (c) irrelevant sentences with money entities.
set of the Revenue corpus. In particular, we have modified the original cor-
pus texts by random duplicating or deleting
(a) relevant sentences, which contain all relevant information types,
(b) irrelevant sentences, which miss at least one relevant type,
(c) irrelevant sentences, which contain money entities, but which miss
at least one other relevant type.
In case (a) and (b), we created text corpora, in which the density of the whole
set C is 0.01, 0.02, 0.05, 0.1, and 0.2, respectively. In case (c), it is not possible
to obtain higher densities than a little more than 0.021 from the training set
Revenue corpus, because under that density, all irrelevant sentences with
money entities have been deleted. Therefore, we restrict our view to five
densities between 0.009 and 0.021 in that case.
Experiments We have processed all created corpora ten times with both
1 and 2 on a 2 GHz Intel Core 2 Duo MacBook with 4 GB memory. Due
to the alterations, the corpora differ significantly in size. For this reason, we
compare the efficiency of the pipelines in terms of the average run-times per
sentence. Appendix B.4 outlines how to reproduce the experiments.
The Impact of the Distribution of Relevant Information Figure 4.5 plots
the run-times as a function of the density of C. In line with Theorem 4.2,
Figure 4.5(a) conveys that changing the number of relevant sentences does
not influence the absolute differences of the run-time of 1 and 2 .9 In
9
Minor deviations occur on the processed corpora, since we have changed the number
of relevant sentences as opposed to the number of sentences that are classified as relevant.
4 Pipeline Efficiency 159
contrast, the gap between the curves in Figure 4.5(b) increases proportion-
ally under growing density, because the two pipelines spend a proportional
amount of time processing irrelevant portions of text. Finally, the impact of
the distribution of relevant information becomes explicit in Figure 4.5(c): 1
is faster on densities lower than about 0.018, but 2 outperforms 1 under a
higher density (0.021).10 The reason for the change in optimality is that, the
more irrelevant sentences with money entities are deleted, the less portions
of text are filtered after emo, which favors the schedule of 2 .
Altogether, we conclude that the distribution of relevant information can
be decisive for the optimal scheduling of a text analysis pipeline. While
there are other influencing factors, some of which trace back to the efficiency
of the employed text analysis algorithms (as discussed in the beginning of
this section), we have cancelled out many of these factors by only duplicat-
ing and deleting sentences from the Revenue corpus itself.
0.35
0.31 0.26
0.3 0.3
0.27
0.25 0.25
0.21
0.2 0.19
Organization
0.18
0.14 0.14 0.15
0.05
Location
0.12 0.12
Person
0.1 0.11
0.1 0.09
0.07
0.05
Brown CoNLL-2003 CoNLL-2003 Wikipedia 10k Revenue LFA-11
corpus (English) (German) sample (German) corpus smartphone
diverse topics diverse topics, focused topics, focused topics,
and genres one genre one genre diverse genres
Figure 4.6: Illustration of the densities of person, organization, and location en-
tities in the sentences of two English and four German collections of texts. All
densities are computed based on the results of ene = (sse, sto2 , tpo1 , pch, ene).
Table 4.1: The average run-time t in milliseconds per sentence and its standard
deviation for every algorithm in the pipeline ene = (sse, sto2 , tpo1 , pch, ene)
on each evaluated text corpus. In the bottom line, the average of each algorithms
run-time is given together with the standard deviation from the average.
So, to summarize, the evaluated text corpora and information types sug-
gest that the run-times of the algorithms used to infer relevant information
tend to remain rather stable. At the same time, the resulting distributions
of relevant information may completely change. In such cases, the practical
relevance of the impact outlined above becomes obvious, since, at least for
algorithms with similar run-times, the distribution of relevant information
will directly decide the optimality of the schedule of the algorithms.13
...
...
...
A,*
...
A
...
A1 {A1},1
root node node leaf node (solution)
...
applicable
, A(j),(j) A,
...
...
action
(j) (j)
{Ak}, t(A , ) t(A,)
Ak Ak Ai
path cost solution cost
(j) (j)
ti(S(A , ))
...
...
step cost
... ...
Figure 4.7: Illustration of the nodes, actions, and costs of the complete search graph
of the informed search for the optimal schedule of an algorithm set A.
gorithms and (j) is an admissible schedule. The graphs root is the empty
pipeline h, i, and each leaf a complete pipeline hA, i. An edge represents
the execution of an applicable algorithm Ai A on the currently relevant
portions of D. Here, we define applicability exactly like in Section 4.1.16
The run-time of Ai represents the step cost, while the path and solution
costs refer to the run-times of partial and complete pipelines, respectively.
Figure 4.7 illustrates all concepts in an abstract search graph. It imitates the
trellis visualization from Figure 4.3 in order to show the connection to the
dynamic programming approach from Section 4.1.
However, the search graph will often be even much larger than the re-
spective trellis (in terms of the number of nodes), since an algorithm set A
can entail up to |A|! admissible schedules. To efficiently find a solution, in-
formed search aims to avoid to explore the complete search graph by follow-
search strategy ing some search strategy that governs the order in which nodes of the search
graph are generated.17 The notion of being informed refers to the use of
heuristic a heuristic within the search strategy. Such a heuristic relies on problem-
specific or domain-specific knowledge specified beforehand, which can be
exploited to identify nodes that may quickly lead to a solution. For schedul-
best-first search ing, we consider one of the most common search strategies, named best-first
search, which always generates the successor nodes of the node with the low-
estimated solution cost est estimated solution cost first. Best-first search operationalizes its heuristic
heuristic function in the form of a heuristic function H, which estimates the cost of the cheapest
path from a given node to a leaf node (Russell and Norvig, 2009).
16
Again, we schedule single text analysis algorithms here for a less complex presentation.
To significantly reduce the search space, it would actually be more reasonable to define the
filter stages from Section 3.1 as actions, as we propose in (Wachsmuth et al., 2013a).
17
In the given context, it is important not to confuse the efficiency of the search for a pipe-
line schedule and the efficiency of the schedule itself. Both are relevant, as both influence
the overall efficiency of addressing a text analysis task. We evaluate their influence below.
4 Pipeline Efficiency 165
The actually observed run-time t() of and the value of the heuristic func-
Similar to the dynamic
tion then sum up to the estimated solution cost q().
programming approach from Section 4.1, we hence need to keep track of
all run-times and filtered portions of text. By that, we implicitly estimate
the selectivities of all algorithms in A on the input text D at each possible
position in a pipeline based on A.
Now, assume that each run-time estimation q (Ai ) is optimistic, meaning
of Ai exceeds q (Ai ) on all scopes S().
that the actual run-time ti (S()) In
is optimistic, too, because at least one applicable algorithm has
this case,
to be executed on the remaining scope S(). In the end, however, the only
way to guarantee optimistic run-time estimations consists in setting all of
them to 0, which would render the defined heuristic H useless. Instead, we
relax the need of finding an optimal schedule here to the need of optimiz-
ing the schedule with respect to the given run-time estimations q. Conse-
quently, the accuracy of the run-time estimations implies a tradeoff between
the efficiency of the search and the efficiency of the determined schedule:
The higher the estimations are set, the less nodes A search will expand on
average, but also the less probable it will return an optimal schedule, and
vice versa. We analyze some of the effects of run-time estimations in the
evaluation below.
Given an informed search scheduling problem hA, , q, Di and the de-
fined heuristic H, we can apply A search to find an optimized schedule.
However, A search may still be inefficient when the number of nodes on
the open list with similar estimated solution costs becomes large. Here, this
can happen for algorithm sets with many admissible schedules of similar
efficiency. Each time a node is expanded, every applicable algorithm needs
to process the relevant portions of text of that node, which may cause high
run-times in case of computationally expensive algorithms. To control the
efficiency of A search, we introduce a parameter k that defines the maxi-
mum number of nodes to be kept on the open list. Such a k-best variant of
19
For simplicity, we assume here that there is one type of portions of text only (say, Sen-
tence) without loss of generality. For other cases, we could dinstinguish between instances
of the different types and respective run-time estimations in Equation 4.6.
4 Pipeline Efficiency 167
k-bestA PipelineScheduling(A,
, q, D, k)
1: Pipeline 0 h, i
2: Scope S(0 ) D
3: Run-time t(0 ) 0
4: Algorithm set Ai {Ai A | 6 (A < Ai )
}
5: Estimated run-time q(0 ) H(0 , Ai , q)
6: Pipelines open {0 }
7: loop
i open .poll( arg min q() )
8: Pipeline hA,
9: = A then return hA, i open
if A
10: Ai {Ai A\ A | (A < Ai )
: A A}
11: for each Algorithm Ai Ai do
12: Pipeline i hA {Ai }, {(A < Ai ) | A A}
i
13: Scope S(i ) Si (S(hA, i))
14: Run-time t(i ) t(hA, i) + ti (S(hA,
i))
15: Estimated run-time q(i ) t(i ) + H(i , Ai \{Ai }, q)
16: open open {i }
17: while |open | > k do open .poll(arg max q() )
open
A search considers only the seemingly best k nodes for expansion, which
improves efficiency while not guaranteeing optimality (with respect to the
given run-time estimations) anymore.20 In particular, k thereby provides
another means to influence the efficiency-effectiveness tradeoff, as we also
evaluate below. Setting k to yields a standard A search.
Pseudocode 4.2 shows our k-best A search approach for determining
an optimized schedule of an algorithm set A based on an input text D. The
root node of the implied search graph refers to the empty pipeline 0 and to
the complete input text S(0 ) = D. 0 does not yield any run-time and, so,
the estimated solution cost of 0 equals the value of the heuristic H, which
depends on the initially applicable algorithms in Ai (lines 1 to 5). Line 6
creates the set open from 0 , which represents the open list. In lines 7
i with the currently best estimated solution
to 17, the partial pipeline hA,
cost is iteratively polled from the open list (line 8) and expanded until it
contains all algorithms and is, thus, returned. Within one iteration, line 10
first determines all remaining algorithms that are applicable after hA, i
according to the partial schedule . Each such algorithm Ai processes the
i, thereby generating a successor node for
relevant portions of text of hA,
20
k-best variants of A search have already been proposed for other tasks in natural lan-
guage processing, such as parsing (Pauls and Klein, 2009).
168 4.3 Optimized Scheduling via Informed Search
the resulting pipeline i and its associated portions of texts (line 11 to 13).21
The run-time and the estimated solution cost of i are then updated, before
i is added to the open list (line 14 to 16). After expansion, line 17 reduces
the open list to the k currently best pipelines.22
more than one text would be processed, different schedules might be found,
which entails the additional problem of inferring the fastest schedule in to-
tal from the fastest schedules of all texts.
Correctness Under the described circumstances, the standard A search
variant of k-bestA PipelineScheduling (which emanates from setting k to
or, alternatively, from skipping line 17 of Pseudocode 4.2) can be said to
be correct in that it always finds an optimal solution, as captured by the
following theorem. Like in the proof in Section 4.1, we refer to consistent
algorithm sets without circular dependencies (cf. Section 3.3) here:23
Theorem 4.3. Let hA, , q, Di be an informed search scheduling problem with a
consistent algorithm set A that has no circular dependencies. If all estimations in
q are optimistic on each portion of the text D, then the pipeline hA, i returned by
a call of k-bestA PipelineScheduling(A, , q, D, ) is run-time optimal on D
under all admissible pipelines based on A.
Proof. We only roughly sketch the proof, since the correctness of A search
has already often been shown in the literature (Russell and Norvig, 2009).
As clarified above, optimistic run-time estimations in q imply that the em-
ployed heuristic H is optimistic, too. When k-bestA PipelineScheduling re-
turns a pipeline hA, i (pseudocode line 9), the estimated solution cost
q(hA, i) of hA, i equals its run-time t(hA, i), as all algorithms have
been applied. At the same time, no other pipeline on the open list has a
lower estimated solution cost according to line 8. By definition of H, all es-
timated solution costs are optimistic. Hence, no pipeline on the open list
can entail a lower run-time than hA, i, i.e., hA, i is optimal. And, since
algorithms from A are added to a pipeline on the open list in each iteration
of the outer loop in Pseudocode 4.2, hA, i is always eventually found.
algorithms are always applicable and take exactly the same run-time tA (D)
on D. As a consequence, the schedule of a pipeline does not affect the pipe-
lines run-time and, so, the run-time is higher the longer the pipeline (i.e.,
the number of employed algorithms). Therefore, lines 7 to 17 generate the
whole search graph except for the leaf nodes, because line 9 directly returns
the pipeline of the first reached leaf node (which corresponds to a pipeline
of length |A|). Since all algorithms can always be applied, there are |A|
pipelines of length 1, |A|(|A|1) pipelines of length 2, and so forth. Hence,
the number of generated nodes is
|A| + |A|(|A|1) + . . . + |A|(|A|1). . . 2 = O |A|! (4.7)
Avoiding this worst case is what the applied best-first search strategy aims
for in the end. In addition, we can control the run-time with the param-
eter k. In particular, k changes the products in Equation 4.7. Within each
product, the last factor denotes the number of possible expansions of a node
of the respective length, while the multiplication of all other factors results
in the number of such nodes to be expanded. This number is limited to k,
which means that we can transform Equation 4.7 into
|A| + k(|A|1) + . . . + k 2 = O k |A|2
(4.9)
Like above, a single node generation entails costs that largely result from
tA (D) and at most |A|. As a result, we obtain a worst case run-time of
tk-bestA PipelineScheduling (A, D) = O k |A|2 (tA (D)+|A|)
(4.10)
of input texts is worth being spent. On the other hand, we determine the
conditions under which our approach achieves to find a run-time optimal
pipeline based on a respective training set. Details on the source code used
in the evaluation are given in Appendix B.4.
Corpora As in Section 4.1, we conduct experiments on our Revenue corpus
described in Appendix C.1 as well as on the German dataset of the CoNLL-
2003 shared task described in Appendix C.4. First, we process different sam-
ples of the training sets of these corpora for obtaining the algorithms run-
time estimations as well as for performing scheduling. Then, we execute
the scheduled pipelines on the union of the respective validation and test
sets in order to measure their run-time efficiency.
Queries We consider three information needs of different complexity that
we represent as queries in the form presented in Section 3.4:
1 = Financial(Money, Forecast(Time))
2 = Forecast(Time, Money, Organization)
3 = Forecast(Revenue(Resolved(Time), Money, Organization))
1 and 2 have already been analyzed in Sections 3.5 and 4.1, respectively.
In contrast, we introduce 3 here, which we also rely on when we analyze
efficiency under increasing heterogeneity of input texts in Section 4.5. 3
targets at revenue forecasts that contain a resolvable time information, a
money value, and an organization name. A simple example for such a fore-
acst is Apples annual revenues could hit $400 billion by 2015. We require all
information of an instance used to address any of the queries to lie within
a sentence, i.e., the degree of filtering is Sentence in all cases.
Pipelines To address 1 , 2 , and 3 , we assume the following pipelines to
be given initially. They employ different algorithms from Appendix A:
1 = (sse, sto2 , tpo2 , eti, emo, rfo, rfi)
2 = (sse, sto2 , tpo2 , pch, ene, emo, eti, rfo)
3 = (sse, sto2 , tpo2 , pch, ene, emo, eti, nti, rre2 , rfo)
Each of the three pipelines serves as input to all evaluated approaches, i.e.,
the respective pipeline is simply seen as an algorithm set with a partial
schedule, for which an optimized schedule can then be computed. The al-
gorithms in 1 allow for only 15 different admissible schedules, whereas 2
entails 84 and 3 even 1638 admissible schedules.
Baselines We compare our approach to three baseline approaches. All
approaches are equipped with our input control from Section 3.5 and, thus,
process only relevant portions of text in each analysis step. We informally
172 4.3 Optimized Scheduling via Informed Search
define the three baselines that we look at here by the rules they follow to
obtain a schedule, when given a training set:
1. Fixed baseline. Do not process the training set at all. Remain with
the schedule of the given text analysis pipeline.
2. Greedy baseline. Do not process the training set at all. Schedule the
given algorithms according to their run-time estimation in an increas-
ing and admissible order, as proposed in Section 3.3.
3. Optimal baseline. Process the training set with all possible admissi-
ble schedules by stepwise executing the given algorithms in a breadth-
first search manner (Cormen et al., 2009). Choose the schedule that is
run-time optimal on the training set.24
The standard baseline is used to highlight the general efficiency potential
of scheduling when filtering is performed, while we analyze the benefit of
processing a sample of texts in comparison to the greedy baseline. The last
baseline is called optimal, because it guarantees to find the schedule that
is optimal on a training set. However, its brute-force nature contrasts the
efficient process of our informed search approach, as we see below.
Different from Chapter 3, we omit to construct filter stages here (cf. Sec-
tion 3.1), but we schedule the single text analysis algorithms instead. This
may affect the efficiency of both the greedy baseline and our k-best A
search approach, thereby favoring the optimal baseline to a certain extent.
Anyway, it enables us to simplify the analysis of the efficiency impact of
optimized scheduling, which is our main focus in the evaluation.
Experiments Below, we measure the absolute run-times of all approaches
averaged over ten runs. We break these run-times down in the schedul-
ing time on the training sets and the execution time on the combined val-
idation and test sets in order to analyze and compare the efficiency of the
approaches in detail. All experiments are conducted on a 2 GHz Intel Core 2
Duo Macbook with 4 GB memory.25
Efficiency Impact of k-best A Search Pipeline Scheduling First, we an-
alyze the efficiency potential of scheduling a pipeline for each of the given
queries with our k-best A search approach in comparison to the optimal
baseline. To imitate realistic circumstances, we use run-time estimations
24
The optimal baseline generates the complete search graph introduced above. It can be
seen as a simple alternative to the optimal scheduling approach from Section 4.1.
25
Sometimes, the optimal baseline and our approaches return different pipelines in dif-
ferent runs of the same experiment. This can happen when the measured run-time of the
analyzed pipelines are very close to each other. Since such behavior can also occur in prac-
tical applications, we simply average the run-times of the returned pipelines.
4 Pipeline Efficiency 173
Table 4.2: Comparison between our 20-best A search approach and the optimal
baseline with respect to the scheduling time and execution time on the Revenue
corpus for each query and for five different numbers of training texts.
obtained on one corpus (the CoNLL-2003 dataset), but schedule and exe-
cute all pipelines on another one (the Revenue corpus). Since it is not clear
in advance, what number of training texts suffices to find an optimal sched-
ule, we perform scheduling based on five different training sizes (with 1, 10,
20, 50, and 100 texts). In contrast, we delay the analysis of the parameter k of
our approach to later experiments. Here, we set k to 20, which has reliably
produced near-optimal schedules in some preliminary experiments.
Table 4.2 opposes the scheduling times and execution times of the two
evaluated approaches as well as their standard deviations. In terms of
scheduling time, the k-best A search approach significantly outperforms
the optimal baseline for all queries and training sizes.26 For 1 , k exceeds
the number of admissible schedules (see above), meaning that the approach
equals a standard A search. Accordingly, the gains in scheduling time ap-
pear rather small. Here, the largest difference is observed for 100 training
texts, where informed search is almost two times as fast as the optimal base-
line (65.0 vs. 117.3 seconds). However, this factor goes up to over 30 in case
of 3 (e.g. 507.2 vs. 15589.1 seconds), which indicates the huge impact of
informed search for larger search graphs. At the same time, it fully com-
petes with the optimal baseline in finding the optimal schedule. Moreover,
26
Partly, our approach improves over the baseline even in terms of execution time. This,
however, emanates from a lower system load and not from finding a better schedule.
174 4.3 Optimized Scheduling via Informed Search
0 10 20 30 40 50 60 70 80
average run-time in seconds
Figure 4.8: Illustration of the execution times (medium colors on the left) and
the scheduling times (light colors on the right) as well as their standard devia-
tions (small black markers) of all evaluated approaches on the Revenue corpus for
each of the three addressed queries and 20 training texts.
search
5-best A*
100 eline
y bas
greed
61.9 search
1-best A*
50
43.1 31.0
30.4
11.4
Figure 4.9: The total run-times of the greedy baseline and two variants of our k-
best A search approach for addressing 3 as a function of the number of processed
texts. The dashed parts are extrapolated from the run-times of the approaches on
the 366 texts in the validation and test set of the Revenue corpus.
Table 4.3: The average execution times in seconds with standard deviations of ad-
dressing the query 3 using the pipelines scheduled by our 20-best A search ap-
proach and the greedy baseline depending on the corpora on which (1) the algo-
rithms run-time estimations are determined, (2) scheduling is performed (in case
of the 20-best A search approach), and (3) the pipeline is executed.
k-best A search approaches and the greedy baseline for 3 , assuming that
the run-times grow proportionally to those on the 366 texts of the validation
and test set of the Revenue corpus. Given 20 training texts, our approach
actually saves time beginning at a number of 386 processed texts: There,
the total run-time of 1-best A search starts to be lower than the execution
time of the greedy baseline. Later on, 5-best A search becomes better, and
so on. Consequently, our hypothesis already raised in Section 3.3 turns out
to be true for this evaluation: Given that an information need must be ad-
dressed ad-hoc, a zero-time scheduling approach (like the greedy baseline)
seems more reasonable, but when large amounts of text must be processed,
performing scheduling based on a sample of texts is worth the effort.
Input Dependency of Optimized Scheduling Lastly, we investigate in how
far the given input texts influence the quality of the pipeline constructed
through optimized scheduling. In particular, we evaluate all possible com-
binations of using the Revenue corpus and the CoNLL-2003 dataset as input
for the three main involved steps: (1) Determining the run-time estimations
of the pipelines algorithms, (2) scheduling the algorithms, and (3) execut-
ing the scheduled pipeline. To see the impact of the input, we address only
the query, where our k-best A search approach is most successful, i.e., 3 .
We compare the approach to the greedy baseline, which also involves step
(1) and (3). This time, we leave both k and the number of training texts fixed,
setting each of them to 20.
Table 4.3 lists the results for each corpus combination. A first observation
matching our argumentation from Section 4.2 is that the efficiency impact of
our approach remains stable under the different run-time estimations: The
resulting execution times are exactly the same for the respective configura-
4 Pipeline Efficiency 177
tions on the CoNLL-2003 dataset and are also very similar on the Revenue
corpus. In contrast, the greedy baseline is heavily affected by the estima-
tions at hand. Overall, the execution times on the two corpora differ largely
because only the Revenue corpus contains many portions of text that are
relevant for 3 . The best execution times are 16.1 and 5.1 seconds, respec-
tively, both achieved by the informed search approach. For the CoNLL-2003
dataset, however, we see that scheduling based on inappropriate training
texts can have negative effects, as in the case of the Revenue corpus, where
the efficiency of 20-best A search significantly drops from 5.1 to 6.5 sec-
onds. In practice, this gets important when the input texts to be processed
are heterogeneous, which we analyze in the following sections.
Table 4.4: The run-time t() with standard deviation of each admissible pipe-
line based on the given algorithm set A on both processed corpora in comparison
to the gold standard. #best denotes the number of texts, is most efficient on.
on 96 of the 553 texts. While the best fixed pipeline, 1 , performs well on
both corpora, 2 and 6 fail to maintain efficiency on the Revenue corpus,
with e.g. 6 being almost 50% slower than the gold standard. Although the
gold standard significantly outperforms all pipelines on both corpora at a
very high confidence level (say, 3), the difference to the best fixed pipelines
may seem acceptable. However, the case of 2 and 6 shows that a slightly
different training set could have caused the optimized scheduling from Sec-
tion 4.3 to construct a pipeline whose efficiency is not robust to changing
distributions of relevant information. We hypothesize that such a danger
gets more probable the higher the text heterogeneity of a corpus.28
To deal with text heterogeneity, the question is whether and how we can
anticipate it for a collection or a stream of texts. Intuitively, it appears rea-
sonable to assume that text heterogeneity relates to the mixing of types,
domains, or according text characteristics, as is typical for the results of an
exploratory web search. However, the following example from text classi-
fication suggests that there is not only dimension that governs the hetero-
geneity. In text classification tasks, the information sought for is the final
class information of each text. While the density of classes naturally will
be 1.0 in all cases (given that different classes refer to the same informa-
tion type), what may vary is the distribution of those information types that
serve as input for the final classification. For instance, our sentiment analy-
sis approach developed in the ArguAna project (cf. Section 2.3) relies on the
facts and opinions in a text. For our ArguAna TripAdvisor corpus (cf. Ap-
pendix C.2) and for the Sentiment Scale dataset from (Pang and Lee, 2005),
we illustrate the distribution of these types in Figure 4.10.29
As can be seen, the distribution of relevant information in the ArguAna
TripAdvisor corpus remains nearly identical among its three parts.30 The
corpus compiles texts of the same type (user reviews) and domain (hotel),
but different topics (hotels) and authors, suggesting that the type and do-
main play a more important role. This is supported by the different densi-
ties of facts, positive, and negative opinions in the Sentiment Scale dataset,
which is comprised of more professional reviews from the movie domain.31
28
Also, larger numbers of admissible schedules make it harder to find a robust pipeline,
since they allow for higher efficiency gaps, as we have seen in the evaluation of Section 4.3.
29
In (Wachsmuth et al., 2014a), we observe that the distributions and positions of facts
and opinions influence the effectiveness of sentiment analysis. As soon as a pipeline re-
stricts some analysis to certain portions of text only (say, to positive opinions), however, the
different distributions will also impact the efficiency of the pipelines schedule.
30
Since the distributions are computed based on the self-created annotations here, the
values for the ArguAna TripAdvisor corpus differ from those in Appendix C.2.
31
In the given case, the density and relative frequency of each information type (cf. Sec-
tion 4.2) are the same, since the information types define a partition of all portions of text.
4 Pipeline Efficiency 181
objective facts
0.24 0.12
0.32 0.26 0.32 0.3 0.26 0.25
0.37 0.34
However, the four parts of the Sentiment Scale dataset show a high varia-
tion. Especially the distribution of facts and opinions in Author d deviates
from the others, so the writing style of the texts seems to matter, too. We
conclude that it does not suffice to know the discussed characteristics for a
collection or a stream of texts in order to infer its heterogeneity. Instead, we
propose to quantify the differences between the input texts as follows.
text result from varying densities of C1 , . . . , C|C| in the processed texts.32 So,
the text heterogeneity of D can be quantified by measuring the variance of
these densities in C. The outlined considerations give rise to a new measure
averaged deviation that we call the averaged deviation:
Averaged Deviation Let C = {C1 , . . . , C|C| } be an information need to be
addressed on a collection or a stream of input texts D, and let i (D) be the
standard deviation of the density of Ci C in D, 1 i |C|. Then, the
averaged deviation of C in D is
|C|
1 X
D(C|D) = i (D) (4.11)
|C|
i=1
Given a text analysis task, the averaged deviation can be estimated based on
a sample of texts. Different from other sampling-based approaches for effi-
ciency optimizations, like (Wang et al., 2011), it does not measure the typical
characteristics of input texts, but it quantifies how much these characteris-
tics vary. By that, the averaged deviation reflects the impact of the input
texts to be processed by a text analysis pipeline on the pipelines efficiency,
namely, the higher the averaged deviation, the more the optimal pipeline
schedule will vary on different input texts.
To illustrate the defined measure, we refer to Person, Location, and Or-
ganization entities again, for which we have presented the densities in two
English and four German text corpora in Section 4.2. Now, we determine
the standard deviations of these densities in order to compute the associated
averaged deviations (as always, see Appendix B.4 for the source code). Ta-
ble 4.5 lists the results, ordered by increasing averaged deviation.33 While
the deviations behave quite orthogonal to the covered topics and genres,
they seem connected to the quality of the texts in a corpus to some extent.
Concretely, the Revenue corpus and Brown corpus (both containing a care-
fully planned choice of texts) show less heterogeneity than the random sam-
ple of Wikipedia articles and much less than the LFA-11 web crawl of smart-
phone blog posts. This matches the intuition of web texts being heteroge-
neous. An exception is given by the values of the CoNLL-2003 datasets,
though, which rather suggest that high deviations correlate with high den-
sities (cf. Figure 4.5). However, the LFA-11 corpus contradicts this, having
the lowest densities but the second highest averaged deviation (18.4%).
32
Notice that even without an input control the number of instances of the relevant infor-
mation types can affect the efficiency, as outlined at the beginning of Section 4.2. However,
the density of information types might not be the appropriate measure in this case.
33
Some of the standard deviations of organization entities in Table 4.5 and the associated
averaged deviations exceed those presented in (Wachsmuth et al., 2013c). This is because
there we use a modification of the algorithm ene, which rules out some organization names.
4 Pipeline Efficiency 183
Table 4.5: The standard deviations of the densities of person, organization, and
location entities from Figure 4.5 (cf. Section 4.2) as well as the resulting averaged
deviations, which quantify the text heterogeneity in the respective corpora. All
values are computed based on the results of ene = (sse, sto2 , tpo1 , pch, ene).
Altogether, the introduced measure does not clearly reflect any of the text
characteristics discussed above. For efficiency purposes, it therefore serves
as a proper solution to compare the heterogeneity of different collections or
streams of texts with respect to a particular information need. In contrast,
it does not help to investigate our hypothesis that the danger of losing effi-
ciency grows under increasing text heterogeneity, because it leaves unclear
what a concrete averaged deviation value actually means. For this purpose,
we need to estimate how much run-time is wasted by relying on a text ana-
lysis pipeline with a fixed schedule.
D= ( d1 , ... , dj , ... , dn )
...
...
...
...
Ak tk(d1) ... tk(dj) ... tk(dn)
A=
Aj
...
...
...
...
A \ Aj
...
...
Am tm(d1) ... tm(dj) ... tm(dn)
Figure 4.11: Illustration of computing the gold standard run-time tgs (D) of an al-
gorithm set A = {A1 , . . . , Am } on a sample of portions of texts D = (d1 , . . . , dn )
for the simplified case that the algorithms in A have no interdependencies.
As a matter of fact, the overall run-time of the gold standard on the sample
of texts D results from summing up all run-times tgs (dj ):
n
X
tgs (D) = tgs (dj ) (4.13)
j=1
The computation of tgs (D) is illustrated in Figure 4.11. Given tgs (D) and the
optimal pipelines run-time t (D), we finally estimate the efficiency impact
of text heterogeneity in the collection or stream of texts represented by the
sample D as the fraction of run-time that can be saved through scheduling
the algorithms depending on the input text, i.e., 1 t (D)/tgs (D).34
To put it the other way round, the estimated efficiency loss represents the
optimization potential of choosing the schedule of a pipeline depending on
the input text at hand. On this basis, a pipeline designer can decide whether
it seems worth spending the additional effort of realizing such an adaptive
scheduling. In the next section, we first present an according approach.
Then, we evaluate the importance of adaptive scheduling for maintaining
efficiency under increasing text heterogeneity.
prefix pipeline pre =Apre,pre scheduling model main pipelines = {1, ..., k}
...
main pipeline ' =A,'
...
Figure 4.12: Illustration of the overall pipeline when addressing pipeline schedul-
ing as a text classification problem. Based on the results a prefix pipeline, a learned
scheduling model decides what main pipeline to choose for the input text D at hand.
determine optimized pipelines for different samples of input texts and then
let each of these become a candidate pipeline. In the following, we simply
expect some k |A|! candidate pipelines to be given.
Now, the determination of an optimal pipeline (D) for an input
text D D requires to have information about D. For text classification, we
represent this information in the form of a feature vector x (cf. Section 2.1).
Before we can find D , we therefore need a mapping D x, which in
turn requires a preceding analysis of the texts in D.35 Let pre = hApre , pre i
be the pipeline that realizes this text analysis. For distinction, we call pre
prefix pipeline
the prefix pipeline and each a main pipeline. Under the premise that all
main pipeline
algorithms from Apre A have been removed from the main pipelines, the
prefix pipeline can be viewed as the fixed first part of an overall pipeline,
while each main pipeline denotes one of the possible second parts. The
results of pre for an input text D D lead to the feature values x(D) , which
can then be used to choose a main pipeline for D.
Concretely, we propose to realize the mapping from feature values to
main pipelines as a statistical model obtained through machine learning on
a set of training texts DT . The integration of such a scheduling model into scheduling model
the overall pipeline is illustrated in Figure 4.12. We formalize our aim of
finding a mapping D as an adaptive scheduling problem: adaptive scheduling problem
adaptivePipelineScheduling(pre , , DT , D)
1: for each Main Pipeline do
2: Regression model Y() initializeRegressionModel()
3: for each Input text D DT do
4: pre .process(D)
5: Feature values x(D) computeFeatureValues(pre , D)
6: for each Main Pipeline do
7: Run-time t() .process(D)
8: updateRegressionModel(Y(), x(D) , t())
9: for each Input text D D do
10: pre .process(D)
11: Feature values x(D) computeFeatureValues(pre , D)
12: for each Main Pipeline do
13: Estimated run-time q() Y().predictRunTime(x(D) )
Main pipeline (D) arg min q()
14:
(D)
15: Run-time t((D) ) .process(D)
16: updateRegressionModel(Y((D) ), x(D) , t((D) ))
Pseudocode 4.3: Learning the fastest main pipeline (D) self-supervised for
each input text D from a training set DT and then predicting and choosing (D)
depending on the input text D D at hand while continuing learning online.
tupdate (D, ) tpre (D) + tf c (D) + tmain (D) + (||+1) treg (DT ) (4.15)
Table 4.6: The standard deviations of the densities of all information types from C
in the four evaluated text corpora as well as the resulting averaged deviations. All
values are computed from the results of a non-filtering pipeline based on A.
texts are processed in each step (cf. Sections 3.4 and 3.5). Here, we set the
degree of filtering to Sentence.
Corpora For a careful analysis of our hypothesis, we need comparable col-
lections or streams of input texts that refer to different levels of text het-
erogeneity. Most existing corpora for information extraction tasks are too
small to create reasonable subsets of different heterogeneity like those used
in the evaluations above, i.e., the Revenue corpus (cf. Appendix C.1) and
the CoNLL-2003 dataset (cf. Appendix C.4). An alternative is given by web
crawls. Web crawls, however, tend to include a large fraction of completely
irrelevant texts (as indicated by our analysis in Section 4.2), which conceals
the efficiency impact of scheduling.
We therefore decided to create partly artificial text corpora D0 , . . . , D3
instead. D0 contains a random selection of 1500 original texts from the Rev-
enue corpus and the German CoNLL-2003 dataset. The other three consist
of both original texts and artificially modified versions of these texts, where
the latter are created by randomly duplicating one sentence, ensuring that
each text is unique in every corpus: D1 is made up of the 300 texts from D0
with the highest differences in the density of the information types relevant
for 3 as well as 1200 modified versions. Accordingly, D2 and D3 are cre-
ated from the 200 and 100 highest-difference texts, respectively. Where not
stated otherwise, we use the first 500 texts of each corpus for training and
the remaining 1000 for testing (and updating regression models).
By resorting to modified duplicates, we limit our approach to a certain
extent in learning features from the input texts. However, we gain that we
can capture the impact of adaptive scheduling as a function of the text het-
erogeneity, which we quantify using the averaged deviation measure from
Section 4.4. Table 4.6 lists the exact deviations of the densities of all relevant
information types in the sentences of each of the four corpora.
Algorithms and Pipelines To address the query 3 , we rely on a set of nine
text analysis algorithms (details on these are provided in Appendix A):
4 Pipeline Efficiency 193
To allow for online learning, we trained linear regression models with the
Weka 3.7.5 implementation (Hall et al., 2009) of the incremental algorithm
Stochastic Gradient Descent (cf. Section 2.1). In all experiments, we let
the algorithm iterate 10 epochs over the training set, while its learning rate
was set to 0.01 and its regularization parameter to 0.00001.40
Baselines The aim of adaptive scheduling is to achieve optimal efficiency
on collections or streams of input texts where no single optimal schedule
exists. In this regard, we see the optimal baseline from Section 4.3, which
determines the run-time optimal fixed pipeline (pre , ) on the training
set and then chooses this pipeline for each test text, as the main competi-
tor. Moreover, we introduce another baseline to assess whether adaptive
scheduling improves over trivial non-fixed approaches:
Random baseline. Do not process the training set at all. For each test
text, choose one of the fixed pipelines (pseudo-) randomly.
Finally, we oppose all approaches to the gold standard, which knows the op-
timal main pipeline for each text beforehand. Together with the optimal
baseline, the gold standard implies the optimization potential of adaptive
scheduling on a given collection or stream of input texts (cf. Section 4.4).
Experiments In the following, we present the results of a number of effi-
ciency experiments that were conducted on 2 GHz Intel Core 2 Duo Mac-
Book with 4 GB memory. We omit to report on effectiveness, since all main
pipelines are equally effective by definition. The pipelines efficiency is mea-
sured as the run-time in milliseconds per sentence, averaged over ten runs.
For reproducability, all run-times and their standard deviations were saved
in a file in advance. In the experiments, we then loaded the precomputed
run-times instead of executing the pipelines.41
Efficiency Impact of Adaptive Scheduling We evaluate adaptive schedul-
ing on the test sets of each corpus D0 , . . . , D3 after training on the respective
training sets. Figure 4.13 compares the run-times of the main pipelines of
our approach to those of the two baselines and the gold standard as a func-
tion of the averaged deviation. The shown confidence intervals visualize
the standard deviations , which range from 0.029 to 0.043 milliseconds.
On the least heterogeneous corpus D0 , we achieve an average run-time
of 0.98 ms per sentence through adaptive scheduling. This is faster than the
40
In preceding experiments, we tested one other online learning algorithm, namely, an
artificial neural network. Mostly, Stochastic Gradient Descent performed better.
41
For lack of relevance in our discussion, we leave out an analysis of the effects of relying
on precomputed run-times here. In (Wachsmuth et al., 2013c), we offer evidence that the
main effect is a significant reduction of the standard deviations of the pipelines run-times.
4 Pipeline Efficiency 195
adaptive 0.73
0.7 scheduling
0.62
gold standard
0.55
13.8% (D0) 15.4% (D1) 16.6% (D2) 18.5% (D3)
averaged deviation (text heterogeneity)
Figure 4.13: Interpolated curves of the main pipelines average run-times of both
baselines, our adaptive scheduling approach, and the gold standard under increas-
ing averaged deviation, which represents the heterogeneity of the processed texts.
The background areas denote the 95% confidence intervals ( 2).
random baseline (1.06 ms), but slower than the optimal baseline (0.95 ms)
at a low confidence level. On all other corpora, our approach also outper-
forms the optimal baseline, providing evidence for the growing efficiency
impact of adaptive scheduling under increasing text heterogeneity. At the
averaged deviation of 18.5% on D3 , adaptive scheduling clearly succeeds
over both baselines, whose main pipelines take 37% and 40% more time on
average, respectively. There, the optimal baseline does not choose the main
pipeline, which performs best on the test set. This matches our hypothe-
sis from Section 4.4 that higher text heterogeneity may cause a significant
efficiency loss when relying on a fixed schedule.
Altogether, the curves in Figure 4.13 emphasize that only our adaptive
scheduling approach manages to stay close to the gold standard on all cor-
pora. In this respect, one reason for the seemingly weak performance of our
approach on D0 lies in the low optimization potential of adaptive schedul-
ing on that corpus: The optimal baseline takes only about 12% more time on
average than the gold standard (0.95 ms as opposed to 0.85 ms). This indi-
cates very small differences in the main pipelines run-times, which renders
the prediction of the fastest main pipeline both hard and quite unnecessary.
In contrast, D3 yields an optimization potential of over 50%. In the follow-
ing, we analyze the effects of these differences in detail.
Run-time and Error Analysis Figure 4.14 breaks down the run-times of the
three fixed pipelines, our approach, and the gold standard on D0 and D3
according to Inequality 4.15. On D3 , all fixed pipelines are significantly
slower than our approach in terms of overall run-time. The overall run-
time of our approach is largely caused by the prefix pipeline and the main
pipelines, while the time spent for computing the above-mentioned stan-
dard feature types 13 (0.03 ms) and for the subsequent regression (0.01 ms)
196 4.5 Adaptive Scheduling via Self-supervised Online Learning
on corpus D0 on corpus D3
0.95 0.65 (pre, 1) 0.51 0.96
1.06 0.65 (pre, 2) 0.51 1.02
1.09 0.65 (pre, 3) 0.51 1.08
Figure 4.14: The average run-times per sentence (with standard deviations) of the
three fixed pipelines, our adaptive scheduling approach, and the gold standard on
the test sets of D0 and D3 . Each run-time is broken down into its different parts.
is almost negligible. This also holds for D0 where our approach performs
worse only than the optimal fixed pipeline (pre , 1 ).
(pre , 1 ) is the fastest pipeline on 598 of the 1000 test texts from D0 ,
whereas (pre , 2 ) and (pre , 3 ) have the lowest run-time on 229 and 216
texts, respectively (on some texts, the pipelines are equally fast). In contrast,
our approach takes 2 (569 times) more often than 1 (349) and 3 (82),
which results in an accuracy of only 39% for choosing the optimal pipeline.
This behavior is caused by a mean regression error of 0.45 ms, which is
almost half as high as the run-times to be predicted on average and, thus,
often exceeds the differences between them. However, the success on D3
does not emanate from lower regression errors, which are in fact 0.24 ms
higher on average. Still, the accuracy is increased to 55%. So, the success
must result from larger differences in the main pipelines run-times.
One reason behind can be inferred from the average run-time per sen-
tence of pre in Figure 4.14, which is significantly higher on D1 (0.65 ms)
than on D3 (0.51 ms). Since the run-times of all algorithms in pre scale
linearly with the number of input tokens, the average sentence length of D0
must exceed that of D3 . Naturally, shorter sentences tend to contain less
relevant information. Hence, many sentences can be discovered as being ir-
relevant after few analysis steps by a pipeline that schedules the respective
text analysis algorithms early.
Learning Analysis The observed regression errors bring up the question of
how suitable the employed feature set is for learning adaptive scheduling.
To address this question, we built a separate regression model on the train-
ing sets of D0 and D3 , respectively, for each of the five distinguished feature
types in isolation as well as for combinations of them. For each model, we
then measured the resulting mean regression error as well as the classifi-
cation accuracy of choosing the optimal main pipeline. In Table 4.7, we
4 Pipeline Efficiency 197
Table 4.7: The average regression time per sentence (including feature computation
and regression), the mean regression error, and the accuracy of choosing the optimal
pipeline for each input text in either D0 or D3 for different feature types.
compare these values to the respective regression time, i.e., the run-time per regression time
sentence spent for feature computations and regression.
In terms of the mean regression error, the part-of speech tags and regex
matches perform best among the single feature types, while the average
run-times fail completely, especially on D3 (1.34 ms). Still, the accuracy of
the average run-times is far from worst, indicating that they sometimes pro-
vide meaningful information. The best accuracy is clearly achieved by the
lexical statistics.42 Obviously, none of the single feature types dominates
the evaluation. The set of all features outperforms both the single types
and the standard features in most respects. Nevertheless, we use the stan-
dard features in all other experiments, because they entail a regression time
of only 0.04 to 0.05 milliseconds per sentence on average. In contrast, the
regex matches e.g. need 0.16 ms alone on D0 , which exceeds the difference
between the optimal baseline and the gold standard on D0 and, thus, ren-
ders the regex matches useless in the given setting.
The regex matches emphasize the need for efficiently computable fea-
tures that we discussed above. While the set of standard features fulfills
the former requirement, it seems as if none of the five feature types really
captures the text characteristics relevant for adaptive scheduling.43
Alternatively, though, the features may also require more than the 500
training texts given so far. To rule out this possibility, we next analyze the
performance of the standard features depending on the size of the training
set. Figure 4.15 shows the main pipelines run-times for nine training sizes
42
The low inverse correlation of the mean regression error and the classification accu-
racy seems counterintuitive, but it indicates the limitations of these measures: E.g., a small
regression error can still be problematic if run-times differ only slightly, while a low classi-
fication accuracy may have few negative effects in this case.
43
We have also experimented with other task-independent features, especially further reg-
ular expressions, but their benefit was low. Therefore, we omit to report on them here.
198 4.5 Adaptive Scheduling via Self-supervised Online Learning
Figure 4.15: The average run-time per sentence of the main pipelines of the two
baselines, our adapative scheduling approach, and the gold standard on the test
set of D0 as a function of the training size.
between 1 and 5000. Since the training set of D0 is limited, we have partly
performed training on duplicates of the texts in D0 (modified in the way
sketched above) where necessary. Adaptive scheduling does better than the
random baseline but not than the optimal baseline on all training sizes ex-
cept for 1. The illustrated curve minimally oscillates in the beginning. After
its maximum at 300 training texts (1.05 ms), it then declines monotonously
until it reaches 0.95 ms at size 1000. From there, the algorithm mimics the
optimal baseline, i.e., it chooses 1 on about 90% of the texts.
While the observed learning behavior may partly result from overfitting
the training set in consequence of using modified duplicates, it also under-
lines that the considered features simply do not suffice to always find the
optimal pipeline for each text. Still, more training decreases the danger of
being worse than without adaptive scheduling.
In addition, our approach continues learning in its update phase, as we
finally exemplify. Figure 4.16 plots two levels of detail of the learning curve
of the employed regression models on 15,000 modified duplicates of the
texts from D0 .44 Here, only one text is used for an initial training. As the
bold curve highlights, the mean regression error decreases on the first 4000
to 5000 texts to an area between 0.35 ms and 0.45 ms, where it stays most
of the time afterwards. Although the light curve reveals many outliers, we
conclude that online learning apparently works well.
regression error in ms
0.65
0.56
0.55
0.45 0.48
mean (0.43)
0.40
0.35 0.37
0.25
1000 5000 10,000 15,000 input text
Figure 4.16: The mean regression error for the main pipelines chosen by our adap-
tive scheduling approach on 15,000 modified versions of the texts in D0 with train-
ing size 1. The values of the two interpolated learning curves denote the mean of
100 (light curve) and 1000 (bold curve) consecutive predictions, respectively.
accucate run-time predictions. Aside from that, another way to reduce the
prediction errors may be to iteratively schedule each text analysis algorithm
separately. This would allow for more informed features in later predic-
tions, but it would also make the learning of the scheduling much more
complex. Moreover, the introductory example in Section 4.1 suggests that
the first filter stages in a pipeline tend to be most decisive for the pipelines
efficiency. Since the main purpose of this section is to show how to deal
with text heterogeneity in general, we leave according and other extensions
of our adaptive scheduling approach for future work.
As a result, we close the analysis of the pipeline scheduling problem here.
Throughout this chapter, we have offered evidence that the optimization
of a pipelines design and execution in terms of efficiency can drastically
speed up the realized text analysis process. The underlying goal in the con-
text of this thesis is to enable text analysis pipelines to be used for large-
scale text mining and, thus, to work on big data. The analysis of big data
strongly relies on distributed and parallelized processing. In the following,
we therefore conclude this chapter by discussing to what extent the devel-
oped scheduling approaches can be parallelized.
In Section 2.4, we have already clarified that text analysis processes are
very amenable to parallelization because different input texts are analyzed
independently in most cases. In general, parallelization may have a number
of purposes, as e.g. surveyed in (Kumar et al., 1994). Not all of these target
at the memory and processing power of an application. For instance, par-
allelization can also be used to introduce redundancy into an application,
which allows a handling of machine breakdowns, thereby increasing the
fault tolerance of an application. While we roughly discuss fault tolerance fault tolerance
below, in the context of pipeline execution we are predominantly interested
in the question to what extent the run-time efficiency of the pipelines result-
ing from our approaches scales under parallelization, i.e., whether it grows
proportionally to the number of available machines. To this end, we qualita-
tively examine possible ways to parallelize pipeline execution with respect
to different efficiency-related metrics.
In particular, we primarily focus on the pipeline execution time, i.e., the pipeline execution time
total run-time of all employed pipelines on the (possibly large) collection
or stream of input texts. This run-time is connected to other metrics: First,
some experiments in this chapter have indicated that the memory consump- memory consumption
tion of maintaining pipelines on a machine matters, namely, a high memory
load lowers the efficiency of text analysis. Second, the impact of paralleliza-
tion depends on the extent to which machine idle times are avoided. In this
regard, machine utilization denotes the percentage of the overall run-time of machine utilization
a machine, in which it processes text. And third, the distribution of texts
over a network causes communication overhead, which we indirectly cap-
ture as the network time. We assume these three to be most important for the network time
pipeline execution time and we omit to talk about others accordingly.
Aside from a scalable execution, parallelization can also be exploited to
speed up pipeline scheduling. We analyze the effects of parallelization on
the scheduling time, i.e., the time spent for an optimized scheduling on a sam- scheduling time
ple of texts, as proposed in Section 4.3 (or for the optimal scheduling in Sec-
tion 4.1) as well as on the training time of our adaptive scheduling approach training time
from Section 4.5. Also, we look at the minimum response time of a pipeline, minimum response time
which we define as the pipelines run-time on a single input text. The min-
imum response time becomes important in ad-hoc text mining, when first
results need to be returned as fast as possible (cf. Section 3.3).
In the following, we examine four types of parallelization for the scenario
that a single text analysis task is to be addressed on a network of machines
with pipelines equipped with our input control from Section 3.5. All ma-
chines are uniform in speed and execute algorithms and pipelines in the
same way. They can receive arbitrary input texts from other machines, an-
202 4.6 Parallelizing Execution in Large-scale Text Mining
Table 4.8: Qualitative overview of the expected effects of the four distinguished
types of parallelization with respect to each considered metric. The scale ranges
from very positive [++] over none or hardly any [o] to very negative [ ].
alyze the texts, and return the produced output information. We assess the
effects of each type on all metrics introduced above on a comparative scale
from very positive [++] and positive [+] over none or hardly any [o] to negative []
and very negative [ ]. Table 4.8 provides an overview of all effects.
To illustrate the different types, we consider three machines 0 , . . . , 2 .
master machine 0 serves as the master machine that distributes input texts and aggregates
output information. Given this setting, we schedule four sample algorithms
related to our case study InfexBA from Section 2.3: a time recognizer AT ,
a money recognizer AM , a forecast event detector AF , and some segmen-
tation algorithm AS . Let the output of AS be required by all others and
let AF additionally depend on AT . Then, three admissible pipelines exist:
(AS , AT , AM , AF ), (AS , AT , AF , AM ), and (AS , AM , AT , AF ).
0 AS AT 0 AS AT
1 AF 1 AF
2 AM 2 AM
0 AS AT AF AM 0 AS AT AF AM
1 AS AT AF AM 1 AT AM AF
2 AS AT AF AM 2 AM AT AF
The same holds for exchanged roles of A1 and A2 . We have seen an ex-
ample that fulfills Inequality 4.16 in the evaluation of Section 4.1, where
the pipeline (eti, emo) outperforms the algorithm emo alone. The danger of
losing efficiency (which also exists for the minimum response time [+/])
generally makes analysis parallelization questionable. While the schedul-
ing time [++] and training time [+] behave in the same way as for analysis
pipelining, other types of parallelization exist that come with only few no-
table drawbacks and with even more benefits, as discussed next.
(a) Parallel processing of the portions of an input text (b) Parallel processing of an input text
0 AS AT AF AM 0 AS AT AF AM
1 AT AF AM 1 AS AT AM AF
2 AT AF AM 2 AS AM AT AF
Figure 4.18: Parallel processing of a single input text with four sample text analysis
algorithms: (a) The master machine segments the input text into portions of text,
each of which is then processed on one machine. (b) Each machine processes the
whole input text, but schedules the algorithms in a distinct manner.
input text into single portions, whose size is constrained by the largest spec-
ified degree of filtering in the query to be addressed (cf. Section 3.4). The
portions can then be distributed to the available machines. Figure 4.18(a)
sketches such an input distribution for our four sample algorithms.
Pipeline duplication appears to be an almost perfect choice, at least when
a single text analysis pipeline is given, as in the case of our optimized
scheduling approach from Section 4.3. In contrast, the (ideally) even better
adaptive scheduling approach from Section 4.5 can still cause a high mem-
ory consumption, because every machine needs to maintain all candidate
schedules. A solution is to parallelize the schedules instead of the pipe-
line, as illustrated in Figure 4.17(d). Such a schedule parallelization requires schedule parallelization
to store only a subset of the schedules on each machine, thereby reducing
memory consumption [+]. The adaptive choice of a schedule (and, hence,
of a machine) for an input text then must take place on the master machine.
Consequently, idle times can occur, especially when the choice is very im-
balanced. In order to ensure a full machine utilization [++], input texts may
therefore have to be reassigned to other machines, which implies a negative
effect on the network time []. So, we cannot generally determine whether
schedule parallelization yields a better pipeline execution time than pipe-
line duplication or vice versa [++].
In terms of scheduling time [++] and training time [++], schedule paral-
lelization behaves analog to pipeline duplication, whereas the distribution
of schedules over machines will tend to benefit the minimum response time
on a single input text more clearly [++]: Similar to (Kalyanpur et al., 2011),
a text can be processed by each machine simultaneously (cf. Figure 4.18(b)).
As soon as the first machine finishes, the execution can stop to directly re-
turn the produced output information. However, the full potential of such
a massive parallelization is only achieved when all machines are working.
Still, schedule parallelization makes it easy to cope with machine break-
downs in general, indicating a high but not optimal fault tolerance [+].
206 4.6 Parallelizing Execution in Large-scale Text Mining
47
In exemplary tests with the main pipeline in our project ArguAna (cf. Section 2.3), pipe-
line duplication on five machines reduced the pipeline execution time by factor 3.
In making a speech one must study three points: first, the
means of producing persuasion; second, the style, or lan-
guage, to be used; third, the proper arrangement of the vari-
ous parts of the speech.
Aristotle
Pipeline Robustness
5
The ultimate purpose of text analysis pipelines is to infer new information
from unknown input texts. To this end, the algorithms employed in pipe-
lines are usually developed on known training texts from the anticipated
domains of application (cf. Section 2.1). In many applications, however,
the unknown texts significantly differ from the known texts, because a con-
sideration of all possible domains within the development is practically in-
feasible (Blitzer et al., 2007). As a consequence, algorithms often fail to infer
information effectively, especially when they rely on features of texts that
are specific to the training domain. Such missing domain robustness consti- domain robustness
tutes a fundamental problem of text analysis (Turmo et al., 2006; Daum and
Marcu, 2006). The missing robustness of an algorithm directly reduces the
robustness of a pipeline it is employed in. This in turn limits the benefit of
pipelines in all search engines and big data analytics applications, where the
domains of texts cannot be anticipated. In this chapter, we present first sub-
stantial results of an approach that improves robustness by relying on novel
structure-based features that are invariant across domains.
Section 5.1 discusses how to achieve ideal domain independence in the-
ory. Since the domain robustness problem is very diverse, we then focus on
a specific type of text analysis tasks (unlike in Chapters 3 and 4). In par-
ticular, we consider tasks that deal with the classification of argumentative
texts, like sentiment analysis, stance recognition, or automatic essay grad-
ing (cf. Section 2.1). In Section 5.2, we introduce a shallow model of such
tasks, which captures the sequential overall structure of argumentative texts
on the pragmatic level while abstracting from their content. For instance,
207
208 5.1 Ideal Domain Independence for High-Quality Text Mining
Figure 5.1: Abstract view of the overall approach of this thesis (cf. Figure 1.5). The
main contribution of Chapter 5 is represented by the overall analysis.
several false
classifications
one false
classification
X1 X1
Figure 5.2: Illustration of the domain dependence of text analysis for a two-class
classification task: Applying the decision boundary from domain A in some do-
main B with a different feature distribution (here, for x1 and x2 ) often works badly.
(a) instances from domain A and B (b) instances from domain A and B
X2 X2
better fitting
models improve
classification
few false
classifications
few false
classifications
more representative
training sets improve
classification
X1 X1
Figure 5.3: Illustration of two ways of improving the decision boundary from Fig-
ure 5.2 for the domains A and B as a whole: (a) Choosing a model that better fits
the task. (b) Choosing a training set (open icons) that represents both domains.
and underfitting exist (Montavon et al., 2012), in the end the appropriate-
ness of a model depends on the training set it is derived from. This directly
optimal representativeness leads to requirement of optimal representativeness.
The used training set governs the distribution of values of the considered
features and, hence, the quality of the derived model. According to learn-
ing theory (cf. Section 2.1), the best model is obtained when the training
set is optimally representative for the domain of application. The represen-
tativeness prevents the model from incorrectly generalizing from the train-
ing data. Given different domains of application, the training texts should,
thus, optimally represent all texts irrespective of their domains (Sapkota
et al., 2014). Figure 5.3(b) shows an alternative training set (open icons) for
the sample instances from domain A and B that adresses this requirement.
As for optimal fitting, it leads to a decision boundary, which causes fewer
false classifications than the one shown in Figure 5.2. Among others, we
observe such behavior in (Wachsmuth and Bujna, 2011) after training on a
random crawl of blog posts instead of a focused collection of reviews.
Optimal representativeness is a primary goal of corpus design (Biber
et al., 1998). Besides the problem of how to achieve representativeness, an
optimally representative training set can only be built when enough training
texts are given from different domains. This contradicts one of the basic sce-
narios that motivates the need for domain robustness, namely, that enough
data is given from some source domain, but only few from a target domain.
The domain adaptation approaches summarized in Section 2.4 deal with
this scenario by learning the shift in the feature distribution between do-
mains or by aligning features from the source and the target domain. Hence,
they require at least some data from the target domain for training and, so,
need to assume the domains of application in advance.
5 Pipeline Robustness 213
domain-specific features
X1 prevent robustness X1
always square always circle always circle always square
class in domain B class in domain B class in domain A class in domain A
Figure 5.4: Illustration of the different domain invariance of the features x1 and x2
with respect to the instances from domain A and B: Only for x2 , the distribution of
values over the circle and square remains largely invariant across the domains.
requires to know the target function. Still, we believe that strongly domain-
invariant features can be found for many text analysis tasks. While the do-
main invariance of a certain set of features cannot be proven, the robustness
of an algorithm based on the features can at least be evaluated using test
sets from different domains of application. Such research has already been
done for selected tasks. For instance, Menon and Choi (2011) give exper-
imental evidence that features based on function words robustly achieve
high effectiveness in authorship attribution across domains.
Domain-invariant features benefit the domain independence of text ana-
lysis algorithms and, consequently, the domain independence of a pipeline
that employs such algorithms. That being said, we explicitly point out that
the best set of features in terms of domain invariance is not necessarily the
best in terms of effectiveness. In the case of the figures above, for instance,
the domain-specific feature x1 may still add to the overall classification ac-
curacy, when used appropriately. Hence, domain-invariant features do not
solve the general problem of achieving optimal effectiveness in text analy-
sis, but they help to robustly maintain the effectiveness of a text analysis
pipeline when applying the pipeline to unknown texts.
To conclude, none of the described requirements can be realized per-
fectly in general, preventing ideal domain independence in practice. Still,
approaches to address each requirement exist. The question is whether gen-
eral ways can be found to overcome domain dependence, thereby improv-
ing pipeline robustness in ad-hoc large-scale text mining. For a restricted
setting, we consider this question in the remainder of Chapter 5.
cus on tasks that deal with the classification of argumentative texts like es- argumentative text
says, transcripts of political speeches, scientific articles, or reviews, since
we claim that they are particularly viable for the development of domain-
invariant features: In general, an argumentative text represents a written
form of monological argumentation. For our purposes, such argumentation argumentation
can be seen as a regulated sequence of text with the goal of providing per-
suasive arguments for an intended conclusion (cf. Section 2.4 for details).
This involves the identification of facts about the topic being discussed as
well as the structured presentation of pros and cons (Besnard and Hunter,
2008). As such, argumentative texts resemble the type of speeches Aristotle
refers to in the quote at the beginning of this chapter.
Typical tasks that target at argumentative texts are sentiment analysis,
stance recognition, and automatic essay grading among others (cf. Sec-
tion 2.1). In such tasks, domains are mostly distinguished in terms of topic,
like different product types in reviews or different disciplines of scientific
articles. Moreover, argumentative texts share common linguistic character-
istics in terms of their structure (Trosborg, 1997). Now, according to Aris-
totle, the arrangement of the parts of a speech (i.e., the overall structure of overall structure
the speech) plays an important role in making a speech. Putting both to-
gether, it therefore seems reasonable that the following two-fold hypothesis
holds for many tasks that deal with the classification of argumentative texts,
where overall structure can be equated with argumentation structure: argumentation structure
Text
class
1 Argumenta- 1
Stance Topic
tive text
successor
1..* 1..* 0..1 1..*
Figure 5.5: The proposed metamodel of the structure of an argumentative text (cen-
ter) and its connection to the argumentation (left) and the content (right) of the text.
Apart from that, the reinterpretation appears rather vague, because it leaves
open what is exactly meant by argumentation structure. In the following,
we present a metamodel that combines different concepts from the litera-
ture to define such structure for argumentative texts in a granularity that
we argue is promising to address text classification.
Figure 5.5 shows the ontological metamodel with all presented concepts.
Similar to the information-oriented view of text analysis in Section 3.4, the
ontology is not used in terms of a knowledge base, but it serves as an anal-
ogy to the process-oriented view in Section 3.2. We target at the center part
of the model, which defines the overall structure of argumentative texts, i.e.,
their argumentation structure. In some classification tasks, structure may
have to be analyzed in consideration of the content referred to in the argu-
mentation of a text. An example in this regard is stance recognition, where
the topic of a stance is naturally bound to the content. First, we thus intro-
duce the left and right part. Given the topic is known or irrelevant, however,
a focus on structure benefits domain robustness, as we see later on.
argumentative relation components (like a claim or a premise) and argumentative relations between
the components (like the support of a claim by a premise). In contrast, the
concrete types of components and relations differ. E.g., Toulmin (1958) fur-
ther divides premises into grounds and backings (cf. Section 2.4 for details).
The conclusion of an argumentation may or may not be captured explicitly
stance in an argument component itself. It usually corresponds to the stance of the
author of the text with respect to the topic being discussed.
topic The topic of an argumentative text sums up what the content of the text
is all about, such as the stay at some specific hotel in case of the mentioned
reviews. The topic is referred to in the text directly or indirectly by talking
semantic concept about different semantic concepts. We use the generic term semantic concept
here to cover entities, attributes, and the like, like a particular employee of
semantic relation a hotel named John Doe or like the hotels staff in general. Semantic rela-
tions may exist between the concepts, e.g. John Doe works for the reviewed
hotel. As the examples show, the relevant concrete types of both semantic
concepts and semantic relations are often domain-specific, similar to what
we observed for annotation types in Section 3.2.
An actual understanding of the arguments in a text would be bound to
the contained semantic concepts and relations. In contrast, we aim to de-
termine only the class of an argumentative text given some classification
text class scheme here. Such a text class represents meta information about the text,
e.g. the sentiment score of a review or the name of its author. As long as
the meta information does not relate to the topic of a text, loosing domain
independence by analyzing content-related structure seems unnecessary.
Stab and Gurevych (2014a) present an annotation of the structure of ar-
gumentative texts (precisely, of persuasive essays) that relates to the de-
fined concepts. They distinguish major claims (i.e., conclusions), claims,
and premises as argumentation components as well as support and attack
as argumentative relations. Such annotation schemes serve research on the
mining of arguments and their interactions (cf. Section 2.4). The induced
structure may also prove beneficial for text classification, though, especially
when the given classification scheme targets at the purpose of argumenta-
tion (as in the case of stance recognition). However, we seek for a model
that can be applied to several classification tasks. Accordingly, we need to
abstract from concrete classification tasks in the first place.
cleus and satellite of a discourse relation, like Mann and Thompson (1988),
but we simply specify the relation to be directed, either as CR (di , di+1 ) or
as CR (di+1 , di ). Figure 5.6 visualizes the resulting metamodel of overall
structure that emanates from the abstract concepts. In the following, we
exemplify how to instantiate the metamodel for a concrete task.8
different subsets of the 23 relation types from the rhethorical structure the-
ory may be beneficial in different tasks or even other relation types, such as
those used in the Penn Discourse Treebank (Carlson et al., 2001).
As an example, we illustrate the two instantiation steps for the sentiment
analysis of reviews, as tackled in our project ArguAna (cf. Section 2.3). Re-
views comprise a positional argumentation, where an author collates and
structures a choice of statements (i.e., facts and opinions) about a prod-
uct or service in order to inform intended recipients about his or her be-
liefs (Besnard and Hunter, 2008). The conclusion of a review is often not
explicit, but it is quantified in terms of an overall sentiment score or the like.
For example, a review from the hotel domain may look like the following:
We spent one night at that hotel. Staff at the front desk was very nice,
the room was clean and cozy, and the hotel lies in the city center... but
all this never justifies the price, which is outrageous!
Five statements can be identified in the review: A fact on the stay, followed
by two opinions on the staff and the room, another fact on the hotels loca-
tion, and a final opinion on the price. Although there are hence more pos-
itive than negative statements, the argumentation structure of the review
reveals a negative global sentiment, i.e., the overall sentiment to be inferred global sentiment
in the sentiment analysis of reviews.
In simplified terms, the argumentation structure is given by a sequential
composition of statements with local sentiments on certain aspects of the ho- local sentiment
tel. Figure 5.7 models the argumentation structure of the example review as
an instance of our metamodel from Figure 5.5: Each statement represents
a discourse unit whose unit class corresponds to a local sentiment. Here,
we cover the positive and negative polarities of opinions as well as the ob-
jective nature of a fact as classes of local sentiment. They entail the order
relation already mentioned above. The five statements are connected by
four discourse relations. The discourse relations in turn refer to three from
whatever number of discourse relation types.
What Figure 5.7 highlights is the sequence information induced by our
shallow structure-oriented view of text analysis. In particular, according to
this view an argumentative text implies both a sequence of unit classes and
a sequence of discourse relations. In the remainder of this chapter, we ex-
amine in how far such sequences can be exploited in features for an effective
and domain-robust text analysis. However, Figure 5.7 also illustrates that
the representation of a text as an instance of the defined metamodel is te-
dious and space-consuming. Instead, we therefore prefer visualizations in
the style of Figure 5.6 from here on.
222 5.2 A Structure-oriented View of Text Analysis
hotel review
:Argumentative text
Figure 5.7: Instantiation of the structure part of the metamodel from Figure 5.5
with concrete concepts of discourse relations and unit classes as well as with indi-
viduals of the concepts for the example hotel review discussed in the text.
Experiments Given the two domains for each of the two considered tasks,
we create classifiers for all feature types in isolation using supervised learn-
ing. Concretely, we separately train one linear multi-class support vector
machine from the LibSVM integration of Weka (Chang and Lin, 2011; Hall
et al., 2009) on the training set of each corpus, optimizing parameters on the
respective validation set.10 Then, we measure the accuracy of each feature
type on the test sets of the corpora in the associated task in two scenarios:
(1) when training is performed in-domain on the training set of the corpus,
and (2) when training is performed on the training set of the other domain.11
The results are listed in Table 5.1 for each possible combination A2B of a
training domain A and a testing domain B.
Limited Domain Invariance of all Feature Types The token 1-grams per-
form comparably well in the in-domain scenarios (H2H, F2F, M2M, and
S2S), but their accuracy significantly drops in the out-of-domain scenarios
with a loss of up to 25 percentage points. The token 2-grams and 3-grams
behave inconsistent in these respects, indicating that their distribution is not
learned on the given corpora. Anyway, the minimum observed accuracy of
all word features lies below 33.3%, i.e., below chance in three-class classifi-
cation. The class features share this problem, but they seem less domain-
dependent as far as available in the respective scenarios. While the senti-
ment scores and words work well in three of the four sentiment analysis
10
The Sentiment Scale dataset is partitioned into four author datasets (cf. Section C.4).
Here, we use the datasets of author c and d for training, b for validation, and a for testing.
11
Research on domain adaptation often compares the accuracy of a classifier in its training
domain to its accuracy in some other test domain (i.e., A2A vs. A2B), because training data
from the test domain is assumed to be scarce. However, this leaves unclear whether an
accuracy change may be caused by a varying difficulty of the task at hand across domains.
For the analysis of domain invariance, we therefore put the comparison of different training
domains for the same test domain (i.e., A2A vs. B2A) in the focus here.
226 5.3 The Impact of the Overall Structure of Input Texts
Table 5.1: Accuracy of 15 common feature types in experiments with 3-class sen-
timent analysis and 3-class language function analysis for eight different scenarios
A2B. Here, A is the domain of the training texts, and B the domain of the test texts,
with A, B {hotel (H), film/movie (F), music (M), smartphone (S)}. The two right-
most columns shows the minimum observed accuracy of each feature type and the
maximum loss of accuracy points caused by training out of the test domain.
We spent one night at that hotel. Staff at the front desk was very nice, the room was clean and cozy,
1 2 3
positive (pos)
objective (obj)
statement
negative (neg)
4 5
and the hotel lies in the city center... but all this never justifies the price, which is outrageous!
Figure 5.8: Illustration of capturing the structure of a sample hotel review as a local
sentiment flow, i.e., the sequence of local sentiment in the statements of the review.
that these types do not suffice to achieve high quality in the evaluated clas-
sification tasks, although the combination of features at least performs best
in all in-domain scenarios. For out-of-domain scenarios, we hence need to
find features that are both effective and domain-robust.
(a) all statements (b) statements in titles (c) first statements (d) last statements
100%
80%
60%
40%
20%
0%
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 score
Figure 5.9: (a) The fractions of positive opinions (green), objective facts (dark gray),
and negative opinions (red) in the texts of the ArguAna TripAdvisor corpus, sepa-
rated by the sentiment scores between 1 and 5 of the reviews they refer to. (bd) The
respective fractions for statements at specific positions in the reviews.
distributions in Figure 5.9(d) do not differ clearly from those in Figure 5.9(a).
A possible explanation is that the last statements often summarize reviews.
However, they may also simply reflect the average.
The impact of local sentiment at certain positions indicates the impor-
tance of structural aspects of a review. Yet, it does not allow drawing in-
ferences about a reviews overall structure. Therefore, we now come to the
third hypothesis, which refers to local sentiment flows. For generality, we
do not consider a reviews title as part of its flow, since unlike in the Argu-
Ana TripAdvisor corpus many reviews have no title. Our method to test
the hypothesis for the corpus is to first determine sentiment flows that rep-
resent a significant fraction of all reviews in the corpus. Then, we compute
how often these patterns cooccur with each sentiment score.
We do not exactly determine local sentiment flows, though, because of
the varying lengths of reviews: The only five local sentiment flows that rep-
resent at least 1% of the ArguAna TripAdvisor corpus are trivial without
any change in local sentiment, e.g. (pos, pos, pos, pos) or (obj). In principle, a
solution is to length-normalize the flows. We return to this in the context of
our overall analysis in Section 5.4.13 From an argumentation perspective,
length normalization appears hard to grasp. Instead, we move from local
sentiment to changes in local sentiment here. More precisely, we combine
consecutive statements with the same local sentiment, thereby obtaining lo-
cal sentiment segments. We define the sentiment change flow of a review as sentiment change flow
the sequence of all such segments in the reviews body.14 In case of the ex-
ample in Figure 5.8, for instance, the second and third statement have the
same local sentiment. Hence, they refer to the same segment in the reviews
sentiment change flow, (obj, pos, obj, neg).
In total, our corpus contains reviews with 826 different sentiment change
flows. Table 5.2 lists all those with a frequency of at least 1%. Together, they
cover over one third (34.8%) of all texts. The most frequent flow, (pos), repre-
sents the 161 (7.7%) fully positive hotel reviews, whereas the best global sen-
timent score 5 is indicated by flows with objective facts and positive opin-
ions (table lines 4, 5, and 7). Quite intuitively, (neg, pos, neg) and (pos, neg, pos)
denote typical flows of reviews with score 2 and 4, respectively. In contrast,
none of the listed flows clearly indicates score 3. The highest correlation is
observed for (neg, obj, neg), which results in score 1 in 88.9% of the cases.
13
Alternatively, Mao and Lebanon (2007) propose to ignore the objective facts. Our ac-
cording experiments did not yield new insights except for a higher frequency of trivial flows.
For lack of relevance, we omit to present results on local sentiment flows here, but they can
be easily reproduced using the provided source code (cf. Appendix B.4).
14
In (Wachsmuth et al., 2014b), we name these sequences argumentation flows. In the given
more general context, we prefer a more task-specific naming in order to avoid confusion.
230 5.3 The Impact of the Overall Structure of Input Texts
Table 5.2: The 13 most frequent sentiment change flows in the ArguAna Trip-
Advisor corpus and their distribution over all possible global sentiment scores.
The outlined cooccurrences offer strong evidence for the hypothesis that
the global sentiment of a review depends on the reviews local sentiment
flow. Even more, they imply the expected effectiveness (in the hotel domain)
of a single feature based on a sentiment change flow. In particular, the fre-
quency of a flow can be seen as the recall of any feature that applies only to
reviews matching the flow. Correspondingly, the distribution of a flow over
the sentiment scores shows what precision the feature would achieve in pre-
dicting the scores. However, Table 5.2 also reveals that all found flows cooc-
cur with more than one score. Thus, we conclude that sentiment change
flows do not decide global sentiment alone. This becomes explicit for (obj,
pos, neg, pos), which is equally distributed over scores 3 to 5.
4% 3.8
3.0
2%
1.0 1.1 1.1 0.8
0.4 0.6 0.3
0%
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 score
Figure 5.10: The fractions of five types of discourse relations under all discourse re-
lations, separated by the sentiment scores between 1 and 5 of the reviews they refer
to. The discourse relations were found using the algorithm pdr (cf. Appendix A.1).
Table 5.3: The 12 most frequent discourse change flows in the ArguAna Trip-
Advisor corpus (when ignoring sequence and elaboration relations) and their dis-
tribution over all possible global sentiment scores. The relations were found using
the discourse parser pdr (cf. Appendix A).
(background, elaboration, contrast) for our example review. However, not co-
incidentally, we have left out both elaboration and sequence in Figure 5.10.
Together, these two types make up over 80% of all discourse relations, ren-
dering it hard to find common flows with other (potentially more relevant)
types. Thus, we ignore elaboration and sequence relations in the determina-
tion of the most frequent discourse change flows.
Table 5.3 shows the distributions of sentiment scores for all such 12 dis-
course change flows that represent at least 1% of the reviews in the Argu-
Ana TripAdvisor corpus each and 44.6% of the corpus reviews in total.
The flows are grouped according to the contained discourse relation types.
This is possible because only contrast cooccurs with other types in the listed
flows, certainly due to the low frequency of the others.
About every fourth review (25.2%) contains only contrast relations (except
for sequences and elaborations). Compared to Figure 5.10, such reviews
differ from an average review with contrast relations, having their peak at
score 2 (28.1%). Similar observations can be made for (cause). The found
flows with circumstance relations suggest that discourse change flows do not
influence sentiment scores: No matter if a contrast is expressed before or af-
ter a circumstance, the respective review tends to be negative mostly. How-
ever, this is different for concession and motivation relations. E.g., a motiva-
tion in isolation leads to score 4 and 5 in over 60% of the cases, whereas the
flow (contrast, motivation) most often cooccurs with score 3. (motivation, con-
trast) even speaks for a negative review on average. An explanation might
5 Pipeline Robustness 233
be that readers shall be warned right from the beginning in case of negative
hotel experiences, while recommendations are rather made at the end.
Altogether, we see that the correlations between scores and flows in Ta-
ble 5.3 are not as clear as for the sentiment change flows. While the existence
of certain discourse relations obviously affects the global sentiment of the
hotel reviews, their overall structure seems sometimes but not always deci-
sive for the sentiment analysis of reviews.
rally tend to assign positive weights to all values (or proceed accordingly if
no weights are used), thereby disregarding the flow as a whole.
In contrast, we now develop a novel feature type that achieves to make
the overall structure of argumentative texts measurable by addressing struc-
ture classification as a relatedness problem, i.e., by computing how similar the relatedness problem
overall structure of a text is to a common set of known patterns of overall
structure. The idea behind resembles explicit semantic analysis (Gabrilovich explicit semantic analysis
and Markovitch, 2007), which measures the semantic relatedness of texts.
Explicit semantic analysis represents the meaning of a text as a weighted
vector of semantic concepts, where each concept is derived from a Wikipedia
article. The relatedness of two texts then corresponds to the similarity of
their vectors, e.g. measured using the cosine distance (cf. Section 2.1).
Since we target at overall structure, we rely on our model from Section 5.2
instead of semantic concepts. In particular, in line with our analyses from
Section 5.3, we propose to capture the overall structure of an argumentative
text in terms of a specific unit class flow or of the discourse relation flow of
a text. From here on, we summarize all concrete types of unit classes (e.g.
local sentiments, language functions, etc.) and the discourse relation type
flow type
under the term flow type, denoted as Cf . Hence, every flow can be seen as a
flow
sequence of instances of the respective type:
Flow The flow f of a text of some concrete flow type Cf denotes the se-
quence of instances of Cf in the text.
Based on flows, we define the patterns of overall structure we seek for:
Flow Pattern A flow pattern f denotes the average of a set of similar length- flow pattern
normalized flows F = {f1 , . . . , f|F | }, |F | 1, of some concrete flow type.
Analog to deriving semantic concepts from Wikipedia articles, we deter-
mine a set of flow patterns F = {f1 , . . . , f|F
}, |F | 1, of some flow type C
| f
from a training set of argumentative texts. Unlike the semantic concepts,
however, we aim for common patterns that represent more than one text.
Therefore, we construct F from flow clusters that cooccur with text classes.
Each pattern is then deployed as a single feature, whose value corresponds
to the similarity between the pattern and a given flow. The resulting feature
vectors can be used for statistical approaches to text classification, i.e., for
learning to map flows to text classes (cf. Section 2.1). In the following, we
detail the outlined process and we exemplify it for sentiment scoring.
objective (0.5)
1 5
negative (0.0)
discourse relation Background (B) Elaboration (E) Elaboration (E) Contrast (C)
(b) normalized local sentiment flow and normalized discourse relation flow
positive (1.0)
objective (0.5)
1 18
negative (0.0)
discourse relation B B B B E E E E E E E E E C C C C
Figure 5.11: (a) Illustration of the local sentiment flow and the discourse flow of
the sample text from Section 5.3. (b) Length-normalized versions of the two flows
for length 18 (local sentiment) and 17 (discourse relations), respectively.
of some flow type Cf , we first need to determine the respective flow of each
text in DT . Usually, instances of Cf are not annotated beforehand. Conse-
quently, a text analysis pipeline f is required that can infer Cf from DT .15
The output of f directly leads to the flows. As an example, Figure 5.11(a)
depicts flows of two types for the sample text from Section 5.3: the local
sentiment flow already shown in Figure 5.8 as well as the discourse rela-
tion flow of the sample text with all occurring types of discourse relations
including elaboration (unlike the discourse change flows in Section 5.3).
The length of a flow follows from the number of discourse units in a
text (cf. Section 5.2), which varies among input texts in most cases. To assess
the similarity of flows, we thus convert each flow into a length-normalized
version. The decision what length to be used resembles the optimal fitting
problem sketched in Section 5.1: Short normalized versions may oversim-
plify long flows, losing potentially relevant flow type information. Long
versions may capture too much noise in the flows. A reasonable normalized
length should therefore be chosen in dependence of the expected average
or median length of the texts to represented.
Besides the length, normalization brings up the question of whether and
how to interpolate values, at least in case of ordinal or metric flow types. For
instance, local sentiment could be interpreted as a numeric value between
0.0 (negative) and 1.0 (positive) with 0.5 meaning objective or neutral. In
such cases, an interpolation seems beneficial, e.g. for detecting similarities
between flows like (1.0, 0.5, 0.0) and (1.0, 0.0). For illustration, Figure 5.11(b)
shows normalized versions of the flows from Figure 5.11(a). There, some
non-linear interpolation is performed for the local sentiment values, while
15
In the evaluation at the end of this section, we present results on the extent to which the
effectiveness of inferring Cf affects the quality of the features based on the flow patterns.
5 Pipeline Robustness 237
the nominal discourse relation types are duplicated for lack of reasonable
alternatives. The chosen normalized lengths are exemplary only.
Once the set of all normalized flows F has been created from DT , flow
patterns can be derived. As usual for feature computations (cf. Section 2.1),
however, it may be reasonable to discard rare flows before (say, flows that
occur only once in the training set) in order to avoid capturing noise.
Now, our hypothesis behind flow patterns is that similar flows entail the
same or, if applicable, similar text classes. Here, the similarity of two length-
normalized flows is measured in terms of some adequate similarity func-
tion (cf. Section 2.1). For instance, the (inverse) Manhattan distance may
capture the similarity of the metric local sentiment flows. In case of dis-
course relation flows, we can at least compute the fraction of matches. With
respect to the chosen similarity function, flows that construct the same pat-
tern should be as similar as possible and flows that construct different pat-
terns as dissimilar as possible. Hence, it appears reasonable to partition the
set F using clustering (cf. Section 2.1) and to then derive flow patterns from
the resulting clusters.
In particular, we propose to perform supervised clustering, which can be supervised clustering
understood as a clustering variant, where we exploit knowledge about the
training text classes to ensure that all obtained clusters have a certain purity. purity
In accordance with the original purity definition (Manning et al., 2008), here
purity denotes the fraction of those flows in a cluster, whose text class equals
the majority class in the cluster. This standard purity assumes exactly one cor- standard purity
rect class for each flow, implying that a flow alone decides the class, which
is not what we exclusively head for (as discussed in Section 5.2). At least
larger classification schemes speak for a relaxed purity definition. For exam- relaxed purity
ple, our results for sentiment scores between 1 and 5 in Section 5.3 suggest
to also see the dominant neighbor of the majority score as correct. Either
way, based on any measure of purity, we define supervised flow clustering: supervised flow clustering
Supervised Flow Clustering Given a set of flows F with known classes, de-
S|F|
termine a clustering F = {F1 , . . . , F|F| } of F with i=1 Fi = F and Fi Fj =
for Fi 6= Fj F, such that the purity of each Fi F lies above some threshold.
We seek for clusters with a high purity, as they indicate specific classes,
which matches the intuition behind the flow patterns to be derived. At the
same time, the number of clusters should be small in order to achieve a high
average cluster size and, thus, a high commonness of the flow patterns. An
easy way to address both requirements is to rely on a hierarchical cluster-
ing (cf. Section 2.1), where we can directly choose a flat clustering F with
a desired number |F| of clusters through cuts at appropriate nodes in the
238 5.4 Features for Domain Independence via Supervised Clustering
hierarchical
flow clustering
text classes of
sample flows 1 1 1 3 1 3 3 3 2 3 4 4 3 3 4 5 5 4 5 5
flow clusters cluster for cluster for cluster for cluster for
at highest cuts class 1 class 3 and 4 class 4 and 5 class 5
objective (0.5)
negative (0.0)
majority B B B B E E E E E E E C C C C C C
Figure 5.13: Sketch of the construction of (a) a sentiment flow pattern (dashed curve)
from two length-normalized sample local sentiment flows (circles and squares) and
(b) a discourse flow pattern (bold) from three according discourse relation flows.
determineFlowPatterns(DT , CT , Cf , f )
1: Training flows F
2: for each i {1, . . . , |DT |} do
3: f .process(DT [i])
4: Flow f DT [i].getOrderedFlowClasses(Cf )
5: f normalizeFlow(f )
6: F F {hf , CT [i]i}
7: F retainSignificantFlows(F )
8: Flow patterns F
9: Clusters F performSupervisedFlowClustering(F )
10: F retainSignificantFlowClusters(F)
11: for each Cluster Fi F do F F Fi .getCentroid()
12: return F
Pseudocode 5.1: Determination of a common set of flow patterns F from a set
of training texts DT and their associated known text classes CT . The patterns are
derived from flows of the type Cf that is in turn inferred with the pipeline f .
(a) x2 (b)
3 3 3 4 flow pattern of class 1
3 4 5 flow pattern
2
flow pattern of of class 5
class 3 and 4 3
3
5 flow
...
flow flow pattern of class 5
1 3 5
1 1 5 4
flow pattern 1 4 flow pattern of
of class 1 class 4 and 5
x1 flow
Figure 5.14: Two illustrations of measuring the similarity (here, the inverse Man-
hattan distance) of an unknown flow to the flow patterns resulting from the clusters
in Figure 5.12: (a) 2D plot of the distance of the flows vector to the cluster centroids.
(b) Distances of the values of the flow to the values of the respective patterns.
createTrainingFeatureVectors(DT , CT , Cf , f , F )
1: Training vectors X
2: for each i {1, . . . , |DT |} do
3: f .process(DT [i])
4: Flow f DT [i].getOrderedFlowClasses(Cf )
5: f normalizeFlow(f )
6: Feature values x(i)
7: for each Flow Pattern f F do
8: Feature value x(i) computeSimilarity(f , f )
(i) (i)
9: x(i) x k x
10: X X {hx(i) , CT [i]i}
11: return X
Pseudocode 5.2: Creation of a training feature vector for every text from a training
set DT with associated text classes CT . Each feature value denotes the similarity of
a flow pattern from F to the texts flow of type Cf (inferred with the pipeline f ).
represent the flow patterns, mapped into two dimensions (for illustration
purposes) that correspond to the positions in the flow. And, on the right,
the flow view with the single values of the flows and flow patterns.
Pseudocode 5.2 presents how to create a vector for each text in DT based
on the flow patterns F . Given the normalized flow f of a text (lines 3 to 5),
one feature value is computed for each f F by measuring the similarity
between f and f (lines 8 and 9). The combination of the ordered set of
feature values and the text class CT [i] defines the vector (line 10).16
follows at the end of this section). It works irrespective of the type, lan-
guage, or other properties of the input texts being processed, since it out-
sources the specific analysis of producing the flow type Cf to the employed
text analysis pipeline f . Nevertheless, our feature type explicitly aims to
serve for approaches to the classification of argumentative texts, because it
relies on our hypothesis that the overall structure of a text is decisive for
the class of a text, which will not hold for all other text classification tasks,
e.g. not for topic detection (cf. Section 2.1). While the feature type can cope
with all kinds of flow types as exemplified, we have indicated that nominal
flow types restrict its flexibility.
Correctness Similar to the scheduling approach in Section 4.5, the two
presented pseudocodes (Pseudocodes 5.1 and 5.2) define method schemes
rather than concrete methods. As a consequence, again we cannot prove
correctness here. Anyway, the notion of correctness generally does not re-
ally make sense in the context of feature computations.
In particular, besides the flow type Cf that is defined as part of the input,
both the realized processes in general and the flow patterns in specific are
schematic in that they imply a number of relevant parameters:
1. Normalization. How to normalize flows and what length to use.
2. Similarity. How to measure the similarity of flows and clusters.
3. Purity. How to measure purity and what purity threshold to ensure.
4. Clustering. How to perform clustering and what algorithm to use.
5. Significance. How often a flow must occur to be used for clustering,
and how large a cluster must be to be used for pattern construction.
For some parameters, reasonable configurations may be found conceptually
in regard of the task at hand. Others should rather be found empirically.
With respect to the question of how to perform clustering, we have clar-
ified that the benefit of a supervised flow clustering lies in the construction
of common flow patterns that cooccur with certain text classes. A regular
unsupervised clustering may also achieve commonness, but it may lose the
cooccurrences. Still, there are scenarios where the unsupervised variant can
make sense, e.g. when rather few input texts with classes are available, but
a large number of unknown texts. Then, semi-supervised learning could be
conducted (cf. Section 2.1) where flow patterns are first derived from the
unknown texts and cooccurrences thereafter from the known texts.
Although we see the choice of a concrete clustering algorithm as part of
the realization, above we propose to resort to hierarchical clustering in or-
der to be able to easily find flow clusters with some minimum purity. An
242 5.4 Features for Domain Independence via Supervised Clustering
In practice, the most expensive operation will often be the clustering in de-
termineFlowPatterns for larger numbers of input texts. The goal of this
chapter is not to optimize efficiency, which is why we do not evaluate run-
times in the following experiments, where we employ the proposed fea-
17
If the flow of each text from DT is computed only once during the whole process (see
above), Inequality 5.2 would even be reduced to O(|DT | |fmax |).
5 Pipeline Robustness 243
Table 5.4: Accuracy of all evaluated feature types in 5-class and 3-class sentiment
analysis on the test hotel reviews of the ArguAna TripAdvisor corpus based on
ground-truth annotations (Corpus) or on self-created annotations (Self) as well as in
3-class sentiment analysis on the film reviews of author a, b, c and d in the Sentiment
Scale dataset. The bottom line compares our approach to (Pang and Lee, 2005).
best under all single feature types with an accuracy of 52% and 51%, respec-
tively. Combining all types boosts the accuracy to 54%. Using self-created
annotations, however, it significantly drops down to 48%. The loss of fea-
ture types 2 to 4 is even stronger, making them perform slightly worse than
the content and style features (40%42% vs. 43%). These results seem not
to match with (Wachsmuth et al., 2014a), where the regression error of the
argumentation features remains lower on the self-created annotations. The
reason behind can be inferred from the 3-class hotel results, which demon-
strate the effectiveness of modeling argumentation for sentiment scoring:
There, all argumentation feature types outperform feature type 1. This in-
dicates that at least the polarity of their classified scores is often correct, thus
explaining the low regression errors.
On all four film review datasets, the sentiment flow patterns classify
scores most accurately among the argumentation feature types, but their
effectiveness still remains limited.21 The content and style features domi-
nate the evaluation, which again gives evidence for the effectiveness of such
features within a domain (cf. Section 5.3). Compared to ova, our classifier
based on all feature types is significantly better on the reviews of author a
and a little worse on two other datasets (c and d).
We conclude that the proposed feature types are competitive, achieving
similar effectiveness than existing approaches. In the in-domain task, the
sentiment flow patterns do not fail, but they also do not excel. Their main
benefit lies in their strong domain invariance, as we see next.
21
We suppose that the reason behind mainly lies in the limited accuracy of 74% of our
polarity classifier csp in the film domain (cf. Appendix A.2), which reduces the impact of all
features that rely on local sentiment.
5 Pipeline Robustness 247
Fd2H
Fa2H
Fc2H
20%
H2H
0%
1. Content and style 2. Local sentiment 3. Discourse relation 4. Sentiment flow 1.-4. All four
features distributions distributions patterns feature types
Figure 5.15: Accuracy of the evaluated feature types on the test hotel reviews in
the ArguAna TripAdvisor corpus based on self-created annotations with training
either on the training hotel reviews (H2H) or on the film reviews of author a, b, c,
or d in the Sentiment Scale dataset (Fa 2H, Fb 2H, Fc 2H, and Fd 2H).
Fd2Fd
20%
Fa2Fa
Fc2Fc
H2Fb
H2Fd
H2Fa
H2Fc
0%
1. Content and style 2. Local sentiment 3. Discourse relation 4. Sentiment flow 1.-4. All four
features distributions distributions patterns feature types
Figure 5.16: Accuracy of the evaluated feature types on the film reviews of author
a, b, c, and d in the Sentiment Scale dataset in 10-fold cross-validation on these
reviews (Fa 2Fa , Fb 2Fb , Fc 2Fc , and Fd 2Fd ) or with training on the training hotel re-
views of the ArguAna TripAdvisor corpus (H2Fa , H2Fb , H2Fc , and H2Fd ).
objective (0.5)
1 30
(b) Sentiment Scale dataset, author c (scale 0-2) (c) Sentiment Scale dataset, author d (scale 0-2)
score 1 score 2 score 1
1.0 score 2 (7 flows) 1.0 (15 flows)
(155 flows) (5 flows)
0.5 0.5
1 60 1 60
score 0 score 0
0.0 (11 flows) 0.0 (16 flows)
Figure 5.17: (a) The three most common sentiment flow patterns in the training
set of the ArguAna TripAdvisor corpus, labeled with their associated sentiment
scores. (bc) The according sentiment flow patterns for all possible scores of the
texts of author c and d in the Sentiment Scale dataset, respectively.
with the sentiment flow patterns being worst. Apparently, the argumenta-
tion structure of the film review author d, which is reflected by the found
sentiment flow patterns, must differ from the others.
Insights into Sentiment Flow Patterns In Figure 5.17(a), we plot the most
common sentiment flow pattern for each possible sentiment score among
those 38 patterns that we found in the training set of the ArguAna Trip-
Advisor corpus (based on self-created annotations). As depicted, the pat-
terns are constructed from the local sentiment flows of up to 226 texts. Be-
low, Figures 5.17(bc) show the respective patterns for author c and d in the
Sentiment Scale dataset. One of the 75 patterns of author c results from 155
flows, whereas all 41 patterns of author d represent at most 16 flows.
With respect to the shown sentiment flow patterns, the film reviews yield
less clear sentiment but more changes of local sentiment than the hotel re-
views. While there appears to be some similarity in the overall argumenta-
tion structure between the hotel reviews and the film reviews of author c,
5 Pipeline Robustness 249
two of the three patterns of author d contain only little clear sentiment at
all, especially in the middle parts. We have already indicated the dispar-
ity of the author d dataset in Figure 4.10 (Section 4.4). In particular, 73%
of all discourse units in the ArguAna TripAdvisor corpus are classified as
positive or negative opinions, but only 37% of the sentences of author d.
The proportions of the three other film datasets at least range between 58%
and 67%. These numbers also serve as a general explanation for the limited
accuracy of the argumentation feature types 1 to 3 in the film domain.
A solution to improve the accuracy and domain invariance of modeling
argumentation structure might be to construct sentiment flow patterns from
the subjective statements only or from the changes of local sentiment, which
we leave for future work. Here, we conclude that our novel feature type does
not yet solve the domain dependence problem of classifying argumentative
texts, but our experimental sentiment scoring results suggest that it denotes
a promising step towards more domain robustness.
of inferring all instances of the flow type of the flow patterns (including pre-
processing steps). The average run-times of the algorithms csb and csp em-
ployed here give a hint of the increased complexity of performing sentiment
scoring in the presented way (cf. Appendix A.2). Other flow types may be
much cheaper to infer (like function words), but also much more expensive.
Discourse relations, for instance, are often obtained through parsing. Still,
efficient alternatives exist (like our lexicon-based algorithm pdr), indicat-
ing the usual tradeoff beween efficiency and effectiveness (cf. Section 3.1).
With respect to the second (clustering), we have discussed that our hierar-
chical approach may be too slow for larger numbers of training texts and
we have outlined flat clusterers as an alternative. Nevertheless, clustering
tends to represent the bottleneck of the feature computation and, thus, of
the training of an according algorithm.
Anyway, training time is not of upmost importance in the scenario we tar-
get at, where we assume the text analysis algorithms to choose from to be
given in advance (cf. Section 1.2). This observation conforms with our idea
of an overall analysis from Section 1.3: The determination of features takes
place within the development of an algorithm AT that produces instances
of a set of text classes CT . At development time, AT can be understood as
an overall analysis: It denotes the last algorithm in a pipeline T for infer-
ring CT while taking as input all information produced by the preceding
algorithms in T . Once AT is given, it simply serves as an algorithm in the
set of all available algorithms. In the end, our overall analysis can hence be
seen as a regular text analysis algorithm for cross-domain usage. Besides
the intended domain robustness, such an analysis provides the benefit that
its results can be explained, as we finally sketch in Section 5.5.
sentiment 3 out of 5
with score We spent one night at that hotel. Staff at the front desk was very nice, the room was clean and cozy,
a 48% and the hotel lies in the city center... but all this never justifies the price, which is outrageous!
discourse
relation background elaboration elaboration contrast
n/a
Figure 5.18: Illustration of an explanation graph for the sentiment score of the sam-
ple text from Section 5.2. The dependencies between the different types of output
information as well as the accuracy estimations (a) are derived from knowledge
about the text analysis pipeline that has produced the output information.
between independent but fully overlapping nodes (cf. Appendix B.3 for ex-
amples). The latter reduces soundness, but it makes the graph more easy to
conceive. Besides, Figure 5.18 indicates that explanation graphs can become
very large, which makes their understanding hard and time-consuming. To
deal with the size, our prototype reduces completeness by displaying only
the first layers in an overview graph and the others in a detail graph. Nev-
ertheless, long texts make the resort to explanation graphs questionable.
In terms of an internal view, the expressiveness of explanation graphs is
rather low because of their generic nature. Concretely, an explanation graph
provides no information about how the target instance emanates from the
annotations it depends on. As mentioned above, the decisive step of most
text classification approaches (and also many information extraction algo-
rithms) is the feature computation, which remains implicit in an explana-
tion graph. General information in this respect could be specified via ac-
cording properties of the employed algorithm, say features: lexical and shal-
low syntactic in case of csb. For input-specific information, however, the
actually performed text analysis must be explained. This is easy for our
overall analysis from Section 5.4, as we discuss next.
objective (0.5)
score 3 (2.69)
negative (0.0) common pattern with score 2-3
Figure 5.19: Two possible visual explanations for the sentiment score of the sample
text from Figure 5.3 based on our model from Section 5.2: (a) Highlighting all local
sentiment. (b) Comparison to the most similar sentiment flow patterns.
seems sound and not very incomplete. In contrast, Figure 5.19(b) puts more
focus on structure. By comparing the local sentiment flow of the text to
common flow patterns, it directly visualizes the feature type developed in
Section 5.4. If few patterns are much more similar to the flow than all oth-
ers, the visualization serves as a sound and rather complete explanation of
that feature type. Given that a user believes the patterns are correct, there
should hence be no reason for mistrusting such an explanation.
To summarize, we claim that certain feature types can be explained ad-
equately by visualizing our model. However, many text classification ap-
proaches combine several features, like the one evaluated in Section 5.4. In
this case, both the soundness and the completeness of the visualizations will
be reduced. To analyze the benefit of explanations in a respective scenario,
we conducted a first user study in our project ArguAna using the crowd-
sourcing platform Amazon Mechanical Turk26 , where so called workers
can be requested to perform tasks. The workers are paid a small amount of
money if the results of the tasks are approved by the requester. For a concise
presentation, we only roughly outline the user study here.
The goal of the study was to examine whether explanations (1) help to
assess the sentiment of a text and (2) increase the speed of assessing the sen-
timent. To this end, each task asked a worker to classify the sentiment score
of 10 given reviews from the ArguAna TripAdvisor corpus (cf. Section C.2),
based on presented information of exactly one of the following three types
that we obtained from our prototypical web application (cf. Section B.3):
1. Plain text. The review in plain text form.
2. Highlighted text. The review in highlighted text form, as exemplified
in Figure 5.19(a).
26
Amazon Mechanical Turk, http://www.mturk.com, accessed on November 11, 2014.
5 Pipeline Robustness 257
Table 5.5: The average sentiment score classified by the Amazon Mechanical Turk
workers for all reviews in the ArguAna TripAdvisor corpus of each score between 1
and 5 depending on the presented information as well as the time the users took for
classifying 10 reviews averaged over all tasks or over the fastest 25%, respectively.
3. Plain text + local sentiment flow. The review in plain text form with
the associated local sentiment flow shown below the text.27
All 2100 reviews of the ArguAna TripAdvisor corpus were classified based
on each type by three different workers. To prevent flawed results, two
check reviews with unambiguous sentiment (score 1 or 5) were put among
every 10 reviews. We accepted only tasks with correctly classified check
reviews and we reassigned rejected tasks to other workers. Altogether, this
resulted in an approval rate of 93.1%, which indicates the quality of the con-
ducted crowdsourcing. Table 5.5 lists the aggregated classification results
separately for the reviews of each possible sentiment score (in the center) as
well as the seconds required by a worker to perform a task, averaged over
all tasks and over the fastest 25% of the tasks (on the right).
The average score of a TripAdvisor review lies between 3 and 4 (cf. Ap-
pendix C.2). As Table 5.5 shows, these weak sentiment scores were most
accurately assessed by the workers based on the plain text, possibly because
the focus on the text avoids a biased reading in this case. In contrast, espe-
cially the highlighted text seems to help to assess the other scores. At least
for score 2, also the local sentiment flow proves beneficial. In terms of the re-
quired time, the highlighted text clearly dominates the study. While type 3
speeds up the classification with respect to the fastest 25%, it entails the
highest time on average. The latter result is not unexpected, because of the
complexity of understanding two instead of one visualization.
6
Conclusion
The ability of performing text mining ad-hoc in the large has the potential
to essentially improve the way people find information today in terms of
speed and quality, both in everyday web search and in big data analytics.
More complex information needs can be fulfilled immediately, and previ-
ously hidden information can be accessed. At the heart of every text mining
application, relevant information is inferred from natural language texts by
a text analysis process. Mostly, such a process is realized in the form of a
pipeline that sequentially executes a number of information extraction, text
classification, and other natural language processing algorithms. As a mat-
ter of fact, text mining is studied in the field of computational linguistics,
which we consider from a computer science perspective in this thesis.
Besides the fundamental challenge of inferring relevant information ef-
fectively, we have revealed the automatic design of a text analysis pipeline
and the optimization of a pipelines run-time efficiency and domain robust-
ness as major requirements for the enablement of ad-hoc large-scale text
mining. Then, we have investigated the research question of how to exploit
knowledge about a text analysis process and information obtained within
the process to approach these requirements. To this end, we have devel-
oped different models and algorithms that can be employed to address in-
formation needs ad-hoc on large numbers of texts. The algorithms rely on
classical and statistical techniques from artificial intelligence, namely, plan-
ning, truth maintenance, and informed search as well as supervised and
self-supervised learning. All algorithms have been analyzed formally, im-
plemented as software, and evaluated experimentally.
259
260 6.1 Contributions and Open Problems
To our knowledge, we are thereby the first to enable ad-hoc text analysis
for unanticipated information needs and input texts.3 Some minor prob-
lems of our approach remain for future work, like its current limitation to
single information needs. Most of these are of technical nature and should
be solvable without restrictions (see the discussions in Sections 3.2 and 3.3
for details). Besides, a few compromises had to be made due to automa-
tion, especially the focus on either effectiveness or efficiency during the se-
lection of algorithms to be employed in a pipeline. Similarly, the flipside of
constructing and executing a pipeline ad-hoc is the missing opportunity of
evaluating the pipelines quality before using it.
narios. Among others, big data requires to deal with huge memory con-
sumption. While we are confident that such challenges even increase the
impact of our approaches on a pipelines efficiency, we cannot ultimately
rule out the possibility that they revert some achieved efficiency gains. Sim-
ilarly, streams of input texts have been used for motivation in this thesis,
but their analysis is left for future work. Finally, an open problem refers to
the limited accuracy of predicting pipeline run-times within our adaptive
scheduling approach, which prevents an efficiency impact of the approach
on real data of low heterogeneity. Possible solutions have been discussed in
Section 4.5. However, we do not deepen them in this thesis, since we have
presented successful alternatives for low heterogeneity (cf. Section 4.3).
tial flows we found model overall structure from a human perspective and,
thus should be intuitively understandable, such as (positive, negative, posi-
tive). On the other hand, different from all existing approaches we know,
the developed feature type captures the structure of a text in overall terms.
We believe that these results open the door to new approaches in other areas
of computational linguistics, especially in those related to argumentation.
We detail the implications of our approach in the following.
bustness targets at specific text classification tasks. The latter even tends to
be slower than standard text classification approaches, although it at least
avoids performing deep analyses. In contrast, all approaches from Chap-
ters 3 and 4 fit together perfectly. Their integration will even solve remain-
ing problems. For instance, the restriction of our pipeline construction ap-
proach to single information needs is easy to manage when given an input
control (cf. Section 3.3). Moreover, there are scenarios where all approaches
have an impact. E.g., a sentiment analysis based only on the opinions of a
text allows for automatic design, optimized scheduling, and the classifica-
tion of overall structure. In addition, we have given hints in Section 5.4 on
how to transfer our robustness approach to further tasks.
We realized all approaches on top of the standard software framework for
text analysis, Apache UIMA. A promising step still to be taken is their de-
ployment in widely-recognized platforms and tools. In Section 3.5, we have
already argued that a native integration of our input control within Apache
UIMA would minimize the effort of using the input control while benefit-
ing the efficiency of many text analysis approaches based on the framework.
Similarly, applications like U-Compare, which serves for the development
and evaluation of pipelines (Kano et al., 2010), may in our view greatly ben-
efit from including the ad-hoc pipeline construction from Section 3.3 or the
scheduling approaches from Sections 4.3 and 4.5. We leave these and other
deployments for future work. The same holds for some important aspects of
using pipelines in practical applications that we have analyzed only roughly
here, such as the parallelization of pipeline execution (cf. Section 4.6) and
the explanation of pipeline results (cf. Section 5.5). Both fit well to the ap-
proaches we have presented, but still require more investigation.
We conclude that our contributions do not fully enable ad-hoc large-scale
text mining yet, but they define essential building blocks for achieving this
goal. The decisive question is whether academia and industry in the con-
text of information search will actually evolve in the direction suggested in
this thesis. While we can only guess, the superficial answer may be no,
because there are too many possible variations of this direction. A more
nuanced view on todays search engines and the lasting hype around big
data, however, reveals that the need for automatic, efficient, and robust text
mining technologies is striking: Chiticariu et al. (2010b) highlight their im-
pact on enterprise analytics, and Etzioni (2011) stresses the importance of
directly returning relevant information as search results (cf. Section 2.4 for
details). Hence, we are confident that our findings have the potential of im-
proving the future of information search. In the end, leading search engines
show that this future has already begun (Pasca, 2011).
266 6.2 Implications and Outlook
dering pipeline (Angel, 2008). Similar to our input control, pipeline stages
like clipping decide what parts of a scene are relevant for the raster. While
the ordering of the high-level rendering stages is usually fixed, stages like a
shader compose several programmable steps whose schedule strongly im-
pacts rendering efficiency (Arens, 2014). A transfer of our approaches seems
possible, but it might put other parameters in the focus, since the execution
of a pipeline is parallelized on a specialized graphics processing unit.
Another related area is software engineering. Among others, recent soft-
ware testing approaches deal with the optimization of test plans (Gldali
et al., 2011). Here, an optimized scheduling can speed up detecting some
defined number of failures or achieving some defined code coverage. Ac-
cordingly, approaches that perform an assembly-based method engineering
based on situational factors and a repository of services (Fazal-Baqaie et al.,
2013) should, in principle, be viable to automation with an adaptation of the
pipeline construction from Section 3.3. Further possible applications reach
down to basic compiler optimization operations like list scheduling (Cooper
and Torczon, 2011). The use of information obtained from training input
is known in profile-guided compiler optimization (Hsu et al., 2002) where
such information helps to improve the efficiency of program execution, e.g.
by optimizing the scheduling of checked conditions in if-clauses.
Even outside computer science, our scheduling approaches may prove
beneficial. An example from the real world is an authentication of paintings
or paper money, which runs through a sequence of analyses with different
run-times and numbers of found forgeries. Also, we experience in everyday
life that scheduling affects the efficiency of solving a problem. For instance,
the number of slices needed to cut some vegetable into small cubes depends
on the ordering of the slices and the form of the vegetable. Moreover, the ab-
stract concept of adaptive scheduling from Section 4.5 should be applicable
to every performance problem where (1) different solutions to the problem
are most appropriate for certain situations or inputs and (2) where the per-
formance of a solution can be assessed somehow.
Altogether, we summarize that possible continuations of the work de-
scribed in this thesis are manifold. We hope that our findings will inspire
new approaches of other researchers and practitioners in the discussed
fields and that they might help anyone who encounters problems like those
we approached. With this in mind, we close the thesis with a translated
quote from the German singer Andreas Front: What you learn from that is up
to you, though. I hope at least you have fun doing so.6
6
Was Du daraus lernst, steht Dir frei. Ich hoffe nur, Du hast Spa dabei. from the song Spa
dabei, http://andreas-front.bplaced.net/blog/, accessed on December 21, 2014.
Text Analysis Algorithms
A
The evaluation of the design and execution of text analysis pipelines re-
quires the resort to concrete text analysis algorithms. Several of these algo-
rithms are employed in the experiments and case studies on our approaches
to enable ad-hoc large-scale text mining from Chapter 3 to 5. Some of them
have been developed by ourselves, while others refer to existing software
libraries. In this appendix, we give basic details on the functionalities and
properties of all employed algorithms and on the text analyses they per-
form. First, we describe all algorithms in a canonical form (Appendix A.1).
Then, we present evaluation results on their efficiency and effectiveness as
far as available (Appendix A.2). Especially, the measured run-times are im-
portant in this thesis, because they directly influence the efficiency impact
of our pipeline optimization approaches, as discussed.
269
270 A.1 Analyses and Algorithms
Table A.1: The required input types C(in) and the produced output types C(out) of
all text analysis algorithms referred to in this thesis. Bracketed input types indicate
the existence of variations of the respective algorithm with and without these types.
Classification of Text
Text classification is one of the central text analysis types the approaches
in this thesis focus on. It assigns a class from a predefined scheme to each
given text. In our experiments and case studies in Chapter 5, we deal with
the classification of both whole texts and portions of text.
Language Functions Language functions target at the question why a text
was written. On an abstract level, most texts can be seen as being predom-
inantly expressive, appellative, or informative (Bhler, 1934). For product-
related texts, we concretized this scheme with a personal, a commercial,
and an informational class (Wachsmuth and Bujna, 2011).
clf (Wachsmuth and Bujna, 2011) is a statistical classifier, realized as a lin-
ear multi-class support vector machine from the LibSVM integration
of Weka (Chang and Lin, 2011; Hall et al., 2009). It assigns a language
function to a text based on different word, n-gram, entity, and part-of-
speech features (cf. Section 5.3). clf operates on the text level, requir-
ing sentences and tokens as well as, if given, time and money entities
as input and producing the language function classes with assigned
class values. It has been trained on German texts from the music do-
main and the smartphone domain, respectively.
and Lin, 2011; Hall et al., 2009). Similar to csb, it classifies the polarities
of opinions based on the contained words, part-of-speech tags, scores
from SentiWordNet (Baccianella et al., 2010), and the like. csp oper-
ates on the opinion level, requiring opinions and tokens with part-of-
speech as input and producing the polarity feature of each opinion.
It has been trained on English reviews from the hotel domain and the
movie domain, respectively.
Entity Recognition
According to Jurafsky and Martin (2009), the term entity is used not only to
refer to names that represent real-world entities, but also to specific types
of numeric information, like money and time expressions.
Money In terms of money information, we distinguish absolute men-
tions (e.g. 300 million dollars), relative mentions (e.g. by 10%), and com-
binations of these.
emo (self-implemented) is a rule-based money extractor that uses lexicon-
based regular expressions, which capture the structure of money en-
tities. emo operates on the sentence level, requiring sentence annota-
tions as input and producing money annotations. It works only on
German texts and it targets at news articles.
Named entities In some of our experiments and case studies, we deal with
the recognition of person, organization, and location names. These three
named entity types are in the focus of widely recognized evaluations, such
as the CoNLL-2003 shared task (Tjong Kim Sang and Meulder, 2003).
A Text Analysis Algorithms 273
Time Similar to money entities, we consider text spans that represent pe-
riods of time (e.g. last year) or dates (e.g. 07/21/69) as time entities.
eti (self-implemented) is a rule-based time extractor that, analog to emo,
uses lexicon-based regular expressions, which capture the structure
of time entities. eti operates on the sentence level, requiring sentence
annotations as input and producing time annotations. It works only
on German texts, and it targets at news articles.
Parsing
In natural language processing, parsing denotes the syntactic analysis of
texts or sentences in order to identify relations between their different parts.
2
Stanford NER, http://nlp.stanford.edu/software/CRF-NER.shtml, accessed on
October 15, 2014.
274 A.1 Analyses and Algorithms
Discourse Units and Relations Discourse units are the minimum build-
ing blocks in the sense of text spans that make up the discourse of a text.
Several types of discourse relations may exist between discourse units, e.g.
23 types are distinguished by the widely-followed rhethorical structure the-
ory (Mann and Thompson, 1988).
pdr (self-implemented) is a rule-based discourse relation extractor that
mainly relies on language-specific lexicons with discourse connectives
to identify 10 discourse relation types, namely, background, cause, cir-
cumstance, concession, condition, contrast, motivation, purpose, sequence,
and summary. pdr operates on the discourse unit level, requiring dis-
course units and tokens with part-of-speech as input and producing
typed discourse relation annotations. It is implemented for English
only and it targets at less-formatted texts like web user reviews.
pdu (self-implemented) is a rule-based discourse unit segmenter that an-
alyzes commas, connectives (using language-specific lexicons), verb
types, ellipses, etc. to identify discourse units in terms of main clauses
with all their subordinate clauses. pdu operates on the text level, re-
quiring sentences and tokens with part-of-speech as input and pro-
ducing discourse unit annotations. It is implemented for English and
German, and it targets at less-formatted texts like web user reviews.
Time/Money Relations The relations between time and money entities that
we consider here refer to all according pairs where both entities belong to
the same statement on revenue.
rtm1 (self-implemented) is a rule-based relation extractor that simply ex-
tracts the closest pairs of time and money entities (in terms of the
number of characters). It operates on the sentence level, requiring sen-
tences, time and money entities as well as statements on revenue as
input and producing the time and money features of the latter. trm1
works only on arbitrary texts of any language.
rtm2 (self-implemented) is a statistical relation extractor, realized as a linear
support vector machine from the LibSVM integration of Weka (Chang
and Lin, 2011; Hall et al., 2009). It classifies relations between candi-
date pairs of time and money entities based on several types of in-
formation. rtm2 operates on the sentence level, requiring sentences,
tokens with all features, time and money entities as well as statements
on revenue as input and producing the time and money features of the
latter. It works for German texts only and it targets at news articles.
Segmentation
Segmentation means the sequential partition of a text into single units. In
this thesis, we restrict our view to lexical and shallow syntactic segmenta-
tions in terms of the following information types.
Paragraphs We define a paragraph here syntactically to be a composition
of sentences that ends with a line break.
spa (self-implemented) is a rule-based paragraph splitter that looks for
line breaks that indicate paragraph ends. spa operates on the charac-
ter level, requiring only plain text as input and producing paragraph
annotations. It works on arbitrary texts of any language.
Sentences The sentences of a text segment the text into basic meaningful
grammatical units.
278 A.1 Analyses and Algorithms
Tagging
Under the term tagging, we finally subsume text analyses that add informa-
tion to segments of a text, here to tokens in particular.
Lemmas A lemma denotes the dictionary form of word (in the sense of a
lexeme), such as be for am, are, or be itself. Lemmas are of particu-
lar importance for highly inflected languages like German and they serve,
among others, as input for many parsers (see above).
tle (Bjrkelund et al., 2010) is a statistical lemmatizer, realized as a large
margin classifier in the above-mentioned Mate Tools, that uses sev-
eral features to find the shortest edit script between the lemmas and
the words. tle operates on the sentence level, requiring tokens as
input and producing the lemma features of the tokens. It has been
trained on a number of languages, including English and German,
and it targets at well-formatted texts like news articles.
6
UIMA Whitspace Tokenizer, http://uima.apache.org/sandbox.html, accessed on
October 14, 2014.
A Text Analysis Algorithms 279
Efficiency Results
In terms of efficiency, Table A.2 shows the average run-time per sentence of
each algorithm. We measured all run-times in either five or ten runs on a 2
GHz Intel Core 2 Duo MacBook with 4 GB RAM, partly using the complete
respective corpus, partly its training set only.
As can be seen, there is a small number of algorithms whose run-times
greatly exceed those of the others. Among these, the most expensive are
the two dependency parsers, pde1 and pde1 . Common dependency pars-
ing approaches are of cubic computational complexity with respect to the
number of tokens in a sentence (Covington, 2001), although more efficient
approaches have been proposed recently (Bohnet and Kuhn, 2012). Still,
the importance of employing dependency parsing for complex text analysis
tasks like relation extraction is obvious (and, here, indicated by the different
F1 -scores of rtm2 and rtm1 ). The use of such algorithms particularly empha-
sizes the benefit of our pipeline optimization approaches, as exemplified in
the case study of Section 3.1.
Besides, we point out that, while we argue in Section 4.2 that algorithm
run-times remain comparably stable across corpora (compared to distribu-
tions of relevant information), a few outliers can be found in Table A.2.
Most significantly, rre2 has an average run-time per sentence of 0.81 mil-
liseconds on the Revenue corpus, but only 0.05 milliseconds on the CoNLL-
2003 dataset. The reason behind is that the latter contains only a very small
fraction of candidate statements on revenue that contain both a time and a
money entity. Consequently, the observed differences rather give another
indication of the benefit of filtering only relevant portions of text.
Effectiveness Results
The effectiveness values in Table A.2 were obtained on the test sets of
the specified corpora in all cases except for those on the Sentiment Scale
dataset, the Subjectivity dataset, and the Sentence polarity dataset. The
latter are computed using 10-fold cross-validation in order to make them
comparable to (Pang and Lee, 2005). All results are given in terms of the
quality criteria, we see as most appropriate for the respective text analy-
ses (cf. Section 2.1 for details). For lack of required ground-truth annota-
tions, we could not evaluate the effectiveness of some algorithms, such as
pdr. Also, for a few algorithms, we analyzed a small subset of the Revenue
corpus manually to compute their precision (eti and emo) or accuracy (sse
and sto2 ). With respect to the effectiveness of the algorithms from existing
software libraries, we refer to the according literature.
A Text Analysis Algorithms 281
Table A.2: Evaluation results on the run-time efficiency (in milliseconds per sen-
tence) and the effectiveness (as precision p, recall r, F1 -score f1 , and accuracy a) of
all text analysis algorithms referred to in this thesis on the specified text corpora.
B
Software
283
284 B.1 An Expert System for Ad-hoc Pipeline Construction
Getting Started
Installation The expert system refers to the project XPS of the provided
software. By default, its annotation task ontology (cf. Section 3.2) com-
prises the algorithms and information types of the EfXTools project. When
using the integrated development environment Eclipse2 , Java projects can
be created from the respective top-level folders, taking each of them as a
root directory. Otherwise, an according procedure has to be performed.
General Information Our expert system Pipeline XPS can be seen as a
first prototype, which still may have some bugs and which tends not to be
robust to wrong inputs and usage. Therefore, the instructions presented
here should be followed carefully.
Launch Before the first launch, an option has to be adjusted if not using
Windows as the operating system: In the file ./XPS/conf/xps.properties, the
line starting with xps.treeTaggerModel, which belongs to the operating
system at hand, must be commented in, while the respective others must be
commented out. The file Main.launch in the folder XPS can then be run in
order to launch the expert system. At first start, no annotation task ontology
is present in the system. After pressing OK in response to the appearing
popup window, a standard ontology is imported. When starting again, the
main window Pipeline XPS should appear as well as an Explanations window
with the message Pipeline XPS has been started.
User Interface Figure B.1 shows the prototypical user interface of the im-
plemented expert system from (Rose, 2012). Here, a user first sets the di-
rectory of an input text collection to be processed and chooses a quality
prioritization. Then, the user specifies an information need by repeatedly
choosing annotation types with active features (cf. Section 3.2).3 The ad-
dition of types to filter beneath does not replace the on-the-fly creation of
filters from the pseudocode in Figure 3.1, but it denotes the definition of
value constraints.4 Once all is set, pressing Start XPS leads to the ad-hoc
construction and execution of a pipeline. Afterwards, explanations and re-
sults are given in separate windows. We rely on this user interface in our
evaluation of ad-hoc pipeline construction in Section 3.3. In the following,
we describe how to interact with the user interface in more detail.
2
Eclipse, http://www.eclipse.org, accessed on October 20, 2014.
3
According to the properties of pipelinePartialOrderPlanning from Section 3.3, the ex-
pert system can construct pipelines for single information needs only. An integration with
the implementation of the input control from Section 3.5 (cf. Appendix B.2) would allow a
handling of different information needs at the same time, but this is left for future work.
4
Different from our model in Section 3.2, the user interface separates the specifications
of information types and value constraints. Still, these inputs are used equally in both cases.
B Software 285
Figure B.1: Screenshot of the prototypical user interface of our expert system Pipe-
line XPS that realizes our approach to ad-hoc pipeline construction and execution.
2. Choose attributes for the added type by marking the appearing check-
boxes (if the added type has attributes at all).
3. Press the button Add this type.
Value Constraints The area Value constraints allows setting one or more
filters that represent the value constraints to be checked by the pipeline to
be constructed. For each filter, the following needs to be done:
1. In the Type to filter list, select the type to be filtered.
2. Select one of the appearing attributes of the selected type.
3. Select one of the three provided filters.
4. Insert the text to be used for filtering.
5. Press the button Add this filter.
Start XPS When all types and filters have been set, Start XPS constructs and
executes a pipeline for the specified information need and quality prioriti-
zation. Logs are shown in the console of Eclipse as well as in the Explana-
tions window. A Calculating results... window appears where all results are
shown when the pipeline execution is finished. The results are also written
to a file ./XPS/pipelineResults/resultOfPipeline-<pipelineNametimestamp>.txt.
All created pipeline descriptor files can be found in the ./XPS/temp/ direc-
tory, while the filter descriptor file are stored in /XPS/temp/filter/.
Import Ontology By default, a sample ontology with a specified type sys-
tem, an algorithm repository, and the built-in quality model described in
Section 3.3 are set as the annotation task ontology to rely on. When press-
ing the button Import ontology, a window appears where an Apache UIMA
type system descriptor file can be selected as well as a directory in which to
look for the analysis engine descriptor files (i.e., the algorithm repository).
After pressing Import Ontology Information, the respective information is im-
ported into the ontology and Pipeline XPS is restarted.5
XPS The source code of the project XPS consists of four main pack-
ages: All classes related to the user interface of Pipeline XPS belong to
the package de.upb.mrose.xps.application, while the management of annota-
tion task ontologies and their underlying data model are realized by the
classes in de.upb.mrose.xps.knowledgebase and de.upb.mrose.xps.datamodel. Fi-
nally, de.upb.mrose.xps.problemsolver is responsible for the pipeline construc-
tion. Besides, some further packages handle the interaction with classes
and descriptors specific to Apache UIMA. For details on the architecture
and implementation of the expert system, we refer to (Rose, 2012).
EfXTools EfXTools is the primary software project containing text analysis
algorithms and text mining applications developed within our case study
InfexBA described in Section 2.3. A large fraction of the source code and as-
sociated files is not relevant for the expert system, but partly plays a role in
other experiments and case studies (cf. Appendix B.4 below). The algorithms
used by the expert system can be found in all sub-packages of the package
de.upb.efxtools.ae. The related Apache UIMA descriptor files are stored in
the folders desc, desc38, desc76, where the two latter represent the algorithm
repositories evaluated in Section 3.3. Text corpora like the Revenue cor-
pus (cf. Appendix C.1) are given in the folder data.
Libraries The folder lib of XPS contains the following freely available Java
libraries, which are needed to compile the associated source code:6
Apache Jena, http://jena.apache.org
Apache Log4j, http://logging.apache.org/log4j/2.x/
Apache Lucene, http://lucene.apache.org
Apache Xerces, http://xerces.apache.org
JGraph, http://sourceforge.net/projects/jgraph
StAX, http://stax.codehaus.org
TagSoup, http://ccil.org/cowan/XML/tagsoup
Woodstox, http://woodstox.codehaus.org
StanfordNER, http://nlp.stanford.edu/software/CRF-NER.shtml
TreeTagger, http://www.ims.uni-stuttgart.de/projekte/
corplex/TreeTagger/
tt4j, http://code.google.com/p/tt4j/
Weka, http://www.cs.waikato.ac.nz/ ml/weka/
Getting Started
Getting Started
The web application accesses a webservice to predict a sentiment score be-
tween 1 (worst) and 5 (best) for an English input text. The webservice re-
alizes the output analysis from Chapter 5, including the use of the feature
type developed in Section 5.4 and the creation of explanation graphs from
Section 5.5. While any input text can be entered by a user, the application
targets at the analysis of hotel reviews.
Before prediction, the application processes the entered text with a pipe-
line of several text analysis algorithms. In addition to the feature types con-
sidered in the evaluation of Section 5.4, it also extracts hotel names and
aspects and it derives features from the combination of local sentiment and
the found names and aspects. Unlike the evaluated sentiment scoring ap-
proach, prediction is then performed using supervised regression (cf. Sec-
tion 2.1). Afterwards, the application provides different visual explana-
tions of the prediction, as described in the following.
Figure B.2: Screenshot of the main user interface of the prototypical web applica-
tion for the prediction and explanation of the sentiment score of an input text.
opinions depend on the found hotel names and aspects (labeled as products
and product features) to achieve a more simple graph layout. Given that the
full overview graph style has been selected, the explanation graph includes
tokens with simplified part-of-speech tags as well as sentences besides the
outlined information. Otherwise, the tokens and sentences are shown in
the mentioned detail view only.
Figure B.3: Screenshot of the detail view of the explanation graph of a single dis-
course unit in the prototypical web application.
Acknowledgments
The development of the application was funded by the German Federal
Ministry of Education and Research (BMBF) as part of the project Argu-
Ana described in Section 2.3. The applications user interface was imple-
mented by the Resolto Informatik GmbH9 , based in Herford, Germany. The
source code for predicting scores and creating explanation graphs was de-
veloped by the author of this thesis together with a research assistant from
the Webis group10 of the Bauhaus Universitt Weimar, Martin Trenkmann.
The latter also realized the webservice underlying the application.
Software
As already stated at the beginning of Appendix B, the provided software
is split into different projects. These projects have been created in differ-
ent periods of time between 2009 and 2014. As the thesis at hand does not
primarily target at the publication of software, the projects (including all
source code and experiment data) are not completely uniform and partly
overlap. In case, any problems are encountered when reproducing certain
results, please contact the author of this thesis.
9
Resolto Informatik GmbH, http://www.resolto.com, accessed on October 17, 2014.
10
Webis group, http://www.webis.de, accessed on October 17, 2014.
294 B.4 Source Code of All Experiments and Case Studies
Text Corpora
The top-level folder corpora consists of the three text corpora that we cre-
ated ourselves in the last year and that are described in Appendix C.1 to C.3.
Each of them already comes in the format that is required to reproduce the
experiments and case studies.
For the processed existing corpora (cf. Appendix C.4), we provide con-
version classes in the projects. In particular, the CoNLL-2003 dataset can
be converted (1) into XMI files with the class CoNLLToXMIConverter in the
package de.upb.efxtools.application.convert found in the project EfXTools and
(2) into plain texts with the CoNLL03Converter in efxtools.sample.application
of IE-as-a-Filtering-Task. For the Sentiment Scale dataset and the related
sentence subjectivity and polarity datasets, three accordingly named XMI
conversion classes can be found in the package com.arguana.corpus.creation
of ArguAna. Finally, the Brown Corpus is converted using BrownCorpusTo-
PlainTextConverter from de.upb.efxtools.application.convert in EfXTools.
297
298 C.1 The Revenue Corpus
Table C.1: Numbers of texts from the listed websites in the complete Revenue cor-
pus as well as in its training set and in the union of its validation and test set.
Compilation
The Revenue corpus consists of 1128 German news articles from the years
2003 to 2009. These articles were manually selected from 29 source websites
by four employees of a company from the semantic technology field (see
acknowledgments below). Table C.1 lists the distribution of websites in the
corpus. As shown, we created a split of the corpus, in which two third of
the texts constitute the training set and one sixth refers to the validation and
test set each. In order to simulate the conditions of developing and applying
text analysis algorithms, the training texts were randomly chosen from the
seven most represented websites only, while the validation and test data
both cover all 29 sources. As a result, the training set of the corpus consists
C Text Corpora 299
Table C.2: Distribution of statements on revenues in the different parts of the Rev-
enue corpus, separated into the distributions of forecasts and of declarations.
of 752 texts with a total of 21,586 sentences, while the validation and test set
sum up to 188 texts each with 5751 and 6038 sentences, respectively.
Annotations
In each text of the Revenue corpus, annotation of text spans are given on the
event level and the entity level, as sketched in the following.
Event Level Every sentence with explicit time and money information that
represents a statement on the revenue of an organization or market is anno-
tated as either a forecast or a declaration. If a sentence comprises more than
one such statement on revenue, it is annotated multiple times.
Entity Level In each statement, the time expression and the monetary expres-
sion are marked as such (relative money information is preferred over abso-
lute amounts in case they are separated). Accordingly, the subject is marked
within the sentence if available, otherwise its last mention in the preceding
text. The same holds for optional entities, namely, a possible referenced point
a relative time expression refers to, a trend word that indicates whether a
relative monetary expression is increasing or decreasing, and the author of
a statement. All annotated entities are linked to the statement on revenue
they belong to. Only entities that belongs to a statement are annotated.
Table C.2 gives an overview of the statements on revenue in the corpus.
Altogether, 2,075 statements are annotated. The varying distributions of
forecast and declarations give a hint that the validation and test set differ
significantly from the training set.
Example
Figure C.1 shows a sample text from the Revenue corpus with one state-
ment on revenue, in particular a forecast. Besides the required time and
monetary expressions, the forecast spans an author mention as well as a
trend indicator of the monetary expression. Other relevant information is
spread across a text, namely, the organization the forecast is about as well
as a reference date that is needed to resolve the time expression.
300 C.1 The Revenue Corpus
Figure C.1: Illustration of a sample text from the Revenue corpus. Each statement
on revenue that spans a time expression and a monetary expression is manually
marked either as a forecast or as a declaration. Also, different types of information
needed to process the statement are annotated.
Annotation Process
Two employees of the above-mentioned company manually annotated all
texts from the corpus. They were given the following main guideline:
Search for sentences in the text (including its title) that contain state-
ments about the revenues of an organization or market with explicit
time and money information. Annotate each such sentence as a fore-
cast (if it is about the future) or as a declaration (if about the past).
Also, annotate the following information related to the statement:
Author. The person who made the statement (if given).
Money. The money information in the statement (prefer relative
over absolute information in case they are separated).
Subject. The organization or market, the statement is about (an-
notate a mention in the statement if given, otherwise the closest
in the preceding text).
Trend. A word that makes explicit whether the revenues increase
or decrease (if given).
Time. The time information in the statement.
Reference point. A point in time, the annotated time informa-
tion refers to (if given).
Files
The Revenue corpus comes as a packed tar.gz archive (6 MB compressed;
32 MB uncompressed). The content of each contained news article comes
as unicode plain text with appended source URL for access to the HTML
source code. Annotations are given in a standard XMI file preformatted for
the Apache UIMA framework.
Acknowledgments
The creation of the Revenue corpus was funded by the German Federal
Ministry of Education and Research (BMBF) as part of the project In-
fexBA, described in Section 2.3. The corpus was planned by the author
of this thesis together with a research assistant from the above-mentioned
Webis group of the Bauhaus Universitt Weimar, Peter Prettenhofer. The
described process of manually selecting and annotating the texts in the cor-
pus was conducted by the Resolto Informatik GmbH, also named above.
Compilation
The ArguAna TripAdvisor corpus is based on a highly balanced subset of a
dataset originally used for aspect-level rating prediction (Wang et al., 2010).
302 C.2 The ArguAna TripAdvisor Corpus
San Francisco
40%
Punta Cana
Other
78,404
Amsterdam
30%
Barcelona
50k
Los Angeles
Hong Kong
Honolulu
Florence
New York
20%
Singapore
Sydney
Seattle
Boston
Berlin
25k
Paris
20,040 25,968
10% 15,152
0 0%
location 1 2 3 4 5 score
Figure C.2: (a) Distribution of the locations of the reviewed hotels in the original
TripAdvisor dataset from Wang et al. (2010). The ArguAna TripAdvisor corpus
contains 300 annotated texts of each of the seven marked locations. (b) Distribution
of the overall ratings of the reviews in the original dataset between 1 and 5.
The original dataset contains nearly 250,000 crawled English hotel reviews
from the travel website TripAdvisor2 that refer to 1850 hotels from over 60
locations. Each review comprises a text, a set of numerical ratings, and some
metadata. The quality of the texts is not perfect in all cases, certainly due
to crawling errors: Some line breaks have been lost, which hides a number
of sentence boundaries and, sporadically, also word boundaries. The distri-
butions of locations and overall ratings in the original dataset is illustrated
in Figure C.2. Since the reviews of the covered locations are more or less
randomly crawled, the distribution of overall ratings can be assumed to be
representative for TripAdvisor in general.
Our sampled subset consists of 2,100 reviews balanced with respect to
both location and overall rating. In particular, we selected 300 reviews
of seven of the 15 most-represented locations in the original dataset each,
60 for every overall rating between 1 (worst) and 5 (best). This supports
an optimal training for machine learning approaches to rating prediction.
Moreover, the reviews of each location cover at least 10, but as few as pos-
sible hotels, which is beneficial for opinion summarization approaches.
To counter location-specific bias, we propose a corpus split with a train-
ing set containing the reviews of three locations, and both a validation set
and a test set with two of the other locations. Table C.3 lists details about
the balanced compilation and the split.
Annotations
The reviews in the dataset from (Wang et al., 2010) have a title and a body
and they include different ratings and metadata. We maintain all this in-
formation as text-level and syntax-level annotations in the ArguAna Trip-
Advisor corpus. In addition, the corpus is enriched with annotations of lo-
cal sentiment at the discourse level and domain concepts at the entity level:
2
TripAdvisor, http://www.tripadvisor.com, accessed on October 17, 2014.
C Text Corpora 303
Table C.3: The number of reviewed hotels of each location in the complete Argu-
Ana TripAdvisor corpus and in its three parts as well as the number of reviews for
each sentiment score between 1 and 5 and in total.
Text Level Each review comes with optional ratings for seven hotels as-
pects, namely, value, room, location, cleanliness, front desk, service, and business
service, as well as with a mandatory overall rating. We interpret the overall
ratings as global sentiment scores. All ratings are integer values between 1
and 5. In terms of metadata, the ID and location of the reviewed hotel, the
username of the author, and the date of creation are given.
Syntax Level In every review text, the title and body are annotated as such
and they are separated by two line breaks.
Discourse Level All review texts are segmented into single statements that
represent single discourse units. A statement is a main clause together with
all its dependent subordinate clauses (and, hence, a statement spans at most
a sentence). Each statement is classified as being an objective fact, a positive
opinion, or a negative opinion.
Entity Level Two types of domain concepts are marked as product features in
all texts: (1) hotel aspects, like those rated on the text level but also others like
atmosphere, and (2) everything that is called an amenity in the hotel domain,
e.g. facilities like a coffee maker or wifi as well as services like laundry.
Table C.4 lists the numbers of corpus annotations together with some
statistics. The corpus includes 31,006 classified statements and 24,596 prod-
uct features. On average, a text comprises 14.76 statements and 11.71 prod-
uct features. A histogram of the length of all reviews in terms of the number
of statements is given in Figure C.3(a), grouped into intervals. As can be
seen, over one third of all texts span less than 10 statements (intervals 0-4
and 5-9), whereas less than one fourth spans 20 or more. Figure C.3(b) visu-
alizes the distribution of sentiment scores for all intervals that cover at least
1% of the corpus. Most significantly, the fraction of reviews with sentiment
score 3 increases under higher numbers of statements.
304 C.2 The ArguAna TripAdvisor Corpus
Table C.4: Statistics of the tokens, sentences, manually classified statements, and
manually annotated product features in the ArguAna TripAdvisor corpus.
20%
0-4
5-9
score 1
0% 0%
# statements 0-4 5-9 10-14 # statements 40-44
Figure C.3: (a) Histogram of the number of statements in the texts of the ArguAna
TripAdvisor corpus, grouped into intervals. (b) Interpolated curves of the fraction
of sentiment scores in the corpus depending on the numbers of statements.
Example
Figure C.4 illustrates the main annotations of a sample review from the cor-
pus. Each text has a specified title and body. In this case, the body spans
nine mentions of product features, such as location or internet access". It
is segmented into 12 facts and opinions. The facts and opinions reflect the
reviews rather negative sentiment score 2 while e.g. highlighting that the
internet access was not seen as negative. Besides, Figure C.4 exemplifies the
typical writing style often found in web user reviews like those from Trip-
Advisor: A few grammatical inaccuracies (e.g. inconsistent capitalization)
and colloquial phrases (e.g. like 2 mins walk), but easily readable.
Annotation Process
The classification of all statements in the texts of the ArguAna TripAdvisor
corpus was performed using crowdsourcing, while experts annotated the
product features. Before, the segmentation of the texts into statements was
done automatically using the algorithm pdu (cf. Appendix A). The manual
annotation process is summarized in the following.
Crowdsourcing Annotation The statements were classified using the
crowdsourcing platform Amazon Mechanical Turk that we already relied
on in Section 5.5. The task we assigned to the workers here involved the
C Text Corpora 305
Figure C.4: Illustration of a review from the ArguAna TripAdvisor corpus. Each
review text has a title and a body. It is segmented into discourse-level statements
that have been manually classified as positive opinions (light green background),
negative opinions (medium red), and objective facts (dark gray). Also, manual an-
notations of domain concepts are provided (marked in bold).
Files
The ArguAna TripAdvisor corpus comes as a packed ZIP archive (8 MB
compressed; 28 MB uncompressed), which contains XMI files preformat-
ted for the Apache UIMA framework just as for the Revenue corpus in Ap-
pendix C.1. Moreover, we converted all those 196,865 remaining reviews
of the dataset from (Wang et al., 2010) that have a correct text and a correct
overall rating between 1 and 5 into the same format without manual annota-
tions but with all TripAdvisor metadata. This unannotated dataset (265 MB
compressed; 861 MB uncompressed) can be used both for semi-supervised
learning techniques (cf. Section 2.1) and for large-scale evaluations of rat-
ing prediction and the like. We attached some example applications and a
selection of the text analysis algorithms from Appendix A to the corpus. The
applications and algorithms can be executed to conduct the analyses from
Section 5.3, thereby demonstrating how to process the corpus.
Acknowledgments
The creation of the ArguAna TripAdvisor corpus was funded by the Ger-
man Federal Ministry of Education and Research (BMBF) as part of the
project ArguAna, described in Section 2.3. The corpus was planned by
the author of this thesis together with a research assistant from the above-
C Text Corpora 307
Compilation
The LFA-11 corpus contains 2713 texts from the music domain as well as
2093 texts from the smartphone domain. The texts of these topical domains
come from different sources and are of very different quality and style:
The music collection is made up of user reviews, professional reviews,
and promotional texts from a social network platform, selected by employ-
ees of a company from the digital asset management industry (see acknowl-
edgments below). These texts are well-written and of homogeneous style.
On average, a music texts span 9.4 sentences with 23.0 tokens on average,
according to the output of our algorithms sse and sto2 (cf. Appendix A.1). In
contrast, the texts in the smartphone collection are blog posts. These posts
were retrieved via queries on a self-made Apache Lucene3 index, which
was built for the Spinn3r corpus.4 Spinn3r aims at crawling and indexing
the whole blogosphere. Hence, the texts in the smartphone collection vary
strongly in quality and writing style. They have an average length of 11.8
sentences but only 18.6 tokens per sentence.
3
Apache Lucene, http://lucene.apache.org, accessed on October 19, 2014.
4
Spinn3r corpus, http://www.spinn3r.com, accessed on October 19, 2014.
308 C.3 The LFA-11 Corpus
Table C.5: Distributions of the text-level classes in the three sets of the two topical
parts of the LFA-11 corpus for the three annotated types.
Annotations
All texts of the LFA-11 corpus are annotated on the text level with respect
to three classification schemes:
Text Level First, the language function of each text is annotated as being pre-
dominantly personal, commercial, or informational (cf. Section 2.3).5 Sec-
ond, the texts are classified with respect to their sentiment polarity, where
we distinguish positive, neutral, and negative sentiment. And third, the rel-
evance with respect to the topic of the corpus part the text belongs to (music
or smartphones) is annotated as being given (true) or not (false).
In the corpus texts, all three annotations are linked to a metadata annota-
tion that provides access to them. Some texts were annotated twice for inter-
annotator agreement purposes (see the annotation process below). These
texts have two annotations of each type. We created splits for each topic
with half of the texts in the training set and each one fourth in the valida-
tion set and test set, respectively. Table C.5 show the class distributions of
language functions, sentiment polarities, and topic relevance. The distribu-
tions indicate that the training, validation, and test sets differ significantly
from each other. In case of double-annotated texts, we used the annotation
of the second employee to compute the distributions. So, the exact frequen-
cies of the different classes depend on which annotations are used.
5
The language function annotation is called Genre in the corpus texts. Language func-
tions can be seen as a single aspect of genres (Wachsmuth and Bujna, 2011).
C Text Corpora 309
[...] How did Alex ask recently when [...] The sitars sound authentically [...] It's All Too Much? No, no,
he saw the Kravitz' latest best-of Indian. In combination with the still okay, though an enormous hype
collection: Is it his own liking, the three-part harmonious singing and was made about the seemingly new
voting on his website or the chart the jingle-jangle of the rickenbacker Beatles song for decades. The point
position what counts? Good guitars, they create an oriental flair is that exactly this song Hey
question. However, in our case, there without losing their Beatlesque Bulldog has already been published
is nothing to argue about: 27 songs, elegance. If that doesn't make you several times, most recently on a
all were number one. The Beatles. smile! [...] reprint of ``Yellow Submarine'' in
Biggest Band on the Globe. [...] the year 1987. [...]
Figure C.5: Translated excerpts from three texts of the music part of the LFA-11
corpus, exemplifying one instance of each language function. Notice that the trans-
lation to English may have affected the indicators of the annotated classes.
Example
Figure C.5 shows excerpts from three texts of the music collection, one out of
each language function class. The excerpts have been translated to English
for convenience purposes. The neutral sentiment of the personal text might
seem inappropriate, but the given excerpt is misleading in this respect.
Annotation Process
The classification of all texts of the LFA-11 corpus was performed by two
employees of the mentioned company based on the following guidelines:
Read through each text of the two collections. First, tag the text as be-
ing predominantly personal, commercial, or informational with respect
to the product discussed in the text:
personal. Use this annotation if the text seems not to be of com-
mercial interest, but probably represents the personal view on the
product of a private individual.
commercial. Use this annotation if the text is of obvious com-
mercial interest. The text seems to predominantly aim at persuad-
ing the reader to buy or like the product.
informational. Use this annotation if the text seems not to be of
commercial interest with respect to the product. Instead, it pre-
dominantly appears to be informative in a journalistic manner.
Second, tag the sentiment polarity of the text:
neutral. Use this annotation if the text either reports on the prod-
uct without making any positive or negative statement about it
or if the texts is neither positive nor negative, but rather close to
the midth between positive and negative.
310 C.4 Used Existing Text Corpora
Files
The LFA-11 corpus comes as a packed tar.gz archive (5 MB compressed;
35 MB uncompressed). Both the music and the smartphone texts are stored
in a standard UTF-8 encoded XMI file together with their annotations, pre-
formatted for the Apache UIMA framework.
Acknowledgments
The creation of the LFA-11 corpus was funded by the German Federal Min-
istry of Education and Research (BMBF) as part of the project InfexBA, de-
scribed in Section 2.3. Both parts of the corpus were planned by the author
of this thesis together with a research assistant from the above-mentioned
Webis group, Peter Prettenhofer. The latter gathered the texts of the smart-
phone collection, whereas the music texts were selected by the company
Digital Collections Verlagsgesellschaft mbH6 , based in Hamburg, Ger-
many. This company also conducted the described annotation process.
rize the main facts about the purpose and compilation of each collection as
well as about given annotations as far as available. Also, we give references
to where the collection can be accessed.
Brown Corpus
The Brown corpus (Francis, 1966) has been introduced in the 1960s as a
standard text collection of present-day American English. It consists of 500
prose text samples of about 2000 words each. The samples are excerpts from
texts printed in the year 1961 that were written by native speakers of Amer-
ican English as far as determinable. They cover a wide range of styles and
varieties of prose. At a high level, they can be divided into informative prose
(374 samples) and imaginative prose (126 samples).
We process the Brown corpus in Sections 4.2 and 4.4 to show how rel-
evant information is distributed across texts and collections of texts. The
Brown corpus is free for non-commercial purposes and can be downloaded
at http://www.nltk.org/nltk_data (accessed on October 20, 2014).
Wikipedia Sample
The German Wikipedia sample that we experiment with consists of the first
10,000 articles from the Wikimedia7 dump from March 9, 2013, ordered ac-
cording to their internal page IDs. The complete dump contains over 3 mil-
lion Wikipedia pages, from which 1.8 million pages represent articles that
are neither empty nor stubs or simple lists.
As in the case of the Brown corpus, we process the Wikipedia sample in
Sections 4.2 and 4.4 to show how relevant information is distributed across
texts and collections of texts. The dump we rely on is outdated and not avail-
able anymore. However, similar dumps from later dates can be accessed at
http://dumps.wikimedia.org/dewiki (accessed on October 20, 2014).
7
Wikimedia, http://dumps.wikimedia.org, accessed on October 20, 2014.
References
Charu C. Aggarwal. Outlier Analysis. Springer, New York, NY, USA, 2013.
Eugene Agichtein and Luis Gravano. Querying Text Databases for Efficient
Information Extraction. In Proceedings of the 19th International Conference
on Data Engineering, pages 113124, 2003.
David Ahn, Sisay F. Adafre, and Maarten de Rijke. Extracting Temporal In-
formation from Open Domain Text: A Comparative Exploration. Journal
of Digital Information Management, 3(1):1420, 2005.
Rami Al-Rfou and Steven Skiena. SpeedRead: A Fast Named Entity Recog-
nition Pipeline. In Proceedings of the 24th International Conference on Com-
putational Linguistics, pages 5166, 2012.
Sophia Ananiadou and John McNaught. Text Mining for Biology and
Biomedicine. Artech House, Inc., Norwood, MA, USA, 2005.
Maik Anderka, Benno Stein, and Nedim Lipka. Predicting Quality Flaws
in User-generated Content: The Case of Wikipedia. In 35th International
ACM Conference on Research and Development in Information Retrieval, pages
981990, 2012.
313
314
Miguel Ballesteros, Bernd Bohnet, Simon Mille, and Leo Wanner. Deep-
Syntactic Parsing. In Proceedings of the 25th International Conference on Com-
putational Linguistics: Technical Papers, pages 14021413, 2014.
Srinivas Bangalore. Thinking outside the Box for Natural Language Pro-
cessing. In Proceedings of the 13th International Conference on Intelligent Text
Processing and Computational Linguistics, pages 116, 2012.
Douglas Biber, Susan Conrad, and Randi Reppen. Corpus Linguistics : In-
vestigating Language Structure and Use. Cambridge University Press, Cam-
bridge, MA, USA, 1998.
John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jenn
Wortman. Learning Bounds for Domain Adaptation. In Advances in Neu-
ral Information Processing Systems 21. MIT Press, 2008.
Bernd Bohnet. Very High Accuracy and Fast Dependency Parsing is not a
Contradiction. In Proceedings of the International Conference on Computa-
tional Linguistics, pages 8997, 2010.
Bernd Bohnet and Jonas Kuhn. The Best of Both Worlds: A Graph-based
Completion Model for Transition-based Parsers. In Proceedings of the 13th
Conference of the European Chapter of the Association for Computational Lin-
guistics, pages 7787, 2012.
Bernd Bohnet, Alicia Burga, and Leo Wanner. Towards the Annotation of
Penn TreeBank with Information Structure. In Proceedings of the Sixth Inter-
national Joint Conference on Natural Language Processing, pages 12501256,
Nagoya, Japan, 2013.
Claire Cardie, Vincent Ng, David Pierce, and Chris Buckley. Examining
the Role of Statistical and Linguistic Knowledge Sources in a General-
Knowledge Question-Answering System. In Proceedings of the Sixth Ap-
plied Natural Language Processing Conference, pages 180187, 2000.
Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A Library for Support Vec-
tor Machines. ACM Transactions on Intelligent Systems and Technology, 2:
27:127:27, 2011.
Laura Chiticariu, Yunyao Li, Sriram Raghavan, and Frederick R. Reiss. En-
terprise Information Extraction: Recent Developments and Open Chal-
317
Yejin Choi, Eric Breck, and Claire Cardie. Joint Extraction of Entities and
Relations for Opinion Recognition. In Proceedings of the 2006 Conference on
Empirical Methods in Natural Language Processing, pages 431439, 2006.
Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan, and Tat-Seng Chua. Question
Answering Passage Retrieval using Dependency Relations. In Proceed-
ings of the 28th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, pages 400407, 2005.
Anish Das Sarma, Alpa Jain, and Philip Bohannon. Building a Generic De-
bugger for Information Extraction Pipelines. In Proceedings of the 20th
ACM International Conference on Information and Knowledge Management,
pages 22292232, 2011.
Hal Daum, III and Daniel Marcu. Domain Adaptation for Statistical Clas-
sifiers. Journal of Artificial Intelligence Research, 26(1):101126, 2006.
Stephen Dill, Nadav Eiron, David Gibson, Daniel Gruhl, R. Guha, Anant
Jhingran, Tapas Kanungo, Sridhar Rajagopalan, Andrew Tomkins,
John A. Tomlin, and Jason Y. Zien. SemTag and Seeker: Bootstrapping
the Semantic Web via Automated Semantic Annotation. In Proceedings of
the 12th International Conference on World Wide Web, pages 178186, 2003.
Geoffrey B. Duggan and Stephen J. Payne. Text Skimming: The Process and
Effectiveness of Foraging through Text under Time Pressure. Journal of
Experimental Psychology: Applied, 15(3):228242, 2009.
Michael Thomas Egner, Markus Lorch, and Edd Biddle. UIMA GRID: Dis-
tributed Large-scale Text Analysis. In Proceedings of the Seventh IEEE In-
ternational Symposium on Cluster Computing and the Grid, pages 317326,
2007.
Joseph L. Fleiss. Statistical Methods for Rates and Proportions. John Wiley &
Sons, second edition, 1981.
George Forman and Evan Kirshenbaum. Extremely Fast Text Feature Ex-
traction for Classification and Indexing. In Proceedings of the 17th ACM
Conference on Information and Knowledge Management, pages 12211230,
2008.
W. Nelson Francis. A Standard Sample of Present-day English for Use with Dig-
ital Computers. Brown University, 1966.
Daniel Gruhl, Laurent Chavet, David Gibson, Joerg Meyer, Pradhan Pat-
tanayak, Andrew Tomkins, and Jason Zien. How to Build a WebFountain
An Architecture for Very Large-scale Text Analytics. IBM Systems Journal,
43(1):6476, 2004.
Baris Gldali, Holger Funke, Stefan Sauer, and Gregor Engels. TORC: Test
Plan Optimization by Requirements Clustering. Software Quality Journal,
pages 129, 2011.
Matthias Hagen, Benno Stein, and Tino Rb. Query Session Detection as a
Cascade. In 20th ACM International Conference on Information and Knowl-
edge Management (CIKM 11), pages 147152, 2011.
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reute-
mann, and Ian H. Witten. The WEKA Data Mining Software: An Update.
SIGKDD Explorations, 11(1):1018, 2009.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Sta-
tistical Learning: Data Mining, Inference and Prediction. Springer, New York,
NY, USA, 2 edition, 2009.
Wei Chung Hsu, Howard Chen, Pen Chung Yew, and H. Chen. On the Pre-
dictability of Program Behavior using Different Input Data Sets. In Sixth
Annual Workshop on Interaction between Compilers and Computer Architec-
tures, pages 4553, 2002.
John E. Kelly and Steve Hamm. Smart Machines: IBMs Watson and the Era
of Cognitive Computing. Columbia University Press, New York, NY, USA,
2013.
Timo Klerx, Maik Anderka, Hans Kleine Bning, and Steffen Priesterjahn.
Model-based Anomaly Detection for Discrete Event Systems. In Proceed-
ings of the 26th IEEE International Conference on Tools with Artificial Intelli-
gence, pages 665672, 2014.
Eyal Krikon, David Carmel, and Oren Kurland. Predicting the Performance
of Passage Retrieval for Question Answering. In Proceedings of the 21st
ACM International Conference on Information and Knowledge management,
pages 24512454, 2012.
Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis. Intro-
duction to Parallel Computing: Design and Analysis of Algorithms. Benjamin-
Cummings Publishing Co., Inc., Redwood City, CA, USA, 1994.
Knud Lambrecht. Information Structure and Sentence Form: Topic, Focus, and
the Mental Representations of Discourse Referents. Cambridge University
Press, New York, NY, USA, 1994.
Yong-Bae Lee and Sung Hyon Myaeng. Text Genre Classification with
Genre-Revealing and Subject-Revealing Deatures. In Proceedings of the
25th Annual International ACM SIGIR Conference on Research and Develop-
ment in Information Retrieval, pages 145150, 2002.
324
David D. Lewis and Richard M. Tong. Text Filtering in MUC-3 and MUC-4.
In Proceedings of the 4th Conference on Message Understanding, pages 5166,
1992.
David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RCV1: A New
Benchmark Collection for Text Categorization Research. Journal of Ma-
chine Learning Research, 5:361397, 2004.
Lianghao Li, Xiaoming Jin, and Mingsheng Long. Topic Correlation Analy-
sis for Cross-Domain Text Classification. In Proceedings of the 26th AAAI
Conference on Artificial Intelligence, pages 9981004, 2012a.
Qi Li, Sam Anzaroot, Wen-Pin Lin, Xiang Li, and Heng Ji. Joint Inference for
Cross-document Information Extraction. In Proceedings of the 20th ACM
International Conference on Information and Knowledge Management, pages
22252228, 2011.
Yunyao Li, Laura Chiticariu, Huahai Yang, Frederick R. Reiss, and Arnaldo
Carreno-fuentes. WizIE: A Best Practices Guided Development Environ-
ment for Information Extraction. In Proceedings of the ACL 2012 System
Demonstrations, pages 109114, 2012b.
Yi Mao and Guy Lebanon. Isotonic Conditional Random Fields and Local
Sentiment Flow. Advances in Neural Information Processing Systems, 19:961
968, 2007.
Daniel Marcu. The Theory and Practice of Discourse Parsing and Summarization.
MIT Press, 2000.
David Meyer, Friedrich Leisch, and Kurt Hornik. The Support Vector Ma-
chine under Test. Neurocomputing, 55(12):169186, 2003.
Vincent Ng. Supervised Noun Phrase Coreference Research: The First Fif-
teen Years. In Proceedings of the 48th Annual Meeting of the Association for
Computational Linguistics, pages 13961411, 2010.
Bo Pang and Lillian Lee. Seeing Stars: Exploiting Class Relationships for
Sentiment Categorization with Respect to Rating Scales. In Proceedings of
the 43rd Annual Meeting on Association for Computational Linguistics, pages
115124, 2005.
Bo Pang and Lillian Lee. Opinion Mining and Sentiment Analysis. Founda-
tions and Trends in Informal Retrieval, 2(12):1135, 2008.
327
Adam Pauls and Dan Klein. k-best A Parsing. In Proceedings of the Joint
Conference of the 47th Annual Meeting of the ACL and the 4th International
Joint Conference on Natural Language Processing of the AFNLP, pages 958
966, 2009.
Anton Riabov and Zhen Liu. Scalable Planning for Distributed Stream Pro-
cessing Systems. In Proceedings of the Sixteenth International Conference on
Automated Planning and Scheduling, pages 3141, 2006.
Benno Stein, Sven Meyer zu Eissen, Gernot Grfe, and Frank Wissbrock.
Automating Market Forecast Summarization from Internet Data. In Pro-
ceedings of the Fourth Conference on WWW/Internet, pages 395402, 2005.
Benno Stein, Sven Meyer zu Eien, and Nedim Lipka. Genres on the Web,
volume 42 of Text, Speech and Language Technology, chapter Web Genre
Analysis: Use Cases, Retrieval Models, and Implementation Issues, pages
167190. Springer, Berlin Heidelberg New York, 2010.
Erik F. Tjong Kim Sang and Fien De Meulder. Introduction to the CoNLL-
2003 Shared Task: Language-independent Named Entity Recognition. In
Proceedings of the Seventh Conference on Natural Language Learning at HLT-
NAACL 2003, pages 142147, 2003.
Anna Trosborg. Text Typology: Register, Genre and Text Type. In Text Typol-
ogy and Translation, pages 324. John Benjamins Publishing, Amsterdam,
The Netherlands, 1997.
331
Jordi Turmo, Alicia Ageno, and Neus Catal. Adaptive Information Extrac-
tion. ACM Computing Surveys, 38(2), 2006.
Maria Paz Garcia Villalba and Patrick Saint-Dizier. Some Facets of Argu-
ment Mining for Opinion Analysis. In Proceedings of the 2012 Conference
on Computational Models of Argument, pages 2334, 2012.
Henning Wachsmuth and Katrin Bujna. Back to the Roots of Genres: Text
Classification by Language Function. In Proceedings of the 5th International
Joint Conference on Natural Language Processing, pages 632640, 2011.
Daisy Z. Wang, Long Wei, Yunyao Li, Frederick R. Reiss, and Shivakumar
Vaithyanathan. Selectivity Estimation for Extraction Operators over Text
Data. In Proceedings of the 2011 IEEE 27th International Conference on Data
Engineering, pages 685696, 2011.
Hongning Wang, Yue Lu, and Chengxiang Zhai. Latent Aspect Rating Ana-
lysis on Review Text Data: A Rating Regression Approach. In Proceedings
of the 16th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, pages 783792, 2010.
Casey Whitelaw, Alex Kehlenbeck, Nemanja Petrovic, and Lyle Ungar. Web-
scale Named Entity Recognition. In Proceedings of the 17th ACM Conference
on Information and Knowledge Management, pages 123132, 2008.
Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools
and Techniques. Morgan Kaufmann Publishers, San Francisco, CA, 2nd
edition, 2005.
Qiong Wu, Songbo Tan, Miyi Duan, and Xueqi Cheng. A Two-Stage Al-
gorithm for Domain Adaptation with Application to Sentiment Transfer
333
Jeonghee Yi, Tetsuya Nasukawa, Razvan Bunescu, and Wayne Niblack. Sen-
timent Analyzer: Extracting Sentiments About a Given Topic Using Nat-
ural Language Processing Techniques. In Proceedings of the Third IEEE
International Conference on Data Mining, pages 427434, 2003.
Monika kov, Petr Kemen, Filip elezn, and Nada Lavra. Automating
Knowledge Discovery Workflow Composition through Ontology-based
Planning. IEEE Transactions on Automation Science and Engineering, 8(2):
253264, 2011.
Min Zhang, Jie Zhang, Jian Su, and Guodong Zhou. A Composite Kernel
to Extract Relations Between Entities with Both Flat and Structured Fea-
tures. In Proceedings of the 21st International Conference on Computational
Linguistics and the 44th Annual Meeting of the Association for Computational
Linguistics, pages 825832, 2006.
Tong Zhang. Solving Large Scale Linear Prediction Problems Using Stochas-
tic Gradient Descent Algorithms. In Proceedings of the Twenty-first Interna-
tional Conference on Machine Learning, pages 116123, 2004.