Download textbook Computational Linguistics And Intelligent Text Processing 18Th International Conference Cicling 2017 Budapest Hungary April 17 23 2017 Revised Selected Papers Part I Alexander Gelbukh ebook all chapter pdf
Download textbook Computational Linguistics And Intelligent Text Processing 18Th International Conference Cicling 2017 Budapest Hungary April 17 23 2017 Revised Selected Papers Part I Alexander Gelbukh ebook all chapter pdf
Download textbook Computational Linguistics And Intelligent Text Processing 18Th International Conference Cicling 2017 Budapest Hungary April 17 23 2017 Revised Selected Papers Part I Alexander Gelbukh ebook all chapter pdf
https://textbookfull.com/product/cloud-computing-and-service-
science-7th-international-conference-closer-2017-porto-portugal-
april-24-26-2017-revised-selected-papers-donald-ferguson/
https://textbookfull.com/product/fundamentals-of-software-
engineering-7th-international-conference-fsen-2017-tehran-iran-
april-26-28-2017-revised-selected-papers-1st-edition-mehdi-
dastani/
Computational
Linguistics
LNCS 10761
and Intelligent
Text Processing
18th International Conference, CICLing 2017
Budapest, Hungary, April 17–23, 2017
Revised Selected Papers, Part I
123
Lecture Notes in Computer Science 10761
Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board
David Hutchison
Lancaster University, Lancaster, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Friedemann Mattern
ETH Zurich, Zurich, Switzerland
John C. Mitchell
Stanford University, Stanford, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
C. Pandu Rangan
Indian Institute of Technology Madras, Chennai, India
Bernhard Steffen
TU Dortmund University, Dortmund, Germany
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbrücken, Germany
More information about this series at http://www.springer.com/series/7407
Alexander Gelbukh (Ed.)
Computational
Linguistics
and Intelligent
Text Processing
18th International Conference, CICLing 2017
Budapest, Hungary, April 17–23, 2017
Revised Selected Papers, Part I
123
Editor
Alexander Gelbukh
CIC, Instituto Politécnico Nacional
Mexico City, Mexico
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
To encourage providing algorithms and data along with the published papers, we
selected three winners of our Verifiability, Reproducibility, and Working Description
Award. The main factors in choosing the awarded submission were technical cor-
rectness and completeness, readability of the code and documentation, simplicity of
installation and use, and exact correspondence to the claims of the paper. Unnecessary
sophistication of the user interface was discouraged; novelty and usefulness of the
results were not evaluated, instead, they were evaluated for the paper itself and not for
the data.
The following papers received the Best Paper Awards, the Best Student Paper
Award, as well as the Verifiability, Reproducibility, and Working Description Awards,
respectively:
Best Verifiability Award, First Place:
“Label-Dependencies Aware Recurrent Neural Networks”
by Yoann Dupont, Marco Dinarelle, and Isabelle Tellier
CICLing 2017 was hosted by the Pázmány Péter Catholic University, Faculty of
Information Technology and Bionics, Budapest, Hungary, and organized by the
CICLing 2017 Organizing Committee in conjunction with the Pázmány Péter Catholic
University, Faculty of Information Technology and Bionics, the Natural Language and
Text Processing Laboratory of the CIC, IPN, and the Mexican Society of Artificial
Intelligence (SMIA).
Organizing Committee
Attila Novák (Chair) MTA-PPKE Language Technology Research Group,
Pázmány Péter Catholic University
Gábor Prószéky MTA-PPKE Language Technology Research Group,
Pázmány Péter Catholic University
Borbála Siklósi MTA-PPKE Language Technology Research Group,
Pázmány Péter Catholic University
Program Committee
Ted Pedersen
Florian Holz
Miloš Jakubíček
Sergio Jiménez Vargas
Miikka Silfverberg
Ronald Winnemöller
Alexander Gelbukh
Eduard Hovy
Rada Mihalcea
Ted Pedersen
Yorick Wiks
Contents – Part I
General
Invited Paper:
Invited Paper:
Information Extraction
Speech Recognition
Invited Papers:
Sentiment Analysis
Opinion Mining
Invited Paper:
Reading the Author and Speaker: Towards a Holistic and Deep Approach
on Automatic Assessment of What is in One’s Words . . . . . . . . . . . . . . . . . 275
Björn W. Schuller
Machine Translation
Text Summarization
Efficient Semantic Search Over Structured Web Data: A GPU Approach . . . . 549
Ha-Nguyen Tran, Erik Cambria, and Hoang Giang Do
Practical Applications
1 Introduction
Traditionally, natural language processing (NLP) relies on a preprocessing pipe-
line, such as the one described in [33] and depicted in Fig. 1. First, the document
is tokenized. This step needs language-specific tokenization tools. The token
sequence is then segmented into sentences. Afterwards, syntactic and semantic
analysis is performed (usually sentence-wise). Syntactic analysis outputs part-of-
speech tags, syntactic dependencies, etc. Semantic analysis extracts named entity
tags, semantic roles, etc. The actual natural language processing/understanding
(NLP/NLU) task, e.g., question answering or information extraction, uses fea-
tures from those preprocessing steps.
Since every preprocessing step can have deficiencies, the whole pipeline of
modules is prone to subsequent errors. Usually, it is hard, inefficient or even
impossible to recover from those errors, especially when they occur during tok-
enization, i.e., in the first step of the pipeline. Although tokenization is easy for
many cases in English,1 it can be very hard for other languages, e.g., for Chinese
1
There are also difficult cases in English, such as “Yahoo!” or “San Francisco-Los
Angeles flights”.
c Springer Nature Switzerland AG 2018
A. Gelbukh (Ed.): CICLing 2017, LNCS 10761, pp. 3–16, 2018.
https://doi.org/10.1007/978-3-319-77113-7_1
4 H. Adel et al.
because tokens are not separated by spaces, for German because of compounds
and for agglutinative languages like Turkish. Therefore, character-based models
have the potential of being more robust for natural language processing. Futher-
more, they support end-to-end approaches for text that do not require manual
definitions of features, similar to pixel-based models in vision or acoustic signal-
based approaches in speech recognition.
In the following, we will present an overview of work on character-based
models for a variety of tasks from different NLP areas.2
The history of character-based research in NLP is long and spans a broad array
of tasks. Here we make an attempt to categorize the literature of character-
level work into three classes based on the way they incorporate character-level
information into their computational models. The three classes we identified
are: tokenization-based models, bag-of-n-gram models and end-to-end
models [80]. However, there are also mixtures possible, such as tokenization-
based bag-of-n-gram models or bag-of-n-gram models trained end-to-end.
On top of the categorization based on the underlying representation model,
we sub-categorize the work within each group into six abstract types of NLP
tasks (if possible) to be able to compare them more directly. These task types
are the following:
2
In our view, morpheme-based models are not true instances of character-level mod-
els as linguistically motivated morphological segmentation is an equivalent step to
tokenization, but on a different level. We therefore do not cover most work on mor-
phological segmentation in this paper.
Overview of Character-Based Models for Natural Language Processing 5
Fig. 3. Hierarchical RNN for part-of- Fig. 4. Hierarchical CNN + MLP for
speech tagging with character embed- part-of-speech tagging with character
dings as input embeddings as input
Fig. 5. Hierarchical CNN + RNN for part-of-speech tagging with character embeddings
as input
their character-aware neural language model. They use convolution over char-
acter embeddings followed by a highway network [87] and feed its output into a
long short-term memory network that predicts the next word using a softmax
function.
Sequence Classification. Examples of tokenization-based models that per-
form sequence classification are CNNs used for sentiment classification [76] and
combinations of RNNs and CNNs used for language identification [41].
In the following, we explore a subset of bag-of-ngram models that are used for
representation learning, information retrieval, and sequence classification tasks.
Representation Learning for Character Sequences. An early study in
this category of character-based models is [79]. Its goal is to create corpus-
based fixed-length distributed semantic representations for text. To train k-gram
embeddings, the top character k-grams are extracted from a corpus along with
their cooccurrence counts. Then, singular value decomposition (SVD) is used
to create low dimensional k-gram embeddings given their cooccurrence matrix.
To apply them to a piece of text, the k-grams of the text are extracted and
their corresponding embeddings are summed. The study evaluates the k-gram
embeddings in the context of word sense disambiguation.
A more recent study [96] trains character n-gram embeddings in an end-
to-end fashion with a neural network. They are evaluated on word similarity,
sentence similarity and part-of-speech tagging.
Training character n-gram embeddings has also been proposed for biological
sequences [3,4] for a variety of bioinformatics tasks.
Information Retrieval. As mentioned before, character n-gram features are
widely used in the area of information retrieval [14,16,26,27,47,63].
Sequence Classification. Bag-of-n-gram models are used for language identifi-
cation [6,28], topic labeling [54], authorship attribution [70], word/text similarity
[10,30,96] and word sense disambiguation [79].
3 Conclusion
References
1. Alex, B.: An unsupervised system for identifying english inclusions in german
text. In: Annual Meeting of the Association for Computational Linguistics (2005)
2. Andor, D., et al.: Globally normalized transition-based neural networks. In:
Annual Meeting of the Association for Computational Linguistics (2016)
3. Asgari, E., Mofrad, M.R.K.: Continuous distributed representation of biological
sequences for deep proteomics and genomics. PLoS ONE 10(11), 1–15 (2015)
4. Asgari, E., Mofrad, M.R.K.: Comparing fifty natural languages and twelve genetic
languages using word embedding language divergence (WELD) as a quantitative
measure of language distance. In: Workshop on Multilingual and Cross-lingual
Methods in NLP, pp. 65–74 (2016)
5. Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., Bengio, Y.: End-to-end
attention-based large vocabulary speech recognition. In: IEEE International Con-
ference on Acoustics, Speech and Signal Processing, pp. 4945–4949 (2016)
6. Baldwin, T., Lui, M.: Language identification: the long and the short of the mat-
ter. In: Conference of the North American Chapter of the Association for Com-
putational Linguistics/Human Language Technologies, pp. 229–237 (2010)
7. Ballesteros, M., Dyer, C., Smith, N.A.: Improved transition-based parsing by
modeling characters instead of words with LSTMS. In: Conference on Empirical
Methods in Natural Language Processing (2015)
8. Bilmes, J., Kirchhoff, K.: Factored language models and generalized parallel back-
off. In: Conference of the North American Chapter of the Association for Com-
putational Linguistics/Human Language Technologies (2003)
9. Bisani, M., Ney, H.: Joint-sequence models for grapheme-to-phoneme conversion.
Speech Commun. 50(5), 434–451 (2008)
10. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with
subword information. Transactions of the Association for Computational Linguis-
tics (2017)
11. Bojanowski, P., Joulin, A., Mikolov, T.: Alternative structures for character-level
RNNS. In: Workshop at International Conference on Learning Representations
(2016)
12. Botha, J.A., Blunsom, P.: Compositional morphology for word representations
and language modelling. In: International Conference on Machine Learning (2014)
13. Cao, K., Rei, M.: A joint model for word embedding and word morphology.
In: Annual Meeting of the Association for Computational Linguistics, pp. 18–
26 (2016)
14. Cavnar, W.: Using an n-gram-based document representation with a vector pro-
cessing retrieval model. NIST SPECIAL PUBLICATION SP, pp. 269–269 (1995)
15. Chan, W., Jaitly, N., Le, Q.V., Vinyals, O.: Listen, attend and spell: A neural
network for large vocabulary conversational speech recognition. In: IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing, pp. 4960–4964
(2016)
16. Chen, A., He, J., Xu, L., Gey, F.C., Meggs, J.: Chinese text retrieval without
using a dictionary. ACM SIGIR Forum 31(SI), 42–49 (1997)
17. Chen, X., Xu, L., Liu, Z., Sun, M., Luan, H.: Joint learning of character and
word embeddings. In: International Joint Conference on Artificial Intelligence,
pp. 1236–1242 (2015)
18. Chiu, J.P.C., Nichols, E.: Named entity recognition with bidirectional LSTM-
CNNS. Trans. Assoc. Comput. Linguist. 4, 357–370 (2016)
Another random document with
no related content on Scribd:
dangerous from the point of view of party expediency are tolerated. If
they are tolerated greater expenditures of money and of other party
resources must be made when the final accounting with public
sentiment takes place. To put the matter in another way: the forms of
political corruption earlier described, i.e., corruption in connection
with the regulation of business and of vice and corruption in
connection with the buying and selling operations of the state, are for
the most part sources of income, whereas corruption in the form of
political control is mainly expenditure. Under George III., according
to Mr. Dorman B. Eaton, a “Patronage Secretary of the Treasury”
was appointed
“whose duty it was to stand between members and partisan managers
appealing for places for their favourites, on the one side, and the
heads of offices who needed to have these places filled with
competent persons, on the other side. This Secretary measured the
force of threats and took the weight of influence; he computed the
political value of a member’s support and deducted from it the official
appraisement of patronage before awarded to him. It is said that
actual accounts, Dr. and Cr., were kept with members by this
Patronage Secretary.”[66]
Whether or not “accounts Dr. and Cr.” are kept by our political
organisations, a calculus of essentially the same character must
underlie the determination of their policy.
On the spending, or political control, side of their ledgers the
various heads are comparatively simple. Office holders must be kept
in line, and to this end patronage, promotion, and the control of
primaries are important. The direct use of money for bribes may play
only a small part in this process; opportunities for auto-corruption
may be left open in special cases, but personal and party loyalty and
ambition can be relied on to a large extent. Back of the office holders
of the hour, however, there are the constantly recurring necessities
of election day. Party organisation must be kept up continuously,
involving the reward in some way of swarms of assistants and
hangers-on who cannot all be remunerated directly at public
expense. At times votes must be bought, and repeaters, thugs, and
ballot-box stuffers must be paid for their services. A heavy toll is apt
to be taken out of the funds used for such purposes by every hand
through which they pass on their way down. In addition to the
expenditures already noted there are many other occasions, some of
them quite legitimate in character, and others unobjectionable or
even laudable, for the lavish use of money to secure party success
and party control.
The situation which has just been described is so common that the
only justification for repeating its description here is the necessity of
completing an outline the other parts and interrelations of which are
somewhat more obscure. In the gradual awakening of the American
people to corrupt conditions existing in their government the first
evils clearly seen were the abuses of the patronage and the
defilement of the ballot-box. Civil service reform and corrupt
practices acts (the latter term seems lamentably narrow in its original
usage to the present somewhat more sophisticated generation) were
the result. Later the presence of purveyors of vice immediately
behind much of the prevailing electoral corruption was clearly
discerned, and the battle on that score is still being waged. It is
beyond question that our present local option movement is directed
against the saloon not so much because it is a place where
intoxicating liquor is sold, as because it is a political centre which did
not know how to be moderate in its exercise of power during the
days of its ascendancy. Still later the more secret relationship
between grafting business and political corruption was laid bare.
Renewed determination to impose the necessary measures of state
regulation and, more specifically, the campaign contribution issue
were the results. The problems presented by corrupt practices in
connection with political control are still far from adequate solution.
Reforms already achieved in the right direction, and still more the
determination to press for further reforms, are the most hopeful
features of the present situation. In our national government, for
example, the civil service movement has reached a gratifyingly high
development, but it still needs much extension and strengthening in
our states and cities. We have some stringent legislation against
ballot-box crimes, but, an election once settled, our tolerance on this
subject is amazing and deplorable. Every act which simplifies our
governmental machinery, which places responsibility squarely upon
a few shoulders and provides means for enforcing it, which shortens
our cumbersome ballots, which makes the primary accessible to
independent voters, will help in the solution of the problem of honest
party control.
Without undertaking a summary of the argument on the various
forms of business and political corruption the same point may be
made with regard to them that was made with reference to the
corruption in the professions, journalism, and the higher education,—
namely that the major forms of evil are recognised and savagely
criticised. To an even greater extent legislative action has been
secured against the primarily political forms of corruption. The fight
for the regulation of business is the great unsolved problem of our
time, but so far as it is successful we may expect not only more
honest business practices but also a favourable reaction upon
political life. A great many means may be brought to bear to secure
honesty in the buying and selling operations of the state and to
prevent the corrupt toleration of vice. Their success will mean that
the corrupt political manager will find himself deprived of some of his
most lucrative sources of income. A strong impression prevails at the
present time that corruption funds in general are much smaller in
amount than a few years ago. In part this is perhaps due to a change
of heart, in part to the fear, intensified by recent events, of exposure.
Perhaps, however, it is still more largely the result of a conviction
that the “goods” could not, or would not, in the present state of public
opinion, be delivered by the politicians. It is evident that the more
successful we are in thus drying up the income sources of venal
political organisations the smaller will be the resources available in
their hands for the extension and perpetuation of their power of
control.
FOOTNOTES:
[53] “Sin and Society,” p. 78.
[54] “Back to Beginnings,” Commencement Address, Oberlin
College, June 28, 1905.
[55] “The Nature of Political Corruption,” p. 46, supra.
[56] Limitation of the scope of this study to the internal forms of
corruption makes it impossible to discuss this very interesting
topic. It may be noted, however, that in international cases certain
peculiarities occur regarding the personal element of corruption.
When the military secrets of one government are purchased by
another, the faithless official of the former who makes the sale is,
of course, corrupt in the highest degree. What shall be said of the
nation making the purchase? Personal interest on its side is
merged in the collective interest of a commonwealth numbering
millions of inhabitants it may be. The case is not entirely unlike
those in which group interest rather than self interest impels to
corrupt action (see p. 65), except that in the latter the groups are
subordinate and not sovereign. If, however, the state which buys
the secrets of another government runs counter to international
law or morality in so doing, it may be held to be pursuing a
relatively narrow interest regardless of the broader interest of
humanity as a whole. From this point of view the state which uses
money for such ends is guilty of corruption although, of course, it
is a highly socialised form of corruption.
[57] “Law and Opinion in England,” p. 216.
[58] Cf. H. C. Adams, “Public Debts,” pt. iii, ch. iv, for a very
able discussion of the influence of the commercial spirit on public
officials.
[59] Cf. sec. vii, “Die Organisation als
Klassenerhöhungsmaschine,” in Robert Michel’s very thorough
and illuminating study of the organisation of the German social-
democracy. Archiv. f. Sozialwissenschaft und Sozialpolitik, Bd.
xxiii, Heft 2 (September, 1906).
[60] For an overwhelmingly convincing presentation of
materials on this point cf. the “Digest of Report by the Bureau of
Municipal Research on the Administration of the Water Revenues,
Manhattan,”—Efficient Citizenship Leaflet, no. 145. Corruption of
this sneaking sort resembles tax dodging in that it is so largely
indulged in by otherwise respectable people. Cf. p. 192.
[61] “Enough Money to Uplift the World,” p. 6, by William H.
Allen, Director, Bureau of Municipal Research, reprinted as a
pamphlet by the Bureau from the World’s Work of May, 1909.
[62] Cf. p. 5, supra, for a discussion of the consequences of the
corrupt protection of vice and crime.
[63] Cf p. 61, supra.
[64] Cf. especially No. 3 of the Taxation Series published by the
Board, entitled, “Cincinnati an Independent Assessment District,”
by Allen Ripley Foote.
[65] The assumption is not extreme. In the pamphlet referred to
it is held that by the various means proposed, Cincinnati’s (then)
rate of 2.96 per cent. might be reduced to 0.75 per cent. “When
the real estate of the state of Kansas was revalued by the Tax
Commission,” according to Mr. Foote, “the valuation was
increased 484 per cent.” Of course real increase of property
values through considerable periods of time accounts in part for
such totals whenever assessment periods are a number of years
apart.
[66] “Civil Service in Great Britain,” p. 154.
CAMPAIGN CONTRIBUTIONS AND THE
THEORY OF PARTY SUPPORT
VI
CAMPAIGN CONTRIBUTIONS AND THE THEORY OF PARTY
SUPPORT