10 1016@j CSL 2017 03 001
10 1016@j CSL 2017 03 001
10 1016@j CSL 2017 03 001
PII: S0885-2308(16)30396-5
DOI: 10.1016/j.csl.2017.03.001
Reference: YCSLA 839
Please cite this article as: Marta R. Costa-jussà, Alexandre Allauzen, Loı̈c Barrault, Kyunghun Cho,
Holger Schwenk, Introduction to the Special Issue on Deep Learning Approaches for Machine Trans-
lation, Computer Speech & Language (2017), doi: 10.1016/j.csl.2017.03.001
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Highlights
T
• It covers the main approach of neural machine translation in detail.
IP
• It covers the main research contributions of the papers included in the special
issue.
CR
US
AN
M
ED
PT
CE
AC
1
ACCEPTED MANUSCRIPT
T
1 TALP Research Center, Universitat Politècnica de Catalunya
2 LIMSI, CNRS, Univiversité Paris-Sud, Université Paris-Saclay
IP
3 LIUM, University of Le Mans
4 Courant Institute of Mathematical sciences and Center for Data Science, New York University
5 Facebook Artificial Intelligence Research
CR
Abstract
US
Deep learning is revolutionizing speech and natural language technologies since it is
AN
offering an effective way to train systems and obtaining significant improvements. The
main advantage of deep learning is that, by developing the right architecture, the system
automatically learns features from data without the need of explicitly designing them.
M
This machine learning perspective is conceptually changing how speech and natural
language technologies are addressed.
In the case of Machine Translation (MT), deep learning was first introduced in stan-
ED
dard statistical systems. By now, end-to-end neural MT systems have reached compet-
itive results. This special issue introductory paper addresses how deep learning has
been gradually introduced in MT. This introduction covers all topics contained in the
PT
papers included in this special issue, which basically are: integration of deep learning
in statistical MT; development of the end-to-end neural MT system; and introduction of
CE
1. Introduction
Considered as one of the major advance in machine learning, deep learning has
been recently applied with success to many areas including Natural Language Pro-
cessing, Speech Recognition and Image Processing. Deep learning techniques have
T
5 surprised the entire community, both academy and industry, by its powerful ability to
IP
learn complex tasks from data.
Recently introduced to Machine Translation (MT), deep learning was first consid-
CR
ered as a new kind of feature, integrated in standard statistical approaches [1]. Deep
learning has been shown useful in translation and language modeling [2, 3, 4, 5] as well
as in reordering [6], tuning [7] and rescoring [8]. Additionally, deep learning has been
US
10
neural MT system. This recent line of research opens new research perspectives and
sketches new MT challenges, for instance dealing with: large vocabularies [18, 19, 20];
multimodal translation [21]; the high computational cost, which implies new issues for
ED
number of publications on this topic in top conferences (e.g. ACL3 or EMNLP4 ) has
dramatically increased in the last years. The main goal of this pioneer special issue is
CE
to gather articles that would give the reader a global vision, insight and understanding
25 of deep learning limits, challenges and impact. This special issue contains high quality
submissions on the following topics categories:
AC
1 http://naacl.org/naacl-hlt-2015/tutorial-deep-learning.html
2 http://dl4mt.computing.dcu.ie/
3 http://acl2016.org/
4 http://www.emnlp2016.net/
3
ACCEPTED MANUSCRIPT
• Neural MT
• Interactive Neural MT
T
30 • MT Evaluation enhanced with deep learning techniques
IP
The rest of the paper is organized as follows. Section 2 briefly describes the main
CR
current alternatives to build a neural MT approach. Section 3 overviews the papers on
this special issue ordered by the different categories listed above. Finally, section 4
discusses the main research perspectives on applying deep learning for MT.
Then, many alternatives are possible for designing the encoder. A first approach
40 is to use a simple recurrent neural network (RNN) to encode the input sequence [13].
However, compressing a sequence into a fixed-size vector appears to be too much re-
ED
ductive to preserve source side information. Then, new systems were developed using
bidirectional RNN. Source sequences are encoded into annotations by concatenating
the two representations obtained with a forward and a backward RNN respectively.
PT
45 In this case, each annotation vector contains information from the entire source se-
quence but focusing on a particular word [22]. An attention mechanism implemented
CE
by a feed-forward neural network is then used to attend specific parts of the input and
to generate an alignment between input and output sequence. An alternative to the
biRNN encoder is the stacked Long Short-Term Memory (LSTM) [23] as presented in
AC
50 [12, 24].
A major problem with neural MT is dealing with the large softmax normalization at
the output which is dependent on the target vocabulary size. Many research works have
4
ACCEPTED MANUSCRIPT
softmax normalization
Attention
T
Mechanism
. . .
IP
CR
Source sentence Decoder
US
been done to address this problem, like performing the softmax on a subset of the out-
puts only [25] or using a structured output layer to manage [26] or self-normalization
[27].
AN
55
Another possibility is to perform translation at a subword level. This also have the
advantage of allowing the generation of out-of-vocabulary words. Character-level ma-
chine translation has been presented in several papers [28, 29, 20]. Byte Pair Encoding
M
(BPE) is a broadly used technique performing very well on many language pairs [30].
ED
Category Papers
Using deep learning in Statistical MT Source Sentence Simplification for Statistical MT by Hasler et al.
Domain Adaptation Using Joint Neural Network Models by et al.
Neural MT Context-Dependent Word Representation for Neural MT by Choi et al.
Multi-Way, Multilingual Neural MT by Firat et al.
PT
This section summarises the papers in this special issue, covering the main idea and
contribution of each one. Papers are organised in four categories, which include: using
5
ACCEPTED MANUSCRIPT
deep learning in statistical MT, neural MT, interactive neural MT and MT evaluation
with deep learning techniques.
T
One of the first approaches to integrate neural networks or deep learning into MT
has been through rescoring n-best lists from statistical MT systems [31, 2]. Given
IP
that statistical MT provides state-of-the-art results and deep learning helps in finding
CR
the right set of weights for statistical features, the scientific community is still doing
70 research in this direction. As follows, we summarise the main research contributions
of the two papers in this special issue that use deep learning to improve statistical MT.
US
Source Sentence Simplification for Statistical Machine Translation by Eva Hasler,
Adrià de Gispert, Felix Stahlberg, Aurelien Waite and Bill Byrne. Long sentences are
a major challenge for MT in general. This paper uses text simplification to help hier-
AN
75 archical MT decoding with neural rescoring. Authors combine the full input sentence
together with the simplified version of the same sentence. Simplification of the in-
put sentence is done through deletion of most redundant words in the sentence. The
M
80 didates. The second step uses the n-best list to guide the decoding of the full input
sentence.
The main contribution of the work is the procedure of integrating source sentence
PT
simplification into the hierarchical MT decoding with neural rescoring. This contribu-
tion is interesting for all types of MT and, therefore, further interesting work of this
paper includes using source sentence simplification directly in a neural MT system.
CE
85
Domain Adaptation Using Joint Neural Network Models by Shafiq Joty, Nadir
Durrani, Hassan Sajjad and Ahmed Abdelali. Domain adaptation is still a challenge
AC
for MT systems in general, since parallel data can be considered as a scarce resource
wrt the difference between text types and genres. Neural translation models, such as
90 joint models, have shown an improved adaptation ability, thanks to the continuous
representation. This paper investigates different ways to adapt this kind of models.
Data selection and mixture modeling is the starting point of this work. The authors
6
ACCEPTED MANUSCRIPT
then propose a neural model to better estimate model weighting and instance selection
in a neural framework. For instance, they introduce a pairwise model that minimizes
95 the cross entropy by regularizing the loss function with respect to an in-domain model.
Experimental results on the TED talk (Arabic-to-Engish) task show promising results.
T
3.2. Neural Machine Translation
IP
Since the seminal work on neural MT [11, 12, 13], the encoder-decoder architecture
CR
has fastly emerged as an efficient solution, yielding state of the art performance on sev-
100 eral translation tasks. Beyond these important results, this kind of architecture renew
the perspective of a multilingual approach to MT, but it also has some limitations. For
US
instance, using source context information, together with dealing with highly multilin-
gual frameworks and leveraging the abundant monolingual data remain still difficult
challenges.
AN
105 Context-Dependent Word Representation for Neural Machine Translation by
Heeyoul Choi, Kyunghyun Cho and Yoshua Bengio deals with two major problems in
MT, namely the word sense disambiguation (i.e. contextualization), and the symboliza-
M
tion aiming at solving the rare word problem. Contextualization is performed by mask-
ing out some dimensions of the target word embedding vectors (feedback) based on the
ED
110 input context, i.e. the average of the nonlinearly transformed source word embeddings.
Symbolization is performed by introducing position-dependent special tokens to deal
with digits, proper nouns and acronyms.
PT
115
5 http://www.statmt.org/wmt15/
7
ACCEPTED MANUSCRIPT
the main achievement of this paper is that the complexity of adding a language into the
system increases the number of parameters only linearly, sharing the advantages of in-
terlingua methods. The approach is tested on 8 language pairs (including linguistically
similar and non-similar language pairs, high and low-resource language pairs). The
T
125 approach improves strong statistical MT system in low-resource language pairs, and it
IP
achieves similar performance for other language pairs.
The shared attention mechanism is the main contribution of this paper compared
CR
to previous existing works on multilingual neural MT. This contribution is specially
helpful when the number of language pairs is dramatically increased (e.g. highly mul-
130 tilingual contexts like the European) and/or for low-resource language pairs.
US
On Integrating a Language Model into Neural Machine Translation by Caglar
Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Yoshua Bengio. Neural MT train-
ing relies on the availability of parallel corpora. For several language pairs, this kind
AN
of resources are scarce. While conventional MT systems can leverage the abundant
135 amount of monolingual data by means of target language models, neural MT systems
are limited in their ability to benefit from this kind of resources. This paper explores
M
two strategies to integrate recurrent neural language models in neural MT: a shallow
fusion simply combine the scores of the neural MT and target language models, and
ED
a deep strategy explores the fusion of the hidden states of both models. Experimental
140 results show promising improvements in terms of translation quality for both low and
high-resource language pairs to be compared to state of the art MT systems thanks to
PT
Despite the promising results achieved in last decades by MT, this technology is
145 still error prone for some domains and applications. Interactive MT is in such cases
AC
8
ACCEPTED MANUSCRIPT
new challenges.
Interactive Neural Machine Translation, by Alvaro Peris, Miguel Domingo, Fran-
cisco Casacuberta investigates the integration of a neural MT system in an interactive
system. The authors propose a new interactive protocol which allows the user an en-
T
155 hanced and more efficient interaction with the system. First, a neural MT system is
IP
adapted to fit the prefix-based interactive scenario. In this conventional approach the
user corrects the translation hypothesis by reading it from left to right creating a trans-
CR
lation prefix that the MT system completes with a new hypothesis. This scenario is then
extended by using the peculiarities of neural MT systems: the user can validate word
160 segments and the neural MT system fills the gap by generating a new hypothesis. For
US
these both scenarios, a tailored decoding strategy is proposed. Simulated experiments
are carried out on four different translation tasks (user manuals, medical texts and TED
talk translations) involving 4 language pairs. The results show a significant reduction
AN
of the human effort.
scientific challenge which have generated a lot of debates within the scientific com-
munity. For MT system development, the goal is to define an automatic metric that
170 can both rank different approaches to measure progress and provide a replicable mea-
PT
Shafiq Joty, Lluis Marquez and Preslav Nakov. Given a reference translation, the goal
175 is to select the best translation from a pair of hypotheses. The paper proposes a neural
AC
architecture able to represent in distributed vectors the lexical, syntactic and semantic
properties from the reference and the two hypotheses.
The experimental setup relies on the WMT metrics shared task and the new flexible
model highly correlates with human judgments. Additional contributions include task-
180 oriented tuning of embeddings and sentence-based semantic representations.
9
ACCEPTED MANUSCRIPT
4. Research perspectives
Neural MT is a very recent line of work which has already shown great results in
many translation tasks. The community, however, lacks of hindsight about how re-
search in the area will evolve in the upcoming year. In comparison, more than ten
T
185 years were necessary to establish the phrase-based approach as the widespread, ro-
IP
bust and intensively tuned solution for MT. Neural MT questions this statement by
providing a unified and new framework, which to some extent, renders obsolete the
CR
inter-dependant components of statistical MT systems (word alignments, reordering
models, phrase extraction). It is worth noticing that we are only at the beginning and
that neural MT opens a wide range of research perspectives.
US
190
several limitations that can be solved within a discriminative framework [32] or with a
learning-to-rank strategy [33]. Neural MT also suffers from the vocabulary limitation
issue which is well-known in the field of NLP. The complexity associated to a large
ED
combining word and character representation [34], or using subword units [18].
Moreover, neural MT systems provide a very promising framework to learn contin-
CE
uous representations for textual data. This creates an important step moving from the
205 word to the sentence level. Along with the introduction of the attention based model,
these peculiarities renew how the notion of context can be considered within the trans-
AC
lation process. This could allow the model to take into account for instance: a longer
context, enabling document or discourse translation; a multi-modal context when trans-
lating image captions; or a social anchor to deal with different writting style. In the
210 seminal paper on statistical machine translation [35], the authors set out the limit the
10
ACCEPTED MANUSCRIPT
approach considering that: ”in its most highly developed form, translation involves a
careful study of the original text and may even encompass a detailed analysis of the
author’s life and circumstances. We, of course, do not hope to reach these pinnacles of
the translator’s art”. While this is still valid today, neural MT creates a real opportunity
T
215 to extend the application field of machine translation in many aspects, beyond ”just”
IP
challenging the state-of-the-art performance.
CR
Acknowledgements
The work of the 1st author is supported by the Spanish Ministerio de Economı́a y
Competitividad and European Regional Development Fund, through the postdoctoral
220
US
senior grant Ramón y Cajal and the contract TEC2015-69266-P (MINECO/FEDER,
UE). The 4th author thanks the support by Facebook, Google (Google Faculty Award
AN
2016) and NVidia (GPU Center of Excellence 2015-2016).
References
M
References
225 [1] P. Koehn, F. J. Och, D. Marcu, Statistical phrase-based translation, in: Proceed-
ED
ings of the 2003 Conference of the North American Chapter of the Association for
Computational Linguistics on Human Language Technology - Volume 1, NAACL
’03, 2003, pp. 48–54.
PT
[3] H.-S. Le, A. Allauzen, F. Yvon, Continuous space translation models with neural
235 networks, in: Proceedings of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies (NAACL-HLT),
Association for Computational Linguistics, Montréal, Canada, 2012, pp. 39–48.
11
ACCEPTED MANUSCRIPT
T
SIGDAT, a Spec ial Interest Group of the ACL, 2013, pp. 1387–1392.
IP
[5] J. Devlin, R. Zbib, Z. Huang, T. Lam̃ar, R. Schwartz, J. Makhoul, Fast and robust
neural network joint models for statistical mach ine translation, in: Proceedings
CR
245 of the 52nd Annual Meeting of the Association for Co mputational Linguistics
(Volume 1: Long Papers), Association for Computational Linguistics, Baltimore,
Maryland, 2014, pp. 1370–1380.
US
[6] P. Li, Y. Liu, M. Sun, T. Izuha, D. Zhang, A neural reordering model for phrase-
based translation, in: Proceedings of COLING 2014, the 25th International Con-
AN
250 ference on Computational Linguistics: Technical Papers, Dublin City University
and Association for Computational Linguist ics, Dublin, Ireland, 2014, pp. 1897–
1907.
M
[7] L. Shen, A. Sarkar, F. J. Och, Discriminative reranking for machine translation, in:
D. M. Susan Dumais, S. Roukos (Eds.), HLT-NAACL 2004: Main Proceedings,
ED
[8] Z. Li, S. Khudanpur, Forest reranking for machine translation with the perceptron
PT
algor ithm, in: GALE book chapter on ”MT from text”, 2009.
[9] R. Gupta, C. Orasan, J. van Genabith, Machine translation evaluation using re-
CE
260 current neural networks, in: Proceedings of the Tenth Workshop on Statistical
Machine Translation, Association for Computational Linguistics, Lisbon, Portu-
AC
12
ACCEPTED MANUSCRIPT
T
[12] I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence learning with neural
IP
networks, in: Advances in Neural Information Processing Systems 27: Annual
Con ference on Neural Information Processing Systems 2014, December 8-13
CR
2014, Mont real, Quebec, Canada, 2014, pp. 3104–3112.
US
H. Schwenk, Y. Bengio, Learning phrase representations using RNN encoder-
decoder for statistical machine translation, in: Proceedings of the 2014 Con-
ference on Empirical Methods in Natur al Language Processing, EMNLP 2014,
AN
October 25-29, 2014, Doha, Qatar, A mee ting of SIGDAT, a Special Interest
280 Group of the ACL, 2014, pp. 1724–1734.
M
285 [15] R. Sennrich, B. Haddow, A. Birch, Edinburgh neural machine translation systems
PT
for wmt 16, in: Proceedings of the First Conference on Machine Translation,
Association for Computational Linguistics, Berlin, Germany, 2016, pp. 371–376.
CE
13
ACCEPTED MANUSCRIPT
[18] R. Sennrich, B. Haddow, A. Birch, Neural machine translation of rare words with
subword units, CoRR abs/1508.07909.
T
[19] M. R. Costa-jussà, J. A. R. Fonollosa, Character-based neural machine transla-
IP
300 tion, in: Proceedings of the 54th Annual Meeting of the Association for Com-
CR
putational Linguistics (Volume 2: Short Papers), Association for Computational
Linguistics, Berlin, Germany, 2016, pp. 357–361.
305
US
without explicit segmentation, CoRR abs/1610.03017.
[21] D. Elliott, S. Frank, E. Hasler, Multi-language image description with neural se-
AN
quence models, CoRR abs/1510.04709.
310
315 M. Hughes, J. Dean, Google’s neural machine translation system: Bridging the
gap between human and machine translation, CoRR abs/1609.08144.
[25] S. Jean, K. Cho, R. Memisevic, Y. Bengio, On using very large target vocabulary
AC
[26] H.-S. Le, I. Oparin, A. Messaoudi, A. Allauzen, J.-L. Gauvain, F. Yvon, Large
320 vocabulary SOUL neural network language models, in: INTERSPEECH, 2011.
14
ACCEPTED MANUSCRIPT
[27] J. Devlin, R. Zbib, Z. Huang, T. Lamar, R. Schwartz, J. Makhoul, Fast and robust
neural network joint models for statistical machine translation, in: Proceedings
of the 52nd Annual Meeting of the Association for Computational Linguistics,
2014, pp. 1370–1380.
T
325 [28] W. Ling, I. Trancoso, C. Dyer, A. W. Black, Character-based neural machine
IP
translation, CoRR abs/1511.04586.
CR
[29] M. R. Costa-Jussà, J. A. R. Fonollosa, Character-based neural machine transla-
tion, CoRR abs/1603.00810.
[30] R. Sennrich, B. Haddow, A. Birch, Neural machine translation of rare words with
330 subword units, CoRR abs/1508.07909.
US
[31] H. Schwenk, M. R. Costa-Jussà, J. A. R. Fonollosa, Continuous space language
AN
models for the IWSLT 2006 task, in: 2006 International Workshop on Spoken
Language Translation, IWSLT Keihanna Science City, Kyoto, Japan, November
27-28, 2006, 2006, pp. 166–173.
M
335 [32] S. Shen, Y. Cheng, Z. He, W. He, H. Wu, M. Sun, Y. Liu, Minimum risk training
for neural machine translation, in: Proceedings of the 54th Annual Meeting of
ED
[34] Y. Miyamoto, K. Cho, Gated word-character recurrent language model, in: Pro-
ceedings of the 2016 Conference on Empirical Methods in Natural Language Pro-
AC
15