10 1016@j CSL 2017 03 001

Introduction to the Special Issue on Deep Learning Approaches for

Machine Translation

Marta R. Costa-jussà, Alexandre Allauzen, Loı̈c Barrault,

Kyunghun Cho, Holger Schwenk

• It covers an introduction to the field of using Deep Learning in Machine Trans-


• It covers the main approach of neural machine translation in detail.

• It covers the main research contributions of the papers included in the special



Introduction to the Special Issue on Deep Learning

Approaches for Machine Translation

Marta R. Costa-jussà1 , Alexandre Allauzen2 , Loı̈c Barrault3 ,

Kyunghun Cho4 and Holger Schwenk5

1 TALP Research Center, Universitat Politècnica de Catalunya
2 LIMSI, CNRS, Univiversité Paris-Sud, Université Paris-Saclay

3 LIUM, University of Le Mans
4 Courant Institute of Mathematical sciences and Center for Data Science, New York University
5 Facebook Artificial Intelligence Research

Deep learning is revolutionizing speech and natural language technologies since it is
offering an effective way to train systems and obtaining significant improvements. The
main advantage of deep learning is that, by developing the right architecture, the system
automatically learns features from data without the need of explicitly designing them.

This machine learning perspective is conceptually changing how speech and natural
language technologies are addressed.
In the case of Machine Translation (MT), deep learning was first introduced in stan-

dard statistical systems. By now, end-to-end neural MT systems have reached compet-
itive results. This special issue introductory paper addresses how deep learning has
been gradually introduced in MT. This introduction covers all topics contained in the

papers included in this special issue, which basically are: integration of deep learning
in statistical MT; development of the end-to-end neural MT system; and introduction of

deep learning in interactive MT and MT evaluation. Finally, this introduction sketches

some research directions that MT is taking guided by deep learning.
Keywords: Machine Translation, Deep learning

1. Introduction

Considered as one of the major advance in machine learning, deep learning has
been recently applied with success to many areas including Natural Language Pro-
cessing, Speech Recognition and Image Processing. Deep learning techniques have

5 surprised the entire community, both academy and industry, by its powerful ability to

learn complex tasks from data.
Recently introduced to Machine Translation (MT), deep learning was first consid-

ered as a new kind of feature, integrated in standard statistical approaches [1]. Deep
learning has been shown useful in translation and language modeling [2, 3, 4, 5] as well
as in reordering [6], tuning [7] and rescoring [8]. Additionally, deep learning has been


applied to MT evaluation [9] and quality estimation [10].

In the last couple of years, a new paradigm proposal has emerged: neural MT
[11, 12, 13]. This new paradigm has yielded outstanding results, improving state-of-the
art results for several language pairs [14, 15, 16]. This new approach uses an encoder-
15 decoder architecture, along with an attention-based model [17] to build an end-to-end

neural MT system. This recent line of research opens new research perspectives and
sketches new MT challenges, for instance dealing with: large vocabularies [18, 19, 20];
multimodal translation [21]; the high computational cost, which implies new issues for

large scale optimization [16].

20 This hot topic is raising interest from the scientific community and as a response
there have been several related events (e.g. tutorial1 , winter school2 ). Moreover, the

number of publications on this topic in top conferences (e.g. ACL3 or EMNLP4 ) has
dramatically increased in the last years. The main goal of this pioneer special issue is

to gather articles that would give the reader a global vision, insight and understanding
25 of deep learning limits, challenges and impact. This special issue contains high quality
submissions on the following topics categories:

• Using deep learning in statiscal MT

• Neural MT

• Interactive Neural MT

30 • MT Evaluation enhanced with deep learning techniques

The rest of the paper is organized as follows. Section 2 briefly describes the main

current alternatives to build a neural MT approach. Section 3 overviews the papers on
this special issue ordered by the different categories listed above. Finally, section 4
discusses the main research perspectives on applying deep learning for MT.

35 2. Neural MT Brief Description US

Most of the neural MT architectures are based on the so-called encoder-decoder ap-
proach, where an input sequence in source language is projected into a low-dimensional
space from which the output sequence in target language is generated [12].

Then, many alternatives are possible for designing the encoder. A first approach
40 is to use a simple recurrent neural network (RNN) to encode the input sequence [13].
However, compressing a sequence into a fixed-size vector appears to be too much re-

ductive to preserve source side information. Then, new systems were developed using
bidirectional RNN. Source sequences are encoded into annotations by concatenating
the two representations obtained with a forward and a backward RNN respectively.

45 In this case, each annotation vector contains information from the entire source se-
quence but focusing on a particular word [22]. An attention mechanism implemented

by a feed-forward neural network is then used to attend specific parts of the input and
to generate an alignment between input and output sequence. An alternative to the
biRNN encoder is the stacked Long Short-Term Memory (LSTM) [23] as presented in

50 [12, 24].
A major problem with neural MT is dealing with the large softmax normalization at
the output which is dependent on the target vocabulary size. Many research works have


Encoder Target sentence

softmax normalization



. . .

Source sentence Decoder

Figure 1: Standard architecture of actual neural machine translation systems.

been done to address this problem, like performing the softmax on a subset of the out-
puts only [25] or using a structured output layer to manage [26] or self-normalization

Another possibility is to perform translation at a subword level. This also have the
advantage of allowing the generation of out-of-vocabulary words. Character-level ma-
chine translation has been presented in several papers [28, 29, 20]. Byte Pair Encoding

(BPE) is a broadly used technique performing very well on many language pairs [30].

Category Papers
Using deep learning in Statistical MT Source Sentence Simplification for Statistical MT by Hasler et al.
Domain Adaptation Using Joint Neural Network Models by et al.
Neural MT Context-Dependent Word Representation for Neural MT by Choi et al.
Multi-Way, Multilingual Neural MT by Firat et al.

On Integrating a Language Model into Neural MT by et al.

Interactive Neural MT Interactive Neural MT by Peris et al.
MT Evaluation with deep learning MT Evaluation with Neural Networks by et al.

Table 1: Summary of papers in this special issue classified by categories.


60 3. Special Issue Overview

This section summarises the papers in this special issue, covering the main idea and
contribution of each one. Papers are organised in four categories, which include: using


deep learning in statistical MT, neural MT, interactive neural MT and MT evaluation
with deep learning techniques.

65 3.1. Using deep learning in Statistical Machine Translation

One of the first approaches to integrate neural networks or deep learning into MT
has been through rescoring n-best lists from statistical MT systems [31, 2]. Given

that statistical MT provides state-of-the-art results and deep learning helps in finding

the right set of weights for statistical features, the scientific community is still doing
70 research in this direction. As follows, we summarise the main research contributions
of the two papers in this special issue that use deep learning to improve statistical MT.

Source Sentence Simplification for Statistical Machine Translation by Eva Hasler,
Adrià de Gispert, Felix Stahlberg, Aurelien Waite and Bill Byrne. Long sentences are
a major challenge for MT in general. This paper uses text simplification to help hier-
75 archical MT decoding with neural rescoring. Authors combine the full input sentence
together with the simplified version of the same sentence. Simplification of the in-
put sentence is done through deletion of most redundant words in the sentence. The

corresponding integration is done using a two-step decoding approach to process both

inputs. The first step translates the simplified input and produces an n-best list of can-

80 didates. The second step uses the n-best list to guide the decoding of the full input
The main contribution of the work is the procedure of integrating source sentence

simplification into the hierarchical MT decoding with neural rescoring. This contribu-
tion is interesting for all types of MT and, therefore, further interesting work of this
paper includes using source sentence simplification directly in a neural MT system.


Domain Adaptation Using Joint Neural Network Models by Shafiq Joty, Nadir
Durrani, Hassan Sajjad and Ahmed Abdelali. Domain adaptation is still a challenge

for MT systems in general, since parallel data can be considered as a scarce resource
wrt the difference between text types and genres. Neural translation models, such as
90 joint models, have shown an improved adaptation ability, thanks to the continuous
representation. This paper investigates different ways to adapt this kind of models.
Data selection and mixture modeling is the starting point of this work. The authors


then propose a neural model to better estimate model weighting and instance selection
in a neural framework. For instance, they introduce a pairwise model that minimizes
95 the cross entropy by regularizing the loss function with respect to an in-domain model.
Experimental results on the TED talk (Arabic-to-Engish) task show promising results.

3.2. Neural Machine Translation

Since the seminal work on neural MT [11, 12, 13], the encoder-decoder architecture

has fastly emerged as an efficient solution, yielding state of the art performance on sev-
100 eral translation tasks. Beyond these important results, this kind of architecture renew
the perspective of a multilingual approach to MT, but it also has some limitations. For

instance, using source context information, together with dealing with highly multilin-
gual frameworks and leveraging the abundant monolingual data remain still difficult
105 Context-Dependent Word Representation for Neural Machine Translation by
Heeyoul Choi, Kyunghyun Cho and Yoshua Bengio deals with two major problems in
MT, namely the word sense disambiguation (i.e. contextualization), and the symboliza-

tion aiming at solving the rare word problem. Contextualization is performed by mask-
ing out some dimensions of the target word embedding vectors (feedback) based on the

110 input context, i.e. the average of the nonlinearly transformed source word embeddings.
Symbolization is performed by introducing position-dependent special tokens to deal
with digits, proper nouns and acronyms.

Experiments on the International Evaluation of WMT 2015 (Workshop on Statisti-

cal Machine Translation5 ) for two tasks show that the proposed contextualization and
symbolic methods impact translation both quantitatively and qualitatively.


Multi-Way, Multilingual Neural Machine Translation by Orhan Firat, Kyungh-

yun Cho, Baskaran Sankaran, Fatos T. Yarman Vural and Yoshua Bengio addresses

the challenge of efficiently managing highly multilingual environments. The paper

presents a multi-way, multilingual neural MT approach with a shared attention mecha-
120 nism (across language pairs). While keeping multiple encoders and multiple decoders,

5 http://www.statmt.org/wmt15/


the main achievement of this paper is that the complexity of adding a language into the
system increases the number of parameters only linearly, sharing the advantages of in-
terlingua methods. The approach is tested on 8 language pairs (including linguistically
similar and non-similar language pairs, high and low-resource language pairs). The

125 approach improves strong statistical MT system in low-resource language pairs, and it

achieves similar performance for other language pairs.
The shared attention mechanism is the main contribution of this paper compared

to previous existing works on multilingual neural MT. This contribution is specially
helpful when the number of language pairs is dramatically increased (e.g. highly mul-
130 tilingual contexts like the European) and/or for low-resource language pairs.

On Integrating a Language Model into Neural Machine Translation by Caglar
Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Yoshua Bengio. Neural MT train-
ing relies on the availability of parallel corpora. For several language pairs, this kind
of resources are scarce. While conventional MT systems can leverage the abundant
135 amount of monolingual data by means of target language models, neural MT systems
are limited in their ability to benefit from this kind of resources. This paper explores

two strategies to integrate recurrent neural language models in neural MT: a shallow
fusion simply combine the scores of the neural MT and target language models, and

a deep strategy explores the fusion of the hidden states of both models. Experimental
140 results show promising improvements in terms of translation quality for both low and
high-resource language pairs to be compared to state of the art MT systems thanks to

the correcly exploitation of monolingual resources.

3.3. Interactive Neural Machine Translation


Despite the promising results achieved in last decades by MT, this technology is
145 still error prone for some domains and applications. Interactive MT is in such cases

an efficient solution, defining a collaboration between a human translator and a MT

system, especially when the quality of fully-automated systems is insufficient. This
approach consists in an iterative prediction–correction process in which, the MT system
reacts offering a new translation hypothesis after each user correction. The recent
150 emergence of neural MT has renewed the perspective for interactive MT, hence setting


new challenges.
Interactive Neural Machine Translation, by Alvaro Peris, Miguel Domingo, Fran-
cisco Casacuberta investigates the integration of a neural MT system in an interactive
system. The authors propose a new interactive protocol which allows the user an en-

155 hanced and more efficient interaction with the system. First, a neural MT system is

adapted to fit the prefix-based interactive scenario. In this conventional approach the
user corrects the translation hypothesis by reading it from left to right creating a trans-

lation prefix that the MT system completes with a new hypothesis. This scenario is then
extended by using the peculiarities of neural MT systems: the user can validate word
160 segments and the neural MT system fills the gap by generating a new hypothesis. For

these both scenarios, a tailored decoding strategy is proposed. Simulated experiments
are carried out on four different translation tasks (user manuals, medical texts and TED
talk translations) involving 4 language pairs. The results show a significant reduction
of the human effort.

165 3.4. Machine Translation Evaluation with deep learning techniques


Progress in the field of MT heavily relies on the evaluation of a proposed transla-

tion. However, translation quality and its assessment is still an open question and a

scientific challenge which have generated a lot of debates within the scientific com-
munity. For MT system development, the goal is to define an automatic metric that
170 can both rank different approaches to measure progress and provide a replicable mea-

sure to be optimized. As an illustration, different shared tasks of the WMT evaluation

campaigns are organized every year on this topic since 2008, showing its importance.
Machine Translation Evaluation with Neural Networks, by Francisco Guzman,

Shafiq Joty, Lluis Marquez and Preslav Nakov. Given a reference translation, the goal
175 is to select the best translation from a pair of hypotheses. The paper proposes a neural

architecture able to represent in distributed vectors the lexical, syntactic and semantic
properties from the reference and the two hypotheses.
The experimental setup relies on the WMT metrics shared task and the new flexible
model highly correlates with human judgments. Additional contributions include task-
180 oriented tuning of embeddings and sentence-based semantic representations.


4. Research perspectives

Neural MT is a very recent line of work which has already shown great results in
many translation tasks. The community, however, lacks of hindsight about how re-
search in the area will evolve in the upcoming year. In comparison, more than ten

185 years were necessary to establish the phrase-based approach as the widespread, ro-

bust and intensively tuned solution for MT. Neural MT questions this statement by
providing a unified and new framework, which to some extent, renders obsolete the

inter-dependant components of statistical MT systems (word alignments, reordering
models, phrase extraction). It is worth noticing that we are only at the beginning and
that neural MT opens a wide range of research perspectives.


Nowadays, most of neural MT systems are based on an auto-encoder architecture

which can evolve in many ways by considering for instance different encoders or richer
attention-based models to better handle long-range reorderings and syntactic differ-
ences between languages. The decoder, or the model generation, is also an important
195 part. The current objective function is based on maximum-likelihood and suffers of

several limitations that can be solved within a discriminative framework [32] or with a
learning-to-rank strategy [33]. Neural MT also suffers from the vocabulary limitation
issue which is well-known in the field of NLP. The complexity associated to a large

output vocabulary hinders the application of such approaches to morphologically rich

200 languages and to non-canonical texts like social media. To circumvent this limitation,
several solutions are under investigation: decoding at the character-level [28, 19, 20],

combining word and character representation [34], or using subword units [18].
Moreover, neural MT systems provide a very promising framework to learn contin-

uous representations for textual data. This creates an important step moving from the
205 word to the sentence level. Along with the introduction of the attention based model,
these peculiarities renew how the notion of context can be considered within the trans-

lation process. This could allow the model to take into account for instance: a longer
context, enabling document or discourse translation; a multi-modal context when trans-
lating image captions; or a social anchor to deal with different writting style. In the
210 seminal paper on statistical machine translation [35], the authors set out the limit the


approach considering that: ”in its most highly developed form, translation involves a
careful study of the original text and may even encompass a detailed analysis of the
author’s life and circumstances. We, of course, do not hope to reach these pinnacles of
the translator’s art”. While this is still valid today, neural MT creates a real opportunity

215 to extend the application field of machine translation in many aspects, beyond ”just”

challenging the state-of-the-art performance.


The work of the 1st author is supported by the Spanish Ministerio de Economı́a y
Competitividad and European Regional Development Fund, through the postdoctoral

senior grant Ramón y Cajal and the contract TEC2015-69266-P (MINECO/FEDER,
UE). The 4th author thanks the support by Facebook, Google (Google Faculty Award
2016) and NVidia (GPU Center of Excellence 2015-2016).



