Thesis Book 2

Bangladesh University of Engineering and
Technology
Department of Electrical and Electronic Engineering
NEURAL MACHINE TRANSLATION

WITH GENERATIVE ADVERSARIAL
NETWORK
By
Mursalin Ibne Salehin

Student ID: 1506174
&
Nibras Ul Islam
Student ID: 1506066
Under the Supervision of
Dr. Mohammad Ariful Haque

Professor
Department of EEE, BUET
A thesis submitted in partial fulfillment of the requirements for

the degree of Bachelor of Science in Electrical and Electronic Engineering
November 2021
Abstract
Neural machine translation is one of the interesting

sectors of neural language processing. As translation is
becoming more important day by day, there is no way out
without the help of neural network . It has the potential to
create new novels, music albums and articles autonomously.
Generative Adversarial Networks (GANs) have typically been used
for continuous space data such as images. While there are a few
examples of GANs employed in audio and images translation, it is
not commonly used in neural machine translation. We propose a
GAN architecture for machine translation in order to find a
improved model of neural machine translation. We also explore
alternative approaches to processing the data in order to achieve
better results in the shortest span of time.
i
Declaration of Authorship
We declare that this thesis titled "Neural Machine Translation
with Generative Adversarial Network” and the work
presented in it is our own. We confirm that –
• This work was done wholly or mainly while in

candidature for a research degree at this University.
• Where any part of this thesis has previously been
submitted for a degree or any other qualification at this
University or any other institution, this has been clearly
stated.
• Where I have consulted the published work of others,
this is always clearly attributed.
• Where I have quoted from the work of others, the
source is always given.
• With the exception of such quotations, this thesis is
entirely our own work.
• I have acknowledged all main sources of help.
Mursalin Ibne Salehin

& Nibras Ul Islam
ii
Certificate of Approval
This undergraduate thesis report titled “Neural Machine Translation

with Generative Adversarial Network” submitted by Mursalin &
Nibras, has been accepted as satisfactory in partial fulfillment of the
requirements for the Bachelor of Science on November 2021.
Dr. Mohammad Ariful Haque

Professor & Thesis Supervisor
Dept. of EEE, BUET
iii
Acknowledgement
First of all We are grateful to almighty Allah for allowing us to

complete the thesis. This thesis would not have been possible
without the support of my teachers, friends and family; their
encouragement and advice along the way was invaluable in
producing this work.
I would like to thank my thesis advisor, Professor Mohammad

Ariful Haque, for guiding my research and shaping this thesis. He
has always been willing to give me the time and resources to
explore different angles of the problem. He took personal interest
in our research and was a great source of motivation for us.
Having him as our mentor over the last one and a half year has
taught me work ethic and the basic principles of research work,
something we will carry with us for all our future endeavors.
We would like to heartily thank all of our friends - a list too large
to write here, but they know who they are - for their support both
technical and motivational. They provided a support network
which allowed me to stay sane through our years as an undergrad.
Most of all, our parents, Their endless love, support, patience and
encouragement enabled me to come this far. Words aren't enough
to thank them all. I can muster is a solemn and grateful pause.
iv
Contents
Abstract...........................................................................................................................................................i
Declaration of Authorship ..................................................................................................................... ii
Certificate of Approval........................................................................................................................... iii
Acknowledgement ................................................................................................................................... iv
Contents .........................................................................................................................................................v
List Of Figures ........................................................................................................................................ vii
1.1 Motivation ............................................................................................................................................ 2
1.2 Machine Translation ....................................................................................................................... 3
1.4 Challenges In NMT .........................................................................................................................10
1.5 Thesis Contribution ........................................................................................................................14
1.6 Thesis Organization......................................................................................................................15
2| Methods of Machine Translation ................................................................................................16
2.1 Statistical Machine Translation ...............................................................................................17
2.2 Neural Machine Translation .....................................................................................................18
2.2.1 Sequence To Sequence Learning .........................................................................................19
2.2.2 Recurrent Neural Network (RNN) .....................................................................................24
2.3 Attention Mechanism ..................................................................................................................25
2.4 Generative Adversarial Network ............................................................................................27
2.5 Chapter Summary ........................................................................................................................28
3 | Literature Review...........................................................................................................................28
3.1 GENERATIVE ADVERSARIAL TRAINING FOR NEURAL MACHINE
TRANSLATION .........................................................................................................................................28
3.2 AUTOMATIC GENERATION OF NEWS COMMENTS BASED ON GATED
ATTENTION NEURAL NETWORKS..................................................................................................30
3.3 ADVERSARIAL FEATURE MATCHING FOR TEXT GENERATION.................................31
4 | Proposed method ..........................................................................................................................32
4.1 CycleGAN in NMT ............................................................................................................................33
4.2 DATASET...........................................................................................................................................36
4.3 System Architecture...................................................................................................................37
4.3.1 Preprocessing Module ..............................................................................................................37
4.3.2 Generator Module........................................................................................................................38
4.3.3 Discriminator Module ...............................................................................................................39
v
4.4 Activation Functions ..................................................................................................................... 40
4.5 Evaluation ........................................................................................................................................ 40
5 | Results and Analysis ..................................................................................................................... 41
5.1 Model Loss Graphs.......................................................................................................................... 41
5.2 Model Accuracy Graphs: ............................................................................................................... 42
5.3 Comparison with BLEU score on Validation Sets .............................................................. 43
6 | Future work and Conclusions .................................................................................................... 44
6.1 Future work ..................................................................................................................................... 44
6.2 Conclusion ................................................................................................................................... 45
Bibliography ............................................................................................................................................. 46
vi
List Of Figures
1.1 Graph of Quality of different translation models…………………………………5
2.1 Encoder-Decoder Sequence to Sequence Model…………………..…………….20
2.2 Recurrent Neural Network……………………………………………………………….24
4.1 English-French Dataset…………………………………………………...………………..36
4.2 Generator Module…………………………………………………………………………….38

4.3 Discriminator Module ……………………………………………………………………...40
5.1 Train loss and validation loss (with attention)……………………………...…...41
5.2 Train loss and validation loss (without attention)……………………………...42
vii
1| Introduction
From the earliest written languages to the present day,

human translation has always been an important way to
connect the world. Translation is necessary for the spread of
information, knowledge, and ideas. It is absolutely necessary
for effective and empathetic communication between
different cultures. It is more than just changing the words
from one language to another. As we continue to transition
more and more of our lives online, translation has become an
important way to reach large global audiences who are
looking for information on the internet. For the longest time,
translation was a highly manual process that relied solely on
human labor to accomplish. While human translation
continues to be the most reliable way to translate content, it
takes longer and tends to be more expensive if you’re doing it
for each individual piece of content. Translators had
constraints on the volume of content they could be expected
to accurately translate in a given time, meaning that there
were large volumes of content for which it would be hard to
justify translating based on the time, cost, and effort involved.
Alternative methods of translation have started appearing in
more recent years with the advent of machine translation
(MT) in the 1940s and 50s. Machine translation completely
1
changed the way translation could be done, as it added
powerful AI and automation to the translation process. In this
introductory chapter the necessity of doing thesis on
depression detection will be discussed. The goal of this thesis,
challenges and contributions are presented here.
1.1 Motivation
At its core, machine translation is fully automated software
that translates content from one language to another. Since a
large portion of the world’s content is inaccessible to people
that don’t speak the original source language, machine
translation can effectively translate content faster and into
more languages. If people could communicate with a single
language then many problems could be solved easily. Machine
translation gives us an opportunity to bring all the people of
the under one common language. Machine translation
systems are most commonly used when there’s a lot of
information that needs translation (i.e., hundreds of
thousands of words or more). In those situations, traditional
human translation wouldn’t be feasible due to the sheer
volume of content, so we turn to AI. We have multiple types of
machine translation. The accuracy and the time taken by the
machine translation models are different. We are always
2
looking for more accurate and faster form of models so that it
can translate huge amount of data within less time. In most
machine translation cases, we need to train our model with a
paired dataset. It generally limits the ability of the model. It
also takes more time to train a data set. But if we can create a
model which can be trained with unpaired dataset, it can be
more accurate and less time consuming. This will create more
opportunities of translation from one language to another.
Using Generative Adversarial Network (GAN) in machine

translation can give us a better and faster way of translation.
Here, we can train unpaired dataset. So, it will be able to
translate different and difficult sets of words. This method
can make the job translation easy.
1.2 Machine Translation

Machine Translation (MT) or automated translation is a
process when a computer software translates text from one
language to another without human involvement.
MT works with large amounts of source and target languages

that are compared and matched against each other by a
3
machine translation engine. We differentiate three types of
machine translation methods:
• Rules-based machine translation: It uses grammar and

language rules, developed by language experts, and
dictionaries which can be customized to a specific topic
or industry.
• Statistical machine translation: It does not rely on
linguistic rules and words; it learns how to translate by
analyzing large amount of existing human translations.
• Neural machine translation: It teaches itself on how to
translate by using a large neural network. This method
is becoming more and more popular as it provides
better results with language pairs.
Machine translation has some vital benefits.
• Saves time: Machine language translation can save

significant time as it is capable of translating entire text
documents in seconds. However, please bear in mind
that human translators should always post-edit
translations done by MTs.
• Reduces costs: Machine Translation can substantially
lower your costs, as it requires less human involvement.
• Memorizes terms: Another benefit of machine language
translation is its ability to memorize key terms and
reuse them wherever they might fit.
4
While both statistical and neural MT use huge datasets of
translated sentences to teach software to find the best
translation, the models themselves are different. Statistical
MT translates sentences by breaking them up into phrases,
translating the pieces, then trying to stitch those translations
back together. Neural MT, on the other hand, uses neural
networks to consider whole sentences when predicting
translations, which allows it to take into account the context
in which each word and phrase is used.
Figure 1.1 : Graph of Quality of different translation models
From the chart above, we can see neural machine translation

technology is currently state-of-the-art technology in
machine translation and offers the highest quality translation.
5
1.3 Machine Learning
Artificial Intelligence (AI) is a science devoted to making

machines think and act like humans. Machine Learning is a
subset of artificial intelligence focusing on a specific goal:
setting computers up to be able to perform tasks without the
need for explicit programming. Computers are fed structured
data (in most cases) and ‘learn’ to become better at
evaluating and acting on that data over time. There are many
uses of machine learning, so there is no shortage of machine
learning algorithms.
There are four types of machine learning algorithms:
supervised, semi-supervised, unsupervised and
reinforcement.
• Supervised Learning: It is a subset of machine learning

that requires the most ongoing human participation —
hence the name ‘supervised’. The computer is fed
training data and a model explicitly designed to ‘teach’ it
how to respond to the data. Once the model is in place,
more data can be fed into the computer to see how well
it responds — and the programmer can confirm
accurate predictions, or can issue corrections for any
incorrect responses. Supervised learning uses a training
set to teach models to yield the desired output. This
6
training dataset includes inputs and correct outputs,
which allow the model to learn over time. The algorithm
measures its accuracy through the loss function,
adjusting until the error has been sufficiently minimized
Supervised learning can be separated into three types of

problems when data mining—classification, regression and
forecasting:
Classification: It uses an algorithm to accurately assign

test data into specific categories. It recognizes specific
entities within the dataset and attempts to draw some
conclusions on how those entities should be labeled or
defined. Common classification algorithms are linear
classifiers, support vector machines (SVM), decision
trees, k-nearest neighbor, and random forest, which are
described in more detail below.
Regression: It is used to understand the relationship

between dependent and independent variables. It is
commonly used to make projections, such as for sales
revenue for a given business. Linear
regression, logistical regression, and polynomial
regression are popular regression algorithms.
7
Forecasting: Forecasting is the process of making
predictions about the future based on the past and
present data, and is commonly used to analyse trends.
• Semi-supervised Learning: In semi-supervised learning,

the computer is fed a mixture of correctly labeled data
and unlabeled data, and searches for patterns on its
own. The labeled data serves as ‘guidance’ from the
programmer, but they do not issue ongoing corrections.
By using this combination, machine learning algorithms
can learn to label unlabeled data.
• Unsupervised Learning: Unsupervised learning, also

known as unsupervised machine learning, uses machine
learning algorithms to analyze and cluster unlabeled
datasets. These algorithms discover hidden patterns or
data groupings without the need for human
intervention. Its ability to discover similarities and
differences in information make it the ideal solution for
exploratory data analysis, cross-selling strategies,
customer segmentation, and image recognition. The
algorithm tries to organize that data in some way to
describe its structure. This might mean grouping the
data into clusters or arranging it in a way that looks
more organized. As it assesses more data, its ability to
8
make decisions on that data gradually improves and
becomes more refined.
Unsupervised learning models are utilized for three main

tasks—clustering, association, and dimensionality reduction.
Clustering: It is a data mining technique which groups

unlabeled data based on their similarities or differences.
Clustering algorithms are used to process raw, unclassified
data objects into groups represented by structures or
patterns in the information. Clustering algorithms can be
categorized into a few types, specifically exclusive,
overlapping, hierarchical, and probabilistic.
Association Rule: An association rule is a rule-based method

for finding relationships between variables in a given dataset.
These methods are frequently used for market basket
analysis, allowing companies to better understand
relationships between different products. Understanding
consumption habits of customers enables businesses to
develop better cross-selling strategies and recommendation
engines.
Dimensionality reduction: While more data generally yields

more accurate results, it can also impact the performance of
machine learning algorithms (e.g. over-fitting) and it can also
9
make it difficult to visualize datasets. Dimensionality
reduction is a technique used when the number of features, or
dimensions, in a given dataset is too high. It reduces the
number of data inputs to a manageable size while also
preserving the integrity of the dataset as much as possible. It
is commonly used in the preprocessing data stage.
Reinforcement learning: Reinforcement learning focuses on

regimented learning processes, where a machine learning
algorithm is provided with a set of actions, parameters and
end values. By defining the rules, the machine learning
algorithm then tries to explore different options and
possibilities, monitoring and evaluating each result to
determine which one is optimal. Reinforcement learning
teaches the machine trial and
error. It learns from past experiences and begins to adapt its
approach in response to the situation to achieve the best
possible result.
1.4 Challenges In NMT
Neural Machine Translation (NMT) is difficult for many

reasons. There are many challenges when we work on neural
machine translation. A known challenge in translation is that
10
in different domains, words have different translations and
meaning is expressed in different styles. Hence, a crucial step
in developing machine translation systems targeted at a
specific use case is domain adaptation. A well-known
property is that increasing amounts of training data lead to
better results. Small sample size is a barrier to get better
accuracy from machine learning algorithms. Moreover most
of the datasets are highly imbalanced that means the number
of word sample in one language is not same as the number of
word sample in another language. Imbalanced data causes
biased learning of models and therefore prediction by these
models will be biased also. This will degrade the accuracy of
model. But we need to improve the performance with a
smaller amounts of training data because sometimes large
amount of data cannot be found. It can also perform poorly on
rare words which can affect the results. Another flaw of
encoder-decoder NMT models is the inability to properly
translate long sentences. Word Alignment is another problem
which needs to be overcome. There are other challenges
when it comes to Decoding. The task of decoding is to find the
full sentence translation with the highest probability. Despite
its recent successes, neural machine translation still has to
overcome various challenges, most notably performance out-
of-domain and under low resource conditions.
The main challenges are given below:
11
• Lack of Existing Work GANs have majorly been used in
generating images, audio and video - continuous data. There
hasn’t been a lot of work done in traditional deep learning
applications such as text. As seen in the literature survey,
there have been little to no attempts to actually use the GAN
system to generate independent text. Rather, there are
restricted uses such as MedGAN and CSGAN-NMT. We aim to
rectify this issue by simply completing our thesis, thus
providing a baseline for future research.
• The Vanishing Gradient Problem Sometimes the

discriminator might become so successful that it rejects
anything that the generator makes, halting the learning
process of the generator. This is due to the discrete nature
of the text. In normal circumstances such as in images or
audio, the data is in a continuous space. To put it in the
words of Dr. Goodfellow, the creator of GANs, you can add a
minute increase to “1” to obtain “1.001”. But you can’t add
0.001 to the word “Penguin”. However, if we convert these
words into embeddings, then this might just be possible.
Existing implementations of word embeddings such as
Word2Vec and GloVe associate similar words to each other.
The popular example by Word2Vec is that it associates
“King” with “Male” and “Queen” with “Female”.
• Training Time Takes a long time to train even a single

epoch. Need to train for 10s, 100s or even 1000s of epochs.
12
The DCGAN architecture trained on the MNIST dataset
requires about 4000 epochs of training to achieve decent
results. Since we are working with text data, we won’t need as
much time, but each individual epoch takes a lot of time. This
is mostly dependent on the type of input we use more so
than the network. If we train the network on individual
characters, then it becomes extremely time consuming. Even
normal RNNs such as Karpathy’s Char-RNN takes a long time
to train. Word level models on the other hand train much
faster but have a considerably worse performance. Due to
the lack of computational power, we plan to use word level
methods and tweak them to achieve a significant measure of
performance.
• Preprocessing There are a few approaches to

preprocessing. One major consideration is whether to use a
character level model or a word level model. As mentioned
earlier, the premier character level model is Karpathy’s Char-
RNN. However, it takes a lot of time to train, though it
demonstrates astounding results. Word level models have
considerably worse performance but require much less
training time. However, tweaking the model is difficult and
requires a lot of work. In addition, the representation of
these words is also a consideration. Using a word
embedding such as GloVe or Word2Vec might lead to
greater outputs at the cost of performance. On the other
hand, simply converting each individual word into a number
13
and using that as an embedding, leads to an increase in
performance but worse outputs.
• Evaluation There is no proper way for the generator to test for

“correct” language on its own. While metrics exist for other
tasks, no metric exists to properly evaluate if the language
generated by the model is “human-like”. This is why we bring
in the discriminator in the first place. Since it doesn’t exist, we
train a network to do it. This means that the discriminator is
of the utmost importance. If the discriminator is not trained
properly, the network will fail. We plan to take steps to
ensure that the discriminator is properly trained and does
not fail at any stage.
1.5 Thesis Contribution

The challenges of machine translation are presented before
1.4. In this thesis the ways of handling the problems are
presented. Moreover this thesis shows the effectiveness of
using unsupervised data to perform neural machine
translation. CycleGAN is used in this thesis which is a way to
handle unsupervised data properly. Challenges are handled
such a way which improves the accuracy. A more flexible and
accurate method is introduced here. Finally the thesis work
achieves the better accuracy in neural machine translation.
14
1.6 Thesis Organization
This report of undergraduate thesis consist of six chapters.

The task of neural machine translation, motivation of doing
this thesis is discussed in chapter 1 (Introduction). In the
same chapter challenges and contribution in this context is
also discussed. In the second chapter the methods of neural
machine translation are. In the later chapter Literature
reviews are discussed. In chapter 4 proposed methods with
datasets, architecture are discussed. The experimentation and
result on the neural machine translation are
analyzed in the next chapter, chapter 5 (Result and Analysis).
Finally, the report ends in chapter 6 (Conclusion) with
concluding remarks and scope of further improvement of the
proposed method.
15
2| Methods of Machine
Translation
Machine translation (MT) is an important sub -field

of natural language processing that aims to translate
natural languages using computers. In recent years,
end-to-end neural machine translation (NMT) has
achieved great success and has become the new
mainstream method in practical MT systems. Here,
we provide a broad review of the methods for NMT
and focus on methods relating to architectures,
decoding, and data augmentation.
Early solutions took the form of rule -based systems

where rules were programmed in by a human, termed
rule-based machine translation (RBMT). With
advances in statistical methods, using data to learn
these rules and to resolve ambiguity in rules through
context has been attempted by a class of methods
under the umbrella of Statistical Machine Translation
(SMT). Another class of solutions proposed
prediction of the target sentence, from several
examples, called example-based MT (EBMT).
16
2.1 Statistical Machine
Translation
A statistical MT model uses the following formulation
for a source sequence x and a target sequence y:
The probabilities P (x \ y) and P (y) is estimated using

frequencies of phrase or word-units. The collection of
phrases is restricted to a few words to make the
computation of these probabilities feasible,
decomposing them into products of factors. The
computations of frequency tables etc. wer e
parallelized over multiple CPU cores and computers
providing the earliest usable translation systems for
the public. SMT brought about significant
improvements to automatic translations to the point
17
it was deployed in popular online services like Google
Translate.
Parallel to the success of neural networks in image
recognition, speech recognition, etc., deep neural
networks (DNNs) have found widespread use to
evolve as the de-facto method at the time pursuing
this work. The class of solutions using deep n eural
networks to learn from data a translation model is
widely known as Neural Machine Translation (NMT)
which is used extensively in this work. NMT is a
different paradigm of GPU heavy approaches
involving neural networks and learns by back
propagation, unlike the frequency-based procedures
in SMT. We discuss elaborately the components that
make up modern NMT ahead.
2.2 Neural Machine Translation
This section introduces and elaborately describes the

building blocks for the NMT approaches. First,
machine translation is cast as a sequence -to-
sequence learning problem. With such a formulation,
methods to decompose sentences as sequences
18
constituted by meaningful units are required to make
implementation feasible.
2.2.1 Sequence To Sequence Learning
A sequence to sequence model aims to map a fixed -

length input with a fixed-length output where the
length of the input and output may differ.
For example, translating “What are you doing today?”

from English to Chinese has input of 5 words and
output of 7 symbols (今天你在做什麼？). Clearly, we
can’t use a regular LSTM network to map each word

from the English sentence to the Chinese sentence.
This is why the sequence to sequence model is used

to address problems like that one.
19
Working Principle of Sequence to Sequence Model:
Figure 2.1: Encoder-Decoder Sequence to Sequence Model
The model consists of 3 parts: encoder, intermediate

(encoder) vector and decoder.
Encoder:
• A stack of several recurrent units (LSTM or GRU

cells for better performance) where each accepts a
single element of the input sequence , collects
information for that element and propagates it
forward.
20
• In question-answering problem, the input
sequence is a collection of all words from the
question. Each word is represented as x t where t is
the order of that word.
• The hidden states h t are computed using the

formula:
This simple formula represents the result of an

ordinary recurrent neural network. As you can see,
we just apply the appropriate weights to the previous
hidden state h (t - 1) and the input vector x t .
Encoder Vector
• This is the final hidden state produced from the

encoder part of the model. It is calculated using
the formula above.
• This vector aims to encapsulate the information

for all input elements in order to help the decoder
make accurate predictions.
21
• It acts as the initial hidden state of the decoder
part of the model.
Decoder
• A stack of several recurrent units where each

predicts an output y t at a time step t .
• Each recurrent unit accepts a hidden state from

the previous unit and produces and output as well
as its own hidden state.
• In the question-answering problem, the output

sequence is a collection of all words from the
answer. Each word is represented as y t where t is
the order of that word.
• Any hidden state h t is computed using the

formula:
22
As you can see, we are just using the previous hidden
state to compute the next one.
• The output y t time step t is computed using the

formula:
We calculate the outputs using the hidden state at the

current time step together with the respective weight
W(S). Softmax is used to create a probability vector
which will help us determine the final output (e.g.
word in the question-answering problem).
The power of this model lies in the fact that it can

map sequences of different lengths to each other. As
you can see the inputs and outputs are not correlated
and their lengths can differ. This opens a whole new
range of problems which can now be solved using
such architecture.
23
2.2.2 Recurrent Neural Network (RNN)
Recurrent neural networks (RNN) are a class of neural

networks that are helpful in modeling sequence data. Derived
from feed-forward networks, RNNs exhibit similar behavior
to how human brains function. Simply put: recurrent neural
networks produce predictive results in sequential data that
other algorithms can’t.
Figure 2.2 : Recurrent Neural Network
24
2.3 Attention Mechanism
The introduction of attention mechanism (Bahdanau et al.,
2015) is a milestone in NMT architecture research. The
attention network computes the relevance of each value
vector based on queries and keys. This can also be interpreted
as a content-based addressing scheme (Graves et al., 2014).
Formally, given a set of m query vectors QRm xd, a set of n
key vectors K Rnxd and associated value vectors V  Rnxd,
the computation of attention network involves two steps. The
first step is to compute the relevance between keys and
values, which is formally described as:
R= score(Q,K)
where score is a scoring function which have several

alternatives.
RRmxn is a matrix storing the relevance score between each
keys and values. The next step is compute the output vectors.
For each query vector, the corresponding output vector is

expressed as a weighted sum of value vectors:
25
Attention(Q,K,V) = softmax(R) . V
Considering on the scoring function, the attention networks

can be
roughly classified into two categories:
additive attention (Bahdanau et al., 2015) and dot-product
attention (Luong et al., 2015).
The additive attention models score through a feed-forward

neural network. On the other hand, the dot-product attention
uses dot production to compute the matching score .
In practice, the dot-product attention is much faster than the

additive attention.
26
2.4 Generative Adversarial
Network
Generative Adversarial Networks, or GANs for short, are an
approach to generative modeling using deep learning
methods, such as convolutional neural networks.
This is another which can be used in neural machine

translation.
Generative modeling is an unsupervised learning task in

machine learning that involves automatically discovering and
learning the regularities or patterns in input data in such a
way that the model can be used to generate or output new
examples that plausibly could have been drawn from the
original dataset.
GANs are a clever way of training a generative model by

framing the problem as a supervised learning problem with
two sub-models: the generator model that we train to
generate new examples, and the discriminator model that
tries to classify examples as either real (from the domain) or
fake (generated). The two models are trained together in a
zero-sum game, adversarial, until the discriminator model is
fooled about half the time, meaning the generator model is
generating plausible examples.
27
2.5 Chapter Summary
Researchers from various fields are working with neural
machine translation. There are many methods that can be
applied in NMT. Here, we have tried to focus on the methods
that we are going to be using in our thesis. Sequence to
Sequence model, Attention mechanism and GANs are needed
for our proposed method.
3 | Literature Review
3.1 GENERATIVE ADVERSARIAL TRAINING FOR

NEURAL MACHINE TRANSLATION
Yang, Z., Chen, W., Wang, F. and Xu, B. have used a conditional
sequence generative adversarial network for neural machine
translation (CSGAN-NMT). The proposed model consists of two
sub-models, a generator and a discriminator. The generator
generates text based on a source language. And the discriminator
evaluates this translation by predicting how probable it is that this
is the correct translation. To reach Nash equilibrium, a gamified
28
process of mini-max between the two sub-models is played for
them to arise at a win-win situation.
The generator consisted of an encoder-decoder system of Gated

Recurrent Units (GRUs) with 512 hidden units. These measures
are chosen, so as to prevent the misleading of the manually
designed loss function into the generation of suboptimal
translations. They utilized their network on the NIST Chinese-
English dataset and to further test on the effectiveness of the
approach, results were also provided on English-German
translation task. A Beam search was utilized for the generation of
translations.
Parameters such as beam width were set at 10, log-likelihood

scores were not normalized by sentence length. The models were
implemented in TensorFlow, with synchronous training on up to
four K80 GPUs in a multi- GPU setup on a single machine.
Experiment results achieved with the proposed model of CSGAN-
NMT has shown significant outperformance compared to the older
works. They also considered a variant called the multi-CSGAN-
NMT, which is a scenario of multiple generators and
discriminators to achieve remarkable results, where each
generator can be considered as an agent, have the ability to
interact with other generator agents and even send messages.
The use of two independent discriminators allowed the

generators far better learning. Through various tests, the alternate
variant has shown even more improvement from the initial
29
suggested model. In their tests, they also noticed that
discriminators with an accuracy that was too high or low
performed badly.
3.2 AUTOMATIC GENERATION OF NEWS COMMENTS

BASED ON GATED ATTENTION NEURAL NETWORKS
Zheng, H., Wang, W., Chen, W., and Sangaiah, A. have proposed
a gated attention neural network model (GANN) that comprises of
two main elements. First, a comment Generator and that is built
on an encoder-decoder
framework, where the conversion of all words of the title into one-
hot vectors and obtaining the embedding representations by
multiplying the embedding matrix, is done by the encoder
component. The initialization of the decoder component of the
generator is triggered by the last hidden vector of the title. Similar
to the encoder, the model converts the sequence of comment
words into one-hot vectors and gets their low-dimensional
representations through the shared embedding matrix.
Introduction of modules such as the gated attention mechanism
and a relevance control module is done, so as to guarantee the
contextual relevance between comments and news, by assigning
different weights to different parts of contextual information,
which has proven to improve the performance. The second
element is a comment discriminator, which is used to improve the
accuracy of comment generation. This is a concept inspired by
30
General Adversarial Networks (GAN). The various tests performed
on the large dataset show the effectiveness of GANN compared to
other generators. The generated news comments were found to be
close to human comments.
The widespread adoption of electronic health record records by

healthcare organisations along with the increase in the quality and
quantity of data has motivated computational advances in medical
research. However, there are various concerns over privacy which
limit the access and collaborative use of this data.
3.3 ADVERSARIAL FEATURE MATCHING FOR TEXT

GENERATION
Zhang, Y., Gan, Z., Fan, K., Chen, Z., Henao, R., Shen, D., and
Cari, L. have proposed a framework for generating realistic text
via adversarial training. It employs the conventional architecture
of General Adversarial Networks (GAN), by having a generator and
a discriminator, where a long short-term memory network
(LSTM) is utilized as the generator and a convolutional network as
the discriminator. Instead of using the standard objective of GAN,
a matching of the high-dimensional latent feature distributions of
real and synthetic sentences is proposed by the framework and is
31
undertaken via a kernelized discrepancy metric. With the
proposed framework modules, it alleviates the mode-collapsing
problem and thus eases adversarial training. This particular
model delivers superior performance compared to the other
related approaches. It not only produces realistic sentences, but it
also enables the learned latent representation space to smoothly
encode plausible sentences. The methods that were employed was
quantitatively evaluated with baseline models and existing
methods as benchmarks and the results indicate superior
performance of the above-proposed methods.
4 | Proposed method
There are some methods to perform neural machine

translation. Sequence to Sequence machine translation ,
machine translation with attention mechanism are one of
them. In this thesis, we will be performing Neural Machine
Translation with CycleGAN.
The CycleGAN is a technique that involves the automatic

training of word-to-word translation models without paired
examples. The models are trained in an unsupervised manner
using a collection of words from the source and target domain
32
that do not need to be related in any way. Training a model for
word-to-word translation typically requires a large dataset of
paired examples. These datasets can be difficult and expensive
to prepare, and in some cases impossible, because some
languages have very low resources of datasets.
In this thesis we will also be performing sequence to sequence

translation with attention and without attention. And then,
we compare the results with our neural machine translation
with CycleGAN model.
The CycleGAN model was described by Jun-Yan Zhu, et al. in

their 2017 paper titled “Unpaired Image-to-Image Translation
using Cycle-Consistent Adversarial Networks.”
4.1 CycleGAN in NMT
The benefit of the CycleGAN model is that it can be trained

without paired examples. That is, it does not require
examples of sentences before and after the translation in
order to train the model.
The model architecture is comprised of two generator

models: one generator (Generator-A) for generating
33
sentences for the first domain (Domain-A) and the second
generator (Generator-B) for generating sentences for the
second domain (Domain-B).
• Generator-A -> Domain-A

• Generator-B -> Domain-B
The generator models perform neural machine translation.
Generator-A takes a sentence from Domain-B as input and
Generator-B takes a sentence from Domain-A as input.
• Domain-B -> Generator-A -> Domain-A

• Domain-A -> Generator-B -> Domain-B
Each generator has a corresponding discriminator model. The

first discriminator model (Discriminator-A) takes real
sentences from Domain-A and generated sentences from
Generator-A and predicts whether they are real or fake. The
second discriminator model (Discriminator-B) takes real
sentences from Domain-B and generated sentences from
Generator-B and predicts whether they are real or fake.
• Domain-A -> Discriminator-A -> [Real/Fake]

• Domain-B -> Generator-A -> Discriminator-A -> [Real/Fake]
• Domain-B -> Discriminator-B -> [Real/Fake]
• Domain-A -> Generator-B -> Discriminator-B -> [Real/Fake]
The discriminator and generator models are trained in an

adversarial zero-sum process, like normal GAN models. The
34
generators learn to better fool the discriminators and the
discriminator learn to better detect fake images. Together, the
models find an equilibrium during the training process.
Additionally, the generator models are regularized to not just

create new sentences in the target domain, but instead
translate more reconstructed versions of the input sentences
from the source domain. This is achieved by using generated
sentences as input to the corresponding generator model and
comparing the output sentence to the original sentence.
Passing a sentence through both generators is called a cycle.
Together, each pair of generator models are trained to better
reproduce the original source sentence, referred to as cycle
consistency.
• Domain-B -> Generator-A -> Domain-A -> Generator-B ->

Domain-B
• Domain-A -> Generator-B -> Domain-B -> Generator-A ->
Domain-A
There is one further element to the architecture, referred to

as the identity mapping. This is where a generator is provided
with sentence as input from the target domain and is
expected to generate the same sentence without change. This
addition to the architecture is optional.
• Domain-A -> Generator-A -> Domain-A

• Domain-B -> Generator-B -> Domain-B
35
4.2 DATASET
Generally, when it comes to machine learning tasks, the data is
required to be of a particular format and must be split into a
“training” and a “testing” set. This is because such tasks usually
involve predicting some variable. However, when it comes to the
generation of language, this is not applicable. There is no variable
to predict. Rather, language is simply generated. With that in
mind, any corpus of text can be used as an input dataset. We have
restricted ourselves to a few well-known datasets in order to build
our model. However, it must be reiterated that the model is
applicable to any corpus of text.
Here, we have used dataset from manythings.org website which

can be accessed free of cost. We have used “fra-eng” which has
French-English paired data.
Figure 4.1: English-French Dataset
36
4.3 System Architecture
The overall system architecture is fixed. It consists of just three

modules. The pre-processing module handles cleaning of the input
data and converts it into a machine-friendly representation. The
generator network is responsible for attempting to generate text
while the discriminator network judges the text. Based on the
network’s output, the loss function will propagate and update
either the discriminator network or the generator network. Over
time, each network will learn more and more and hence produce
even better results. This in turn will allow the model to truly
succeed and perhaps even fool human beings.
4.3.1 Preprocessing Module
It is easier for a model to recognize smaller numerical patterns in

order to generate text, rather than incorporating larger words.
That is, a model can understand a sequence of numbers easily. It
has no way of understanding random strings of text. Since
computer operate in binary and not text, all text must be
represented in some form of numbers. However, there are some
challenges to this. We can’t simply convert the text into numbers.
The first thing to consider is case. Is “Word” the same as “word”?
Indeed, it is, but a naive approach would label them both as
different words. So all words must be converted to lowercase. The
next point to consider is the sentences in the text. Foremost,
37
sentences have to be identified. Not all sentences end with a
period. Some end with a question mark, others with an
exclamation. There are complex sentences and compound
sentences. English is not a very well structured language. So the
sentences have to be identified and stored. Thirdly, not all
sentences are born equal. Some may be short, while others may be
long. But a model can’t really accept uneven input like that. So we
have to find a suitable standard sentence length that is neither too
long, which would require most sentences to be padded, nor too
short which would require most of the sentences to be truncated.
However, regardless of the sentence length, both will be needed.
So functions need to be developed to do the same. Lastly, the
actual embedding is to be considered.
4.3.2 Generator Module
Figure 4.2 : Generator Module
The generator network receives an input of the preprocessed text

corpus that we want to imitate. The input text is received in
38
batches and the generator attempts to imitate the batch. This
generated text is then passed on to the discriminator module. If
the discriminator determines that the generated text is fake, then
the loss function propagates to the generator and the gradient is
updated.
4.3.3 Discriminator Module
The discriminator module receives two inputs. The first is a

randomly chosen sample text from the dataset used. The second
input is text generated by the generator network. Both texts are
pre-processed in the same manner and the discriminator is
provided with the outputs. It does not know which of the two texts
is generated and which is real. Rather it must make a prediction. If
the prediction is right, then the loss propagates through to the
generator network. However, if the discriminator makes an
incorrect prediction, then the gradient will pass through the
discriminator network instead. This will allow the discriminator
to learn and hence perform better against future samples.
39
Figure 4.3: Discriminator Module
4.4 Activation Functions

For Generator we have used “softmax” , and for discriminator
we have used “relu” and “sigmoid” functions.
4.5 Evaluation
When it comes to generated text, there are no well-established
metrics (Novikova et al.) to evaluate the quality of the text. The
best and perhaps, the only, proper way of evaluating generated
text is via human judgement. This is an expensive and time-
consuming task and for the purposes of our study, we use a small
collection of individuals for evaluation. But We have used BLEU
score for Neural Machine Translation.
40
5 | Results and Analysis
Here we have completed neural machine translation by
sequence to sequence model without attention and also with
attention. Then , we used our proposed model to do neural
machine translation using Cycle GAN.
We have developed of accuracy vs epoch graphs for train and

validation sets. We have also developed BLEU score of
different models.
5.1 Model Loss Graphs
Figure 5.1: Train loss and validation loss (with attention)
In this process, we used attention. Here, we can see the value

of loss for validation set is below 1 and for train set it is below
0.5.
41
Figure 5.2 : Train loss and validation loss (without attention)
In this process, we did not use attention. Here, we can see the
value of loss for validation set is above 2 and for train set it is
just below 2.
5.2 Model Accuracy Graphs:
(a) (b)
Figure 5.4: (a) model accuracy graph with attention & (b) model accuracy graph with CycleGAN
42
From the above graphs, we can see that model with CycleGAN
has more accuracy than model with attention. We can see
thatat 30 epochs model with attention has nearly 77%
accuracy whereas model with CycleGAN has nearly 86%
accuracy on Validation set.
The above graphs clearly shows that CycleGAN model

outperforms attention model.
5.3 Comparison with BLEU score on

Validation Sets
NMT MODELS BLEU Score
Sequence to Sequence model 0.77
Sequence to Sequence model 0.83

with attention
NMT with CycleGAN 0.87
Table 1: BLEU scores of different model
From BLEU score we can see that, neural machine translation

with CycleGAN is the best model among three.
CycleGAN creates a lot opportunities for neural machine

translation because it can be modeled with unsupervised
data. So the languages which have small resources can be
43
trained easily and machine translation can be performed. Its
impact is greater than other two models.
6 | Future work and Conclusions
6.1 Future work
There is a lot of scope for future work to be done in this domain.

There are two distinct routes for future work. The first is methods
to increase the performance of the GAN itself. The second is in
using this GAN for other, greater, purposes. Simply training the
model for longer periods of time might very well lead to incredible
results. In addition, we have used a relatively simple network
architecture in order to establish a baseline.
Expanding this network, either by making it wider or by making it

deeper, may lead to the network learning in a better and faster
manner.
While generated sentences are somewhat relevant, they do tend to
suddenly spout gibberish. So using a relevance module is an
appealing option. This will allow us to control what kind of text is
generated. This is essential in order to maintain a coherent
44
dialogue or text for any kind of creative endeavour. This model
can also be trained for larger sentences. Larger sentences are
difficult to train. So improvement can be made in this section.
6.2 Conclusion
In this thesis we have explored the use of Cycle Generative

Adversarial Networks in the domain of natural language
generation. While it does present an interesting alternative to
traditional text generation methodologies, it still requires a lot of
work before it can be deemed as viable. However, the CycleGAN
approach does have its own advantages. Primarily, using a word
level model is much faster than the character level models
traditionally used. It also means that the words themselves do not
need to be learned, just the meaning. In addition, the CycleGAN is
generalized and can be adapted to any situation, whether it is in
generating music lyrics, game dialogue or full novels.
Keeping the importance of neural machine translation in mind,
we can say that use of CycleGAN in this sector will improve overall
neural machine translation.
45
Bibliography
1. Minh-Thang Luong, Hieu Pham, and Christopher Manning. Effective
approaches to attention-based neural machine translation. arXiv, 2015.
2. Fedus, W., Goodfellow, I.J., and Dai, A.M. (2018). 'MaskGAN: Better Text
Generation via Filling in the ______.' CoRR abs/1801.07736.
3. Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,
S., Courville, A.C., and Bengio, Y.(2014). 'Generative Adversarial Nets'.
Advances in Neural Information Processing Systems, pp. 2672- 2680.
4. Liang, C., Yang, X., Wham, D., Pursel, B., Passonneaur, R., and Giles, C.
(2017). 'Distractor Generation with Generative Adversarial Nets for
Automatically Creating Fill-in-the-blank Questions', Proceedings of the
Knowledge Capture Conference, Article 33.
5. Mikolov, T., Chen, K., Corrado, G.S., and Dean, J. (2013). 'Efficient Estimation
of Word Representations in Vector Space.' CoRR abs/1301.3781.
6. Novikova, J., Dušek, O., Curry, A.C., and Rieser, V. (2017). 'Why We Need
New Evaluation Metrics for NLG'. Proceedings of the 2017 Conference on
Empirical Methods in Natural Language Processing, pp. 2231-2242.
7. Pennington, J., Socher, R., and Manning, C. (2014). 'Glove: Global Vectors for
Word Representation'. Proceedings of the 2014 Conference on Empirical
Methods in Natural Language Processing (EMNLP), pp. 1532- 1543.
8. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen,
X. (2016). 'Improved Techniques for Training GANs.' Proceedings of the 30th
International Conference on Neural Information Processing Systems, pp.
2234-2242.
46
9. Yang, Z., Chen, W., Wang, F., and Xu, B. (2018). 'Generative Adversarial
Training for Neural Machine Translation'. Neurocomputing, Vol. 321, pp. 146-
155.
10.https://becominghuman.ai/what-is-deep-learning-and-why-you-need-it-
9e2fc0f0e61b 15. https://bloomberg.github.io/foml/#home
11.https://machinelearningmastery.com/supervised-and-unsupervised-
machine- learning-algorithms/
18. www.python.org
19.https://towardsdatascience.com/animated-rnn-lstm-and-gru-
ef124d06cf45
20. https://machinelearningmastery.com/cyclegan-tutorial-with-keras/
47

Thesis Book 2

Uploaded by

Thesis Book 2

Uploaded by

Bangladesh University of Engineering and

NEURAL MACHINE TRANSLATION

Mursalin Ibne Salehin

Under the Supervision of

Dr. Mohammad Ariful Haque

A thesis submitted in partial fulfillment of the requirements for

Neural machine translation is one of the interesting

• This work was done wholly or mainly while in

Mursalin Ibne Salehin

This undergraduate thesis report titled “Neural Machine Translation

Dr. Mohammad Ariful Haque

First of all We are grateful to almighty Allah for allowing us to

I would like to thank my thesis advisor, Professor Mohammad

1.1 Graph of Quality of different translation models…………………………………5

2.1 Encoder-Decoder Sequence to Sequence Model…………………..…………….20

2.2 Recurrent Neural Network……………………………………………………………….24

4.1 English-French Dataset…………………………………………………...………………..36

4.2 Generator Module…………………………………………………………………………….38

5.1 Train loss and validation loss (with attention)……………………………...…...41

5.2 Train loss and validation loss (without attention)……………………………...42

From the earliest written languages to the present day,

Using Generative Adversarial Network (GAN) in machine

1.2 Machine Translation

MT works with large amounts of source and target languages

• Rules-based machine translation: It uses grammar and

Machine translation has some vital benefits.

• Saves time: Machine language translation can save

Figure 1.1 : Graph of Quality of different translation models

From the chart above, we can see neural machine translation

Artificial Intelligence (AI) is a science devoted to making

• Supervised Learning: It is a subset of machine learning

Supervised learning can be separated into three types of

Classification: It uses an algorithm to accurately assign

Regression: It is used to understand the relationship

• Semi-supervised Learning: In semi-supervised learning,

• Unsupervised Learning: Unsupervised learning, also

Unsupervised learning models are utilized for three main

Clustering: It is a data mining technique which groups

Association Rule: An association rule is a rule-based method

Dimensionality reduction: While more data generally yields

Reinforcement learning: Reinforcement learning focuses on

1.4 Challenges In NMT

Neural Machine Translation (NMT) is difficult for many

The main challenges are given below:

• The Vanishing Gradient Problem Sometimes the

• Training Time Takes a long time to train even a single

• Preprocessing There are a few approaches to

• Evaluation There is no proper way for the generator to test for

1.5 Thesis Contribution

This report of undergraduate thesis consist of six chapters.

Machine translation (MT) is an important sub -field

Early solutions took the form of rule -based systems

The probabilities P (x \ y) and P (y) is estimated using

2.2 Neural Machine Translation

This section introduces and elaborately describes the

2.2.1 Sequence To Sequence Learning

A sequence to sequence model aims to map a fixed -

For example, translating “What are you doing today?”

can’t use a regular LSTM network to map each word

This is why the sequence to sequence model is used

Figure 2.1: Encoder-Decoder Sequence to Sequence Model

The model consists of 3 parts: encoder, intermediate

• A stack of several recurrent units (LSTM or GRU

• The hidden states h t are computed using the

This simple formula represents the result of an

• This is the final hidden state produced from the

• This vector aims to encapsulate the information

• A stack of several recurrent units where each

• Each recurrent unit accepts a hidden state from

• In the question-answering problem, the output

• Any hidden state h t is computed using the

• The output y t time step t is computed using the

We calculate the outputs using the hidden state at the

The power of this model lies in the fact that it can

Recurrent neural networks (RNN) are a class of neural

Figure 2.2 : Recurrent Neural Network