0% found this document useful (0 votes)
72 views57 pages

Thesis Book 2

This document is a thesis submitted by two students, Mursalin Ibne Salehin and Nibras Ul Islam, for their Bachelor of Science degree. The thesis proposes using a generative adversarial network (GAN) for neural machine translation in order to improve existing models. It provides background on neural machine translation and challenges in the field. The proposed method uses a CycleGAN architecture for the GAN with an encoder-decoder generator and discriminator modules. Results are evaluated using loss graphs, accuracy graphs, and BLEU scores on a validation data set. Future work opportunities and conclusions are also discussed.

Uploaded by

w6jkzfc7pg
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
72 views57 pages

Thesis Book 2

This document is a thesis submitted by two students, Mursalin Ibne Salehin and Nibras Ul Islam, for their Bachelor of Science degree. The thesis proposes using a generative adversarial network (GAN) for neural machine translation in order to improve existing models. It provides background on neural machine translation and challenges in the field. The proposed method uses a CycleGAN architecture for the GAN with an encoder-decoder generator and discriminator modules. Results are evaluated using loss graphs, accuracy graphs, and BLEU scores on a validation data set. Future work opportunities and conclusions are also discussed.

Uploaded by

w6jkzfc7pg
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 57

Bangladesh University of Engineering and

Technology
Department of Electrical and Electronic Engineering

NEURAL MACHINE TRANSLATION


WITH GENERATIVE ADVERSARIAL
NETWORK

By

Mursalin Ibne Salehin


Student ID: 1506174
&
Nibras Ul Islam
Student ID: 1506066

Under the Supervision of

Dr. Mohammad Ariful Haque


Professor
Department of EEE, BUET

A thesis submitted in partial fulfillment of the requirements for


the degree of Bachelor of Science in Electrical and Electronic Engineering
November 2021
Abstract

Neural machine translation is one of the interesting


sectors of neural language processing. As translation is
becoming more important day by day, there is no way out
without the help of neural network . It has the potential to
create new novels, music albums and articles autonomously.
Generative Adversarial Networks (GANs) have typically been used
for continuous space data such as images. While there are a few
examples of GANs employed in audio and images translation, it is
not commonly used in neural machine translation. We propose a
GAN architecture for machine translation in order to find a
improved model of neural machine translation. We also explore
alternative approaches to processing the data in order to achieve
better results in the shortest span of time.

i
Declaration of Authorship
We declare that this thesis titled "Neural Machine Translation
with Generative Adversarial Network” and the work
presented in it is our own. We confirm that –

• This work was done wholly or mainly while in


candidature for a research degree at this University.
• Where any part of this thesis has previously been
submitted for a degree or any other qualification at this
University or any other institution, this has been clearly
stated.
• Where I have consulted the published work of others,
this is always clearly attributed.
• Where I have quoted from the work of others, the
source is always given.
• With the exception of such quotations, this thesis is
entirely our own work.
• I have acknowledged all main sources of help.

Mursalin Ibne Salehin


& Nibras Ul Islam

ii
Certificate of Approval

This undergraduate thesis report titled “Neural Machine Translation


with Generative Adversarial Network” submitted by Mursalin &
Nibras, has been accepted as satisfactory in partial fulfillment of the
requirements for the Bachelor of Science on November 2021.

Dr. Mohammad Ariful Haque


Professor & Thesis Supervisor
Dept. of EEE, BUET

iii
Acknowledgement

First of all We are grateful to almighty Allah for allowing us to


complete the thesis. This thesis would not have been possible
without the support of my teachers, friends and family; their
encouragement and advice along the way was invaluable in
producing this work.

I would like to thank my thesis advisor, Professor Mohammad


Ariful Haque, for guiding my research and shaping this thesis. He
has always been willing to give me the time and resources to
explore different angles of the problem. He took personal interest
in our research and was a great source of motivation for us.
Having him as our mentor over the last one and a half year has
taught me work ethic and the basic principles of research work,
something we will carry with us for all our future endeavors.

We would like to heartily thank all of our friends - a list too large
to write here, but they know who they are - for their support both
technical and motivational. They provided a support network
which allowed me to stay sane through our years as an undergrad.

Most of all, our parents, Their endless love, support, patience and
encouragement enabled me to come this far. Words aren't enough
to thank them all. I can muster is a solemn and grateful pause.

iv
Contents
Abstract...........................................................................................................................................................i
Declaration of Authorship ..................................................................................................................... ii
Certificate of Approval........................................................................................................................... iii
Acknowledgement ................................................................................................................................... iv
Contents .........................................................................................................................................................v
List Of Figures ........................................................................................................................................ vii
1.1 Motivation ............................................................................................................................................ 2
1.2 Machine Translation ....................................................................................................................... 3
1.4 Challenges In NMT .........................................................................................................................10
1.5 Thesis Contribution ........................................................................................................................14
1.6 Thesis Organization......................................................................................................................15
2| Methods of Machine Translation ................................................................................................16
2.1 Statistical Machine Translation ...............................................................................................17
2.2 Neural Machine Translation .....................................................................................................18
2.2.1 Sequence To Sequence Learning .........................................................................................19
2.2.2 Recurrent Neural Network (RNN) .....................................................................................24
2.3 Attention Mechanism ..................................................................................................................25
2.4 Generative Adversarial Network ............................................................................................27
2.5 Chapter Summary ........................................................................................................................28
3 | Literature Review...........................................................................................................................28
3.1 GENERATIVE ADVERSARIAL TRAINING FOR NEURAL MACHINE
TRANSLATION .........................................................................................................................................28
3.2 AUTOMATIC GENERATION OF NEWS COMMENTS BASED ON GATED
ATTENTION NEURAL NETWORKS..................................................................................................30
3.3 ADVERSARIAL FEATURE MATCHING FOR TEXT GENERATION.................................31
4 | Proposed method ..........................................................................................................................32
4.1 CycleGAN in NMT ............................................................................................................................33
4.2 DATASET...........................................................................................................................................36
4.3 System Architecture...................................................................................................................37
4.3.1 Preprocessing Module ..............................................................................................................37
4.3.2 Generator Module........................................................................................................................38
4.3.3 Discriminator Module ...............................................................................................................39

v
4.4 Activation Functions ..................................................................................................................... 40
4.5 Evaluation ........................................................................................................................................ 40
5 | Results and Analysis ..................................................................................................................... 41
5.1 Model Loss Graphs.......................................................................................................................... 41
5.2 Model Accuracy Graphs: ............................................................................................................... 42
5.3 Comparison with BLEU score on Validation Sets .............................................................. 43
6 | Future work and Conclusions .................................................................................................... 44
6.1 Future work ..................................................................................................................................... 44
6.2 Conclusion ................................................................................................................................... 45
Bibliography ............................................................................................................................................. 46

vi
List Of Figures

1.1 Graph of Quality of different translation models…………………………………5

2.1 Encoder-Decoder Sequence to Sequence Model…………………..…………….20

2.2 Recurrent Neural Network……………………………………………………………….24

4.1 English-French Dataset…………………………………………………...………………..36

4.2 Generator Module…………………………………………………………………………….38


4.3 Discriminator Module ……………………………………………………………………...40

5.1 Train loss and validation loss (with attention)……………………………...…...41

5.2 Train loss and validation loss (without attention)……………………………...42

vii
1| Introduction

From the earliest written languages to the present day,


human translation has always been an important way to
connect the world. Translation is necessary for the spread of
information, knowledge, and ideas. It is absolutely necessary
for effective and empathetic communication between
different cultures. It is more than just changing the words
from one language to another. As we continue to transition
more and more of our lives online, translation has become an
important way to reach large global audiences who are
looking for information on the internet. For the longest time,
translation was a highly manual process that relied solely on
human labor to accomplish. While human translation
continues to be the most reliable way to translate content, it
takes longer and tends to be more expensive if you’re doing it
for each individual piece of content. Translators had
constraints on the volume of content they could be expected
to accurately translate in a given time, meaning that there
were large volumes of content for which it would be hard to
justify translating based on the time, cost, and effort involved.
Alternative methods of translation have started appearing in
more recent years with the advent of machine translation
(MT) in the 1940s and 50s. Machine translation completely

1
changed the way translation could be done, as it added
powerful AI and automation to the translation process. In this
introductory chapter the necessity of doing thesis on
depression detection will be discussed. The goal of this thesis,
challenges and contributions are presented here.

1.1 Motivation
At its core, machine translation is fully automated software
that translates content from one language to another. Since a
large portion of the world’s content is inaccessible to people
that don’t speak the original source language, machine
translation can effectively translate content faster and into
more languages. If people could communicate with a single
language then many problems could be solved easily. Machine
translation gives us an opportunity to bring all the people of
the under one common language. Machine translation
systems are most commonly used when there’s a lot of
information that needs translation (i.e., hundreds of
thousands of words or more). In those situations, traditional
human translation wouldn’t be feasible due to the sheer
volume of content, so we turn to AI. We have multiple types of
machine translation. The accuracy and the time taken by the
machine translation models are different. We are always
2
looking for more accurate and faster form of models so that it
can translate huge amount of data within less time. In most
machine translation cases, we need to train our model with a
paired dataset. It generally limits the ability of the model. It
also takes more time to train a data set. But if we can create a
model which can be trained with unpaired dataset, it can be
more accurate and less time consuming. This will create more
opportunities of translation from one language to another.

Using Generative Adversarial Network (GAN) in machine


translation can give us a better and faster way of translation.
Here, we can train unpaired dataset. So, it will be able to
translate different and difficult sets of words. This method
can make the job translation easy.

1.2 Machine Translation


Machine Translation (MT) or automated translation is a
process when a computer software translates text from one
language to another without human involvement.

MT works with large amounts of source and target languages


that are compared and matched against each other by a

3
machine translation engine. We differentiate three types of
machine translation methods:

• Rules-based machine translation: It uses grammar and


language rules, developed by language experts, and
dictionaries which can be customized to a specific topic
or industry.
• Statistical machine translation: It does not rely on
linguistic rules and words; it learns how to translate by
analyzing large amount of existing human translations.
• Neural machine translation: It teaches itself on how to
translate by using a large neural network. This method
is becoming more and more popular as it provides
better results with language pairs.

Machine translation has some vital benefits.

• Saves time: Machine language translation can save


significant time as it is capable of translating entire text
documents in seconds. However, please bear in mind
that human translators should always post-edit
translations done by MTs.
• Reduces costs: Machine Translation can substantially
lower your costs, as it requires less human involvement.
• Memorizes terms: Another benefit of machine language
translation is its ability to memorize key terms and
reuse them wherever they might fit.

4
While both statistical and neural MT use huge datasets of
translated sentences to teach software to find the best
translation, the models themselves are different. Statistical
MT translates sentences by breaking them up into phrases,
translating the pieces, then trying to stitch those translations
back together. Neural MT, on the other hand, uses neural
networks to consider whole sentences when predicting
translations, which allows it to take into account the context
in which each word and phrase is used.

Figure 1.1 : Graph of Quality of different translation models

From the chart above, we can see neural machine translation


technology is currently state-of-the-art technology in
machine translation and offers the highest quality translation.

5
1.3 Machine Learning

Artificial Intelligence (AI) is a science devoted to making


machines think and act like humans. Machine Learning is a
subset of artificial intelligence focusing on a specific goal:
setting computers up to be able to perform tasks without the
need for explicit programming. Computers are fed structured
data (in most cases) and ‘learn’ to become better at
evaluating and acting on that data over time. There are many
uses of machine learning, so there is no shortage of machine
learning algorithms.
There are four types of machine learning algorithms:
supervised, semi-supervised, unsupervised and
reinforcement.

• Supervised Learning: It is a subset of machine learning


that requires the most ongoing human participation —
hence the name ‘supervised’. The computer is fed
training data and a model explicitly designed to ‘teach’ it
how to respond to the data. Once the model is in place,
more data can be fed into the computer to see how well
it responds — and the programmer can confirm
accurate predictions, or can issue corrections for any
incorrect responses. Supervised learning uses a training
set to teach models to yield the desired output. This

6
training dataset includes inputs and correct outputs,
which allow the model to learn over time. The algorithm
measures its accuracy through the loss function,
adjusting until the error has been sufficiently minimized

Supervised learning can be separated into three types of


problems when data mining—classification, regression and
forecasting:

Classification: It uses an algorithm to accurately assign


test data into specific categories. It recognizes specific
entities within the dataset and attempts to draw some
conclusions on how those entities should be labeled or
defined. Common classification algorithms are linear
classifiers, support vector machines (SVM), decision
trees, k-nearest neighbor, and random forest, which are
described in more detail below.

Regression: It is used to understand the relationship


between dependent and independent variables. It is
commonly used to make projections, such as for sales
revenue for a given business. Linear
regression, logistical regression, and polynomial
regression are popular regression algorithms.

7
Forecasting: Forecasting is the process of making
predictions about the future based on the past and
present data, and is commonly used to analyse trends.

• Semi-supervised Learning: In semi-supervised learning,


the computer is fed a mixture of correctly labeled data
and unlabeled data, and searches for patterns on its
own. The labeled data serves as ‘guidance’ from the
programmer, but they do not issue ongoing corrections.
By using this combination, machine learning algorithms
can learn to label unlabeled data.

• Unsupervised Learning: Unsupervised learning, also


known as unsupervised machine learning, uses machine
learning algorithms to analyze and cluster unlabeled
datasets. These algorithms discover hidden patterns or
data groupings without the need for human
intervention. Its ability to discover similarities and
differences in information make it the ideal solution for
exploratory data analysis, cross-selling strategies,
customer segmentation, and image recognition. The
algorithm tries to organize that data in some way to
describe its structure. This might mean grouping the
data into clusters or arranging it in a way that looks
more organized. As it assesses more data, its ability to

8
make decisions on that data gradually improves and
becomes more refined.

Unsupervised learning models are utilized for three main


tasks—clustering, association, and dimensionality reduction.

Clustering: It is a data mining technique which groups


unlabeled data based on their similarities or differences.
Clustering algorithms are used to process raw, unclassified
data objects into groups represented by structures or
patterns in the information. Clustering algorithms can be
categorized into a few types, specifically exclusive,
overlapping, hierarchical, and probabilistic.

Association Rule: An association rule is a rule-based method


for finding relationships between variables in a given dataset.
These methods are frequently used for market basket
analysis, allowing companies to better understand
relationships between different products. Understanding
consumption habits of customers enables businesses to
develop better cross-selling strategies and recommendation
engines.

Dimensionality reduction: While more data generally yields


more accurate results, it can also impact the performance of
machine learning algorithms (e.g. over-fitting) and it can also

9
make it difficult to visualize datasets. Dimensionality
reduction is a technique used when the number of features, or
dimensions, in a given dataset is too high. It reduces the
number of data inputs to a manageable size while also
preserving the integrity of the dataset as much as possible. It
is commonly used in the preprocessing data stage.

Reinforcement learning: Reinforcement learning focuses on


regimented learning processes, where a machine learning
algorithm is provided with a set of actions, parameters and
end values. By defining the rules, the machine learning
algorithm then tries to explore different options and
possibilities, monitoring and evaluating each result to
determine which one is optimal. Reinforcement learning
teaches the machine trial and
error. It learns from past experiences and begins to adapt its
approach in response to the situation to achieve the best
possible result.

1.4 Challenges In NMT

Neural Machine Translation (NMT) is difficult for many


reasons. There are many challenges when we work on neural
machine translation. A known challenge in translation is that

10
in different domains, words have different translations and
meaning is expressed in different styles. Hence, a crucial step
in developing machine translation systems targeted at a
specific use case is domain adaptation. A well-known
property is that increasing amounts of training data lead to
better results. Small sample size is a barrier to get better
accuracy from machine learning algorithms. Moreover most
of the datasets are highly imbalanced that means the number
of word sample in one language is not same as the number of
word sample in another language. Imbalanced data causes
biased learning of models and therefore prediction by these
models will be biased also. This will degrade the accuracy of
model. But we need to improve the performance with a
smaller amounts of training data because sometimes large
amount of data cannot be found. It can also perform poorly on
rare words which can affect the results. Another flaw of
encoder-decoder NMT models is the inability to properly
translate long sentences. Word Alignment is another problem
which needs to be overcome. There are other challenges
when it comes to Decoding. The task of decoding is to find the
full sentence translation with the highest probability. Despite
its recent successes, neural machine translation still has to
overcome various challenges, most notably performance out-
of-domain and under low resource conditions.

The main challenges are given below:

11
• Lack of Existing Work GANs have majorly been used in
generating images, audio and video - continuous data. There
hasn’t been a lot of work done in traditional deep learning
applications such as text. As seen in the literature survey,
there have been little to no attempts to actually use the GAN
system to generate independent text. Rather, there are
restricted uses such as MedGAN and CSGAN-NMT. We aim to
rectify this issue by simply completing our thesis, thus
providing a baseline for future research.

• The Vanishing Gradient Problem Sometimes the


discriminator might become so successful that it rejects
anything that the generator makes, halting the learning
process of the generator. This is due to the discrete nature
of the text. In normal circumstances such as in images or
audio, the data is in a continuous space. To put it in the
words of Dr. Goodfellow, the creator of GANs, you can add a
minute increase to “1” to obtain “1.001”. But you can’t add
0.001 to the word “Penguin”. However, if we convert these
words into embeddings, then this might just be possible.
Existing implementations of word embeddings such as
Word2Vec and GloVe associate similar words to each other.
The popular example by Word2Vec is that it associates
“King” with “Male” and “Queen” with “Female”.

• Training Time Takes a long time to train even a single


epoch. Need to train for 10s, 100s or even 1000s of epochs.

12
The DCGAN architecture trained on the MNIST dataset
requires about 4000 epochs of training to achieve decent
results. Since we are working with text data, we won’t need as
much time, but each individual epoch takes a lot of time. This
is mostly dependent on the type of input we use more so
than the network. If we train the network on individual
characters, then it becomes extremely time consuming. Even
normal RNNs such as Karpathy’s Char-RNN takes a long time
to train. Word level models on the other hand train much
faster but have a considerably worse performance. Due to
the lack of computational power, we plan to use word level
methods and tweak them to achieve a significant measure of
performance.

• Preprocessing There are a few approaches to


preprocessing. One major consideration is whether to use a
character level model or a word level model. As mentioned
earlier, the premier character level model is Karpathy’s Char-
RNN. However, it takes a lot of time to train, though it
demonstrates astounding results. Word level models have
considerably worse performance but require much less
training time. However, tweaking the model is difficult and
requires a lot of work. In addition, the representation of
these words is also a consideration. Using a word
embedding such as GloVe or Word2Vec might lead to
greater outputs at the cost of performance. On the other
hand, simply converting each individual word into a number

13
and using that as an embedding, leads to an increase in
performance but worse outputs.

• Evaluation There is no proper way for the generator to test for


“correct” language on its own. While metrics exist for other
tasks, no metric exists to properly evaluate if the language
generated by the model is “human-like”. This is why we bring
in the discriminator in the first place. Since it doesn’t exist, we
train a network to do it. This means that the discriminator is
of the utmost importance. If the discriminator is not trained
properly, the network will fail. We plan to take steps to
ensure that the discriminator is properly trained and does
not fail at any stage.

1.5 Thesis Contribution


The challenges of machine translation are presented before
1.4. In this thesis the ways of handling the problems are
presented. Moreover this thesis shows the effectiveness of
using unsupervised data to perform neural machine
translation. CycleGAN is used in this thesis which is a way to
handle unsupervised data properly. Challenges are handled
such a way which improves the accuracy. A more flexible and
accurate method is introduced here. Finally the thesis work
achieves the better accuracy in neural machine translation.

14
1.6 Thesis Organization

This report of undergraduate thesis consist of six chapters.


The task of neural machine translation, motivation of doing
this thesis is discussed in chapter 1 (Introduction). In the
same chapter challenges and contribution in this context is
also discussed. In the second chapter the methods of neural
machine translation are. In the later chapter Literature
reviews are discussed. In chapter 4 proposed methods with
datasets, architecture are discussed. The experimentation and
result on the neural machine translation are
analyzed in the next chapter, chapter 5 (Result and Analysis).
Finally, the report ends in chapter 6 (Conclusion) with
concluding remarks and scope of further improvement of the
proposed method.

15
2| Methods of Machine
Translation

Machine translation (MT) is an important sub -field


of natural language processing that aims to translate
natural languages using computers. In recent years,
end-to-end neural machine translation (NMT) has
achieved great success and has become the new
mainstream method in practical MT systems. Here,
we provide a broad review of the methods for NMT
and focus on methods relating to architectures,
decoding, and data augmentation.

Early solutions took the form of rule -based systems


where rules were programmed in by a human, termed
rule-based machine translation (RBMT). With
advances in statistical methods, using data to learn
these rules and to resolve ambiguity in rules through
context has been attempted by a class of methods
under the umbrella of Statistical Machine Translation
(SMT). Another class of solutions proposed
prediction of the target sentence, from several
examples, called example-based MT (EBMT).

16
2.1 Statistical Machine
Translation
A statistical MT model uses the following formulation
for a source sequence x and a target sequence y:

The probabilities P (x \ y) and P (y) is estimated using


frequencies of phrase or word-units. The collection of
phrases is restricted to a few words to make the
computation of these probabilities feasible,
decomposing them into products of factors. The
computations of frequency tables etc. wer e
parallelized over multiple CPU cores and computers
providing the earliest usable translation systems for
the public. SMT brought about significant
improvements to automatic translations to the point

17
it was deployed in popular online services like Google
Translate.
Parallel to the success of neural networks in image
recognition, speech recognition, etc., deep neural
networks (DNNs) have found widespread use to
evolve as the de-facto method at the time pursuing
this work. The class of solutions using deep n eural
networks to learn from data a translation model is
widely known as Neural Machine Translation (NMT)
which is used extensively in this work. NMT is a
different paradigm of GPU heavy approaches
involving neural networks and learns by back
propagation, unlike the frequency-based procedures
in SMT. We discuss elaborately the components that
make up modern NMT ahead.

2.2 Neural Machine Translation

This section introduces and elaborately describes the


building blocks for the NMT approaches. First,
machine translation is cast as a sequence -to-
sequence learning problem. With such a formulation,
methods to decompose sentences as sequences

18
constituted by meaningful units are required to make
implementation feasible.

2.2.1 Sequence To Sequence Learning

A sequence to sequence model aims to map a fixed -


length input with a fixed-length output where the
length of the input and output may differ.

For example, translating “What are you doing today?”


from English to Chinese has input of 5 words and
output of 7 symbols (今天你在做什麼?). Clearly, we

can’t use a regular LSTM network to map each word


from the English sentence to the Chinese sentence.

This is why the sequence to sequence model is used


to address problems like that one.

19
Working Principle of Sequence to Sequence Model:

Figure 2.1: Encoder-Decoder Sequence to Sequence Model

The model consists of 3 parts: encoder, intermediate


(encoder) vector and decoder.

Encoder:

• A stack of several recurrent units (LSTM or GRU


cells for better performance) where each accepts a
single element of the input sequence , collects
information for that element and propagates it
forward.

20
• In question-answering problem, the input
sequence is a collection of all words from the
question. Each word is represented as x t where t is
the order of that word.

• The hidden states h t are computed using the


formula:

This simple formula represents the result of an


ordinary recurrent neural network. As you can see,
we just apply the appropriate weights to the previous
hidden state h (t - 1) and the input vector x t .

Encoder Vector

• This is the final hidden state produced from the


encoder part of the model. It is calculated using
the formula above.

• This vector aims to encapsulate the information


for all input elements in order to help the decoder
make accurate predictions.

21
• It acts as the initial hidden state of the decoder
part of the model.

Decoder

• A stack of several recurrent units where each


predicts an output y t at a time step t .

• Each recurrent unit accepts a hidden state from


the previous unit and produces and output as well
as its own hidden state.

• In the question-answering problem, the output


sequence is a collection of all words from the
answer. Each word is represented as y t where t is
the order of that word.

• Any hidden state h t is computed using the


formula:

22
As you can see, we are just using the previous hidden
state to compute the next one.

• The output y t time step t is computed using the


formula:

We calculate the outputs using the hidden state at the


current time step together with the respective weight
W(S). Softmax is used to create a probability vector
which will help us determine the final output (e.g.
word in the question-answering problem).

The power of this model lies in the fact that it can


map sequences of different lengths to each other. As
you can see the inputs and outputs are not correlated
and their lengths can differ. This opens a whole new
range of problems which can now be solved using
such architecture.

23
2.2.2 Recurrent Neural Network (RNN)

Recurrent neural networks (RNN) are a class of neural


networks that are helpful in modeling sequence data. Derived
from feed-forward networks, RNNs exhibit similar behavior
to how human brains function. Simply put: recurrent neural
networks produce predictive results in sequential data that
other algorithms can’t.

Figure 2.2 : Recurrent Neural Network

24
2.3 Attention Mechanism
The introduction of attention mechanism (Bahdanau et al.,
2015) is a milestone in NMT architecture research. The
attention network computes the relevance of each value
vector based on queries and keys. This can also be interpreted
as a content-based addressing scheme (Graves et al., 2014).
Formally, given a set of m query vectors QRm xd, a set of n
key vectors K Rnxd and associated value vectors V  Rnxd,
the computation of attention network involves two steps. The
first step is to compute the relevance between keys and
values, which is formally described as:

R= score(Q,K)

where score is a scoring function which have several


alternatives.
RRmxn is a matrix storing the relevance score between each
keys and values. The next step is compute the output vectors.

For each query vector, the corresponding output vector is


expressed as a weighted sum of value vectors:

25
Attention(Q,K,V) = softmax(R) . V

Considering on the scoring function, the attention networks


can be
roughly classified into two categories:
additive attention (Bahdanau et al., 2015) and dot-product
attention (Luong et al., 2015).

The additive attention models score through a feed-forward


neural network. On the other hand, the dot-product attention
uses dot production to compute the matching score .

In practice, the dot-product attention is much faster than the


additive attention.

26
2.4 Generative Adversarial
Network
Generative Adversarial Networks, or GANs for short, are an
approach to generative modeling using deep learning
methods, such as convolutional neural networks.

This is another which can be used in neural machine


translation.

Generative modeling is an unsupervised learning task in


machine learning that involves automatically discovering and
learning the regularities or patterns in input data in such a
way that the model can be used to generate or output new
examples that plausibly could have been drawn from the
original dataset.

GANs are a clever way of training a generative model by


framing the problem as a supervised learning problem with
two sub-models: the generator model that we train to
generate new examples, and the discriminator model that
tries to classify examples as either real (from the domain) or
fake (generated). The two models are trained together in a
zero-sum game, adversarial, until the discriminator model is
fooled about half the time, meaning the generator model is
generating plausible examples.

27
2.5 Chapter Summary
Researchers from various fields are working with neural
machine translation. There are many methods that can be
applied in NMT. Here, we have tried to focus on the methods
that we are going to be using in our thesis. Sequence to
Sequence model, Attention mechanism and GANs are needed
for our proposed method.

3 | Literature Review

3.1 GENERATIVE ADVERSARIAL TRAINING FOR


NEURAL MACHINE TRANSLATION

Yang, Z., Chen, W., Wang, F. and Xu, B. have used a conditional
sequence generative adversarial network for neural machine
translation (CSGAN-NMT). The proposed model consists of two
sub-models, a generator and a discriminator. The generator
generates text based on a source language. And the discriminator
evaluates this translation by predicting how probable it is that this
is the correct translation. To reach Nash equilibrium, a gamified

28
process of mini-max between the two sub-models is played for
them to arise at a win-win situation.

The generator consisted of an encoder-decoder system of Gated


Recurrent Units (GRUs) with 512 hidden units. These measures
are chosen, so as to prevent the misleading of the manually
designed loss function into the generation of suboptimal
translations. They utilized their network on the NIST Chinese-
English dataset and to further test on the effectiveness of the
approach, results were also provided on English-German
translation task. A Beam search was utilized for the generation of
translations.

Parameters such as beam width were set at 10, log-likelihood


scores were not normalized by sentence length. The models were
implemented in TensorFlow, with synchronous training on up to
four K80 GPUs in a multi- GPU setup on a single machine.
Experiment results achieved with the proposed model of CSGAN-
NMT has shown significant outperformance compared to the older
works. They also considered a variant called the multi-CSGAN-
NMT, which is a scenario of multiple generators and
discriminators to achieve remarkable results, where each
generator can be considered as an agent, have the ability to
interact with other generator agents and even send messages.

The use of two independent discriminators allowed the


generators far better learning. Through various tests, the alternate
variant has shown even more improvement from the initial

29
suggested model. In their tests, they also noticed that
discriminators with an accuracy that was too high or low
performed badly.

3.2 AUTOMATIC GENERATION OF NEWS COMMENTS


BASED ON GATED ATTENTION NEURAL NETWORKS

Zheng, H., Wang, W., Chen, W., and Sangaiah, A. have proposed
a gated attention neural network model (GANN) that comprises of
two main elements. First, a comment Generator and that is built
on an encoder-decoder
framework, where the conversion of all words of the title into one-
hot vectors and obtaining the embedding representations by
multiplying the embedding matrix, is done by the encoder
component. The initialization of the decoder component of the
generator is triggered by the last hidden vector of the title. Similar
to the encoder, the model converts the sequence of comment
words into one-hot vectors and gets their low-dimensional
representations through the shared embedding matrix.
Introduction of modules such as the gated attention mechanism
and a relevance control module is done, so as to guarantee the
contextual relevance between comments and news, by assigning
different weights to different parts of contextual information,
which has proven to improve the performance. The second
element is a comment discriminator, which is used to improve the
accuracy of comment generation. This is a concept inspired by

30
General Adversarial Networks (GAN). The various tests performed
on the large dataset show the effectiveness of GANN compared to
other generators. The generated news comments were found to be
close to human comments.

The widespread adoption of electronic health record records by


healthcare organisations along with the increase in the quality and
quantity of data has motivated computational advances in medical
research. However, there are various concerns over privacy which
limit the access and collaborative use of this data.

3.3 ADVERSARIAL FEATURE MATCHING FOR TEXT


GENERATION

Zhang, Y., Gan, Z., Fan, K., Chen, Z., Henao, R., Shen, D., and
Cari, L. have proposed a framework for generating realistic text
via adversarial training. It employs the conventional architecture
of General Adversarial Networks (GAN), by having a generator and
a discriminator, where a long short-term memory network
(LSTM) is utilized as the generator and a convolutional network as
the discriminator. Instead of using the standard objective of GAN,
a matching of the high-dimensional latent feature distributions of
real and synthetic sentences is proposed by the framework and is

31
undertaken via a kernelized discrepancy metric. With the
proposed framework modules, it alleviates the mode-collapsing
problem and thus eases adversarial training. This particular
model delivers superior performance compared to the other
related approaches. It not only produces realistic sentences, but it
also enables the learned latent representation space to smoothly
encode plausible sentences. The methods that were employed was
quantitatively evaluated with baseline models and existing
methods as benchmarks and the results indicate superior
performance of the above-proposed methods.

4 | Proposed method

There are some methods to perform neural machine


translation. Sequence to Sequence machine translation ,
machine translation with attention mechanism are one of
them. In this thesis, we will be performing Neural Machine
Translation with CycleGAN.

The CycleGAN is a technique that involves the automatic


training of word-to-word translation models without paired
examples. The models are trained in an unsupervised manner
using a collection of words from the source and target domain
32
that do not need to be related in any way. Training a model for
word-to-word translation typically requires a large dataset of
paired examples. These datasets can be difficult and expensive
to prepare, and in some cases impossible, because some
languages have very low resources of datasets.

In this thesis we will also be performing sequence to sequence


translation with attention and without attention. And then,
we compare the results with our neural machine translation
with CycleGAN model.

The CycleGAN model was described by Jun-Yan Zhu, et al. in


their 2017 paper titled “Unpaired Image-to-Image Translation
using Cycle-Consistent Adversarial Networks.”

4.1 CycleGAN in NMT

The benefit of the CycleGAN model is that it can be trained


without paired examples. That is, it does not require
examples of sentences before and after the translation in
order to train the model.

The model architecture is comprised of two generator


models: one generator (Generator-A) for generating

33
sentences for the first domain (Domain-A) and the second
generator (Generator-B) for generating sentences for the
second domain (Domain-B).

• Generator-A -> Domain-A


• Generator-B -> Domain-B
The generator models perform neural machine translation.
Generator-A takes a sentence from Domain-B as input and
Generator-B takes a sentence from Domain-A as input.

• Domain-B -> Generator-A -> Domain-A


• Domain-A -> Generator-B -> Domain-B

Each generator has a corresponding discriminator model. The


first discriminator model (Discriminator-A) takes real
sentences from Domain-A and generated sentences from
Generator-A and predicts whether they are real or fake. The
second discriminator model (Discriminator-B) takes real
sentences from Domain-B and generated sentences from
Generator-B and predicts whether they are real or fake.

• Domain-A -> Discriminator-A -> [Real/Fake]


• Domain-B -> Generator-A -> Discriminator-A -> [Real/Fake]
• Domain-B -> Discriminator-B -> [Real/Fake]
• Domain-A -> Generator-B -> Discriminator-B -> [Real/Fake]

The discriminator and generator models are trained in an


adversarial zero-sum process, like normal GAN models. The
34
generators learn to better fool the discriminators and the
discriminator learn to better detect fake images. Together, the
models find an equilibrium during the training process.

Additionally, the generator models are regularized to not just


create new sentences in the target domain, but instead
translate more reconstructed versions of the input sentences
from the source domain. This is achieved by using generated
sentences as input to the corresponding generator model and
comparing the output sentence to the original sentence.
Passing a sentence through both generators is called a cycle.
Together, each pair of generator models are trained to better
reproduce the original source sentence, referred to as cycle
consistency.

• Domain-B -> Generator-A -> Domain-A -> Generator-B ->


Domain-B
• Domain-A -> Generator-B -> Domain-B -> Generator-A ->
Domain-A

There is one further element to the architecture, referred to


as the identity mapping. This is where a generator is provided
with sentence as input from the target domain and is
expected to generate the same sentence without change. This
addition to the architecture is optional.

• Domain-A -> Generator-A -> Domain-A


• Domain-B -> Generator-B -> Domain-B
35
4.2 DATASET
Generally, when it comes to machine learning tasks, the data is
required to be of a particular format and must be split into a
“training” and a “testing” set. This is because such tasks usually
involve predicting some variable. However, when it comes to the
generation of language, this is not applicable. There is no variable
to predict. Rather, language is simply generated. With that in
mind, any corpus of text can be used as an input dataset. We have
restricted ourselves to a few well-known datasets in order to build
our model. However, it must be reiterated that the model is
applicable to any corpus of text.

Here, we have used dataset from manythings.org website which


can be accessed free of cost. We have used “fra-eng” which has
French-English paired data.

Figure 4.1: English-French Dataset

36
4.3 System Architecture

The overall system architecture is fixed. It consists of just three


modules. The pre-processing module handles cleaning of the input
data and converts it into a machine-friendly representation. The
generator network is responsible for attempting to generate text
while the discriminator network judges the text. Based on the
network’s output, the loss function will propagate and update
either the discriminator network or the generator network. Over
time, each network will learn more and more and hence produce
even better results. This in turn will allow the model to truly
succeed and perhaps even fool human beings.

4.3.1 Preprocessing Module

It is easier for a model to recognize smaller numerical patterns in


order to generate text, rather than incorporating larger words.
That is, a model can understand a sequence of numbers easily. It
has no way of understanding random strings of text. Since
computer operate in binary and not text, all text must be
represented in some form of numbers. However, there are some
challenges to this. We can’t simply convert the text into numbers.
The first thing to consider is case. Is “Word” the same as “word”?
Indeed, it is, but a naive approach would label them both as
different words. So all words must be converted to lowercase. The
next point to consider is the sentences in the text. Foremost,

37
sentences have to be identified. Not all sentences end with a
period. Some end with a question mark, others with an
exclamation. There are complex sentences and compound
sentences. English is not a very well structured language. So the
sentences have to be identified and stored. Thirdly, not all
sentences are born equal. Some may be short, while others may be
long. But a model can’t really accept uneven input like that. So we
have to find a suitable standard sentence length that is neither too
long, which would require most sentences to be padded, nor too
short which would require most of the sentences to be truncated.
However, regardless of the sentence length, both will be needed.
So functions need to be developed to do the same. Lastly, the
actual embedding is to be considered.

4.3.2 Generator Module

Figure 4.2 : Generator Module

The generator network receives an input of the preprocessed text


corpus that we want to imitate. The input text is received in

38
batches and the generator attempts to imitate the batch. This
generated text is then passed on to the discriminator module. If
the discriminator determines that the generated text is fake, then
the loss function propagates to the generator and the gradient is
updated.

4.3.3 Discriminator Module

The discriminator module receives two inputs. The first is a


randomly chosen sample text from the dataset used. The second
input is text generated by the generator network. Both texts are
pre-processed in the same manner and the discriminator is
provided with the outputs. It does not know which of the two texts
is generated and which is real. Rather it must make a prediction. If
the prediction is right, then the loss propagates through to the
generator network. However, if the discriminator makes an
incorrect prediction, then the gradient will pass through the
discriminator network instead. This will allow the discriminator
to learn and hence perform better against future samples.

39
Figure 4.3: Discriminator Module

4.4 Activation Functions


For Generator we have used “softmax” , and for discriminator
we have used “relu” and “sigmoid” functions.

4.5 Evaluation
When it comes to generated text, there are no well-established
metrics (Novikova et al.) to evaluate the quality of the text. The
best and perhaps, the only, proper way of evaluating generated
text is via human judgement. This is an expensive and time-
consuming task and for the purposes of our study, we use a small
collection of individuals for evaluation. But We have used BLEU
score for Neural Machine Translation.

40
5 | Results and Analysis
Here we have completed neural machine translation by
sequence to sequence model without attention and also with
attention. Then , we used our proposed model to do neural
machine translation using Cycle GAN.

We have developed of accuracy vs epoch graphs for train and


validation sets. We have also developed BLEU score of
different models.

5.1 Model Loss Graphs

Figure 5.1: Train loss and validation loss (with attention)

In this process, we used attention. Here, we can see the value


of loss for validation set is below 1 and for train set it is below
0.5.

41
Figure 5.2 : Train loss and validation loss (without attention)

In this process, we did not use attention. Here, we can see the
value of loss for validation set is above 2 and for train set it is
just below 2.

5.2 Model Accuracy Graphs:

(a) (b)

Figure 5.4: (a) model accuracy graph with attention & (b) model accuracy graph with CycleGAN

42
From the above graphs, we can see that model with CycleGAN
has more accuracy than model with attention. We can see
thatat 30 epochs model with attention has nearly 77%
accuracy whereas model with CycleGAN has nearly 86%
accuracy on Validation set.

The above graphs clearly shows that CycleGAN model


outperforms attention model.

5.3 Comparison with BLEU score on


Validation Sets

NMT MODELS BLEU Score

Sequence to Sequence model 0.77

Sequence to Sequence model 0.83


with attention
NMT with CycleGAN 0.87

Table 1: BLEU scores of different model

From BLEU score we can see that, neural machine translation


with CycleGAN is the best model among three.

CycleGAN creates a lot opportunities for neural machine


translation because it can be modeled with unsupervised
data. So the languages which have small resources can be

43
trained easily and machine translation can be performed. Its
impact is greater than other two models.

6 | Future work and Conclusions

6.1 Future work

There is a lot of scope for future work to be done in this domain.


There are two distinct routes for future work. The first is methods
to increase the performance of the GAN itself. The second is in
using this GAN for other, greater, purposes. Simply training the
model for longer periods of time might very well lead to incredible
results. In addition, we have used a relatively simple network
architecture in order to establish a baseline.

Expanding this network, either by making it wider or by making it


deeper, may lead to the network learning in a better and faster
manner.
While generated sentences are somewhat relevant, they do tend to
suddenly spout gibberish. So using a relevance module is an
appealing option. This will allow us to control what kind of text is
generated. This is essential in order to maintain a coherent

44
dialogue or text for any kind of creative endeavour. This model
can also be trained for larger sentences. Larger sentences are
difficult to train. So improvement can be made in this section.

6.2 Conclusion

In this thesis we have explored the use of Cycle Generative


Adversarial Networks in the domain of natural language
generation. While it does present an interesting alternative to
traditional text generation methodologies, it still requires a lot of
work before it can be deemed as viable. However, the CycleGAN
approach does have its own advantages. Primarily, using a word
level model is much faster than the character level models
traditionally used. It also means that the words themselves do not
need to be learned, just the meaning. In addition, the CycleGAN is
generalized and can be adapted to any situation, whether it is in
generating music lyrics, game dialogue or full novels.
Keeping the importance of neural machine translation in mind,
we can say that use of CycleGAN in this sector will improve overall
neural machine translation.

45
Bibliography
1. Minh-Thang Luong, Hieu Pham, and Christopher Manning. Effective
approaches to attention-based neural machine translation. arXiv, 2015.

2. Fedus, W., Goodfellow, I.J., and Dai, A.M. (2018). 'MaskGAN: Better Text
Generation via Filling in the ______.' CoRR abs/1801.07736.

3. Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,
S., Courville, A.C., and Bengio, Y.(2014). 'Generative Adversarial Nets'.
Advances in Neural Information Processing Systems, pp. 2672- 2680.

4. Liang, C., Yang, X., Wham, D., Pursel, B., Passonneaur, R., and Giles, C.
(2017). 'Distractor Generation with Generative Adversarial Nets for
Automatically Creating Fill-in-the-blank Questions', Proceedings of the
Knowledge Capture Conference, Article 33.

5. Mikolov, T., Chen, K., Corrado, G.S., and Dean, J. (2013). 'Efficient Estimation
of Word Representations in Vector Space.' CoRR abs/1301.3781.

6. Novikova, J., Dušek, O., Curry, A.C., and Rieser, V. (2017). 'Why We Need
New Evaluation Metrics for NLG'. Proceedings of the 2017 Conference on
Empirical Methods in Natural Language Processing, pp. 2231-2242.

7. Pennington, J., Socher, R., and Manning, C. (2014). 'Glove: Global Vectors for
Word Representation'. Proceedings of the 2014 Conference on Empirical
Methods in Natural Language Processing (EMNLP), pp. 1532- 1543.

8. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen,
X. (2016). 'Improved Techniques for Training GANs.' Proceedings of the 30th
International Conference on Neural Information Processing Systems, pp.
2234-2242.

46
9. Yang, Z., Chen, W., Wang, F., and Xu, B. (2018). 'Generative Adversarial
Training for Neural Machine Translation'. Neurocomputing, Vol. 321, pp. 146-
155.

10.https://becominghuman.ai/what-is-deep-learning-and-why-you-need-it-
9e2fc0f0e61b 15. https://bloomberg.github.io/foml/#home

11.https://machinelearningmastery.com/supervised-and-unsupervised-
machine- learning-algorithms/

18. www.python.org

19.https://towardsdatascience.com/animated-rnn-lstm-and-gru-
ef124d06cf45

20. https://machinelearningmastery.com/cyclegan-tutorial-with-keras/

47

You might also like