Thesis Book 2
Thesis Book 2
Technology
Department of Electrical and Electronic Engineering
By
i
Declaration of Authorship
We declare that this thesis titled "Neural Machine Translation
with Generative Adversarial Network” and the work
presented in it is our own. We confirm that –
ii
Certificate of Approval
iii
Acknowledgement
We would like to heartily thank all of our friends - a list too large
to write here, but they know who they are - for their support both
technical and motivational. They provided a support network
which allowed me to stay sane through our years as an undergrad.
Most of all, our parents, Their endless love, support, patience and
encouragement enabled me to come this far. Words aren't enough
to thank them all. I can muster is a solemn and grateful pause.
iv
Contents
Abstract...........................................................................................................................................................i
Declaration of Authorship ..................................................................................................................... ii
Certificate of Approval........................................................................................................................... iii
Acknowledgement ................................................................................................................................... iv
Contents .........................................................................................................................................................v
List Of Figures ........................................................................................................................................ vii
1.1 Motivation ............................................................................................................................................ 2
1.2 Machine Translation ....................................................................................................................... 3
1.4 Challenges In NMT .........................................................................................................................10
1.5 Thesis Contribution ........................................................................................................................14
1.6 Thesis Organization......................................................................................................................15
2| Methods of Machine Translation ................................................................................................16
2.1 Statistical Machine Translation ...............................................................................................17
2.2 Neural Machine Translation .....................................................................................................18
2.2.1 Sequence To Sequence Learning .........................................................................................19
2.2.2 Recurrent Neural Network (RNN) .....................................................................................24
2.3 Attention Mechanism ..................................................................................................................25
2.4 Generative Adversarial Network ............................................................................................27
2.5 Chapter Summary ........................................................................................................................28
3 | Literature Review...........................................................................................................................28
3.1 GENERATIVE ADVERSARIAL TRAINING FOR NEURAL MACHINE
TRANSLATION .........................................................................................................................................28
3.2 AUTOMATIC GENERATION OF NEWS COMMENTS BASED ON GATED
ATTENTION NEURAL NETWORKS..................................................................................................30
3.3 ADVERSARIAL FEATURE MATCHING FOR TEXT GENERATION.................................31
4 | Proposed method ..........................................................................................................................32
4.1 CycleGAN in NMT ............................................................................................................................33
4.2 DATASET...........................................................................................................................................36
4.3 System Architecture...................................................................................................................37
4.3.1 Preprocessing Module ..............................................................................................................37
4.3.2 Generator Module........................................................................................................................38
4.3.3 Discriminator Module ...............................................................................................................39
v
4.4 Activation Functions ..................................................................................................................... 40
4.5 Evaluation ........................................................................................................................................ 40
5 | Results and Analysis ..................................................................................................................... 41
5.1 Model Loss Graphs.......................................................................................................................... 41
5.2 Model Accuracy Graphs: ............................................................................................................... 42
5.3 Comparison with BLEU score on Validation Sets .............................................................. 43
6 | Future work and Conclusions .................................................................................................... 44
6.1 Future work ..................................................................................................................................... 44
6.2 Conclusion ................................................................................................................................... 45
Bibliography ............................................................................................................................................. 46
vi
List Of Figures
vii
1| Introduction
1
changed the way translation could be done, as it added
powerful AI and automation to the translation process. In this
introductory chapter the necessity of doing thesis on
depression detection will be discussed. The goal of this thesis,
challenges and contributions are presented here.
1.1 Motivation
At its core, machine translation is fully automated software
that translates content from one language to another. Since a
large portion of the world’s content is inaccessible to people
that don’t speak the original source language, machine
translation can effectively translate content faster and into
more languages. If people could communicate with a single
language then many problems could be solved easily. Machine
translation gives us an opportunity to bring all the people of
the under one common language. Machine translation
systems are most commonly used when there’s a lot of
information that needs translation (i.e., hundreds of
thousands of words or more). In those situations, traditional
human translation wouldn’t be feasible due to the sheer
volume of content, so we turn to AI. We have multiple types of
machine translation. The accuracy and the time taken by the
machine translation models are different. We are always
2
looking for more accurate and faster form of models so that it
can translate huge amount of data within less time. In most
machine translation cases, we need to train our model with a
paired dataset. It generally limits the ability of the model. It
also takes more time to train a data set. But if we can create a
model which can be trained with unpaired dataset, it can be
more accurate and less time consuming. This will create more
opportunities of translation from one language to another.
3
machine translation engine. We differentiate three types of
machine translation methods:
4
While both statistical and neural MT use huge datasets of
translated sentences to teach software to find the best
translation, the models themselves are different. Statistical
MT translates sentences by breaking them up into phrases,
translating the pieces, then trying to stitch those translations
back together. Neural MT, on the other hand, uses neural
networks to consider whole sentences when predicting
translations, which allows it to take into account the context
in which each word and phrase is used.
5
1.3 Machine Learning
6
training dataset includes inputs and correct outputs,
which allow the model to learn over time. The algorithm
measures its accuracy through the loss function,
adjusting until the error has been sufficiently minimized
7
Forecasting: Forecasting is the process of making
predictions about the future based on the past and
present data, and is commonly used to analyse trends.
8
make decisions on that data gradually improves and
becomes more refined.
9
make it difficult to visualize datasets. Dimensionality
reduction is a technique used when the number of features, or
dimensions, in a given dataset is too high. It reduces the
number of data inputs to a manageable size while also
preserving the integrity of the dataset as much as possible. It
is commonly used in the preprocessing data stage.
10
in different domains, words have different translations and
meaning is expressed in different styles. Hence, a crucial step
in developing machine translation systems targeted at a
specific use case is domain adaptation. A well-known
property is that increasing amounts of training data lead to
better results. Small sample size is a barrier to get better
accuracy from machine learning algorithms. Moreover most
of the datasets are highly imbalanced that means the number
of word sample in one language is not same as the number of
word sample in another language. Imbalanced data causes
biased learning of models and therefore prediction by these
models will be biased also. This will degrade the accuracy of
model. But we need to improve the performance with a
smaller amounts of training data because sometimes large
amount of data cannot be found. It can also perform poorly on
rare words which can affect the results. Another flaw of
encoder-decoder NMT models is the inability to properly
translate long sentences. Word Alignment is another problem
which needs to be overcome. There are other challenges
when it comes to Decoding. The task of decoding is to find the
full sentence translation with the highest probability. Despite
its recent successes, neural machine translation still has to
overcome various challenges, most notably performance out-
of-domain and under low resource conditions.
11
• Lack of Existing Work GANs have majorly been used in
generating images, audio and video - continuous data. There
hasn’t been a lot of work done in traditional deep learning
applications such as text. As seen in the literature survey,
there have been little to no attempts to actually use the GAN
system to generate independent text. Rather, there are
restricted uses such as MedGAN and CSGAN-NMT. We aim to
rectify this issue by simply completing our thesis, thus
providing a baseline for future research.
12
The DCGAN architecture trained on the MNIST dataset
requires about 4000 epochs of training to achieve decent
results. Since we are working with text data, we won’t need as
much time, but each individual epoch takes a lot of time. This
is mostly dependent on the type of input we use more so
than the network. If we train the network on individual
characters, then it becomes extremely time consuming. Even
normal RNNs such as Karpathy’s Char-RNN takes a long time
to train. Word level models on the other hand train much
faster but have a considerably worse performance. Due to
the lack of computational power, we plan to use word level
methods and tweak them to achieve a significant measure of
performance.
13
and using that as an embedding, leads to an increase in
performance but worse outputs.
14
1.6 Thesis Organization
15
2| Methods of Machine
Translation
16
2.1 Statistical Machine
Translation
A statistical MT model uses the following formulation
for a source sequence x and a target sequence y:
17
it was deployed in popular online services like Google
Translate.
Parallel to the success of neural networks in image
recognition, speech recognition, etc., deep neural
networks (DNNs) have found widespread use to
evolve as the de-facto method at the time pursuing
this work. The class of solutions using deep n eural
networks to learn from data a translation model is
widely known as Neural Machine Translation (NMT)
which is used extensively in this work. NMT is a
different paradigm of GPU heavy approaches
involving neural networks and learns by back
propagation, unlike the frequency-based procedures
in SMT. We discuss elaborately the components that
make up modern NMT ahead.
18
constituted by meaningful units are required to make
implementation feasible.
19
Working Principle of Sequence to Sequence Model:
Encoder:
20
• In question-answering problem, the input
sequence is a collection of all words from the
question. Each word is represented as x t where t is
the order of that word.
Encoder Vector
21
• It acts as the initial hidden state of the decoder
part of the model.
Decoder
22
As you can see, we are just using the previous hidden
state to compute the next one.
23
2.2.2 Recurrent Neural Network (RNN)
24
2.3 Attention Mechanism
The introduction of attention mechanism (Bahdanau et al.,
2015) is a milestone in NMT architecture research. The
attention network computes the relevance of each value
vector based on queries and keys. This can also be interpreted
as a content-based addressing scheme (Graves et al., 2014).
Formally, given a set of m query vectors QRm xd, a set of n
key vectors K Rnxd and associated value vectors V Rnxd,
the computation of attention network involves two steps. The
first step is to compute the relevance between keys and
values, which is formally described as:
R= score(Q,K)
25
Attention(Q,K,V) = softmax(R) . V
26
2.4 Generative Adversarial
Network
Generative Adversarial Networks, or GANs for short, are an
approach to generative modeling using deep learning
methods, such as convolutional neural networks.
27
2.5 Chapter Summary
Researchers from various fields are working with neural
machine translation. There are many methods that can be
applied in NMT. Here, we have tried to focus on the methods
that we are going to be using in our thesis. Sequence to
Sequence model, Attention mechanism and GANs are needed
for our proposed method.
3 | Literature Review
Yang, Z., Chen, W., Wang, F. and Xu, B. have used a conditional
sequence generative adversarial network for neural machine
translation (CSGAN-NMT). The proposed model consists of two
sub-models, a generator and a discriminator. The generator
generates text based on a source language. And the discriminator
evaluates this translation by predicting how probable it is that this
is the correct translation. To reach Nash equilibrium, a gamified
28
process of mini-max between the two sub-models is played for
them to arise at a win-win situation.
29
suggested model. In their tests, they also noticed that
discriminators with an accuracy that was too high or low
performed badly.
Zheng, H., Wang, W., Chen, W., and Sangaiah, A. have proposed
a gated attention neural network model (GANN) that comprises of
two main elements. First, a comment Generator and that is built
on an encoder-decoder
framework, where the conversion of all words of the title into one-
hot vectors and obtaining the embedding representations by
multiplying the embedding matrix, is done by the encoder
component. The initialization of the decoder component of the
generator is triggered by the last hidden vector of the title. Similar
to the encoder, the model converts the sequence of comment
words into one-hot vectors and gets their low-dimensional
representations through the shared embedding matrix.
Introduction of modules such as the gated attention mechanism
and a relevance control module is done, so as to guarantee the
contextual relevance between comments and news, by assigning
different weights to different parts of contextual information,
which has proven to improve the performance. The second
element is a comment discriminator, which is used to improve the
accuracy of comment generation. This is a concept inspired by
30
General Adversarial Networks (GAN). The various tests performed
on the large dataset show the effectiveness of GANN compared to
other generators. The generated news comments were found to be
close to human comments.
Zhang, Y., Gan, Z., Fan, K., Chen, Z., Henao, R., Shen, D., and
Cari, L. have proposed a framework for generating realistic text
via adversarial training. It employs the conventional architecture
of General Adversarial Networks (GAN), by having a generator and
a discriminator, where a long short-term memory network
(LSTM) is utilized as the generator and a convolutional network as
the discriminator. Instead of using the standard objective of GAN,
a matching of the high-dimensional latent feature distributions of
real and synthetic sentences is proposed by the framework and is
31
undertaken via a kernelized discrepancy metric. With the
proposed framework modules, it alleviates the mode-collapsing
problem and thus eases adversarial training. This particular
model delivers superior performance compared to the other
related approaches. It not only produces realistic sentences, but it
also enables the learned latent representation space to smoothly
encode plausible sentences. The methods that were employed was
quantitatively evaluated with baseline models and existing
methods as benchmarks and the results indicate superior
performance of the above-proposed methods.
4 | Proposed method
33
sentences for the first domain (Domain-A) and the second
generator (Generator-B) for generating sentences for the
second domain (Domain-B).
36
4.3 System Architecture
37
sentences have to be identified. Not all sentences end with a
period. Some end with a question mark, others with an
exclamation. There are complex sentences and compound
sentences. English is not a very well structured language. So the
sentences have to be identified and stored. Thirdly, not all
sentences are born equal. Some may be short, while others may be
long. But a model can’t really accept uneven input like that. So we
have to find a suitable standard sentence length that is neither too
long, which would require most sentences to be padded, nor too
short which would require most of the sentences to be truncated.
However, regardless of the sentence length, both will be needed.
So functions need to be developed to do the same. Lastly, the
actual embedding is to be considered.
38
batches and the generator attempts to imitate the batch. This
generated text is then passed on to the discriminator module. If
the discriminator determines that the generated text is fake, then
the loss function propagates to the generator and the gradient is
updated.
39
Figure 4.3: Discriminator Module
4.5 Evaluation
When it comes to generated text, there are no well-established
metrics (Novikova et al.) to evaluate the quality of the text. The
best and perhaps, the only, proper way of evaluating generated
text is via human judgement. This is an expensive and time-
consuming task and for the purposes of our study, we use a small
collection of individuals for evaluation. But We have used BLEU
score for Neural Machine Translation.
40
5 | Results and Analysis
Here we have completed neural machine translation by
sequence to sequence model without attention and also with
attention. Then , we used our proposed model to do neural
machine translation using Cycle GAN.
41
Figure 5.2 : Train loss and validation loss (without attention)
In this process, we did not use attention. Here, we can see the
value of loss for validation set is above 2 and for train set it is
just below 2.
(a) (b)
Figure 5.4: (a) model accuracy graph with attention & (b) model accuracy graph with CycleGAN
42
From the above graphs, we can see that model with CycleGAN
has more accuracy than model with attention. We can see
thatat 30 epochs model with attention has nearly 77%
accuracy whereas model with CycleGAN has nearly 86%
accuracy on Validation set.
43
trained easily and machine translation can be performed. Its
impact is greater than other two models.
44
dialogue or text for any kind of creative endeavour. This model
can also be trained for larger sentences. Larger sentences are
difficult to train. So improvement can be made in this section.
6.2 Conclusion
45
Bibliography
1. Minh-Thang Luong, Hieu Pham, and Christopher Manning. Effective
approaches to attention-based neural machine translation. arXiv, 2015.
2. Fedus, W., Goodfellow, I.J., and Dai, A.M. (2018). 'MaskGAN: Better Text
Generation via Filling in the ______.' CoRR abs/1801.07736.
3. Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,
S., Courville, A.C., and Bengio, Y.(2014). 'Generative Adversarial Nets'.
Advances in Neural Information Processing Systems, pp. 2672- 2680.
4. Liang, C., Yang, X., Wham, D., Pursel, B., Passonneaur, R., and Giles, C.
(2017). 'Distractor Generation with Generative Adversarial Nets for
Automatically Creating Fill-in-the-blank Questions', Proceedings of the
Knowledge Capture Conference, Article 33.
5. Mikolov, T., Chen, K., Corrado, G.S., and Dean, J. (2013). 'Efficient Estimation
of Word Representations in Vector Space.' CoRR abs/1301.3781.
6. Novikova, J., Dušek, O., Curry, A.C., and Rieser, V. (2017). 'Why We Need
New Evaluation Metrics for NLG'. Proceedings of the 2017 Conference on
Empirical Methods in Natural Language Processing, pp. 2231-2242.
7. Pennington, J., Socher, R., and Manning, C. (2014). 'Glove: Global Vectors for
Word Representation'. Proceedings of the 2014 Conference on Empirical
Methods in Natural Language Processing (EMNLP), pp. 1532- 1543.
8. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen,
X. (2016). 'Improved Techniques for Training GANs.' Proceedings of the 30th
International Conference on Neural Information Processing Systems, pp.
2234-2242.
46
9. Yang, Z., Chen, W., Wang, F., and Xu, B. (2018). 'Generative Adversarial
Training for Neural Machine Translation'. Neurocomputing, Vol. 321, pp. 146-
155.
10.https://becominghuman.ai/what-is-deep-learning-and-why-you-need-it-
9e2fc0f0e61b 15. https://bloomberg.github.io/foml/#home
11.https://machinelearningmastery.com/supervised-and-unsupervised-
machine- learning-algorithms/
18. www.python.org
19.https://towardsdatascience.com/animated-rnn-lstm-and-gru-
ef124d06cf45
20. https://machinelearningmastery.com/cyclegan-tutorial-with-keras/
47