Solutions
Solutions
Solutions
This assignment is split into two sections: Neural Machine Translation with RNNs and Analyzing NMT
Systems. The first is primarily coding and implementation focused, whereas the second entirely consists
of written, analysis questions. If you get stuck on the first section, you can always work on the second as
the two sections are independent of each other. Note that the NMT system is more complicated than the
neural networks we have previously constructed within this class and takes about 2 hours to train on a
GPU. Thus, we strongly recommend you get started early with this assignment. Finally, the notation and
implementation of the NMT system is a bit tricky, so if you ever get stuck along the way, please come to
Office Hours so that the TAs can support you.
Figure 1: Seq2Seq Model with Multiplicative Attention, shown on the third step of the
decoder. Hidden states henc
i and cell states cenc
i are defined on the next page.
1
CS 224n Assignment 4 Page 2 of 11
the embedding size. We then feed the embeddings to a convolutional layer 1 while maintaining their
shapes. We feed the convolutional layer outputs to the bidirectional encoder, yielding hidden states
and cell states for both the forwards (→) and backwards (←) LSTMs. The forwards and backwards
versions are concatenated to give hidden states henc
i and cell states cenc
i :
←−− −− → ←−− −− →
henc
i = [henc enc enc
i ; hi ] where hi ∈ R2h×1 , henc
i , hi
enc
∈ Rh×1 1≤i≤m (1)
←−− −− → ←−− −− →
cenc
i = [cenc enc enc
i ; ci ] where ci ∈ R2h×1 , cenc enc
i , ci ∈ Rh×1 1≤i≤m (2)
←−− −− →
hdec
0 = Wh [henc enc dec
1 ; hm ] where h0 ∈ Rh×1 , Wh ∈ Rh×2h (3)
←−− −− →
cdec
0 = Wc [cenc enc dec
1 ; cm ] where c0 ∈ Rh×1 , Wc ∈ Rh×2h (4)
With the decoder initialized, we must now feed it a target sentence. On the tth step, we look up the
embedding for the tth subword, yt ∈ Re×1 . We then concatenate yt with the combined-output vector
ot−1 ∈ Rh×1 from the previous timestep (we will explain what this is later down this page!) to produce
yt ∈ R(e+h)×1 . Note that for the first target subword (i.e. the start token) o0 is a zero-vector. We then
feed yt as input to the decoder.
hdec dec
t , ct = Decoder(yt , hdec dec dec
t−1 , ct−1 ) where ht ∈ Rh×1 , cdec
t ∈ Rh×1 (5)
(6)
et,i is a scalar, the ith element of et ∈ Rm×1 , computed using the hidden state of the decoder at the tth
step, hdec
t ∈ Rh×1 , the attention projection WattProj ∈ Rh×2h , and the hidden state of the encoder at
the ith step, henc
i ∈ R2h×1 .
We now concatenate the attention output at with the decoder hidden state hdec
t and pass this through
a linear layer, tanh, and dropout to attain the combined-output vector ot .
ut = [at ; hdec
t ] where ut ∈ R
3h×1
(10)
vt = Wu ut where vt ∈ R h×1
, Wu ∈ R h×3h
(11)
ot = dropout(tanh(vt )) where ot ∈ R h×1
(12)
1 Checkout https://cs231n.github.io/convolutional-networks for an in-depth description for convolutional layers if you are
not familiar
2 If it’s not obvious, think about why we regard [← −− −−
henc
→
enc
1 , hm ] as the ‘final hidden state’ of the Encoder.
CS 224n Assignment 4 Page 3 of 11
Then, we produce a probability distribution Pt over target subwords at the tth timestep:
Here, Vt is the size of the target vocabulary. Finally, to train the network we then compute the cross
entropy loss between Pt and gt , where gt is the one-hot vector of the target subword at timestep t:
Here, θ represents all the parameters of the model and Jt (θ) is the loss on step t of the decoder. Now
that we have described the model, let’s try implementing it for Mandarin Chinese to English translation!
Note that this virtual environment will not be needed on the VM.
(e) (8 points) (coding) Implement the decode function in nmt_model.py. This function constructs ȳ
and runs the step function over every timestep for the input. You can run a non-comprehensive
sanity check by executing:
python sanity_check .py 1e
(f) (10 points) (coding) Implement the step function in nmt_model.py. This function applies the
Decoder’s LSTM cell for a single timestep, computing the encoding of the target subword hdec t ,
the attention scores et , attention distribution αt , the attention output at , and finally the combined
output ot . You can run a non-comprehensive sanity check by executing:
python sanity_check .py 1f
(g) (3 points) (written) The generate_sent_masks() function in nmt_model.py produces a tensor called
enc_masks. It has shape (batch size, max source sentence length) and contains 1s in positions cor-
responding to ‘pad’ tokens in the input, and 0s for non-pad tokens. Look at how the masks are
used during the attention computation in the step() function (lines 311-312).
First explain (in around three sentences) what effect the masks have on the entire attention com-
putation. Then explain (in one or two sentences) why it is necessary to use the masks in this
way.
Solution: The masks will assign −∞ values to the attention scores at positions corresponding to
’pad’ tokens in the input. This will give the probability of the ’pad’ tokens to be 0 after applying
softmax, which means the ’pad’ token embeddings will not affect the attention outputs. It is nec-
essary because the ’pad’ tokens are just additional elements we use to make the length of sentences
equal and they do not appear in actual sentences. Thus we don’t want these tokens to affect the
output probability distribution.
Another solution:
1. What effect: for every batch, those attention scores which have corresponding zero-padded
embeddings are set to −∞. This way, during the calculation of attention distributions αt ,
the probabilities are calculated with corresponding non-padded words whereas the scores with
corresponding padded words are negligible. Finally, during the calculation of attention outputs
At , for every batch only the addition of those hidden states matter which are not multiplied by
a probability close to 0.
2. Why necessary: using masks in this way is an efficient way to determine the true attention
distribution that only involves the non-padded entries. Involving padded entries would result
in false attention representation.
Now it’s time to get things running! As noted earlier, we recommend that you develop the code on your
personal computer. Confirm that you are running in the proper conda environment and then execute
the following command to train the model on your local machine:
sh run.sh train_local
( Windows ) run.bat train_local
For a faster way to debug by training on less data, you can run the following instead:
sh run.sh train_debug
( Windows ) run.bat debug
CS 224n Assignment 4 Page 5 of 11
To help with monitoring and debugging, the starter code uses tensorboard to log loss and perplexity
during training using TensorBoard3 . TensorBoard provides tools for logging and visualizing training
information from experiments. To open TensorBoard, run the following in your conda environment:
tensorboard −−logdir =runs
You should see a significant decrease in loss during the initial iterations. Once you have ensured that
your code does not crash (i.e. let it run till iter 10 or iter 20), power on your VM from the Azure Web
Portal. Then read the Managing Code Deployment to a VM section of our Practical Guide to VMs (link
also given on website and Ed) for instructions on how to upload your code to the VM.
Next, install necessary packages to your VM by running:
pip install −r gpu_requirements .txt
Finally, turn to the Managing Processes on a VM section of the Practical Guide and follow the instruc-
tions to create a new tmux session. Concretely, run the following command to create tmux session called
nmt.
tmux new −s nmt
Once you know your code is running properly, you can detach from session and close your ssh connection
to the server. To detach from the session, run:
tmux detach
You can return to your training model by ssh-ing back into the server and attaching to the tmux session
by running:
tmux a −t nmt
(h) (3 points) (written) Once your model is done training (this should take under 2 hours on the
VM), execute the following command to test the model:
sh run.sh test
( Windows ) run.bat test
Please report the model’s corpus BLEU Score. It should be larger than 18.
Solution: Corpus BLEU: 20.349133759154903 (with early stopping after 4 epochs, 18800 itera-
tions)
(i) (4 points) (written) In class, we learned about dot product attention, multiplicative attention, and
additive attention. As a reminder, dot product attention is et,i = sTt hi , multiplicative attention is
et,i = sTt Whi , and additive attention is et,i = vT tanh(W1 hi + W2 st ).
i. (2 points) Explain one advantage and one disadvantage of dot product attention compared to
multiplicative attention.
ii. (2 points) Explain one advantage and one disadvantage of additive attention compared to mul-
tiplicative attention.
3 https://pytorch.org/docs/stable/tensorboard.html
CS 224n Assignment 4 Page 6 of 11
Solution:
i. An advantage of dot product attention compared to multiplicative attention is that it does not
have any learnable parameters and it is a vector dot product, meaning it is fast to compute. A
disadvantage, however, is that a simple dot product is not sufficient to capture what parts of
st and what part of ht to pay attention to because it is a simple piece-wise similarity.
ii. An advantage of additive attention compared to multiplicative attention is that similarity is
computed in a non-linearly transformed space where both hi and st have their own learnable
weights. This add more flexibility in terms of parameter space. A disadvantage is that the
computation is very expensive.
Solution: Adding a 1D Convolutional layer after the embedding layer and before passing the
embeddings into the bidirectional encoder could help the NMT system by allowing the model to
capture local dependencies and patterns among the character sequences. Since each Mandarin
Chinese character is either an entire word or a morpheme in a word, the convolutional layer can
identify these patterns and use them to inform the representation of the word or phrase as a whole.
For example, in the case of the characters 电, 脑, and 电脑, the convolutional layer could potentially
learn to recognize the pattern of the characters 电 (electricity) and 脑 (brain) occurring together to
form the word 电脑 (computer). By capturing these patterns, the model may be able to improve
its ability to handle rare or unseen words, which is important for NMT systems since they must be
able to translate sentences containing previously unseen vocabulary.
(b) (8 points) Here we present a series of errors we found in the outputs of our NMT model (which
is the same as the one you just trained). For each example of a reference (i.e., ‘gold’) English
translation, and NMT (i.e., ‘model’) English translation, please:
1. Identify the error in the NMT translation.
2. Provide possible reason(s) why the model may have made the error (either due to a specific
linguistic construct or a specific model limitation).
3. Describe one possible way we might alter the NMT system to fix the observed error. There
are more than one possible fixes for an error. For example, it could be tweaking the size of the
hidden layers or changing the attention mechanism.
Below are the translations that you should analyze as described above. Only analyze the underlined
error in each sentence. Rest assured that you don’t need to know Mandarin to answer these
questions. You just need to know English! If, however, you would like some additional color on the
source sentences, feel free to use a resource like https://www.archchinese.com/chinese_english_
dictionary.html to look up words. Feel free to search the training data file to have a better sense of
how often certain characters occur.
CS 224n Assignment 4 Page 7 of 11
Solution:
1. The error in the NMT translation is the use of singular form ”culprit” instead of its plural
form ”culprits”.
2. The model might have made this error because of a lack of attention to the plural form of
the noun ”culprits” in the source sentences (Mandarin). Additionally, the model might be
trained on a dataset that the frequency of the singular form ”culprit” is higher than the
plural form.
3. A possible way to address this error is to increase the weight of the attention mechanism
on the number of nouns in the source sentence or to increase the occurrence of plural words
in our dataset.
Solution:
1. The error in the NMT translation is the repetition of the phrase ”resources have been
exhausted”.
2. One possible reason for this error is that the NMT system didn’t capture the meaning of
the word ”space” or ”accommodate”, leads to the inaccurate weights of attention in the
source sentence while trying to translate the first part of the sentence (”there is almost no
space to accommodate these people”).
3. This error can be solved by adjusting the attention mechanism to better capture the meaning
of sentence. Another the way to do so is to increase the amount of training data which
helps improve the accuracy of the translation.
Solution:
1. The error of the NMT translation is that it misses the meaning of 国 殇 日 (”national
mourning day”) and mistranslates it as ”today’s day”.
2. The model may not have learned the specific translation of ” 国殇日” as it is a culturally
specific term, and may have relied on the literal translation of each individual character.
3. The model may benefit from being trained on a larger corpus of text that includes culturally
specific terms and phrases. Additionally, the model could be improved by incorporating
additional context and domain-specific knowledge during the training process, such as in-
corporating knowledge of national holidays and events.
Solution:
1. The error is that the NMT translation is missing the first half of the reference translation,
which is the translation of the Chinese idiom.
2. One possible reason for this error is the shortage of idiom phrases in the training data,
which make model to have difficulty understanding idiomatic expressions, as well as the
structure of the Chinese language.
3. To help the model better understand idiomatic expressions, we could provide it a larger
training set that includes more diverse examples of idioms and their translations. Ad-
ditionally, we could explore incorporating a pre-trained language model to improve its
understanding of the structure of Chinese language.
(c) (14 points) BLEU score is the most commonly used automatic evaluation metric for NMT systems.
It is usually calculated across the entire test set, but here we will consider BLEU defined for a single
example.5 Suppose we have a source sentence s, a set of k reference translations r1 , . . . , rk , and a
candidate translation c. To compute the BLEU score of c, we first compute the modified n-gram
precision pn of c, for each of n = 1, 2, 3, 4, where n is the n in n-gram:
∑ ( )
min max Countri (ngram), Countc (ngram)
i=1,...,k
ngram∈c
pn = ∑ (15)
Countc (ngram)
ngram∈c
Here, for each of the n-grams that appear in the candidate translation c, we count the maxi-
mum number of times it appears in any one reference translation, capped by the number of times
it appears in c (this is the numerator). We divide this by the number of n-grams in c (denominator).
Next, we compute the brevity penalty BP. Let len(c) be the length of c and let len(r) be the
length of the reference translation that is closest to len(c) (in the case of two equally-close reference
translation lengths, choose len(r) as the shorter one).
{
1 if len(c) ≥ len(r)
BP = ( len(r) )
(16)
exp 1 − len(c) otherwise
where λ1 , λ2 , λ3 , λ4 are weights that sum to 1. The log here is natural log.
the NLTK function is sensitive to capitalization. In this question, all text is lowercased, so capitalization is irrelevant.
http://www.nltk.org/api/nltk.translate.html#nltk.translate.bleu_score.sentence_bleu
CS 224n Assignment 4 Page 9 of 11
Which of the two NMT translations is considered the better translation according to the BLEU
Score? Do you agree that it is the better translation?
Solution:
1. BLEU score for c1
We will first compute the modified n-gram precision p1 and p2 of c1 :
0+0+0+0+0+1+1+1+1 4
p1 = =
9 9
0+0+0+0+0+1+1+1 3
p2 = =
8 8
We then compute the length of candidate translation len(c) and the length of the refer-
ence translation len(r) that is closest to len(c) (len(r) in this case is the first Reference
Translation r1 ):
len(c1 ) = 9
len(r1 ) = 11
Next, we compute the brevity penalty BP:
( ) ( )
len(r) 11
= e− 9
2
BP = exp 1 − = exp 1 −
len(c) 9
Finally, the BLEU score for candidate c1 with respect r1 , r2 is:
(∑ 2 ) ( )
2 4 3
BLEU = BP × exp λn log pn = exp − + 0.5 × (log + log ) ≈ 0.327
n=1
9 9 8
0+1+1+1+0 3
p2 = =
5 5
len(c2 ) = 6
len(r2 ) = 6
(∑
2 ) ( )
3
BLEU = BP × exp λn log pn = exp 0.5 × (log 1 + log ) ≈ 0.775
n=1
5
CS 224n Assignment 4 Page 10 of 11
According to the BLEU score for c1 and c2 , the second NMT translation c2 is considered the
better translation. However, I would not agree that c2 is translated well compared to c1 .
ii. (5 points) Our hard drive was corrupted and we lost Reference Translation r1 . Please recom-
pute BLEU scores for c1 and c2 , this time with respect to r2 only. Which of the two NMT
translations now receives the higher BLEU score? Do you agree that it is the better translation?
Solution:
1. BLEU score for c1 with respect to r2 :
0+0+0+0+0+1+1+1+1 4
p1 = =
9 9
0+0+0+0+0+1+1+1 3
p2 = =
8 8
len(c1 ) = 9
len(r2 ) = 6
(∑
2 ) ( )
4 3
BLEU = BP × exp λn log pn = exp 0.5 × (log + log ) ≈ 0.408
n=1
9 8
2. BLEU score for c2 with respect to r2 :
1+0+0+1+1+0 1
p1 = =
6 2
0+0+0+1+0 1
p2 = =
5 5
len(c1 ) = 6
len(r2 ) = 6
(∑
2 ) ( )
1 1
BLEU = BP × exp λn log pn = exp 0.5 × (log + log ) ≈ 0.316
n=1
2 5
The first translation c1 now has a higher BLEU score, which is reasonable as c1 seems to
be the better translation.
iii. (2 points) Due to data availability, NMT systems are often evaluated with respect to only a
single reference translation. Please explain (in a few sentences) why this may be problematic. In
your explanation, discuss how the BLEU score metric assesses the quality of NMT translations
when there are multiple reference translations versus a single reference translation.
Solution: Translations from a source language can vary a lot due to the flexibility of the
target language, e.g. using synonyms, antonyms... NMT systems being evaluated with respect
CS 224n Assignment 4 Page 11 of 11
to only a single reference translation can ignore the variations, resulting in it being given low
BLEU score even though it is a high quality translation. With multiple reference translations,
the BLEU score metric can be more reliable and accurate as it can cover different variations of
the translated sentences. On the contrary, the BLEU score can vary widely and does not give
an accurate score when assessing the quality of the NMT translations with respect to a single
reference translation. However, in the original BLEU paper, they stated that we may use a big
test corpus with a single reference translation, provided that the translations are not all from
the same translator.
iv. (2 points) List two advantages and two disadvantages of BLEU, compared to human evaluation,
as an evaluation metric for Machine Translation.
Solution:
Advantages:
1. BLEU is an automatic evaluation metric that is quicker and inexpensive compared to human
evaluation, which can take weeks or months to finish and involve human labor that can not
be reused.
2. BLEU score is language-independent, i.e. reliable with different source - target translation
and shows a significantly high correlation with human judgements.
Disadvantages:
1. BLEU metric neither consider the meanings of the word nor understands the significance
of the words in the context. For example, the propositions usually have the lowest level of
importance. However, BLEU sees them as important as nouns and verb keywords.
2. BLEU doesn’t understand the variants of the words and can’t take the word order into
account.
Submission Instructions
You shall submit this assignment on GradeScope as two submissions – one for “Assignment 4 [coding]” and
another for ‘Assignment 4 [written]”:
1. Run the collect_submission.sh script on Azure to produce your assignment4.zip file. You can use scp
to transfer files between Azure and your local computer.
3. Upload your written solutions to GradeScope to “Assignment 4 [written]”. When you submit your
assignment, make sure to tag all the pages for each problem according to Gradescope’s submission
directions. Points will be deducted if the submission is not correctly tagged.