Marathi To English Neural Machine Translation With Near Perfect Corpus and Transformers
Marathi To English Neural Machine Translation With Near Perfect Corpus and Transformers
Marathi To English Neural Machine Translation With Near Perfect Corpus and Transformers
And Transformers
results with Fairseq platform. Opus6 provides good amount piece tokenizer.
of parallel corpus. For marathi-english pair, we can see that
around 1 million sentences are available. Among which only 3.2. Training
Tatoeba, Wikimedia and bible datasets are useful, as other
data is just instructions. Among the valid sets as well, when We trained multiple models with above mentioned trans-
sanity check was done, it was found that not all sentences former architectures with various hyper-parameters sug-
are correctly aligned and there some fetal mismatches. We gested in respective papers and in Fairseq github discussions.
tried to rectify but later decided not to and then ignored the Following is the one of the training commands we used.
bible dataset completely. We decided to keep Wikimedia Training Command : CUDA_VISIBLE_DE-
and Tatoeba datasets for validation purpose as we were left VICES=0,1,2,3 fairseq-train mr2en_token_data –arch
with just around 53k sentence pairs and did not use the same transformer_vaswani_wmt_en_de_big –share-decoder-input-
in training. output-embed –optimizer adam –adam-betas ’(0.9,0.98)’
Through scrapping, we had more than 6 million sentence –clip-norm 0.0 –lr 5e-4 –lr-scheduler inverse_sqrt –warmup-
pairs, but we determined and used only those sentence pairs updates 10000 –dropout 0.3 –weight-decay 0.0001 –criterion
which were almost correct. We put a hard rule of dictionary- label_smoothed_cross_entropy –label-smoothing 0.1 –max-tokens
based words matching and considered only those sentence 4096 –update-freq 2 –max-source-positions 512 –max-target-
pairs which had at-least 30% of the translated words matched positions 512 –skip-invalid-size-inputs-valid-test
with dictionary words. We were left with around 3 million Note that, as we were using 4 GPU’s instead of 8 GPUs, men-
sentence pairs. tioned in many state-of-the-art papers, we set –update-freq to
We used wordpiece tokenizer by huggingface7 to tokenize 2. This is done to mimic the training with 8 GPUS. We used
Marathi and English text. Also, we used lowercased English different optimizers but finally settled on adam optimizer
text throughout the experiments to reduce the learning of because of its stable loss reduction capability. We noticed
cases for English language. There is no concept like ”cases” that increasing warmup-updates from 4k to 10k improved
for Marathi language. Sample parallel corpus examples can the convergence and reduced overall iterations. Also, note
be found at project gihub location8 . that we didn’t use –FP16 option in above command which
could have improved the training speed but we observed
that it reduces BLEU score marginally.
3. Experiments
We stopped the training once perplexity(ppl) went be-
3.1. Setup low 3. Smaller models like transformer-wmt-en-de and
We used Facebook’s sequence-to-sequence library Fairseq9 transformer-iwslt-de-en took around 30 hours whereas, other
to train and inference the translation model. This neatly two big models took 50+ hours. For all the transformer-
written and easy to use library provides multiple state-of- based models, loss went below 1 for train and test sets.
the-art architectures to build translation models. We in-
stalled the library on a 4x V100 32gb Nvidia GPU linux 3.3. Results
setup. Even though, there are multiple algorithms available, To make the comparison fair we used 16 th iteration of
we focused majorly on following Transformer based archi- the models throughout. Marathi text was fired against the
tectures : transformer-wmt-en-de, transformer-iwslt-de-en, Google-cloud-api-v2 to collects the results for the compari-
transformer-wmt-en-de-big-t2t and transformer-vaswani- son. Inference on GPU was preferred over CPUs as we could
wmt-en-de-big utilize all 4 GPUs effectively with 4 models. Following is
Fairseq also provides option to tokenize the input text with the one of the command for inference we used.
sentencepiece tokenizer10 and gpt tokenizer. But we opted Inference Command : CUDA_VISIBLE_DEVICES=0 python
to tokenize the text with wordpiece tokenizer instead, even interactive.py –path ../translation_task/checkpoints_trans-
before passing the text for the training. In future, we would former_iwslt_de_en/checkpoint16.pt ../translation_task/mr2en_to-
try to build sentencepiece model from Marathi and English ken_data –beam 5 –source-lang mr –target-lang en –input
News corpus and use and evaluate against the existing word- ../translation_task/set3_tokens.mr –sacrebleu –skip-invalid-size-
6
http://opus.nlpl.eu/ inputs-valid-test –batch-size 32 –buffer-size 32
7
https://pypi.org/project/ We used beam search of 5 which worked better in BLEU
pytorch-pretrained-bert/
score than any other option. After the inference, we got
8
https://github.com/swapniljadhav1921/
marathi-2-english-NMT tokenized English text which we de-tokenized and used fur-
9
https://fairseq.readthedocs.io/ ther for model metrics comparisons. Note that, calculating
10
https://github.com/google/sentencepiece BLEU with tokenized text yields high scores which is unfare
Marathi to English Neural Machine Translation
Table 1. BLEU score comparison on small sentences having word- Table 3. Error between actual translation word-count and predicted
count less than 15. translation word-count.
Table 2. BLEU score comparison on medium to large sentences Table 4. Comparison between existing Translators
having word-count more than 15.
Marathi Text
Models bleu raw-bleu आयुष्य पतंगासारखं आहे . मांजा धरला तर वे-
गात उंच झेपावत नाही आिण सोडला तर कुठे
Google 28.60 17.47
wmt-en-de 26.87 26.10 जाईल त्याचा नेम नाही.
iwslt-de-en 26.06 25.28 Actual Translation Life is like a kite. If you keep holding
wmt-en-de-big-t2t 29.50 28.73 the thread, it will not rise faster and if
vaswani-wmt-en-de-big 27.18 26.50 you loose it then not sure where it will
land .
Our Model life is like a kite . holding a cat does
to Google Translator and hence avoided throughout. not accelerate high speeds and does not
specify where it will go if left unat-
We used sacreBLEU11 library to calculate corpus-bleu score tended .
(with smoothing function enabled) and raw-corpus-bleu Google Life is like a kite. If you catch a cat,
score (with smoothing function disabled). As mentioned it does not jump high and it does not
before we used, Tatoeba and Wikimedia parallel corpus of specify where you will go.
around 53k sentence pairs as validation set. Tatoeba contains Facebook Life is like a fall. If you hold a cat, you
smaller everyday sentences and greetings, while Wikimedia don’t run high in speed and if you leave
has long scientific sentences. it, there is no name where it goes .
Yandex The life is like The Moth . I held her
From Table 1, we can see that for smaller sentences hav- naked buttocks in my hands as she rode
ing wordcount less than 15, all transformer models crushed me until she climaxed .
Google in BLEU and Raw-BLEU scores. Boundary of 15
words was chosen based on study12 which states that in cur-
rent generation average words used in a sentence is around score doesn’t signify quality of the translation but often
10-20. Also note that, there is a less gap between BLEU used to check the sanity of the model. Here we can see
and Raw-BLEU scores for all transformer models compared that, all transformer-based models outperformed Google in
to Google. This table shows that for smaller everyday sen- both Mean-Absolute-Error(MAE) and Root-Mean-Squred-
tences and greetings, our model outperformed Google easily. Error(RMSE). Also, wmt-t2t model performed the best again
vaswani-big architecture and wmt-t2t architecture performed overall.
the best.
From Table 2, we can see that all the models including 3.4. Discussion
Google struggled to go beyond 30 BLEU score. Also, only Any language base model require huge amount of data to
wmt-t2t model was able to outscore Google in BLEU but train deep architectures. We saw that one of the best word-
at the same time Google struggled in Raw-BLEU score embedding model BERT13 was trained on more than 100gb
compared all other models. This shows that wmt-t2t model of textual data. Similarly, to train translation models and to
was able to translate longer, complex sentences with good make them learn how to generate sentence structures and
score and did better than Google. even transliterate proper nouns instead of translation, we
Table 3 shows the comparison between actual English text need large parallel corpus. And Google supposedly have
and predicted English text word-counts. Even though, this a very large corpus. As per our knowledge, Google relies
on scrapping and community help14 . As Google has not
https://github.com/mjpost/sacreBLEU
11
https://techcomm.nz/Story?Action=View&Story_
12 13
https://github.com/google-research/bert
id=106 14
https://translate.google.com/community#mr/en
Marathi to English Neural Machine Translation
mining with multilingual sentence embeddings. In Pro- Conference on Natural Language Processing (Volume 1:
ceedings of the 57th Annual Meeting of the Association Long Papers), volume 1, pages 1723–1732., 2015.
for Computational Linguistics, 2019.
Firat, O., Cho, K., and Bengio, Y. Multi-way, multilin-
Bahdanau, D., Cho, K., and Bengio, Y. Neural machine gual neural machine translation with a shared attention
translation by jointly learning to align and translate. In mechanism. In arXiv preprint arXiv:1601.01073., 2016.
International Conference on Learning Representations.,
2015. Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin,
Y. N. Convolutional sequence to sequence learning. In
Bapna, A., Chen, M. X., Firat, O., Cao, Y., and Wu, Y. Train- International Conference on Machine Learning., 2017.
ing deeper neural machine translation models with trans-
parent attention. In arXiv preprint, arXiv:1808.07561., Ha, T.-L., Niehues, J., and Waibel, A. H. Toward multilin-
2018. gual neural machine translation with universal encoder
and decoder. In CoRR, abs/1611.04798., 2016.
Bojar, O., Chatterjee, R., and Federmann, C. Findings of the
2016 conference on machine translation. In ACL 2016 Hassan, H., Aue, A., and Chen, C. Achieving human parity
FIRST CONFERENCE ON MACHINE TRANSLATION on automatic chinese to english news translation. In arXiv
(WMT16), pages 131–198. The Association for Computa- preprint arXiv:1803.05567., 2018.
tional Linguistics., 2016.
Hieber, F., Domhan, T., Denkowski, M., Vilar, D., Sokolov,
Bojar, O., Chatterjee, R., and Federmann, C. Findings of A., Clifton, A., and Post, M. Sockeye: A toolkit for neural
the 2017 conference on machine translation (wmt17). In machine translation. In arXiv preprint, arXiv:1712.05690,
Proceedings of the Second Conference on Machine Trans- 2017.
lation, pages 169–214., 2017.
Johnson, M., Schuster, M., and Le, Q. V. Google’s mul-
Bojar, O., Federmann, C., Fishel, M., Graham, Y., Had- tilingual neural machine translation system: Enabling
dow, B., Koehn, P., and Monz, C. Findings of the 2018 zero-shot translation. In Transactions of the Association
conference on machine translation (wmt18). In Proceed- of Computational Linguistics, 5(1):339–351., 2017.
ings of the Third Conference on Machine Translation:
Kalchbrenner, N. and Blunsom, P. Recurrent continuous
Shared Task Papers, pages 272–303, Belgium, Brussels.
translation models. In Proceedings of the 2013 Confer-
Association for Computational Linguistics., 2018.
ence on Empirical Methods in Natural Language Pro-
Chen, M. X., Firat, O., Bapna, A., MelvJohnson, Macherey, cessing, pages 1700–1709, Seattle, Washington, USA.
W., Foster, G., Jones, L., Parmar, N., Schuster, M., and Association for Computational Linguistics., 2013.
Chen, Z. The best of both worlds: Combining recent
Luong, M.-T., Le, Q. V., Sutskever, I., Vinyals, O., and
advances neural machine translation. In Association for
Kaiser, L. Multitask sequence to sequence learning. In
Computational Linguistics., 2018.
arXiv preprint arXiv:1511.06114., 2015.
Cheng, Y., Xu, W., He, Z., He, W., Wu, H., Sun, M., and Liu,
Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to se-
Y. Semisupervised learning for neural machine transla-
quence learning with neural networks. In Advances Neu-
tion. In Association for Computational Linguistics, 2016.
ral Information Processing Systems, 2014.
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D.,
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
Bougares, F., Schwenk, H., and Bengio, Y. Learning
L., Gomez, A. N., Łukasz Kaiser, and Polosukhin, I. At-
phrase representations using rnn encoder–decoder for sta-
tention is all you need. In Advances Neural Information
tistical machine translation. In Proceedings of the 2014
Processing Systems., 2017.
Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 1724–1734, Doha, Qatar. Zhou, J., Cao, Y., Wang, X., Li, P., and Xu, W. Deep
Association for Computational Linguistics., 2014. recurrent models with fast-forward connections for neural
Crego, J. M., Kim, J., and Klein, G. Systran’s pure neural machine translation. In Transactions of the Association
machine translation systems. In CoRR, abs/1610.05540., for Computational Linguistics, 4:371–383., 2016.
2016.
Dong, D., Wu, H., He, W., Yu, D., and Wang, H. Multi-task
learning for multiple language translation. In Proceed-
ings of the 53rd Annual Meeting of the Association for
Computational Linguistics and the 7th International Joint