Termpaper

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

ENGLISH TO LUGANDA MACHINE

TRANSLATION
* FINAL TERM PAPER

KISEJJERE RASHID
BACHELORS OF SCIENCE IN SOFTWARE ENGINEERING)
MAKERERE UNIVERSITY
COURSE INSTRUCTOR: MR.GALIWANGO MARVIN
KAMPALA, UGANDA
[email protected]

Abstract—English to Luganda Translation using different II. BACKGROUND AND MOTIVATION


machine classification models and deep learning. Their many
machine translation sources right now but there isn’t any public The background of ML translation comes from majorly
English to Luganda Translation paper showing how the transla- three translation processes I.e. the Rule Based Machine Trans-
tion process occurs. Luganda is a very common language and it’s
lation (RBMT), Statistical Machine Learning (SMT) and Neu-
the mother language of Uganda. It has got a very big vocabulary
of words which means that working on it requires a very big ral Machine translation (NMT) of this laptop
dataset. For the sake of this paper, am going to be going through
different machine-learning models that can be implemented to A. Rule Based Machine translation(RBMT)
translate a given English text to Luganda. Because Luganda is
a new field in translation so am going to be experimenting with Rule base machine translation (RBMT) as the name states
multiple machine learning classification models of SVMs, Logistic it’s mainly about researchers coming up with different rules
regression models, and many more, finally deep learning models
of RNNs, LSTMs, and also incorporating advanced mechanisms through which a text in a given can follow to come up with
of Attention, Transformers plus some additional techniques of its respective translation. It is the oldest machine translation
transfer learning. technique and it was used in the 1970s.
Index Terms—Artificial Intelligence, machine translation, clas-
sification task, RNNs, transfer learning, LSTMs, attention
B. Statistical Machine Learning (SMT)

I. I NTRODUCTION Statistical Machine translation (SMT) is an old translation


technique that uses a statistical model to create a representation
Machine translation is a field that was and is still in the of the relationships between sentences, words, and phrases in a
research and so far there are multiple machine translation Ap- given text. This method is then applied to a second language
proaches that researchers have come up with. These machine to convert these elements into the new language.One of the
translation techniques mainly include; Rule-Based Machine main advantages of this technique is that it can improve on
Translation (RBMT), Statistical Machine Learning (SMT), and the rule-based MT while sharing the same problems.
Neural Machine translation (NMT). A detailed explanation of
these approaches is in the next chapter. Machine translation C. Neural Machine Translation(NMT)
is one of the major subcategories of NLP as it involves
a proper understanding of two different languages. This is Neural network translation was developed using deep learn-
always challenging as languages tend to have a very huge ing techniques. This method is faster and more accurate than
vocabulary so a lot of computer resources are needed for other methods. Neural MT is rapidly becoming the standard
a machine translation system to come out as accurately as in MT engine development.
possible. Also, the data used in the process is supposed to be The translation of a given language to another different
very accurate and this as result also tends to affect the accuracy language is a machine classification problem. This type of
of these models so coming up with a very accurate model is problem can predict only values within a known domain. The
very tricky. Throughout this paper, I will be explaining how I domain in this case could be either the number of characters
was able to come up with a couple of translation models using that make up the vocabulary of a given language or the number
different strategies of machine learning. of words in that given vocabulary. So this shows that the
classification model could be either word-based or character
based. I will elaborate more on this issue in the next topics.
III. LITERATURE REVIEW V. C ONTRIBUTIONS OF THIS PAPER
One of the major aim of this paper is lay a foundation for
Translation is a crucial aspect of communication for further and much more detailed research in the translation of
individuals who speak different languages. With the advent large vocabulary languages like Luganda. Through showing
of Artificial Intelligence (AI), translation has become more the different machine learning techniques that can be used ti
efficient and accurate, making it possible to communicate achieve this.
with individuals in other languages in real-time. There are
basically two major learning techniques that can be used ; VI. METHODOLOGY
The problem being investigated in this project is to develop
Supervised learning is a type of machine learning where the an AI-powered English to Luganda translation system. The
model is trained on a labeled dataset and makes predictions significance of this problem lies in the growing demand for
based on the input data and the labeled output. Supervised high-quality and culturally sensitive translations, particularly
learning algorithms have been used to train AI-powered in the field of commerce and communication between English
English to Luganda translation systems. The model is trained and Luganda-speaking communities.
on a large corpus of bilingual text data, which helps it learn
the relationships between English and Luganda words and The scope of the project is to develop an AI system that is
phrases. This allows the model to make predictions about the capable of accurately translating English text into Luganda
Luganda translation of an English text based on the input text, while also preserving the meaning and cultural context
data. This is the famous type of machine learning and it of the original text.
involves the famous deep neural networks.
To address this problem, the proposed AI approach is to
Unsupervised learning is a type of machine learning develop a neural machine translation (NMT) model. The
where the model is not trained on labeled data but instead NMT model will be trained on the English and Luganda
learns from the input data. Unsupervised learning algorithms parallel corpus dataset, and will use this data to learn the
can also been used to develop AI-powered English to relationship between the two languages.The AI process can
Luganda translation systems. The model uses techniques be summarized as follows:
such as clustering and dimensionality reduction to learn
the relationships between English and Luganda words and Data Collection: Collect a large corpus of parallel text data
phrases. This allows the model to make predictions about the in English and Luganda.
Luganda translation of an English text based on the input data.
Pre-processing: Pre-process the data to remove irrelevant
In conclusion, AI-powered English to Luganda translation information and standardize the text.
has the potential to greatly improve the speed and accuracy of
translations. Model Selection: Choose the neural machine translation
model that is best suited for the problem.

IV. RESEARCH GAPS Model Training: Train the NMT model on the pr-processed
data.
Below are some of the major research Gaps in the field of
machine translation. Model Evaluation: Evaluate the trained model on a held-out
• Limited Training Data: The quality of AI-powered trans- set of data to determine its performance.
lations is heavily dependent on the amount and quality
of training data used to train the model. Further research Deployment: Deploy the trained model for use in a
is needed to explore methods for obtaining high-quality real-world setting.
training data.
• Lack of Cultural Sensitivity: AI-powered translation sys- Continuous Improvement: Continuously evaluate the
tems can produce translations that are grammatically performance of the model and make improvements as needed.
correct but lack the cultural sensitivity of human trans-
lations. This can result in translations that are culturally The AI evaluation framework used in this project are the
inappropriate or that do not accurately convey the original accuracy metrics mainly. This is a major of how the model
message. will be able to translate a given text correctly.
• Vulnerability to Errors of the machine learning system.
AI can only understand what it has been trained on. So In conclusion, the proposed AI approach for this project is to
in cases where the input is not similar to the data which develop a neural machine translation model that can accurately
it was trained on, AI then can easily create undesired translate English text into Luganda text while preserving the
results. meaning and cultural context of the original text.
VII. DATASET DESCRIPTION through the visualization of the data. Below are the visualiza-
tions and their meanings;

1) Word Cloud: A word cloud is graphical representation


of the words that are used frequently in the dataset. This is
important as it shows that the model will highlt depend on
those particular words .

The dataset [15] I used was created by Makerere University


and it contains approximately 15k English sentences with
there respective Luganda translation. Below are the factors for
considering this dataset.
• Scarcity of Luganda datasets. Luganda isn’t a famous
language world wide and it is mainly used in the Country
Uganda only so the only major dataset I could find was
this one.
• Cost. The dataset is available for free for anyone to use
and edit. For the Luganda sentences
• The accuracy of the dataset isn’t bad at all so it is the
best option to use.
• The dataset is relatively large and diverse enough to be
able to create a very good model out of.
VIII. DATA PREPARATION AND EXPLORATORY
DATA ANALYSIS.
A. DATA PREPARATION
Data preparation refers to the steps taken to prepare raw
data into improved data which can be used to train a machine
learning model. The data preparation process for my model
was as follows;
• Removal of any punctuation plus any unnecessary spaces
this is necessary to prevent the model from training on a 2) Correlation matrix: This is a matrix showing the cor-
large amount of unnecessary data. relation of the different values to each other. Plotting a 2d
• Converting the case of words in the dataset to lowercase. correlation matrix for the entire dataset is almost impossible
Since python is case-sensitive a word like “Hello” is but what is possible is the plot of a particular sentence. The
different from “hello”. to avoid this dilemma I had to matrix below shows the correlation for a given sentence. Here
change the case. the model will have to pay a lot of attention to the words that
• Vectorization of the dataset. Vectorization is referred to are highly correlated to each other.
as the process of converting a given text into numerical
indices. This is necessary because the machine learning
pipeline can only be trained on numerical data.
• Removal of null values. Here all the rows that had null
data had to be dropped because for textual data it is very
difficult to estimate the value in the null spot.
Those were the data preparation processes I used in the
model creation process.

B. DATA ANALYSIS
Exploratory data analysis is referred to as the process of
performing initial investigations on data to discover anomalies
and patterns. Exploratory data analysis is mainly carried out For the it’s Luganda Sentence
For the respective Luganda Sentence
3) Sentence Lengths plots: Through these plots, we are
ale to determine what should all the sentences of the datasets
be padded to because during the training process they are all
supposed to be of the same length

In a conclusion, data preparation and exploratory data


analysis are key steps in the creation of a very accurate model.
IX. AI MODEL SELECTION AND OPTIMIZATION
Throughout the project, I created three models. I.e one
with recurrent neural networks, the other with the attention
mechanism, and finally the last one with transfer learning on
the per-trained hugging face transformer model.
• The recurrent neural network model was a simple model
that uses RNNs to translate the model. Its accuracy was
very bad because the vocabulary for the two languages
was very big. These types of RNNs are best for simple
vocabularies.
• The attention mechanism model. This happened to be
much much better compared to the RNN model. Attention
is a mechanism used in deep neural networks where the
model can focus on only the important parts of a given
text by assigning them more weights.
These figures show the maximum sentence lengths for the • The other model I created used transformers. Transform-
English and the Luganda sentences receptively. ers are also deep learning models that are built on top
of attention layers. This makes them much more efficient
4) Box Plot: A box plot is visual representation that can when it comes to NLP tasks.. This information includes
be used to show the major outliers in the dataset. Plotting ram, processor, brand, storage, type, screen size and many
a box plot for the entire spot is also almost impossible but more.
what is possible is the plotting of the box plot for a particular
sentence, this as a result shows on the possible outliers in the X. ACCOUNTABILITY
sentence thus the model during the training process ends up In this context of AI, “accountability” refers to the
not paying a lot of attention to those particular words. expectation that organizations or individuals will use to
Box plot for one of the sentences in the dataset ensure the proper functioning, throughout the AI systems that
they design, develop, operate or deploy, following their roles
and applicable regulatory frameworks, and for demonstrating
this through their actions and decision-making process (for
example, by providing documentation on key decisions
throughout the AI system lifecycle or conducting or allowing
auditing were justified).

AI accountability is very important because it’s a means


of safeguarding against unintended uses. Most AI systems are
designed for a specific use case; using them for a different use
case would produce incorrect results. Through this am also to Words that were predicted with a very high probability are
apply accountability to my model by making sure that Since more coloured.
my AI model mainly depends on the dataset. Hence, it’s best to
make sure that the quality of the dataset is constantly improved
and filtered. Because of any slight modifications in the spelling XII. CONCLUSION AND FUTURE WORKS
of the words then the model’s accuracy will decrease.
I hope this paper will give a basic understanding of the
XI. R ESULTS AND D ISCUSSION different machine learning methods that can be used to create
a deep learning model capable of translating a given English
I spitted the data into training and the validation set below text into Luganda. The same idea can be used to translate
are the results; different languages.

The model currently is overfitting the dataset. One way to


overcome this is to increase the size of the data because the
dataset contains of only about 15k sentences. So for the model
to become much more accurate increasing the dataset to about
a million sentences will tremendously improve on its accuracy.

Usage of other machine learning techniques like transform-


ers. The model illustrated above was based on the attention
mechanism of neural networks. Using the transformers will
The training accuracy is of 92
improve the quality of the model even more. Though trans-
formers are usually complicated to use instead fine tuning an
already trained model is what I would recommend, this is
A. Validation and Accuracy plot called transfer learning.

A. Dataset and python source code

LINK to the Final Python Source Code -


https : //colab.research.google.com/drive
1N sRAxdf tGIzqzeIM w3N F Y 9xClLk49f

LINK to the used dataset -


https : //zenodo.org/record/5855017

LINK to the YouTube Video -


https : //youtu.be/RLXf M 0iLQag
Its clear that the model is over fitting the dataset but it’s
accuracy is still fairly good.

XIII. ACKNOWLEDGMENT
B. ATTENTION PLOT
Special Thanks to Mr.Ggaliwango Marvin for his never
An attention plot is a figure showing how the model was ending support towards my research on this project. I also
able to predict the given output. want to appreciate Dr. Rose Nakibuule for the provision of
the foundation knowledge needed for this project. [4]
R EFERENCES
[1] M. Singh, R. Kumar, and I. Chana, ”Neural-Based Machine Transla-
tion System Outperforming Statistical Phrase-Based Machine Transla-
tion for Low-Resource Languages”, 2019 Twelfth International Con-
ference on Contemporary Computing (IC3), 2019, pp. 1-7, DOI:
10.1109/IC3.2019.8844915. V. Bakarola and J. Nasriwala, ”Attention
based Neural Machine Translation with Sequence to Sequence Learning
on Low Resourced Indic Languages,” 2021 2nd International Con-
ference on Advances in Computing, Communication, Embedded and
Secure Systems (ACCESS), 2021, pp. 178-182, DOI: 10.1109/AC
CESS51619.2021.9563317. .
[2] Academy, E. (2022) How to Write a Research Hy-
pothesis — Enago Academy, Enago Academy. Avail-
able at: https://www.enago.com/academy/how-to-develop-
a-good-research-hypothesis/ (Accessed: 17 November
2022). What is the project scope? (2022). Available at:
https://www.techtarget.com/searchcio/definition/project-scope
(Accessed: 17 November 2022).
[3] Machine translation – Wikipedia (2022). Available at:
https://en.wikipedia.org/wiki/Machine translation (Accessed: 17
November 2022).
[4] K. Chen et al., ”Towards More Diverse Input Representation for
Neural Machine Translation,” in IEEE/ACM Transactions on Audio,
Speech, and Language Processing, vol. 28, pp. 1586-1597, 2020, doi:
10.1109/TASLP.2020.2996077.
[5] O. Mekpiroon, P. Tammarattananont, N. Apitiwongmanit, N. Buasroung,
T. Charoenporn and T. Supnithi, ”Integrating Translation Feature Using
Machine Translation in Open Source LMS,” 2009 Ninth IEEE Interna-
tional Conference on Advanced Learning Technologies, 2009, pp. 403-
404, doi: 10.1109/ICALT.2009.136.
[6] J. -W. Hung, J. -R. Lin and L. -Y. Zhuang, ”The Evaluation Study of
the Deep Learning Model Transformer in Speech Translation,” 2021 7th
International Conference on Applied System Innovation (ICASI), 2021,
pp. 30-33, doi: 10.1109/ICASI52993.2021.9568450.
[7] V. Alves, J. Ribeiro, P. Faria and L. Romero, ”Neural Machine Transla-
tion Approach in Automatic Translations between Portuguese Language
and Portuguese Sign Language Glosses,” 2022 17th Iberian Conference
on Information Systems and Technologies (CISTI), 2022, pp. 1-7, doi:
10.23919/CISTI54924.2022.9820212.
[8] Machine Translation – Towards Data Science. (2022). Retrieved 24
November 2022, from https://towardsdatascience.com/tagged/machine
translation
[9] H. Sun, R. Wang, K. Chen, M. Utiyama, E. Sumita and T. Zhao, ”Un-
supervised Neural Machine Translation With Cross-Lingual Language
Representation Agreement,” in IEEE/ACM Transactions on Audio,
Speech, and Language Processing, vol. 28, pp. 1170-1182, 2020, doi:
10.1109/TASLP.2020.2982282.
[10] Y. Wu, ”A Chinese-English Machine Translation Model Based on
Deep Neural Network,” 2020 International Conference on Intelligent
Transportation, Big Data and Smart City (ICITBS), 2020, pp. 828-831,
doi: 10.1109/ICITBS49701.2020.00182.
[11] L. Wang, ”Adaptability of English Literature Translation from the
Perspective of Machine Learning Linguistics,” 2020 International Con-
ference on Computers, Information Processing and Advanced Education
(CIPAE), 2020, pp. 130-133, doi: 10.1109/CIPAE51077.2020.00042.
[12] S. P. Singh, H. Darbari, A. Kumar, S. Jain and A. Lohan, ”Overview of
Neural Machine Translation for English-Hindi,” 2019 International Con-
ference on Issues and Challenges in Intelligent Computing Techniques
(ICICT), 2019, pp. 1-4, doi: 10.1109/ICICT46931.2019.8977715
[13] R. F. Gibadullin, M. Y. Perukhin and A. V. Ilin, ”Speech
Recognition and Machine Translation Using Neural Networks,”
2021 International Conference on Industrial Engineering, Appli-
cations and Manufacturing (ICIEAM), 2021, pp. 398-403, doi:
10.1109/ICIEAM51226.2021.9446474.
[14] How to Build Accountability into Your AI. (2021). Retrieved 24 Novem-
ber 2022, from https://hbr.org/2021/08/how-to-build-accountability-into-
your-ai
[15] Mukiibi, J., Hussein, A., Meyer, J., Katumba, A., and Nakatumba
Nabende, J. (2022). The Makerere Radio Speech Corpus: A Luganda
Radio Corpus for Automatic Speech Recognition. Retrieved 24 Novem-
ber 2022, from https://zenodo.org/record/5855017

You might also like