Named-Entity Recognition For A Low-Resource Language Using Pre-Trained Language Model

Named-Entity Recognition for a Low-resource Language using
Pre-Trained Language Model

Hailemariam Mehari Yohannes∗ Toshiyuki Amagasa
Systems and Information Engineering Center for Computational Sciences
University of Tsukuba University of Tsukuba
Tsukuba, Ibaraki, Japan Tsukuba, Ibaraki, Japan
[email protected] [email protected]
ABSTRACT 1 INTRODUCTION
This paper proposes a method for Named-Entity Recognition (NER) Tigrinya (also known as Tigrigna) is a Semitic (Afro-Asiatic) lan-
for a low-resource language, Tigrinya, using a pre-trained language guage spoken by an estimated population of more than 9 million1
model. Tigrinya is a morphologically rich language, although one in northern Ethiopia and Eritrea. Semitic languages belong to a lan-
of the underrepresented in the field of NLP. This is mainly due guage family that includes modern languages, for example, Tigrinya,
to the limited amount of annotated data available. To address this Tigre, Amharic, Hebrew, and Arabic 2 .
problem, we introduced the first publicly available NER dataset for Although its popularity in population, the current status of
Tigrinya. The dataset contains 69,309 tokens that were manually Tigrinya in terms of the available linguistic resource is not suf-
annotated based on the CoNLL 2003 Beginning, Inside, and Outside ficient, and it is regarded as a low-resource language due to the fact
(BIO) tagging schema. Specifically, we develop a new pre-trained that it has received less attention from researchers. As a result,
language model for Tigrinya based on RoBERTa, which we refer to most of the NLP tasks for such low-resource languages, e.g., Named
as TigRoBERTa. First, It is trained on an unsupervised Tigrinya cor- Entity Recognition (NER), POS tagging, sentiment analysis, ques-
pus using Masked Language Modeling (MLM). Then, we show the tion answering, etc., are still in the early stages and do not have
validity of TigRoBERTa by fine-tuning for a couple of downstream sufficient language resources (e.g., corpus), while there has been
tasks, namely, NER and Part of Speech (POS) tagging. The experi- significant development in NLP tasks for resource-rich languages
mental results show that the method achieved 81.05% F1-score for (e.g., English, French, etc.) due to the emergence of deep learning
NER and 92% accuracy for POS tagging, which is better than or technologies.
comparable to the baseline method based on the CNN-BiLSTM-CRF One of the obstacles is that making language resources is cost
model. demanding. For example, high-quality human-annotated datasets
are essential for training NLP models for tasks, such as NER, POS
CCS CONCEPTS tagging, sentiment analysis, and text classification, but creating an
annotated dataset is very expensive and time-consuming. According
• Information retrieval → Document representation;
to [3], recent studies on named entity identification relied on the
use of hand-crafted features and very large knowledge resources,
KEYWORDS which is time-consuming and not appropriate for low-resource
Name entity recognition, POS tagging, pre-trained language model, languages.
low-resource language, RoBERTa. So far, there have been efforts to create datasets for low-resourced
languages; for example, [11] created a NER dataset for ten South
ACM Reference Format:
African languages based on government data. Additionally, [31]
Hailemariam Mehari Yohannes and Toshiyuki Amagasa. 2022. Named-
Entity Recognition for a Low-resource Language using Pre-Trained Lan-
created a dataset for some African languages (such as Amharic), but
guage Model . In The 37th ACM/SIGAPP Symposium on Applied Computing they are not publicly available. Another attempt by [1] also created
(SAC ’22), April 25–29, 2022, Virtual Event, . ACM, New York, NY, USA, a NER dataset for 10 African languages; unfortunately, Tigrinya
8 pages. https://doi.org/10.1145/3477314.3507066 was not included in their study. To our knowledge, the only publicly
available dataset for Tigrinya is the Nagaoka corpus [35], which
can only be used for POS tagging task.
∗ First Author As for the state-of-the-art NLP tasks, many recent works use
Transformer architectures [38] that have been pre-trained for lan-
guage modeling tasks. Recent studies have shown that transfer
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
learning by fine-tuning pre-trained language models [9, 16, 18] can
for profit or commercial advantage and that copies bear this notice and the full citation improve the performance of downstream tasks. The presence of pre-
on the first page. Copyrights for components of this work owned by others than ACM trained language models play a crucial role in the development of
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a NLP tasks with a small training sample. Multilingual Transformer
fee. Request permissions from [email protected].
SAC ’22, April 25–29, 2022, Virtual Event,
© 2022 Association for Computing Machinery.
1 https://utalk.com/en/store/tigrinya
ACM ISBN 978-1-4503-8713-2/22/04. . . $15.00
2 https://www.ucl.ac.uk/atlas/tigrinya/language.html
https://doi.org/10.1145/3477314.3507066
837
SAC ’22, April 25–29, 2022, Virtual Event, Hailemariam Mehari Yohannes and Toshiyuki Amagasa
models [8, 26] have benefited several low and rich resource lan- Unlike rich-resource languages like English, there is almost no
guages. However, the Tigrinya language was not considered in available Tigrinya corpus, which makes it difficult for researchers
the pre-training of multilingual Transformer models. Furthermore, to develop tools. To the best of our knowledge, the only publicly
several deep learning and supervised studies [7, 10, 14, 17, 30, 40] available labeled corpus for the Tigrinya language is the Nagaoka
use multiple processing layers to learn a hierarchical representation POS tagging corpus [35], which contains gold POS labels with
of data and have achieved better results in many NLP domains. 72,080 tokens and 4656 sentences.
Those having observed, we develop in this paper the first publicly The work with the Nagaoka corpus proposed a method for
available dataset for Tigrinya tagged with Named Entity Recog- POS tagging using the traditional supervised machine learning ap-
nition (NER). More precisely, we have annotated 69,309 tokens proaches [36]. The authors evaluated traditional machine learning
with 3,625 sentences for five entity types: person (PER), location methods, i.e., Conditional Random Field (CRF) and Support Vector
(LOC), organization (ORG), date & time (DATE), and miscellaneous Machine (SVM). The original Nagaoka Tigrinya POS tagging corpus
(MISC). Then, we introduced a pre-trained language model for the contained 73 labels and was reduced to 20 labels and achieved an
Tigrinya language, which is a RoBERTa-based [18] model. The accuracy of 89.92% and 90.89% for SVM and CRF, respectively.
language model was trained exclusively on the Tigrinya corpus Another study on Tigrinya POS tagging using the Nagaoka cor-
using the Masked Language Modeling (MLM) task and has the same pus was also conducted by [37]. The authors evaluated Deep Neural
size as the RoBERTa-base model. We named the language model Network (DNN) classifiers: Feed Forward Neural Network (FFNN),
TigRoBERTa (Tig refers to Tigrinya, and RoBERTa refers to the Long Short-Term Memory method (LSTM), Bidirectional LSTM,
Transformer model used). and Convolutional Neural Network (CNN) using word2vec neural
We then apply TigRoBERTa on two different downstream se- word embeddings. They reported that the BiLSTM approach was
quence labeling tasks: NER and POS tagging. For this purpose, we suitable for POS tagging and achieved 91.1% accuracy.
apply fine-tuning to the TigRoBERTa using the respective labeled Moreover, [34] investigated the effects of morphological segmen-
datasets. The experimental result shows that TigRoberta achieved tation on the performance of statistical machine translation from
an F1-score of 81.05% for NER and 92% accuracy for POS tagging. English to Tigrinya. They performed a segmentation to achieve
We further explore the CNN-BiLSTM-CRF as a baseline model for better word alignment and to reduce vocabulary dropouts, thereby
NER and POS tagging tasks. To initialize the CNN-BiLSTM-CRF and improving the language model in both languages. Furthermore, they
test the significance of the pre-trained word embeddings, we pre- explored two segmentation schemes, i.e., one based on longest affix
trained a word2vec [22] on Tigrinya corpus. The experimental result segmentation and another based on fine-grained morphological
for the CNN-BiLSTM-CRF model on word2vec Tigrinya embedding segmentation. Another study by [2] investigated the use of current
achieved a 68.86% F1-score for the NER task and 94% accuracy for Neural Machine Translation (NMT) techniques for Tigrinya. The
POS tagging. Thus, we surpassed the highest accuracy reported so author used a Transformer-based architecture to achieve better
far and achieved the best result for the POS tagging task. translation performance than previous approaches. They also in-
Our contributions can be summarized as follows: vestigated the use of neural machine translation techniques. They
proposed a method for the translation task English-Tigrinya using
• Develop and release the first publicly available named entity
the JW300 En-Ti dataset.
recognition dataset for Tigrinya language.
Another study by [12] investigated text classification based on
• Develop and release a language model pre-trained exclu-
CNN-BiLSTM for Tigrinya. They created a manually annotated
sively on the Tigrinya language corpus.
dataset of 30,000 documents from Tigrinya news with the six dif-
• Introduce supervised and transfer learning techniques for
ferent categories of "sports", "agriculture", "politics", "religion", "ed-
the Tigrinya language.
ucation", and "health", as well as none annotated corpus of more
We expect that our dataset and the proposed language model than six million words. However, they did not make their corpus
will contribute to the researchers around NLP studies for Tigrinya. publicly available. In their study, they evaluated word2vec and fast-
The rest of this paper is organized as follows: Section 2 pro- text word embedding in classification models by applying CNNs
vides a brief literature review of previous NLP work on Tigrinya to Tigrinya news articles. [1] investigated the NER task for ten
and other low-resource languages. Section 3 presents the devel- African languages and created and published a NER dataset for
opment and annotation process of the Named Entity Recognition each language. They also investigated cross-domain transfer with
dataset. Section 4. describes how the data was collected for pre- experiments on five languages using the Wiki-Ann dataset and a
training the model and the methodology used. Section 5. discusses cross-lingual transfer for low-resource named entity recognition.
the experimental results in the NER and POS tagging datasets using As a result of reviewing the Tigrinya NLP literature, we found
the TigRoBERTa and CNN-BiLSTM-CRF models. Finally, Section 6 that none of the researchers have implemented neural network
presents the conclusions and directions for future work. approaches to the NER task.
2 LITERATURE REVIEW 3 TIGRINYA NER CORPUS

In recent years, there have been several studies around NLP for
3.1 Tigrinya Script
Tigrinya language, such as Neural Machine Translation (NMT)
[2, 5, 23, 34], text classification [12], POS tagging [33, 37], and Tigrinya uses the Ge’ez script. Ge’ez is a script used as an abugida (al-
stemming Tigrinya words for information retrieval [24]. phasyllabary) for several Afro-Asiatic and Nilo-Saharan languages
838
Figure 1: An example of Named Entity Recognition in English and Tigrinya. PER, LOC, and DATE are identified as entities.
Table 1: Distribution of different text types in the corpus. Tigrinya is written from left to right in the Ethiopic script, which
is an abugida. In abugida scripts, each symbol represents a pair
News Article Percentage of consonant sounds and a vowel sound. In Ethiopia, Tigrinya
native speakers in the Tigray region are called tigrāwāy for males,
Agriculture 8
tigrāweytı̄ for females, and tigrāwōt or tegaru as a group. The
Business 12
dialects of Tigrinya differ in sound, spelling, and grammar3 .
Culture 12
Healthy 15
History 10 3.2 Corpus Annotation
Politics 18 According to [20], a language is considered to a be low-resource if
Sport 10 there are no (or not enough) annotated corpora: name dictionaries,
General 15 appropriate morphological analyzers, POS taggers, and a tree-bank
Total 100 in that language. Developing a corpus for a low-resource language
like Tigrinya is very difficult for at least two reasons. First, there
is no annotation tool for low-resource languages. Second, develop-
Table 2: Entity tags with their frequency and percentage ing a labeled corpus is expensive and time consuming. To address
presented in the corpus. this problem, we develop a newly labeled corpus for named entity
recognition in Tigrinya language. We labeled 69,309 entity tags
Entity Tags Frequency of Tokens Percentage containing 3625 sentences. Since there is no any annotator tool for
PER 2095 2.8 Tigrinya language, our corpus was annotated manually. The corpus
LOC 3333 4.94 contains sentences from 2015-2021 on various topics. Table 1 shows
ORG 2612 3.75 the distribution of different types of text topics in the corpus. We
DATE 2881 4.21 annotated five entity types: person name (PER), location (LOC), or-
MISC 369 0.5 ganization (ORG), date and time (DATE), and miscellaneous (MISC)
O 58019 83.8 using the BIO standard. The annotated tags were inspired by the
English CoNNL-2003 corpus [29]. In addition, we follow the MUC6
Total tokens 69309 100 [32] annotation guidelines.
In the following, we summarize the annotation guidelines for
Table 3: Interannotator agreement for our datasets calculated the five classes.
using Kohen kappa for each entity tag. Disagreements were PER personal name including first name, middle name, and last
resolved by discussion. name. Personal names that refer to an organization, location,
events, law, and prizes were not tagged with the PER tag.
Entities Kappa Statistics % LOC includes all country names, region names, state names
and city names, non-gpe locations like (mountain name, river
PER 0.98
name and, body of water).
LOC 0.96
ORG can be grouped as proper names that include all kinds
ORG 0.95
of organizations, sports teams, multinational organizations,
DATE 0.93
political parties, unions and, proper names referring to facil-
MISC 0.85
ities.
in Ethiopia and Eritrea in the horn of Africa. In Amharic and

3 https://blog.amara.org/2021/08/04/new-to-amara-tigrinya/amp/
Tigrinya, the script is often called fidäl, meaning "script" or "letter."
839
DATE absolute date expressions denote a particular segment with [MASK] tokens, then the model tries to predict the origi-
of date, i.e., a particular day, season, final quarters, years, a nal masked word. In the NSP phase, The model concatenates two
decade, or a specific century. masked sentences as inputs during pre-training. Sometimes they
MISC includes other types of entities, e.g., events, specific dis- correspond to sentences that were next to each other in the original
ease names, etc. text, sometimes not. The model then predicts whether or not the
O is used for non-entity tokens. two sentences followed each other 5 . In addition, the architecture
The annotation process was carried out by three paid and four of BERT uses a stack of either 12 (base) or 24 (large) encoders.
volunteer human annotators who have a linguistic background and RoBERTa was developed by Facebook [18] and aimed to opti-
are native speakers of Tigrinya. Table 2 shows the frequency of each mize training from BERT. RoBERTa has a similar architecture to
entity tag. The corpus was annotated according to the established BERT and was introduced to improve the training procedure of
Beginning, Inside, and Outside (BIO) scheme, where "B" indicates BERT. RoBERTa modifies key hyper-parameters in BERT, including
the first word of the entity; "I" indicates the remaining words of removing BERT’s next sentence prediction task objective to train
the same entity, and the "O" indicates that the tagged word is not longer sequences and introducing dynamic masking. RoBERTa also
a named entity. Our corpus will be publicly available on GitHub 4 changes the training of BERT with much larger mini-batches and
for research purposes after the publication of related papers. learning rates. This allows RoBERTa to improve the MLM objective
To validate the annotation quality, we report inter-annotator compared to BERT and leads to better performance on downstream
agreement scores in Table 3 using kohen’s kappa [21] for all en- tasks. Moreover, RoBERTa has two models: the base model and
tity tags. We calculated the inter-annotator agreement between the large model. The RoBERTa base model consists of 12 layers
two annotation sets of 3,625 sentences with 69,309 tokens. Table 3 with a hidden size of 768 and 278M parameters, while the RoBERTa
shows the results of the inter-annotator agreement analysis. The large model has 24 layers with a hidden size of 1024 and 550M
agreement between the annotations PER and LOC is relatively high. parameters.
Furthermore, the kappa agreement for MISC was low compared to
the other entities. Thus, the tag MISC was the most difficult tag for 4.3 TigRoBERTa Language Model
our annotator. The goal of our annotation technique was to produce Recent studies [9, 16, 18] have shown that pre-training language
a high-quality corpus by ensuring high annotator agreement. models improve the performance of many NLP tasks. Several rich
and low-resource languages have benefited from multilingual Trans-
4 PROPOSED METHOD former models [8, 26]. Unfortunately, the Tigrinya language is not
4.1 Overview considered in the pre-training of the multilingual Transformer lan-
guage models. To solve this problem, we have pre-trained a new
In this work, we propose a new pre-trained RoBERTa-based lan-
language model for Tigrinya. Our model is purely trained on the
guage model for Tigrinya language. Language models are trained
Tigrinya corpus, and we name it the TigRoBERTa (Tig refers to
to predict tokens in a text based on their surrounding context. We
Tigrinya and RoBERTa refers to the Transformer model used) lan-
can use the language model for different downstream tasks, such as
guage model. We use the RoBERTa-base configuration method with
Named Entity Recognition (NER) or Part of Speech (POS) tagging.
12 blocks, 768 hidden dimensions, 12 attention heads, and a maxi-
Figure 2 shows an overview of our proposed language model,
mum sequence length of 512.
the source data used, and the architecture used to generate our
Data preparation. The dataset used for pre-training TigRoBERTa
model. Finally, it shows the fine-tuning of our model for different
was compiled from various Tigrinya online platforms, mainly differ-
downstream tasks such as NER and POS tagging.
ent news portals, and some freely available e-Books, including the
Bible. Although the model was trained on a small dataset and for a
4.2 Transformer-based Architectures small number of epochs (compared to the most common monolin-
Our model exploits the well-studied Transformer [38]. It uses an gual Transformer models).
encoder and decoder architecture for converting one sequence Though the initial Tigrinya corpus size was less than 4.5 million
to another sequence. The encoder takes as input a sequence and sentences (approx. 800 MB), Through data exploration, it was ob-
converts it into an embedding which is a vector representation of served it contains documents from different languages, especially
the input. The decoder as input takes an embedding and converts it English, Amharic, etc. So, we had to clean the dataset before train-
into a sequence. The encoder and decoder consist of several multi- ing our tokenizer and model. After cleaning the Tigrinya corpus
headed attentions stacked on top of each other. However, recent data were reduced to 4.3 million sentences (approx. 750 MB).
approaches, such as BERT, AlBERT, RoBERTa, GPT3, and XLNeT Tokenizer. For our implementation, we use a Byte Pair Encoder
[9, 16, 18, 27, 39], use the Transformers to create embeddings that (BPE) tokenizer with a vocabulary size of 50,265 units. We changed
can be used for other tasks. the default BPE tokenizer of RoBERTa to a Tigrinya tokenizer. Using
In this study, we use RoBERTa [18], which is a Transformer- BPE allows learning a sub-word vocabulary of modest size that
based model and is a replication of BERT (Bidirectional Encoder can encode any input without obtaining "unknown" tokens. The
Representations from Transformers) [9]. BERT uses two training model’s inputs consist of 512 continuous tokens that can span
strategies: Masked Language Modeling (MLM) and Next Sentence multiple documents. Special tokens are added to the vocabulary to
Prediction (NSP). In the MLM phase, 15% of the words are replaced
4 https://github.com/mehari-eng/Tigrinya-NER 5 https://huggingface.co/bert-base-uncased
840
Figure 2: Proposed language model, source text data used for pre-training, the architecture used, i.e., RoBERTa, generating
TigRoBERTa model and fine-tuning TigRoBERTa on NER and POS tagging tasks.
Table 4: Hyper-parameter setting for CNN-BiLSTM-CRF

model for NER and POS tagging experiments.
Hyper-parameter NER & POS

Char window size 3
Char number of filters 30
Dropout ratio 0.5
Batch size 10
Learning rate 0.01
Decay rate 0.05
Gradient clipping 5
represent the beginning and end of the input sequence (<s>, </s>)
where:
• <s> beginning of the sentence; and
• </s> ending of the sentence.
In addition, unknown, masking, and padding tokens are also
added where:
• <unk> this token is needed for unknown sub-strings occur-
ring during the inference;
• <pad> tokens are needed for short sentences since batch
training requires uniformly long inputs, and
Figure 3: Architecture of CNN-BiLSTM-CRF model.
• <mask> tokens are required for language modeling since
training is based on hiding a random number of input tokens
in a given sentence and predict them correctly.
5 EXPERIMENTAL EVALUATION
Model training. TigRoBERTa shares its architecture with RoBERTa’s
base model, which is a replication and improvement over BERT. In this section, we present the experimental evaluation of the pro-
Since the model is RoBERTa-like, we train it with a masked lan- posed TigRoBERTa model in terms of a couple of downstream tasks,
guage modeling task. This involves masking part of the input, about i.e., NER and POS tagging. Specifically, in addition to TigRoBERTa,
15% of the tokens, and then learning a model that predicts the miss- we employ a CNN-BiLSTM-CRF model as a baseline and compare
ing tokens. MLM is often used in pre-training tasks to allow the their performance.
model to learn text patterns from unlabeled data. The optimizer For all experiments, we used 80% of training data, 10% of test
used is Adam with a learning rate of 3e-4, a weight decay of 0.01, data, and the remaining 10% as validation data.
warming up for 5,000 steps and maximum steps 100k with 512 se-
quence length for 8 epochs which takes 7 days to finish the training. 5.1 Experimental Setting
Once we pre-trained our mode, we fine-tuned on each downstream 5.1.1 TigRoBERTa. To apply our TigRoBERTa model to down-
task as discussed in section 5.2.2. stream tasks, we further train our pre-trained model by replacing
841
Table 5: Evaluation of CNN-BiLSTM-CRF using different model architecture was proposed by [19]. We first use a Convolu-
word2vec embedding settings on NER. The results are pre- tional Neural Network (CNN) to encode a word into its character-
sented using the F1-score on a test set. level representation for each input vector. Then, we concatenate the
characters and the word-level representation and pass them to the
Embedding Dimension F1-score BiLSTM layer. The CRF layer produces sentence-level tag informa-
tion for sequence prediction. To initialize our model, we pret-rained
Random 100 63.05
a word2Vec word embedding for Tigrinya with different dimen-
Word2Vec 50 66.44
sions. Figure 3 shows the architecture of the CNN-BiLSTM-CRF
Word2Vec 100 68.86
model.
Word2Vec 200 66.72
Word embedding.
Word2Vec 300 67.64
A word embedding is to map a word to its vector representation
in such a way that words with similar meanings are mapped to
Table 6: We report F1-score for every entity tags on test set points that are close with each other, while dissimilar words are
using CNN-BiLSTM-CRF model results (i.e., F1-score). For mapped to points at a distance.
the PER, LOC, ORG, DATE, and MISC tags with different For example, distributional pre-trained word embeddings such as
word2vec embedding settings on a test set. word2vec, glove, and fastText [6, 22, 25], are trained with a neural
network. They then encode similarities based on the context in
Embedding Dimension PER LOC ORG DATE MISC which they appear and are used as input to deep neural network
classifiers. In this work, we used the word2vec method to pre-train
Random 100 70.07 64.76 45.78 60.15 41.23 Tigrinya embedding. The word2vec architecture was proposed by
Word2Vec 50 77.54 67.58 50.31 57.75 46.18 [22] and released in 2013.
Word2Vec 100 78.95 70.34 59.67 57.14 54.51 It takes as input a large text corpus and generates a vector space.
Word2Vec 200 76.73 68.54 52.46 58.90 49.06 A word2vec embedding with 50, 100, 200, and 300 dimensions
Word2Vec 300 77.68 67.83 59.67 57.34 51.76 (using 4.3 million Tigrinya sentences) and a random with 100 di-
mensions using a labeled corpus (NER =3625 and POS tagging =
Table 7: Evaluation of CNN-BiLSTM-CRF using different 4656 sentences) were pre-trained.
embedding settings and comparing to similar works pub- We used the Gensim Python library [28] to train the Tigrinya
lished. The three sections are in order, our models, experi- word2vec embedding with the default parameters.
mented with (BiLSTM+relu, BiLSTM+softmax, LSTM+relu, LSTM Layer. Recurrent Neural Networks (RNN) [4] are a class
and LSTM), and experimented with SVM & CRF. All experi- of neural networks that are powerful in sequence modeling of
ments were performed on the Nagaoka corpus for POS tag- data. Theoretically, RNNs can handle long-term dependencies, but
ging task. Results are given in accuracy. in practice, they fail due to gradient vanishing. Long Short Term
Memory (LSTM) was introduced by [13] and was designed to avoid
the problem of long-term dependencies. LSTM has three gates: the
Algorithm Embedding Dimension Accuracy
forget gate, the input gate, and the output gate.
Random 100 90.2
• The forget gate: decides what information to throw away
Word2Vec 50 93.1
from the cell state.
CNN-BiLSTM-CRF Word2Vec 100 94
• The input gate: decides what information to store in the cell
Word2Vec 200 93
state.
Word2Vec 300 93
• The output gate: determines the value of the next hidden
BiLSTM+Relu Random 100 91.1 state and contains information about previous inputs.
BiLSTM+Softmax Random 100 89.6 Conditional Random Field (CRF). The CRF [15] is a class of
LSTM+relu Random 100 89.1 statistical modeling methods commonly used in pattern recognition
LSTM Random 100 89 and machine learning for structured predictions. On top of the
SVM - - 89.92 LSTM layer, the CRF decodes the labels for the whole sequence.
CRF - - 90.89
5.2 Experimental Results
5.2.1 Baseline Model Result. Table 5 gives the F1-score obtained
the fully connected output layer of the network with a new set of
by CNN-BiLSTM-CRF model on Tigrinya NER dataset. Named En-
output layers that can produce the desired output. The only output
tity Recognition (NER) is the task of identifying and categorizing
parameters are learned from scratch; the remaining model parame-
entities in texts. Given a corpus of text, NER attempts to find and
ters are slightly fine-tuned. In this phase, we fine-tuned our model
classify named entities in a given corpus. In this experiment, we
to evaluate its performance on two different downstream sequence
trained the CNN-BiLSTM-CRF model on our NER dataset, which
labeling tasks, NER and POS tagging.
contains 5 entity tags, 3625 sentences, and 69309 tokens. We exper-
5.1.2 Baseline: CNN-BiLSTM-CRF. To evaluate the baseline model imented with different word embedding dimensions and random
performance we experimented with CNN-BiLSTM-CRF model. This sampling methods to initialize the CNN-BiLSTM-CRF model.
842
Table 8: Fine-tuning TigRoBERTa on NER and POS tagging. Table 9: Fine-tuning TigRoBERTa on NER dataset for PER,
Evaluation metrics are prec, rec, and F1-score for NER. Prec, LOC, ORG, DATE, and MISC entities on set. Evaluation met-
rec, F1-score, and accuracy for POS tagging. rics are prec, rec, and F1-score.
Tasks Metrics Dev Test Entities Precision Recall F1-score

Precision 75.48 80
NER Recall 77.62 82.14 PER 86 79 82
F1-score 76.53 81.05 LOC 80 81 81
Precision 90.64 90.20 ORG 72 81 76
POS Recall 90.20 89.80 DATE 86 79 82
F1-score 90.42 90 MISC 61 67 64
Accuracy 91.5 92
6 CONCLUSIONS
Table 5 illustrates the performance of five different word2vec Tigrinya is poorly researched due to the lack of comprehensive
embedding dimensions and random samples. In this experiment, and freely available data. To address this problem, we have pre-
we found that using the pre-trained word embedding improves per- sented the first publicly available dataset for Tigrinya tagged with
formance compared to the random embeddings. The CNN-BiLSTM- named entity recognition. We tagged 69,309 entity-tags with 3625
CRF model achieved the best result with an F1-score of 68.86 on sentences. We have also presented a pre-trained RoBERTa-based lan-
the word2vec 100 dimension. Table 6 also shows the F1-score for guage model for Tigrinya, which we call TigRoBERTa. TigRoBERTa
PER, LOC, ORG, DATE, and MISC entity tags. was trained on the Tigrinya corpus and evaluated on two differ-
Furthermore, Table 7 gives the accuracy result on POS tagging ent sequence tagging tasks: NER and POS tagging. In addition, we
task. have investigated the CNN-BiLSTM-CRF model for NER and POS
For this task, we used the publicly available dataset [35], devel- tagging tasks. The experimental results show that TigRoBERTa
oped by Nagaoka. In Table 7, we report our experimental results outperforms the CNN-BiLSTM-CRF on the NER dataset. Similarly,
using different sets of word2vec Tigrinya embeddings and compare the CNN-BiLSTM-CRF model also performs well on POS tagging.
our results with previous studies on the same dataset. The three In the future, we plan to expand the study to other NLP domains
sections are in order: our result, experimented with (BiLSTM+relu, that have not yet been studied for the Tigrinya language.
BiLSTM+softmax, LSTM+relu, and LSTM) and experimented with
SVM & CRF. The results are reported in terms of accuracy. Table 7 REFERENCES
show that our model improved 3.11 - 4.08% points of accuracy, scor- [1] David Ifeoluwa Adelani, Jade Abbott, Graham Neubig, Daniel D’souza, Julia
ing 94% compared with SVM and CRF [36]. And 2.9 - 5% points Kreutzer, Constantine Lignos, Chester Palen-Michel, Happy Buzaaba, Shruti
Rijhwani, Sebastian Ruder, et al. 2021. MasakhaNER: Named Entity Recognition
improvement over the [37], which they considered a random ini- for African Languages. arXiv preprint arXiv:2103.11811 (2021).
tialization with word2vec 100D. Thus, our model outperformed [2] Isayas Berhe Adhanom. [n. d.]. A First Look into Neural Machine Translation
the highest accuracy scores reported previously. Our model on the for Tigrinya. ([n. d.]).
[3] Norah Alsaaran and Maha Alrabiah. 2021. Arabic Named Entity Recognition: A
Nagaoka corpus establishes a new result for the POS tagging task. -BGRU Approach. CMC-COMPUTERS MATERIALS & CONTINUA 68, 1 (2021),
471–485.
5.2.2 Improving the Baseline Model using Transfer Learning. Trans- [4] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term
fer learning, in which a model is first trained on a large corpus for dependencies with gradient descent is difficult. IEEE transactions on neural
networks 5, 2 (1994), 157–166.
an initial task and then fine-tuned for various downstream tasks. [5] Zemicheal Berihu, Gebremariam Mesfin Assres, Mulugeta Atsbaha, and Tor-
The technique of fine-tuning a pre-trained language model is widely Morten Grønli. 2020. Enhancing Bi-directional English-Tigrigna Machine Trans-
used and has improved various NLP tasks. Table 8 shows the result lation Using Hybrid Approach. In Norsk IKT-konferanse for forskning og utdan-
ning.
of fine-tuning TigRoBERTa with the NER and POS tagging datasets. [6] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017.
The results are given in terms of precision, recall, F1-score, and Enriching word vectors with subword information. Transactions of the Association
for Computational Linguistics 5 (2017), 135–146.
accuracy. For the NER dataset, TigRoBERTa achieved an F1-score [7] Jason PC Chiu and Eric Nichols. 2016. Named entity recognition with bidirectional
of 76.53% and 81.05% on dev and test sets, respectively. Similarly, LSTM-CNNs. Transactions of the Association for Computational Linguistics 4
TigRoBERTa achieved 91.5% and 92% accuracy on dev and test for (2016), 357–370.
[8] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guil-
POS tagging, respectively. We found that the CNN-BiLSTM-CRF laume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer,
model outperformed TigRoBERTa in POS tagging by 2% accuracy and Veselin Stoyanov. 2019. Unsupervised cross-lingual representation learning
in the test. And, for NER task TigRoBERTa improved the F1-score at scale. arXiv preprint arXiv:1911.02116 (2019).
[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
by 12.19% in the test compared to the CNN-BiLSTM-CRF model Pre-training of deep bidirectional transformers for language understanding. arXiv
(i.e., 68.86%), as shown in Table 5. preprint arXiv:1810.04805 (2018).
[10] Xishuang Dong, Shanta Chowdhury, Lijun Qian, Xiangfang Li, Yi Guan, Jinfeng
Furthermore, Table 9 shows fine-tuning TigRoBERTa on NER Yang, and Qiubin Yu. 2019. Deep learning for named entity recognition on Chinese
dataset for PER, LOC, ORG, DATE, and MISC entities. Our model electronic medical records: Combining deep transfer learning with multitask
performs well in predicting the tags PER and DATE with an F1- bi-directional LSTM RNN. PloS one 14, 5 (2019), e0216046.
[11] Roald Eiselen. 2016. Government domain named entity recognition for south
score of 82%. Similarly, our model achieved 81% and 76% for the African languages. In Proceedings of the Tenth International Conference on Lan-
categories LOC and ORG, respectively. guage Resources and Evaluation (LREC’16). 3344–3348.
843
[12] Awet Fesseha, Shengwu Xiong, Eshete Derb Emiru, Moussa Diallo, and Abdel- [37] Senait Gebremichael Tesfagergish and Jurgita Kapociute-Dzikiene. 2020. Deep
ghani Dahou. 2021. Text Classification Based on Convolutional Neural Networks Learning-Based Part-of-Speech Tagging of the Tigrinya Language. In Interna-
and Word Embedding for Low-Resource Languages: Tigrinya. Information 12, 2 tional Conference on Information and Software Technologies. Springer, 357–367.
(2021), 52. [38] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
[13] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
computation 9, 8 (1997), 1735–1780. you need. In Advances in neural information processing systems. 5998–6008.
[14] Rasmus Hvingelby, Amalie Brogaard Pauli, Maria Barrett, Christina Rosted, [39] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov,
Lasse Malm Lidegaard, and Anders Søgaard. 2020. DaNE: A named entity re- and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language
source for danish. In Proceedings of the 12th Language Resources and Evaluation understanding. Advances in neural information processing systems 32 (2019).
Conference. 4597–4604. [40] Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2018.
[15] John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional Recent trends in deep learning based natural language processing. ieee Computa-
random fields: Probabilistic models for segmenting and labeling sequence data. tional intelligenCe magazine 13, 3 (2018), 55–75.
(2001).
[16] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush
Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning
of language representations. arXiv preprint arXiv:1909.11942 (2019).
[17] ThAnh Lê and MS Burtsev. 2019. A deep neural network model for the task
of Named Entity Recognition. International Journal of Machine Learning and
Computing 9, 1 (2019), 8–13.
[18] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A
robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
(2019).
[19] Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-
directional lstm-cnns-crf. arXiv preprint arXiv:1603.01354 (2016).
[20] Michael Franklin Mbouopda and Paulin Melatagia Yonta. 2020. Named Entity
Recognition in Low-resource Languages using Cross-lingual distributional word
representation. Revue Africaine de la Recherche en Informatique et Mathématiques
Appliquées 33 (2020).
[21] Mary L McHugh. 2012. Interrater reliability: the kappa statistic. Biochemia
medica 22, 3 (2012), 276–282.
[22] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient
estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
(2013).
[23] Alp Öktem, Mirko Plitt, and Grace Tang. 2020. Tigrinya neural machine
translation with transfer learning for humanitarian response. arXiv preprint
arXiv:2003.11523 (2020).
[24] Omer Osman and Yoshiki Mikami. 2012. Stemming Tigrinya words for informa-
tion retrieval. In Proceedings of COLING 2012: Demonstration Papers. 345–352.
[25] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove:
Global vectors for word representation. In Proceedings of the 2014 conference on
empirical methods in natural language processing (EMNLP). 1532–1543.
[26] Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multi-
lingual BERT? arXiv preprint arXiv:1906.01502 (2019).
[27] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the lim-
its of transfer learning with a unified text-to-text transformer. arXiv preprint
arXiv:1910.10683 (2019).
[28] Radim Řehřek, Petr Sojka, et al. 2011. Gensim—statistical semantics in python.
Retrieved from genism. org (2011).
[29] Erik F Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared
task: Language-independent named entity recognition. arXiv preprint cs/0306050
(2003).
[30] Richa Sharma, Sudha Morwal, Basant Agarwal, Ramesh Chandra, and Moham-
mad S Khan. 2020. A deep neural network-based model for named entity recog-
nition for Hindi language. Neural Computing and Applications 32, 20 (2020),
16191–16203.
[31] Stephanie Strassel and Jennifer Tracey. 2016. Lorelei language packs: Data,
tools, and resources for technology development in low resource languages.
In Proceedings of the Tenth International Conference on Language Resources and
Evaluation (LREC’16). 3273–3280.
[32] Beth M Sundheim. 1995. Overview of results of the MUC-6 evaluation. (1995).
[33] Yemane Tedla and Kazuhide Yamamoto. 2017. Analyzing word embeddings and
improving POS tagger of tigrinya. In 2017 International Conference on Asian
Language Processing (IALP). IEEE, 115–118.
[34] Yemane Tedla and Kazuhide Yamamoto. 2017. Morphological Segmentation for
English-to-Tigrinya Statistical MachineTranslation. Int. J. Asian Lang. Process 27,
2 (2017), 95–110.
[35] Yemane Keleta Tedla, Kazuhide Yamamoto, and Ashuboda Marasinghe. 2016.
Nagaoka Tigrinya Corpus: Design and development of part-of-speech tagged
corpus. Nagaoka University of Technology (2016), 1–4.
[36] Yemane Keleta Tedla, Kazuhide Yamamoto, and Ashuboda Marasinghe. 2016.
Tigrinya part-of-speech tagging with morphological patterns and the new na-
gaoka tigrinya corpus. International Journal of Computer Applications 146, 14
(2016).
844

Named-Entity Recognition For A Low-Resource Language Using Pre-Trained Language Model

Uploaded by

Copyright:

Available Formats

Named-Entity Recognition For A Low-Resource Language Using Pre-Trained Language Model

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Named-Entity Recognition For A Low-Resource Language Using Pre-Trained Language Model

Uploaded by

Copyright:

Available Formats

Named-Entity Recognition for a Low-resource Language using

Pre-Trained Language Model

2 LITERATURE REVIEW 3 TIGRINYA NER CORPUS

in Ethiopia and Eritrea in the horn of Africa. In Amharic and

Table 4: Hyper-parameter setting for CNN-BiLSTM-CRF

Hyper-parameter NER & POS

Tasks Metrics Dev Test Entities Precision Recall F1-score

You might also like