Joint Khmer Word Segmentation and Part-Of-Speech T

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Joint Khmer Word Segmentation and Part-of-Speech Tagging

Using Deep Learning

Rina Buoy† Nguonly Taing† Sokchea Kor ‡



Techo Startup Center (TSC)

Royal University of Phnom Penh (RUPP)
{rina.buoy,nguonly.taing}@techostartup.center
[email protected]

Abstract mid-19th century, there was an effort to ro-


manize the Khmer language; however, it was
Khmer text is written from left to right
not successful. This incident led to Khmer-
with optional space. Space is not served
ization in the middle of and late 19th-century
as a word boundary but instead, it is used
[1]. Khmer is classified as a low-resource lan-
for readability or other functional purposes.
guage [2].
Word segmentation is a prior step for down-
Khmer text is written from left to right
stream tasks such as part-of-speech (POS)
with optional space. Space is not served
tagging and thus, the robustness of POS tag-
as a word boundary but instead, it is used
ging highly depends on word segmentation.
for readability or other functional purposes.
The conventional Khmer POS tagging is a
Therefore, word segmentation is a prior step
two-stage process that begins with word seg-
in Khmer text processing tasks. Various ef-
mentation and then actual tagging of each
forts including dictionary-based and statis-
word, afterward. In this work, a joint word
tical models are made to solve the Khmer
segmentation and POS tagging approach us-
segmentation problem [3][4].
ing a single deep learning model is proposed
Khmer has two distinct phonological fea-
so that word segmentation and POS tagging
tures from the other Southeast Asian lan-
can be performed spontaneously. The pro-
guages such as Thai and Burmese. Khmer
posed model was trained and tested using the
is not a tonal language and therefore, a large
publicly available Khmer POS dataset. The
set of vowel phonemes are available to com-
validation suggested that the performance of
pensate for this absence of tones. Khmer has
the joint model is on par with the conven-
also a large set of consonant clusters (C1C2
tional two-stage POS tagging.
or C1C2C3). Khmer allows complex initial
Keywords: Khmer, NLP, POS tagging, consonant clusters at the beginning of a syl-
Deep Learning, LSTM, RNN lable [2].

1 Introduction 1.2 POS Tagging


Part-of-Speech tagging is one of the sequence
1.1 Background
labelling tasks in which a word is assigned
Khmer (KHM) is the official language of the to one of a predefined tag set according to
Cambodia kingdom and the Khmer script is its syntactic function. POS tagging is re-
used in the writing system of Khmer and quired for various downstream tasks such as
other minority languages such Kuay, Tam- spelling check, parsing, grammar induction,
puan, Jarai Krung, Brao and Kravet. Khmer word sense disambiguation, and information
language and writing system were hugely in- retrieval [5].
fluenced by Pali and Sanskrit in early history There is no explicit word delimiter in the
[1]. Khmer writing system. Automatic word seg-
Khmer script is believed to be originated mentation is run to obtain segmented words
from the Brahmi Pallava script. The Khmer and POS tagging is performed afterwards.
writing system has been undergone ten evo- The performance of POS tagging is reliant
lutions for over 1400 years. In the early- and on the results of segmentation in this two-
stage approach [5]. ied. The character-level network took in-
For languages such as Khmer, Thai, puts as a sequence of characters while the
Burmese which do not explicit word sepa- character-cluster networks took, instead, a
rator, the definition of words is not a natu- sequence of character clusters (KCC).
ral concept and therefore, segmentation and
POS tagging cannot be separated as both
tasks unavoidably affect one another [2].
2.2 Khmer POS Tagging
In this paper, we, thus, propose a joint
word segmentation and POS tagging using
a single deep learning network to remove Khmer is a low-resource language with lim-
the adverse dependency effect. The pro- ited natural language processing (NLP) re-
posed model is a bidirectional long short- search. One of the earlier research on POS
term memory (LSTM) recurrent network tagging used a rule-based transformation ap-
with one-to-one architecture. The network proach [9]. The same authors later intro-
takes inputs as a sequence of characters and duced a hybrid method by combining tri-
outputs a sequence of POS tags. We use the gram models and the rule-based transfor-
publicly available Khmer POS dataset by [5] mation approach. 27 tags were defined.
to train and validate the model. The dataset was manually annotated and
contained about 32,000 words. For known
2 Related Work words, the hybrid model could achieve up to
95.55% and 94.15% on training and test set,
2.1 Word Segmentation respectively. For unknown words, 90.60%
One of the early researches on Khmer word and 69.84% on training and test set were ob-
segmentation was done by [3]. The authors tained.
proposed a bidirectional maximum match-
Another Khmer POS tagging study was
ing approach (BiMM) to maximize segmen-
done by [10] which used a conditional ran-
tation accuracy. BiMM processed maxi-
dom field model. The authors recycled the
mum matching twice – forward and back-
tag definitions in [9] and built a training cor-
ward. The average accuracy of BiMM was
pus of 41,058 words. The authors experi-
reported to be 98.13%. BiMM was, how-
mented with various feature templates by in-
ever, unable to handle out-of-vocabulary
cluding morphemes, word-shapes and name-
words(OOV) and take into account the con-
entities and the best template gave an accu-
text.
racy of 96.38%.
Another word segmentation approach was
proposed by [6][7] which used conditional Based on Choun Nat dictionary, [5] de-
random field (CRF). The feature template fined 24 tags, some of which were added for
was defined using a trigram model and 10 considering word disambiguation task. The
tags for each input character. 5000 sentences authors used an automatic word segmenta-
were manually segmented and were used to tion tool by [6] to segment 12,000 sentences
train a first CRF model which was used to along with some manual corrections. Vari-
segment more sentences. Additional man- ous machine learning algorithms were used
ual hand-corrections were needed to build a to train POS tagging. These include Hidden
training corpus of 97,340 sentences. 12,468 Markov Model (HMM), Maximum Entropy
sentences were used as a test set. Two CRF (ME), Support Vector Machine (SVM), Con-
models were trained. The 2-tags model was ditional Random Fields (CRF), Ripple Down
for predicting only word boundary and the 5- Rules-based (RDR) and Two Hours of Anno-
tags model for predicting both word bound- tation Approach (combination of HMM and
ary and identifying compound words. Both Maximum Entropy Markov Model). RDR
models obtained the same F1 score of 0.985. approach outperformed the rest by achieving
[8] proposed a deep learning approach for an accuracy of 95.33% on the test set while
the Khmer word segmentation task. Two CRF, HMM and SVM approaches achieved
long short-term memory networks were stud- comparable results.
2.3 Joint Segmentation and POS 1. Abbreviation (AB): In Khmer writing,
tagging an abbreviation can be written with or
For most Indo-European languages, POS without a dot. Without an explicit dot,
tagging can be done after segmentation since there is an ambiguity between a word
spaces are used to separate words in their and an abbreviation. For example, គម
writing system. Most Eastern and South- or គ.ម. (Kilometer).
Eastern Asian, on the contrary, do not use 2. Adjective (JJ): An adjective is a word
any explicit word separator and the defini- used to describe a noun and is gener-
tion of words is not well defined. Segmenta- ally placed after nouns except for loan-
tion and POS tagging are, therefore, cannot words from Pali or Sangkrit [5]. Some
be separated. The authors suggested apply- common Khmer adjectives are, for ex-
ing a joint segmentation and POS tagging ample, ស (white) ល្អ (good) តូច (small),
for low-resource languages which share simi- and ធំ (big). ឧត្តម and មហា are examples
lar linguistic features as Khmer and Burmese of Pali/Sangkrit loanword.
[2].
3. Adverb (RB): An adverb is a word used
3 Modified POS Tag Set to describe a verb, adjective and an-
other adverb [5][11]. For example, some
[5] proposed a comprehensive set of 24 POS
words belonging to the adverb tag are
tags which were derived from Choun Nat dic-
េពក (very), ណាស់ (very), េហើយ (already),
tionary. These tag sets are shown in Table
and េទើប (just).
1. In this work, we propose the following re-
visions to the tag set defined in [5]: 4. Auxiliary Verb (AUX): Only three
words are tagged as an auxiliary verb
• Grouping the measure (M) tag under and their syntactic role is to indicate
noun (NN) tag since the syntactic role of tense [5]. បាន or មាន indicates past
the measure tag is the same as the noun tense. កំពុង indicates progress tense. នឹង
tag. For example, some words belong- indicates future tense.
ing to the measure tag are កបល (head),
ស្រមាប់ (set), and អង្គ (person). 5. Cardinal Number (CD): A cardinal
number is used to indicate the quantity
• Grouping the relative pronoun (RPN) [5]. Some examples of cardinal numbers
tag under the pronoun tag since the rel- are ១ (one), បី (three), បូន (four), and
ative pronoun tag has only one word លាន (million).
(ែដល).
6. Conjunction (CC): A conjunction is a
• Grouping the currency tag (CUR), dou- word used to connect words, phrases
ble tag (ៗ - DBL), et cetera tag (។ល។ or clauses [5][11]. For example, some
- ETC), and end-of-sentence tag (។ - words belonging to the conjunction tag
KAN) under the symbol (SYM) tag. are េបើ (if), ្របសិនេបើ (if), ពីេ្រពាះ (because),
េ្រពាះ (because), and ពុំេនាះេសាត (never-
• Grouping the injection (UH) tag under theless).
the particle tag (PA).
7. Determiner Pronoun (DT): A deter-
• Grouping the adjective verb (VB_JJ) miner is a word used to indicate the lo-
and compliment verb (V_COM) tag un- cation or uncertainty of a noun. Deter-
der the verb (VB) tag. miners are equivalent to English words:
this, that, those, these, all, every, each,
After applying the above revisions, the re- and some. In Khmer grammar, a de-
sulting tag set consists of 15 tags which is terminer is tagged as either a pronoun
shown in Table 2. or adjective [11]. However, a deter-
The descriptions of the revised POS tags miner pronoun tag is used in [5] and
are as follows: this work. For example, some words
Table 1. The POS Tag Set Proposed by [5]
No. Tags No. Tags
1 AB Abbreviation 13 NN Noun
2 AUX Auxiliary Verb 14 PN Proper Noun
3 CC Conjunction 15 PA Particle
4 CD Cardinal Number 16 PRO Pronoun
5 CUR Currency 17 QT Question Word
6 DBL Double Sign 18 RB Adverb
7 DT Determiner Pronoun 19 RPN Rel. Pronoun
8 ETC Khmer Sign 20 SYM Symbol/Sign
9 IN Preposition 21 UH Interjection
10 JJ Adjective 22 VB Verb
11 KAN Khmer Sign 23 VB_JJ Adjective Verb
12 M Measure 24 V_COM Verb Compliment

Table 2. The Revised POS Tag Set Proposed in This Work


No. Tags No. Tags
1 AB Abbreviation 9 NN Noun
2 AUX Auxiliary Verb 10 PN Proper Noun
3 CC Conjunction 11 PA Particle
4 CD Cardinal Number 12 PRO Pronoun
5 DT Determiner Pronoun 13 QT Question Word
6 IN Preposition 14 RB Adverb
7 JJ Adjective 15 SYM Symbol/Sign
8 VB Verb

belonging to the determiner pronoun ple, some words belonging to the noun
tag are េនះ (this), េនាះ (that), សព្វ (ev- tag are សិស្ស (student), េគា (cow), តុ (ta-
ery), ទាំងេនះ (these),������� (those), and ខ្លះ ble), and កសិករ (farmer).
(some).
11. Proper Noun (PN): A proper noun is
8. Pronoun (PRO): A pronoun is a word a word used to identify the name of a
used to refer to a noun or noun phrase person, animal, tree, place, and location
which was already mentioned [5][11]. In in particular [5][11]. For example, some
this work, a pronoun tag is used to tag words belonging to the proper noun are
both a personal pronoun and a relative សុខា (Sokha - a person’s name), ភ្នំេពញ
pronoun. In Khmer grammar, there is (Phnom Penh), កម្ពុជា (Cambodia), and
only one relative pronoun which is ែដល សុីធីអុិន (CTN).
(that, which, where, who).
12. Question Word (QT): Some examples of
9. Preposition (IN): A preposition is a question word are េតើ (what) and ដូចេម្តច
word used with a noun, noun phrase, (how).
verb or verb phrase to time, place, lo-
cation, possession and so on [5][11]. For 13. Verb (VB): A verb is a word used
example, some words belonging to the to describe action, state, or condition
preposition tag are េនេលើ (above), កាលពី [5][11]. For example, some words be-
(from), តាម (by), and អំពី (about). longing to the verb tag are េដើរ (to
walk), ជា (to be), េមើល (to watch), and
10. Noun (NN): A noun is a word used to េ�សក (to be thirsty). [5] invented two
identify a person, animal, tree, place, more tags for verb, which were the
and object in general [5][11]. For exam- adjective verb (VB_JJ) and compli-
ment verb (V_COM). A VB_JJ was simple recurrent network is given in Figure
used to tag any verb behaving like 2.
an adjective in large compound words The hidden vector, ht depends on both in-
such as មា៉ សុីនេបាកេខាអាវ (washing ma- put, xt and previous hidden vector, ht−1 .
chine), កំបិតចិតបែន្ល (knife) and so on.
A V_COM, on the other hand, was ht = g(U ht−1 + W xt ) (1)
used to tag any verb in verb phrases Where:
or collocations such េរៀនេចះ (to learn),
្របលងជាប់ (to pass), and so on. Both • U and W are trainable weight matrices.
VB_JJ and V_COM are dropped in They are shared by all time steps in the
this work for the reason that identi- sequence.
fying the VB_JJ and V_COM should
The fact that the computation of ht re-
be done via compound and collocation
quires ht−1 makes RNN an ideal candidate
identification task [12]. Certain Khmer
for sequence tagging tasks such as POS tag-
compounds are loose and have the same
ging.
structure as a Subject-Verb-Object sen-
tence [13][2]. This is illustrated in Fig- 4.2 Recurrent Neural Network for
ure 1. Sequence Labelling
Various forms of RNN architecture are given
14. Particle (PA): There is no clear con-
in Figure 3. The choice of an RNN architec-
cept of a particle in Khmer grammar.
ture depends on the application of interest.
[5] identified three groups of particles
A POS tagging task is a sequence labelling
namely: hesitation, response and final.
and one-to-one architecture is suitable. A
Some examples of particles are េអុើ (hes-
possible RNN model for POS tagging is given
itation), េអើ (response), សុិន (final), and
in Figure 4. Here are the forward steps taken
ចុះ (final)។
by an RNN model:
15. Symbol (SYM): Various symbols in 1. A sequence of inputs are encoded or em-
writing are grouped under the SYM tag. bedded with an embedding layer.
The SYM tag includes currency signs,
Khmer signs (ៗ, ។, ។ល។) and various 2. Hidden vectors are computed by un-
borrowed signs ( +, -, ?) and so on. rolling the computational graph through
time.

4 POS Tagging Methodology 3. At each time step, the RNN cell outputs
an output vector.
The application of deep learning in the
Khmer word segmentation task was first pro- 4. A Softmax activation is applied to the
posed by [8]. In this work, we extend [8] output vectors.
by combining word segmentation and POS
tagging task in a single deep learning model. 4.3 Stacking
The proposed model is a variant of recurrent One or more RNN layers can be stacked on
neural networks. The details are explained top of each other to form a deep RNN model.
below. An illustration of a stacked RNN model is
shown in Figure 4. In a stacked RNN model,
4.1 Recurrent Neural Network an input sequence is fed to the first RNN
A recurrent neural network (RNN) is a type layer to produce h1t . h1t is then fed to another
of neural network with a cycle in its connec- RNN layer until the layer output layer.
tions. That means the value of an RNN cell Stacking is used to learn representations
depends on both inputs and its previous out- at different levels across layers. The optimal
puts. Elman networks or simple recurrent number of the stack depends on the appli-
networks have been proven to be very effec- cation of interest and is a matter of hyper-
tive in NLP tasks [14]. An illustration of a parameter tuning [14].
Figure 1. Similar Structure of a Sentence and Compound Noun

Figure 2. A simple RNN [14]

Figure 3. Various RNN architectures [15]


Figure 4. A example RNN model for POS tagging task [14]

4.4 Bidirectionality • xnt are inputs up to time, t in reserve


In forward mode (from left to right), h1t rep- order.
resents what the model has processed from An RNN model which processes from both
the first input in the sequence until the in- directions is known as a bidirectional RNN
put at a time, t. (bi-RNN). hft and hbt can be averaged or con-
catenated to form ht .
hft = RN Nf orward (xt1 ) (2)
4.5 Long Short-Term Memory
Where:
(LSTM)
• hft represents the forward hidden vector When processing a long sequence, an RNN
after the network sees an input sequence cell has difficulty in carrying forward critical
up to time step, t. information for two reasons [14]:
• xt1 are inputs up to time, t. • The weight matrices need to provide in-
In certain applications such as sequence la- formation for the current output as well
belling, a model can have access to the en- as the future outputs.
tire input sequence. It is possible to process • The back-propagation through time suf-
an input sequence from both directions - for- fers from vanishing gradients problem
ward and backward [14]. hbt representing the due to repeated multiplications along a
hidden vector up to time, t in reverse order long sequence.
can be expressed as:
Long Short-Term Memory (LSTM) net-
hbt = RN Nbackward (xnt ) (3) works were devised to address the above is-
sues by introducing sophisticated gate mech-
Where:
anisms and an additional context vector to
• hbt represents the backward hidden vec- control the flow of information into and out
tor after the network sees an input se- of the units [14]. Each gate in LSTM net-
quence from timestep, n to t. works consists of:
1. A feed-forward layer – Ui , Wi , Ug , and Wg are weight ma-
trices.
2. A sigmoid activation
– gt , it , and jt are vectors at time, t.
3. Element-wise multiplication with the – ct is a current context vector at
gated layer time, t.
The choice of a sigmoid function is to gate • Output Gate: As the name implies, the
information flow as it tends to squash the objective of an output gate is to deter-
output to either zero or one. The combined mine what information is needed to up-
effect of a sigmoid activation and element- date a current hidden vector at time, t
wise multiplication is the same as binary :
masking [14].
There are three gates in an LSTM cell. ot = σ(Uo ht−1 + Wo xt ) (10)
The details are as follow:
ht = ot ⊙ tanh(ct ) (11)
• Forget Gate: As the name implies, the
objective of a forget gate is to remove Where:
irrelevant information from the context
– Uo and Wo are weight matrices.
vector. The equations of a forget gate
are given below: – ot is a vectors at time, t.
– ht is a hidden vector at time, t.
ft = σ(Uf ht−1 + Wf xt ) (4)
4.6 Bidirectional LSTM Network for
kt = ct−1 ⊙ ft (5) Joint Word Segmentation and
POS Tagging
Where:
In this section, we introduce a bidirectional
– σ is a sigmoid activation. LSTM (Bi-LSTM) network at character level
– Uf and Wf are weight matrices. for joint word segmentation and POS tag-
– ft and ct are vectors at time, t. ging. It is bidirectional since the model has
– ct−1 is a previous context vector. access to an entire input sequence during for-
– ⊙ is an element-wise multiplication ward run.
operator. The descriptions of the proposed Bi-LSTM
network shown in Figure 5 are as follows:
• Add Gate: As the name implies, the ob-
jective of an add gate is to add relevant 1. Inputs: The network takes a sequence of
information to the context vector. The characters as inputs.
equations of an add gate are given be- 2. Inputs Encoding: One-hot encoding is
low: used to encode an input character.
gt = tanh(Ug ht−1 + Wg xt ) (6) 3. LSTM Layers: The network processes
an input sequence in both directions -
Like a standard RNN cell, the above
forward and backward. The forward
equation extracts information from the
and backward hidden vector in the final
previous hidden vector and current in-
LSTM stack is concatenated to form a
put.
single hidden vector.
it = σ(Ui ht−1 + Wi xt ) (7)
jt = gt ⊙ it (8) 4. Feed-forward Layer: The concatenated
hidden vector is fed into a feed-forward
The current context vector is updated layer to produce an output vector. The
as follows: size of the output vector is equal to the
number of POS tags plus one as an addi-
ct = jt + kt (9) tional no-space (NS) tag is introduced.
No space tag is explained in the output
Where: representation section.
5. Softmax: Softmax activation is applied
Table 3. POS Tag Frequency in the Training
to the output vector to produce proba-
Set
bilistic outputs. Tags Frequency Percentage
NN 32297 25.03%
In the proposed models of [8], a sigmoid
PN 20084 15.57%
function is used to output a probability
VB 18604 14.42%
whether a character or cluster is the start-
PRO 13950 10.81%
ing of a word.
IN 13446 10.42%
4.7 Cross Entropy Loss RB 6428 4.98%
SYM 5839 4.53%
Multi-class cross-entropy loss is used to train
JJ 4446 3.45%
the proposed model. The loss is expressed as
DT 4311 3.34%
follows:
CD 3337 2.59%

K CC 2788 2.16%
L(y, ŷ) = y k log(ŷ k ) (12) AUX 2466 1.91%
k=1 PA 885 0.69%
QT 79 0.06%
Where:
AB 69 0.05%
• ŷ k is a predicted probability of class, k. Total 129029 100.00%

• y k is 0 or 1, indicating whether class, k


is the correct classification. • Space denotes word boundary.

5 Experimental Setup • POS tag of a word is right after slash.

5.1 Dataset - Train/Test Set The corresponding input and target se-
The dataset used in this work is ob- quence of the above training sentence is il-
tained from [5]. The dataset is avail- lustrated in Figure 6. If a character is the
able on Github (https://github.com/ye- starting character of a word, its label is the
kyaw-thu/khPOS) under CC BY-NC-SA 4.0 POS tag of the word and it also means the
license. The dataset consists of 12,000 sen- beginning of a new word. Otherwise, the no-
tences (25,626 words in total). The average space (NS) tag is assigned.
number of words per sentence is 10.75. The
word segmentation tool from [6] was used 5.3 Training Configuration
and manual corrections were also performed. The network was implemented in the Py-
The original dataset has 24 POS tags. The Torch framework and trained on Google Co-
most common tags are NN, PN, PRO, IN lab Pro. Training utilized mini-batch on
and VB. GPU.
We revised the dataset by grouping some The following hyper-parameters were used:
tags as per the above discussion. The re-
vised dataset has 15 tags in total - 9 tags • Number of LSTM stacks = 2
fewer. The count and proportion of each tag
are given in Figure 3 in descending order. • Hidden dimension = 100
The 12,000-sentence dataset was used as a
training set. [5] provided a separate open test • Batch size = 128
set for evaluating the model performance.
• Optimizer = Adam with learning rate of
5.2 Inputs and Labels Preparation 0.001
An example of a training sentence is given • Epochs = 100
below.
ខ្ញំុ/PRO �សលាញ់/VB ែខ្មរ/PN ។/SYM • Loss function = categorical cross en-
(I love Khmer) tropy loss
Figure 5. The proposed Bi-LSTM network for joint segmentation and POS tagging at character
level

Figure 6. Input and Output Sequence Representation


Table 4. Accuracy of Word Segmentation Table 5. POS Tag Accuracy Breakdowns
Metric Training Set Test Set Tags Train Test
Accuracy 99.27% 97.11% NN 98.43% 93.75%
PN 98.94% 96.40%
VB 96.90% 89.12%
Each input character was encoded to a PRO 98.86% 97.39%
one-hot vector of 132 dimensions, which is IN 97.49% 92.05%
the number of Khmer characters including RB 95.00% 87.23%
numbers and other signs. SYM 100.00% 99.74%
JJ 92.35% 79.89%
6 Results and Evaluation DT 99.71% 96.99%
The accuracy of word segmentation is defined CD 98.52% 96.82%
as: CC 96.06% 94.09%
AUX 100.00% 98.15%
Countcorrect PA 95.77% 83.33%
word
Accuracy = (13) QT 100.00% 100.00%
Countcorpus
word
AB 100.00% 83.33%
Where:

• Countcorrect is the number of correctly Table 6. Overall Accuracy of POS tagging


word
segmented words. Metric Training Set Test Set
Accuracy 98.14% 94.00%
• Countcorpus
word is the number of works in
the corpus.
7 Discussion
The accuracy for a given tag, i is defined The trained model achieved the segmenta-
as below: tion and POS tagging accuracy of 97.11%
and 94.00% , respectively. Compared with
Countcorrect
Accuracy =
pos,i
(14) [5], our POS tagging accuracy was 1.33%
Countcorpus
pos,i lower on the open test set. In the work of [5],
the overall error should be composed of two
Where:
components - segmentation and POS tagging
• Countcorrect is the number of correctly error and is approximated by the below equa-
pos,i
predicted POS tag, i. tion:

• Countcorpus is the number of POS tag, ϵt = ϵs + ϵp (16)


pos,i
i in the corpus. Where:
The overall POS tagging accuracy is given • ϵt : is the overall error.
by:
• ϵs : is the segmentation error.
∑tag correct
i=1 Countpos,i • ϵp : is the POS tagging error.
Accuracy = ∑tag corpus (15)
i=1 Countpos,i
Since the segmentation error was not in-
Where: cluded in [5], the reported error was just the
POS tagging error. [5] used the segmenta-
• tag is the number of POS tags. tion tool by [6] with the reported error (ϵs ) of
1.5%. The estimated overall error (ϵt ) of [5]
The segmentation and overall POS tagging is about 6.17% since the highest accuracy of
accuracy are given Table 4 and 6, respec- [5] was reported to be 95.33% (ϵp = 4.67%).
tively while the accuracy breakdowns by tag Thus, the performance of the joint seg-
are given in Table 5 mentation and POS tagging model with ϵt of
6.00% is on par with the conventional two- fields. Khmer Natural Language Pro-
stage POS tagging method. cessing, 2015.
[7] Ye Kyaw Thu, Vichet Chea, Andrew
8 Conclusion and Future Work Finch, Masao Utiyama, and Eiichiro
In this work, we proposed joint word segmen- Sumita. A large-scale study of statis-
tation and POS tagging using a deep learn- tical machine translation methods for
ing approach. We presented a bidirectional khmer language. In Proceedings of the
LSTM network that takes inputs at the char- 29th Pacific Asia Conference on Lan-
acter level and outputs a sequence of POS guage, Information and Computation,
tags. The overall accuracy of the proposed pages 259–269, Shanghai, China, Octo-
model is on par with the conventional two- ber 2015.
stage Khmer POS tagging. We believe the [8] Rina Buoy, Nguonly Taing, and Sokchea
training dataset available is limited in size Kor. Khmer word segmentation using
and a significantly larger dataset is required bilstm networks. 4th Regional Confer-
to train a more robust joint POS tagging ence on OCR and NLP for ASEAN Lan-
model with greater generalization ability. guages, 2020.
[9] C. Nou and W. Kameyama. Khmer
References pos tagger: A transformation-based
approach with hybrid unknown word
[1] Makara Sok. Phonological Principles
handling. In International Conference
and Automatic Phonemic and Phonetic
on Semantic Computing (ICSC 2007),
Transcription of Khmer Words. PhD
pages 482–492, 2007.
thesis, Payap University, 2016.
[10] Sokunsatya Sangvat and Charnyote
[2] Chenchen Ding, Masao Utiyama, and Pluempitiwiriyawej. Khmer pos tagging
Eiichiro Sumita. Nova: A feasible and using conditional random fields. Com-
flexible annotation system for joint to- munications in Computer and Informa-
kenization and part-of-speech tagging. tion Science, 2017.
ACM Trans. Asian Low-Resour. Lang.
Inf. Process., 18(2), December 2018. [11] National Council of Khmer Language.
Khmer Grammar Book. National Coun-
[3] Narin Bi and Nguonly Taing. Khmer cil of Khmer Language, 2018.
word segmentation based on bi-
[12] Wirote Aroonmanakun. Thoughts on
directional maximal matching for
word and sentence segmentation in thai.
plaintextand microsoft word. Signal
2007.
and Information Processing Associa-
tion Annual Summit and Conference [13] Sok Khin. Khmer Grammar. Royal
(APSIPA), 2014. Academy of Cambodia, 2007.
[4] Chea Sok Huor, Top Rithy, Ros Pich [14] Dan Jurafsky and James H. Martin.
Hemy, Vann Navy, Chin Chanthirith, Speech and Language Processing. 3rd ed
and Chhoeun Tola. Word bigram vs draft, 2020.
orthographic syllable bigram in khmer [15] Andrej Karpathy. The unreasonable ef-
word. PAN Localization Team, 2007. fectiveness of recurrent neural networks.
[5] Ye Kyaw Thu, Vichet Chea, and Yoshi-
nori Sagisaka. Comparison of six
pos tagging methods on 12k sentences
khmer language pos tagged corpus. 1st
Regional Conference on OCR and NLP
for ASEAN Languages, 2017.
[6] Vichet Chea, Ye Kyaw Thu, Chenchen
Ding, Masao Utiyama, Andrew Finch,
and Eiichiro Sumita. Khmer word
segmentation using conditional random

You might also like