Arabic Multi-Dialect Segmentation: bi-LSTM-CRF vs. SVM: Habash and Sadat 2006

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Arabic Multi-Dialect Segmentation: bi-LSTM-CRF vs.

SVM

Mohamed Eldesouki1 , Younes Samih2 ,


Ahmed Abdelali , Mohammed Attia3 , Hamdy Mubarak1 , Kareem Darwish1 , and Laura Kallmeyer2
1

1
Qatar Computing Research Institute, HBKU, Doha, Qatar
2
Dept. of Computational Linguistics,University of Düsseldorf, Düsseldorf, Germany
3
Google Inc., New York City, USA
1
{mohamohamed,hmubarak,aabdelali,kdarwish}@hbku.edu.qa
2
{samih,kallmeyer}@phil.hhu.de
3
[email protected]
arXiv:1708.05891v1 [cs.CL] 19 Aug 2017

Abstract lation (Habash and Sadat, 2006). Most previous


work has mostly focused on segmenting Modern
Arabic word segmentation is essential for Standard Arabic (MSA) achieving segmentation
a variety of NLP applications such as ma- accuracies of nearly 99% (Abdelali et al., 2016;
chine translation and information retrieval. Pasha et al., 2014). MSA is the lingua franca of
Segmentation entails breaking words into the Arab world, and it is typically used in writ-
their constituent stems, affixes and cli- ten and formal communications. Dialectal Ara-
tics. In this paper, we compare two ap- bic (DA) segmentation on the other hand has re-
proaches for segmenting four major Ara- ceived limited attention, with most of the work
bic dialects using only several thousand focusing on the Egyptian dialect (Habash et al.,
training examples for each dialect. The 2013; Samih et al., 2017). Arabic dialects are typ-
two approaches involve posing the prob- ically spoken and are used in informal communi-
lem as a ranking problem, where an SVM cations. The advent of the social media and the
ranker picks the best segmentation, and ubiquity of smart phones has led to a greater need
as a sequence labeling problem, where a for dialectal processing such as dialect identifica-
bi-LSTM RNN coupled with CRF deter- tion (Eldesouki et al., 2016; Khurana et al., 2016),
mines where best to segment words. We morphological analysis (Habash et al., 2013) and
are able to achieve solid segmentation re- machine translation (Sennrich et al., 2016; Sajjad
sults for all dialects using rather limited et al., 2013). Yet, dialectal training corpora for a
training data. We also show that employ- variety of NLP modules, including segmentation,
ing Modern Standard Arabic data for do- continue to be limited and often nonexistent.
main adaptation and assuming context in- In this work, we focus on the segmentation of
dependence improve overall results. four major Arabic dialects, namely Egyptian, Lev-
antine, Gulf, and Maghrebi. We particularly focus
1 Introduction
on DA text from Twitter, a popular social media
Arabic has both complex morphology and orthog- platform, from which we can obtain large amounts
raphy, where stems are typically derived from a of text in different dialects written by ordinary so-
closed set of roots to which affixes such as coor- cial media users and exhibiting nonstandard or-
dinating conjunctions, determiners, and pronouns thography. We employ two machine learning ap-
are attached to form words. Segmenting Arabic proaches for building robust segmentation mod-
words into their constituent parts is important for ules using limited training data (350 tweets con-
a variety of natural language processing applica- taining several thousand words per dialect). In
tions. For example, segmentation has been shown one approach, we pose the segmentation as a rank-
to improve the effectiveness of information re- ing problem where all possible segmentations of
trieval (Darwish et al., 2014a) and machine trans- a word are ranked using a Support Vector Ma-
chine (SVM) based ranker. In the second, we ambiguation of Arabic (MADA). They trained and
use bidirectional Long Short Term Memory (bi- evaluated their system on both Penn Arabic Tree-
LSTM) Recurrent Neural Network (RNN) with bank (PATB) (parts 1-3) and the Egyptian Arabic
Conditional Random Fields (CRF) to perform se- Treebank (parts 1-5) (Maamouri et al., 2014) and
quence labeling over the characters in words. For they achieved 97.5% accuracy. MADAMIRA1
both, we adopt the simplifying assumption that (Pasha et al., 2014) is a new version of MADA
word segmentation can be reliably performed in- and includes functionality for analyzing dialec-
dependent of context. Though the assumption is tal Egyptian. Monroe et al. (2014) used a single
not always correct, it has been shown to be fairly dialect-independent model for segmenting all Ara-
robust for more than 99% of word occurrences in bic dialects including MSA. They argue that their
Arabic text (Abdelali et al., 2016). Lastly, given segmenter is better than other segmenters that use
the large overlap between MSA and DA, we em- sophisticated linguistic analysis. They evaluated
ploy segmented MSA data to further improve di- their model on three corpora, namely parts 1-3 of
alectal segmentation. Penn Arabic Treebank (PATB), Broadcast News
The contribution of this paper are as follows: Arabic Treebank (BN), and parts 1-8 of the BOLT
Phase 1 Egyptian Arabic Treebank (ARZ) report-
• We present robust DA segmenters for four major
ing a 95.13% F1 score.
Arabic dialects. We plan to open-source all of
them.
3 Dialectal Arabic
• We provide an exposition of challenges associ-
ated with performing in situ DA segmentation 3.1 DA Challenges
including segmentation guidelines and the effect
DA shares many MSA challenges, such as hav-
of orthographic standardization.
ing complex templatic derivational morphology
• We compare two machine learning approaches
and concatenative orthography. Most nouns and
that can generalize well even when limited train-
verbs are typically derived from a closed set of
ing data is available.
roots, which are fitted into templates to gener-
2 Related Work ate stems. Templates may indicate morphologi-
cal features such POS tag, gender, and number.
Work on dialectal Arabic is fairly new compared Stems may accept prefixes, such as coordinating
to MSA. A number of research projects were de- conjunction and prepositions, or suffixes, such as
voted to dialect identification (Biadsy et al., 2009; pronouns, to form words. While dialects mostly
Zbib et al., 2012; Zaidan and Callison-Burch, comply with the templatic nature of morphology2 ,
2014; Eldesouki et al., 2016). There are five major they diverge from MSA in other aspects such as:
dialects including Egyptian, Gulf, Iraqi, Levantine
and Maghribi. Few resources for these dialects • Lack of standard orthography, particularly for
are available such as the CALLHOME Egyptian ‚«
strictly dialectal words such as àA  “E$An”
Arabic Transcripts (LDC97T19), which was made
available for research as early as 1997. Newly de-
(because), which may also appear as àA ‚Ê«

veloped resources include the corpus developed by
‚Ó
“El$An” or àA  “m$An” (Habash et al., 2012).
Bouamor et al. (2014), which contains 2,000 par- • Word borrowing from other languages (Ibrahim,
allel sentences in multiple dialects and MSA as 2006), such as ½J“CK. “blAStk” (your place)
well as English translation.
in Maghrebi, or code switching with other lan-
For the segmentation, Mohamed et al. (2012) built
guages (Samih et al., 2016).
a segmenter based on memory-based learning.
The segmenter has been trained on a small cor- • Fusing multiple words together by concatenat-
pus of Egyptian Arabic comprising 320 comments ing tokens and dropping letters, such as the word
containing 20,022 words from www.masrawy. ½Ëñ®K
“yqwlk” (he says to you), “yqwl lk” are
com that were segmented and annotated by two concatenated and one “l” is dropped.
native speakers. They reported a 91.90% accu- 1
MADAMIRA release 20160516 2.1
racy on the task of segmentation. MADA-ARZ 2
Minor exception exist such the Egyptian template “At-
(Habash et al., 2013) is an Egyptian Arabic ex- fEl” that occasionally replaces the MSA template “AnfEl” as
tension of the Morphological Analysis and Dis- in Q儺K@ “Atksr” (broke)
• Additional affixes. Dialectal-specific affixes as the Maghrebi word AÒJ
» “kymA” (like/as in) and
may arise because of: the alteration of pro- the Leventine word ½J
ë “hyk” (like this). Given
nouns, such as the feminine second person pro-
the filtered tweets, we randomly selected 2,000
noun from ¼ “k” to ú»“ky” or the plural pronoun

unique tweets for each dialect, and we asked a na-
Õç' “tm” to ñK “tw”; the introduction of negation tive speaker of each dialect to manually select 350
prefix-suffix combination AÓ - € “mA-$”, which tweets that are heavily dialectal. Table 4 lists the
number of tweets that we obtained for each dialect
behaves like the French “ne-pas” negation con-
and the number of words they contain.
struct; the placement of present tense markers,
such as “b” in Egyptian and Levantine; the use Field Annotation
of different future markers such as “H”, “h”, and
“g” instead of “s” for MSA; and the shortening
Orig. word ½Ëñ®J
K. “byqwlk”
of prepositions and fusing them with the words Meaning he is saying to you
they precede such as the transformation of úΫ In situ Segm. ¼+ Èñ®K
+ H. “b+yqwl+k”
“ElY” (on) to ¨ “E”. CODA ½Ë Èñ®J
K. “byqwl lk”
• Letter substitution, where some letters are com- CODA Segm. ¼+ È Èñ®K
+ H. “b+yqwl l+k”
monly substituted for others such as “v” which
is replaced with “t” in Egyptian (as in Q
J» “ktyr” Table 1: Egyptian annotation example
(much)) or “q” which is replaced with “j” in Dialect No of Tweets No of Tokens
Gulf (as in h. Y“ “Sdj” (really). Egyptian 350 6,721
• Syntactic differences, such as the use of mascu- Levantine 350 6,648
line plural or singular noun forms instead dual Gulf 350 6,844
and feminine plural, the dropping of some ar- Maghrebi 350 5,495
ticles and preposition in some syntactic con-
structs, and the abandonment of some suffixes Table 2: Dataset size for the different dialects
such as “wn” in favor of “wA” for verbs and Segmentation of DA can be applied on the orig-
“yn” for nouns. inal raw text, or on the cleaned text after correct-
Using raw text from social media introduces ad- ing spelling mistakes and applying conventional
ditional phenomena such as word elongation, such orthography rules, such as CODA (Habash et al.,
2012). In this work, we decided to segment the
@P Q
J

J
k @ “>KyyyyrrA” (finally) instead of @Q
g @
original raw text. Though Egyptian CODA is a
“>KyrA”, and the use of non-Arabic characters
reasonably stable standard, CODA for other di-
such as Urdu characters (Darwish et al., 2012).
alects are either immature or nonexistent. Also,
4 Dataset CODA conversion tools are lacking for most di-
alects3 . Building such tools requires the estab-
We constructed our dataset by obtaining 350 lishment of clear guidelines, is laborious, and may
tweets that were authored for each of the follow- require large annotated corpora (Eskander et al.,
ing four dialects: Egyptian, Levantine, Gulf, and 2013), such as the LDC Egyptian Treebank.
Maghrebi. For dialectal Egyptian tweets, we ob- To prepare the ground truth data for a dialect,
tained the dataset described in (Darwish et al., we enlisted an annotator who is either a native
2014b), and we used the same methodology to speaker for the dialect or well versed in it and has
construct the dataset for the remaining dialects. background in natural language processing. The
Initially, we obtained 175 million Arabic tweets by authors along with another native speaker of the
querying the Twitter API using the query “lang:ar” dialect made multiple review rounds on the work
during March 2014. Then, we identified tweets of the annotator to ensure consistency and quality.
whose authors identified their location in countries The annotation guidelines were fairly straightfor-
where the dialects of interest are spoken (ex. Mo- ward. Basically, we asked annotators to:
rocco, Algeria, and Tunisia for Maghrebi) using
a large location gazetteer (Mubarak and Darwish, • segment words in a way that would maintain the
2014). Then we filtered the tweets using a list con- 3
except for Egyptian CODA tool that is embedded in
taining 10 strong dialectal words per dialect, such MADAMIRA
correct number of part of speech tags Diff. % Examples
• favor stems when repeated letters are dropped as Same no. of segments and same POS tags
Table 1 variable
2.4%
àA ‚Ê«
 ⇔ àA ‚«

• segment multiple concatenated words with spellings “E$An, El$An”
pluses as in the “merged words” example in Ta-
ble 3.
dropped
2.3%
Ð Qg A+K. ⇔ Ð Qg + H.
• attach injected long vowels that trail preposi- letters “b+Htrm, b+AHtrm”
tions or pronouns to the preposition or pronoun merged
1.4%
Ñ« AK
⇔ Ñ«+ AK

respectively (ex. ú¾J


Ë “lyky” (to you – feminine) words “yA+Em, yA Em”

→ “ly+ky”) Shortened
0.4%
ú¯ ⇔ ¬ , úΫ ⇔

¨
• treat dialectal words that originated as multiple particles “E, ElA f, fy”
fused words as single tokens (ex. €C« “ElA$” elongations 0.3% éJ
Ë ⇔ éJ
J

J
Ë “lyyyyh,lyh”

(why) – originally Zú
æ… ø
@ úΫ “ElY >y $y'”) Different no. of segments or POS tags
• do not segment name mentions and hashtags spelling
2.2%
B@ ð ⇔ Bð à@ ⇔ AK@
In what follows, we discuss the advantages errors “An, Ana wlA, wAlA”
fused ù+Ë ÈA  ⇔ ù+ËA¯
¯
and disadvantages of segmenting raw text versus 0.8%

the CODA’fied text with some statistics obtained letters “qAl+y, qAl l+y”
for the Egyptian tweets for which we have a
CODA’fied version as exemplified in Table 1. The Table 3: Original vs CODA Segmentations
main advantage of segmenting raw text is that it
doesn’t need any preprocessing tool to generate
CODA orthography, and the main advantage of when preceded by the present tense marker H. “b”
CODA is that is regularizes text making it more or the future tense marker ë “h”, and the “A” in
uniform and easier to process. We manually
negation particle AÓ“mA” which is often written as
compared the CODA version to the raw version
of 2,000 words in our Egyptian dataset. We Гm” and attached to the following word, and the
found that in 75.4% of the words, segmentation trailing letters in prepositions. “Merged words”
of original raw words is exactly the same as and “word elongations” are common in social me-
the their CODA’ifed equivalents (ex.
áÓ+ð dia, where users try to keep within limit by drop-
“w+mn” (and from) and Aê+ÊÒªK , “nEml+hA” ping the spaces between letters that do not connect
or to stress words respectively.
(we do it)). Further, if we normalize some char-
 Though some processing such as splitting of
acters, namely è ← è ,ø
← ø ,@ ← @, @, @, words or removing elongations is required to over-
and non-Arabic characters come the phenomena in this group, in situ segmen-

ø
← þ
,¼ ← ¹,Ã , ¬ ← ¬ ,h. ← h,
tation of raw words would yield identical segments
and remove diacritics, the percentage of matching with the same POS tags as their CODA counter-
increases to 90.3%. Table 3 showcases the parts. Thus, the segmentation of raw words could
remaining differences between raw and CODA be sufficient for 97% of words.
segmentations and how often they appear. The In the second group (accounting for 3% of
differences are divided into two groups. In the the cases), both may have a different number
first group (accounting for 6.8% of the cases), of segments or POS tags, which would compli-
the number of word segments remains the same cate downstream processing such as POS tagging.
and both the raw and CODA’fied segments would They involve spelling errors and the fusion of two
have the same POS tags. identical consecutive letters (gemmination). Cor-
In this group, the “variable spelling” class con- recting such errors may require a spell checker.
tains dialectal words that may have different com- We opted to segment raw input without correction
mon spellings with one “standard” spelling in in our reference, and we kept stem, such as verbs
CODA. The “dropped letter” and “shortened parti- and nouns, complete at the expense of other seg-
cles” classes typically involve the omission of let- ments such as prepositions as in the example in

ters such as the first person imperfect prefix @ “>” Table 1.
5 Segmentation Approaches • Conditional probability that a leading character
sequence is a prefix.
We present here two different systems for word
• Conditional probability that a trailing character
segmentation. The first uses SVM-based rank-
sequence is a suffix.
ing (SVMRank )4 to rank different possible seg-
mentations for a word using a variety of features. • probability of the prefix given the suffix.
The second uses bi-LSTM-CRF, which performs • probability of the suffix given the prefix.
sequence-to-sequence mapping to guess word seg- • unigram probability of the stem (more details
mentation. about calculating this is showing below).
• unigram probability of the stem with first suffix.
5.1 SVMRank Approach • whether a valid stem template can be obtained
This approach is inspired by the work done by from the stem.
Abdelali et al. (2016), in which they used SVM • whether the stem that has no trailing suffixes ap-
based ranking to ascertain the best segmentation pears in a gazetteer of person and location names
for Modern Standard Arabic (MSA), which they (Abdelali et al., 2016).
show to be fast and accurate. The approach in- • whether the stem is a function word, such as
volves generating all possible segmentations of a úΫ “ElY” (on), áÓ “mn” (from), and Ó  “m$”
word and then ranking them.
(not).
In training, we generate all possible seg-
• whether the stem appears in the AraComLex5
mentations of a word based on a closed set of
Arabic lexicon (Attia et al., 2011) or in the
prefixes and suffixes, and the correct segmen-
Buckwalter lexicon (Buckwalter, 2002). This is
tation is assigned rank 1 and all other incorrect
sensible considering the large overlap between
segmentations are assigned rank 2. Our valid
MSA and DA.
affixes include MSA prefixes and suffixes that
we extracted from Farasa (Abdelali et al., 2016) • length difference from the average stem length.
and additional dialectal prefixes and suffixes The segmentations with their corresponding
that we observed during training. Since we are features are then passed to the SVM ranker
not mapping words into a standard spelling, (Joachims, 2006) for training. Our SVMRank uses
such as CODA, prefixes and suffixes may have a linear kernel and a trade-off parameter between
multiple different representations. For example, training error and margin of 100.
given the dialectal Egyptian word for “I do not Before training the classifier, features needed to
play”, it could be spelled as “m+b+lEb+$”, calculated in advance. As training data, we used
“mA+b+lEb+$”, “mA+b+AlEb+$”, the aforementioned sets of 350 dialectal tweets for
“m+b+lEb+$y”, “mA+b+AlEb+$y”, etc. In each dialect containing typically several thousand
this example, the first prefix could be “m” or words each. We also use three parts of the Penn
“mA” and the suffix could be “$” or “$y”. Arabic Treebank (ATB); part 1 (version 4.1), 2
Here are two example dialectal Egyptian words (version 3.1), and 3 (version 2), which have a com-
to demonstrate segmentation: bined size of 628,870 tokens, to lookup MSA seg-
• Given the input word: 
€ñËA« “EAlw$” mentations. The intuition behind using such seg-
mented MSA data for lookup is that both MSA and
(on the face), possible segmentations are:
dialects share a fair amount of vocabulary. Thus,
{E+Al+w$} (correct segmentation), {E+Alw$},
using the ATB corpus has the effect of increasing
{E+Al+w+$}, {E+Alw+$}, {EAlw+$}, and
coverage.
{EAlw$}.
We also adopted the simplifying assumption
• Given the input word: ú¾K
XAK. “bAdyky”

that any given word has only 1 possible correct
(I give you (feminine)), possible segmenta- segmentation regardless of context. Though this
tions are: {b+Ady+ky} (correct segmentation), assumption is not always true, previous work on
{b+Adyky}, {bAdy+ky}, and {bAdyky}. MSA has shown that it holds for 99% of the cases
We use the following features in training the (Abdelali et al., 2016). Invoking this assumption
classifier: has multiple positive implications, namely: we
4 5
https://www.cs.cornell.edu/people/tj/ http://sourceforge.net/projects/
svm_light/svm_rank.html aracomlex/
can use the segmentations that we observed during interpretation about this architecture can be found
training directly, which typically cover most com- in (Lipton et al., 2015).
mon function words, or segmentations that we ob-
5.2.2 Bi-directional LSTM
served in the ATB, which cover most MSA words
that may be prevalent in dialectal text; and we can Bi-LSTM networks (Schuster and Paliwal, 1997)
cache word segmentations leading to significant are extensions to single LSTM networks. They are
speedup. Thus, we experimented with three dif- capable of learning long-term dependencies and
ferent lookup schemes for every word, namely: 1) maintain contextual features from past and future.
we output the rankers guess directly (None); 2) As shown in Figure 1, they comprise two separate
if exists, we use seen segmentations in dialectal hidden layers that feed forward to the same out-
training set, and the output of the ranker otherwise put layer. A bi-LSTM calculates the forward hid-


(DA); 3) if exists, we use seen segmentation in di- den sequence h , the backward hidden sequence
←−
alectal training set, else we use segmentation that h and the output sequence y by iterating over the
we observed in the ATB, and lastly the output of following equations :
the ranker (DA+MSA). →
− →

ht = σ(Wx− → xt + W−
h
→−→ h t−1 + b−
h h
→)
h
5.2 Bi-LSTM-CRF Approach ←
− ←

ht = σ(Wx← − x + W←
h t
−←− h t−1 + b←
h h
−)
h
5.2.1 Long Short-term Memory →
− ←−
yt = W −→ h + W←
hy t
− h + by
hy t
Recurrent Neural Network (RNN) belongs to a
family of neural networks suited for modeling se- More interpretations about these formulas are
quential data. Given an input sequence x = found in Graves et al. (2013a)
(x1 , ..., xn ), an RNN computes the output vector
yt of each word xt by iterating the following equa- 5.2.3 Conditional Random Fields (CRF)
tions from t = 1 to n: Over the past few years, bi-LSTMs have achieved
many ground-breaking results in many NLP tasks
ht = f (Wxh xt + Whh ht−1 + bh ) because of their ability to cope with long dis-
yt = Why ht + by tance dependencies and exploit contextual fea-
tures from past and future states. Still, when
where ht is the hidden states vector, W denotes they are used for some specific sequence classi-
weight matrix, b denotes bias vector and f is the fication tasks, (such as segmentation and named
activation function of the hidden layer. Theoreti- entity detection), where there is strict dependence
cally, RNNs can learn long distance dependencies, between output labels, they fail to generalize per-
still in practice they fail due vanishing/exploding fectly. During the training phase of the bi-LSTM
gradients (Bengio et al., 1994). To solve this prob- networks, the resulting probability distributions
lem , Hochreiter and Schmidhuber (1997) intro- for different time steps are independent from each
duced the LSTM RNN. The idea consists of aug- other. To overcome the independence assumptions
menting an RNN with memory cells to overcome imposed by the bi-LSTM and to exploit these kind
difficulties with training and efficiently cope with of labeling constraints in our Arabic segmentation
long distance dependencies. The output of the system, we model label sequence logic jointly us-
LSTM hidden layer ht given input xt is com- ing Conditional Random Fields (CRF) (Lafferty
puted via the following intermediate calculations: et al., 2001).
(Graves, 2013):
5.2.4 bi-LSTM-CRF for DA Segmentation
it = σ(Wxi xt + Whi ht−1 + Wci ct−1 + bi )
In this model we consider Arabic segmentation as
ft = σ(Wxf xt + Whf ht−1 + Wcf ct−1 + bf ) a sequence labeling problem at the character level.
ct = ft ct−1 + it tanh(Wxc xt + Whc ht−1 + bc ) Each character is labeled with one of five labels
ot = σ(Wxo xt + Who ht−1 + Wco ct + bo ) B, M, E, S, W B that designate the segmentation
decision boundaries: Beginning, Middle, End of
ht = ot tanh(ct )
a multi-character segment, Single character seg-
where σ is the logistic sigmoid function, and i, ment, and Word Boundary respectively. Figure
f , o and c are respectively the input gate, forget 1 illustrates our segmentation model and how the

model takes the word éJ.ʯ “qlbh” (his heart) as its
gate, output gate and cell activation vectors. More
idea behind dropout involves randomly omitting a
certain percentage of the neurons in each hidden
layer for each presentation of the samples during
training. This encourages each neuron to depend
less on the other neurons to learn the right seg-
mentation decision boundaries. We apply dropout
masks to the character embedding layer before in-
putting to the bi-LSTM and to its output vector. In
our experiments, we find that dropout with a fixed
rate of 0.5 decreases overfitting and improves the
overall performance of our system. We also em-
ploy early stopping (Caruana et al., 2000; Graves
et al., 2013b) to mitigate overfitting by monitoring
the model’s performance on the development set.

6 Experiments and Results


Figure 1: Architecture of our proposed neural net-
work Arabic segmentation model applied to word. As described earlier, we perform several exper-
éJ.ʯ “qlbh” and output “qlb+h” iments for each dialect. These involve training
using dialectal data while using different lookup
current input and predicts its correct segmentation. schemes, namely: no lookup (None); lookup
The model is comprised of the following three lay- from dialectal training only (DA); and a cascaded
ers: lookup from dialectal training and then MSA
• Input layer: containing character embeddings. (DA+MSA). For all our experiments, we use 5
fold cross validation with 70/10/20 train/dev/test
• Hidden layer: bi-LSTM maps character repre-
splits. We use the Farasa MSA segmenter as a
sentations to hidden sequences.
baseline. Table 4 reports on the results for both
• Output layer: CRF computes the probability dis-
segmentation approaches and in combination of
tribution over all labels.
using different lookup schemes. As the results
At the input layer, a look-up table is ini- clearly shows, using an MSA segmenter yields
tialized with randomly uniform sampled embed- suboptimal results for dialects. Also, when no
dings6 mapping each character in the input to d- lookup is used, the bi-LSTM-CRF sequence la-
dimensional vector. At the hidden layer, the output beler performs better than the SVM ranker for all
from the character embeddings is used as the input dialects. However, using lookup leads to greater
to the bi-LSTM layer to obtain fixed-dimensional improvements for the SVM approach leading to
representations for each character. At the output the best results for Levantine, Gulf, and Maghrebi
layer, a CRF is applied over the hidden represen- and slightly lower results for Egyptian. Further,
tation of the bi-LSTM to obtain the probability SVMRank seemed to have benefited more from the
distribution over all the labels. Training is per- DA lookup, while bi-LSTM-CRF benefited more
formed using stochastic gradient (SGD) descent from the MSA lookup. As for Egyptian segmenta-
with momentum 0.9 and batch size 50, optimizing tion, we suspected that it performed better for both
the cross entropy objective function. approaches than the segmentation for the other di-
alects, because the percentage of test words that
5.2.5 Optimization
appear in the training set was greater for Egyptian.
Due to the relatively small size the training and The percentages for all the dialects are:
development sets, overfitting poses a considerable
Egyptian Levantine Gulf Maghrebi
challenge for our Dialectal Arabic segmentation
64.7% 54.7% 56.7% 55.2%
system. To make sure that our model learns sig-
nificant representations, we resort to dropout (Hin- As for the lower results for Maghrebi, we noticed
ton et al., 2012) to mitigate overfitting. The basic that Maghrebi has many more affixes than MSA
6
and other dialects. These affixes contribute to the
We did not use pre-trained character embeddings, be-
cause we conducted side experiments with and without pre- data sparsity and complexity of the segmentation
trained embeddings and the results were mixed task. For example, we enumerated 24 prefixes
Training Set Look-up Egyptian Levantine Gulf Maghrebi
None 91.0 87.8 87.7 84.7
SVMRank DA 94.5 92.9 92.8 90.5
DA+MSA 94.6 93.3 93.1 91.2
None 93.8 91.0 89.4 87.1
bi-LSTM-CRF DA 94.2 91.8 90.8 88.5
DA+MSA 95.0 93.0 91.9 90.1
Farasa 85.7 82.6 82.9 82.6

Table 4: SVMRank and bi-LSTM results w/ and w/o lookup

for Maghrebi compared to 8 for MSA, 17 for data, the disputed letter belongs to the first to-
Levantine and Gulf, and 12 for Egyptian. Sim- ken while in the system output, it belongs to the
ilarly, Maghrebi had more suffixes than MSA second.
and other dialects. To ascertaining the effect of • Like the SVM, the system often fails due to un-
CODA’fication, we ran an extra experiment were conventional spelling. For example the word
we trained our best SVMRank system using the “lAxwyA” (to my brother) is a mis-
AK
ñkB
CODA’fied version of the Egyptian data, and the
segmentation accuracy increased from 94.6% to spelling of AK
ñk B.
96.8%. Thus, having stable CODA standards and • The majority of the remaining errors are simply
reliable conversion tools may positively impact mis-tokenization due to the system’s inability to
dialectal processing. Next, we elaborate on typical decide whether a substring (which out of con-
errors of both approaches. text can be a valid token) is an independent to-

ken or part of a word, e.g. ½+ÊJ.®J‚Ó “mstqbl+k”
SVMRank Errors: We examined the errors that (your future), which is predicted by the system
the SVM ranker produced for different dialects as “m+staqbl+k”, where it correctly recognizes
and the most common types involved: the genitive pronoun in the end, but mistakenly
tags the first radical as a separate segment.
• erroneous splitting of leading or trailing charac-
ters when they were not prefixes or suffixes re- 7 Conclusion
spectively or not splitting actual prefix and suf-
fixes. For example, àñºJ
+k “H+ykwn” (will In this paper we presented two approaches in-
volving SVM-based ranking and bi-LSTM-CRF
be) was segmented as “Hyk+wn”.
sequence labeling for segmenting Egyptian, Lev-
• the use of non-Arabic letters, wrong form of antine, Gulf, and Maghrebi dialects. Both ap-
alef, or “h” instead of “p”. For example proaches yield strong comparable results that
Á+Ê+K
Ag. “jAy+l+k” (I am coming to you), range between 91% and 95% accuracy for dif-
where “A” and “k” were replace with “|” and a ferent dialects. To perform the work, we cre-
Farsi character respectively, was not segmented. ated training corpora containing naturally occur-
• long words with multiple segments such as ring text from social media for the aforementioned

+ ®Ê ® J+Ó “m+tqlq+y+nA+$” (don’t make
 J+J
€+A dialects. We plan to release the data and the result-
us angry) where the ranker chose to segment it ing segmenters to the research community. For fu-
as “m+tqlq+yn+A$”. ture work, we want to perform domain adaptation
using large MSA data, such as ATB, to improve
bi-LSTM-CRF Errors: The errors in this system segmentation results. Further, we plan to investi-
are broadly classified into three categories: gate building a joint model capable of segmenting
all the dialects with minimal loss in accuracy.
• Ambiguous in token boundary because of char-
acter sharing in case of gemmination/elision.
“En A” (about us) is References
For example the word AJ«
En and AK nA. The two à Ahmed Abdelali, Kareem Darwish, Nadir Durrani, and
actually two tokens á« Hamdy Mubarak. 2016. Farasa: A fast and furious
n letters are now merged into one. In the gold segmenter for arabic. In Proceedings of the 2016
Conference of the North American Chapter of the Alex Graves. 2013. Generating sequences with
Association for Computational Linguistics: Demon- recurrent neural networks. arXiv preprint
strations. Association for Computational Linguis- arXiv:1308.0850 .
tics, San Diego, California, pages 11–16.
Alex Graves, Navdeep Jaitly, and Abdel-rahman Mo-
Mohammed Attia, Pavel Pecina, Antonio Toral, Lamia hamed. 2013a. Hybrid speech recognition with deep
Tounsi, and Josef van Genabith. 2011. An open- bidirectional lstm. In Automatic Speech Recognition
source finite state morphological transducer for and Understanding (ASRU), 2013 IEEE Workshop
modern standard arabic. In Proceedings of the on. IEEE, pages 273–278.
9th International Workshop on Finite State Methods
and Natural Language Processing. Association for Alex Graves, Abdel-rahman Mohamed, and Geoffrey
Computational Linguistics, pages 125–133. Hinton. 2013b. Speech recognition with deep re-
current neural networks. In Acoustics, speech and
Yoshua Bengio, Patrice Simard, and Paolo Frasconi. signal processing (icassp), 2013 ieee international
1994. Learning long-term dependencies with gradi- conference on. IEEE, pages 6645–6649.
ent descent is difficult. IEEE transactions on neural
networks 5(2):157–166. Nizar Habash, Mona T Diab, and Owen Rambow.
2012. Conventional orthography for dialectal ara-
Fadi Biadsy, Julia Hirschberg, and Nizar Habash. 2009. bic. In LREC. pages 711–718.
Spoken arabic dialect identification using phonotac-
tic modeling. In Proceedings of the EACL 2009 Nizar Habash, Ryan Roth, Owen Rambow, Ramy Es-
Workshop on Computational Approaches to Semitic kander, and Nadi Tomeh. 2013. Morphological
Languages. Association for Computational Linguis- analysis and disambiguation for dialectal arabic. In
tics, Stroudsburg, PA, USA, Semitic ’09, pages 53– Hlt-Naacl. pages 426–432.
61.
Nizar Habash and Fatiha Sadat. 2006. Arabic pre-
Houda Bouamor, Nizar Habash, and Kemal Oflazer. processing schemes for statistical machine transla-
2014. A multidialectal parallel corpus of arabic. tion. In Proceedings of the Human Language Tech-
In Nicoletta Calzolari (Conference Chair), Khalid nology Conference of the NAACL, Companion Vol-
Choukri, Thierry Declerck, Hrafn Loftsson, Bente ume: Short Papers. Association for Computational
Maegaard, Joseph Mariani, Asuncion Moreno, Jan Linguistics, pages 49–52.
Odijk, and Stelios Piperidis, editors, Proceedings of
the Ninth International Conference on Language Re- Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky,
sources and Evaluation (LREC’14). European Lan- Ilya Sutskever, and Ruslan R Salakhutdinov. 2012.
guage Resources Association (ELRA), Reykjavik, Improving neural networks by preventing co-
Iceland. adaptation of feature detectors. arXiv preprint
arXiv:1207.0580 .
Tim Buckwalter. 2002. Buckwalter {Arabic} morpho-
logical analyzer version 1.0 . Sepp Hochreiter and Jürgen Schmidhuber. 1997.
Long short-term memory. Neural computation
Rich Caruana, Steve Lawrence, and Lee Giles. 2000.
9(8):1735–1780.
Overfitting in neural nets: Backpropagation, conju-
gate gradient, and early stopping. In NIPS. pages Zeinab Ibrahim. 2006. Borrowing in modern standard
402–408. arabic. Innovation and Continuity in Language and
Kareem Darwish, Walid Magdy, and Ahmed Mourad. Communication of Different Language Cultures 9.
2012. Language processing for arabic microblog Edited by Rudolf Muhr pages 235–260.
retrieval. In Proceedings of the 21st ACM inter- Thorsten Joachims. 2006. Training linear svms in lin-
national conference on Information and knowledge ear time. In Proceedings of the 12th ACM SIGKDD
management. ACM, pages 2427–2430. international conference on Knowledge discovery
Kareem Darwish, Walid Magdy, et al. 2014a. Arabic and data mining. ACM, pages 217–226.
information retrieval. Foundations and Trends® in
Sameer Khurana, Ahmed Ali, and Steve Renals.
Information Retrieval 7(4):239–342.
2016. Multi-view dimensionality reduction for di-
Kareem Darwish, Hassan Sajjad, and Hamdy Mubarak. alect identification of arabic broadcast speech. arXiv
2014b. Verifiably effective arabic dialect identifica- preprint arXiv:1609.05650 .
tion. In EMNLP. pages 1465–1468.
John D. Lafferty, Andrew McCallum, and Fernando
Mohamed Eldesouki, Fahim Dalvi, Hassan Sajjad, and C. N. Pereira. 2001. Conditional random fields:
Kareem Darwish. 2016. Qcri@ dsl 2016: Spoken Probabilistic models for segmenting and labeling se-
arabic dialect identification using textual features. quence data. In Proc. ICML.
VarDial 3 page 221.
Zachary C Lipton, David C Kale, Charles Elkan, and
Ramy Eskander, Nizar Habash, Owen Rambow, and Randall Wetzell. 2015. A critical review of recur-
Nadi Tomeh. 2013. Processing spontaneous orthog- rent neural networks for sequence learning. CoRR
raphy. abs/1506.00019.
Mohamed Maamouri, Ann Bies, Seth Kulick, Michael Rabih Zbib, Erika Malchiodi, Jacob Devlin, David
Ciul, Nizar Habash, and Ramy Eskander. 2014. De- Stallard, Spyros Matsoukas, Richard Schwartz, John
veloping an egyptian arabic treebank: Impact of di- Makhoul, Omar F. Zaidan, and Chris Callison-
alectal morphology on annotation and tool develop- Burch. 2012. Machine translation of arabic dialects.
ment. In LREC. pages 2348–2354. In Proceedings of the 2012 Conference of the North
American Chapter of the Association for Computa-
Emad Mohamed, Behrang Mohit, and Kemal Oflazer. tional Linguistics: Human Language Technologies.
2012. Annotating and learning morphological seg- Association for Computational Linguistics, Strouds-
mentation of egyptian colloquial arabic. In LREC. burg, PA, USA, NAACL HLT ’12, pages 49–59.
pages 873–877.
Will Monroe, Spence Green, and Christopher D Man-
ning. 2014. Word segmentation of informal arabic
with domain adaptation. In ACL (2). pages 206–211.
Hamdy Mubarak and Kareem Darwish. 2014. Using
twitter to collect a multi-dialectal corpus of arabic.
In Proceedings of the EMNLP 2014 Workshop on
Arabic Natural Language Processing (ANLP). pages
1–7.
Arfath Pasha, Mohamed Al-Badrashiny, Mona Diab,
Ahmed El Kholy, Ramy Eskander, Nizar Habash,
Manoj Pooleery, Owen Rambow, and Ryan M Roth.
2014. Madamira: A fast, comprehensive tool for
morphological analysis and disambiguation of Ara-
bic. Proc. LREC .
Hassan Sajjad, Kareem Darwish, and Yonatan Be-
linkov. 2013. Translating dialectal Arabic to En-
glish. In Proceedings of the 51st Annual Meeting of
the Association for Computational Linguistics (Vol-
ume 2: Short Papers). Sofia, Bulgaria, ACL ’13,
pages 1–6.
Younes Samih, Mohammed Attia, Mohamed Eldes-
ouki, Hamdy Mubarak, Ahmed Abdelali, Laura
Kallmeyer, and Kareem Darwish. 2017. A neu-
ral architecture for dialectal arabic segmentation.
WANLP 2017 (co-located with EACL 2017) page 46.
Younes Samih, Suraj Maharjan, Mohammed Attia,
Laura Kallmeyer, and Thamar Solorio. 2016.
Multilingual code-switching identification via lstm
recurrent neural networks. In Proceedings of the
Second Workshop on Computational Approaches
to Code Switching,. Austin, TX, pages 50–59.
http://www.aclweb.org/anthology/W/W16/W16-
58.pdf#page=62.
Mike Schuster and Kuldip K Paliwal. 1997. Bidirec-
tional recurrent neural networks. IEEE Transactions
on Signal Processing 45(11):2673–2681.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016. Neural machine translation of rare words with
subword units. In Proceedings of the 54th Annual
Meeting of the Association for Computational Lin-
guistics (Volume 1: Long Papers). Association for
Computational Linguistics, Berlin, Germany, pages
1715–1725. http://www.aclweb.org/anthology/P16-
1162.
Omar F Zaidan and Chris Callison-Burch. 2014. Ara-
bic dialect identification. Computational Linguistics
40(1):171–202.

You might also like