Arabic Multi-Dialect Segmentation: bi-LSTM-CRF vs. SVM: Habash and Sadat 2006
Arabic Multi-Dialect Segmentation: bi-LSTM-CRF vs. SVM: Habash and Sadat 2006
Arabic Multi-Dialect Segmentation: bi-LSTM-CRF vs. SVM: Habash and Sadat 2006
SVM
1
Qatar Computing Research Institute, HBKU, Doha, Qatar
2
Dept. of Computational Linguistics,University of Düsseldorf, Düsseldorf, Germany
3
Google Inc., New York City, USA
1
{mohamohamed,hmubarak,aabdelali,kdarwish}@hbku.edu.qa
2
{samih,kallmeyer}@phil.hhu.de
3
[email protected]
arXiv:1708.05891v1 [cs.CL] 19 Aug 2017
→ “ly+ky”) Shortened
0.4%
ú¯ ⇔ ¬ , úΫ ⇔
¨
• treat dialectal words that originated as multiple particles “E, ElA f, fy”
fused words as single tokens (ex. C« “ElA$” elongations 0.3% éJ
Ë ⇔ éJ
J
J
Ë “lyyyyh,lyh”
(why) – originally Zú
æ
ø
@ úΫ “ElY >y $y'”) Different no. of segments or POS tags
• do not segment name mentions and hashtags spelling
2.2%
B@
ð ⇔ Bð à@ ⇔ AK@
In what follows, we discuss the advantages errors “An, Ana wlA, wAlA”
fused ù+Ë ÈA ⇔ ù+ËA¯
¯
and disadvantages of segmenting raw text versus 0.8%
the CODA’fied text with some statistics obtained letters “qAl+y, qAl l+y”
for the Egyptian tweets for which we have a
CODA’fied version as exemplified in Table 1. The Table 3: Original vs CODA Segmentations
main advantage of segmenting raw text is that it
doesn’t need any preprocessing tool to generate
CODA orthography, and the main advantage of when preceded by the present tense marker H. “b”
CODA is that is regularizes text making it more or the future tense marker ë “h”, and the “A” in
uniform and easier to process. We manually
negation particle AÓ“mA” which is often written as
compared the CODA version to the raw version
of 2,000 words in our Egyptian dataset. We Гm” and attached to the following word, and the
found that in 75.4% of the words, segmentation trailing letters in prepositions. “Merged words”
of original raw words is exactly the same as and “word elongations” are common in social me-
the their CODA’ifed equivalents (ex.
áÓ+ð dia, where users try to keep within limit by drop-
“w+mn” (and from) and Aê+ÊÒªK , “nEml+hA” ping the spaces between letters that do not connect
or to stress words respectively.
(we do it)). Further, if we normalize some char-
Though some processing such as splitting of
acters, namely è ← è ,ø
← ø ,@ ← @,
@, @, words or removing elongations is required to over-
and non-Arabic characters come the phenomena in this group, in situ segmen-
ø
← þ
,¼ ← ¹,Ã , ¬ ← ¬ ,h. ← h,
tation of raw words would yield identical segments
and remove diacritics, the percentage of matching with the same POS tags as their CODA counter-
increases to 90.3%. Table 3 showcases the parts. Thus, the segmentation of raw words could
remaining differences between raw and CODA be sufficient for 97% of words.
segmentations and how often they appear. The In the second group (accounting for 3% of
differences are divided into two groups. In the the cases), both may have a different number
first group (accounting for 6.8% of the cases), of segments or POS tags, which would compli-
the number of word segments remains the same cate downstream processing such as POS tagging.
and both the raw and CODA’fied segments would They involve spelling errors and the fusion of two
have the same POS tags. identical consecutive letters (gemmination). Cor-
In this group, the “variable spelling” class con- recting such errors may require a spell checker.
tains dialectal words that may have different com- We opted to segment raw input without correction
mon spellings with one “standard” spelling in in our reference, and we kept stem, such as verbs
CODA. The “dropped letter” and “shortened parti- and nouns, complete at the expense of other seg-
cles” classes typically involve the omission of let- ments such as prepositions as in the example in
ters such as the first person imperfect prefix @ “>” Table 1.
5 Segmentation Approaches • Conditional probability that a leading character
sequence is a prefix.
We present here two different systems for word
• Conditional probability that a trailing character
segmentation. The first uses SVM-based rank-
sequence is a suffix.
ing (SVMRank )4 to rank different possible seg-
mentations for a word using a variety of features. • probability of the prefix given the suffix.
The second uses bi-LSTM-CRF, which performs • probability of the suffix given the prefix.
sequence-to-sequence mapping to guess word seg- • unigram probability of the stem (more details
mentation. about calculating this is showing below).
• unigram probability of the stem with first suffix.
5.1 SVMRank Approach • whether a valid stem template can be obtained
This approach is inspired by the work done by from the stem.
Abdelali et al. (2016), in which they used SVM • whether the stem that has no trailing suffixes ap-
based ranking to ascertain the best segmentation pears in a gazetteer of person and location names
for Modern Standard Arabic (MSA), which they (Abdelali et al., 2016).
show to be fast and accurate. The approach in- • whether the stem is a function word, such as
volves generating all possible segmentations of a úΫ “ElY” (on), áÓ “mn” (from), and Ó “m$”
word and then ranking them.
(not).
In training, we generate all possible seg-
• whether the stem appears in the AraComLex5
mentations of a word based on a closed set of
Arabic lexicon (Attia et al., 2011) or in the
prefixes and suffixes, and the correct segmen-
Buckwalter lexicon (Buckwalter, 2002). This is
tation is assigned rank 1 and all other incorrect
sensible considering the large overlap between
segmentations are assigned rank 2. Our valid
MSA and DA.
affixes include MSA prefixes and suffixes that
we extracted from Farasa (Abdelali et al., 2016) • length difference from the average stem length.
and additional dialectal prefixes and suffixes The segmentations with their corresponding
that we observed during training. Since we are features are then passed to the SVM ranker
not mapping words into a standard spelling, (Joachims, 2006) for training. Our SVMRank uses
such as CODA, prefixes and suffixes may have a linear kernel and a trade-off parameter between
multiple different representations. For example, training error and margin of 100.
given the dialectal Egyptian word for “I do not Before training the classifier, features needed to
play”, it could be spelled as “m+b+lEb+$”, calculated in advance. As training data, we used
“mA+b+lEb+$”, “mA+b+AlEb+$”, the aforementioned sets of 350 dialectal tweets for
“m+b+lEb+$y”, “mA+b+AlEb+$y”, etc. In each dialect containing typically several thousand
this example, the first prefix could be “m” or words each. We also use three parts of the Penn
“mA” and the suffix could be “$” or “$y”. Arabic Treebank (ATB); part 1 (version 4.1), 2
Here are two example dialectal Egyptian words (version 3.1), and 3 (version 2), which have a com-
to demonstrate segmentation: bined size of 628,870 tokens, to lookup MSA seg-
• Given the input word:
ñËA« “EAlw$” mentations. The intuition behind using such seg-
mented MSA data for lookup is that both MSA and
(on the face), possible segmentations are:
dialects share a fair amount of vocabulary. Thus,
{E+Al+w$} (correct segmentation), {E+Alw$},
using the ATB corpus has the effect of increasing
{E+Al+w+$}, {E+Alw+$}, {EAlw+$}, and
coverage.
{EAlw$}.
We also adopted the simplifying assumption
• Given the input word: ú¾K
XAK. “bAdyky”
that any given word has only 1 possible correct
(I give you (feminine)), possible segmenta- segmentation regardless of context. Though this
tions are: {b+Ady+ky} (correct segmentation), assumption is not always true, previous work on
{b+Adyky}, {bAdy+ky}, and {bAdyky}. MSA has shown that it holds for 99% of the cases
We use the following features in training the (Abdelali et al., 2016). Invoking this assumption
classifier: has multiple positive implications, namely: we
4 5
https://www.cs.cornell.edu/people/tj/ http://sourceforge.net/projects/
svm_light/svm_rank.html aracomlex/
can use the segmentations that we observed during interpretation about this architecture can be found
training directly, which typically cover most com- in (Lipton et al., 2015).
mon function words, or segmentations that we ob-
5.2.2 Bi-directional LSTM
served in the ATB, which cover most MSA words
that may be prevalent in dialectal text; and we can Bi-LSTM networks (Schuster and Paliwal, 1997)
cache word segmentations leading to significant are extensions to single LSTM networks. They are
speedup. Thus, we experimented with three dif- capable of learning long-term dependencies and
ferent lookup schemes for every word, namely: 1) maintain contextual features from past and future.
we output the rankers guess directly (None); 2) As shown in Figure 1, they comprise two separate
if exists, we use seen segmentations in dialectal hidden layers that feed forward to the same out-
training set, and the output of the ranker otherwise put layer. A bi-LSTM calculates the forward hid-
→
−
(DA); 3) if exists, we use seen segmentation in di- den sequence h , the backward hidden sequence
←−
alectal training set, else we use segmentation that h and the output sequence y by iterating over the
we observed in the ATB, and lastly the output of following equations :
the ranker (DA+MSA). →
− →
−
ht = σ(Wx− → xt + W−
h
→−→ h t−1 + b−
h h
→)
h
5.2 Bi-LSTM-CRF Approach ←
− ←
−
ht = σ(Wx← − x + W←
h t
−←− h t−1 + b←
h h
−)
h
5.2.1 Long Short-term Memory →
− ←−
yt = W −→ h + W←
hy t
− h + by
hy t
Recurrent Neural Network (RNN) belongs to a
family of neural networks suited for modeling se- More interpretations about these formulas are
quential data. Given an input sequence x = found in Graves et al. (2013a)
(x1 , ..., xn ), an RNN computes the output vector
yt of each word xt by iterating the following equa- 5.2.3 Conditional Random Fields (CRF)
tions from t = 1 to n: Over the past few years, bi-LSTMs have achieved
many ground-breaking results in many NLP tasks
ht = f (Wxh xt + Whh ht−1 + bh ) because of their ability to cope with long dis-
yt = Why ht + by tance dependencies and exploit contextual fea-
tures from past and future states. Still, when
where ht is the hidden states vector, W denotes they are used for some specific sequence classi-
weight matrix, b denotes bias vector and f is the fication tasks, (such as segmentation and named
activation function of the hidden layer. Theoreti- entity detection), where there is strict dependence
cally, RNNs can learn long distance dependencies, between output labels, they fail to generalize per-
still in practice they fail due vanishing/exploding fectly. During the training phase of the bi-LSTM
gradients (Bengio et al., 1994). To solve this prob- networks, the resulting probability distributions
lem , Hochreiter and Schmidhuber (1997) intro- for different time steps are independent from each
duced the LSTM RNN. The idea consists of aug- other. To overcome the independence assumptions
menting an RNN with memory cells to overcome imposed by the bi-LSTM and to exploit these kind
difficulties with training and efficiently cope with of labeling constraints in our Arabic segmentation
long distance dependencies. The output of the system, we model label sequence logic jointly us-
LSTM hidden layer ht given input xt is com- ing Conditional Random Fields (CRF) (Lafferty
puted via the following intermediate calculations: et al., 2001).
(Graves, 2013):
5.2.4 bi-LSTM-CRF for DA Segmentation
it = σ(Wxi xt + Whi ht−1 + Wci ct−1 + bi )
In this model we consider Arabic segmentation as
ft = σ(Wxf xt + Whf ht−1 + Wcf ct−1 + bf ) a sequence labeling problem at the character level.
ct = ft ct−1 + it tanh(Wxc xt + Whc ht−1 + bc ) Each character is labeled with one of five labels
ot = σ(Wxo xt + Who ht−1 + Wco ct + bo ) B, M, E, S, W B that designate the segmentation
decision boundaries: Beginning, Middle, End of
ht = ot tanh(ct )
a multi-character segment, Single character seg-
where σ is the logistic sigmoid function, and i, ment, and Word Boundary respectively. Figure
f , o and c are respectively the input gate, forget 1 illustrates our segmentation model and how the
model takes the word éJ.ʯ “qlbh” (his heart) as its
gate, output gate and cell activation vectors. More
idea behind dropout involves randomly omitting a
certain percentage of the neurons in each hidden
layer for each presentation of the samples during
training. This encourages each neuron to depend
less on the other neurons to learn the right seg-
mentation decision boundaries. We apply dropout
masks to the character embedding layer before in-
putting to the bi-LSTM and to its output vector. In
our experiments, we find that dropout with a fixed
rate of 0.5 decreases overfitting and improves the
overall performance of our system. We also em-
ploy early stopping (Caruana et al., 2000; Graves
et al., 2013b) to mitigate overfitting by monitoring
the model’s performance on the development set.
for Maghrebi compared to 8 for MSA, 17 for data, the disputed letter belongs to the first to-
Levantine and Gulf, and 12 for Egyptian. Sim- ken while in the system output, it belongs to the
ilarly, Maghrebi had more suffixes than MSA second.
and other dialects. To ascertaining the effect of • Like the SVM, the system often fails due to un-
CODA’fication, we ran an extra experiment were conventional spelling. For example the word
we trained our best SVMRank system using the “lAxwyA” (to my brother) is a mis-
AK
ñkB
CODA’fied version of the Egyptian data, and the
segmentation accuracy increased from 94.6% to spelling of AK
ñk B.
96.8%. Thus, having stable CODA standards and • The majority of the remaining errors are simply
reliable conversion tools may positively impact mis-tokenization due to the system’s inability to
dialectal processing. Next, we elaborate on typical decide whether a substring (which out of con-
errors of both approaches. text can be a valid token) is an independent to-
ken or part of a word, e.g. ½+ÊJ.®JÓ “mstqbl+k”
SVMRank Errors: We examined the errors that (your future), which is predicted by the system
the SVM ranker produced for different dialects as “m+staqbl+k”, where it correctly recognizes
and the most common types involved: the genitive pronoun in the end, but mistakenly
tags the first radical as a separate segment.
• erroneous splitting of leading or trailing charac-
ters when they were not prefixes or suffixes re- 7 Conclusion
spectively or not splitting actual prefix and suf-
fixes. For example, àñºJ
+k “H+ykwn” (will In this paper we presented two approaches in-
volving SVM-based ranking and bi-LSTM-CRF
be) was segmented as “Hyk+wn”.
sequence labeling for segmenting Egyptian, Lev-
• the use of non-Arabic letters, wrong form of antine, Gulf, and Maghrebi dialects. Both ap-
alef, or “h” instead of “p”. For example proaches yield strong comparable results that
Á+Ê+K
Ag. “jAy+l+k” (I am coming to you), range between 91% and 95% accuracy for dif-
where “A” and “k” were replace with “|” and a ferent dialects. To perform the work, we cre-
Farsi character respectively, was not segmented. ated training corpora containing naturally occur-
• long words with multiple segments such as ring text from social media for the aforementioned
+ ®Ê ® J+Ó “m+tqlq+y+nA+$” (don’t make
J+J
+A dialects. We plan to release the data and the result-
us angry) where the ranker chose to segment it ing segmenters to the research community. For fu-
as “m+tqlq+yn+A$”. ture work, we want to perform domain adaptation
using large MSA data, such as ATB, to improve
bi-LSTM-CRF Errors: The errors in this system segmentation results. Further, we plan to investi-
are broadly classified into three categories: gate building a joint model capable of segmenting
all the dialects with minimal loss in accuracy.
• Ambiguous in token boundary because of char-
acter sharing in case of gemmination/elision.
“En A” (about us) is References
For example the word AJ«
En and AK nA. The two à Ahmed Abdelali, Kareem Darwish, Nadir Durrani, and
actually two tokens á« Hamdy Mubarak. 2016. Farasa: A fast and furious
n letters are now merged into one. In the gold segmenter for arabic. In Proceedings of the 2016
Conference of the North American Chapter of the Alex Graves. 2013. Generating sequences with
Association for Computational Linguistics: Demon- recurrent neural networks. arXiv preprint
strations. Association for Computational Linguis- arXiv:1308.0850 .
tics, San Diego, California, pages 11–16.
Alex Graves, Navdeep Jaitly, and Abdel-rahman Mo-
Mohammed Attia, Pavel Pecina, Antonio Toral, Lamia hamed. 2013a. Hybrid speech recognition with deep
Tounsi, and Josef van Genabith. 2011. An open- bidirectional lstm. In Automatic Speech Recognition
source finite state morphological transducer for and Understanding (ASRU), 2013 IEEE Workshop
modern standard arabic. In Proceedings of the on. IEEE, pages 273–278.
9th International Workshop on Finite State Methods
and Natural Language Processing. Association for Alex Graves, Abdel-rahman Mohamed, and Geoffrey
Computational Linguistics, pages 125–133. Hinton. 2013b. Speech recognition with deep re-
current neural networks. In Acoustics, speech and
Yoshua Bengio, Patrice Simard, and Paolo Frasconi. signal processing (icassp), 2013 ieee international
1994. Learning long-term dependencies with gradi- conference on. IEEE, pages 6645–6649.
ent descent is difficult. IEEE transactions on neural
networks 5(2):157–166. Nizar Habash, Mona T Diab, and Owen Rambow.
2012. Conventional orthography for dialectal ara-
Fadi Biadsy, Julia Hirschberg, and Nizar Habash. 2009. bic. In LREC. pages 711–718.
Spoken arabic dialect identification using phonotac-
tic modeling. In Proceedings of the EACL 2009 Nizar Habash, Ryan Roth, Owen Rambow, Ramy Es-
Workshop on Computational Approaches to Semitic kander, and Nadi Tomeh. 2013. Morphological
Languages. Association for Computational Linguis- analysis and disambiguation for dialectal arabic. In
tics, Stroudsburg, PA, USA, Semitic ’09, pages 53– Hlt-Naacl. pages 426–432.
61.
Nizar Habash and Fatiha Sadat. 2006. Arabic pre-
Houda Bouamor, Nizar Habash, and Kemal Oflazer. processing schemes for statistical machine transla-
2014. A multidialectal parallel corpus of arabic. tion. In Proceedings of the Human Language Tech-
In Nicoletta Calzolari (Conference Chair), Khalid nology Conference of the NAACL, Companion Vol-
Choukri, Thierry Declerck, Hrafn Loftsson, Bente ume: Short Papers. Association for Computational
Maegaard, Joseph Mariani, Asuncion Moreno, Jan Linguistics, pages 49–52.
Odijk, and Stelios Piperidis, editors, Proceedings of
the Ninth International Conference on Language Re- Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky,
sources and Evaluation (LREC’14). European Lan- Ilya Sutskever, and Ruslan R Salakhutdinov. 2012.
guage Resources Association (ELRA), Reykjavik, Improving neural networks by preventing co-
Iceland. adaptation of feature detectors. arXiv preprint
arXiv:1207.0580 .
Tim Buckwalter. 2002. Buckwalter {Arabic} morpho-
logical analyzer version 1.0 . Sepp Hochreiter and Jürgen Schmidhuber. 1997.
Long short-term memory. Neural computation
Rich Caruana, Steve Lawrence, and Lee Giles. 2000.
9(8):1735–1780.
Overfitting in neural nets: Backpropagation, conju-
gate gradient, and early stopping. In NIPS. pages Zeinab Ibrahim. 2006. Borrowing in modern standard
402–408. arabic. Innovation and Continuity in Language and
Kareem Darwish, Walid Magdy, and Ahmed Mourad. Communication of Different Language Cultures 9.
2012. Language processing for arabic microblog Edited by Rudolf Muhr pages 235–260.
retrieval. In Proceedings of the 21st ACM inter- Thorsten Joachims. 2006. Training linear svms in lin-
national conference on Information and knowledge ear time. In Proceedings of the 12th ACM SIGKDD
management. ACM, pages 2427–2430. international conference on Knowledge discovery
Kareem Darwish, Walid Magdy, et al. 2014a. Arabic and data mining. ACM, pages 217–226.
information retrieval. Foundations and Trends® in
Sameer Khurana, Ahmed Ali, and Steve Renals.
Information Retrieval 7(4):239–342.
2016. Multi-view dimensionality reduction for di-
Kareem Darwish, Hassan Sajjad, and Hamdy Mubarak. alect identification of arabic broadcast speech. arXiv
2014b. Verifiably effective arabic dialect identifica- preprint arXiv:1609.05650 .
tion. In EMNLP. pages 1465–1468.
John D. Lafferty, Andrew McCallum, and Fernando
Mohamed Eldesouki, Fahim Dalvi, Hassan Sajjad, and C. N. Pereira. 2001. Conditional random fields:
Kareem Darwish. 2016. Qcri@ dsl 2016: Spoken Probabilistic models for segmenting and labeling se-
arabic dialect identification using textual features. quence data. In Proc. ICML.
VarDial 3 page 221.
Zachary C Lipton, David C Kale, Charles Elkan, and
Ramy Eskander, Nizar Habash, Owen Rambow, and Randall Wetzell. 2015. A critical review of recur-
Nadi Tomeh. 2013. Processing spontaneous orthog- rent neural networks for sequence learning. CoRR
raphy. abs/1506.00019.
Mohamed Maamouri, Ann Bies, Seth Kulick, Michael Rabih Zbib, Erika Malchiodi, Jacob Devlin, David
Ciul, Nizar Habash, and Ramy Eskander. 2014. De- Stallard, Spyros Matsoukas, Richard Schwartz, John
veloping an egyptian arabic treebank: Impact of di- Makhoul, Omar F. Zaidan, and Chris Callison-
alectal morphology on annotation and tool develop- Burch. 2012. Machine translation of arabic dialects.
ment. In LREC. pages 2348–2354. In Proceedings of the 2012 Conference of the North
American Chapter of the Association for Computa-
Emad Mohamed, Behrang Mohit, and Kemal Oflazer. tional Linguistics: Human Language Technologies.
2012. Annotating and learning morphological seg- Association for Computational Linguistics, Strouds-
mentation of egyptian colloquial arabic. In LREC. burg, PA, USA, NAACL HLT ’12, pages 49–59.
pages 873–877.
Will Monroe, Spence Green, and Christopher D Man-
ning. 2014. Word segmentation of informal arabic
with domain adaptation. In ACL (2). pages 206–211.
Hamdy Mubarak and Kareem Darwish. 2014. Using
twitter to collect a multi-dialectal corpus of arabic.
In Proceedings of the EMNLP 2014 Workshop on
Arabic Natural Language Processing (ANLP). pages
1–7.
Arfath Pasha, Mohamed Al-Badrashiny, Mona Diab,
Ahmed El Kholy, Ramy Eskander, Nizar Habash,
Manoj Pooleery, Owen Rambow, and Ryan M Roth.
2014. Madamira: A fast, comprehensive tool for
morphological analysis and disambiguation of Ara-
bic. Proc. LREC .
Hassan Sajjad, Kareem Darwish, and Yonatan Be-
linkov. 2013. Translating dialectal Arabic to En-
glish. In Proceedings of the 51st Annual Meeting of
the Association for Computational Linguistics (Vol-
ume 2: Short Papers). Sofia, Bulgaria, ACL ’13,
pages 1–6.
Younes Samih, Mohammed Attia, Mohamed Eldes-
ouki, Hamdy Mubarak, Ahmed Abdelali, Laura
Kallmeyer, and Kareem Darwish. 2017. A neu-
ral architecture for dialectal arabic segmentation.
WANLP 2017 (co-located with EACL 2017) page 46.
Younes Samih, Suraj Maharjan, Mohammed Attia,
Laura Kallmeyer, and Thamar Solorio. 2016.
Multilingual code-switching identification via lstm
recurrent neural networks. In Proceedings of the
Second Workshop on Computational Approaches
to Code Switching,. Austin, TX, pages 50–59.
http://www.aclweb.org/anthology/W/W16/W16-
58.pdf#page=62.
Mike Schuster and Kuldip K Paliwal. 1997. Bidirec-
tional recurrent neural networks. IEEE Transactions
on Signal Processing 45(11):2673–2681.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016. Neural machine translation of rare words with
subword units. In Proceedings of the 54th Annual
Meeting of the Association for Computational Lin-
guistics (Volume 1: Long Papers). Association for
Computational Linguistics, Berlin, Germany, pages
1715–1725. http://www.aclweb.org/anthology/P16-
1162.
Omar F Zaidan and Chris Callison-Burch. 2014. Ara-
bic dialect identification. Computational Linguistics
40(1):171–202.