Manually Constructed Context-Free Grammar For Myanmar Syllable Structure
Manually Constructed Context-Free Grammar For Myanmar Syllable Structure
Manually Constructed Context-Free Grammar For Myanmar Syllable Structure
32
Proceedings of the EACL 2012 Student Research Workshop, pages 32–37,
Avignon, France, 26 April 2012.
2012
c Association for Computational Linguistics
3. V = Attached Vowel is always represented by the character U+103A .2
4. K = Vowel Killer or ASAT [John Okell, 1994]
5. D = Diacritic
6. I = Free standing Vowel In Unicode chart, the diacritics group D and the
7. N = Digit vowel killer or ASAT “K” are included in the
And the notation [ ] means 0 or 1 occurrence and group named various signs.
{ } means 0 or more occurrence.
However, in this paper, we ignore digits, free 2.2 Preprocessing of Texts - Contraction
standing vowel and punctuation marks in writing
grammar for Myanmar syllable and we focus In writing formal grammar for a Myanmar
only on basic and major five sub syllabic groups syllable, there are some cases where two or more
namely consonants(C), medial(M), attached Myanmar characters combine each other and the
vowels(V), a vowel killer (K) and diacritics(D). resulting combined forms are also used in
The following subsection will give the details of Myanmar traditional writing system though they
each sub syllabic group. are not coded directly in the Myanmar Unicode
chart. Such combinations of vowel and medials
2.1 Brief Description of Basic Myanmar are described in detail below.
Sub Syllabic Elements Two or more Myanmar attached vowels are
combined and formed new three members { ,
Each Myanmar consonant has default vowel
sound and itself works as a syllable. The set of , } in the vowel set.
consonants in Unicode chart is C={က, ခ, ဂ, ဃ,
Glyph Unicode for Description
င, စ, ဆ, ဇ, ဈ, ဉ , ည ,ဋ ,ဌ, ဍ, ဎ, ဏ, တ,
Contraction
ထ ,ဒ ,ဓ ,န ,ပ ,ဖ, ဗ, ဘ ,မ ,ယ ,ရ, လ, ၀, သ, ဟ,
+ 1031+102C Vowel sign E
ဠ } having 33 elements. But, the letter အ can act
+ AA
as consonant as well as free standing vowel.
+ 1031+102C+1 Vowel sign E
Medials or consonant conjuncts mean the 03A +AA+ASAT
modifiers of the syllables` vowel and they are + 102D + 102F Vowel sign I
encoded separately in the Unicode encoding. + UU
There are four basic medials in Unicode chart
and it is represented as the set M={ , , }. “Table 1. Contractions of vowels”
The set V of Myanmar attached vowel characters Similarly, 4 basic Myanmar medials combine
in Unicode contains 8 elements { ါ, , , , , each other in some different ways and produce
new set of medials { , , , , ,
, , }. ( Peter and William, 1996)
, }. [Tin Htay Hlaing and Yoshiki
Diacritics alter the vowel sounds of Mikami, 2011]
accompanying consonants and they are used to
indicate tone level. There are 2 diacritical marks Glyph Unicode for Description
{ , } in Myanmar script and the set is Contraction
represented as D. + 103B + 103D Consonant Sign
Medial YA + WA
The asat, or killer, representing the set K= { } 103C + 103D Consonant Sign
is a visibly displayed sign. In some cases it Medial RA + WA
indicates that the inherent vowel sound of a + 103B + 103E Consonant Sign
consonant letter is suppressed. In other cases it Medial YA + HA
combines with other characters to form a vowel 103C + 103E Consonant Sign
letter. Regardless of its function, this visible sign Medial RA + HA
2
http://www.unicode.org/versions/Unicode6.0.0/ch11.pdf
33
103D + 103E Consonant Sign In Myanmar language, a syllable with only one
Medial WA + HA consonant can be taken as one syllable because
+ 103B + 103D + Consonant Sign Myanmar script is Abugida which means all
103E Medial YA+WA + letters have inherent vowel. And, consonants can
+ be followed by vowels, consonant, vowel killer
HA
and medials in different combinations.
103C + 103D + Consonant Sign One special feature is that if there are two
103E Medial YA+WA
+ consonants in a given syllable, the second
+ HA consonant must be followed by vowel killer (K).
We found that 1872 combinations of sub-syllabic
“Table 2. Contractions of Medials” elements in Myanmar Orthography [Myanmar
Language Commission, 2006]. The table below
The above mentioned combinations of characters shows top level combinations of these sub-
are considered as one vowel or medial in syllabic elements.
constructing the grammar. The complete sets of
elements for vowels and meidals used in writing Conso- Consona- Consona- Consonant
grammar are depicted in the table below.3 nant nt nt followed by
only followed followed Medial
Name of Sub Elements by Vowel by
Syllabic Consona-
Component nt
Medials or Conjunct , , , , , C CV CCK CM
Consonants
, , CVCK CCKD CMV
CVD CMVD
CVCKD CMVCK
CMVCKD
Attached vowels အ, , , , , , , CMCK
, , , , CMCKD
3
Sorting order of Medials and attached vowels in Myanmar
Orthography
34
In the above FSA, an interesting point is that BC D
only one consonant can be a syllable because B D
Myanmar consonants have default vowel sounds. B
That is why, state 2 can be a final state. For
D # Diacritics
instance, a Myanmar Word “မန မ” (means
D # Diacritics
“Woman” in English) has two syllables. In the
first syllable “မန ”, the sub syllabic elements are D
C က
Consonant(မ) + Vowel( ) +Consonant(န)+
# Such production will be expanded for 33
Vowel Killer( )+Diacritics( ). The second consonants.
syllable has only one consonant “မ”. Total number of productions/rules to recognize
Myanmar syllable structure is “111” and we
3 Myanmar Syllable Structure in found that the director symbol sets (which is also
known as first and follow sets) for same non-
Context-Free Grammar
terminal symbols with different productions are
disjoint.
3.1 Manually Constructed Context-Free
This is the property of LL(1) grammar which
Grammar for Myanmar Syllable
means for each non terminal that appears on the
Structure
left side of more than one production, the
directory symbol sets of all the productions in
Context free (CF) grammar refers to the grammar
which it appears on the left side are disjoint.
rules of languages which are formulated
Therefore, our proposed grammar can be said as
independently of any context. A CF-grammar is
LL(1) grammar.
defined by:
The term LL1 is made up as follows. The first L
1. A finite terminal vocabulary VT.
means reading from Left to right, the second L
2. A finite auxiliary vocabulary VA. means using Leftmost derivations, and the “1”
3. An axiom SVA. means with one symbol of lookahead. (Robin
Hunter, 1999)
4. A finite number of context-free rules P
of the form A where
3.2 Parse Table for Myanmar CFG
AVA and {VA U VT}* The following figure is a part of parse table made
(M.Gross and A.Lentin, 1970) from the productions of the proposed LL(1)
grammar.
The grammar G to represent all possible က င $
structures of a Myanmar syllable can be written
S S S
as G= (VT,VA,P,S) where the elements of P are:
ကX ငX
Sက X
X X X X X X
# Such production will be expanded for 33
C C
consonants.
D D
X A
A B
# Such production will be expanded for 11
A A A A A
medials.
C C
X B
D D
# Such production will be expanded for 12
B
vowels.
B B B B B B
XC D
C C D D
X
D D
A B D D D D
# Such production will be expanded for 12
vowels.
C C C
A C D
က င
A
“Table 5. Parse Table for Myanmar Syllable”
35
In the above table, the topmost row represents combination of two or more characters in parsing
terminal symbols whereas the leftmost column Myanmar syllable.
represents the non terminal symbols. The entries
in the table are productions to apply for each pair 5 Discussion and Future Work
of non terminal and terminal.
Myanmar script is syllabic as well as
An example of Myanmar syllable having 4 aggulutinative script. Every Myanmar word or
different sub syllabic elements is parsed using sentence is composed of series of individual
proposed grammar and the above parse table. syllables. Thus, it is critical to have efficient way
The parsing steps show proper working of the of recognizing syllables in conformity with the
proposed grammar and the detail of parsing a rules of Myanmar traditional writing system.
syllable is as follows. Our intended research is the automatic
Input Syllable = က =က(C) + (M)+ syllabification of Myanmar polysyllabic words
(D) using formal language theory.
One option to do is to modify our current CFG to
recognize consecutive syllables as a first step.
Parse Stack Remaining Input Parser
We found that if the current CFG is changed for
Action
sequence of syllables, the grammar can be no
S$ က $ SကX longer LL(1). Then, we need to use one of the
ကX $ က $ MATCH statistical methods, for example, probabilistic
ကX$ က $ X A CFG, to choose correct productions or best parse
for finding syllable boundaries.
က A$ က $ MATCH Again, it is necessary to calculate the probability
က A$ က $ A B values for each production based on the
frequency of occurrence of a syllable in a
က B$ က $ MATCH
dictionary we referred or using TreeBank.
က B$ က $ BD We need Myanmar corpus or a tree bank which
က D$ က $ D contains evidence for rule expansions for syllable
structure and such a resource does not yet exist
က $ က $ MATCH
for Myanmar. And also, the time and cost for
က $ $ SUCCESS constructing a corpus by ourselves came into
consideration.
“Table 6. Parsing a Myanmar Syllable using Another approach is to construct finite state
predictive top-down parsing method” transducer for automatic syllabification of
Myanmar words. If we choose this approach, we
4 Conclusion firstly need to construct regular grammar to
recognize Myanmar syllables. We already have
This study shows the powerfulness of Myanmar syllable structure in regular grammar.
Chomsky`s context free grammar as it can apply However, for finite state syllabification using
not only to describe the sentence structure but weights, there is a lack of resource for training
also the syllable structure of an Asian script, database.
Myanmar. Though the number of productions in We still have many language specific issues to be
the proposed grammar for Myanmar syllable is addressed for implementing Myanmar script
large, the syntactic structure of a Myanmar using CFG or FSA. As a first issue, our current
syllable is correctly recognized and the grammar grammar is based on five basic sub-syllabic
is not ambiguous. elements and thus developing the grammar
Further, in parsing Myanmar syllable, it is which can handle all seven Myanmar sub
necessary to do preprocessing called contraction syllabic elements will be future study.
for input sequences of vowels and consonant Our current grammar is based on the code point
conjuncts or medials to meet the requirements of values of the input syllables or words. Then, as a
traditional writing systems. However, because of second issue, we need to consider about different
these contracted forms, single lookahead symbol presentations or code point values of same
in our proposed LL(1) grammar does not refer character. Moreover, we have special writing
exactly to one character and it may be a traditions for some characters, for example, such
36
as consonant stacking eg. ဗုဒ္ဓ (Buddha), မန ္တလေး
(Mandalay, second capital of Myanmar),
consonant repetition eg. က (University),
kinzi eg. အင်္ဂ (Cement), loan words eg.
ဘတ်(စ်) (bus). To represent such complex forms
in a computer system, we use invisible Virama
sign (U+1039). Therefore, it is necessary to
construct the productions which have conformity
with the stored character code sequence of
Myanmar Language.
References
37