Manually Constructed Context-Free Grammar For Myanmar Syllable Structure

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Manually Constructed Context-Free Grammar

For Myanmar Syllable Structure

Tin Htay Hlaing


Nagaoka University of Technology
Nagaoka, JAPAN
[email protected]

grammar and/or finite state methods so that


Abstract syllabified strings can be used for Myanmar
Myanmar language and script are unique and
sorting.
complex. Up to our knowledge, considerable In this paper, as a preliminary stage, we describe
amount of work has not yet been done in the structure of a Myanmar syllable in context-
describing Myanmar script using formal language free grammar and parse the syllables using
theory. This paper presents manually constructed predictive top-down parsing technique to
context free grammar (CFG) with “111” determine whether a given syllable can be
productions to describe the Myanmar Syllable recognized by the proposed grammar or not.
Structure. We make our CFG in conformity with Further, the constructed grammar includes
the properties of LL(1) grammar so that we can linguistic information and follows the traditional
apply conventional parsing technique called writing system of Myanmar script.
predictive top-down parsing to identify Myanmar
syllables. We present Myanmar syllable structure
according to orthographic rules. We also discuss 2 Myanmar Script
the preprocessing step called contraction for
vowels and consonant conjuncts. We make LL (1) Myanmar is a syllabic script and also one of the
grammar in which “1” does not mean exactly one languages which have complex orthographic
character of lookahead for parsing because of the structures. Myanmar words are formed by
above mentioned contracted forms. We use five collection of syllables and each syllable may
basic sub syllabic elements to construct CFG and contain up to seven different sub syllabic
found that all possible syllable combinations in
Myanmar Orthography can be parsed correctly
elements. Again, each component group has its
using the proposed grammar. own members having specific order.
Basically, Myanmar script has 33 consonants, 8
vowels (free standing and attached)1 , 2 diacritics,
1 Introduction 11 medials, a vowel killer or ASAT, 10 digits
and 2 punctuation marks.
Formal Language Theory is a common way to A Myanmar syllable consists of 7 different
represent grammatical structures of natural components in Backus Normal Form (BNF) is as
languages and programming languages. The follows.
origin of grammar hierarchy is the pioneering S:= C{M}{V}[CK][D] | I[CK] | N
work of Noam Chomsky (Noam Chomsky, where
1957). A huge amount of work has been done in S = Syllable
Natural Language Processing where Chomsky`s 1. C = Consonant
grammar is used to describe the grammatical 2. M = Medial or Consonant Conjunct or
rules of natural languages. However, formulation attached consonant
rules have not been established for grammar for
Myanmar script. The long term goal of this study
Free standing vowel syllables (eg. ဣ )and attached vowel
1
is to develop automatic syllabification of
Myanmar polysyllabic words using regular symbols (eg. )

32
Proceedings of the EACL 2012 Student Research Workshop, pages 32–37,
Avignon, France, 26 April 2012. 2012
c Association for Computational Linguistics
3. V = Attached Vowel is always represented by the character U+103A .2
4. K = Vowel Killer or ASAT [John Okell, 1994]
5. D = Diacritic
6. I = Free standing Vowel In Unicode chart, the diacritics group D and the
7. N = Digit vowel killer or ASAT “K” are included in the
And the notation [ ] means 0 or 1 occurrence and group named various signs.
{ } means 0 or more occurrence.
However, in this paper, we ignore digits, free 2.2 Preprocessing of Texts - Contraction
standing vowel and punctuation marks in writing
grammar for Myanmar syllable and we focus In writing formal grammar for a Myanmar
only on basic and major five sub syllabic groups syllable, there are some cases where two or more
namely consonants(C), medial(M), attached Myanmar characters combine each other and the
vowels(V), a vowel killer (K) and diacritics(D). resulting combined forms are also used in
The following subsection will give the details of Myanmar traditional writing system though they
each sub syllabic group. are not coded directly in the Myanmar Unicode
chart. Such combinations of vowel and medials
2.1 Brief Description of Basic Myanmar are described in detail below.
Sub Syllabic Elements Two or more Myanmar attached vowels are
combined and formed new three members { ,
Each Myanmar consonant has default vowel
sound and itself works as a syllable. The set of , } in the vowel set.
consonants in Unicode chart is C={က, ခ, ဂ, ဃ,
Glyph Unicode for Description
င, စ, ဆ, ဇ, ဈ, ဉ , ည ,ဋ ,ဌ, ဍ, ⁠ဎ, ဏ, တ,
Contraction
ထ ,ဒ ,ဓ ,န ,ပ ,ဖ, ဗ, ဘ ,မ ,ယ ,ရ, လ, ၀, သ, ဟ,
+ 1031+102C Vowel sign E
ဠ } having 33 elements. But, the letter အ can act
+ AA
as consonant as well as free standing vowel.
+ 1031+102C+1 Vowel sign E
Medials or consonant conjuncts mean the 03A +AA+ASAT
modifiers of the syllables` vowel and they are + 102D + 102F Vowel sign I
encoded separately in the Unicode encoding. + UU
There are four basic medials in Unicode chart
and it is represented as the set M={ , , }. “Table 1. Contractions of vowels”

The set V of Myanmar attached vowel characters Similarly, 4 basic Myanmar medials combine
in Unicode contains 8 elements { ါ, , , , , each other in some different ways and produce
new set of medials { , , , , ,
, , }. ( Peter and William, 1996)
, }. [Tin Htay Hlaing and Yoshiki
Diacritics alter the vowel sounds of Mikami, 2011]
accompanying consonants and they are used to
indicate tone level. There are 2 diacritical marks Glyph Unicode for Description
{ , } in Myanmar script and the set is Contraction
represented as D. + 103B + 103D Consonant Sign
Medial YA + WA
The asat, or killer, representing the set K= { } 103C + 103D Consonant Sign
is a visibly displayed sign. In some cases it Medial RA + WA
indicates that the inherent vowel sound of a + 103B + 103E Consonant Sign
consonant letter is suppressed. In other cases it Medial YA + HA
combines with other characters to form a vowel 103C + 103E Consonant Sign
letter. Regardless of its function, this visible sign Medial RA + HA

2
http://www.unicode.org/versions/Unicode6.0.0/ch11.pdf

33
103D + 103E Consonant Sign In Myanmar language, a syllable with only one
Medial WA + HA consonant can be taken as one syllable because
+ 103B + 103D + Consonant Sign Myanmar script is Abugida which means all
103E Medial YA+WA + letters have inherent vowel. And, consonants can
+ be followed by vowels, consonant, vowel killer
HA
and medials in different combinations.
103C + 103D + Consonant Sign One special feature is that if there are two
103E Medial YA+WA
+ consonants in a given syllable, the second
+ HA consonant must be followed by vowel killer (K).
We found that 1872 combinations of sub-syllabic
“Table 2. Contractions of Medials” elements in Myanmar Orthography [Myanmar
Language Commission, 2006]. The table below
The above mentioned combinations of characters shows top level combinations of these sub-
are considered as one vowel or medial in syllabic elements.
constructing the grammar. The complete sets of
elements for vowels and meidals used in writing Conso- Consona- Consona- Consonant
grammar are depicted in the table below.3 nant nt nt followed by
only followed followed Medial
Name of Sub Elements by Vowel by
Syllabic Consona-
Component nt
Medials or Conjunct , , , , , C CV CCK CM
Consonants
, , CVCK CCKD CMV
CVD CMVD
CVCKD CMVCK
CMVCKD
Attached vowels အ, , , , , , , CMCK
, , , , CMCKD

“Table 4. Possible Combinations within a Syllable”

The combinations among five basic sub syllabic


“Table 3. List of vowels and Medials” components can also be described using Finite
State Automaton. We also find that Myanmar
2.3 Combinations of Syllabic Components orthographic syllable structure can be described
within a Syllable in regular grammar.

As mentioned in the earlier sections, we choose C M


1 2 3
only 5 basic sub syllabic components namely
consonants (C), medial (M), attached vowels (V), V
vowel killer (K) and diacritics (D) to describe V C
Myanmar syllable. As our intended use for 4
syllabification is for sorting, we omit stand-alone
vowels and digits in describing Myanmar C
syllable structure. Further, according to the C
D
sorting order of Myanmar Orthography, stand-
alone vowels are sorted as the syllable using the 5
above 5 sub syllabic elements having the same K
pronunciation. For example, stand-alone vowel D
6 7
“ဣ” is sorted as consonant “အ” and attached
vowel “ ” combination as “အ”.
“Figure 1. FSA for a Myanmar Syllable”

3
Sorting order of Medials and attached vowels in Myanmar
Orthography

34
In the above FSA, an interesting point is that BC D
only one consonant can be a syllable because B D
Myanmar consonants have default vowel sounds. B
That is why, state 2 can be a final state. For
D # Diacritics
instance, a Myanmar Word “မန မ” (means
D # Diacritics
“Woman” in English) has two syllables. In the
first syllable “မန ”, the sub syllabic elements are D 
C က
Consonant(မ) + Vowel( ) +Consonant(န)+
# Such production will be expanded for 33
Vowel Killer( )+Diacritics( ). The second consonants.
syllable has only one consonant “မ”. Total number of productions/rules to recognize
Myanmar syllable structure is “111” and we
3 Myanmar Syllable Structure in found that the director symbol sets (which is also
known as first and follow sets) for same non-
Context-Free Grammar
terminal symbols with different productions are
disjoint.
3.1 Manually Constructed Context-Free
This is the property of LL(1) grammar which
Grammar for Myanmar Syllable
means for each non terminal that appears on the
Structure
left side of more than one production, the
directory symbol sets of all the productions in
Context free (CF) grammar refers to the grammar
which it appears on the left side are disjoint.
rules of languages which are formulated
Therefore, our proposed grammar can be said as
independently of any context. A CF-grammar is
LL(1) grammar.
defined by:
The term LL1 is made up as follows. The first L
1. A finite terminal vocabulary VT.
means reading from Left to right, the second L
2. A finite auxiliary vocabulary VA. means using Leftmost derivations, and the “1”
3. An axiom SVA. means with one symbol of lookahead. (Robin
Hunter, 1999)
4. A finite number of context-free rules P
of the form A where
3.2 Parse Table for Myanmar CFG
AVA and  {VA U VT}* The following figure is a part of parse table made
(M.Gross and A.Lentin, 1970) from the productions of the proposed LL(1)
grammar.
The grammar G to represent all possible က င $
structures of a Myanmar syllable can be written
S S S
as G= (VT,VA,P,S) where the elements of P are:
ကX ငX
Sက X
X X X X X X
# Such production will be expanded for 33
C C   
consonants.
D D
X A
A B
# Such production will be expanded for 11
A A A A A
medials.
C C  
X B
D D
# Such production will be expanded for 12
B
vowels.
B B B B B B
XC D
C C D D 
X
D D
A B D D D D
# Such production will be expanded for 12   
vowels.
C C C
A C D
က င
A
“Table 5. Parse Table for Myanmar Syllable”

35
In the above table, the topmost row represents combination of two or more characters in parsing
terminal symbols whereas the leftmost column Myanmar syllable.
represents the non terminal symbols. The entries
in the table are productions to apply for each pair 5 Discussion and Future Work
of non terminal and terminal.
Myanmar script is syllabic as well as
An example of Myanmar syllable having 4 aggulutinative script. Every Myanmar word or
different sub syllabic elements is parsed using sentence is composed of series of individual
proposed grammar and the above parse table. syllables. Thus, it is critical to have efficient way
The parsing steps show proper working of the of recognizing syllables in conformity with the
proposed grammar and the detail of parsing a rules of Myanmar traditional writing system.
syllable is as follows. Our intended research is the automatic
Input Syllable = က =က(C) + (M)+ syllabification of Myanmar polysyllabic words
(D) using formal language theory.
One option to do is to modify our current CFG to
recognize consecutive syllables as a first step.
Parse Stack Remaining Input Parser
We found that if the current CFG is changed for
Action
sequence of syllables, the grammar can be no
S$ က $ SကX longer LL(1). Then, we need to use one of the
ကX $ က $ MATCH statistical methods, for example, probabilistic
ကX$ က $ X A CFG, to choose correct productions or best parse
for finding syllable boundaries.
က A$ က $ MATCH Again, it is necessary to calculate the probability
က A$ က $ A B values for each production based on the
frequency of occurrence of a syllable in a
က B$ က $ MATCH
dictionary we referred or using TreeBank.
က B$ က $ BD We need Myanmar corpus or a tree bank which
က D$ က $ D contains evidence for rule expansions for syllable
structure and such a resource does not yet exist
က $ က $ MATCH
for Myanmar. And also, the time and cost for
က $ $ SUCCESS constructing a corpus by ourselves came into
consideration.
“Table 6. Parsing a Myanmar Syllable using Another approach is to construct finite state
predictive top-down parsing method” transducer for automatic syllabification of
Myanmar words. If we choose this approach, we
4 Conclusion firstly need to construct regular grammar to
recognize Myanmar syllables. We already have
This study shows the powerfulness of Myanmar syllable structure in regular grammar.
Chomsky`s context free grammar as it can apply However, for finite state syllabification using
not only to describe the sentence structure but weights, there is a lack of resource for training
also the syllable structure of an Asian script, database.
Myanmar. Though the number of productions in We still have many language specific issues to be
the proposed grammar for Myanmar syllable is addressed for implementing Myanmar script
large, the syntactic structure of a Myanmar using CFG or FSA. As a first issue, our current
syllable is correctly recognized and the grammar grammar is based on five basic sub-syllabic
is not ambiguous. elements and thus developing the grammar
Further, in parsing Myanmar syllable, it is which can handle all seven Myanmar sub
necessary to do preprocessing called contraction syllabic elements will be future study.
for input sequences of vowels and consonant Our current grammar is based on the code point
conjuncts or medials to meet the requirements of values of the input syllables or words. Then, as a
traditional writing systems. However, because of second issue, we need to consider about different
these contracted forms, single lookahead symbol presentations or code point values of same
in our proposed LL(1) grammar does not refer character. Moreover, we have special writing
exactly to one character and it may be a traditions for some characters, for example, such

36
as consonant stacking eg. ဗုဒ္ဓ (Buddha), မန ္တလေး
(Mandalay, second capital of Myanmar),
consonant repetition eg. က (University),
kinzi eg. အင်္ဂ (Cement), loan words eg.
ဘတ်(စ်) (bus). To represent such complex forms
in a computer system, we use invisible Virama
sign (U+1039). Therefore, it is necessary to
construct the productions which have conformity
with the stored character code sequence of
Myanmar Language.

References

John Okell. “ Burmese An Introduction to the Script”.


Northern Illinois University Press, 1994.
M.Gross, A.Lentin. “Introduction to Formal
Grammar”. Springer-Verlag, 1970.
Myanmar Language Commission. Myanmar
Orthography, Third Edition, University Press,
Yangon, Myanmar, 2006.

Noam Chomsky. “Syntactic Structures”. Mouton De


Gruyter, Berlin, 1957.

Peter T. Denials, William Bright. “World`s Writing


System”. Oxford University Press, 1996.

Robin Hunter. “The Essence of Compilers”. Prentice


Hall, 1999.

Tin Htay Hlaing, Yoshiki Mikami. “ Collation Weight


Design for Myanmar Unicode Texts” in Proceedings
of Human Language Technology for Development
organized by PAN Localization- Asia, AnLoc –
Africa, IDRC – Canada. May 2011, Alexandria,
EGYPT, Page 1- 6.

37

You might also like