Thesis PDF
Thesis PDF
B. Hettige
(08/8021)
University of Moratuwa
Sri Lanka
December 2010
Budditha Hettige
(08/8021)
University of Moratuwa
Sri Lanka
December 2010
published
or
written
by
another
person
except
where
the
..
Budditha Hettige
Date
Candidate
The above candidate has carried out research for the M. Phil. dissertation under my
supervision.
..
..
Date
..
Dr.
i
Abstract
Communication is fundamental to the evolution and development of all kinds of living
beings. With no disputes, languages should be recognized as the most amazing artifacts ever
developed by mankind to enable communication. Computer has also become such a unique
machine, due to its capacity to communicate with humans through languages. It is worth
mentioning that the languages understood by computers and humans are quite different, yet
people can communicate with computers. This has been possible since the computer is
fundamentally an artifact that can translate one language to another. Therefore, computers
must be able to do language translations than any other computing task. Nowadays,
computing is evolving to enable machine-machine communication with no or little human
intervention, yet humans continue to face with what is called language barrier for
communication. In particular, a vast collection of world knowledge written in English has
been inaccessible to communities who cannot communicate in English. Such communities
are unable to contribute to the development of world knowledge due to the language barrier.
As a result many people have embarked into research in computer aided natural language
translation. This area is commonly known as Machine Translation. Among others, Aptium,
Bable fish, Google translator, SYSTRAN, EDR, Anusaaraka, AngalaHindi, AnagalaBarathi,
and Mantra are some examples for popular machine translation systems. These systems use
various approaches
including Human-assisted, Rule-based, Corpus-based, Knowledgebased, Hybrid and Agent-based to translate from one language to another. However, due to
inherent diversifications of natural languages, a generic machine translation approach is far
from reality.
This thesis presents a computational grammar for Sinhala language to develop English to
Sinhala machine translation system with an underlying theoretical basis. This system is
known as BEES, an acronym for Bilingual Expert for English to Sinhala machine translation.
The concept of Varanegeema (conjugation) in Sinhala language has been considered as the
philosophical basis of this approach to the development of BEES. The Varanegeema in
Sinhala language is able to handle large number of language primitives associated with
nouns and verbs. For instance, Varanegeema handles the language primitives such as person,
gender, tense, number, preposition and subjectivity/objectivity. More importantly,
Varanegeema allows deriving all associated word forms from a given base word. This
enables to drastically reduce the size of the Sinhala dictionary. Since the concept of
Varanegeema can be expressed by a set of rules, it nicely goes with rule-based
implementation of machine translation systems. BEES implements 85 grammar rules for
Sinhala nouns and 18 rules for Sinhala verbs. BEES compresses with seven modules
namely English Morphological analyzer, English Parser, English to Sinhala base word
translator, Sinhala Morphological Generator, Sinhala Parser, Transliteration module and
Intermediate Editor. In addition to the main modules, system comprises of four dictionaries,
namely, English dictionary, Sinhala dictionary, English-Sinhala Bilingual dictionary and the
Concept dictionary. BEES primarily shares the features with the Rule-based, Context-based
and Human-assisted approaches to machine translation. The BEES has been implemented
using Java and Swi-Prolog to run on both Linux and Windows environments.
The English to Sinhala Machine Translation system, BEES has been evaluated to test the
hypothesis that concepts of Varanegeema can be used to drive English to Sinhala machine
translation. The English to Sinhala machine translation system has been evaluated through
three steps. As the first step, all the language processing primitives such as morphological
analyzers, parsers, translator and the transliteration module have been tested through the
white box testing approach. In order to test each module, several online testing tools
ii
including English morphological analyzer, English parser and Sinhala word generator have
been implemented. By using these online tools each module has been completely tested
through a carefully created test plan. In addition, an online evaluation test bed has also been
implemented to continuously capture feedback from online users. This online evaluation test
bed gives facilities to make different types of sentences using a given set of words. Word
Error Rate and the Sentence Error Rate were calculated by using these evaluation results.
Finally the intelligibility and the accuracy tests have been conducted through the human
support.
In order to evaluate the intelligibility and the accuracy of the English to Sinhala machine
translation system, following steps were followed. Two hundred sample sentences were
collected and grouped into 20 sets (10 sentences per each set). Then each sentence was
translated using the English to Sinhala Machine Translation system. Each set was given to
the human translators and scored. The intelligibility and the accuracy were calculated
through the above evaluation results. The experimental result shows that English
morphological analyzer, English parser, English to Sinhala base word translator, Sinhala
morphological generator and the Sinhala sentence generator successfully work with more
than 90% accuracy. Overall result of the evaluation shows 89% accuracy with the word error
rate of 7.2% and the sentence error rate of 5.4%.
The BEES successfully translates English sentences with simple or complex subjects and
objects. The translation system successfully handles most commonly used patterns of the
tenses including active and passive voice forms.
iii
Acknowledgements
This thesis is the result of four years of devoted work whereby I have been
accompanied and supported by many people. It is a pleasant aspect that I have now
the opportunity to express my gratitude for all of them.
I am grateful to the University of Moratuwa especially to the faculty of
Information Technology for providing me the opportunity to do a research study.
The first person I would like to thank is my supervisor Prof. Asoka Karunananda
for whom a few lines are too short to make a complete account of my deep
appreciation. This study would not have been such a success without his
commonsense knowledge and perceptiveness. I owe him lots of gratitude for
showing me this way of research. Besides apart from being an excellent supervisor
Prof. Karunananda has been an understanding teacher and he has provided me
support in every aspect for the success of this research.
I am also grateful to thank Dr. Sarath Bannayake, Head, Department of Statistics
and Computer Science, University of Sri Jayawardenepura for assistance he has
given to me during the research work.
With the great pleasure and deep sense of gratitude, I acknowledge Mr. P. Dias
former head; Senior Lecturer Department of Statistics and Computer Science,
University of Sri Jayawardenepura for the great help provided me to make a method
for evaluation.
I would also like to thank Mr. Niranjan Bandara, Lecturer, Department of Sinhala
and Mass Communication, University of Sri Jayawerdenepura for his valuable
support to correct some Sinhala language issues.
I would like to give my great pleasure and deep sense of gratitude to Venerable
Kirioruwe Dhamananada thera, Venerable Kukulpane Sudassi thera and Venerable
Matttumagala Chandanada thera for their valuable support given to me to solve
Sinhala and English language problems by sharing their knowledge of Sinhala, Pali
and Sanskrit Language structures.
iv
January 3, 2011
Budditha Hettige
Table of Contents
Declaration of the Candidate and the Supervisor
Abstract
ii
Acknowledgements
iv
Table of Contents
vi
List of Figures
xi
List of Tables
xii
1
Chapter 1 Introduction
1.1 Preamble
1.7 Hypothesis
1.9 Summary
2.1 Introduction
10
10
12
14
15
16
17
18
19
20
20
21
22
vi
23
24
25
2.8 Summary
26
28
3.1 Introduction
28
28
28
29
30
31
32
33
33
33
34
35
35
35
35
35
36
38
38
41
43
44
44
45
45
46
46
47
47
vii
48
49
49
3.12.3 Conjugation
49
50
50
50
50
3.13 Summary
51
52
4.1 Introduction
52
52
53
53
53
4.4 Hypothesis
57
57
57
58
58
58
4.10 Summary
59
60
60
60
60
62
62
63
63
64
64
65
66
viii
66
67
67
5.4 Summary
68
69
Chapter 6 Implementation
6.1 Introduction
69
69
70
70
74
77
78
81
82
83
84
84
86
89
90
91
91
92
93
6.5 Summary
94
95
95
95
97
100
102
7.6 Summary
106
107
Chapter 8 Evaluation
8.1 Introduction
107
ix
107
109
110
110
111
112
113
114
115
115
117
118
8.8 Summary
121
122
9.1 Introduction
122
122
9.3 Limitations
124
124
9.5 Summary
125
References
126
135
137
143
145
147
148
List of Figures
Figure 2.1: Architecture for a rule-based machine translation system
13
54
56
61
64
67
83
95
96
97
100
101
101
103
111
114
116
117
121
xi
List of Tables
Table 2.1: Existing Machine translation systems
26
29
30
31
32
33
36
37
37
40
41
42
43
54
84
110
112
113
119
120
120
121
xii
Chapter 1
INTRODUCTION
1.1 Preamble
A Natural Language is a kind of marvelous artifact ever invented by mankind. It is a
cornerstone of all kinds of communications. Each natural language plays the role of
describing thoughts of humans in a particular environment. As such, a natural
language has a strong bearing on the culture and the environment within which a
certain community of persons live. This is why we identify large number of different
natural languages worldwide. Despite the differences in languages, people still want
to communicate with persons who use different languages. Differences in languages
have become a barrier for cross-cultural communications. In particular, many nations
have not been able to access a huge reservoir of world knowledge written in English,
unless those nations have a sound knowledge in English. On the other hand, people
do not know English will not be able to contribute to the world knowledge. It is
undisputable the importance of mother tongue for discovery and creation of new
systems of knowledge. Consequently, this has resulted in what is called language
barrier for communication. In fact, this issue is not only between English and other
languages, but also between any two languages.
Of course, people have been practicing a solution for the issue. That is nothing but
translation between two languages by knowing the both languages. However, can we
really expect everyone to know every language? Undoubtedly, this is impractical.
The emergence of digital computer technology in early 1950s had postulated the
concept of machine translation to seek assistance from computers to seek solutions
for long felt language needs of humans. Since then hundreds of research works have
been conducted to translate between natural languages. The machine translation has
been a branch of Natural Language Processing, which comes under the broad area of
Artificial Intelligence. It is commonly cited that machine translation has been one of
1
the least achieved area in Artificial Intelligence over the last sixty years. As such, a
generic approach to machine translation has been an unrealized dream of researchers.
Thus, machine translation approaches have become so much language specific.
[154]. According to the design, each Machine translation system can be broadly
categorized into two groups, namely, the direct translation system and the indirect
translation system. The direct translation system translates source language into
target language by using word-to-word or phrase-to-phrase mapping. In contrast,
indirect translation systems use an Interlingua or some kind of transfer method. This
approach starts with an analysis of source text and performs a synthesis to generate
corresponding text in the target language. Figure 1.1 gives classic pyramid to show
relationship between these two approaches to machine translation.
Under the above two broad areas, several approaches have been used to develop
hundreds of machine translation systems all over the world. Among other
approaches, Human-assisted, Rule-based, Statistical, Example-based, Knowledgebased, Hybrid, and Agent-based are commonly cited as the most successful
approaches for machine translation.
Comparing the existing machine translation systems and their approaches, many of
these systems use sequential level architecture for Natural Language Processing and
machine translation [59]. This sequence comprises of steps such as preprocessing,
3
Objective 4:
Evaluate the system
1.7 Hypothesis
In order to achieve the above aim and objectives, the hypothesis employed in the
thesis can be stated as concepts of Varanegeema (Conjugation) in Sinhala
languages can be used to drive English to Sinhala Machine translation.
Chapter 4 discusses the novel approach taken to develop English to Sinhala machine
translation system. It presents the hypothesis of the project in the first place. Then the
chapter explains the mechanism of the translation process, nature of input, output and
key features of the system.
Chapter 5 is about the design of the proposed English to Sinhala Machine Translation
system. Each and every module of the design model is explained separately by
describing the functionality and relation among the modules.
Chapter 6 presents the implementation of the English to Sinhala machine translation
system. This chapter gives implementation details about prolog-based modules, java
based user interface, Intermediate editor and ontology of the lexical databases.
Chapter 7 presents how BEES works in practice when translating a given English
text. This chapter also explains applications of BEES as, a standalone translator, an
on demand translator, web page translator and selected text translator for machine
translation.
Chapter 8 reports evaluation of the English to Sinhala machine translation. The
evaluation methodology, evaluation steps, participants and the result of the
evaluation are also given in this chapter.
Chapter 9 concludes the thesis by referring to achievement of each objective. The
chapter also presents limitations and further work of the research conducted.
1.9 Summary
This chapter provided an overview for the entire project by describing the problem
to be addressed, aim, objectives and the hypothesis employed in the thesis. It briefly
explained the proposed English to Sinhala Machine Translation. Structure of the rest
of the thesis has also been presented in the chapter.
The next chapter reports on critical review of the existing approaches to machine
translation together with major machine translation systems that are based on these
approaches.
Chapter 2
STATE OF THE ART OF MACHINE TRANSLATIONS
2.1 Introduction
The previous chapter presented an overview of the thesis. This chapter gives the state
of the art of Natural language processing with a special attention on the Machine
Translation. Some of the related fundamental aspects in Machine Translation will
also be discussed in this chapter.
intelligence. After that, In 1957 Noam Chomsky in the academic and scientific
community as one of the fathers of modern linguistics, introduced the Syntactic
Structures for grammar [31]. It is recognized as a most important text in the field of
linguistics. After that, it becomes fundamental theory for Natural Language
Processing and many of these Machine Translation systems use this syntactic
structure [31][33].
The Natural language processing has come under broad area of the field of Artificial
Intelligence. The NLP is used to do several tasks including machine translation,
automatic summarization, Information retrieval, optical character recognition, speech
recognition, text-to-speech etc [107][128][147].
Based on the task, the Natural Language Processing systems reserved several issues
such as Natural language understanding, Natural language generation, Speech and
text segmentation, Part-of-speech tagging and the Word sense disambiguation [84]
8
10
In the Indian region a number of machine translation systems have used this
approach, including Anusaaraka, ManTra, MaTra, Angalabarathi etc [133][38][146].
Anusaaraka [4] [7] is a popular Human-assisted translation system for Indian
languages that makes text in one Indian language accessible to another Indian
language. This system uses Paninian Grammar model [6] to its language analysis.
The Anusaaraka project [16] has been developed to translate Punjabi, Bengali,
Telugu, Kannada and Marathi languages into Hindi. English-Hindi Anusaaraka
translates English text into Hindi. The approach and lexicon is general, but the
system has mainly been applied for childrens stories [95].
MaTra is a human-assisted transfer-based translation system for English to Hindi
[11]. This System uses general-purpose lexicons and applied mainly in the domains
of news. MaTra follows a structural and lexical transfer approach for its machine
translation. The MaTra aims to produce understandable output for wide coverage,
rather than perfect output for a limited range of sentences.
Mantra [106] is a machine assisted translation tool that, translates English text into
Hindi in several domains. ManTra is based on the Tree Adjoining Grammar (TAG).
The Mantra system was started with the translation of administrative documents such
as appointment letters, notification and circular issued in central government from
English to Hindi.
Angalabharti [103] is also a human-assisted machine translation system used in
India. Since India has many languages, there are a variety of machine translation
systems. For example, Angalahindi [133] translates English to Hindi using machineaided translation methodology. Human-aided machine translation approach is a
common feature of most Indian machine translation systems. In addition, these
systems also use the concepts of both pre-editing and post-editing as the means of
human intervention in the machine translation system.
Chandrashekhar Research Centre [20] has developed a machine aided translation
system for Tamil to Hindi.
Machine Translation System and the input text is in Tamil and the output can be seen
11
in a Hindi text. Stand-alone, API and Web-based on-line versions are developed.
Tamil morphological analyzer and Tamil-Hindi bilingual dictionary are the
byproducts of this system [133].
In addition to the above, KSHALT is a human assisted Machine Translation
system that translates English to Korean language [85]. This translation system
contains four phrases namely English Parser, English Analyzer, English to Korean
transfer and the Korean generation.
12
A number of machine translation systems have been designed through the rulebased approach. Among others Apertium [18] is a rule-based Machine Translation
system, which translates related languages. This is an opensource system that can
be used to translate any related two languages. The Apertium engine follows a
shallow transfer approach and consists of the eight pipelined modules, such as deformatter, A morphological analyzer, A parts-of-speech (PoS) tagger, A lexical
transfer module, A structural transfer module, A morphological generator, A postgenerator, and A re-formatter.
Source language
Source language
Analyzer
Dictionary
Bilingual translator
Bilingual
Dictionary
Target language
Dictionary
Target Language
difficult task in practice [135]. However, there are several advantages in the
Interlingua approach. Among others Interlingua gives more easy way to adding new
language than all other methods. Also it seems several disadvantages. Meaning
representation is the critical approach in Interlingua. If the meaning is too simple
then meaning will be lost in the translation. On the other hand it is too complex and
analysis and generation will be too difficult.
Numbers of Machine translation system have been developed through the Interlingua
approach. Abdelhadi and others have been developed English to Arabic machine
translation system based on Interlingua approach [1]. They have used mapping
system to Arabic to intermediate representation. This mapping system contains three
steps namely, selecting lexical items for each Interlingua concepts, mapping the
semantic roles and mapping the semantic features for each Interlingua concept to
appropriate syntactic feature in the feature structure.
Among others ICENT is the interlingua-based Chinese-English natural language
translation system [167]. This system introduces the realization mechanism of
Chinese language analysis, which contains syntactic parsing and semantic analyzing
and gives the design of Interlingua in details.
Tai to English machine translation system is another successful machine
translation system for Tai to English [29]. This system translates the Thai sentences
into Interlingua of a Thai LFG tree using LFG grammar and a bottom up parser.
17
18
19
contains four key components, namely Multi-Agent Engine, Virtual world, Ontology
and Interfaces [130][131]. The multi agent engine provides a run time support for
agents. The engine starts as the first step of the system.
environment of the multi agent systems. Using this Virtual world, agents are
cooperated and competed with each other as they construct and modify the current
scene. The Ontology contains conceptual problem domain knowledge of each agent.
There are a number of NLP systems that have been developed using multi agent
system technology [175][129][130][113][36]. Most of these systems use agents to
handle semantics in the translation.
Minakow and others [113] have developed a Multi Agent-based text understanding
system for car insurance domain. This system uses Multi agent system based
approach to understand a given text. The system uses four steps to text understanding
namely morphological analysis, Syntax analysis, semantic analysis and pragmatic
analysis. To analyze the whole text is divided into sentences. Then first three stages
are applied to each sentence. After analyzing each paragraph text is passed to
pragmatic analysis.
Stefanini and others have developed a Multi-agent based general Natural language
processing system named Talisman [141]. Talisman agents can communicate with
each other without the central control. These agents are able to directly exchange
information using an interaction language. Linguistic agents are governed by a set of
local rules. The TALISMAN deals with ambiguities and provides a distributed
algorithm for conflict resolutions arising from uncertain information.
evaluates through the BLUE score matrix [123] and reasonable result were achieved.
At present they are researching to develop English to Sinhala machine translation
system through the translation memories[156]. They have designed translation tool
named OpenTM, which is based on the translation memories. They have mentioned
that this OpenTM is suitable for any language pairs around the world, where at least
one language requires complex script support.
Further, many other local researchers have developed several prototype English to
Sinhala machine translation systems through several approaches. In 2003, Vithanage
and others have developed English to Sinhala machine translation systems for
weather forecasting domain [153].
simple sentences and works on the limited set of words and the limited sentence
patterns. This translation system is fundamental rule-based and it has used
Paragraphs and sentence tokenization, simple parsers (English and Sinhala),
translators and Sinhala sentence generators for English to Sinhala translation.
In 2008, Fernando and others have developed English to Sinhala machine translation
system using Artificial Neural Networks [47]. A Probabilistic Neural Network is
used to identify the English grammar and it is based on Bayesian classifiers. This
system has been achieved 50% accuracy in the grammatical translation. It has been
tested through 84 test cases including 12 tenses and it only capable to translate only
the simple sentences.
In addition to above, some people all over the world have attempted to develop
machine translation system for Sinhala. Among others, Hearth and others have
attempted to develop translation system for Japanese to modern Sinhalese [57]. The
system has a limited vocabulary and it handles translations only within its domain.
Asian countries including India, Japan and Thailand have also developed
morphological analyzers for computer-based natural language processing [5][6]. For
example, Anusaaraka system has developed morphological analyzers for six Indian
languages [16]. Anusaaraka has been designed to translate among major Indian
languages and its morphological analysis is based on the paradigms. The Paradigm is
used both for word analysis as well as word generation. Also Akshar Bharati and
others have developed a Generic Morphological Analysis Shell that can be used to
develop morphological analyzers for different minority languages [5]. This Shell
uses finite state transducers with features to give the analysis of a given word.
Further, it integrates paradigms with augmented FSTs. The current model has been
developed for sample data of Hindi, Telugu, Tamil and Russian. The above generic
Morphological Analysis Shell uses dictionaries, s paradigm table and paradigm
classes.
especially suitable for ambiguous grammars and use for parsing the computational
linguistics. Many of these parsers are already implemented through the C, Java, Perl
and Python languages. The X-Saiga parsers are developed under the X-Saiga project
to create algorithms and implementations which enable the construction of language
processors such as recognizers, parsers, interpreters, translators, etc. they have
implemented several algorithms, at various stages to develop X-Saiga [166].
The bottom-up parser attempts to identify the most fundamental units first. Then it
attempts to build trees upwards the start. These parsers are mainly used to analyze
both natural languages and computer languages. Using this bottom-up parsing
approach several types of Parsers are also developed including Operator Precedence
parsers, LR parsers and the CYK parsers.
The operator precedence parser is a bottom-up parser that interprets an operatorprecedence grammar [162]. The LR Parser [132] is also used bottom-up parsing and
parses the input from Left to right, and constructs a rightmost derivation of the
sentence. The CYK Parsers are used CockeYoungerKasami algorithm and parsing
techniques are based on the bottom-up parsing. The CYK parsers operate on contextfree grammars given in Chomsky normal form (CNF) [31][32].
In addition to the above Parsers are developed by using several computer
languages especially prolog [25] and number of tools are used to develop parsers
including ANTLR, Yacc, JavaCC etc.By using these programming languages and
development tools numbers of parsers have been developed by many people for
several Natural languages as well as computer programming languages.
2.8 Summary
This chapter gave a detailed discussion about Machine Translation systems and the
approaches used. The table 2.1 shows selected successful machine translation
systems with language pair, approach and system type.
Language pair
Anusaaraka
Angalabarath
English
AngalaHindi
to
Human-Assisted, Application
languages
Application
English to Hindi
ManTra
English to Hindi
MT
Matra
English to Hindi
Human-aided, transfer-based
Application
Google TR
Several languages
Statistical, Web-based
Bable fish
Several languages
Yahoo TR
Several languages
Statistical, web-based
Aprtium
Related languages
Rule-based, Application
EDR
English/Japanese
26
According to the literature survey, the author has identified that human assisted and
rule-based approaches are more suitable for none-related language pairs such as
English and Sinhala. Next chapter reviews features of English and Sinhala languages
with a view to identify issues related to machine translation from English to Sinhala.
27
Chapter 3
OVERVIEW OF THE ENGLISH AND SINHALA LANGUAGES
3.1 Introduction
The previous chapter discussed in detail about the Machine Translation systems. The
author has pointed out issues in adapting an existing translation system for
constructing English to Sinhala machine translation system. The literature review
also revealed that the development of the Machine Translation system absolutely
depends on the structure of the source and the target languages. Therefore, this
chapter studies about language primitives and structures of English and Sinhala
languages. This study would help to provide an insight about how the translation
from English to Sinhala can be done.
previous example a morpheme boy is a stem and the s is an affix. These stems and
affixes are participated both inflection and derivation of the word which is called
word formation [109].The Inflection provides various forms of any single word such
as Singular, Plural etc. (E.g. singular man, plural men in English). Derivation creates
new words from old ones. (E.g. the creation of dogcatcher from dog, catch and
er is a derivational process) [117][84]. Comparing the other Indo-European
languages, English grammar has minimal inflections. Therefore, the English
morphology is simpler than the other Indo-European languages. With the exception
of pronouns, English words have relatively few forms.
The English noun participates regular and irregular inflections. The regular inflection
gives general forms of the singular, plural and possessive cases. Table 3.1 shows
regular and irregular nouns with the inflection forms.
Regular
Irregular
Singular
boy
Man
Plural
boys
Men
Singular Possessive
boy's
man's
Plural Possessive
boys'
men's
Considering the morphology of the English noun, it has very limited number of
rules for noun inflections. The table 3.2 shows some morphological rules for the
29
English Noun. Basically, the plural noun is formed by adding some suffixes to the
singular noun such as s, es, ies, ves etc. The posessive case is formed by adding s
or s.
Morphological structure
Base word
Example
Singular noun
Boy
boy
Plural Base + s
Boy
Boys
Plural Base + es
Class
Classes
Baby
Babies
Knife
Knives
School
Schools
Boy
Boys
verbs use different patterns. Then the regular verbs expect simple present (adding s)
and the Present Participle (adding ing) forms.
Regular verb
Irregular
verb
Infinitive
play
eat
Past
played
ate
Present Participle
playing
eating
Past Participle
played
eaten
play
eat
You
play
eat
He, She, It
plays
eats
We
play
eat
You
play
eat
They
play
eat
Present:
31
Morphological structure
Regular verb
Irregular verb
play
eat
plays
eats
Past(base + ed)
played
ate
Playing
eating
played
eaten
Example
I write a book
The boy sings a new song
Present
I am writing a book
33
continuous
Present perfect
Present perfect
continuous
Past tense
I wrote a book
Past continuous
Past perfect
Past perfect
continuous
Future tense
Future perfect
continuous
34
Maldives, Dhivehi are the closest relative languages to Sinhala. Further, Sinhala
scripts are the worlds 16th most creative alphabet among todays functional
languages [35]. The Sinhalese most historical book Mahavansa [102] noted that, the
prince Vijaya and his entourages who came from India in the 5th century BC were
merged with the native Hela tribes known as Yakka and Naga who spoke Elu
language (the ancient form of the Sinhalese language) and the new nation called
Sinhala came to exist with the Sinhala language.
Further, Sinhala differs from all other Indo-Aryan languages. It contains a pair of
vowel sounds that are unique to it, such as short vowel: we ae and Long vowel:
wE aae. Also Sinhala contains a set of five nasal sounds known as half nasal or
prenasalized stops. These sounds as represented in modern Sinhala writing and
their Romanized notations are as follows: a (nng), `ca (ndj), ` (nnd), |a (nd), (mb)
[88].
The next sub section briefly describes the Sinhala alphabet, morphology and the
syntax of the Sinhala language.
Sinhala Letters
w, wd, we, wE, b, B, W, W! ,, iD, iDD, t, ta, ft, T, , T!
l, L, . , >, V, , p, P, c, Cv [, {, P, g, G, v, V, K,
Consonants
, ; , : , o, O, k, |, m, M, n, N, u, U, h, r, ,, j, Y, I, i,
y, <, *
Semi-Consonants
x, (
36
Stoke
Name
Position
Example
Al-lakuna1
Upper
ia
Al-lakuna2
Upper
1
2
Aela-pilla
Right
ld
Kettiaedapilla
Right
le
Digaaedapilla
Right
lE
Ketti ispilla
Upper
ls
Diga ispilla
Upper
lS
Kettipaa pilla1
Lower
nq
Kettipaa pilla2
Lower
l=
Digapaa pilla1
Lower
nQ
Digapaa pilla1
Lower
l+
8
9
Gaettapilla
Right
iD
10
Kombuva
Left
fu
11
Gayanukitta
Right
T!
Character
Letter
la
la
la + w
37
la + wd
ld
la + we
le
la + wE
lE
la + b
ls
la + B
lS
la + W
l=
la+ W!
l+
10
la + iD
lD
11
la + iDD
lDD
12
la + t
fl
13
la + ta
fla
14
la + ft
ffl
15
la + T
fld
16
la +
flda
17
la+ T!
fl!
(lingaya), Number (Wachana), Person (Purusha) and Case (Vibhakthi). There are
three genders namely masculine gender, feminine gender and neuter gender. Singular
and plural are the Number and there are three persons namely first person
(Uthtamapurusha)
second
person
(Maddamapurusha)
and
third
person
ksidg = ksia + wd + g
= Upasarga + Prakurthi
ksis = ksia + b
= Prakurthi + Thadhitaya
39
Note that, in the above the word ksia is a prakurthi and wd is a nama prathya ,g
is a vibakthi suffix, fkd is a upasarga and b is a Tthadithaya [43]. Note that nama
prakurtiya is a base form and nama prathya is one of the inflection parts of the noun.
Also vibakthi suffix is an inflection part. Table 3.9 shows some case makers in the
sinhala nouns. Upasarga and Thadditha change meaning of the noun. Note that any
morphologically complex word can be broken up into several meaningful units called
morphos. Therefore prakurthi, nama prattya, vibakthi prattya, thadditha and upasarga
are morphos in Sinhala.
Case
Suffix
Nominative
Accusative
Instrumental
jsiska
Auxiliary
f.ka
Dative
g$ yg
Ablative
f.ka
Genitive
f.a
Locative
flfrys
Vocative
There are 27 forms of nouns that can be generated by inflecting a single root word
(prakurthi). This inflecting is called Nama varanagilla(Word conjugation). The
Sinhala noun contains more than hundred rules to conjugate a noun using a given
base form (Prakurthi). In Sinhala there are 15 conjugation patterns identified for
generating a Sinhala noun.These patterns are called Gana. There are six noun
generation forms (aeth ganaya, ali ganaya, tara ganaya, vasu ganaya, kaputu ganaya
and bamara ganaya) [41] that used to generate masculine gender nouns. There are
nine generation forms (poth, akshara, basha, pili, akuru, polo, sulan, nuwara and
mutu) that used to generate neuter gender . The table 3.10 shows some rules for the
40
ksh; tal
A
Example
wksh; Wla;
A
wksh; wkqla;
Example
Example
we;a
we;d
f;la
;a
wef;la
l=
wef;l=
fldla
fldld
flla
la
fldflla
fll=
fldfll=
f.dka
f.dkd
fkla
ka
f.dfkla
fkl=
f.dfkl=
ksl
kslud
fula
kslfula
ful=
kslful=
lsUq,a
lsUq,d
f,la
,a
lsUqf,la
f,l=
,a
lsUqf,l=
ksia
ksid
fila
ia
ksfila
fil=
ia
ksfil=
Furthermore, sandhi rules are the morpho-graphemic rules describing changes that
occur due to concatenation of different morphemes. There are ten sandhi rules that
are availble in Sinhala language, namely, purwasswara lopa, parasawara lopa, swara,
swaradesha, gatradesha, purwarupa, pararupa, gathashwara lopa, agama and
dithwarupa. Nouns also undergo in darivations. Derivation creates new words from
pre-existing words, often of different syntactic categories. The Sandhi rules are used
for derivations.
language has only three tenses. They are Past tense, Present tense and future tense.
Main verb (Akkyathaya) participate three types of inflections namely person, number
and sex. Table 3.11 and 3.12 shows inflection forms of a verb in the active voice and
the passive voice.
Furthermore, structure of the Sinhala verbs is different from English. In comparison
with the English language, the Sinhala language has only three tenses such as present
(Varthamana), past (Athitha) and future (Anagatha) and the English shows 20 tenses
for active and passive. Note that, More than 18 inflection forms are available in a
Sinhala base verb including inflection of the tense, number and the person. In
addition, there are four moods such as Indicative mood, Optative mood, Imperative
mood and Conditional Mood and two participles Present participle (Misrakriya) and
Past participle (Purvakriya). For example hka, f.dia is the inflection form of the
above two participles.
In addition to the above, other parts of speech namely Nipatha and Upasarga do not
participate any inflections.
Number
Present
Past
Future
First
Singular
n,
ne,S
n,kafk
First
Plural
n,uq
ne,Suq
n,kafkuq
Second
Singular
n,ys
ne,Sys
n,kafkys
Second
Plural
n,yq
ne,Syq
n,kafkyq
Third
Singular
n,hs
ne,S
n,kafkah
Third
Plural
n,;s
ne,Q
n,kafkdah
42
Number
Present
Past
Future
First
Singular
nef,
ne,sKs
nef,kafk
First
Plural
nef,uq
ne,sKsuq
nef,kafkuq
Second
Singular
nef,ys
ne,sKsys
nef,kafkys
Second
Plural
nef,yq
ne,sKsyq
nef,kafkyq
Third
Singular
nef,hs
ne,sKs
nef,kafkah
Third
Plural
nef,;s
ne,qKq
nef,kafkdah
From the morphological point of view, a verb contains two parts, namely, Base verb
and a suffix. Base verb is a prakurthi, and it is named as kriya prakurthi. Diffrent
verb forms are generated by adding diffrent suffixes for the kriya prakurthi.
43
44
English language. Also there are several prefixes available such as none (none
usable) , un(uninstall) etc.
However, in Sinhala there are different ways to generate Sinhala word The Sinhala
part of speech named Upsarga acts as the prefix of the Sinhala words and Sandi
rules are used to combine two or more words.
When comparing English and Sinhala nouns, the English nouns have only three
types of inflections namely number, case and person. The Sinhala noun has four
types of inflections namely Number, person, gender and determination. (English
determinations are used as a separate word, a boy the boy etc.)
Sinhala verb is inflectionally richer than the English verb. Normally English verb
has 5 forms including simple present, past, past perfect etc. However the Sinhala
verb has more than 36 inflection forms for the two voices (active and passive) and
person number word inflections. Also Sinhala has 4 moods namely Indicative,
Operative, Imperative and conditional [54].
46
above two languages. This section describes more on the issues that need to handle
in the English to Sinhala machine translation.
The literary language and the spoken language differ from each other in
Sinhala.
Sinhala uses SOV (Subject Object Verb) word order and English uses SVO
(Subject Verb Object) word order
Sinhala nouns have five types of inflections, namely, gender, number, person,
case and article (definite/indefinite). The English nouns have four types of
inflections, namely gender, number, person and case.
There is a difference between Sinhala noun and the adjective form of the noun
However, there is no difference in English
Sinhala language contains only three tenses while English has 12 tenses.
Complete sentences
Noun phrases
URLs
Equations
Numbers etc.
The translation system needs to handle these texts for target language generation.
Identification of the complete sentence is one of the critical problems in machine
translation. Any sentence in English ends with a dot sign (.) after the dot sign the
space is appears. Using these two character combinations, the system identifies the
sentence. However there is a problem to understand the names (Example: A. B.
Fernando) Note that, the A. is not a sentence ending therefore HTML/Text parser
requires to use internal mechanism to remove these issues. In addition, the noun
phrase identification is another issue in the translation. As an example Consider the
following phrase A Computer Science Subject, is translated as a mrs.Kl jsoHd
jsIhla. Note that there are grammatical differences between English and Sinhala
language; therefore, word level translation cannot be used. This is because there is a
difference between Sinhala nouns in the noun form and adjective form (mrs.Klh
is a noun form and mrs.Kl is an adjective form.) Also in Sinhala language, article
comes with a Sinhala noun.
48
3.12.3 Conjugation
Conjugation is another issue for machine translation and it needs to generate
number of words form for the given single base-word. To address these issues,
machine translation system needs successful word generator to generate appropriate
word form. In the English to Sinhala machine translation point of view, authors use
Sinhala morphological generator to handle the conjugation issues.
49
50
3.13 Summary
In this chapter, the author made an in-depth study about English and Sinhala
language with deep concern morphologically, syntactically and semantically with the
existing language issues. The next chapter discusses on our novel approach to
English to Sinhala machine translation.
51
Chapter 4
NOVEL APPROACH TO MACHINE TRANSLATION
4.1 Introduction
Chapter 3 reviewed features of English and Sinhala languages with a view to identify
issues pertaining to English to Sinhala machine translation. It was pointed out that
machine translation systems need a theoretical base for analysis of source language
and creation of target language sentence. This chapter presents a theoretical-based
approach to machine transition from English to Sinhala.
52
Considering the Sinhala language, a Sinhala sentence can be divided into eight
components namely
1. Attributive adjunct of Subject (Wla; fYaIKh)
2. Subject (Wla;h)
3. Attributive adjunct of Object (lu fYaIKh)
4. Object (luh)
5. Attributive adjunct of Predicate (wdLHd; fYaIKh)
6. Attributive adjunct of the complement of predicate (wdLHd; mQK
fYaIKh)
7. Complement of predicate (wdLHd; mQKh)
8. Predicate (wdLHd;h)
Table 4.1: Paradigm table for Kaputu Ganaya
lmqgq .Kh
lmqgq
Base Form
Form
Add
Remove
Example
ksh; tal
lmqgd
wkshl Wla;
fgla
gq
lmqfgla
wksh; wkqla;
fgl=
gq
lmqfgl=
nyq Wla;
fgda
gq
lmqfgda
nyq wkqla;
ka
lmqgka
These components are building blocks of the Sinhala sentence. Some select
context-free grammar rules for the Sinhala language are listed below. All the
implemented rules are listed in the Appendix C.
SubP = Subject Phrase
VebP = Verb Phrase
Sub = Subject
Obj = Object
ObjP = Objective Phrase
AdjSub = Attributive adjunct of Subject
AdjObj = Attributive adjunct of Object
Pre = Predicate
AdjPre = Attributive adjunct of Predicate
AdjCmp = Attributive adjunct of Complement
CmpPre = Complement of predicate
CmpPreP = = Complement of predicate phrase
S SubP
VebP
SubP Sub
SubP AdjSub Sub
VebP ObjP PreP
VebP PreP
ObjP Obj
ObjP AdjObj Obj
PreP AdjPre CmpPrep
PreP CmpPrep
CmpPrep Pre
CmpPrep Pre CmpPre
CmpPre Cmp
CmpPre AdjCmp Cmp
Sub Noun
AdjSub Noun
55
Obj Noun
AdjObj Noun
AdjPre Adv
Cmp Noun
AdjCmp Noun
Pre Verb
Noun [;u]
Noun [YsIHhd]
Noun [Ydrohl=]
Noun [Kque;s]
Verb [lf<ah]
Adv [blauKska]
4.4 Hypothesis
The hypothesis employed in the thesis can be stated as concepts of Varanegeema
(conjugation) in Sinhala language can be used to drive English to Sinhala machine
translation.
BEES provides built-in tools for maintenance, evaluation and updating of the
system
word. If there are multiple words available in the Bilingual dictionary, then system
looks up the relevant information from concept dictionary to indentify the most
suitable Sinhala base word. The concept dictionary is used to store concepts
information for each Sinhala word. Otherwise, English to Sinhala Base Word
Translator gives most usable Sinhala based word for the given English based word.
After successful base word translation, the Sinhala parser (Sentence composer)
generates appropriate Sinhala sentence with supporting the Sinhala Morphological
generator. The Sinhala Morphological Generator generates appropriate Sinhala
words by using the translated Sinhala based word for the given grammar information.
The Sinhala Parser uses above generated Sinhala word to generate grammatically
correct Sinhala sentence.
4.10 Summary
This chapter described a novel approach with a theoretical basis for English to
Sinhala machine translation. The translation system presented as a rule-based system
known as BEES. The chapter also discussed the theoretical basis of the approach,
hypothesis, input to the system, output of the system and overall features. Next
chapter describes the design of the software solution of BEES.
59
Chapter 5
DESIGN OF BEES
5.1 Introduction
The previous chapter reported on a novel approach for English to Sinhala machine
translation. It pointed out theoretical basis, hypothesis, input, output and process of
the translation system, which is known as BEES. This chapter gives the design of
the English to Sinhala machine translation system, BEES. The system has been
designed as a rule-based machine translation system with 7 modules.
Morphological analyzer can identify its inflection forms such as play, plays and
playing. However the irregular words cannot be identify by using Morphological
analyzer and theses words are needed to store in the lexical dictionary separately.
English Sentence
English
Dictionary
English Parser
English-Sinhala
Bilingual &
Concept dictionary
Sinhala
Dictionary
Sinhala Parser
Sinhala Sentence
61
Find the suitable Sinhala base-word from bilingual dictionary with the
full grammatical mapping (Two or more words available in the
bilingual dictionary System uses context dictionary to find the suitable
Sinhala base-word)
The English to Sinhala base word translator translates the English base word into
the Sinhala base word by using the concept dictionary and the English to Sinhala
bilingual dictionary.
62
63
V2
e, r
a, e, i, o, u, y
a
o
V3
V4
w, u
o, u
dictionaries in the machine translation system. The process of the intermediateediting, before composing a Sinhala sentence, drastically reduces computational cost
for running Sinhala morphological analyzer and the parser. In addition, requirement
64
for post-editing can be reduced by the process of intermediate editing. On the other
hand, intermediate-editing can be used as means of continuous capturing of human
expertise for machine translation. This knowledge can be reused for subsequent
translations. As such the concept of intermediate-editing can be introduced as an
approach to automatic knowledge management in the machine translation system. It
should be noted that the knowledge used for pre-editing and post-editing cannot be
readily captured by the machine translation system, as this process can be done even
outside the machine translation system. In contrast, intermediate editing will be an
integral part of the machine translation system, in which human directly interact with
the system. If the English to Sinhala base word translator cannot be identified the
most suitable Sinhala word (Grammatical mapping is not satisfied), then intermediate
editor provides abilities to use to select the suitable Sinhala word.
Therefore, author has designed a new structure for the development of the English to
Sinhala bilingual dictionary [67]. The English Sinhala bilingual dictionary is also
designed as a prolog database.
The Sinhala dictionary stores Sinhala regular words, irregular words, lexical
information and sets of rules, which are required to generate Sinhala words [54][58].
All the rules are based on the Sinhala language fundamentals.
The concept dictionary [67] contains the context information for the Sinhala words.
This dictionary is used to identify the semantics of the words. All these four
dictionaries are work as ontology of the machine translation system.
Internet
Sinhala
Corpus
Sinhala
Morphological
Generator
Dictionary Updator
Concept
Dictionary
English-Sinhala
Bilingual Dictionary
Sinhala
dictionary
English
dictionary
module identifies the usage and the availability of the given set of words. Further,
Some Sinhala adjectives give unique meaning and some words have special usage.
Example: The English term dangerous has several Sinhala meaning including
Nhdkl, if>dar, kmqre etc. However, each Sinhala terms are not suitable for
each noun for example Nhdkl fldhd, if>dar imhd are the some sample
Sinhala words. However, there is no meaning about if>dar fldhd. The online
search module has been designed to identify this type of word usage through the
online Sinhala resources and these information are stored on the concept dictionary.
5.4 Summary
This chapter describes the design of the English to Sinhala machine translation
system which is contains 7 module, three supporting module and four dictionaries.
Processes of the each module are discussed in the chapter. Next chapter describe the
implementation of the software solution of BEES
68
Chapter 6
IMPLEMENTATION
6.1 Introduction
In the previous chapter, it was described the design of the English to Sinhala machine
translation system. This chapter gives implementation details about all these modules
identified in the design.
70
word. The following codes are used to consult English dictionaries. This code shows
how EMA consult eng_reg_nouns.pl prolog file.
consult('eng_reg_nouns.pl'),
The EMA uses the prolog predicates namely loadEMA/2 to start the morphological
analysis. This predicate gives finish, unknown or error as the result of the analysis.
For example
loadEMA(boy eats rice, X).
X = finish.
The English morphological analyzer writes all the output data to a file name
ema_out.pl. Before analyses the new data set EMA clear the all data and ready to
new data set
Each word in the text is analyzed by the EMA word by word. For each word it
gives all the grammatical information.
71
English Morphological analysis can be divided into two categories namely regular
word analysis and the irregular word analysis. The irregular words are available on
the dictionary. The English irregular nouns, irregular verbs, irregular adjectives,
adverbs, prepositions, conjunctions and determinations are available in the irregular
form. The following code shows how EMA analyze the English adverb.
search_irr_word(EngWord):eiw(ID, av,Type, EngWord),
write_output_advb(ID, Type, EngWord).
Write write_output_advb/3 ids used to write the output result to the output
file
The Prolog predicate eng_advb/3 is used to represent the irregular adverb. The
following sample shows the English adverb slowly in ema_out.pl file
eng_advb([3000015], p, 'slowly').
72
The following code shows how EMA analyze the English Noun
Singular (Base noun) y + ies = Plural noun
get_eng_noun(EWL,RootID,Sp, Sex, Type)
append(Rest1,[i,e,s],EWL),
:-
append(Rest1,[y],Rest),
To analyze the English verbs EMA uses the same method. The following code
shows how EMA analyze the English verb in Simple present tense
get_eng_verb(EWL, RootID, Tens)
:append(Rest,[s],EWL),concat_atom(Rest,GRoot),
erw(RootID, vb, GRoot), Tens =sp.
The English Morphological analyzer has been implemented with the 14 rules for
analysis the regular nouns and 14 rules for English adjectives 11 rules for English
regular verbs and 7 rules for Irregular verbs.
The following output shows result for the Morphological analysis of the given
English sentence A good boy and his friend read the books everyday
eng_input_sen_list(['a', 'good', 'boy', 'and', 'his', 'friend',
'read', 'the', 'books', 'quickly', []]).
eng_detm([3000001], id, 'a').
eng_adjv([3000004], p, 'good').
eng_noun([1000001], td, sg, ma, sb, 'boy').
eng_noun([1000001], td, sg, ma, ob, 'boy').
eng_conj([3000027], 0, 'and').
eng_noun([4000004], td, sg, ma, po, 'his').
eng_noun([4000004], td, sg, ma, po, 'his').
eng_noun([1000011], td, sg, ma, sb, 'friend').
73
The above example shows how EMA analyze the given words.
74
add_eng_sen_results('error')
).
Then the EPA clears all the variables and the previous data on the epa_out.pl file.
The following rules are used to analyze the simple sentence and the complex
sentence.
english_sentence(Out, NL, []) :simple_sentence(Out, NL, []).
english_sentence(Out, NL, []) :compound_sentence(Out, NL, [])
The compound sentence may be two simple sentences with the conjunction
compound_sentence(Out, Sen, End)
:-
The Simple sentence may be the four types namely declarative, interrogative,
imperative or conditional. The following code are shows the implementation.
simple_sentence(Out, NL, End) :declarative_sentence(Out, NL, End).
simple_sentence(Out, NL, End) :interrogative_sentence(Out, NL, End).
simple_sentence(Out, NL, End) :imperative_sentence(Out, NL, End).
simple_sentence(Out, NL, End) :-
75
The English Parser analyzes the English sentence with the following information
1. Type of the sentence
2. Tense of the sentence
3. Subject, Complemant, verb and the predicate
The following results are given for the English sentence A good boy and his
friend read the books everyday
eng_sen_verb([5000008]).
eng_sen_complement([3000003, 1000004, 3000016]).
eng_sen_subject([3000001,
3000004,
1000001,
3000027,
4000004,
1000011]).
eng_sen_predicate([5000008, 3000003, 1000004, 3000016]).
eng_sen_type(declarative).
eng_sen_ekeys([3000001,
3000004,
1000001,
3000027,
4000004,
The Prolog predicate eng_sen_verb/1 gives the verb of the sentence. This verb id is
equal to the verb in the morphological analysis.
eng_verb([5000008], if, 'read').
The
Prolog
predicate
eng_sen_complemant,
eng_sen_subject
and
76
The above three code gives the type, tense and the result of the analysis. Note that,
these information are used to generate the corresponding Sinhala sentence.
The bilingual translator stores all the output results of the base-word translation into
the file named est_out.pl
To identify the corresponding Sinhala based word the Bilingual translator uses the
following three rules.
eng_to_sin_word_all(H, S, Type,EW, SW) :eng_cons_word(H, S, SubID),
subject_form_avlable(SubID),
esw(_, H, S, Type, EW, SW).
eng_to_sin_word_all(H, S, Type,EW, SW) :esu(EW, H, S, Type, SW, _).
eng_to_sin_word_all(H, S, Type,EW, SW) :-
77
The Sinhala morphological generator generates appropriate Sinhala words for the
given grammar information. The Sinhala morphological generator generates Sinhala
Nouns, verbs, adjectives, adverbs and prepositions.
78
To generate the Sinhala nouns SMG uses the get_sin_noun/8 prolog predicate. The
prolog predicate get_sin_noun/8 uses Sinhala base word id, person, number, sex live,
DI-code and case to generate a Suitable Sinhala noun.
snoun([1000001], td, sg, ma, li, dr, v1,' ').
To generate the Sinhala noun, it uses Sinhala rules. The following code shows how
Sinhala Morphological generator generates the Sinhala noun.
Case 1: The Sinhala noun can directly get form the Sinhala dictionary (No need to
generation)
get_sin_noun(WID,P, N,S, L, DI, VB, NW) :sn([WID], P, N, S, L, DI, VB, NW).
Same as the Sinhala Nouns Sinhala regular verbs are generated through the set of
Sinhala rules. The following code shows some rules to generate Sinhala regular verbs
get_sin_final_verb(Skey, ps, P, N, fu, SW) :sfv([Skey], _,_, _,_,_, APR, _, WD),
verb_posfix(APR, ps, P, N, fu, ADD, REM),
atom_chars(WD, WDL),atom_chars(ADD, ABL),
atom_chars(REM, RBL), append(WDL1, RBL, WDL),
append(WDL1, ABL, NWDL),
concat_atom(NWDL, SW), write(SW).
The above code shows how prolog generates the Sinhala verb; As a first step
Sinhala verb and the conjugation form have been identified through the Sinhala
dictionary. After that, Conjugation rules are identified from the Sinhala rule
dictionary. Finally, using all these information Final Sinhala word is generated.
80
The
prolog
predicates
namely
translateSubject,
81
As the first step, trsnliteration module converts given set of words into a list. After
that, it transliterates the given word by word by using FST. In addition to this the
module uses character encoding system for FST.
The following sample code shows the some rules in the FST to represent the
Sinhala Vowels letters.
initial(1).
final(99).
% ***************************************************************
% Finite State Automata for Sinhala Vowels
% ***************************************************************
arc(50, 62, a, []).
arc(62, 70, e, []).
arc(70, 99, e, [e]).
arc(62, 99, a, [c]).
arc(62, 99, e, [d]).
arc(62, 99, i, [p]).
arc(62, 99, u, [s]).
82
In addition to the above, the intermediate editor uses two xml files namlely
reldata.xml and trasdata.xml to store relations and the translated data. The figure
6.1 shows the user interface of the Intermediate editor including sample data.
83
eng_reg_noun.pl,
eng_irr_verb.pl,
eng_reg_verb.pl
and
The table 6.2 shows codes, which are used to implement grammar notation in the
English dictionary.
Number
Sex
Case
code
Meaning
fs
1st person
sc
2nd person
td
3rd person
sg
Singular
pl
Plural
ma
Masculine gender
fe
Feminine gender
co
Common gender
no
Neuter gender
sb
Nominative case
ob
Objective case
po
Possessive case
rf
Reflexive (pronoun)*
84
Verb type
Determination
Adjectives
If
Infinitive
pa
Past
pp
Past Participle
rp
Present Participle
sp
Simple present
dr
Direct
id
indirect
Passive
Comparative
Superlative
The following code shows sample data for the English irregular words
eiw(4000001, na, fs, sg, co, sb, 'i').
eiw(4000001, na, fs, sg, co, ob, 'me').
eiw(4000001, na, fs, sg, co, po, 'my').
eiw(4000001, na, fs, sg, co, po, 'mine').
eiw(4000001, na, fs, sg, co, rf, 'myself').
The English regular nouns are stored on a prolog file namely eng_reg_noun.pl
using the erw/4 prolog predicates. The erw/4 represents the Word ID, word type and
the sex. The following two samples are the regular nouns that are stored in the
eng_reg_noun.pl
erw(1000001, na, ma, 'boy').
erw(1000002, na, fe, 'girl').
The English morphological analyzer reads files prolog predicates and uses to analyze
the English word.
The English irregular verbs are saved in a file named eng_irr_verb.pl. This file
contains English irregular verbs, which are available on the prolog predicates named
eiw/4. It represents word id, word type, tense of the verb and the English irregular
verb.
eiw(5000001, vb, if, 'eat').
eiw(5000001, vb, pt, 'ate').
eiw(5000001, vb, pp, 'eaten').
85
The English regular verbs are stored in a prolog file named eng_reg_verb.pl. This
file contains English regular verbs in erw/3prolog predicates format. The following
code shows how prolog represents the English regular verbs.
erw(2000001, vb, 'play').
The erw/3 prolog predicate uses word id, word type and the word for the strong
regular word information. The English morphological analyzer uses this information
to analyze English regular verbs.
In addition to the above all, the other parts of speech such as adjectives, adverbs,
propositions, conjunctions and interjections are stored on the prolog file named
eng_irr_word.pl. The prolog predicate named eiw/4 is used to store all the words.
The following code shows each words how store in the eiw/4 format. The special
notation is used to identify each word type (na-noun, vb-verb, dt-determinations, ajadjective, av-adverb, pp-proposition, cn-conjunction and uv for auxiliary verbs)
eiw(3000001, dt, id, 'a').
eiw(3000004, aj, p, 'good').
eiw(3000014, av, p, 'badly').
eiw(3000026, pp, v5, 'to').
eiw(3000027, cn, 0, 'and').
eiw(3000029, vb, uv, 'will').
By using online update module, this English dictionary can be updated automatically.
This is the main purpose of the separating English dictionary into several files.
compress the with the prolog type files namely sin_reg_nouns.pl, sin_irr_nouns.pl,
sin_reg_verb.pl,
sin_irr_verb.pl,
sin_irr_words.pl,
sin_case_rules.pl
and
sin_rule_dic.pl.
The file sin_reg_nouns.pl contains the Sinhala regular noun information. The
prolog predicate sn/9 is used to store all the information in the regular noun. The
86
ma,
li,
s900004,
s910000,
s910000,
s910000,
'').
The Sinhala irregular nouns are also stored in the prolog file name sin_irr_nouns.pl
with the use of sn/8 prolog predicate. The sn/8 prolog predicate shows word id,
person, number, sex, live, direct/indirect form, case and the Sinhala words. The
Sinhala dictionary uses Sinhala Unicode (Sinhala Unicode) to store all the Sinhala
words. The following code shows samples for the Sinhala irregular words.
sn([4000001], fs, sg, co, li, dr, v1, '').
sn([4000001], fs, sg, co, li, dr, v2, '').
sn([4000001], fs, sg, co, li, dr, v3, '').
The Sinhala noun contains nine cases and these cases are represented v1 to v9 code.
The Sinhala regular verbs are stored in the prolog file named sin_reg_verb.pl with
the use of the prolog predicate named sfv/9. It represents word id and the verb forms
for the active and passive voice forms and other verb (Moods) forms.
sfv([5000001],
s910001,
s910002,
s910001,
s910001,
s910001,s910001,s910001, '').
The Sinhala irregular verbs are stored in the prolog file named sin_irr_verb using the
prolog predicate sfv/6. The sfv/6 represents Word id, person, number, voice, tense
and the Sinhala verb. The following code shows samples for the Sinhala irregular
verbs.
sfv([8000002], fs, sg, at, pr, '').
sfv([8000002], fs, pr, at, pr, '').
All other Sinhala words namely Sinhala adjectives, adverbs and prepositions are
stored in a prolog file named sin_irr_word.pl using the prolog predicate siw/4. The
87
siw/4 prolog predicate represents the Sinhala word id, type, property and the Sinhala
word. The following sample code shows the Sinhala words in the dictionary.
siw([3000034], aj, p, '').
siw([3000015], av, p, '').
siw([3000033], pp, v3, '').
To generate Sinhala noun several rules are needed. These rule are stored in the
sin_rule_dic.pl, These rules are used to generate appropriate Sinhala noun form
from its base form. The following sample rules are used to generate Sinhala word
. These rules represent the implementations of the Sinhala kaputu ganaya
( ). In the Sinhala_rule_dic.pl has been implemented by using more than
100 rules to generate appropriate Sinhala noun.
noun_posfix(s935001, li, bas,
'', '').
'', '').
'', '').
'', '').
'', '').
'', '').
'', '').
The noun_posfix/5 is the rule format for the Sinhala noun and it represents rule id,
live_code and, noun type, add and remove code. These rules are the implementation
of the Sinhala noun palromdrim (Conjugation table). In addition to the above, the
case rules are used to generate complete Sinhala noun with the case effect. The case
rules are stored in a prolog file name sin_case_rule.pl. The following code shows the
sample case rules.
88
The prolog predicate named noun_vib_postfix/4 gives the rule id, case, add part
and the remove part of the word. The Sinhala morphological generator uses all of
these rules to generate grammatically correct Sinhala terms.
The sin_rule_dic.pl also stores the rules which are used to generate Sinhala verb. The
prolog predicate verb_posfic/7 is used to store rule id, voice, person, number, tense,
add part and the remove part of the Sinhala verb. The following sample code shows
the sample rule for Sinhala verb generation.
The esw/6 prolog predicate is used to store appropriate Sinhala base word for a given
English base word. The esw/6 prolog predicate gives id, English word id, Sinhala
word id, word type, English word and the Sinhala word. Using the above predicate
all the Sinhala and English words are combined through the English-Sinhala
bilingual dictionary.
89
The eng_sin_usage_dic is used to store most usable terms on the web. This
dictionary is automatically updated by the online update module to store usable
words. Same as the eng_cons_word/3, the eng_usage_word/3 prolog predicate is
used to store these usage information.
In addition to above all Sinhala resources Sinhala corpus is used as a supporting
resource to find available word forms. The Sinhala corpus information are stored in a
prolog predicate named sc/1 and all information are stored in a prolog file name
sinhalacop.pl
sc('').
sc('').
sc('').
sc('').
sc('').
In the present corpus uses 18613180 words and these resources were collected from
the UCSC Sinhala corpus (LTRL). The Sinhala word generator uses these resources
to identify the suitable Sinhala word forms directly.
90
Then updater uses online search module and get the grammar information by using
set of online resources. For example, online search module uses madhura online
dictionary, Cambridge dictionary, sensagent online dictionary and yahoo search
engine to get relevant English grammar information. Online updater get the relevant
word information such as word type (regular Noun, irregular Noun, regular Verb,
irregular verb, Adjective etc.) then system update each information. The following
sample code is used to update English regular noun.
update_eng_reg_noun(Word, ID) :write('try to update regular noun'),
consult('c:\\bees3.2\\dic\\eng_reg_nouns.pl'),
( erw(ID, na, _, Word)
91
->
write('English regular noun avilable ('),
write(ID), write(')'), nl
;
consult('c:\\bees3.2\\updateinfo.pl'),
(new_noun(Word, re, Word, _)
->
get_new_eng_reg_noun_key(ID),
open('c:\\bees3.2\\dic\\eng_reg_nouns.pl', append, File),
write(File, 'erw('), write(File, ID),
write(File,', na, no, \''),
write(File, Word),
:-
use_module(library(jpl)),
write('Call : http://dictionary.cambridge.org ..... '),
jpl_new( 'SearchCambDic', [], F),
jpl_call( F, searchDic, [Word], Out), write(Out), nl.
92
Sinhala word generator can generate all the word form for the given Noun or Verb.
These word forms are need validate the requires rules.
validate_baseform(WD, P, NP,BASE)
:-
ensure_loaded('c:\\bees3.2\\dic\\sin_rules_dic.pl'),
noun_posfix(NP, P, bas, AB, RB),
atom_chars(WD, WDL),
atom_chars(AB, ABL),
atom_chars(RB, RBL),
append(WDL1, RBL, WDL),
append(WDL1, ABL, NWDL),
concat_atom(NWDL, BASE),
write(BASE), nl.
//System.setProperty("http.proxyHost", "10.32.193.254");
//System.setProperty("http.proxyPort", "3128");
System.out.println("Connecting to http://search.yahoo.com/");
FileOutputStream
fout
FileOutputStream("tmp\\yahoo_search.html");
new
BufferedWriter
out
=
new
OutputStreamWriter(fout, "ISO-8859-1"));
BufferedWriter(new
String uu = "http://search.yahoo.com/search?p="+word ;
String resultString = new String(uu.getBytes("UTF-8"));
93
6.5 Summary
This chapter reports implementation of all the modules and dictionaries completely.
To implement all modules, author has used Java and prolog technologies. The next
chapter will be discussed how does the BEES work on the four environments
namely desktop application, online translator, webpage translator and selected text
translator.
94
Chapter 7
BEES IN ACTION
7.1 Introduction
The previous chapter described implementation of all the modules and dictionaries.
The BEES has been implemented through several online and standalone applications.
This section describes various applications of BEES. The English to Sinhala machine
translation system has been implemented through the four applications namely
1. BEES as an online translator
2. BEES as a webpage translator
3. BEES as a selected text translator
4. BEES as a Desktop Application
The web browser is the user interface of the system. Apache web server handles all
the web-based transaction of the system. PSP provides facilities to run Prolog-based
system through the web. Prolog-based system is the core of the machine translation
system. Through the PSP scripts, the core system reads input English sentence that
comes from the web client. After the translation, the core machine translation system
returns the output Sinhala sentence to the web client. Figure 7.2 shows user interface
of the online BEES [72].
96
This system translates a given English web page into Sinhala and it shows output of
the translation by using a web browser. Figure 7.4 shows translated output of the
Sample web page. Process of the translation is given below. Assume that the system
reads following simple HTML document. As a first step HTML parser [66] analyzes
the document and identifies the tags and the text. Consider the following simple html
document part.
<tr><td>
The Rabbit
</td></tr>
<tr><td>
<imgsrc="trabsl1.jpg">
The Rabbit is a small and herbivorous animal.
It lives in the jungle. Rabbit has long and powerful legs.
</td></tr>
97
This HTML source contains several HTML tags and text. The rabbit is a text
identifies by the HTML parser. Then the parser sends this text into the translation
module. Translation module reads the above text and tries to translate. In the
sentence analyzing stage, the English parser rejects the input text, because it is not
a sentence. Therefore, the system tries to identify it as a noun phrase. The English
parser recognized the input text The rabbit as a noun phrase. Then the translation
module uses English to Sinhala word translator, Sinhala morphological analyzer
and the Sinhala parser, and generates the appropriate Sinhala translation as ydjd.
This is the time to show how translation module works for given complete
sentence. Assume that, translation module reads a sentence The Rabbit is a small
and herbivorous animal as an input text. Then English morphological analyzer
reads the input sentence and returns the following.
eng_detm([e1000002], dr, 'the').
eng_noun([e1000077], td, sg, ma, sb, 'rabbit').
eng_verb([e1000057], if, 'is').
eng_detm([e1000001], id, 'a').
eng_adjv([e1000074], p, 'small').
eng_conj([e1000020], 0, 'and').
eng_adjv([e1000076], p, 'herbivorous').
eng_noun([e1000059], td, sg, co, sb, 'animal').
This English parser identifies the subject, verb and complement of the sentence. It
stores these information using prolog predicates such as eng_sen_verb/1,
98
The estrwords/4 prolog predicates represent bilingual information for each English
root words. By using these entire information Sinhala morphological generator
generates suitable Sinhala words for corresponding English words.
snoun([s1000078], td, sg, ma, li, dr, v1,'ydjd').
sin_fverb([s1000059], td, sg, pr,'h').
sin_adjv([s1000076],'l=vd').
sin_conj([s1000018],'iy').
sin_adjv([s1000077],'Ydl NlaIl').
snoun([s1000060], td, sg, co, li, id, v1,'isjqmdfjla').
Using all these information Sinhala parser generates appropriate Sinhala sentence as
ydjd l=vd iy Ydl NlaIl isjqmdfjla h'.
After the successful translation, HTML parser reads this translated text and
composes a corresponding web page. Using this interface user can see the original
English web page and the translated Sinhala web page separately. Figure 7.4 shows
the output web interface of the web page translator.
99
101
102
The translation system runs on the user mode, the Intermediate editor appears only
for the user ask to change the sentence. The following sample data is shown how
translation is processed.
Assume that system reads The good boy and his old mother are reading books as
an input sentence. Then English Morphological Analyzer returns the following
output.
% Auto generated
output
% **********************
eng_input_sen_list(['the',
'good',
'boy',
'and',
'his',
'old',
103
eng_conj([3000027], 0, 'and').
eng_noun([4000004], td, sg, ma, po, 'his').
eng_noun([4000004], td, sg, ma, po, 'his').
eng_adjv([3000035], p, 'old').
eng_adjv([3000062], p, 'mother').
eng_noun([1000025], td, sg, no, sb, 'mother').
eng_noun([1000025], td, sg, no, ob, 'mother').
eng_verb([5000026], if, 'are').
eng_verb([3000030], uv, 'are').
eng_verb([5000008], rp, 'reading').
eng_noun([1000004], td, pr, no, sb, 'books').
eng_noun([1000004], td, pr, no, ob, 'books').
3000004,
1000001,
3000027,
4000004,
3000027,
4000004,
3000035, 1000025]).
eng_sen_predicate([3000030, 5000008, 1000004]).
eng_sen_type(declarative).
eng_sen_ekeys([3000003,
3000004,
1000001,
By using these entire information English to Sinhala base word translator returns the
suitable Sinhala terms. The following code displays the result of the English to
Sinhala base word translator.
estrwords(1001, 3000003, 3000000, dt).
estrwords(1002, 3000004, 3000004, aj).
estrwords(1003, 1000001, 1000001, na).
estrwords(1004, 3000027, 3000027, cn).
estrwords(1005, 4000004, 4000004, na).
estrwords(1006, 3000035, 3000035, aj).
estrwords(1007, 1000025, 1000045, na).
104
Then Sinhala Morphological generator generates suitable Sinhala word with full
grammatical information. The output of the Sinhala Morphological generation is as
follows.
sin_adjv([3000004],'').
snoun([1000001], td, sg, ma, li, dr, v1,' ').
sin_conj([3000027],'').
snoun([4000004], td, sg, ma, li, dr, v7,'').
sin_adjv([3000035],'').
snoun([1000045], td, sg, no, nl, dr, v1,'').
sin_sub_info([3000004,
1000001,
3000027,
4000004,
3000035,
1000045]).
sin_sub_word([,
, ,
, , , []]).
105
7.6 Summary
This chapter described how BEES works on four environments namely as an online
application, as a web page translator, as a selected sentence translator and desktop
application. The next chapter reports how evaluate our system to find the accuracy of
the English to Sinhala machine translation.
106
Chapter 8
EVALUATION
8.1 Introduction
The approach and the implementation stages were discussed in the preceding
chapters. The evaluation of the approach is described in this chapter based on
hypothesis formulated to the test whether the BEES is able to translate English text
into Sinhala. This chapter also reports existing evaluation methodology for the
machine translation and our approach to evaluate the English to Sinhala machine
translation.
several methods are used. These evaluation methods can be categorized into two
groups namely the automated evaluation and the human supported evaluation [98].
Numbers of standard evaluation matrices (methods) are available for automated
machine translation system evaluation such as BLEU [123], NIST [111] and
METRO [21] etc. These evaluation metrics do not use the human support for the
evaluation process. These metrics are much faster, easier and cheaper than the human
evaluation [2]. Most of these techniques are based on n-gram metrics evaluation [90].
The BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the
quality of text, which has been machine-translated from one natural language to
another [123]. It is one of the most commonly used evaluation matrices for Statistical
machine translation systems. However, it does not provide sentence level scores
[169].
107
scoring has been done based on the degree of intelligibility and comprehensibility.
Four point scale has been made for their evaluation. Highest point has assigned to the
perfect translation and the lowest point has assigned to the unintelligible sentence.
Error analysis is one of the important factors for evaluation of the machine
translation systems. Error is analyzed through the Word Error Rate (WER) and the
Sentence Error Rate (SER). Word error rate is a common matrix of the performance
of a speech recognition or machine translation system. Word error rate and sentence
error rate can then be computed as:
Considering the above facts, author has developed an evaluation methodology for
our English to Sinhala machine translation system.
109
Test case
Morphologi
cal rules for
English
Noun
Grammar
Singular noun
Plural noun
Plural noun
Plural noun
Plural noun
Singular
Possessive
Plural Possessive
Singular noun
Morphological
structure
Base
word
Examp
le
Base word
Base + s
Base + es
Plurals Base y + ies
Plurals Base f + ves
Base + s
boy
boy
class
baby
knife
Home
boy
Boys
Classes
Babies
Knives
Homes
Plural +
Verb Base + er
boy
play
Boys
player
110
9
10
Plural noun
Singular noun
play
Pay
11
Plural noun
Pay
players
paymen
t
paymen
ts
Pattern
Example
Simple Present
Present Continuous
Present perfect
Simple past
I gave a book
Past Continuous
past perfect
Find the suitable Sinhala base-word from bilingual dictionary with the full
grammatical mapping (Two or more words available in the bilingual dictionary
system uses concepts dictionary to find the suitable Sinhala base-word)
If the grammatical mapping is not satisfied, then the system uses Intermediate
editor.
If there are no any correspondent Sinhala words for the given English base
word in the bilingual dictionary, then the system uses corresponding Sinhala
transliteration.
112
To evaluate English to Sinhala base word translator, author has implemented a test
tool to test the functionality of the bilingual translator. The English to Sinhala
bilingual base-word translator has been tested through the created test plan.
,sx.h
m%lD;sh
ksh; tal
example
Add
rem
example
mq
mq
we;a
mq
.Kh
mq
sl
mq
mq
To test the Sinhala morphological generator author has implemented a Sinhala word
conjugator which gives all the Sinhala words form for the given Sinhala word. The
113
figure 8.3 shows how Sinhala word conjugator runs in the swi-prolog [143] interface.
The complete set of rules, which are used to implement the Sinhala word generation,
is attached at the end of the thesis.
114
This evaluation form is used to evaluate the English to Sinhala machine translation
system. Figure 8.5 shows online evaluation test bed and the figure 8.6 shows the user
interface of the online evaluation form.
116
117
3. Each set of sentences is given to the human translator and scored for each
sentence with the following criteria (Same as the evaluation form of the
evaluation test bed)
The accuracy and the performance of the system have been calculated though all the
above results.
5. A good boy and his mother have been reading new books
olaI msrs <ufhla iy Tyqf.a uj w,q;a fmd;a lshjka isg we;af;dah
6. The beautiful girl was singing a song
,iaik .eyeKq <uhd .S;hla .dhkd lrka isfhah
7. We had written new books
wms w,q;a fmd;a ,shd ;snqfKuq
8. A good boy reads a good book
olaI ms <ufhla fydo fmd;la lshjhs
9. A new book is written by me
ud iska w;a fmd;la ,shkq ,nhs
10. A new book is being written by my good friend
uf.a fydo ;=rd iska w;a fmd;la ,shka we;
After the evaluation, following experimental results were collected. Table 8.4 shows
the result of the module test including English morphological analysis, English
syntax analysis, Sinhala Morphological generation etc. It shows each test case and
percentage of the success of the test.
Percentage (%)
96
90
92
94
90
Sinhala transliteration
80
Table 8.5 shows the evaluation result of the human evaluation including correct
subject verb agreement, correct tense translation, correct noun verb generation etc.
The experimental result shows number of correct sentences/words from 200 sample
119
sentences. Each test case has been shown more than 80 % corrected results in the
evaluation.
Results
186
190
180
185
200
The table 8.6 shows the accuracy result of the 200 sample sentences. The
experimental result shows 71% of the sample is translated perfectly and 26 % of the
sample is basically OK. Therefore, the system gives 97% accuracy of the translation.
The figure 8.7 shows the result of the system accuracy test.
Sentences
Perfect translation
143
Basically OK
52
Meaningless
Error
Using the above all results the Word Error Rate (WER), the Sentence Error Rate
(SER) and the accuracy of the system are calculated. Table 8 shows result of the
error calculation.
120
Percentage
7.2 %
5.4 %
89.1 %
8.8 Summary
This chapter discusses evaluation of the English to Sinhala Machine Translation
system (BEES). The evaluation was conducted through three steps. As the first step,
evaluation was conducted through the white box testing approach and tested each
module in the machine translation system through the developed testing tools. Then,
evaluated the system performance and calculated the error rate through the result of
the evaluation test bed. Finally, Intelligibility and the accuracy test will be conducted
through the human support. The experimental result shows 89% accuracy of the
overall system and 7.2% word error rate and the 5.4% sentence error rate.
121
Chapter 9
CONCLUSION AND FURTHER WORK
9.1 Introduction
Chapter 8 presented how BEES has been evaluated to check the hypothesis that
concepts of Varanegeema (Conjugation) can be used to drive English to Sinhala
Machine translation. The hypothesis was tested by checking whether the BEES is
able to translate English text into Sinhala. This chapter discusses the conclusions
drawn from the evaluation process. The chapter reports 89% overall accuracy of
BEES with 7.2 % word error rate and the 5.4 % sentence error rate. This chapter also
reports on some limitations and further works.
122
The first objective is to critically review the existing systems for machine
translation. The machine translation is a sub field of the Natural Language
Processing in the area of the Artificial Intelligence. During the last sixty years,
hundreds of machine translation systems have been developed all over the world.
Most of these systems have been developed by using rule-based, statistical-based,
agent-based or human-assisted approaches. All of these approaches and 35 successful
systems have been discussed in the second chapter. In addition to the above,
available English to Sinhala prototype machine translation systems were also
discussed in the second chapter. Further, the author has critically reviewed the
existing concepts/techniques for Natural language processing with more attention on
the machine translations. Each concept/technique was also discussed. Therefore, the
author has successfully achieved the first objective.
123
The final objective is to Evaluate the system. The English to Sinhala machine
translation system has been evaluated through the three stages. As the first stage;
evaluation was conducted through the white box testing approach and tested each
module in the Machine Translation system through the developed testing tools. Then,
evaluated the system performance and calculated the error rate through the
evaluation test bed. Finally, Intelligibility and accuracy test was conducted through
the human support. The experimental result shows 89 % accuracy of the system and
7.2 % word error rate and the 5.4 % sentence error rate. According to the above facts,
the author has successfully achieved this objective too.
9.3 Limitations
The English to Sinhala machine translation system has been developed as a rule
based system and the translation process done by the translation modules namely
English morphological analyzer, English parser, translator, Sinhala morphological
generator and the Sinhala sentence composer. The system has several limitations.
The translation system perfectly works on the simple sentences. Translation of small
complex sentences also shows reasonably accurate results. However, the translation
system does not successfully handle multi-word expressions, idioms and compound
sentences. At present the lexical resources in the system are limited. For example,
bilingual dictionary requires regular updating until the system gets way from the outof-vocabulary issue.
the development of systems for machine translation from English to those languages.
It should be noted that all languages have a kind of concept similar to conjugation.
Obviously, the system can also be expanded with more lexical resources such as
dictionaries. In fact, BEES can be updated via intermediate editor, while it is being
used. It would be appropriate to encourage human-assisted translation until the
system gets matured with enough resources.
Handling compound sentences and expansion to the parser for handling more
grammatical structures would also be another direction of further work. In addition,
it would be worth considering the use of Agent technology for improving various
aspects of BEES including, Semantic handling and autonomous updating of lexical
resources.
Sinhala to English machine translation (reverse of BEES) would also be yet
another interesting further work.
9.5 Summary
This chapter provided the conclusions of each objectives and limitation of the
English to Sinhala Machine Translation system. In addition, it points out some
further work related to English to Sinhala machine translation system. The
conclusion supported the authors aim of developing a machine translation system
that works through the concept of Varanegeema. Based on the hypothesis formulated
in this thesis, authors evaluation revealed that English to Sinhala machine
translation system (BEES) is able to achieve the aim and objectives of this thesis.
125
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44]
[45]
[46]
[47]
[48]
[49]
[50]
[51]
[52]
[53]
[54]
[55]
[56]
[57]
[58]
[59]
[60]
[61]
[62]
[63]
[64]
[65]
[66]
[67]
[68]
[69]
[70]
[71]
[72]
[73]
[74]
[75]
[76]
[77]
[78]
[79]
[80]
[81]
[82]
[83]
[84]
[85]
[86]
[87]
[88]
[89]
[90]
[91]
[92]
[93]
[94]
[95]
[96]
[97]
[98]
[99]
[100]
[101]
[102]
[103]
[104]
[105]
[106]
[107]
[108]
[109]
[110]
[111]
[112]
[113]
[114]
[115]
[116]
[117]
[118]
[119]
[120]
[121]
[122]
[123]
[124]
[125]
[126]
[127]
[128]
[129]
[130]
[131]
[132]
[133]
[134]
[135]
[136]
[137]
[138]
[139]
[140]
[141]
[142]
[143]
[144]
[145]
[146]
[147]
[148]
O. Ricardo and et al., New algorithms assessing short summaries in expository texts
using latent semantic analysis, Behavior Research Methods, 2009, pp. 944-950.
S. Russell, P. Norvig, Artificial Intelligence: A Modern Approach, Person
Education Inc, New Jersey 1995.
G. Rzevski, A new direction of research into Artificial Intelligence, Sri Lanka
Association for Artificial Intelligence 5th Annual Sessions. - 2008.
G. Rzevski, J. Himoff, P. Skobelev, "MAGENTA Technology: A Family of MultiAgent Intelligent Schedulers", conference on multi-agent systems in Karlsruhe,
February 2006.
G. Rzevski Home page: http://www.rzevski.net/
C. Samuelsson, Notes on LR parser design, International Conference On
Computational Linguistics, Proceedings of the 15th conference on Computational
linguistics - Volume 1. Japan, 1994. pp. 386 - 390.
Sanjay K. D. and Pramod P. S. Machine Translation System in Indian Perspectives,
Journal of Computer Science, 1082-1087, 2010, 2010, pp 1082-1087.
U. S. Sannasgala, A. Perera, ViyakaranaVimansawa, Sanhida Mudranasaha
Prakashana, Pannipitiya, Sri Lanka, 1995.
K. Shin-ichiro, M. Kazunori, Interlingua Developed and Utilized inReal
Multilingual MT Product Systems, AMTA/SIG-IL First Workshop on Interlinguas.
- 1997.
B. Scott, A. Barreiro, "OpenLogos MT and the SAL representation language", In
Proceedings of the First International Workshop on Free/Open-Source Rule-Based
Machine Translation, 2009, pp. 19-26.
R. Sinha, A. Jain AnglaHindi: an English to Hindi machine-aided translation
system, T Summit IX, New Orleans, USA, 2003, pp 494-497.
Sinhala Unicode, Available: http://www.locallanguages.lk
H. Somers, Round-Trip Translation: What Is It Good For?,Proceedings of the
Australasian Language Technology Workshop. - Australia, 2005, pp 127133.
B. Srinivas, H. Patrick, K. Stephan, Statistical Machine Translation through Global
Lexical Selection and Sentence Reconstruction, Proceedings of the 45th Annual
Meeting of the Association of Computational Linguistics. - Czech Republic:
Association for Computational Linguistics, 2007. - pp. 152159.
M. H. Stefanini, Y. Demazeau, TALISMAN: A multi-agent system for natural
language processing, In Proceedings of SBIA'95. - Springer Verlag:, 1995, pp. 312322.
A. Stevenson, J. Elliott, R. Jones, The Little Oxford English Dictionary, Oxford
university press, 2002.
SWI-Prolog, Available: http://www.swi-prolog.org/index.html.
SYSTRAN, Available: http://www.systransoft.com.
I. Tatsuya, K. Akira, K. Yuka, Toshiba Rule-Based Machine Translation System,
NTCIR-7 PAT MT, Proceedings of NTCIR-7 Workshop Meeting, Japan, 2008, pp.
430-434.
TDIL, Technology
Development for Indian Languages, Available:
http://tdil.mit.gov.in/mat/ach-mat.htm.
D. Thierry, A Short Introduction to Text-to-Speech Synthesis, TTS research team,
TCTS Lab, 1999, Available:
http://tcts.fpms.ac.be/synthesis/introtts_old.html
P. Terence, ANTLR, ANother Tool for Language Recognition, 2008, Available:
http://www.antlr.org.
132
[149]
[150]
[151]
[152]
[153]
[154]
[155]
[156]
[157]
[158]
[159]
[160]
[161]
[162]
[163]
[164]
[165]
[166]
[167]
[168]
[169]
[170]
[171]
[172]
[173]
[174]
[175]
[176]
134
Appendix A:
English Morphological analyzer- Test plan
The following rules are used to analyze English regular words such as nouns, verbs
and adjectives. In addition to these rules, other available words such as irregular
nouns, irregular verbs, adjectives, adverbs, conjunctions and articles are directly
identified from the English dictionary.
No
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Testcase
Grammar
Singularnoun
Pluralnoun
Pluralnoun
Pluralnoun
Pluralnoun
SingularPossessive
Morphological PluralPossessive
rulesfor
Singularnoun
EnglishNoun Pluralnoun
Singularnoun
Pluralnoun
Singularnoun
Pluralnoun
Singularnoun
(female)
(Positive)Adjective
Morphological
structure
Base
word
Example
Baseword
Base+s
Base+es
PluralsBasey+ies
PluralsBasef+ves
Base+s
Plural+
VerbBase+er
VerbBase+ers
VerbBase+ment
VerbBase+ments
VerbBase+ion
VerbBase+ions
BaseNoun+ess
boy
boy
class
baby
knife
Home
boy
play
play
Pay
Pay
boy
Boys
Classes
Babies
Knives
Homes
Boys
player
players
payment
payments
AdjectiveBase
good
good
16
(Positive)Adjective NounBase+ish
Boy
Boyish
17
18
(Positive)Adjective NounBase+ful
(Positive)Adjective NounBase+less
Care
shame
19
20
21
22
23
24
25
26
(Positive)Adjective
Morphological (Positive)Adjective
(Positive)Adjective
rulesfor
(Positive)Adjective
English
Adjective
(comparative)
adjective
(comparative)
adjective
(comparative)
adjective
(Superlative)
NounBase+en
VerbBase+less
VerbBase+ative
VerbBase+able
gold
Tire
Talk
Move
Adjective+er
sweet
Careful
Shameles
s
Golden
Tireless
Talkative
Moveabl
e
Sweeter
Adjective+r
fine
finer
Adjectivey+ierr
happy
Happier
Adjective+est
sweet
Sweetest
135
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
Adjective
(Superlative)
Adjective
(Superlative)
Adjective
Infinitive
Past
PresentParticiple
PastParticiple
Morphological Past
rulesfor
PastParticiple
English
Past
regularverbs PastParticiple
Past
PastParticiple
Simplepresent
tense
PresentParticiple
Simplepresent
tense
Simplepresent
irregularverbs tense
PresentParticiple
PresentParticiple
PresentParticiple
PresentParticiple
Determination Direct/indirect
Adverb
Adverb
unknown
Unknownword
Conjunction
Conjunction
Adjective+st
fine
finest
Adjectivey+iest
happy
Happiest
Base
Base+ed
Base+ing
Base+ed
Base+d
Base+d
Base+ped
Base+ped
Base+ied
Base+ied
Base+s
play
play
play
play
play
play
Played
Playing
Played
Plays
Base+ing
Base+s
walk
walk
Walking
Walks
Base+es
go
goes
Basee+ing
Base+ting
Base+ring
Basee+ing
the/a,an
Base
Base
Base
write
write
a
quickly
Budditha
and
writing
writing
a
Budditha
and
136
Appendix B:
Conjugation Table for Sinhala Language
Sinhala Noun Conjugation (Singular forms)
wxl
h
,sx.h
m%lD;sh
mq
mq
mq
mq
mq
mq
b
b
b
b
b
b
b
b
k
k
k
k
k
k
k
k
k
mq
mq
mq
mq
mq
mq
mq
exampl
e
;reK
foaj;d
<ud
.srd
;dr
us;=re
w.k
,sh
Fj<U
wx.kd
hqj;s
.eyekq
l;a
uja
wdor
?
wl=re
.dia;=
ish,q
f.a
tla
nsla
.x
we;a
fldla
f.dka
kslua
lsUq,a
ksia
Wmdil
mq
lmq
jd
33
mq
jiq
34
mq
bis
35
36
37
mq
mq
mq
bns
l,jeos
fn,s
38
39
40
mq
mq
mq
fmdvs
usgs
fld,q
41
mq
fnanoq
ai
d
ai
d
and
aod
A,
d
avd
agd
A,
d
aod
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
25
26
27
28
29
30
31
32
.Kh
we;a
.Kh
w,s
.Kh
;drd
.Kh
jiq
.Kh
ksh; tal
a
hd
jd
hd
jd
jd
d
j
h
sh
example
;reKhd
foaj;djd
<uhd
.srjd
;drdjd
us;=rd
w.k
,sh
fj<U
wx.kdj
hqj;sh
.eyeksh
l;
uj
wdorh
/h
wl=r
.dia;=j
ish,a,
f.h
tl
nsl
..
we;d
fldld
f.dkd
kslud
lsUq,d
usksid
Wmdilh
d
lmqjd
fhla
fjla
fhla
fjla
fjla
frla
la
la
la
jla
hla
shla
la
la
hla
hla
la
jla
Q,la
hla
la
la
.la
f;la
flla
fkla
fula
f,la
fila
fhla
jiaid
afila
jiafila
biaid
afila
biafila
s
s
s
bnand
l,jeoaod
fn,a,d
afnla
afola
Af,la
s
s
s
bnafnla
l,jeoafola
fn,af,la
s
s
q
fmdvavd
usgagd
fld,a,d
afvla
afgla
Af,la
s
s
q
fmdvafvla
usgafgla
fld,af,la
fnanoaod
afola
fnanoafola
d
d
d
e
-
q
a
a
h
h
e
j
A,
h
.
d
d
d
d
d
d
hd
wksh; Wla;
q
a
a
a
x
a
a
a
a
a
a
r
d
d
d
re
q
a
a
e
q
a
a
a
x
;a
la
ka
ua
,a
ia
fjla
example
;reKfhla
foaj;dfjla
<ufhla
.srfjla
;drdfjla
;=frla
w.kla
,shla
fj<Ula
Wx.kdjla
hqj;shla
.eyekshla
l;la
ujla
wdorhla
/hla
wl=rla
.dia;=jla
ish,a,la
f.hla
tlla
nslla
..la
wef;la
fldflla
f.dfkla
kslfula
lsUqf,la
usksfila
Wmdilfhla
lmqfjla
137
42
mq
Wl=iq
43
44
mq
mq
llal=gq
ldl=
45
mq
46
47
48
49
50
51
52
53
54
55
56
57
58
mq
mq
mq
mq
mq
mq
mq
mq
mq
mq
w
w
w
lerfm
d;=
lmqgq
weoqre
nuqKq
ljqvq
.=re,q
yQkq
nur
f.dak
fjo
f,v
fmd;a
wlaIr
NdId
w
w
w
w
w
w
w
w
w
w
w
w
ms,s
le;s
fros
bks
neus
mdmsis
Fldl=
w;=
wjqreoq
l=,q
fldiq
weos
w
w
w
w
w
w
w
w
w
w
w
w
w
w
loq
,oq
Wvq
fydUq
wl=re
wl=Kq
loq,q
f;gq
fljsgs
wK
iq,x
<sx
i<x
kqjr
fl<
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
lmqgq
.Kh
nur
.Kh
fmd;a
wlaIr
NdId
.Kh
ms,s
.Kh
wl=re
.Kh
fmdf,da
.Kh
kqjr
.Kh
uq;=
.Kh
ai
d
agd
al
d
A;
d
d
d
d
d
d
d
d
d
d
d
Wl=iaid
afila
Wl=iafila
q
q
llal=gagd
ldlald
afgla
aflla
q
q
Llal=gafgla
ldlaflla
lerfmd
;a;d
lmqgd
weoqrd
neuqKd
ljqvd
.=re,d
yQkd
nurd
f.dakd
fjod
f,vd
fmd;
wlaIrh
NdIdj
Af;la
fgla
frla
fKla
fvla
f,la
fkla
frla
fkla
fola
fvla
la
hla
jla
gqq
re
Kq
vq
,q
kq
r
k
o
v
a
lerfmd;af;
la
lmqfgla
weoqfrla
nuqfKla
ljqfvla
.=ref,la
yQfkla
nufrla
f.dafkla
fjfola
f,fvla
fmd;la
wlaIrhla
NdIdjla
q
e
q
q
q
q
a
h
j
A,
A;
ao
ak
au
ai
al
A;
ao
A,
ai
ka
o
.
o
U
s
s
s
s
s
s
=
=
q
q
q
os
ms,a,
le;a;
froao
bkak
neuau
mdmsiai
fldlal
w;a;
wjqreoao
l=,a,
fldiai
wekao
A,la
A;la
aola
akla
aula
aila
alla
A;la
aola
A,la
aila
s
s
s
s
s
s
=
=
q
q
q
ms,a,la
le;a;la
froaola
bkakla
neuaula
mdmsiaila
fldlalla
w;a;la
wjqreoaola
l=,a,la
fldiaila
e
q
q
q
s
wl=r
Wl=K
loq<
f;dg
fljsg
wK
iq,.
<o
i<U
kqjr
la
la
la
la
la
la
.la
ola
Ula
la
e
q
q
q
wl=rla
wl=Kla
loq<la
f;dgla
fljsgla
wKla
iq,.la
<sola
i<Ula
kqjrla
fld<
j,
x
x
x
x
x
x
fl<j,
138
wxlh
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
.K
h
we;a
.K
h
w,s
.K
h
;drd
.K
h
jiq
.K
h
,sx.h
mq
mq
mq
mq
mq
mq
b
b
b
b
b
b
b
b
k
k
k
k
k
k
k
k
k
mq
mq
mq
mq
mq
mq
mq
m%lD;sh
example
;reK
foaj;d
<ud
.srd
;dr
us;=re
w.k
,sh
Fj<U
wx.kd
hqj;s
.eyekq
l;a
uja
wdor
?
wl=re
.dia;=
ish,q
f.a
tla
nsla
.x
we;a
fldla
f.dka
kslua
lsUq,a
ksia
Wmdil
a
fhl=
fjl=
fhl=
fjl=
fjl=
frl=
l
l
l
jl
hl
shl
l
l
hl
hl
l
jl
a,l
hl
l
l
.la
l=
fll=
fkl=
ful=
f,l=
fil=
hl=
mq
lmq
fjl=
mq
mq
mq
mq
mq
mq
mq
mq
mq
mq
mq
mq
jiq
bis
bns
l,jeos
fn,s
fmdvs
usgs
fld,q
fnanoq
Wl=iq
llal=gq
ldl=
Ail=
Ail=
Anl=
Aol=
A,l=
Avl=
Agl=
A,l=
Aol=
Ail=
Agl=
All=
wksh; wkqla;
r
example
;reKfhl=
foaj;dfjl=
d
<ufhl=
d
,srfjl=
d
;drdfjl=
re
;=frl=
w.kl
,shl
fj<Ul
wx.kdjl
hqj;shl
q
.eyekshl
a
l;l
a
ujl
wdorhl
/hl
e
wl=rl
.dia;=jl
q
ish,a,l
a
f.hl
a
tll
a
nsll
x
..l
a
we;l=
a
fldfll=
a
f.dfkl=
ua
kslful=
,a
lsUqf,l=
ia
Usksfil=
Wmdilfhl=
lmqfjl=
q
s
s
s
s
s
s
q
q
q
q
q
jiail=
biail=
Bnanl=
l,jeoaol=
fn,a,l=
fmdvavl=
usgagl=
fld,a,l=
fnanoaol=
Wl=iail=
Llal=gagl=
ldlall=
139
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
mq
mq
mq
mq
lmqgq
.Kh mq
mq
mq
mq
nur mq
.Kh mq
mq
fmd; w
a
.Kh
wlaI w
r
.Kh
NdId w
.Kh
w
w
w
w
w
w
w
ms,s w
.Kh w
w
w
w
w
w
w
w
wl=re w
.Kh w
w
w
w
w
fmdf w
w
,da
.Kh w
kqjr w
.Kh
uq;=
w
.Kh
lerfmd;a;l=
lmqgl=
weoqrl=
nuqKl=
ljqvl=
.=re,l=
yQkl=
nurl=
f.dakl=
fjol=
f,vl=
fmd;la
lerfmd;=
lmqgq
weoqre
nuqKq
ljqvq
.=re,q
yQkq
nur
f.dak
fjo
f,v
fmd;a
A;l=
l=
l=
l=
l=
l=
l=
l=
l=
l=
l=
l
wlaIr
hl
wlaIrhl
NdId
jl
NdIdjl
ms,s
le;s
fros
bks
neus
mdmsis
Fldl=
w;=
wjqreoq
l=,q
fldiq
weos
loq
,oq
Wvq
fydUq
wl=re
wl=Kq
loq,q
f;gq
fljsgs
wK
iq,x
<sx
i<x
kqjr
A,l
A;l
aol
akl
aul
ail
all
A;l
aol
A,l
ail
s
s
s
s
s
s
=
=
q
q
q
ms,a,l
le;a;l
froaol
bkakl
neuaul
mdmsiail
fldlall
w;a;l
wjqreoaol
l=,a,l
fldiail
l
l
l
l
l
l
l
l
l
l
e
q
q
q
wl=rl
wl=Kl
q
q
q
q
q
q
x
x
x
f;dgl
fljsgl
wKl
iq,.l
,sol
i<Ul
kqjrl
fl<
140
.Kh
we;a
.Kh
w,s
.Kh
;drd
.Kh
jiq
.Kh
lmqgq
.Kh
nur
,sx
.
h
mq
mq
mq
mq
mq
mq
b
b
b
b
b
b
b
b
k
k
k
k
k
k
k
k
k
mq
mq
mq
mq
mq
mq
mq
m%lD;sh
nyqjpkWla;
;reK
foaj;d
<ud
.srd
;dr
us;=re
w.k
,sh
fj<U
wx.kd
hqj;s
.eyekq
l;a
uja
wdor
?
wl=re
.dia;=
ish,q
f.a
tla
nsla
.x
we;a
fldla
f.dka
kslua
lsUq,a
Usksia
Wmdil
fhda
fjda
hs
ja
fjda
frda
fkda
fhda
fUda
fjda
fhda
yq
jre
mq
example
r
-
example
a
hska
jka
hska
jqka
jka
ka
ka
ka
qka
jka
hka
ka
=ka
jreka
hka
j,a
-
;reKfhda
foaj;dfjda
<uhs
.srja
;drdfjdA
Us;=frda
w.fkda
,sfhda
fj<fUda
wx.kdfjda
hqj;sfhda
.eyekq
l;ayq
ujqjre
wdor
/hj,a
wl=re
.dia;=
j,a
f.j,a
d
d
d
re
k
U
-
nyqwkql;
a
r
d
d
d
e
-
a
-
example
;reKhska
foaj;djka
<uhska
.srjqka
;drdjka
Us;=rka
w.kka
,shka
fj<Uqka
wx.kdjka
hqj;shka
.eyekqka
l;=ka
ujqjreka
wdorhka
/hj,a
wl=re
.dia;=
nsla
;=
l=
kq
uq
,q
iq
fhda
nsla
.x
we;a;=
fldlal=
f.dkakq
ksluauq
lsUq,a,q
usksiaiq
Wmdilfhda
=ka
l=ka
kqka
uqka
,qka
iqka
hka
we;=ka
fldlalk
= a
f.dkakqka
ksluqka
lsUq,qka
usksiqka
Wmdilhka
lmq
fjda
lmqfjdA
jka
mq
mq
mq
mq
mq
mq
mq
mq
mq
mq
mq
mq
mq
jiq
bis
bns
l,jeos
fn,s
fmdvs
usgs
fld,q
fnanoq
Wl=iq
llal=gq
ldl=
lerfmd;=
afida
afida
afnda
afoda
Af,da
afvda
afgda
Af,da
afoda
Afida
afgda
aflda
Af;da
q
s
s
s
s
s
s
q
q
q
q
q
q
jiafida
biafida
bnafnda
l,jeoafoda
fn,af,da
fmdvafvda
usgafgda
fld,Af,da
fnanoafoda
Wl=iafida
llal=gafgda
aika
aika
anka
aoka
A,ka
avka
agka
A,ka
aoka
aika
agka
alka
A;ka
mq
mq
mq
mq
mq
mq
mq
lmqgq
weoqre
nuqKq
ljqvq
.=re,q
yQkq
nur
fgda
frda
fKda
fvda
f,da
fkda
e
gq
gq
Kq
vq
,q
kq
lerfmd;af
;da
lmqfgda
weoqfrda
nuqfKda
ljqfvda
.=ref,da
yQfkda
nure
ka
ka
ka
ka
ka
ka
eka
lmqjka
q
s
s
s
s
s
s
q
q
q
q
q
q
q
q
q
q
q
q
jiaika
biaika
bnanka
l,jeoaoka
fn,a,ka
fmdvavka
usgagka
fld,a,ka
fnanoaoka
Wl=iaika
llal=gagka
ldlalka
lerfmd;a;
ka
lmqgka
weoqrka
nuqKka
ljqvka
.=re,ka
yQkka
nureka
141
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
.Kh
fmd;a
.Kh
wlaIr
.Kh
NdId
.Kh
ms,s
.Kh
wl=re
.Kh
fmdf,da
.Kh
kqjr
.Kh
uq;=
.Kh
mq
mq
mq
w
f.dak
fjo
f,v
fmd;a
Akq
Aoq
Avq
f.dakakq
fjoaoq
f,vavq
fmd;
qka
qka
qka
f.dakqka
fjoqka
f,vqka
fmd;a
wlaIr
wlaIr
wlaIr
NdId
NdId
NdId
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
ms,s
le;s
fros
bks
neus
mdmsis
Fldl=
w;=
wjqreoq
l=,q
fldiq
weos
loq
,oq
Wvq
fydUq
wl=re
wl=Kq
loq,q
f;gq
fljsgs
wK
iq,x
<sx
i<x
kqjr
ms,s
Le;s
fros
bks
neus
mdmsis
fldl=
w;=
wjqreoq
l=,q
fldiq
ms,s
Le;s
fros
bks
neus
mdmsis
fldl=
w;=
wjqreoq
l=,q
fldiq
wl=re
wl=Kq
Loq,q
F;dgq
fljsgs
wK
iq,x
,sx
i<x
kqjr
wl=re
wl=Kq
Loq,q
f;dgq
fljsgs
wK
iq,x
,sx
i<x
kqjr
fl<
142
Appendix C:
Context-Free Grammar for Sinhala Language
Grammar notations
SubP = Subject Phrase
VebP = Verb Phrase
Sub = Subject
Obj = Object
ObjP = Objective Phrase
AdjSub = Attributive adjunct of Subject
AdjObj = Attributive adjunct of Object
Pre = Predicate
AdjPre = Attributive adjunct of Predicate
AdjCmp = Attributive adjunct of Complement
CmpPre = Complement of predicate
CmpPreP = Complement of predicate phrase
VebP
SubP Sub
SubP AdjSub Sub
VebP ObjP PreP
VebP PreP
ObjP Obj
ObjP AdjObj Obj
PreP AdjPre CmpPrep
PreP CmpPrep
143
CmpPrep Pre
CmpPrep Pre CmpPre
CmpPre Cmp
CmpPre AdjCmp Cmp
Sub Noun
AdjSub Noun
Obj Noun
AdjObj Noun
AdjPre Adv
Cmp Noun
AdjCmp Noun
AdjObj Noun
AdjObj Noun Noun
AdjObj Noun Preposition Noun
AdjSub Noun
AdjSub Noun Noun
AdjSub Noun Preposition Noun
Adv Advb
Adv Advb Preposition Advb
Pre Verb
Pre MisKri Verb
144
Appendix D:
Finite State Transducer for Sinhala Transliteration
Model 1: For Original English text
V1
V2
e, r
a, e, i, o, u, y
w, u
V3
V4
o, u
C1
D1
C2
C3
C4
C5
t, e, s,c ,g
C
h
D
q0
D
C6
h
n
D2
C7
g
C8
q0 = {b,c,d,f,g,h,j,k,l,m,n,p,q,r,s,t,v,w,x,y,z}
FST for Consonants in model 1 transliteration
145
I
V2
r
e
V3
D1
Q2
Q1
V4
V5
u
o
V6
o, u
V7
Q1 = { a, e, ,i, o, u, , }, Q2 = { a, e, i }
FST for Consonants in Types 2 transliteration
C7
Figure 1
C1
C2
D1
l
h
C3
C
C4
Q2
C5
d
h
D2
Q1
d
C6
n, d, y
d, j
D3
j
D4
Q1 = { k, g, c, j, t d ,b, m, y, r, f, v, s, h, l, n, p }
Q2 = { k, g, c, j, t, d, b, s, p}
FST for Consonants in Types 2 transliteration
146
Appendix E:
Sample Evaluation form
147
Appendix F:
Sample of evaluators Comments
The following sample shows some evaluators comments for the evaluation
148