Information Theoretical Complexities in Developing A Bilingual Corpus: Critical Comparison Hindi and Marathi
Information Theoretical Complexities in Developing A Bilingual Corpus: Critical Comparison Hindi and Marathi
Abstract
A critical comparison of Hindi and Marathi constitute the basis for the evaluation and
languages and its implications for building a further development of the theories. Natural
bilingual corpus with IT perspective is language applications which are built to run
presented here. We strongly believe that the on computers have tremendous importance.
efforts required to build a corpus can be First thing that is required when natural
reduced if the similarities between the language applications are written is support
languages can be properly accommodated in of a good corpus. A corpus is a
the design and development of the corpus. If representative of any language and makes
the corpus is domain specific then the itself useful for any linguistic analysis [16].
similarities can further be exploited to arrive Building a corpus or may be parallel corpora
at a more practicable design and the efforts is a task which has both linguistic aspects
can be reduced. The paper also attempts to and information theoretical aspects to be
discuss challenges faced in building the taken care of. When we say applications, in
corpus due to the dissimilarities of the two today‘s scenario we expect practically every
languages. application to run off-line on office
As a first step towards such design, we have computers or client machines at the end user
attempted to provide formal definitions of level with possible web support.
the basic concepts in terms which could be
understood better by IT designers. Results 1.2 Bilingual Applications
from two elementary experiments have been If one is concerned with two natural
used to get insight into the complexities languages at a time, then a Bilingual Corpus
involved in the design of a bilingual corpus. is what is needed. Computational Linguists
Hypothesis have been laid and established have defined various characteristics of a
which could help in understanding the good bilingual corpus [12][24] . However
complexities involved in design and from the point of view of using computers as
development of a bilingual corpus. the cognitive artifacts in applications we
need to identify the IT related aspects.
1. Introduction Storage formats, appropriate choice of data
bases, proper choice of tagging tools,
aligning tools while building the corpus
1.1 . Human language is a most
assume tremendous importance [24].
exciting and demanding puzzle
Facilitating easy browsing, providing proper
Theoretical CL (Computational Linguistics) API so that software developers can easily
takes up issues in theoretical linguistics and write application which naturally interface
cognitive science. Formal theories have with the corpus are the factors one should be
reached a degree of complexity that can only worried about while building bilingual
be managed by employing computers [22]. corpus.
Models simulating aspects of the human
language faculty facilitate implementing
them as computer programs, which in turn
We feel that earlier researchers have concentrated more on Linguistic aspects, both pure and
computational in nature, and have not provided adequate linkage to down to earth software
engineering requirements. It is hoped that this article will help the developers in getting clearer
picture of the corpus and in turn the selection of software environments to support related projects
will be easier.
We have preferred to use the word e-corpus to emphasize the fact that the format will be
electronic, stored in such a way that it is readily accessible to anyone who is a computer literate. It
will be easily browsable without any special browser specifications thus hindering the process.
And those who are developers can develop applications with interfaces which use the corpus at the
back-end that can be easily created. As the title of the paper indicates the chosen languages are
Hindi and Marathi.
2.2
2.3 Formal Definitions different sentences depending on the context
[8]. A list of tags is given in Table 2.
We present here the formal definitions that
are requisite in building an e-corpus. Def 3 : Part-of-speech tagging is the
[ Most of the following definitions have been adapted from process of labelling each word, w in each
( William H Wilson, 2012, sentence, s with its part of speech category, t.
http://www.cse.unsw.edu.au/~billw/nlpdict.html.]
Def 4: We will ref to tagset as defined in Token length of a sentence and byte length
Table 2 as the default tag-set. of a sentence are two different metrics
which we will need when we analyze. Hence
Def 5 : Storage format is the standardized a separate definition of byte length is
format that is used to store the metadata so proposed. Tokens are words separated by a
that it is machine readable and interpretable. token delimiter.
There are many standardized formats for Def 9: The Byte length of a Sentence is the
encoding like total no. of bytes that constitute a sentence
Text Encoding Initiative (TEI), which includes white spaces and one byte
Translation Memory Exchange for the sentence delimiter.
(TMX)
Corpus Encoding Standard for Choice of the Data Base System: Use of a
XML (XCES)[24] Data Base System is a necessity while
IMS corpus WorkBench building a Corpus. XML would be treated as
Each of the formats have their own the default Database unless otherwise
advantages and disadvantages. We feel that specified. Reason for choice of XML as the
the choice of the format, by the corpus default database system is its
builder, will not create any incompatibility interoperability with most of the platforms
in respect of choice of tagging tool or even and programming languages, its software
alignment tools at later stage at the time of and hardware independence when it comes
actual building of the e_corpus. to way of storing information. Since
Unicode is supported it is ideal for storing
Def 6: A RAW_resource is always a Pair of text of any language, and would allow use of
TEXT files in the corresponding languages simple text editors when it comes to the
namely L1 and L2 using Unicode, one being content part [24]. Most of the developmental
translation of the other for a bilingual corpus. tools will have natural compatibility with
This means, we are avoiding usage of XML.
resources which could contain images, or The other options could be any of the
sound files and also assuming that the Raw Standard Relational Database like MySQL,
resource is in UNICODE format. Further, Oracle etc. Choice of Database should not
this means that a resource is input into the affect the usability of the corpus.
corpus only when the builder is satisfied
with its translation into the other language. Def 10: Paragraph is a subdivision of the
text file of finite length, identified by special
Def 7: Sum total of sizes of all delimiters like spaces, new line characters,
RAW_resource files in number of Bytes tabs etc. Sometimes it may be indicated in
would be referred to as the size of the number of sentences.
Corpus.
Def 11: Context_Tags is a set of tags,
Obviously the actual fingerprint of the total predefined (Like set of PoS tags) , which
content would be far larger than the ‗size of can be associated with each of the
the Corpus‘ as that would include processed paragraphs.
files, accompanying software resources etc. There can be different classes of tags:
linguistic, situational and cultural [19].
Def 8: Length of a sentence, s will be
always specified as ‗number of tokens‘, k.
Def 12: Context_tagging is the process of ProLang_Files are the XML files
tagging each paragraph in a text-file, with which provide exact alignment and tagging
predefined tags. of Raw resources.
Def 13: Since we are concerned with a 3. It is ―Intelligent” in the sense that
bilingual product, the concept of ―Direction‖ adequate interfaces provided to add
assumes considerable importance. The e- /modify/delete info,
corpus will have a direction specified as one perform various linguistic
of the elements of the set {Uni, Bi}. operations
allows to browse and extract
If the choice is Uni, then the aligning will be information for user applications
‗L1 to L2‘ or ‗L2 to L1‘ which again will be (and Lot more related
clearly specified in the definition. Choice functionalities )
‗Bi‘ would mean bi-directional and needs no 4. It has a software component
further specification since it would any way ―SoftToManage”
be symmetric. A bidirectional corpus any an integrated package with utilities
way would include both unidirectional required
corpora into it as a recoverable sub-corpus to manage the repository and which
[12]. The corpus will have various sub also contains
corpora that will be aligned (text by text / proper API s , which will facilitate
paragraph by paragraph, sentence by application
sentence, phrase by phrase and word by development in Java/C++ .
word) [12]. 5. It is noiseless (possible noise is spelling
mistakes, incorrect translations, incorrect
Having defined the basic components in character encoding, missing words). In short
clear formal terms, now we are in a position it will have no linguistic inconsistencies)
to provide a good implementable definition
of monolingual e_corpus and bilingual The ProLang_Files is essentially a
e_corpus. collection of
tagged -repositories etc.
Def 14: Bilingual e_corpus obtained from the RawLang_Files,
arranged and structured in such a way so
Bilingual e_corpus is a Quadruplet as to facilitate the Utilities in SoftToMange
{Lan_Names, RawLang_Files, to work properly.
ProLang_Files, SoftwareToManage }
with following characteristics Def 15: Context
1. It has a specified pair of languages Context is the physical
associated with it. {Lan_Names = (L1, environment in which a word is used [19]. A
L2)} word can have a different POS tag based on
the context in which it is used [13]. Lexical
2. It constitutes of a ―repository‖ of contents ambiguity that arises in different situations
included in containers can be resolved using the contextual
{ RawLang_Files, ProLang_Files } information available in the text [13].
combined size of which will be called
the size of corpus. 3. Linguistic Similarities and their
RawLang_Files are a collection of Information Theoretic
resources, which are basically Pairs of text
Implications
files as defined in Def 6, whereas
Number
Singular, Plural NIL
Aspect
Imperfective, Perfective NIL
Verbs
Indicative, imperative,
Mood
optative Subjective, conditional(marathi)
Table 6. Number of words required to patterns from the corpus. For example for a
cover a certain percentage of the corpus. pattern with a high frequency in Hindi is
% of compared with its occurrence in Marathi. It
Hindi Marathi
Corpus has been observed that a pattern which has a
10% 4 17 high occurrence in Hindi has a very low
20% 10 67 occurrence in Marathi. Therefore such
30% 26 213 patterns of syllables are unique to Hindi
40% 77 548 language. Some patterns are unique to a
50% 199 1247 particular language and hence can be used to
60% 486 2882 identify that language.
70% 1158 6922
80% 2874 18874 Table 8. The trisyllable pattern and its
distribution in Hindi and Marathi
Trisyllable
Hindi Marathi
Hindi Marathi Pattern in Hindi
Corpus Size(in no. ka:ra:ne 10642 29
2986063 1872345 a:pa:ne 8824 6
of words)
Word Types 127241 210578 sa:ma:ya 5152 29
Syllable types 3994 3757 u:sa:ke 5057 1
Average no of ka:ra:we 4995 643
2.23 2.97
syllables in a word
Syllable Mode 2 3
Most frequent
Ra wa 5. Results of some Related
syllable
Bigram syllable
Experiments
65697 69023
types
Most frequent Tagging and aligning are two basic
ka : ra A : he techniques used in providing interpretable
bisyllable
u:na:kI, mha:NU:na, structure to contents in a corpus [13]. We
a:pa:ne, mha:Na:je, have run a few trials on selected taggers and
Most frequent aligners. Results of the experiments are
a:pa:nI, A:pa:lyA,
trisyllable reported here. A critical look into the
i:sa:ke, A:he:wa,
u:sa:kI ka:rU:na outputs helps in understanding the
Maximum Word complexity of the whole process and also
20 23 suggests how the ‗similarity‘ between the
Length
Average Word languages can help in reducing the
4.695 6.33 complexity.
Length
Total Sentences 171604 187373 The data has been collected on medical
domain. A set of 90 sentences of varying
Mean Sentence
15.95 9.54 length are taken. The total length of the
Length sentence ranges from 3 to 28 tokens per
Table 7. A comparative list of syllable and sentence.
words in Hindi and Marathi
Tagged Output : Parsed sentences in a The lemma (root) words are same
text file. A sample output selected from the for words which are common in
Parsed file is shown in Table 9. both the languages as shown in the
Discussions : The following observations above table.
were made: If the starting few syllables of two
The Postpositions in Hindi do not words in Hindi and Marathi are
exist in Marathi similar then their root words are
Almost the word order remains the translation equivalents of each
same with few changes. other like ळयीय and
Postpositions do not have any ळयीयाच्मा .
translation equivalents in Marathi
as these gets converted into a
grammatical feature.
Sl. No of
1 2 3 4 5 6 7 8 9 10
Words
Source
क्मा
Sentence ददद ळयीय के अन्म अंगो भें पैरिा है ?
(Hindi)
Target
ही
Sentence लेदना ळयीयाच्मा अन्म बागाि ऩसयिे का ?
(Marathi)