0% found this document useful (0 votes)
27 views

Information Theoretical Complexities in Developing A Bilingual Corpus: Critical Comparison Hindi and Marathi

This document summarizes and compares the Hindi and Marathi languages to inform the development of a bilingual corpus from both linguistic and information theoretical perspectives. It defines key concepts like files, paragraphs, sentences, and tokens. Preliminary experiments on the two languages provide insights into the complexities of designing a bilingual corpus that addresses both linguistic similarities and differences, as well as information theoretical aspects like storage formats and tools for tagging, aligning, and accessing the corpus content. The goal is to facilitate natural language applications that interface easily with the bilingual corpus.

Uploaded by

SyahiduzZaman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Information Theoretical Complexities in Developing A Bilingual Corpus: Critical Comparison Hindi and Marathi

This document summarizes and compares the Hindi and Marathi languages to inform the development of a bilingual corpus from both linguistic and information theoretical perspectives. It defines key concepts like files, paragraphs, sentences, and tokens. Preliminary experiments on the two languages provide insights into the complexities of designing a bilingual corpus that addresses both linguistic similarities and differences, as well as information theoretical aspects like storage formats and tools for tagging, aligning, and accessing the corpus content. The goal is to facilitate natural language applications that interface easily with the bilingual corpus.

Uploaded by

SyahiduzZaman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

ISSN:2229-6093

Sonal Khosla et al, Int.J.Computer Technology & Applications,Vol 6 (1),93-110

Information Theoretical Complexities in Developing a


Bilingual Corpus: Critical comparison Hindi and Marathi

Sonal Khosla Haridasa Acharya


Symbiosis International University Symbiosis International University

Abstract
A critical comparison of Hindi and Marathi constitute the basis for the evaluation and
languages and its implications for building a further development of the theories. Natural
bilingual corpus with IT perspective is language applications which are built to run
presented here. We strongly believe that the on computers have tremendous importance.
efforts required to build a corpus can be First thing that is required when natural
reduced if the similarities between the language applications are written is support
languages can be properly accommodated in of a good corpus. A corpus is a
the design and development of the corpus. If representative of any language and makes
the corpus is domain specific then the itself useful for any linguistic analysis [16].
similarities can further be exploited to arrive Building a corpus or may be parallel corpora
at a more practicable design and the efforts is a task which has both linguistic aspects
can be reduced. The paper also attempts to and information theoretical aspects to be
discuss challenges faced in building the taken care of. When we say applications, in
corpus due to the dissimilarities of the two today‘s scenario we expect practically every
languages. application to run off-line on office
As a first step towards such design, we have computers or client machines at the end user
attempted to provide formal definitions of level with possible web support.
the basic concepts in terms which could be
understood better by IT designers. Results 1.2 Bilingual Applications
from two elementary experiments have been If one is concerned with two natural
used to get insight into the complexities languages at a time, then a Bilingual Corpus
involved in the design of a bilingual corpus. is what is needed. Computational Linguists
Hypothesis have been laid and established have defined various characteristics of a
which could help in understanding the good bilingual corpus [12][24] . However
complexities involved in design and from the point of view of using computers as
development of a bilingual corpus. the cognitive artifacts in applications we
need to identify the IT related aspects.
1. Introduction Storage formats, appropriate choice of data
bases, proper choice of tagging tools,
aligning tools while building the corpus
1.1 . Human language is a most
assume tremendous importance [24].
exciting and demanding puzzle
Facilitating easy browsing, providing proper
Theoretical CL (Computational Linguistics) API so that software developers can easily
takes up issues in theoretical linguistics and write application which naturally interface
cognitive science. Formal theories have with the corpus are the factors one should be
reached a degree of complexity that can only worried about while building bilingual
be managed by employing computers [22]. corpus.
Models simulating aspects of the human
language faculty facilitate implementing
them as computer programs, which in turn

IJCTA | Jan-Feb 2015 93


Available [email protected]
ISSN:2229-6093

Sonal Khosla et al, Int.J.Computer Technology & Applications,Vol 6 (1),93-110

1.3 Objective of current paper


In this article we propose to compare languages Hindi and Marathi. The comparison will have
three important parts. The initial part would be a formalization of some of the basic concepts
followed by our idea of how these similarities and dissimilarities listed in part one can be tackled
while building a bilingual corpus. Second part would be the linguistic similarities and
dissimilarities, where we base our arguments on what earlier researchers have written. Third and
the last part would be results of a very limited experimentation and their results which support a
few hypotheses.

We feel that earlier researchers have concentrated more on Linguistic aspects, both pure and
computational in nature, and have not provided adequate linkage to down to earth software
engineering requirements. It is hoped that this article will help the developers in getting clearer
picture of the corpus and in turn the selection of software environments to support related projects
will be easier.

We have preferred to use the word e-corpus to emphasize the fact that the format will be
electronic, stored in such a way that it is readily accessible to anyone who is a computer literate. It
will be easily browsable without any special browser specifications thus hindering the process.
And those who are developers can develop applications with interfaces which use the corpus at the
back-end that can be easily created. As the title of the paper indicates the chosen languages are
Hindi and Marathi.

2. Basic Definitions and Formal Theoretical Aspects


Firstly we attempt the formalization of the basic definitions and concepts which is the first step
towards building an e_corpus.

2.1 Notations and Symbols


Table 1 gives the list of symbols and notations that are used in this article.

Symbol Notation Meaning


A file is logical storage unit of a computer, which is a collection of
F file
data and information in the form of texts,pictures etc.
A paragraph is a suitable sub-division of contents of a file defined
by paragraph delimiters which can be specially defined spaces, new
P Paragraph line characters and other special characters . A file can always be
divided into finite no of paragraphs, conversely a file is a union of
paragraphs.
A sentence is a set of tokens(words/punctuation
marks/numerals/dates etc.) having a meaning in itself, which
conveys a statement, question, exclamation and command.
S sentence In Hindi, a sentence ends in a viraam, question mark or exclamation
mark.
In Marathi, a sentence ends in a full stop, question mark or
exclamation mark.

IJCTA | Jan-Feb 2015 94


Available [email protected]
ISSN:2229-6093

Sonal Khosla et al, Int.J.Computer Technology & Applications,Vol 6 (1),93-110

A token is a smallest unit of a sentence to be processed which can be


a word, punctuation marks, numerals etc. It is usually enclosed by
K token
two blank characters or two punctuation marks. It represents one
morphosyntactic unit.
A word is a sequence of characters/syllables with a defined meaning
W word
and space on either side
A character over here is referred to a syllable where a syllable is a
C character
unit of spoken language.
A lemma is the canonical form of the token or the token itself. It is
L Lemma the base form of the word representing all word forms belonging to
the same word.
L Language_type Language refers to the language in particular i.e. Hindi and Marathi
Tag is a lexical category given to a particular token in a sentence.
t tag The lexical category is picked from the standard POS tagset given in
Table 2.

Table 1. List of Symbols , Notations with their meanings

2.2
2.3 Formal Definitions different sentences depending on the context
[8]. A list of tags is given in Table 2.
We present here the formal definitions that
are requisite in building an e-corpus. Def 3 : Part-of-speech tagging is the
[ Most of the following definitions have been adapted from process of labelling each word, w in each
( William H Wilson, 2012, sentence, s with its part of speech category, t.
http://www.cse.unsw.edu.au/~billw/nlpdict.html.]

We will assume that an e-corpus


Def 1: Part of speech, POS (Lexical should always be a tagged-corpus to be apt
category) for use in natural language processing
A set POS, also called a tagset, = {t1, applications. Further we will assume that the
t2, t3,.....}, where each element corresponds tag-sets are unique, within an e-corpus. In
to a role which can be played by a word. It is our work we have used the tagset given
a linguistic category. in Table No 2 , taken from the POS Tagset
[4] developed at Language Technology
Research Centre at IIIT, Hyderabad. Note
Def 2: An ordered pair {w, t} is called a
that there are exactly 26 number of
tagged-word if w is the word and t is its
tags, with 21 basic ones, and these are,
lexical category.
common to all Indian languages , and this is
It should be remembered that same word
the complete set.
may have different tags, when they appear in

Table 2: POS Tagset for Indian Languages

Sl. Tag Example Examples


Category Remarks/Discussion
No. Name (Hindi) (Marathi)
घयघयाहट , घयघयाहट ,
1.1 Noun NN Common Noun
सेलन सेलन

IJCTA | Jan-Feb 2015 95


Available [email protected]
ISSN:2229-6093

Sonal Khosla et al, Int.J.Computer Technology & Applications,Vol 6 (1),93-110

Noun denoting spatial and


ऊऩय ,
temporal expressions such as
1.2 NLoc NST लय
location and time. They are also
postpositions in certain contexts.
Proper Nouns used for manual
2 Proper Noun NNP करऩॉर करऩॉर
annotation
Pronouns is a word that आऩके ,
आऩल्मा ,
3.1 Pronoun PRP substitutes for a noun or a noun इसे
मारा
phrase.
Demonstratives indicate the
person or thing to be referred to
as. It is used as a separate tag to
उस तिरा
3.2 Demonstrative DEM mark the difference between
demonstratives and pronouns. It
points to a particular noun or to a
noun it replaces.
A finite verb is used to mark
tense and is used in agreement
with the subject. If there is one फर ु ाए , फोरला ,
verb in a sentence it is a finite खािा , खािो ,
4 Verb-finite VM
verb. They provide grammatical सोिा , झोऩिो ,
information of gender, person, योिा याडिो
number, tense, aspect, mood and
voice.
An auxiliary verb is a verb giving
further semantic or syntactic है
5 Verb Aux VAUX आहे
information of the main verb
following it.
Adjectives are words that
6 Adjective JJ describe or modify another नळीरे नळीरे
person or thing in a sentence.
Adverbs are words that modify िेज , जोयाि ,
7 Adverb RB
verbs. धीये हऱूलाय
It is a word that is used to show
the relation of a noun or pronoun के , की ,
8 Post position PSP NA
to some other word in a sentence, को
It follows the object.
These are function words that
9 Particles RP show grammatical relationships बी , िो ऩण , िय
with other words.
conjuncts or Conjunctions are
10 Conjuncts CC words that join parts of a आणण
औय
sentence.
क्मों ,
Question These are the words that put up a का , काम ,
11 WQ क्मा ,
Words question कुठे
कहा

IJCTA | Jan-Feb 2015 96


Available [email protected]
ISSN:2229-6093

Sonal Khosla et al, Int.J.Computer Technology & Applications,Vol 6 (1),93-110

These are the words that tell us सबी ,


सगऱे ,
how many or how much. They फहुि ,
12.1 Quantifiers QF जास्ि , .थोड ,
generally precede or modify थोडा ,
कमभ
nouns. कभ
These words are the cardinal
numbers in the language which फीस , एक , लीस , एक ,
12.2 Cardinal QC
quantify and are adjectives दो , िीन दोन , तिन
referring to quantity
Ordinal numbers are words that ऩहरा , ऩहहरा ,
12.3 Ordinal QO represent the relative position of दसू या , दसु या ,
an item in an ordered sequence िीसया तिसया
classifier is a word which
accompanies a noun in certain
grammatical contexts and
12.4 Classifier CL NA NA
generally reflects some kind of
conceptual classification of
nouns.
खफ ू ,
खऩ ू ,
These are the words that फहुि ,
13 Intensifier INTF जास्ि , .थोड ,
intensifies adjectives or adverbs थोडा ,
कमभ
कभ
Interjections are words that used
to exclaim, protest or command. अये , हा
14 Interjection INJ अये , हो
They can sometimes be used by
themselves.
These are the negative words in a
15 Negation NEG नही नाही
language.
A quotative introduces a quote. It
16 Quotative UT is typically a verb and some
indian languages use it.
It is used to tag all the special
17 Special Symbol SYM symbols which cannot be ? , : ; ! ?,:;!.
categorised in any other category
These are the words that are
रार यक्ि रार यक्ि
18 Compounds C combined together to represent a
कोमळकाए ऩेळी
single word.
It is used to mark those words in
छोटे
19 Reduplicative RDP Indian languages that are छोटे छोटे
छोटे
repeated consecutively.
This category is designed for
representing words in Indian दलाई
20 Echo ECH languages that do not have any ळलाई NA
place in dictionary and can be
called as ―nonsense words‖
This category is used to mark the
words whose category is not
21 Unknown UNK
known which may be loan words
or foreign words.

IJCTA | Jan-Feb 2015 97


Available [email protected]
ISSN:2229-6093

Sonal Khosla et al, Int.J.Computer Technology & Applications,Vol 6 (1),93-110

Def 4: We will ref to tagset as defined in Token length of a sentence and byte length
Table 2 as the default tag-set. of a sentence are two different metrics
which we will need when we analyze. Hence
Def 5 : Storage format is the standardized a separate definition of byte length is
format that is used to store the metadata so proposed. Tokens are words separated by a
that it is machine readable and interpretable. token delimiter.

There are many standardized formats for Def 9: The Byte length of a Sentence is the
encoding like total no. of bytes that constitute a sentence
Text Encoding Initiative (TEI), which includes white spaces and one byte
Translation Memory Exchange for the sentence delimiter.
(TMX)
Corpus Encoding Standard for Choice of the Data Base System: Use of a
XML (XCES)[24] Data Base System is a necessity while
IMS corpus WorkBench building a Corpus. XML would be treated as
Each of the formats have their own the default Database unless otherwise
advantages and disadvantages. We feel that specified. Reason for choice of XML as the
the choice of the format, by the corpus default database system is its
builder, will not create any incompatibility interoperability with most of the platforms
in respect of choice of tagging tool or even and programming languages, its software
alignment tools at later stage at the time of and hardware independence when it comes
actual building of the e_corpus. to way of storing information. Since
Unicode is supported it is ideal for storing
Def 6: A RAW_resource is always a Pair of text of any language, and would allow use of
TEXT files in the corresponding languages simple text editors when it comes to the
namely L1 and L2 using Unicode, one being content part [24]. Most of the developmental
translation of the other for a bilingual corpus. tools will have natural compatibility with
This means, we are avoiding usage of XML.
resources which could contain images, or The other options could be any of the
sound files and also assuming that the Raw Standard Relational Database like MySQL,
resource is in UNICODE format. Further, Oracle etc. Choice of Database should not
this means that a resource is input into the affect the usability of the corpus.
corpus only when the builder is satisfied
with its translation into the other language. Def 10: Paragraph is a subdivision of the
text file of finite length, identified by special
Def 7: Sum total of sizes of all delimiters like spaces, new line characters,
RAW_resource files in number of Bytes tabs etc. Sometimes it may be indicated in
would be referred to as the size of the number of sentences.
Corpus.
Def 11: Context_Tags is a set of tags,
Obviously the actual fingerprint of the total predefined (Like set of PoS tags) , which
content would be far larger than the ‗size of can be associated with each of the
the Corpus‘ as that would include processed paragraphs.
files, accompanying software resources etc. There can be different classes of tags:
linguistic, situational and cultural [19].
Def 8: Length of a sentence, s will be
always specified as ‗number of tokens‘, k.

IJCTA | Jan-Feb 2015 98


Available [email protected]
ISSN:2229-6093

Sonal Khosla et al, Int.J.Computer Technology & Applications,Vol 6 (1),93-110

Def 12: Context_tagging is the process of ProLang_Files are the XML files
tagging each paragraph in a text-file, with which provide exact alignment and tagging
predefined tags. of Raw resources.

Def 13: Since we are concerned with a 3. It is ―Intelligent” in the sense that
bilingual product, the concept of ―Direction‖ adequate interfaces provided to add
assumes considerable importance. The e- /modify/delete info,
corpus will have a direction specified as one perform various linguistic
of the elements of the set {Uni, Bi}. operations
allows to browse and extract
If the choice is Uni, then the aligning will be information for user applications
‗L1 to L2‘ or ‗L2 to L1‘ which again will be (and Lot more related
clearly specified in the definition. Choice functionalities )
‗Bi‘ would mean bi-directional and needs no 4. It has a software component
further specification since it would any way ―SoftToManage”
be symmetric. A bidirectional corpus any an integrated package with utilities
way would include both unidirectional required
corpora into it as a recoverable sub-corpus to manage the repository and which
[12]. The corpus will have various sub also contains
corpora that will be aligned (text by text / proper API s , which will facilitate
paragraph by paragraph, sentence by application
sentence, phrase by phrase and word by development in Java/C++ .
word) [12]. 5. It is noiseless (possible noise is spelling
mistakes, incorrect translations, incorrect
Having defined the basic components in character encoding, missing words). In short
clear formal terms, now we are in a position it will have no linguistic inconsistencies)
to provide a good implementable definition
of monolingual e_corpus and bilingual The ProLang_Files is essentially a
e_corpus. collection of
tagged -repositories etc.
Def 14: Bilingual e_corpus obtained from the RawLang_Files,
arranged and structured in such a way so
Bilingual e_corpus is a Quadruplet as to facilitate the Utilities in SoftToMange
{Lan_Names, RawLang_Files, to work properly.
ProLang_Files, SoftwareToManage }
with following characteristics Def 15: Context
1. It has a specified pair of languages Context is the physical
associated with it. {Lan_Names = (L1, environment in which a word is used [19]. A
L2)} word can have a different POS tag based on
the context in which it is used [13]. Lexical
2. It constitutes of a ―repository‖ of contents ambiguity that arises in different situations
included in containers can be resolved using the contextual
{ RawLang_Files, ProLang_Files } information available in the text [13].
combined size of which will be called
the size of corpus. 3. Linguistic Similarities and their
RawLang_Files are a collection of Information Theoretic
resources, which are basically Pairs of text
Implications
files as defined in Def 6, whereas

IJCTA | Jan-Feb 2015 99


Available [email protected]
ISSN:2229-6093

Sonal Khosla et al, Int.J.Computer Technology & Applications,Vol 6 (1),93-110

Challenges faced by developers in building a So from the designer‘s perspective we may


bilingual corpus for Hindi and Marathi pair conclude that
of languages are many. The basic definitions A good bilingual corpus
discussed earlier in Section 2 and the faces the problems related to
BIGDATA, i.e. problem of
functional requirements specified in the
Volume, Variety and Velocity.
definition of a bilingual corpus are not easy The common part of the
to meet. Through a study of the similarities vocabulary, and presence of
and dissimilarities of the pair of languages cognitive words will certainly
one can possibly counter some of the reduce the Volume of Corpus,
challenges and reduce complexities. significantly.

3.1 Vocabulary 3.2 Script & Alphabet Set


Indian languages share a common origin and The set of symbols of each language is
are known to have a common vocabulary of unified into a single collection identified as
around 40 to 80 percent [9]. Hindi and a single script. These collection of symbols
Marathi is one such pair of Indo-aryan and scripts, then serve as a reserve from
languages, being derived from Devanagari which symbols are taken to write multiple
script. They are known to be sister languages.
languages and have significant proportion of
common vocabulary [18]. Words that are
phonologically and lexically similar are Hindi and Marathi are derived from
defined to be as Cognates. [18] [14]. Out of Devanagari script for writing, which is a
the corpus of 6 million words created by phonetic script. Devanagari script used for
Central Institute of Indian Languages for Hindi and Marathi have 12 pure vowels, two
Marathi and Hindi language, 44.5% are additional loan vowels taken from the
cognate. Though sometimes these cognitive
Sanskrit and one loan vowel from English
words may have different meanings posing a
problem of Word sense disambiguation in [9][10].There are 34 pure consonants, 5
front of developers. These differences are traditional conjuncts, 7 loan consonants and
due to the difference in the Marathi and 2 traditional signs in Devanagari script and
Hindi grammatical rules in the construction each consonant have 14 variations through
of verbs and its placement. Some words integration of 14 vowels, which produces
retain their meanings and have similar 507 different alphabetical characters[9][11].
meanings while others have become
associated with different concepts [18]. The
problems arising due to bilingualism are Apart from ऱ / which is used only in
reduced when the rate of cognates available Marathi language, consonants are identical.
in the two languages are higher [14]. In Marathi glyphs are preferred for U+0932
devanagari letter la and U+0936 devanagari
Some examples of cognates in Hindi and letter sha[2].
Marathi are given below:
Same origin; same meaning:
The different committees of the Department
The word ―Utsuk‖ means curious
in both Hindi and Marathi. of Electronics and the Department of
Same origin; different meaning: Official Language, Govt. of India have
The word ―shikhsa‖ in Hindi means developed a universal code, which is the
education, while the same word in Marathi Indian Standard Code for Information
means Punishment. Interchange (ISCII). The ISCII code is a
super set of all the characters required in the
Since there are very less similar words in the ten Brahmi based Indian scripts. It is based
two languages having different meanings.
on the standard ASCII code [1].
The work involved in building lexical
resources can be reduced by taking care of
these cognates.

IJCTA | Jan-Feb 2015 100


Available [email protected]
ISSN:2229-6093

Sonal Khosla et al, Int.J.Computer Technology & Applications,Vol 6 (1),93-110

Unicode has also encoded the Indian 3.3 Phonology


language scripts and is based on the Indian
national standard, ISCII. the Unicode Phonology of a language is an important
standard has encoded the Devanagari feature. Most often phonetically similar
characters in the same relative position as in words have similar spellings. Devanagari
the ISCII-1988 standard. This enables one to being a phonetic script, this aspect can be
one mapping between different scripts in the used to match misspelled words or
Indian family [2]. The Range of codes for missing/muted words [11].
Devanagari in The Unicode Standard
Version 7. 0 is 0900–097F [2].
For example, the words ―aaya‖ and ―gaya‖
rhyme similar to ―aala‖ and ―gela‖ in
Since the script and the Alphabet set is Marathi and have similar meanings as well.
similar in both languages,so Unicode to Even in the below example of sentences, the
ISCII and vice versa is not language specific, words िनाल and िणाल rhyme similar
but is dependent on the script. and have similar spellings.
िनाल कभ कये |
Whenever we download or procure Hindi:
िणाल कभी कया .
raw files or documents in Hindi or Marathi Marathi:
ण sound is more frequently used in Marathi
for inclusion in a repository they are passed
or produced through some document
editors. In most cases they need to be
Due to the phonetic similarity of different
converted into a plain text resource using
alphabets [7] and several features and
a code converter. Font Suvidha
sounds shared across Indian languages [5],
[http://www.fontsuvidha.com/] is one of its
an optimal keyboard common to all
kind software developed to convert writing
languages is possible. The different
in devnagari scripts like Hindi, Marathi, and
committees of the Department of Electronics
many other languages written in different
and the Department of Official Language,
fonts to Unicode and vice versa.
Govt. of India have been evolving different
Availability of many such converters is a
codes and keyboards which could cater to all
tremendous advantage for a corpus builder.
the Indian scripts.
Some tools also have language detection
Hypothesis: Due to an overlap in
feature to leave English text unchanged so
vocabulary of Hindi and Marathi, words
that documents with mixed contents
having similar pronunciation and which
(English, Hindi or Marathi) can be easily
rhyme together, they can be directly
handled.
taken in the corpus to be equivalent
words having similar meanings.
Hypothesis : The commonality
of Devanagari script between the two
The Hypothesis said above is supported by
languages has made development of such
the experiments described in Section 5.
Unicode-converters possible. (One
Although phonology aspect is more
Unicode converter can handle RAW files
applicable in building a speech corpus, still
from both the languages)
we have attempted to list the differences in
Table 3 which needs to be taken care of.

IJCTA | Jan-Feb 2015 101


Available [email protected]
ISSN:2229-6093

Sonal Khosla et al, Int.J.Computer Technology & Applications,Vol 6 (1),93-110

Table 3. Difference in pronunciation of certain consonants in Hindi and Marathi.

Consonants Marathi Hindi


च, ज, झ and प Multiple pronunciation Single pronunciation
ऋ /ru/ /ri/ and similar to Sanskrit
words ending with these
T, TH, D, DH, t, th, d dh consonants are prolonged No change
in pronunciation
च and ज are dental-alveolar in Marathi only, while these are alveolar in Hindi.

3.4 Grammar Marathi and are irrelevant while doing Word


alignment.
This is one aspect that needs to be studied
considerably to build a highly accurate
Multiple words in Hindi are converted into a
bilingual corpus. Table 4 gives a
single compound word in Marathi. Mean
comprehensive list of similarities and
sentence length of Hindi is 15.95 while that
dissimilarities in grammar in the two
of Marathi is 9.54[3]. This is attributed to
languages [20][21]
the fact that Marathi forms compound words.
Hindi is a highly inflected language and
The pilot experiment also conducted shows
requires the modification of a word to
that the sentence length in Hindi is always
represent different grammatical categories
greater than the sentence length in Marathi.
such as tense, mood, voice, aspect, person
Section 5 shows the exact statistics of the
etc. It adds prefixes and suffixes to form
sentence length in the pilot study done.
words. The inflection of verbs is called as
conjugation and the inflection of nouns,
For e.g. In the given sentence pair taken
adjectives and pronouns is called declension.
from test data.
Hindi uses postpositions (PSP) rather than
Hindi: क्मा ददद ळयीय के
prepositions for case marking and auxiliaries.
अन्म अंगो भें पैरिा है ?
In Marathi, postpositions are added to the
Marathi: ही लेदना ळयीयाच्मा
word preceding it. It also adds suffixes to
अन्म बागाि ऩसयिे का ?
roots to build words [20][21].
While doing conversion from Hindi to
Marathi, the PSP‘s like के , भे or है are The words ळयीय के gets converted to
removed in Marathi and added as a ळयीयाच्मा in Marathi and the words
morphological phenomena[11] or अंगो भें forms the compound word
grammatical information in the word itself. बागाि . It is also seen that the PSP‘s के
It is converted to a syntactic feature in and भें gets converted into a syntactic
Marathi [23]. Hence these PSP‘s in Hindi do feature in Marathi.
not find any translation equivalent in

IJCTA | Jan-Feb 2015 102


Available [email protected]
ISSN:2229-6093

Sonal Khosla et al, Int.J.Computer Technology & Applications,Vol 6 (1),93-110

Table 4. Grammatical Categories of Hindi and Marathi


Category
Differences
Similarities

Number Singular, Plural NIL

Masculine, Feminine Neuter(Marathi)


Gender
In Marathi, genitive, accusative-
Direct(nominative),
dative, instrumental, ablative,
locative. All cases except vocative
Vocative
are marked by postpositions.
Case
In Hindi, oblique(direct) case is used
Nouns
to mark subject of sentences and is
used to mark postpositions.
In Hindi, definite and indefinite.
Articles NIL
In Marathi, no articles

adjectives agree with the


In Marathi, Adjectives do not inflect
Adjectives nouns they modify in
unless they end in long /a/.
number, gender, and
case.
In Hindi, 2nd honorific
Person 1st, 2nd, 3rd

Number
Singular, Plural NIL

Tense Past, Present, Future NIL

Aspect
Imperfective, Perfective NIL
Verbs
Indicative, imperative,
Mood
optative Subjective, conditional(marathi)

Hindi verbs occur in the following


forms: root, imperfect stem , perfect
stem, and infinitive . The stems
Forms
agree with nouns in gender and
number.

IJCTA | Jan-Feb 2015 103


Available [email protected]
ISSN:2229-6093

Sonal Khosla et al, Int.J.Computer Technology & Applications,Vol 6 (1),93-110

In Marathi, indirect objects precede


Subject-Object- direct objects.
Verb
Word Order Modifiers precede the
nouns they modify.

…. The common script, and the phonetic 4. Important Statistical


similarities together will certainly reduce Parameters
the variety part of the BIGDATA problem
faced by developers. The word types define the distinct number
of words in a corpus, which is also a
measure of the vocabulary of the language.
3.5 Other Aspects The table below gives the top five frequently
used words in the corpus in Hindi and its
Other challenges in processing Hindi and
percentage distribution [3].
Marathi are length of sentence, lexical
ambiguity, ordering of words etc. The length
Table 5.Some statistical measures
of the sentence of Hindi and Marathi
sentence is not the same. A Marathi sentence Top
Frequentl
five
is smaller as compared to a Hindi sentence y used Percentag Percentag
syllable
[3]. words in e e
s in
Due to the difference in the usage context, a Hindi
Hindi
word may have different POS tagging
Ke 3.59 ra 5.27
thereby resulting in ambiguity. Indian
languages are morphologically rich and hE 3.08 ka 3.60
allow changing the ordering of words in a meM 2.79 na 2.84
sentence. Due to the free word order of kI 2.355 sa 2.80
languages, alignment of equivalent words is Se 1.70 pa 2.17
challenging [4]. Though both Hindi and
Marathi follow the Subject-Object-Verb Table 6. gives the number of words required
order, but still the usage shows different to cover a certain percentage of the
word order. corpus[3].There is a drastic difference in the
number of words required to cover a certain
As given by [17] in the inter-language percentage of the corpus in Hindi and
comparison study, the distance between Marathi. As can be seen from the table,
Hindi and Marathi is very less. Marathi has a larger vocabulary as compared
to Hindi. It can be accounted to the fact that
There is a close correspondence between the postpositions in Hindi like ―ke‖, ―hE‖,
Hind and Marathi and largely similar ―meM‖, ―kI‖ and ―se‖ are the words that
structural property [15][6]. Due to their occur most often in Hindi, but these gets
structural similarity, the development of converted to a syntactic or grammatical
Marathi Wordnet can be done through feature in Marathi, hence having more
relation borrowing from Hindi Wordnet variations and vocabulary. Table 7. gives a
[15][6]. comparative list of syllable and words in
Hindi and Marathi.

IJCTA | Jan-Feb 2015 104


Available [email protected]
ISSN:2229-6093

Sonal Khosla et al, Int.J.Computer Technology & Applications,Vol 6 (1),93-110

Table 6. Number of words required to patterns from the corpus. For example for a
cover a certain percentage of the corpus. pattern with a high frequency in Hindi is
% of compared with its occurrence in Marathi. It
Hindi Marathi
Corpus has been observed that a pattern which has a
10% 4 17 high occurrence in Hindi has a very low
20% 10 67 occurrence in Marathi. Therefore such
30% 26 213 patterns of syllables are unique to Hindi
40% 77 548 language. Some patterns are unique to a
50% 199 1247 particular language and hence can be used to
60% 486 2882 identify that language.
70% 1158 6922
80% 2874 18874 Table 8. The trisyllable pattern and its
distribution in Hindi and Marathi
Trisyllable
Hindi Marathi
Hindi Marathi Pattern in Hindi
Corpus Size(in no. ka:ra:ne 10642 29
2986063 1872345 a:pa:ne 8824 6
of words)
Word Types 127241 210578 sa:ma:ya 5152 29
Syllable types 3994 3757 u:sa:ke 5057 1
Average no of ka:ra:we 4995 643
2.23 2.97
syllables in a word
Syllable Mode 2 3
Most frequent
Ra wa 5. Results of some Related
syllable
Bigram syllable
Experiments
65697 69023
types
Most frequent Tagging and aligning are two basic
ka : ra A : he techniques used in providing interpretable
bisyllable
u:na:kI, mha:NU:na, structure to contents in a corpus [13]. We
a:pa:ne, mha:Na:je, have run a few trials on selected taggers and
Most frequent aligners. Results of the experiments are
a:pa:nI, A:pa:lyA,
trisyllable reported here. A critical look into the
i:sa:ke, A:he:wa,
u:sa:kI ka:rU:na outputs helps in understanding the
Maximum Word complexity of the whole process and also
20 23 suggests how the ‗similarity‘ between the
Length
Average Word languages can help in reducing the
4.695 6.33 complexity.
Length
Total Sentences 171604 187373 The data has been collected on medical
domain. A set of 90 sentences of varying
Mean Sentence
15.95 9.54 length are taken. The total length of the
Length sentence ranges from 3 to 28 tokens per
Table 7. A comparative list of syllable and sentence.
words in Hindi and Marathi

The syllable patterns are very important in Experiment 1 : Experiment on Tagging


the study of languages. Table 8. shows the Tools : Shallow Parser for Hindi and
trisyllable pattern and its distribution in Marathi developed by IIIT Hyderabad.
Hindi and Marathi. These syllable Input : Raw file containing 90
frequencies are used to extract unique sentences in Unicode format

IJCTA | Jan-Feb 2015 105


Available [email protected]
ISSN:2229-6093

Sonal Khosla et al, Int.J.Computer Technology & Applications,Vol 6 (1),93-110

Tagged Output : Parsed sentences in a  The lemma (root) words are same
text file. A sample output selected from the for words which are common in
Parsed file is shown in Table 9. both the languages as shown in the
Discussions : The following observations above table.
were made:  If the starting few syllables of two
 The Postpositions in Hindi do not words in Hindi and Marathi are
exist in Marathi similar then their root words are
 Almost the word order remains the translation equivalents of each
same with few changes. other like ळयीय and
 Postpositions do not have any ळयीयाच्मा .
translation equivalents in Marathi
as these gets converted into a
grammatical feature.

Table 9. Sample output of Experiment 1.


No. of
Tokens
क्मा ददद ळयीय के अन्म अंगो भें
Hindi पैरिा है ?
Sentence क्मा ददद ळयीय के अन्म अंग भें पैरिा 10
with lemma है ?

POS Tags WQ NN NN PSP JJ NN PSP VM VAUX SYM

ही लेदना ळयीयाच्मा अन्म बागाि ऩसयिे


Marathi
का ?
Sentence 8
ही लेदना ळयीय अन्म बाग ऩसय
with lemma
का ?
POS Tags
DEM NN NN JJ NN VM WQ SYM

Experiment 2 : Experiment on aligned pairs where the marathi


Alignment word is correctly matched with the
Tool : GIZA ++ corresponding hindi word.
Input : Parallel corpus  With Source as Hindi and Target as
Aligned output : Bilingual corpus aligned Marathi, the word alignment
at word level. A sample output has been procedure matches 85% of the
shown in Table 10. Marathi words with some hindi
Discussions : word, 15 % are not aligned with
 With Source as Marathi and any word and is hence null. Out of
Target as Hindi the word these, 73% are correctly aligned
alignment procedure matches 83% pairs where the marathi word is
of the Marathi words with some correctly matched with the
hindi word, 17% are not aligned corresponding hindi word.
with any word and is hence null.  Mean sentence length of Hindi =
Out of these, 76% are correctly 1041/90 = 11.93

IJCTA | Jan-Feb 2015 106


Available [email protected]
ISSN:2229-6093

Sonal Khosla et al, Int.J.Computer Technology & Applications,Vol 6 (1),93-110

Mean Sentence length of Marathi = us in making an assumption that a long


811/90 = 9.11 sentence gets translated to a long one
Average distance between the while short sentence gets translated into a
length of sentences = 2.75 short one.
 25% of the Hindi Words are also
present in Marathi and 33% of the The following is the output of GIZA++.
Marathi words are also present in Table No. 10 shows the serial number
Hindi which indicates the assigned to each word by the tool.
commonality of vocabulary. क्मा ({ 1 }) ददद ({ 2 })
ळयीय ({ 3 }) के ({ }) अन्म
({ 4 }) अंगो ({ 5 }) भें ({ })
Mean difference between length of पैरिा ({ 6 7 }) है ({ }) ?
sentence is 2.75 which means that the ({ 8 })
Marathi sentence is bigger than a Hindi
sentence by approx 2.75 words. This helps

Table 10. Sample output of Experminent No. 2

Sl. No of
1 2 3 4 5 6 7 8 9 10
Words

Source
क्मा
Sentence ददद ळयीय के अन्म अंगो भें पैरिा है ?
(Hindi)
Target
ही
Sentence लेदना ळयीयाच्मा अन्म बागाि ऩसयिे का ?
(Marathi)

As can be seen from the output of GIZA++ ,


each word in hindi is assigned to some word Result of alignment on GIZA ++ with
in Marathi. For eg. क्मा ({ 1 }) means Marathi as source and Hindi as target
that the word क्मा is aligned to the first sentence
word in Marathi and ददद ({ 2 }) means The experiment was repeated with Marathi
the word ददद in the Hind sentence is as the source sentence and Hindi as the
aligned to the 2nd word in the Marathi target sentence.
sentence. If no equivalence is found, the ही ({ 1 }) लेदना ({ 2 })
word is assigned null. ळयीयाच्मा ({ 3 }) अन्म ({ 5 })
Out of the 10 hindi words, 7 words have बागाि ({ 7 8 }) ऩसयिे ({ 6 })
been aligned to some marathi word and the का ({ 9 }) ? ({ 10 })
rest 3 words are not aligned and are shown
as empty brackets. Out of the 7 words, the All the marathi words in the source sentence
correctly aligned words are 5. Four words has been aligned to some hindi word. There
are given as one to one mapping, while the is no null assignment in this case.
word पैरिा ({ 6 7 }) is aligned to the
6th and 7th word of the marathi sentence,
hence is an example of one-to-many
mapping.

IJCTA | Jan-Feb 2015 107


Available [email protected]
ISSN:2229-6093

Sonal Khosla et al, Int.J.Computer Technology & Applications,Vol 6 (1),93-110

A summary of the statistics in Experiment


No. 1 and Experiment No. 2 is given in
Table 11.
Table 11. Summarize results of Experiment No. 1 and Experiment No. 2

ns = No. of sentences ls = length of sentence(no of words) Tw = Total words C=


Categories created by GIZA++
V=vocabulary (distinct words)
cp = correctly aligned pairs cw= common words nu = null assignments
p:q = Hindi:marathi (ratio) alignments H= Hindi, M= Marathi
Alignment Ratio
Set No L ns ls Tw C V cp cw nu 1:2 1:3 1:4 1:5 1:6
I H 24 16-28 435 100 224 112 92 169 35 7 1 0 0

M 24 9-20 328 100 242

II H 36 9-14 425 100 250 168 101 149 27 7 1 1 1

M 36 7-13 337 100 251

III H 30 1-8 181 100 150 56 52 41 2 1 0 0 0

M 30 3-6 146 100 132

All H 90 3-20 1041 100 488 465 121 358 64 14 5 1 0

M 90 3-20 811 100 526

All M 90 3-20 811 100 526 616 171 55 60 37 11 4 1

H 90 3-20 1041 100 488

6. CONCLUSIONS AND DISCUSSIONS Results of a few experiments in tagging


and aligning are presented as evidences to
In this article an attempt has been made
some of the observations made earlier,
to formalize the very concept of a bilingual
which further strengthen our belief that a
corpus with appropriate definitions in terms
corpus designer can exploit the similarities
of Information Technology, so that the
to his advantage.
concepts are better understood by
a developer and hence would become better
implementable .
Since our focus is on Hindi and Marathi References
bilingual e_corpus, a study of [1] Anonymous, Script Grammar for
similarities between the two languages has Marathi language. Technical Report.
been presented with a view of Technology development for
extracting proper help in reducing the Indian languages Programme of DIT,
complexity of the bilingual corpus. Various Govt. of India in association with
hypothesis stated in Sec 3, should provide CDAC. Ver 1.4-2.
help to a developer.

IJCTA | Jan-Feb 2015 108


Available [email protected]
ISSN:2229-6093

Sonal Khosla et al, Int.J.Computer Technology & Applications,Vol 6 (1),93-110

[2] Julie D. Allen. 2012. The Unicode Education-Longman Publishing


Standard / the Unicode Consortium— Co., pp. 208, ISBN: 81-317-1603- 1, 2008.
Version 6.2. Technical Report. Published in
Mountain View, CA. ISBN 978-1- 936213- [9] M.L.Dhore, S.K.Dixit and R.M.Dhore.
07-8. September 2012. 2012a. Hindi and Marathi to English NE
Transliteration Tool using
[3] Akshar Bharati, Prakash Rao, Rajeev Phonology and Stress Analysis.Proceedings
Sangal and S.M.Bendre. 2002. Basic of 24th International Conference
Statistical analysis of corpus on Computational Linguistics:
and cross comparison among corpora. Demonstration Papers at IIT
In Proceedings of 2002 International Bombay, pages 111–118(2012).
Conference on Natural Language
Processing, Mumbai, India. (2002). [10] M.L.Dhore, S.K.Dixit and R.M.Dhore.
2012b. Issues in Hindi to English and
[4] Akshar Bharati, Rajeev Sangal, Dipti Marathi to English Machine
Mishra Sharma and Lakshmi Bai. 2006. transliteration of Named Entities.
AnnCorra : Annotating Corpora International Journal of Computer
Guidelines For POS And Chunk Applications, Vol. 51, No.14
Annotation For Indian (August 2012).
Languages. Language
Technologies Research Centre, [11] M.L.Dhore, R.M.Dhore and
Technical Report, IIIT, Hyderabad, P.H.Rathod. 2013. Transliteration by
2006. Orthography or Phonology for Hindi and
Marathi to English: Case Study.
[5] Peri Bhaskararao. 2011. Salient Phonetic International Journal of Natural
features of Indian languages in speech Language Computing, Vol.2, No.5 (October
technology. Sadhana Vol .36, Part 5, pp. 2013). DOI : 10.5121/ijnlc.2013.2501.
587- 599. http:// dx.doi.org/
10.1007/ s12046-011-0039-z. (October [12] A.Frankenberg-Garcia. 2009.
2011). Compiling and Using a Parallel corpus for
research in translation.
[6] Pushpak Bhattacharya, Debasri International Journal of Translation,
Chakrabarti and Vaijayanthi M.Sarma. 2006. vol.21(1), pp.57-71, (2009).
Complex Predicates in Indian
language Wordnets. Language [13] Nisheeth Joshi, Hemant Darbari and Iti
Resources and Evaluation, Vol. Mathur. 2013. HMM Based POS tagger
40. pp. 331-355. (2006). for Indian languages. Jan Zizka
(Eds) : CCSIT, SIPP, AISC, PDCTA –
[7] Sandeep Chaware and Srikantha Rao. 2013, pp.341–349, 2013. CS & IT-
2011. Rule based phonetic matching CSCP 2013, DOI :
approach for Hindi and Marathi. 10.5121/csit.2013.3639(2013).
Computer Science and Engineering: An
International Journal(CSEIJ) , vol.1, [14] Rujvi Kamat, Manisha Ghate, Tamar
No.3. DOI : http:// 10.5121 /cseij H.Gollan, Rachel Meyer, Florin Vaida,
2011(August 2011). Robert K.Heaton, Scott Letendre,
Donald Franklin, Terry Alexander,
[8] Niladri Sekhar Dash. 2008. Corpus Igor Grant, Sanjay Mehendale and
Linguistics: An Introduction. India: Pearson Thomas D.Marcotte. 2012. Effects
of Marathi-Hindi bilingualism on

IJCTA | Jan-Feb 2015 109


Available [email protected]
ISSN:2229-6093

Sonal Khosla et al, Int.J.Computer Technology & Applications,Vol 6 (1),93-110

Neuropsychological performance. [20] Irene Thompson. 2014. About World


Journal of International languages: Hindi. http://
Neuropsychological Society, Vol. 18,Issue aboutworldlanguages.com/ hindi.
02, pp.305–313, March, 2012. (July 2014).
http://dx.doi.org/10.1017/S1355617
711001731. [21] Irene Thompson. 2014. About World
languages: Marathi. http://
[15] J. Ramanand, Akshay Ukey, Brahm aboutworldlanguages.com/ marathi.
Kiran Singh and Pushpak Bhattacharyya. (December 2014).
2007. Mapping and
Structural analysis of Multilingual [22] Hans Uszkoreit. 2000. What is
Wordnets. Bulletin of the IEEE Computational Linguistics. http:// coli.uni-
Computer Society Technical saarland.de/~hansu/what_is_cl.html
Committee on Data Engineering,
30(1). (March 2007). [23] Christopher C. Yang and Kar Wing Li.
2003. Automatic construction of
[16] Shikar Kr. Sharma, Himadri Barali English/Chinese parallel corpora.
Ambeshwar Gogoi, Ratul Ch. Deka and, Journal of the American Society for
Anup Kr. Burman. 2012. A Information Science and
structured approach for building Technology. Vol. 54, Issue 8 , p.p 730-
Assamese corpus: Insights 742. http:// dx.doi.org/ 10.1002/
Applications and Challenges. In, asi.10261 (June 2003).
Proceedings of the 10th Workshop
on Asian Language Resources [24] Johann Gamper and Paolo Dongilli,
COLING 2012, pages 21-28. , ―Primary data encoding of a bilingual
corpus‖, In Proceedings of the 11th Annual
[17] Anil Kumar Singh and Harshit Surana Meeting of the GLDV, Frankfurt a/M,
2007a. Can corpus based measures be. Germany, July, 1999.
used for comparative study o
languages. In Proceedings of Ninth Meetingf
of the ACL Special Interest Group in
Computational Morphology and
Phonology, pp 40–47, Prague. (June 2007).

[18] Anil Kumar Singh and Harshit Surana


2007b. Study of Cognates among South.
Asian languages for the purpos
of Building Lexical Resources. Journae
of Language Technology. Dept. o l
IT, Govt. of India. 2007. f

[19] Lichao Song. 2010. The role of context


in Discourse Analysis. Journal of
Language teaching and Research,
Vol. 1, No. 6, pp. 876-879. doi: 10.4304/
jltr.1.6.876-879( November 2010).

IJCTA | Jan-Feb 2015 110


Available [email protected]

You might also like