Morphological Process

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

Hindi Derivational Morphological Analyzer

Nikhil Kanuparthi Abhilash Inumella Dipti Misra Sharma


LTRC LTRC LTRC
IIIT-Hyderabad IIIT-Hyderabad IIIT-Hyderabad
India India India
{nikhil.kvs,abhilashi} @research.iiit.ac.in [email protected]

Abstract analysis of Hindi. Morphological analysis is an im-


portant step for any linguistically informed natural
Hindi is an Indian language which is rela- language processing task. Most morphological ana-
tively rich in morphology. A few morpholog- lyzers perform only inflectional analysis. However,
ical analyzers of this language have been de-
derivational analysis is also crucial for better
veloped. However, they give only
inflectional analysis of the language. In this perfor- mance of several systems. They are used to
paper, we present our Hindi derivational improve the efficiency of machine translators (C
morphological analyzer. Our algorithm Gdaniec et al., 2001). They are also used in
upgrades an existing inflectional analyzer to search engines to improve the information
a derivational analyzer and primarily extraction (J Vilares et al., 2001). Since
achieves two goals. First, it suc- cessfully derivational processes can often be productive in a
incorporates derivational analysis in the language, the development of an ef- fective
inflectional analyzer. Second, it also in-
derivational analyzer will prove beneficial in
creases the coverage of the inflectional
analy- sis of the existing inflectional several aspects.
analyzer. We developed a derivational analyzer for Hindi
over an already existing inflectional analyzer
devel- oped at IIIT Hyderabad. In this approach,
1 Introduction first, the derived words in Hindi were studied to
Morphology is the study of processes of word for- obtain the derivational suffixes of the language.
mation and also the linguistic units such as mor- Then the rules were designed by understanding the
phemes, affixes in a given language. It consists properties of the suffixes. The Hindi Wikipedia was
of two branches: derivational morphology and in- also utilized to collect the required background
flectional morphology. Derivational morphology data. Finally, an al- gorithm was developed based
is the study of those processes of word formation on the above findings. This algorithm has been
where new words are formed from the existing stems used to upgrade the inflec- tional analyzer to a
through the addition of morphemes. The meaning derivational analyzer.
of the resultant new word is different from the In the sections that follow, we describe the ap-
original word and it often belongs to a different proach we followed to develop our derivational an-
syntactic cat- egory. Example: happiness (noun) = alyzer and the experiments that we conducted
happy (adjec- tive) + ness. Inflectional morphology using our system.
is the study of those processes of word formation
where various in- flectional forms are formed from 2 Related Work
the existing stems. Number is an example of There is no derivational morphological analyzer
inflectional morphology. Example: cars = car + for Hindi to the best of our knowledge.
plural affix ’s’. However, a few inflectional morphological
The main objective of our work is to develop a analyzers (IIIT; Vishal and G. Singh, 2008; Niraj
tool which executes the derivational morphological and Robert, 2010)

10
Proceedings of the Twelfth Meeting of the Special Interest Group on Computational Morphology and Phonology (SIGMORPHON2012), pages 10–16,
Montre´al, Canada, June 7, 2012. Ⓧc 2012 Association for Computational Linguistics
of this language have been developed. There are by Niraj and Robert ex- tracts a set of suffix
derivational analyzers for other Indian languages replacement rules from a corpus and a dictionary.
like Marathi (Ashwini Vaidya, 2009) and Kannada The rules are applied to an inflected
(Bhuvaneshwari C Melinamath et al., 2011). The
Marathi morphological analyzer was built using a
Paradigm based approach whereas the Kannada ana-
lyzer was built using an FST based approach. As
far as English is concerned, there are some
important works (Woods, 2000; Hoeppner, 1982)
pertaining to the area of derivational
morphological analysis. However, both of these are
lexicon based works.
For our work, we employed a set of suffix replace-
ment rules and a dictionary in our derivational ana-
lyzer, having taken insights from the Porter’s stem-
mer (Porter, 1980) and the K-stemmer (R. Krovetz.
1993). They are amongst the most cited stemmers
in the literature. The primary goal of Porter’s stem-
mer is suffix stripping. So when a word is given as
input, the stemmer strips all the suffixes in the
word to produce a stem. It achieves the task in five
steps applying rules at each step. Given a word as
input, the Krovetz stemmer removes inflectional
suffixes present in the word in three steps. First it
converts the plural form of the word into a singular
form, then it converts past tense to present tense,
and fi- nally removes -ing. As the last step, the
stemmer checks the dictionary for any recoding
and returns the stem. Our algorithm uses the main
principles of both the Porters stemmer and Krovetz
stemmer. The suffix replacement rules of our
algorithm resemble that of the Porters and a
segment of the algorithm is analogous to the
dictionary based approach of the Krovetzs
stemmer.

3 Existing Inflectional
Hindi Morphological
Analyzers

A derivational morph analyzer can be developed


from an existing morph analyzer instead of build-
ing one from scratch. So three inflectional
analyzers were considered for the purpose. The
morphological analyzer developed by Vishal and
Gurpreet stores all the commonly used word forms
for all Hindi root words in its database. Thus,
space is a constraint for this analyzer but the search
time is quite low. The morph analyzer developed
11
word to obtain the root word. They show that the Table 1: following five step
Paradigm table of approach for
process of developing such rulessets is simple and it ladakA building our
can be applied to develop morphological analyzers Case
of other Indian languages. derivational
However, our derivational analyzer is an exten- analyzer.
Singular
sion of an existing inflectional morphological ana- form 4.1 Studying
lyzer developed at IIIT Hyderabad (Bharati Akshar Hindi
et al, 1995). The inflectional analyzer is based on Plural form Derivations
the paradigm model. It uses the combination of Direct ladakA
paradigms and a root word dictionary to provide in- Oblique ladake To build the
flectional analysis. Given an inflected Hindi word, derivational
this inflectional analyzer returns its root form and morphological
other grammatical features such as gender, num- the analyzer to analyzer, we first
ber, person, etc. For example: if the input word to reconstruct all the conducted a study
the morphological analyzer is bAgabAnoM1 (gar- inflections of the to identify the
deners), the output will be bAgabAna (gardener), root words derivational
noun, m, pl, etc. Here the word bAgabAna is the belonging to this suffixes and the
root word of the input word. ’Noun’ is the cate- paradigm class. related
gory of the input word, ’m’ means masculine and There- fore the morphological
’pl’ means that the input word is plural in number. analyzer can changes. Af- ter
The analyzer uses a root word dictionary for the analyze a word identifying the
purpose. If a word is present in the root word dic- only if its root suffixes, the rules
tionary, the analyzer handles all the inflections per- word is present in pertaining to these
taining to that word. For example: xe (give) is a root the dictionary. suffixes were
word present in the dictionary of the analyzer. xewA This inflectional obtained.
(gives), xenA (to give), xiyA (gave) and other inflec- morphological First, the nouns
tional forms of the root word xe are handled by the analyzer works as a present in the
analyzer. There are 34407 words in the root word platform for our Hindi vocabulary
dictionary. derivational were studied. The
morphological ana- study of nouns
The analyzer handles inflected words using the
lyzer. So our tool helped us in iden-
paradigm tables. Every entry (word) in the dic-
gives derivational tifying some of the
tionary has values like lexical category, paradigm
analysis of all the most productive
class, etc. For example: there is a word pulisavAlA
words whose root derivational suf-
(policeman) in the dictionary. Its paradigm class is
forms are present fixes present in the
ladakA. Table 1 shows the paradigm forms of
in the root word language. For
ladakA. Since the paradigm value of pulisavAlA is
dictionary. Our example, let us
ladakA, its four inflections will be similar to the
tool also tackles consider the word
four paradigms of ladakA (root paradigm). The four
certain words maxaxagAra
in- flections of pulisavAlA are pulisavAlA,
whose root forms (helper). This
pulisavAle, pulisavAle, pulisavAloM. Only the root
are not present in word is derived
form (word) pulisavAlA is present in the dictionary.
the root word from the word
In this way every root word present in the dictionary
dictionary of the maxaxa
belongs to a paradigm class and this paradigm class
IIIT morphological (maxaxagAra =
has a struc- tured paradigm table containing all the
analyzer. maxaxa (help) +
inflections of the main paradigm. This paradigm
gAra). But gAra
table is used by
4 Approach cannot be con-
1
The Hindi words are in wx-format (san- firmed as a suffix
skrit.inria.fr/DATA/wx.html) followed by IIIT-Hyderabad. We pursued the because of just one
12
instance. In order to Table 2: Example A a
confirm gAra as a derivations of some g x
suffix, even other suffixes a A
words ending with S r
gAra must be u b a
examined. The f A
more the number of f g i
words we find, the i a k
greater is the pro- x b a
ductivity of the A aXikAra
suffix. Words like R n a
yAxagAra (de- o a X
rived from yAxa) o i
and gunAhagAra t g k
(criminal) (de- D A A
rived from gunAha e r r
(crime)) prove that r a i
gAra is a i k
derivational suffix. v y a
However, every a A I KuSa
word ending with t x KuSI
gAra need not be a i a AI acCA
derived word. For o acCAI
exam- ple: the word n y
aMgAra is not a A
derived word. Table 3: Rules
A x of few suffixes
There- fore only n a Suffix First set
relevant words a g rules
were studied and A b
the suf- fixes were l r A
obtained only from a a n
them. g a
a x
n A n
A r o
a u
l n
a x
g u =
A k
n A n
a n o
a u
b n
A x /
n u a
a k d
A j
b n +
13
= nouns from nouns this way, rules were
b n and adjectives. The developed for all
A o rule of this suffix the 22 derivational
n u explains the suffixes. These
a n formation of rules form a vital
g - derivations like component of our
A a yAxagAra algorithm. Table 3
r + (yAxagAra contains the
a i = yAxa (noun) + derivational rules of
n k gAra) and a few suffixes.
o a maxaxagAra
u (maxaxa- gAra = 4.3 Finding Majority
n maxaxa + gAra). Properties
= The entire process The second set The majority
n of obtaining the consists of reverse properties (of
o derivational suffixes rules of the first set. derived words of a
u was done manually The reverse rule for suf- fix) are the
n/ and was a time the previous properties which
a consum- ing example is noun/adj most of the words
dj process. This = noun - suffix. In ex-
+ process was
repeated for adjec- hibit. Example: let suffix help us in
g
tives as well. Only us consider the the derivational
A
those suffixes that derived words of analysis of the
r
participate in the the suffix vAlA. unknown derived
a
formation of nouns There are 36 words of that
x
and adjectives were derived words of suffix. For
A
found. A total of 22 the vAlA suffix in example: consider
r
productive the root word the word Gar-
a
derivational suffixes dictionary. Some avAlA
n
were procured. of these words are (housekeeper). Let
o
Table 2 shows a few adjectives but the us assume that it is
u
suffixes and their majority are nouns. not present in the
n
derivations. Hence noun is root word
=
fixed as the dictionary.
n
4.2 Derivational Rules category (major- Therefore the
o
ity category) for lexical category,
u After finding the
derived words of paradigm value and
n/ derivational
this class. Simi- other important
a suffixes, two sets of
larly the majority features of this
dj derivational rules
paradigm class of word are not
+ were developed for
these words is known. But let us
x each suffix. The
ladakA. The as- sume that this
A first set explains the
majority properties word is a genuine
r formation of the
of derived words derived word of
a derived words from
pertaining to all the the suffix vAlA. So
ik their root words. Let
22 suffixes were the tool must
a us consider the
acquired. handle this case.
a suffix gAra. This
The majority The majority
dj suffix generates
properties of a properties of the
14
vAlA suffix are word maxaxagAra
Table 4: Few ,
assigned to this is accepted as a
suffixes and their
word. So noun and derived word. If forms
ladakA are fixed as the word maxaxa is Suffix Suffix- x
the cat- egory and not found in the forms A
paradigm of this dictionary or if its Ana Ana r
word. Thus the category is not a b o
genuine derived noun/adjective, the A M
words which are word maxaxagAra n
unknown to the will be rejected. In a i
analyzer will be this way all the k
analyzed using the valid derivations of b a
majority properties. the suffix were A ika
The majority acquired. This n I I
properties of derived process was a AI AI
words were ob- repeated for other , anI anI,
tained in two main suffixes as well. In aniyAz, aniyoM
steps. First, a suffix the second step, the b
was consid- ered. majority properties A rived words of
Then all the derived of the de- rived n vAlA suffix is
words pertaining to words were o ladakA. This
that suffix were directly retrieved. M implies that the
acquired. Only Finally, a suffix derived words of
genuine derived table was built g this suffix end with
words were taken using the major- ity A vAlA, vAle and
into consideration. properties of the r vAloM. Thus the
Genuine derivations derived words. The a possible inflections
were found out suffix table of a suffix can be
using the suffix contains all the g
derived from its
derivational rules. suffixes and their A
majority properties.
Example: let us inflectional forms. r
This information
take the word Table 4 contains a
was stored in a
maxaxagAra few suffixes and ,
table. The majority
(ending with gAra). their inflectional properties and the
First, the root word forms. For g
suffix table play an
of this word is re- example: the A
important role in the
trieved using the majority paradigm r
analysis of the
gAra derivational of de- o
unknown words.
rule. The root word M
Their usage in our
according to the algorithm will be
rule is maxaxa. x
described in the
This word is A
later sections.
present in the r
dictionary and it a 4.4 Using
also satisfies the Wikipe
cat- egory condition x dia
of the rule. The A Data
word maxaxa is a r for
noun. Hence the a Confir
15
ming extracted from the 4.5 Algorithm for us to bypass the
Genuine Hindi Wikipedia. Derivational construction of a
ness This data contains Analysis derivational
If an invalid word is many words which An algorithm was analyzer from the
not analyzed by the do not exist in Hindi developed to make scratch. The
inflec- tional vocabulary. So 220k use of the existing majority properties
analyzer, there is no proper Hindi words inflectional of the derivations,
need for proceeding were selected (on morphological the Wikipedia data
to the derivational the basis of analyzer for and the suffix-table
analysis of that frequency) from the derivational are also employed
word. Therefore the data and a list analysis. This by the algorithm for
genuineness of a containing those algorithm enabled analyzing un-
word must be tested 220k words was
known derivations. put is a
before going for the created. A word will
combination of
the inflectional
analysis and
the derivational
analysis of the
input word. For
ex- ample: if
the input word
is bAgabAnoM
(garden- ers).
First, the
algorithm gives
the inflectional
anal- ysis of
the input word.
In this case the
word bAga-
bAnoM is a
noun, plural in
number, etc.
Then it gives
the information
(category,
gender) of the
root word
(bAga
derivational analysis. be treated as a (garden)) from
The Hindi Wikipedia genuine word only Figu which the input
was chosen as a when it is present in re 1: word is derived
that list. This Alg (derivational
resource that enables orith
us to test the assumption is used analysis). So a
m
genuineness of a by our algorithm. dual analysis
word. The Wiki data is of the input
used as a standard The input to the
A total of 400k word is
corpus. algorithm is a
words were provided.
word. The out-
16
4.6 Examples a) Example 1 analyzer cannot analyze
The following 4 Input word: pulisavAle this word. The word c) Example 3
(Policemen) kirAexAroM is ending Input word: ladake (Boys)
examples explain
the working of the In the step-2, the with one of the forms In the step-2, the
algorithm in 4 word is analyzed by (xAroM) present in the word is analyzed by
different cases. the IIIT inflectional suffix table. The the IIIT inflectional
These examples analyzer. In the step normal-form of the analyzer. The normal
are provided to 3a.1, the word input word is obtained form of the word is
give a clear pulisavAlA by replacing the suffix ladakA (boy). The
picture of the (Policeman) is the form in the input normal-form of the
complete normal-form of the word with the suffix. word is not ending
algorithm. input word. The Hence the normal-form with any of our 22
normal-form is of the input word suffixes. So there is
ending (vAlA) with kirAexAroM is no derivational
one of our 22 kirAexAra. In this way, analysis of this
suffixes. The rule of the normal-form of the particular case.
the suffix is noun = input word is acquired
noun/verb + vAlA. So without the inflectional d) Example 4
the root word is analyzer. The word Input word: ppppwA
pulisa because kirAexAra is present in (invalid word)
pulisavAlA = pulisa + Wiki data and it is The IIIT inflectional
vAlA. The word ending with one of the analyzer cannot
pulisa should be a 22 suffixes. The rule of analyze this word.
noun or a verb in the suffix is noun = The word ppppwA
order to satisfy the noun/adj + xAra. So the is ending with one
rule. All the root word is kirAe of the forms (wA)
conditions are met because kirAexAra = present in the suffix
and the step 3a.5 kirAe + xAra. table. But the
becomes the vital normal-form word kirAexAroM
final step. This step (ppppwA) is not is analyzed by the
gives the information present in derivational
that the final root Wikipedia. So there analyzer even
word pulisa is a is no derivational though its root
masculine noun and analysis for this form (kirAexAra)
the input word is also particular case. is not present in
a masculine noun and the root word dic-
it is plural in number. 4.7 Expanding tionary. Words
Here the information Inflectional like kirAexAra are
about the final root Analysis genuine deriva-
word and the input tions and can be
The algorithm for
word is again given added to the root
derivational
using the inflectional word dictio- nary.
analysis was also
morphological The addition of
used for expanding
analyzer. such kind of words
the inflectional
analysis of the will extend the
b) Example 2 inflectional
analyzer. Consider
Input word: kirAexAroM analysis of the
the second
(Tenants) analyzer. For
example in the pre-
The IIIT inflectional exam- ple. if the
vious section. The
17
word kirAexAra is inflectional For example: let us tell the
added, its forms ki- analyzer. So these assume that we have improvement the
rAexAroM and words were added a data of 100 words derivational ana-
kirAexAra will be to the root word and their lyzer achieved.
automatically ana- dictionary for morphological
lyzed. This is expanding the analysis. The
because the word inflec- tional analysis of these Figure 2: Evaluation
kirAexAra would be analysis of the 100 words does not Methodology for Morph
added along with its analyzer. The contain any errors Analyzers
features/values like algorithm which and it is a gold-data.
category, paradigm was designed to Now we must get The figure 2
class, etc. perform the analysis of these (Amba P Kulkarni,
Therefore all the derivational 100 words from 2010) explains our
words which fall analysis also both the deriva- evaluation
into the example-2 inflated the tional analyzer and methodology for
category of the inflectional the old morphological ana-
previous section analysis of the morphological lyzers. Let us
can be added to the analyzer. analyzer. Then their continue with the
dictionary. All such analyses must be example mentioned
words must be ob- 5 Experiments compared against in the previous
tained in order to and Results the gold-data. This paragraph. First, we
expand our is nothing but find the anal- ysis
dictionary. For this The performance
directly comparing of the 100 words by
purpose, a Wiki of our derivational
the outputs of the the old morph
data consisting of analyzer must be
derivational analyzer. We
220k Wiki words compared with an
analyzer and the old compare its output
was extracted from existing
morphological with the gold
Wikipedia. Out of derivational
analyzer. This will output/analysis. Let
these 220k words, analyzer. Since
help in evaluating there be 50 words
40k words are there is no such
the derivational which belong to
ending with our 22 derivational
analyzer. This Type-1. It means
suffixes and their analyzer, we
method of the gold analysis
forms. So the compared the
evaluation will also and morphological
derived words performance of our
which can be tool with the ex-
analyzed by our isting IIIT
system are part of inflectional
this sub-dataset. analyzer (or the old
Out of 40k words, morpho- logical
the derivational analyzer). The two
analyzer analyzed tools must be
5579 words. The tested on a gold-
inflectional analyzer data (data that
analyzed only 2362 does not contain
words out of 40000. any errors).
So the derivational
an- alyzer analyzed
3217 derived words
more than the

18
analysis (by old Let there be 10 derivational 6 Conclusions
morph) of 50 words words which belong analyzer. As a
is perfectly equal. to Type-6. It means result of this We presented an
derivational improvement, the algorithm which
Table 5: Output analyzer. Finally overall Type-1 uses an exist- ing
analysis of old we compare the (Perfect output inflectional
morph analyzer
evaluations of the which is analyzer for
Type
old morphological completely performing
analyzer and our matching with the derivational
Number of
derivational an- gold output) of analysis. The
instances % of
Type alyzer. This is our derivational algorithm uses the
Type1 evaluation analyzer is nearly main principles of
Type2 methodology. 5% more than that both the Porters
Type3 So a gold-data of the old stemmer and
Type4 consisting of the morphological Krovetz stemmer
Type5 analysis of 5000 analyzer. The data for achieving the
words was taken. size is small (only task. The algorithm
Type6
The linguistic 5000). A testing on achieves decent
experts of IIIT Hy- a larger gold-data precision and recall.
Table 6: Output derabad have built will show an even It also expands the
analysis of this data and it was better picture of the coverage of the
derivational acquired from that improvement that inflectional
analyzer analyzer. But it
institution. The can be achieved by
Type the derivational must be incorpo-
5000 words were
tested on both the analyzer. rated in applications
Number of
derivational like machine
instances % of
analyzer and the translators which
Type
inflectional ana- use derivational
Type1
lyzer. analysis for
Type2
Both the understanding its
Type3
analyzers were real strengths and
Type4
tested on the gold- limitations.
Type5
Type6 data containing
5000 words. The References
table 6 proves
that the old that the Claudia Gdaniec,
performance of Esm Manandise,
morphological
Michael C.
analyzer could not the new McCord. 2001.
an- alyze 10 words derivational Derivational
but there is gold analyzer is better morphology to the
analysis of those than the old rescue: how it can
words. In this way, morphological help resolve
each type forms an analyzer. The old unfound words in
important part of analyzer could not MT, pp.129–131.
Summit VIII:
the evaluation provide any output
Machine
process. Similarly of 288 words Translation in the
we evalu- ate the (Type-6) whereas Information Age,
analysis of the 100 that number is only Proceedings,
words by the 31 in- case of the Santiago de

19
Compostela, Spain. Proceedings of
Jesus Vilares, David ANLC.
Cabrero and Miguel Wolfgang Hoeppner.
A. Alonso. 2001. 1982. A
Applying multilayered
Productive approach to the
Derivational handling of word
Morphology to formation. In
Term Indexing of Proceedings of
Spanish Texts. In COLING.
Proceedings of R. Krovetz. 1993. Viewing
CICLing. morphology as an
Vishal Goyal, Gurpreet inference process. In
Singh Lehal. 2008. Proceedings of COLING.
Hindi Mor- M. F. Porter. 1980. An
phological Analyzer algorithm for suffix
and Generator, pp. stripping. Originally
1156–1159. IEEE published in Program, 14
Computer Society no. 3, pp 130-137.
Press, California, Bharati Akshar, Vineet
USA. Chaitanya, Rajeev
Niraj Aswani, Robert Sangal. 1995. Natural
Gaizauskas. 2010. Language Processing: A
Develop- ing Paninian Perspec-
Morphological tive. Prentice-Hall of India.
Analysers for South Amba P Kulkarni.
Asian Lan- guages: 2010. A Report on
Experimenting with Evaluation of San-
the Hindi and skrit Tools.
Gujarati
Languages. In
Proceedings of
LREC.
Ashwini Vaidya. 2009.
Using paradigms
for certain
morphological
phenomena in
Marathi. In
Proceedings of
ICON.
Bhuvaneshwari C
Melinamath,
Shubhagini D. 2011.
A robust
Morphological
analyzer to capture
Kannada noun
Morphology, VOL
13. IPCSIT.
William A. Woods.
2000. Aggressive
Morphology for
Robust Lexical
Coverage. In

20

You might also like