Ontology Based Word Sense Disambiguation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

In Milena Slavcheva, Galia Angelova, and Kiril Simov (eds.), Readings in Multilinguality.

Selected Papers, pp. 134-141. INCOMA Ltd., Shoumen, Bulgaria, December 2006. ISBN
978:954-91743-6-6.

Ontology-Based Word Sense Disambiguation in


Parallel Corpora
Dan Tufiş

Research Institute for Artificial Intelligence


Romanian Academy
13, “13 Septembrie”, 050711, Bucharest 5, Romania
[email protected]

Abstract compositional semantics, the meaning of a


complex expression is supposed to be derivable
Lately, there seems to be a growing acceptance from the meanings of its parts, and the way in
of the idea that multilingual lexical ontologies which those parts are combined. Depending on the
might be the key towards aligning different representation formalisms for the word-meaning
views on the semantic atomic units to be used representation, various calculi may be considered
in characterizing the general meaning of for computing the meaning of a complex
various and multilingual documents. expression from the atomic representations of the
Comparing performances of word sense
word senses. Obviously, one should be able, before
disambiguation systems is a difficult evaluation
task when different sense inventories are used
hand, to decide for each word in a text which of its
and, even more difficult when the sense possible meanings is, contextually, the right one.
distinctions are not of the same granularity. Therefore, it is a generally accepted idea that
The paper substantiates this statement by the WSD task is highly instrumental (if not
presenting a statistics based system for word indispensable) in semantic processing of natural
alignment and word sense disambiguation in language documents.
parallel corpora. The system is supported by a The WSD problem can be stated as being able
lexical ontology made of aligned wordnets for to associate to an ambiguous word (w) in a text or
the languages in the corpora. The wordnets are discourse, the sense (sk) which is distinguishable
aligned via the Princeton Wordnet, used as an
from other senses (s1, …, sk-1, sk+1, …, sn)
interlingual index. The evaluation of the WSD
system was performed on the same data, using
prescribed for that word by a reference semantic
three different sense inventories. lexicon. One such semantic lexicon (actually a
lexical ontology) is Princeton WordNet (Felbaum,
2
1 Introduction 1998) version 2.0 (henceforth PWN). PWN is a
very fine-grained semantic lexicon currently
Most difficult problems in natural language containing 203,147 sense distinctions, clustered in
processing stem from the pervasive ambiguity of 115,424 equivalence classes (synsets). Out of the
the human languages. Ambiguity is present at all 145,627 distinct words, 119,528 have only one
levels of traditional structuring of a language single sense. However, the remaining 26,099
system (phonology, morphology, lexicon, syntax, words are those that one would frequently meet in
semantics) and not dealing with it at the proper a regular text and their ambiguity ranges from two
level, exponentially increases the complexity of the senses up to 36. Several authors considered that
problem solving. Currently, the state of the art sense granularity in PWN is too fine-grained for
taggers (combining various models, strategies and the computer use, arguing that even for a human
processing tiers) ensure no less than 97-98% (native speaker of English) the sense differences of
accuracy in the process of morpho-lexical full some words are very hard to be reliably (and
disambiguation. For such taggers a 2-best tagging
1 systematically) distinguished. There are several
is practically 100% correct. attempts to group the senses of the words in PWN
One further step is the word sense in coarser grained senses – hyper-senses – so that
disambiguation (WSD) process. In the traditional clear-cut distinction among them is always
possible for humans and (especially) computers.
We will refer in this paper to two hyper-sense
1 In k-best tagging, instead of assigning each word exactly one inventories used in the BalkaNet project (Tufiş,
tag (the most probable in the given context), it is allowed to 2004). A comprehensive review of the WSD state-
have occasionally at most k-best tags attached to a word and
if the correct tag is among the k-best tags, the annotation is
2
considered to be correct. http://www.cogsci.princeton.edu/~wn/
of the art at the end of 90’s can be found in (Ide & EAST project (Dimitrova et al., 1998; Tufiş et al.
Veronis, 2001). Stevenson and Wilks (1998) 1998) with corresponding resources for Bulgarian,
review several WSD systems that combined Czech, Estonian, Hungarian, Romanian and
various knowledge sources to improve the Slovene. The segmenter is able to recognize
disambiguation accuracy and address the issue of sentence and clause boundaries, dates, numbers
different granularities of the sense inventories. and various fixed phrases, and to split clitics or
3
SENSEVAL series of evaluation competitions on contractions (where the case). We significantly
WSD is a very good source on learning how WSD updated the tokenization resources for Romanian
evolved in the last 6-7 years and where it is and English (the languages we were most
nowadays. interested lately). Additionally, for bilingual
We describe a multilingual environment, contexts we used the feedback from the lexical
containing several monolingual wordnets, aligned alignment phase to build a language-pair
to PWN used as an interlingual index (ILI). The dependent tokenization resource which stores
word-sense disambiguation method combines word multiword sequences in one language that are
alignment technologies, and interlingual translated in the other language by a single word or
equivalence relations in multilingual wordnets. a multiword sequence in the other language
Irrespective of the languages in the multilingual (provided the sequences in the two languages
documents, the words of interest are disambiguated cannot be aligned word-by-word).
by using the same sense-inventory labels. The The tokenized texts are further POS-tagged.
aligned wordnets were constructed in the context Recently, we re-implemented the tiered tagging
of the European project BalkaNet (Tufiş, 2004). methodology (Tufiş, 1999), by relying on a
The consortium developed monolingual wordnets combination between an HMM tagger, called TTL
for five Balkan languages (Bulgarian, Greek, (Ion, 2006), which produces also the
Romanian Serbian, and Turkish) and extended the lemmatization, and a maximum-entropy tagger
Czech wordnet initially developed in the (Ceauşu, 2006). The HMM tagger works with a
EuroWordNet project (Vossen, 1998). The version reduced internal tagset while the ME-tagger
of the PWN used as ILI is an enhanced XML ensures the mapping of the first tagset onto a much
version where the synsets are linked to SUMO larger one (the lexical tagset) dispensing on the
(Niles & Pease, 2001) conceptual categories and hand-written mapping rules used by the initial
are also associated with IRST domain labels version of the tiered tagging engine.
(Magnini & Cavaglia, 2000). In the present version Lemmatization is in our case a straightforward
of the BalkaNet ILI there are used 2066 SUMO process, since the monolingual lexicons developed
distinct categories and 163 domain labels. within MULTEXT-EAST contain, for each word,
Therefore, for our WSD experiments we had at our its lemma and morpho-syntactic codes. Knowing
disposal three sense-inventories, with very the word-form and its associated tag, the lemma
different granularities: PWN senses, SUMO extraction is simply a matter of lexicon lookup for
categories and IRST Domains. those words that are in the lexicon. For the
unknown words, which are not tagged as proper
2 Word Alignment names, a set of lemma candidates is generated by a
set of suffix-stripping rules induced from the word-
2.1 Preprocessing form lexicon. A four-gram letter Markov model
(trained on lemmas in the word-form dictionary) is
The word alignment is the first step (the hardest) in used to choose the most likely lemma.
our approach for the identification of word senses. The next pre-processing step is represented by
The input format for the word aligner is the sentence chunking in both languages. The
obtained from two raw texts that represent chunks are recognized by a set of regular
reciprocal translations. The first pre-processing expressions defined over the tagsets and they
step deals with text segmentation and is achieved correspond to (non-recursive) noun phrases,
by a modified version (much faster) of the adjectival phrases, prepositional phrases and verb
multilingual segmenter MtSeg developed for the complexes (analytical realization of tense, aspect
MULTEXT project. The segmenter comes with mood and diathesis and phrasal verbs). The texts
tokenization resources for many Western European are further processed by a statistical dependency
languages, further enhanced in the MULTEXT- linking parser. Finally, the bitext is assembled as

3 http://www.cs.unt.edu/~rada/senseval
4
an XML document (XCES compliant format), problem includes the cases where words in one
which is the standard input for most of our tools. part of the bitext are not translated in the other part
The word alignment process is preceded by a (these are called null alignments) and the cases
coarser grained alignment, namely the sentence where multiple words in one part of the bitext are
alignment which transforms a parallel text <TL1 translated as one or more words in the other part
TL2> in a sequence of pairs of one or more (these are called expression alignments).
sentences in language L1 (SL11 SL12...SL1k) and one We developed two quite different word
or more sentences in language L2 (SL21 SL22…SL2m) aligners, motivated by two distinct objectives: the
so that the two ordered sets of sentences represent first one, called YAWA (Tufiş et al., 2005) was
reciprocal translations. Such a pair is called a motivated by a project aiming at the development
translation alignment unit (or translation unit). In of an interlingually aligned set of wordnets while
the vast majority of cases a translation unit the other one, called MEBA(Tufiş et al., 2005) was
contains one sentence per language (this is called a developed within an SMT ongoing project. The
1-1 translation unit). first one was used for validating, against a
We developed a sentence aligner [3] inspired by multilingual corpus, the interlingual synset
Moore’s aligner (Moore, 2002) which unlike it, is equivalences and also for WSD experiments.
able to detect sentence alignments which are not Although, initially it was concerned only with open
necessarily 1-1 and can process arbitrarily large class words recorded in a wordnet, turning it into
parallel data. It has a comparable precision but a an “all words” aligner was not a difficult task.
better recall than Moore’s aligner. Our aligner does YAWA is a three stage lexical aligner that uses
not need a-priori language specific information, its bilingual translation lexicons and phrase
parameters being set by a training phase on a small boundaries detection to align words of a given
number of human checked alignment data (about bitext. The translations lexicons are generated by a
1000 sentences). different module, TREQ (Tufiş, 2002; Tufiş et al.
The sentence aligner consists of a hypothesis 2003), which generates translation equivalence
generator which creates a list of plausible sentence hypotheses for the pairs of words (one for each
alignments from the parallel corpus and a filter language in the parallel corpus) which have been
which removes the improbable alignments. The observed occurring in aligned sentences more than
filter is an SVM binary classifier (Fan et al., 2005) expected by chance. The hypotheses are filtered by
initially trained on a Gold Standard. The features a loglikelihood score threshold. Several heuristics
of the initial SVM model are: the word sentence (string similarity-cognates, POS affinities and
length, the non-word sentence length, and the rank alignments locality5) are used in a competitive
correlation for the first 25% of the most frequent linking manner (Melamed, 2001) to extract the
words in the two parts of the training bitext. This most likely translation equivalents.
model is used to preliminarily filter alignment YAWA generates a bitext alignment by
hypotheses generated from the parallel corpus. The incrementally adding new links to those created at
set of the remaining aligned sentences is used as the end of the previous stage. The existing links act
the input for an EM algorithm which builds a word as contextual restrictors for the new added links.
translation equivalence table by a similar approach From one phase to the other, new links are added
to the IBM model-1 procedure. The SVM model is without deleting anything. This monotonic process
rebuilt (from the Gold Standard) this time requires a very high precision (at the price of a
including, as an additional feature, the number of modest recall) for the first step. The next two steps
word translation equivalents existing in the are responsible for significantly improving the
sentences of a candidate alignment pair. This new recall and ensuring an increased F-measure.
model is used by the SVM classifier for the final A quite different approach from the one used by
sentence alignment of the parallel corpus. YAWA, is implemented in our second word
aligner, called MEBA. It is a multiple parameter
2.2 Two Aligners and Their Combination and multiple step algorithm using relevance
thresholds specific to each parameter, but different
The word alignment of a bitext is an explicit from each step to the other. The implementation of
representation of the pairs of words <wL1 wL2>
(called translation equivalence pairs) co-occurring 5 The alignments locality heuristics exploits the observation
in the same translation units and representing made by several researchers that adjacent words of a text in
the source language tend to align to adjacent words in the
mutual translations. The general word alignment target language. A more strict alignment locality constraint
requires that all alignment links starting from a chunk, in the
4 http://www.cs.vassar.edu/XCES/ one language end in a chunk in the other language.
MEBA was strongly influenced by the famous five likely to be wrong. For the purpose of filtering, a
IBM models described in the (Brown et al., 1993) link is characterized by its type defined by the pair
seminal paper. We used GIZA++ (Och & Ney of indexes (i,j) and the POS of the tokens of the
2000; Och &Ney, 2003) to estimate different respective link. The likelihood of a link is
parameters of the MEBA aligner. proportional to the POS affinities of the tokens of
MEBA is an iterative algorithm that takes the link and inverse proportional to the bounded
advantage of all pre-processing phases mentioned relative positions (BRP) of the respective tokens:
in the beginning of the Section 2. BRP = 1+ || i − j | −avg | where avg is the average
The alignment model considers a link between displacement in a Gold Standard of the aligned
two candidate words as an object that is described tokens with the same POSes as the tokens of the
by a feature-values structure (with values in the current link. From the same gold standard we
[0,1] interval) which we call the reification of the estimated a threshold below which a link is
link. We differentiate between context independent removed from the final alignment.
features that refer only to the tokens of the current A more elaborated alignment combination (with
link (translation equivalency, part-of-speech better results than the previous one) is modelled as
affinity, cognates, etc.) and context dependent a binary statistical classification problem (good /
features that refer to the properties of the current bad) and, as in the case of the previous method, the
link with respect to the rest of links in a bi-text net result is the removal of the links which are
(locality, number of traversed links, tokens indexes likely to be wrong. We used the SVM training and
displacement, collocation). Also, we distinguish classification toolkit - LIBSVM (Fan et al., 2005)
between bi-directional features (translation with the default parameters (C-SVC classification
equivalency, part-of-speech affinity) and non- and radial basis kernel function). The classifier was
directional features (cognates, locality, number of trained with positive and negative examples of
traversed links, collocation, indexes displacement) links. A subset of the Gold Standard alignment
links was used as positive examples set. The same
2.3 COWAL: The Combined Aligner number of negative examples was extracted from
the alignments produced by COWAL and MEBA
The Combined Word Aligner, COWAL, is a where they differ from the Gold Standard.
wrapper of the two aligners (YAWA and MEBA) The result of the SVM-based combination
merging the individual alignments and filtering the (COWAL), compared with the individual aligners,
result. At the Shared Task on Word Alignment is shown in Table 1.
organized by the ACL2005 Workshop on
“Building and Using Parallel Corpora: Data-driven
Aligner P R F-measure
Machine Translation and Beyond” (Martin et al.
2005), we participated (on the Romanian-English YAWA 88.80% 74.83% 81.22%
track) with the two aligners and the combined one MEBA 92.15% 73.40% 81.71%
(COWAL). Out of 37 competing systems,
COWAL (Tufiş et al., 2005) was rated the first, COWAL 87.26% 80.94% 83.98%
MEBA the 20th and TREQ-AL (Tufiş, 2002; Tufiş Table 1: Combined alignment
et al. 2003), the former version of YAWA), was
rated the 21st. The usefulness of the aligner
combination was convincingly demonstrated. COWAL is now embedded into a larger platform
Meanwhile, both the individual aligners as well as (called MTkit) that incorporates the tools for
their combination were significantly improved. bitexts pre-processing, a graphical interface that
One very simple, but very effective method of allows for comparing and editing different
alignment combination is a heuristic procedure alignments, as well as a word sense disambiguation
which merges the alignments produced by two or module. A snapshot of the COWAL graphical
more word aligners and filters out the links that are interface is shown in Figure 1.
Figure 1: COWAL Graphical User Interface

6
The left pane in Figure 1 is the alignment viewer et al., 2006) produced wordnet-relevant lexicons
and editor area. The user can edit the alignments 7
with F-measures as high as 84.26% and 89.92% .
(delete and add one or multiple links). By double
clicking a word in this pane, its properties will be
automatically displayed in the right-hand windows. 3 WN-based Sense Disambiguation
The upper-right window shows the lexico-syntactic
The task of word sense disambiguation (WSD)
properties of the selected word: the morphological
requires one reference sense inventory in terms of
analysis of the orthographic form, its lemma, the
which the senses of the target words will be
syntactic chunk to which it belongs. Currently this
labeled. We argued at length elsewhere (Tufiş &
pane is not editable. The bottom-right window
Ion, 2004) that a meaningful discussion of the
displays the semantic properties of the selected
performances of a WSD system cannot dispense of
word: its sense in the current context, the gloss for
clearly specifying the sense inventory it uses, and
this sense, synonyms, hyperonyms, derivatives,
the comparison between two WSD systems that
etc. These properties are extracted from the
uses different sense inventories is frequently more
wordnet of the language to which the selected
confusing than illuminating. Essentially, this is
word belongs to. This pane is editable, but only the
because the differences in the semantic distinctions
sense number is subject to user modifications.
(sense granularities), as used by different semantic
Although far from being perfect, the accuracy
dictionaries (sense repositories), make the
of word alignment technology and of the
difficulty of the WSD task range over a large
translation lexicons extracted from parallel corpora
spectrum. For instance, the discrimination of
is rapidly improving. In the shared task evaluations
homographs (more often than not having different
of different word aligners, organized on the
occasion of the 2003 NAACL Conference and the
2005 ACL Conference, our winning systems 6 wordnet-relevant lexicons are restricted only to translation
TREQ-AL (Tufiş et al., 2003) and COWAL (Tufiş pairs of the same major POS (nouns, verbs, adjectives and
adverbs).
7 Currently, with the most recent improvements, the
COWAL’s F-measure is 92.08%
9
parts of speech, e.g. “(to) bottle” as storing liquids structure. We compute the semantic-similarity
or gases in bottles, versus “bottle” as the recipient)
score by the formula SYM ( ILI 1 , ILI 2 ) = 1 where k
is much simpler than metonymic distinctions (e.g. 1+ k
“bottle” as container, versus “bottle” as content). is the number of links from ILI1 to ILI2 or from
In our research, we used the Princeton Wordnet both ILI1 and ILI2 to the nearest common ancestor.
2.0 as the major sense inventory and the BalkaNet After the WSD process has finished, the sense
multilingual lexical ontology. By observing the information is inserted into the XML encoding of
interlingual synset mapping principle and the corpus. Which sense inventory (ILI, SUMO or
incorporating most of the conceptual extensions DOMAINS) should be used in the encoding is a
proposed by EuroWordNet, the BalkaNet wordnets user-set parameter, which by default includes all of
can be easily combined with any of the other them.
semantic networks of the EuroWordnet, and, thus,
<tu id="Ozz20">
one may speak about a really pan-European <seg lang="en">
multilingual lexical ontology, covering at least 15 <s id="Oen.1.1.4.9">
8 <w lemma="the" ana="Dd">The</w>
languages . <w lemma="patrol" ana="Ncnp" sn="3"
The BalkaNet multilingual environment took oc="Group" dom="military">patrols</w>
advantage of the latest developments in the PWN <w lemma="do" ana="Vais">did</w>
<w lemma="not" ana="Rmp" sn="1" oc="not"
that was adopted itself as an interlingual index. dom="factotum">not</w>
This is a major difference with respect to the <w lemma="matter" ana="Vmn" sn="1"
oc="SubjAssesAttr"
EuroWordNet’s ILI. As the SUMO/MILO and dom="factotum">matter</w>
DOMAINS classifications, have both been aligned <c>,</c>
with PWN, they automatically became available in <w lemma="however" ana="Rmp" sn="1"
oc="SubjAssesAttr|PastFn"
each monolingual wordnet of the BalkaNet. To dom="factotum">however</w>
allow the representation of language idiosyncratic <c>.</c>
</s>
properties, structural knowledge present in the </seg>
monolingual wordnets has precedence over the <seg lang="ro">
structural knowledge imported from the ILI. As the <s id="Oro.1.2.5.9">
<w lemma="şi" ana=Crssp>Şi</w>
Romanian wordnet (Tufiş et al., 2006) imported <w lemma="totuşi" ana="Rgp" sn="1"
SUMO/MILO and DOMAINS labels and the oc="SubjAssesAttr|PastFn"
dom="factotum">totuşi</w>
synsets unique identifiers are the same as in the <c>,</c>
PWN, it is self-contained but at the same time it <w lemma="patrulă" ana="Ncfpry" sn="1.1.x"
can be directly plugged-in in a PWN centered oc=“Group" dom="military">patrulele</w>
<w lemma="nu" ana="Qz" sn="1.x" oc="not"
multilingual wordnet infrastructure. dom="factotum">nu</w>
Once the translation equivalents identified, it is <w lemma="conta" ana="Vmii3p" sn="2.x"
oc="SubjAssesAttr" dom="factotum">contau
reasonable to expect that the words of a translation </w>
pair <wiL1, wjL2> share at least one conceptual <c>.</c>
meaning stored in an interlingual sense inventory. </s>
</seg>
When interlingually aligned wordnets are available …
(as is our case), obtaining the sense labels for the </tu>
words in a translation pair is straightforward: one Figure 2: The final corpus encoding
has to identify for wiL1 the synset SiL1 and for wjL2
the synset SjL2 so that SiL1 and SjL2 are projected In Figure 2, it is shown the final encoding of
over the same interlingual concept. The index of one translation unit of the “1984” parallel corpus.
this common interlingual concept (ILI) is the sense The “sn” attribute represents the Princeton
label of the two words wiL1 and wjL2. However, it is Wordnet 2.0 unique synset identifier (ILI code),
possible that no common interlingual projection the “oc” attribute represents the SUMO ontology
will be found for the synsets to which wiL1 and wjL2 concept and the “dom” attribute represents the
belong. In this case, the senses of the two words DOMAINS label.
will be given by the indexes of the most similar
interlingual concepts corresponding to the synsets
of the two words. Our measure of interlingual
concepts semantic similarity is based on PWN 9 For a detailed discussion and an in-depth analysis of several
other measures see: Budanitsky, A., Hirst, G., Semantic
distance in WordNet: An experimental, application-oriented
8 Basque, Bulgarian, Catalan, Dutch, Czech, English, evaluation of five measures. Proceedings of the Workshop
Estonian, French, German, Greek, Italian, Romanian, on WordNet and Other Lexical Resources, NAACL,
Serbian, Spanish, and Turkish. Pittsburgh, June, (2001) 29-34.
4 WSD Evaluation Finally, the most refined sense inventory of PWN
will be extremely useful in Natural Language
The BalkaNet version of the “1984” corpus is Understanding Systems, which would require a
encoded as a sequence of uniquely identified deep processing. Such a fine inventory would be
translation units. For the evaluation purposes, we highly beneficial in lexicographic and lexicological
selected a set of frequent English words (123 studies.
nouns and 88 verbs) the meanings of which were Similar findings on sense granularity for the
also encoded in the Romanian wordnet. The WSD task are discussed in (Stevenson & Wilks,
selection considered only polysemous words (at 1998) where for some coarser grained inventories
least two senses per part of speech) since the POS- even higher precisions are reported. However, we
ambiguous words are irrelevant as this distinction are not aware of better results in WSD exercises
is solved with high accuracy (more than 99%) by where the PWN sense inventory was used. The
our present tiered-tagger (Ceauşu, 2006). All the major explanation for this is that unlike the
occurrences of the target words were majority work in WSD that is based on
disambiguated by three independent experts who monolingual environments, we use for the
negotiated the disagreements and thus created a definition of sense contexts the cross-lingual
gold-standard annotation for the evaluation of translations of the occurrences of the target words.
precision and recall of the WSD algorithm. The The way one word in context is translated into one
table below summarizes the results. or more other languages is a very accurate and
highly discriminative knowledge source for the
Precision Recall F-measure decision-making.
78.21% 78.21% 78.21%
5. Conclusions
Table 2. WSD precision, recall and F-measure
Word Alignment is a highly promising technology
With the PWN senses identified (synset unique with real prospects of soon reaching full maturity
identifiers), sense labeling with either SUMO and reliability as needed by commercial
and/or IRST domains inventories is trivial, as applications. Among them, one could mention
described before, because the synset unique multilingual computational lexicography and
identifiers of PWN are already mapped (clustered) terminology, multilingual documents indexing and
onto these two sense inventories. Table 3 shows a retrieval, open domain natural language question
great variation in terms of Precision, Recall and F- answering and obviously machine translation. We
measure when different granularity sense described another application, WSD, which is not
inventories are considered for the WSD problem. an end in itself, but necessary at one level or
Thus, it is important to make the right choice on another to accomplish most natural language
the sense inventory to be used with respect to a processing tasks.
given application. Neither YAWA nor MEBA needs an a priori
bilingual dictionary, as this will be automatically
extracted by the TREQ or GIZA++. We made
Sense Inventory Precision Recall F-measure
evaluation of the individual alignments in both
PWN 115424 cat. 78.21% 78.21% 78.21% experimental settings: without a startup bilingual
Sumo 2066 cat. 85.08% 85.08% 85.08% lexicon and with an initial mid-sized bilingual
lexicon. Surprisingly enough, we found that while
Domains 163 cat. 93.30% 93.30% 93.30% the performance of YAWA increases a little bit
Table 3. Evaluation of the WSD in terms of three (approx. 1% increase of the F-measure) MEBA is
different sense inventories. doing better without an additional lexicon. So, in
the evaluation presented in the previous section
In case of a document classification problem, it is MEBA uses only the training data vocabulary. The
very likely that the IRST domain labels (or a automatically extracted lexicons, could be almost
similar granularity sense inventory) would suffice. 100% accurate (with a sufficiently high occurrence
The rationale is that IRST domains are directly threshold) which is obviously a very good starting
derived from the Universal Decimal Classification point in compiling bilingual dictionaries for
as used by most libraries and librarians. The language pairs where such electronic resources are
SUMO sense labeling will be definitely more not easily available.
useful in an ontology based intelligent system The results in Table 3 show that although we
interacting through a natural language interface. used the same WSD algorithm on the same text,
the performance scores (precision, recall, f- Martin, J., Mihalcea, R., Pedersen, T. Word Alignment for Languages
with Scarce Resources. In Proceeding of the ACL2005 Workshop
measure) significantly varied, with more than 15% on “Building and Using Parallel Corpora: Data-driven Machine
difference between the best (DOMAINS) and the Translation and Beyond”. June, 2005, Ann Arbor, Michigan,
June, Association for Computational Linguistics, 65–74
worst (PWN) f-measures. This is not surprising,
Melamed, D. Empirical Methods for Exploiting Parallel Texts.
but it shows that it is extremely difficult to Cambridge, MA: MIT Press, 2001
objectively compare and rate WSD systems Mihalcea R., and Pedersen, T. An Evaluation Exercise for Word
working with different sense inventories. Alignment, in Proceedings of the HLT/NAACL Workshop on
The potential drawback of this approach is that Building and Using Parallel Texts: Data Driven Machine
Translation and Beyond, Edmonton, Canada, May 2003.
it relies on the existence of parallel data and at
Moore, R. 2002. Fast and Accurate Sentence Alignment of Bilingual
least two aligned wordnets that might not be Corpora in Machine Translation: From Research to Real Users. In
available yet. Nevertheless, parallel resources are Proceedings of the 5th Conference of the Association for Machine
Translation in the Americas, Tiburon, California), Springer-
becoming increasingly available, in particular on Verlag, Heidelberg, Germany: 135-244.
the World Wide Web, and aligned wordnets are Niles, I., and Pease, A., Towards a Standard Upper Ontology. In
being produced for more and more languages Proceedings of the 2nd International Conference on Formal
(currently there are more than 40 ongoing wordnet Ontology in Information Systems (FOIS-2001), Ogunquit, Maine,
(2001) 17-19.
projects for 37 languages). In the near future it
Och, F.J., Ney, H., Improved Statistical Alignment Models,
should be possible to apply our and similar Proceedings of ACL2000, Hong Kong, China, 440-447, 2000.
methods to large amounts of parallel data and a Och, F.J., Ney, H. "A Systematic Comparison of Various Statistical
wide spectrum of languages. Alignment Models", Computational Linguistics, 29(1), pp. 19-51,
2003
Stevenson, M., Wilks, Y., The interaction of Knowledge Sources in
Acknowledgements. The reported work is the Word Sense Disambiguation. Computational Linguistics, Vol. 24,
result of several year intensive research at our no. 1, (1998) 321-350.
institute. Many people deserve acknowledgements Tufiş, D., Ide, N. Erjavec, T. (1998). “Standardized Specifications,
Development and Assessment of Large Morpho-Lexical
here, but special mentioning is due to Radu Ion, Resources for Six Central and Eastern European Languages”.
Alin Ceauşu, Dan Ştefănescu, Verginica Barbu- Proceedings LREC’1998, Granada, Spain, pp. 233-240.
Mititelu and Elena Irimia, currently preparing their Tufiş, D., Tiered Tagging and Combined Classifiers, in F. Jelinek, E.
PhD theses on topics directly or closely related to Nöth (eds) Text, Speech and Dialogue, Lecture Notes in Artificial
Intelligence, Vol. 1692. Springer-Verlag, Berlin Heidelberg New-
those discussed in this paper. York (1999) 28-33.
Tufis, D. A cheap and fast way to build useful translation lexicons. In
References Proceedings of the 19th International Conference on
Computational Linguistics, COLING2002, Taipei, 25-30 August,
2002, pp. 1030-1036, ISBN 1-55860-894.
Brown, P. F., Della Pietra, S.A., Della Pietra, V. J., Mercer, R.
L.(1993) “The mathematics of statistical machine translation: Tufiş, D., Barbu, A., M., Ion, R. A word-alignment system with
Parameter estimation”. Computational Linguistics, 19(2) pp. 263– limited language resources. In Proceedings of the NAACL 2003
311 Workshop on Building and Using Parallel Texts; Romanian-
English Shared Task, Edmonton (2003) 36-39.
Ceauşu, Al. Maximum Entropy Tiered Tagging, in Janneke Huitink &
Sophia Katrenko (editors), Proceedings of the Eleventh ESSLLI Tufiş, D. (ed): Special Issue on BalkaNet. Romanian Journal on
Student Session, June 20, 2006, Malaga, Spain, pp. 173-179 Science and Technology of Information, Vol. 7 no. 3-4 (2004) 9-
44.
Ceauşu, Al., Ştefănescu, D., Tufiş, D. :Acquis Communautaire
sentence alignment using Support Vector Machines. In Tufiş, D., Ion, R. Ceauşu, Al., Stefănescu, D.: Combined Aligners. In
proceedings of the 5th LREC Conference, Genoa, Italy, 22-28 Proceeding of the ACL2005 Workshop on “Building and Using
May, 2006, pp. 2134-2137, ISBN 2-9517408-2-4, EAN Parallel Corpora: Data-driven Machine Translation and
9782951740822 Beyond”. June, 2005, Ann Arbor, Michigan, June, Association for
Computational Linguistics, pp. 107-110.
Dimitrova, L, Erjavec, T., Ide, N., Kaalep, H., Petkevic, V. and Tufiş,
D. (1998) "Multext-East:Parallel and Comparable Corpora and Tufiş, D., Ion, R. Evaluating the word sense disambiguation accuracy
Lexicons for Six Central and East European Languages" in with three different sense inventories. In Proceedings of the
Proceedings ACL-COLING’1998, Montreal, Canada, pp. 315- Natural Language Understanding and Cognitive Systems
319. Symposium, Miami, Florida, May 2005, pp. 118-127, ISBN 972-
8865-23-6
Fan, R., Chen, P.H, Lin, C.J. Working set selection using the second
order information for training SVM. Technical report 2005., Tufiş, D., Barbu-Mititelu, V., Bozianu, L., Mihăilă, C.:
Department of Computer Science, National Taiwan University Romanian WordNet: New Developments and
(www.csie.ntu.edu.tw/ ~cjlin/papers/quadworkset.pdf). Applications. In Proceedings of the 3rd Conference of the
Fellbaum, Ch. (ed.) WordNet: An Electronic Lexical Database, MIT Global WordNet Association, Seogwipo, Jeju, Republic of
Press (1998). Korea, January 22-26, 2006, pp. 337-344, ISBN 80-210-
Ide, N., Veronis, J., Introduction to the special issue on word sense 3915-9
disambiguation. The state of the art. Computational Linguistics, Tufiş, D., Ion, R. Ceauşu, Al., Stefănescu, D.: Improved Lexical
Vol. 27, no. 3, (2001) 1-40. Alignment by Combining Multiple Reified Alignments. In
Proceedings of the 11th Conference of the European Chapter of
Ion, R. 2006. Methods for Automatic Disambiguation. Applications the Association for Computational Linguistics (EACL2006),
for Romanian and English, PhD Thesis, Romanian Academy, Trento, Italy, 3-7 April, 2006, pp. 153-160, ISBN 1-9324-32-61-2
Bucharest, Romania, 145 p. (in Romanian) Vossen P. (ed.) A Multilingual Database with Lexical Semantic
Magnini B. Cavaglià G., Integrating Subject Field Codes into Networks, Kluwer Academic Publishers, Dordrecht, 1998
WordNet. In Proceedings of LREC2000, Athens, Greece (2000)
1413-1418.

You might also like