WordNet-Based Information Retrieval
WordNet-Based Information Retrieval
WordNet-Based Information Retrieval
17:1
form, direct hypernym, and sense. For instance, in the text in movement_11 for the word “movement”. Then, the word
Figure 2.1, a possible full annotation of the WordNet word apple “movement” is presented by movement/event_1.
at (1) and (2) is the WW triple (apple, edible_fruit_1, #07739125- In summary, annotation of an ontology word having word
noun) while for the apple at (3) and (4) it is the WW triple (apple, form f can be one of the following formats: (1) word sense (s),
apple_tree_1, #12633994-noun), where Malus pumila and apple when the sense of the word is determined; (2) combined
are synonyms of the same WW sense whose identifier is information f/msc_hypernym(possible_senses(f)), when the word
#12633994-noun. has more than one determined sense. The synonyms, hypernyms,
and hyponyms of an ontology word can be derived from its sense.
“To determine if an apple (1) is ready to be picked, place a cupped
hand under the fruit, lift and gently twist. If the apple (2) doesn't come 2.2 WordNet based Information Retrieval
away easily in your hand, then it's not ready to harvest.”1
“A round, firm fruit with juicy flesh; the tree bearing this fruit, Systems
Malus pumila, comes from the family Rosaceae (rose family). There are With general limitations of lexical search, semantic search is a
many, many types of apples (3) grown all over the world today and major topic of current IR research. A semantic search method
these can be divided into eating, cooking and cider apples (4).”2
often embeds semantic information in queries and/or documents
and may expand them with related information. Ontology is
Fig. 2.1. Text passages from the BBC3:
widely used in semantic search. WordNet is the ontology of
However, due to ambiguity in a context, performance of a interest in this paper. Depending on the purpose and structure of
WSD algorithm, or limitation of the ontology of discourse, an an employed ontology, IR systems will use appropriate methods
ontology word may not be fully annotated or may have multiple to exploit the ontology. Therefore, in this paper we only survey
annotations. As shown in Figure 2.2, the first, third, and eleventh research works using WordNet for query or document semantic
senses among more than 11 senses of the word “movement” have annotation and expansion.
the common hypernym change_3. So, with a context such as Query expansion is the process of adding to an original query
“movement belonging to change”, the word “movement” can be new terms that are similar to the original words in the query to
not fully annotated (movement, change_3, #*) or have three improve retrieval performance ([14]). The works [22], [12] and
annotations, i.e. (movement, change_3, #movement_1-noun), [8] expanded queries by using WordNet. Document expansion is
(movement, change_3, #movement_2-noun) and (movement, the process of enriching documents by related terms to improve
change_3, #movement_3-noun). retrieval performance. The works [25], [24], [6] and [16]
event_1
expanded documents by using WordNet.
act_2 Table 2.1 presents our survey about text IR systems using
WordNet features in comparison to our proposed one. In that, we
action_1
use the following notations: (1) s is sense of a word; (2) form(s) is
any form of a sense s; (3) hypernym(s) is any hypernym of a sense
change_3 happening_1
s; (4) hyponym(s) is any hyponym of a sense s; (5)
f/msc_hypernym(possible_senses(f)) is the pair of a form f and its
movement_1 movement _3 movement _11 movement _2 …
respective msc_hypernym in a certain context; and (6) keyword is
a word that is not a stop-word or a WW.
movement As shown in the table, [22] expanded a query with all forms of
Fig. 2.2. Example about senses of the word “movement” every sense s occurring in it. Also, [22], [12] and [8] used all
forms of a sense and all forms of any hyponym of a sense in a
In this paper, we introduce the notion of most specific
query. Meanwhile, [25] used all forms of a sense to expand a
common hypernyms. The most specific common hypernym is a
document, and [24] and [6] additionally used all forms of any
semantic relation between a sense and sense set, denoted by
hypernym of a sense in a document. The work [16] used senses in
msc_hypernym. A sense s is said to be a most specific common
both queries and documents, and all forms of any hypernym of a
hypernym of a sense set {s1, s2, ...} if s is one common hypernym
sense in a document.
of the sense set and no common hypernym of the sense set is more
In [22], the authors showed that the use of synonyms,
specific than s. For example, event_1 is a msc_hypernym of the
hyponyms and their combination in queries derived from
four senses movement_1, movement_2, movement_3, and
description statements improved retrieval performance, but
movement_11. We note that a sense set may have more than one
reduced retrieval performance with queries derived from narrative
msc_hypernym. Besides, we write possible_senses(f) to denote
statements. In [12], after the sense of a word in a query was
possible senses of the form f in a certain context.
determined, its synonyms, hyponyms, definition were considered
In addition, f/msc_hypernym(possible_senses(f)) is combined
by some rules to be added into the query. Meanwhile, [8]
information of a form and its respective msc_hypernym. We
expanded a query by using spreading activation on all relations in
propose to use this combined information when a word has more
WordNet and selecting only the words being important and found
than one sense determined by the WSD algorithm. For example,
in WordNet to represent the query content.
in a context, the WSD algorithm may determine the four possible
senses movement_1, movement_2, movement_3, and
1 http://www.bbc.co.uk/gardening/basics/techniques/growfruitand
veg_harvestapples1.shtml
2 http://www.bbc.co.uk/dna/h2g2/A12745785
3 http://www.bbc.co.uk
17:2
Table. 2.1. Survey about search engines using features of query “search documents about rainfall”, rain as a synonym of
WordNet rainfall is added into the query. Therefore, documents about “a
rain of bullets” will also be retrieved. Similarly, with document
[22], Our
Paper Lucene [22] [12], [25] [24], [16] Searc expansion, for the document about “rainfall”, rain as a synonym
[6] h of rainfall is added into the document. Therefore, the document
[8]
DE_S DE_I will be retrieved by the query “search documents about a rain of
Lexical QE
QE_S
DE yn d
DE_ bullets”. The drawback is similar with using only word forms of
IR Model yn_H MscH
Search _Syn
ypo
_Syn _Hyp _Hyp
yper hypernyms and hyponyms of senses.
er er
Especially, in case a word has more than one sense determined
The by a WSD algorithm, the above works choose randomly one sense
features
Feature
are used
from those senses, which may decrease the retrieval performance
in if that is a wrong choice. In contrast, in our system, such a word is
Query x x represented by the combination of its form and the msc_hypernym
s of the senses.
Doc x x
Besides, if a word w in a document has only one suitable sense
f/msc_hyperny Query x
m(possible_sen
s, then all forms, hypernyms, and form-hypernym pairs of s are
ses(f)) Doc x virtually added into the document. Otherwise, if it is uttered by a
Query x form f in a document and determined by the pair f/msc_hypernym
form(s)/hypern
ym(s) Doc x
(possible_senses(f)), then f, msc_hypernym(possible_senses(f)),
and all hypernyms of msc_hypernym(possible_senses(f)) are
form(hyponym(
s))
Query x virtually added into the document.
form(hypernym
(s))
Doc x x x 3. THE PROPOSED WORDNET-BASED
hyponym(s) Query SEMANTIC SEARCH
hypernym(s) Doc x 3.1 System Architecture
Query x x Our proposed system architecture is shown in Figure 3.1. The
form(s) WordNet Word Disambiguation-and-Annotation module extracts
Doc x x x
and embeds the most specific WW features in a raw query and a
Query x x x x x x x
keyword raw text document. The process is presented in detail in Sections
Doc x x x x x x x 3.2 and 3.2. After that, the text is indexed by contained WW
features and keywords, and stored in the Extended
Addition to [25], [6] used hypernyms and rules to filter senses
WordNetWord-Keyword Annotated Text Repository. Semantic
determined by the employed WSD algorithm to expand
document search is performed via the WordNet Word-Keyword-
documents. If there were still more than one suitable sense for a
Based Generalized VSM (Vector Space Model) module as
word after running the WSD algorithm and filtering, all of those
presented in Section 3.4.
senses were used. In [25] and [6], the authors constructed
concepts using the format Lemma-POS-SN, where Lemma was a WordNet Word-
form, POS was the part of speech, and SN was the sense number Raw Query Keyword Annotated Ranked
of the word. For example, if surface is a noun and has sense 1, Query Text Document
then it will be denoted by Surface-Noun-1.
In [24], the authors modified the index term weights that were
WordNet Word WordNet Word-Keyword
normally computed based on the tf*idf scheme. In a document, Disambiguation and Annotation Based Generalized VSM
the weight of an index term will be increased when the term had
semantic relations with other co-occurring terms in the document.
The authors used WordNet to determine semantic relations
between the terms. In [16], each WW was replaced by the new Raw Text
...... WordNet Extended WordNet Word-
format Sense|POS. For example, if surface is a noun and has ...... Keyword Annotated
offset 3447223, then it will be denoted by 3447223|Noun. The ...... Text Repository
authors also used forms of hypernyms of a word in WordNet to ......
increase performance of the system.
Like all systems expanding queries, the above query WordNet Word WordNet Word Extension
expansion systems spend time for searching related terms in an Disambiguation and Annotation and Indexing
ontology and matching between the new query and documents. Fig. 3.1. System architecture for WordNet based semantic
Meanwhile, in document expansion systems, searching for related search
terms is offline and the query is not changed. Hence, our search
applies document expansion.
Moreover, since the above-surveyed papers use word forms to
3.2 Word Sense Disambiguation using
represent word senses, it may reduce the precision of system. WordNet
Indeed, a query containing a word having form f and sense x could Word sense disambiguation is to identify the right meaning of
also match to documents containing a word having the same form a word in its occurrence context. Lesk's algorithm ([11]) was one
f but different sense y. For example, with query expansion, for the of the first WSD algorithms for phrases. The main idea of Lesk’s
17:3
algorithm was to disambiguate word senses by finding the overlap in function in Lucene4, which is a general open source for
among their sense definitions using a traditional dictionary. The storing, indexing and searching documents [7].
works [13] and [4] proposed to use WordNet for Lesk's algorithm. 2. Disambiguating and annotating WordNet words in the
Following [13], we modify Lesk’s algorithm by exploiting document by using our WSD algorithm introduced in
associated information of each sense of a word in WordNet, above sections.
including its definition, synonyms, hyponyms, and hypernyms. By
3. Extending the document with implied information:
comparing the associated information of each sense of a word
with its surrounding words, we can identify its right sense. If the sense s of the word is determined, then s and its
However, if a word has two or more suitable senses, then our expanded features form(s), hypernym(s),
WSD algorithm will find msc_hypernyms of the senses in form(hypernym(s)), form(s)/hypernym(s) are added to
hypernym hierarchy of WordNet. We use WordNet version 2.1 for the document.
the WSD algorithm. Figure 3.2 describes the difference between If the word has more than one sense with f and
the traditional KB-based WSDs and our KB-based WSD. msc_hypernym(possible_senses(f)) as its apparent form
and the most specific common hypernym, respectively,
Choose msc_hypernyms of then f and f/msc_hypernym(possible_senses(f))
Choosing the Choosing all the senses and combine the
first sense in the senses in the
and their expanded features
msc_hypernyms with the
sense list sense list word form W form(msc_hypernym(possible_senses(f))),
Tradition Tradition We msc_hypernym(possible_senses(f)),
form(hypernym(msc_hypernym(possible_senses(f))
hypernym(msc_hypernym(possible_senses(f))),
A word form Si Sj … Sk
W in a f/hypernym(msc_hypernym(possible_senses(f)))
sentence or Highest Score Senses are added to the document.
a paragraph Filtering
4. Words not defined in WordNet are treated as plain
keywords.
Word Sense Disambiguation Algorithm
5. Original WordNet features, implied WordNet features and
plain keywords are indexed.
S1 S2 S3 … Sn
4 http://lucene.apache.org
17:4
The L.A. Times document collection is employed, which was used of the precision and F measures. The MAP values in Table 4.2
by 15 papers among the 33 full-papers of SIGIR-2007 and SIGIR- and the two-sided p-values in Table 4.3 show that taking into
2008 about text IR using TREC dataset. The L.A. Times consists account latent ontological features in queries and documents does
of more than 130,000 documents in nearly 500MB. Next, queries enhance text retrieval performance. In terms of the MAP measure,
in the Adhoc Track-1999, which has answer documents in this Semantic Search performs about 17.7% better than the Lexical
document collection, is used. So, there are 44 queries of 50 Search model, and about 37% and 127.1% better than the
queries in this Track chosen. Each query has three portions, QE_Syn, and QE_Syn_Hypo models, and 9.4%, 19.2% and
namely, the title, description and narrative ones. Since a query 23.2% better than the DE_Syn, DE_Syn_Hyper and
title is short and looks like a typical user query, we only use query DE_Id_Hyper, respectively.
titles in all experiments, as in [22], [12], [8], [25], and [6], for Table 4.1. The average precisions and F-measures at the eleven
instance. standard recall levels on 44 queries of the L.A. Times
We have evaluated and compared the IR models in terms of
precision-recall (P-R) curves, F-measure-recall (F-R) curves, and Recall (%)
Measure Model
single mean average precision (MAP) values ([3], [10], [15]). 0 10 20 30 40 50 60 70 80 90 100
Meanwhile, MAP is a single measure of retrieval quality across Lexical Search 51 42 34 28 25 22 16 13 10 9 8
recall levels and considered as a standard measure in the TREC QE_Syn 46 35 30 23 21 18 15 13 8 7 6
community ([23]).
QE_Syn_Hypo 34 22 20 15 14 12 8 5 3 2 2
Obtained values of the measures presented above might occur
Precision
by chance. Therefore, a statistical significance test is required DE_Syn 51 45 37 30 27 23 18 14 12 10 10
(%)
([9]). We use Fisher’s randomization (permutation) test for DE_Syn_Hyper 46 40 34 29 25 22 16 13 11 10 9
evaluating the significance of the observed difference between DE_Id_Hyper 51 40 35 26 22 19 13 11 10 8 8
two systems, as recommendation of [21]. As shown [21], 100,000
DE_MscHyper 62 49 41 33 28 24 18 15 11 10 9
permutations were acceptable for a randomization test and the
threshold 0.05 of the two-sided significance level, or two-sided p- Lexical Search 0 13 19 20 21 22 18 16 12 11 10
value, could detect significance. QE_Syn 0 12 17 18 19 19 17 15 11 9 9
QE_Syn_Hypo 0 9 13 13 13 13 10 8 6 4 4
4.2 Testing results F-measure
We present experiments about search performance of our DE_Syn 0 13 20 22 24 23 20 17 15 13 12
(%)
system in comparison with the surveyed WordNet–based systems DE_Syn_Hyper 0 13 19 22 22 22 18 16 14 12 11
by seven different search models: DE_Id_Hyper 0 13 19 21 20 19 14 13 11 10 10
1. Lexical Search: This search uses Lucene text search DE_MscHyper 0 14 21 24 24 23 20 17 14 12 11
engine as a tweak of the traditional keyword-based VSM.
2. QE_Syn: The search uses synonyms of WordNet to Average P-R curves Average F-R curves
expand queries only.
3. QE_Syn_Hypo: The search is similar to QE_Syn but it
uses both synonyms and forms of hyponyms to expand F-measure (%)
Precision (%)
queries.
4. DE_Syn: The search uses synonyms of WordNet to
expand documents. It employs the traditional KB
WordNet-based WSD as presented in Section 3.2.
5. DE_Syn_Hyper: The search is similar to DE_Syn but it
uses both synonyms and forms of hypernyms to expand
documents.
6. DE_Id_Hyper: The search uses the sense of a word to
represent the word and forms of hypernyms of the sense to Recall (%) Recall (%)
expand documents.
7. DE_MscHyper: This search uses our proposed model and
Fig. 4.1. Average P-R and F-R curves of Lexical Search, QE_Syn,
system presented in Section 3.
DE_Syn, DE_Id_Hyper and DE_MscHyper models on 44 queries
In QE_Syn and QE_Syn_Hypo query expansion models, the sense of TREC
of a word in a query is semi-automatically determined to get high
precision. In DE_Syn, DE_Syn_Hyper and DE_Id_Hyper Table 4.2. The mean average precisions on the 44 queries of
document expansion models, Lesk’s algorithm is modified to TREC
automatically determine the sense of a word as for our
DE_Msc Lexical QE_Syn_ DE_Syn DE_Id_
DE_MscHyper model. However, when a word in a context has Model
Hyper Search
QE_Syn
Hypo
DE_Syn
_Hyper Hyper
many suitable senses, the other document expansion models will
choose the first ranked sense in the senses to represent the word. MAP 0.251 0.2133 0.1832 0.1105 0.2295 0.2106 0.2037
Table 4.1 and Figure 4.1 plots present the average precisions Improvement 17.7% 37% 127.1% 9.4% 19.2% 23.2%
and F-measures of Lexical Search, QE_Syn, QE_Syn_Hypo,
DE_Syn, DE_Syn_Hyper, DE_Id_Hyper and DE_MscHyper
models at each of the standard recall levels. It shows that
DE_MscHyper performs better than the other six models, in terms
17:5
Table 4.3. Randomization tests of DE_MscHyper with the cream cone. In Proceedings of ACM SIGDOC-1986, pp.
Lexical Search, QE_Syn, QE_Syn_Hypo, DE_Syn, 24-26 (1986)
DE_Syn_Hyper and DE_Id_Hyper models [12] Liu, S., Liu, F., Yu, C., Meng, W. An effective approach to
document retrieval via utilizing WordNet and recognizing
|MAP(A) – Two-Sided P-
Model A Model B N– N+ phrases. In Proceedings of ACM SIGIR-2004, pp. 266-272
MAP(B)| Value
(2004)
Lexical Search 0.0377 1335 1453 0.02788 [13] Liu, S., Yu, C., Meng, W.: Word Sense Disambiguation in
QE_Syn 0.0678 1181 1195 0.02376 Queries. In Proceedings of ACM CIKM-2005, pp. 525-
QE_Syn_Hypo 0.1405 161 153 0.00314 532 (2005)
DE_MscHyper [14] Lu, X. A., Keefer, R.B.: Query Expanson/Reduction and Its
DE_Syn 0.0215 11878 11574 0.23452
Impact on Retrieval Effectiveness. In Proceedings of
DE_Syn_Hyper 0.0404 3763 3826 0.07589 TREC-1994, pp. 231-240 (1994)
DE_Id_Hyper 0.0473 2187 2268 0.04455 [15] Manning, C.D., Raghavan, P., Schütze, H.: Introduction to
Information Retrieval, Cambridge University Press (2008)
[16] Mihalcea, R., Moldovan, D.: Semantic Indexing using
5. CONCLUSION AND FUTURE WORKS WordNet Senses. In Proceedings of the ACL-2000
We have presented a generalized VSM that exploits all workshop on Recent Advances in Natural Language
ontological features of WordNet words for semantic text search. Processing and Information Retrieval, pp. 35 - 45 (2000)
That is a whole IR process, from a natural language query to a set [17] Miller, G. A., Beckwith, R., Fellbaum, C., Gross and
of ranked answer documents. The conducted experiments on a Katherine Miller. WordNet: An on-line lexical database. In
TREC dataset have shown that our WordNet features exploitation International Journal of Lexicography, Vol.3, pp 235-244
improves the search quality in terms of the precision, recall, F, (1990)
and MAP measures. [18] Navigli, R. and Lapata, M.: Graph connectivity measures
For future works, we are considering combination of WordNet for unsupervised word sense disambiguation. In
and other ontologies to increase the number of WordNet words Proceedings of IJCAI-2007, pp. 1683-1688 (2007)
that can be covered. Also, we are researching rules to determine [19] Pradhan, S., Loper, E., Dligach, D., and Palmer, M.:
WordNet words that are not updated into employed ontologies. Semeval-2007 task-17: English lexical sample srl and all
words. In Proceedings of SemEval-2007, pp. 87–92 (2007)
[20] Sinha, R. and Mihalcea, R.: Unsupervised graph based
6. REFERENCES word sense disambiguation using measures of word
[1] Agirre, E. and Soroa, A.: Personalizing PageRank for Word
semantic similarity. In Proceedings of the IEEE
Sense Disambiguation. Proceedings of EACL-2009, pp.33-
International Conference on Semantic Computing (ICSC-
41 (2009)
2007), pp. 363-369 (2007)
[2] Agirre, E., Lopez De Lacalle, O., Soroa, A.: Knowledge-
[21] Smucker, M.D., Allan, J., Carterette, B.: A comparison of
Based WSD on Specific Domains: Performing better than
statistical significance tests for information retrieval
Generic Supervised WSD. In Proceedings of IJCAI-2009,
evaluation Export. In Proceedings of ACM CIKM-2007,
pp. 1501-1506 (2009)
pp. 623-632 (2007)
[3] Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information
[22] Voorhees, E. M.: Query expansion using lexical-semantic
Retrieval. ACM Press, New York (1999)
relations. In Proceedings of SIGIR-1994, pp. 61-69 (1994)
[4] Banerjee, S., Pedersen, T.:An Adapted Lesk Algorithm for
[23] Voorhees, E.M., Harman, D.K.: TREC- Experiment and
Word Sense Disambiguation Using WordNet. In
Evaluation in Information Retrieval. MIT Press (2005)
Proceedings of CICLing-2002, pp 136-145 (2002)
[24] Wang, B., & Brookes, B. R.: A semantic approach for Web
[5] Fellaum, C.: WordNet: An Electronic Lexical Database,
indexing. In Proceedings of APWeb-2004, LNCS, Springer,
MIT Press, Cambridge MA (1998)
Vol. 3007, pp. 59-68 (2004)
[6] Giunchiglia, F., Kharkevich, U., Zaihrayeu, I.: Concept
[25] Zaihrayeu, I., Sun, L., Giunchiglia, F., Pan, W., Ju, Q., Chi,
Search. In Proceedings of ESWC-2009, pp. 429-444 (2009)
M., Huang, X.: From web directories to ontologies: Natural
[7] Gospodnetic, O.: Parsing, Indexing, and Searching XML
language processing challenges. In Proceedings of ISWC-
with Digester and Lucene. Journal of IBM DeveloperWorks
2007 + ASWC-2007, pp. 623-636 (2007)
(2003)
[8] Hsu, M. H., Tsai, M. F., Chen, H. H.: Combining WordNet
and ConceptNet for Automatic Query Expansion - A
Learning Approach. In Proceedings of AIRS-2008, LNCS,
Springer, Vol. 4993, pp. 213-224 (2008)
[9] Hull, D.: Using Statistical Testing in the Evaluation of
Retrieval Experiments. In Proceedings of ACM SIGIR-
1993, pp. 329-338 (1993)
[10] Lee, D.L., Chuang, H., Seamons, K.: Document Ranking
and the Vector-Space Model. In IEEE Software, Vol. 14,
pp. 67-75 (1997)
[11] Lesk, M.: Automatic Sense Disambiguation Using Machine
Readable Dictionaries: how to tell a pine cone from an ice
17:6