Wordnet Improves Text Document Clustering: Andreas Hotho Steffen Staab Gerd Stumme

Wordnet improves Text Document Clustering
Andreas Hotho Steffen Staab Gerd Stumme Institute AIFB, University of Karlsruhe, 76128 Karlsruhe, Germany
HOTHO @ AIFB . UNI - KARLSRUHE . DE STAAB @ AIFB . UNI - KARLSRUHE . DE STUMME @ AIFB . UNI - KARLSRUHE . DE
Abstract
Text document clustering plays an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. The bag of words representation used for these clustering methods is often unsatisfactory as it ignores relationships between important terms that do not co-occur literally. In order to deal with the problem, we integrate background knowledge in our application Wordnet into the process of clustering text documents. We cluster the documents by a standard partitional algorithm. Our experimental evaluation on Reuters newsfeeds compares clustering results with pre-categorizations of news. In the experiments, improvements of results by background knowledge compared to the baseline can be shown for many interesting tasks.
pork are found to be similar, because they both are subconcepts of meat in WordNet. The clustering is then performed with Bi-Section-KMeans, which has been shown to perform as good as other text clustering algorithms and frequently better (cf. the seminal paper (Steinbach et al., 2000)). For the evaluation (cf. Section 4), we have investigated the Reuters corpus on newsfeeds, which comes with a set of categorizing labels attached to the documents. The evaluation results (cf. Section 5) compare the original classication with the partitioning produced by clustering the different representations of the text documents. Furthermore, by analysing the manually dened Reuters categories, we nd explanations of when background knowledge helps. In Section 6, we point to some related work. Finally, we conclude that the best strategies that involve background knowledge are most often better than the baseline when word sense disambiguation and feature weighting are included (Section 7).
1. Introduction
With the abundance of text documents available through corporate document management systems and the World Wide Web, the efcient, high-quality partitioning of texts into previously unseen categories is a major topic for applications such as information retrieval from databases, business intelligence solutions or enterprise portals. So far, however, existing text clustering solutions only relate documents that use identical terminology, while they ignore conceptual similarity of terms such as dened in terminological resources like WordNet (Miller, 1995). In this paper we investigate which benecial effects can be achieved for text document clustering by integrating an explicit conceptual account of terms found in WordNet. In order to come up with this result we have performed an empirical evaluation. We compare a simple baseline (Section 2) with different strategies for representing text documents that take background knowledge into account to various extent (Section 3). For instance, terms like beef and
2. Baseline Text Document Representation

For the clustering experiments described subsequently, we have prepared different representations of text documents suitable for the clustering algorithms. Let us rst consider documents to be bags of terms (cf. (Salton, 1989)). Let tf be the absolute frequency of , where is the set term in document of documents and is the set all different terms occurring in . We denote the term vectors tf tf . Later on, we will need the notion of the centroid of a set of term vectors. It is de of its term ned as the mean value vectors. In the sequel, we will apply tf also on sets of terms: tf . for , we let tf
As initial approach we have produced this standard representation of the texts by term vectors. The initial term vectors are further modied as follows. Stopwords are words which are considered as non
descriptive within a bagofwords approach. Following common practice, we removed stopwords from , using a standard list with 571 stopwords. 1 We have processed our text documents using the Porter stemmer introduced in (Porter, 1980). We used the stemmed terms to construct a vector representation for each text document. Then, we have investigated how pruning rare terms affects results. Depending on a pre-dened threshold , a term is discarded from the representation (i. e., from the set ), if . We have used the values 0, 5 and 30 tf for . The rationale behind pruning is that infrequent terms do not help for identifying appropriate clusters, but may still add noise to the distance measures degrading overall performance. 2
3.1. Ontology The background knowledge we will exploit further on is encoded in a core ontology. We here present those parts of our wider ontology denition (cf. (Bozsak et al., 2002)) that we have exploited: conDenition: A core ontology is a tuple sisting of a set whose elements are called concept identiers, and a partial order on , called concept hierarchy or taxonomy. Often we will call concept identiers just concepts, for sake of simplicity. , then is a subDenition: If , for concept of , and is a superconcept of . If and there is no with , then is a direct subconcept of , and is a direct superconcept of . We note this by . According to the international standard ISO 704, we provide names for the concepts (and relations). Instead of name, we here call them sign or lexical entries to better describe the functions for which they are used. Denition: A lexicon for an ontology is a tuple consisting of a set whose elements are called signs for concepts, and a relation called lexical reference for concepts, where holds for all . Based on , we dene, for , and, for , . An ontology with lexicon is a pair where is an ontology and is a lexicon for .
Tdf weighs the frequency of a term in a document with a factor that discounts its importance when it appears in almost all documents. The tdf term frequencyinverted document frequency 3 of term in document is dened by: tf tdf df
where term that counts in how many documents term appears. If tdf weighting is applied then we replace the term tf tf by vectors tdf tdf . There are more sophisticated measures than tdf in the literature (see, e. g., (Amati et al., 2001)), but we abstract herefrom, as this is not the main topic of this paper.
df is the document frequency of
Based on the initial text document representation, we have rst applied stopword removal. Then we performed stemming, pruning and tdf weighting in all different combinations. This also holds for the initial document representation involving background knowledge described subsequently. When stemming and/or pruning and/or tdf weighting was performed, we have always performed them in the order in which they have been listed here.
3. Compiling Background Knowledge into the Text Document Representation

The background knowledge we have exploited is given through a simple ontology. We rst describe its structure, then the actual ontology and its integration into the initial text document representation by various strategies.
See http://www.aifb.uni-karlsruhe.de/WBS/aho/clustering We have also investigated the inuence of the document frequency of a term for pruning, but it showed that this parameter hardly effects the clustering results. 3 tdf actually refers to a class of weighting schemata. Above we have given the one we have used.
2 1
This denition allows for a very generic approach towards using ontologies for clustering. For the purpose of actual evaluation of clustering with background knowledge, we needed a specic resource, which ts to the document collection. We have chosen Wordnet 1.7, 4 as it ts to the generality of the Reuters corpus. Wordnet (Miller, 1995) comprises a core ontology and a lexicon. It consists of 109377 concepts (synsets in Wordnet terminology) and 144684 lexical entries5 (called words in Wordnet). One example synset is foot, ft and a corresponding word is foot. In relates terms if they have a Wordnet, the function lexical entry (e.g., foot and feet) with their corresponding concepts (e.g., synsets foot, ft, foot, human foot, pes, ...). Thus, for a term appearing in a document , allows for retrieving its corre-
http://www.cogsci.princeton.edu/wn/obtain.shtml The actual number of lexical entries is higher in our count, as for one stem like foot, Wordnet includes several morphological derivations like feet.
5
sponding concepts.
In addition, Wordnet provides a ranking on the set for each lexical entry indicating the frequency of its usage in English language. For example, returns as the rst concept and then . Corresponding to our denition of a core ontology, Wordnet also offers access . functions to its concept hierarchy
Concept Vector Only (only). This strategy works like Replace Terms by Concepts but it expels all terms from the vector representation. Thus, terms that do not appear in Wordnet are discarded; is used to represent document . 3.3. Strategies for Disambiguation The assignment of terms to concepts in Wordnet is ambiguous. Therefore, adding or replacing terms by concepts may add noise to the representation and may induce a loss of information. Therefore, we have also investigated how the choice of a most appropriate concept from the set of alternatives may inuence the clustering results. While there is a whole eld of research dedicated to word sense disambiguation (e.g., cf. (Ide & V ronis, 1998)), it e has not been our intention to determine which one could be the most appropriate, but simply whether word sense disambiguation is needed at all. For this purpose, we have considered two simple disambiguation strategies besides of the baseline: All Concepts (all). The baseline strategy is not to do anything about disambiguation and consider all concepts for augmenting the text document representation. Then, the concept frequencies are calculated as follows: tf cf
So far, from all the descriptions given in Wordnet, we have exploited only information about nouns. I.e., we have used only of the synsets available in Wordnet.
Using the morphological capabilities of Wordnet rather than a Porter stemmer we achieved improved results. Therefore, when using background knowledge, stemming has only been performed for terms that do not appear as lexical entries in Wordnet. 3.2. Term vs. Concepts Vector Strategies Enriching the term vectors with concepts from the core ontology has two benets. First it resolves synonyms; and second it introduces more general concepts which help identifying related topics. For instance, a document about beef may not be related to a document about pork by the cluster algorithm if there are only beef and pork in the term vector. But if the more general concept meat is added to both documents, their semantical relationship is revealed. We have investigated different strategies for adding or replacing terms by concepts: Add Concepts (add6). When applying this strategy, we have extended each term vector by new entries for Wordnet concepts appearing in the document set. Thus, the vector was replaced by the concatenation of and , where cf cf is the concept vec tor with and cf denotes the frequency that a appears in a document as indicated by concept applying the reference function to all terms in the document . For a detailed denition of cf, see next subsection. Hence, a term that also appeared in Wordnet as a synset would be accounted for at least twice in the new vector representation, i. e., once as part of the old and at least once as part of . It could be accounted for also more often, because a term like bank has several corresponding concepts in Wordnet.
First Concept (rst). As mentioned in Sec. 3.1, Wordnet returns an ordered list of concepts when applying to a set of terms. Thereby, the ordering is supposed to reect how common it is that a term reects a concept in standard English language. More common term meanings are listed before less common ones. For a term appearing in , this strategy counts only the concept frequency cf for the rst ranked element of , i.e. the most common meaning of . For the other elements of , frequencies of concepts are not increased by the occurrence of . Thus the concept frequency is calculated by: tf first cf
where first gives the rst concept

ing to the order from Wordnet.
accord-
Disambiguation by Context (context). The sense of a term that refers to several different concepts may be disambiguated by a simplied version of (Agirre & Rigau, 1996)s strategy:
Replace Terms by Concepts (repl). This strategy works like Add Concepts but it expels all terms from the vector representations for which at least one corresponding concept exists. Thus, terms that appear in Wordnet are only accounted at the concept level, but terms that do not appear in Wordnet are not discarded.
6
These abbreviations are used below in Section 5.2
1. Dene the semantic vicinity of a concept to be the set of all its direct sub- and superconcepts or . 2. Collect all terms that could express a concept from the conceptual vicinity of by . 3. The function dis with dis
first disambiguates term by document . 4. Let cf tf
maximizes tf based on the context provided
dis
3.4. Strategies for considering Hypernyms The third set of strategies varies the amount of background knowledge. Its principal idea is that if a term like beef appears, one does not only represent the document by the concept corresponding to beef, but also by the concepts corresponding to meat and food etc. up to a certain level of generality. The following procedure realizes this idea by adding to the concept frequency of higher level concepts in a document the frequencies that their subconcepts (at most levels down in the hierarchy) appear, i.e. for : The vectors we consider are of the form tf tf cf cf (the concatenation of an initial term representation with a concept vector). Then the frequencies of the concept vector part are updated in the following way: For all , by replace cf cf cf where gives for a given concept the next subconceps in the taxonomy. In particular returns all subconcepts of . This implies: The strategy does not change the given concept frequencies, adds to each concept the frequency counts of all subconcepts in the levels below it in the ontology and adds to each concept the frequency counts of all its subconcepts.
We have chosen the Reuters-21578 news corpus ((Lewis, 1997)7, cf. section 4.3), because it comprises an a priori categorization of documents, its domain is broad enough to be realistic, and the content of the news were understandable for non-experts (like us) in order to be able to explain results. Furthermore, Reuters-21578 is a well-known, freely available and well investigated corpus. Important reasons for us to use Wordnet as a core ontology in conjunction with Reuters-21578 as a corpus were that Wordnet is freely available and that it has not been specifically designed to facilitate the clustering task. We performed a second evaluation on the FAO Document Online Catalogue,8 in which the Food and Agriculture Organization (FAO) of the United Nations stores documents about agriculture, which are labeled with the controlled vocabulary AGROVOC.9 The evaluation on this domain and with this specic ontology provided similar results, which we omit here because of space restrictions. In the experiments we have varied the different strategies for plain term vector representation and for vector representations containing background knowledge as elaborated in Sections 2 and 3. We have clustered the representations using Bi-Section-KMeans and have compared the pre-categorization with our clustering results using standard measures for this task, as dened below. 4.2. Evaluation Measures The purity measure is based on the precision measure as well-known from information retrieval (cf. (Pantel & Lin, 2002)). Each resulting cluster from a partitioning of the overall document set is treated as if it were the result of a query. Each set of documents of a partitioning which is obtained by manually labeling is treated as if it were the desired set of documents for a query. The two partitionings and are then compared as follows.
4. Partitional Clustering
Our incorporation of background knowledge is rather independent of the concrete clustering method. The only requirements we had were that the baseline could achieve good clustering results in an efcient way on the Reuters corpus. In (Steinbach et al., 2000) it has been shown that Bi-Section-KMeans a variant of KMeans fullled these conditions, while frequently outperforming standard KMeans as well as agglomerative clustering techniques. For our experiments, the similarity between two text documents is measured by the cosine of the angle between the vectors representing them:
The precision of a cluster for a given category is given by Precision . The overall value for purity is computed by taking the weighted average of maximal precision values: Purity Precision
For some selected parameter combinations that proved to be very good wrt. purity, we also investigated their InversePurity Precision
4.1. Evaluation Setting The principal idea of the experiments was the comparison of clustering results on a standard text corpus against a manually predened categorization of the corpus. Such a predened categorization exists only for few text corpora.
Both measures have the interval [0, 1] as range. Their difference is that purity measures the purity of the resulting clusters when evaluated against a pre-categorization, while inverse purity measures how stable the pre-dened cate7 8
http://www.daviddlewis.com/resources/testcollections/reuters21578/ http://www4.fao.org/faobib/index.html 9 http://www.fao.org/agrovoc/
gories are when split up into clusters. Thus, purity achieves an optimal value of 1 when the number of clusters equals , whereas inverse purity achieves an optimal value of 1 when equals 1. Another name in the literature for inverse purity is microaveraged precision. The reader may note that, in the evaluation of clustering results, microaveraged precision is identical to microaveraged recall (cf. e.g. (Sebastiani, 2002)). 4.3. The Reuters-Corpus We have performed all evaluations on the Reuters-21578 document set. In order to be able to perform comparisons with an a priori categorization, we have restricted ourselves to the 12344 documents that were manually classied by Reuters. Documents in the manually classied set were labeled with zero, one, or more of the 135 pre-dened categories.10 The lack of a label indicates that the human annotator could not nd an adequate category. We gathered all the documents without any category label into a new category defnoclass. 11 Standard measures like purity (or mutual information or entropy) only allow for the comparison of two partitionings, but they do not allow for the comparison of structures when documents are manually assigned to several categorizations and/or documents are automatically assigned to multiple clusters. Therefore, we have only selected the rst label of each document and ended up with a categorization of the documents into overall 82 categories, including defnoclass. To be able to perform evaluations for more different parameter settings, we have restricted the number of documents from the corpus. First, categories with extremely few documents have been discarded with the minimum amount of 15 thus, outlier categories are ignored in the evaluation.12 Second, we have restricted the category sizes to max. 100 documents by sampling. We call the resulting corpus PRC-min15-max100. It consists of 46 categories and 2619 documents with an average of 56.93 documents per category (standard deviation of 33.12). The text document representation consists of term vectors of length 1219 to 9924 and concept vectors (or mixed term/concept vectors) of length 1468 to 16157, depending on the applied strategy.
The categories are called topics in Reuters-21578. To be more general, we will refer to them as category in the sequel. 11 The 12344 documents are indicated by an attribute TOPIC set to yes and contain the text surrounded by the BODY tag. 12 We investigate in the technical report (Hotho et al., 2003) the inuence of the 36 discarded outlier categories with their overall 136 documents. We observe a 2% lower purity for both the best baseline as well as for the results with background knowledge. The general results are the same.
10
5. Results
Each evaluation result described in the following denotes an average from 20 test runs performed on the given corpus for a given combination of parameter values with randomly chosen initial values for Bi-Section-KMeans. The clusresults we report here have been achieved for ters. Varying the number of clusters for the parameter combinations described below has not altered the overall picture.
On the results we report in the text, we have applied t-tests to check for signicance with a condence of 99.5%. All differences that are mentioned below are signicant within . a condence of
5.1. Clustering without Background Knowledge Without background knowledge, averaged purity values ranged from 46.1% to 57% (cf. Figure 1). We have observed that tdf weighting decisively increased purity values irrespective of what the combination of parameter values was (see for instance Figure 1). Pruning with a threshold of 5 or 30 has not always shown an effect. But it increased always purity values when it was combined with tdf weighting. 5.2. Clustering with Background Knowledge For clustering using background knowledge, we have also performed pruning and tdf weighting as described above. The thresholds and modications have been enacted on concept frequencies (or mixed term/concept frequencies) instead of term frequencies only. We have computed the purity results for varying parameter combinations as described before. A subset of all cross evaluations is depicted in Figure 1. Each data point indicates a combination of values as follows: X-axis: On the X-axis, different parameter combinations are indicated. From bottom to top there are: Without background knowledge (Section 2) vs. with background knowledge (Section 3), (Ontology = false/true).
No use of hypernyms (r=0) vs. ve levels of hyper), cf. Secnyms added to concept frequencies ( tion 3.4 (Hypdepth = 0 / 5). Disambiguation strategy: All concepts / First concept / Disambiguation by context; cf. Section 3.3 (Hypdis = All/First/Context). Add Concepts vs. Replace Terms by Concepts vs. Concept Vector Only; cf. Section 3.2 (Hypint = add/repl/only).
avg PURITY
0,650 0,600 0,550 0,500 0,450 0,400 0,350 0,300
ONTO false true
HYPDEPTH
HYPINT
WEIGHT-PRUNE
0 5
tfidf - 30 tfidf - 5 tfidf - 0 without - 30 without - 5 without - 0
add only add only
Purity avg std 0,57 0,019 0,585 0,014 0,603 0,019 0,618 0,015 0,593 0,01
InversePurity avg std 0,479 0,016 0,492 0,017 0,504 0,021 0,514 0,019 0,500 0,016
HYPINT HYPDIS HYPDEPTH ONTOLOGY
Table 1. Results on PRC-min15-max100 , prune=30 (with background knowledge also HYPDIS = context, avg denotes average over 20 cluster runs and std denotes standard deviation)
repl
add
only
repl
add
only
repl
add
only
repl
add
only
repl
context
add
only
repl
add first 0
only
all true
context
first 5
all
Figure 1. Comparing clustering without background knowledge (leftmost column) against various combinations of parameter settings using background knowledge on PRC-min15-max100 with .
Y-axis: On the Y-axis the resulting purity averaged over 20 test runs for each data point is shown. Different Lines represent different combinations of tdf weighting / no weighting with different pruning thresholds (0 vs. 5 vs. 30). Results. The baseline, i. e., the representation without background knowledge, is given by the best value, 57%, in the leftmost sector (the one for tdf weighting and a pruning threshold of 30 in Figure 1). The best overall value is achieved by the following combination of strategies: Back), ground knowledge with ve levels of hypernyms ( using disambiguation by context and term vectors extended by concept frequencies. Purity values then reached 61.8%, thus yielding a relative improvement of 8.4% compared to the baseline. Without the application of tdf weighting, all different parameter combinations achieve lower values. Also the difference between the best baseline result (47%) and the best results achieved by adding background knowledge (48,6%) decreases considerably. Furthermore, strategies that conwithout sider hypernyms without weighting, like tdf weighting, even decrease the purity compared to the baseline. Inverse Purity. As may be seen from the description in Section 4.2, purity does not discount evaluation results when splitting up large categories. Therefore, we have investigated how the inverse purity values would be affected for the best baseline (in terms of purity) and a typically good strategy based on background knowledge (again measured in terms of purity). Table 1 summarizes the results favoring background knowledge over the baseline by 51.4% over 47.9%. Inverse Purity and Variance Analysis. We also investigated when and why background knowledge improves
false
the results of Bi-Section-KMeans by analyzing the withinclass variance of the Reuters categorization of PRC-min15max100. For the variance is dened as:
Based on this, we dene the normalized variance within a class as follows, where the denominator performs a normalization adjusting the variance to the corresponding overall variance of :
This variance can be computed both for vector representations with and without background knowledge. We thus and obtain two values for each class , namely 13 . The normalized difference of the vari ances is obtained by vd
The decreasing line in Figure 2 shows this normalized difference of the within-class variance between the representations with (strategy hypdepth=5, hypint=add, hypdis=context, prune=30) and without background knowledge. As becomes evident, for the large majority of pre-dened categories, background knowledge reduces the within-class variance, and hence makes them easier to identify for clustering algorithms which aim at minimizing variance, like Bi-Section-KMeans. Exceptions can be found when the category is characterized best by syntactic means (e.g., the category earn may best be clustered by stop words like vs. which are not contained in Wordnet; see leftmost category in Fig 2). Furthermore, there is a clear tendency that a smaller variance within predened categories goes along with a higher inverse purity compared to the best baseline. This tendency becomes evident when one compares the variance difference against the individual inverse purity values ipv Precision which again can be computed with (ipv ) and without (ipv ) background knowledge. This comparison is done in Figure 2 by comparing the variance difference against the inverse purity difference ipv ipv ipd ipv
13 Observe that in both and change when background knowledge is incorporated.
60,00%
50,00%
variance difference (vd) inverse purity difference (ipd) linear (ipd)
40,00%
30,00%
20,00%
10,00%
0,00%
-10,00%
-20,00%
-30,00%
Reuters categories
Buenaga Rodrguez et. al. (2000) and Ure a L ez et. al. n o (2001) show a successful integration of the Wordnet resource for a document categorization task. They use the Reuters corpus for evaluation and improve the classication results of the Rocchio and Widrow-Hoff algorithms by 20 points. In (Gonzalo et al., 1998), Wordnet is used for word sense disambiguation. They show in an information retrieval setting the improvement of the disambiguated synset model over the term vector model. In contrast to our approach, (de Buenaga Rodrguez et al., 2000), (Ure a n L ez et al., 2001), and (Gonzalo et al., 1998) apply Wordo net to a supervised scenario (and not to an unsupervised one as in our application), do not make use of Wordnet relations such as hypernyms, and build the term vectors manually. Approaches like term clustering (Karypis & Han, 2000), LSI (Deerwester et al., 1990) or PLSI (Hofmann, 1999) use statistic methods to compute a kind of concepts. These concepts are rather different to our denition of ontology concepts. They are not able to indicate the meaning of the concepts and there exists no understandable mapping to lexical entries. A generalization of their concepts is not possible. We do not know of actual comparisons that relate KMeans or Bi-Section-KMeans with LSI or PLSI using the same dataset for clustering. We have built our numerical comparisons on Bi-SectionKMeans which has proved to be very robust in a wide variety of experiments (Steinbach et al., 2000). Also to our experience it performed as good as other algorithms that we tested informally. Its standard parameter settings evaluated as good as other ones (e. g., bisecting based on variance instead of cardinality; cf. (Steinbach et al., 2000)).
Figure 2. Comparing the variance difference for each given category against the change of clustering results in terms of individual inverse purity values when the preprocessing strategy changes from best baseline to standard (good) background knowledge (strategy hypdepth=5, hypint=add, hypdis=context, prune=30) on . PRC-min15-max100 with
and against its linear interpolation. The diagram shows that the linear interpolation increases with decreasing variance difference. The correlation coefcient of between variance difference and individual inverse purity supports this observation.
We have analyzed the categories whose identication by the cluster algorithm is not positively inuenced by background knowledge according to the inverse purity difference. Besides of the ones for which within-class variance is not reduced, problems occur for categories that have semantic overlap. For instance, dlr, and money-fx are all about money and nance and often co-occur (as second or third Reuters label). A measure considering also the second and third Reuters label (which is not possible with standard measures like purity) would probably even indicate a positive inuence of background knowledge on the clustering.
6. Related Work
While we do not know of any research that exploits background knowledge for text document clustering, there are a number of related uses. Wordnet has mostly been used in information retrieval and in supervised learning scenarios up to now: In information retrieval, Voorhees (1994) as well as Moldovan and Mihalcea (2000) have explored the possibility to use Wordnet for retrieving documents by keyword search. It has already become clear by their work that particular care must be taken in order to improve precision and recall.
cotton heat rubber gold hog reserves wheat sugar orange silver bop livestock dlr money-supply interest carcass retail cpi gnp wpi grain gas alum ipi money-fx crude oilseed nat-gas housing iron-steel coffee copper tin zinc veg-oil trade cocoa defnoclass acq strategic-metal jobs lead ship meal-feed pet-chem earn
7. Conclusion
In this paper, we have discussed a way of incorporating background knowledge into a representation for text document clustering in order to improve clustering results. We have performed evaluations on the Reuters data set indicating good performance. In particular, we found that the best background knowledge strategy (e.g., hypint = add, hypdis = context, hypdepth = 5) can be safely used, as it always improves performance compared to the best baseline. The principal idea of our approach is that the variance of documents within one category is reduced by representation with background knowledge, thus improving results of text clustering measured in terms of purity and inverse purity with conventional means like Bi-Section-KMeans. To this end, different, but semantically similar terms in two text documents may contribute to a good similarity rating if they are related via Wordnet synsets or hypernyms.
Our experiments have shown that benecial effects of background knowledge require some care. I.e. we used word sense disambiguation and feature weighting in order to achieve improvements of clustering results. We conjecture that more advanced word sense disambiguation and feature weighting schemes will further improve effectiveness of text clustering. In our technical report (Hotho et al., 2003), we describe how to make further use of background knowledge for improving explanation capabilities. There we show how to exploit concept representations along a hierarchy, based on Formal Concept Analysis (Ganter & Wille, 1999) in order to derive commonalities and distinctions between different clustering results. For instance, one example result derived there is that several clusters are about food some about coffee and some about cacao. This result is achieved without food appearing somewhere in the documents, but by taking advantage of the new representation that incorporates background knowledge.
Gonzalo, J., Verdejo, F., Chugur, I., & Cigarr n, J. (1998). a Indexing with WordNet synsets can improve text retrieval. Proceedings ACL/COLING Workshop on Usage of WordNet for Natural Language Processing. Hofmann, T. (1999). Probabilistic latent semantic indexing. Research and Development in Information Retrieval (pp. 5057). Hotho, A., Staab, S., & Stumme, G. (2003). Text clustering based on background knowledge (Technical Report). University of Karlsruhe, Institute AIFB. 36 pages. Ide, N., & V ronis, J. (1998). Introduction to the special e issue on word sense disambiguation: The state of the art. Computational Linguistics, 24, 140. Karypis, G., & Han, E. (2000). Fast supervised dimensionality reduction algorithm with applications to document categorization and retrieval. Proc. of CIKM-00 (pp. 12 19). ACM Press. Lewis, D. (1997). Reuters-21578 text categorization test collection. Miller, G. (1995). WordNet: A lexical database for english. CACM, 38, 3941. Moldovan, D. I., & Mihalcea, R. (2000). Using WordNet and lexical operators to improve internet searches. IEEE Internet Computing, 4, 3443. Pantel, P., & Lin, D. (2002). Document clustering with committees. Proc. of SIGIR02, Tampere, Finland. Porter, M. F. (1980). An algorithm for sufx stripping. Program, 14, 130137. Salton, G. (1989). Automatic text processing: The transformation, analysis and retrieval of information by computer. Addison-Wesley. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34, 147. Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. KDD Workshop on Text Mining. Ure a L ez, L. A., de Buenaga Rodrguez, M., & Hidalgo, n o J. M. G. (2001). Integrating linguistic resources in tc through wsd. Computers and the Humanities, 35(2), 215230. Voorhees, E. M. (1994). Query expansion using lexicalsemantic relations. Proceedings of ACM-SIGIR. Dublin, Ireland (pp. 6169). ACM/Springer.
Acknowledgements. We thank our colleagues Alexander Maedche and Viktor Pekar for many fruitful discussions during the work on this paper. This work has been supported by EU in the IST project Bizon and by BMBF in the project PADLR.
References
Agirre, E., & Rigau, G. (1996). Word sense disambiguation using conceptual density. Proc. of COLING96. Amati, G., Carpineto, C., & Romano, G. (2001). Fub at trec-10 web track: A probabilistic framework for topic relevance term weighting. The Tenth Text Retrieval Conference (TREC 2001). online publication. Bozsak et al., E. (2002). Kaon - towards a large scale semantic web. Proceedings of EC-Web (pp. 304313). Aix-en-Provence, France: LNCS 2455 Springer. Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41, 391407. de Buenaga Rodrguez, M., Hidalgo, J. M. G., & Daz Agudo, B. (2000). Using WordNet to complement training information in text categorization. Recent Advances in Natural Language Processing II. John Benjamins. Ganter, B., & Wille, R. (1999). Formal concept analysis: Mathematical foundations. Berlin Heidelberg: Springer.

Wordnet Improves Text Document Clustering: Andreas Hotho Steffen Staab Gerd Stumme

Uploaded by

Copyright:

Available Formats

Wordnet Improves Text Document Clustering: Andreas Hotho Steffen Staab Gerd Stumme

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Wordnet Improves Text Document Clustering: Andreas Hotho Steffen Staab Gerd Stumme

Uploaded by

Copyright:

Available Formats

Wordnet improves Text Document Clustering

2. Baseline Text Document Representation

df is the document frequency of

3. Compiling Background Knowledge into the Text Document Representation

where first gives the rst concept

These abbreviations are used below in Section 5.2

first disambiguates term by document . 4. Let cf tf

maximizes tf based on the context provided

http://www.daviddlewis.com/resources/testcollections/reuters21578/ http://www4.fao.org/faobib/index.html 9 http://www.fao.org/agrovoc/

0,650 0,600 0,550 0,500 0,450 0,400 0,350 0,300

ONTO false true

tfidf - 30 tfidf - 5 tfidf - 0 without - 30 without - 5 without - 0

add only add only

HYPINT HYPDIS HYPDEPTH ONTOLOGY

13 Observe that in both and change when background knowledge is incorporated.

variance difference (vd) inverse purity difference (ipd) linear (ipd)

You might also like