Lightweight Document Clustering Sholom Weiss, Brian White, Chid Apte IBM Research Report RC-21684
Lightweight Document Clustering Sholom Weiss, Brian White, Chid Apte IBM Research Report RC-21684
Abstract
A lightweight document clustering method is described that operates in high
dimensions, processes tens of thousands of documents and groups them into
several thousand clusters, or by varying a single parameter, into a few dozen
clusters. The method uses a reduced indexing view of the original documents,
where only the k best keywords of each document are indexed. An ecient
procedure for clustering is specied in two parts (a) compute k most similar
documents for each document in the collection and (b) group the documents
into clusters using these similarity scores. The method has been evaluated
on a database of over 50,000 customer service problem reports that are re-
duced to 3,000 clusters and 5,000 exemplar documents. Results demonstrate
ecient clustering performance with excellent group similarity measures.
Keywords: text clustering, structuring information to aid search and navi-
gation. automated presentation of information, text data mining
1 Introduction
1
context of information retrieval. Instead of pre-clustering all documents in
a database, the results of a query search can be clustered, with documents
appearing in multiple clusters. Instead of presenting a user with a linear list
of related documents, the documents can be grouped in a small number of
clusters, perhaps ten, and the user has an overview of dierent documents
that have been found in the search and their relationship within similar groups
of documents. One approach to this type of visualization and presentation is
described in [Zamir et al., 1997]. Here again though, the direct retrieval and
linear list remains eective, especially when the user is given a "more like this"
option that nds a subgroup of documents representing the cluster of interest
to the user.
Document clustering can be of great value for tasks other than immediate
information retrieval. Among these task are
summarization and label assignment, or
dimension reduction and duplication elimination.
Let's look at these concepts by way of a help-desk example, where users submit
problems or queries online to the vendor of a product. Each submission can be
considered a document. By clustering the documents, the vendor can obtain
a overview of the types of problems the customers are having, for example
a computer vendor might discover that printer problems comprise a large
percentage of customer complaints. If the clusters form natural problem types,
they may be assigned labels or topics. New user problems may then be assigned
a label and sent to the problem queue for appropriate response. Any number
of methods can be used for document categorization once the appropriate
clusters have been identied. Typically, the number of clusters or categories
number no more than a few hundred and often less than 100.
Not all users of a product report unique problems to the help-desk. It can
be expected that most problem reports are repeat problems, with many users
experiencing the same diculty. Given enough users who report the same
problem, an FAQ, Frequently Asked Questions report, may be created. To
reduce the number of documents in the database of problem reports, redun-
dancies in the documents must be detected. Unlike the summary of problem
types, many problems will be similar but still have distinctions that are crit-
ical. Thus, while the number of clusters needed to eliminate duplication of
problem reports can be expected to be much smaller than the total number of
problems reports, the number of clusters is necessarily relatively large, much
larger than needed for summarization of problem types.
In this paper, we describe a new lightweight procedure for clustering docu-
ments. It is intended to operate in high dimensions with tens of thousands
of documents and is capable of clustering a database into the moderate num-
2
ber of clusters need for summarization and label assignment or the very large
number of clusters needed for the elimination of duplication.
The classical k-means technique [Hartigan and Wong, 1979] can be applied to
document clustering. Its weaknesses are well known. The number of clusters
k must be specied prior to application. The summary statistic is a mean of
the values for each cluster. The individual members of the the cluster can
have a high variance and the mean may not be a good summary of the cluster
members. As the number of clusters grow, for example to thousands of clusters,
k-means clustering becomes untenable, approaching the O(n2 ) comparisons
where n is the number of documents. However, for relatively few clusters and
a reduced set of pre-selected words, k-means can do well[Vaithyanathan and
Dom, 1999].
More recent attention has been given to hierarchical agglomerative methods
[Griths et al., 1997]. The documents are recursively merged bottom up,
yielding a decision tree of recursively partitioned clusters. The distance mea-
sures used to nd similarity vary from single-link to more computationally
expensive ones, but they are closely tied to nearest-neighbor distance. The
algorithm works by recursively merging the single best pair of documents or
clusters, making the computational costs prohibitive for document collections
numbering in the tens of thousands.
To cluster very large numbers of documents, possibly with a large number of
clusters, some compromises must be made to reduce the number of indexed
words and the number of expected comparisons. In [Larsen and Aone, 1999],
indexing of each document is reduced to the 25 highest scoring TF-IDF words
(term frequency and inverse document frequency [Salton and Buckley, 1997],
and then k-means is applied recursively, for k=9. While ecient, this approach
has the classical weaknesses associated with k-means document clustering. A
hierarchical technique that also works in steps with a small, xed number of
clusters is described in [Cutting et al., 1992].
We will describe a new lightweight procedure that operates eciently in high
dimensions and is eective in directly producing clusters that have objective
similarity. Unlike k-means clustering, the number of clusters is dynamically
determined, and similarity is based on nearest-neighbor distance, not mean
feature distance. Thus, the new document clustering method maintains the
key advantage of hierarchical clustering techniques, their compatibility with
information retrieval methods, yet performance does not rapidly degrade for
large numbers of documents.
3
3 Methods and Procedures
Our method uses a reduced indexing view of the original documents, where
only the k best keywords of each document are indexed. That reduces a doc-
ument's vector size and the computation time for distance measures for a
clustering method. Our procedure for clustering is specied in two parts (a)
compute k most similar documents (typically the top 10) for each document in
the collection and (b) group the documents into clusters using these similarly
scores. To be fully ecient, both procedures must be computationally ecient.
Finding and scoring the k most similar documents for each document will be
specied as a mathematical algorithm that processes xed scalar vectors. The
4
procedure is simple, a repetitive series of loops that accesses a xed portion
of memory, leading to ecient computation. The second procedure uses the
scores for the k most similar documents in clustering the document. Unlike the
other algorithms described earlier, the second clustering step does not perform
a "best-match rst-out" merging. It merges documents and clusters based on
a "rst-in rst-out" basis.
Figure 1 describes the data structures needed to process the algorithms. Each
of these lists can be represented as a simple linear vector. Figure 2 describes
the steps for the computation of the k most similar documents, typically the
top 10, for each document in the collection. Figure 3 is an ecient algorithm
for these computations. Similarity or distance is measured by a simple addi-
tive count of words found in both documents that are compared plus their
inverse document frequency. This diers from the standard TF-IDF formula
in that term frequency is measured in binary terms, i.e. 0 or 1 for presence
or absence. In addition the values are not normalized, just the sum is used.
In a comparative study, we show that TF-IDF has slightly stronger predictive
value, but the simpler function has numerous advantages in terms of inter-
pretability, simple additive computation, and elimination of storage of term
frequencies. The steps in Figure 2 can readily be modied to use TF-IDF
scoring.
The remaining task is to group the documents into clusters using these simi-
larly scores. We describe a a single pass algorithm for clustering, with at most
k*n comparisons of similarity, where n is the number of documents.
5
(i) Get the next document's words (from doclist), and set all doc-
ument scores to zero.
(ii) Get the next word, w, for current document. If no words remain,
store the k documents with the highest scores and continue with
step (i).
(iii) For all documents having word w (from wordlist), add to their
scores and continue with step (ii).
each document Di , Figure 4 describes the actions that may be taken for
each matched pair during cluster formation. Figure 5 describes an ecient
algorithm for forming clusters. In Figure 5, N is the number of documents,
D[i] is the i-th document, Cluster(D[i]) is the cluster of document D[i] and
In Cluster(D[i]) indicates that the i-th document has been clustered. Docu-
ments are examined in a pairwise fashion proceeding with the rst document
and its top-k matches. Matches below a pre-set minimum score threshold are
ignored. Clusters are formed by the document pairs not yet in clusters. Clus-
ters are merged when the matched pair appear in separate clusters. As we
shall see in Section 4, not allowing merging yields a very large number clus-
ters whose members are highly similar. The single setting of the minimum
score has a strong eect on the number of clusters; a high value produces a
relatively large number of clusters and a zero value produces a relatively small
number of clusters. Similarly, a high minimum score may leave some docu-
ments unclustered, while a low value clusters all documents. As an alternative
to merging, it may be preferable to repeat the same document in multiple
clusters. We do not report results on this form of duplication, typically done
for smaller numbers of documents, but the procedure provides an option for
duplicating documents across clusters.
How can we objectively evaluate clustering performance? Very often, the ob-
jective measure is related to the clustering technique. For example, k-means
clustering can measure overall distance from the mean. Techniques that are
based on nearest neighbor distance, such as most information retrieval tech-
niques, can measure distance from the nearest neighbor or the average distance
from other cluster members.
6
c=1;
n=0;
rank document n;
compute current top c < end_of_doclist
F STOP
k;
score=0;
b=doclist(c); c=c+1;
for m=m1,m2;
a=wordlist(m);
score(a)=score(a)+pv(b);
b <> 0
F T
c=c+1; m1=word[b];
n=n+1; m2=word[b+1]-1;
7
(i) If score for D and D is less than Minimum Score, next pair.
i j
pair.
(iv) Cluster Merge Step: if both D and D are in separate clusters:
i j
i j
8
i=i; i<N
T j=1;
F
F T
i=i+1; j<=k
Stop
match_score(D[i],D[j])
j=j+1; <
T Minimum Score
F
T
Cluster(D[i])=Cluster(D[j])
F Cluster(D[i])!=Cluster(D[j])
T
Action
merging repeat
Plan = ? documents
no
Merge merging
Cluster(D[i]) Add D[j] to
with Cluster Cluster(D[i])
(D[j])
9
that is particularly important for nding similar documents. Words can be as-
signed special tags, whereby they are given extra weight or more importantly,
their absence can lead to a subtraction from the score. For example, when two
documents have dierent model numbers, we can subtract one from the score
of the candidate matching document. This additive scheme for special tags
is consistent with the general scheme of measuring similarity, yet provides a
mechanism for handling exceptional matches of words.
4 Results
10
Exemplar documents were selected for each of the 3250 clusters found in the
fourth entry of the table. For some large clusters, two or three exemplars
were selected for a total of 5,000 exemplar documents. Using the same scoring
scheme, each of the exemplars was matched to the original 51,110 documents.
98.4% of the documents matched at least one of of the exemplars, having
at least one indexed word in common. 60.7% of the documents matched an
exemplar of their assigned cluster, rather than an exemplar of an alternative
cluster.
5 Discussion
11
References
12