0% found this document useful (0 votes)

64 views13 pages

Lightweight Document Clustering Sholom Weiss, Brian White, Chid Apte IBM Research Report RC-21684

A lightweight document clustering method is described that operates in high dimensions. It processes tens of thousands of documents and groups them into several thousand clusters. The method has been evaluated on a database of over 50,000 customer service problem reports.

Uploaded by

cs_bd4654

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

0% found this document useful (0 votes)

64 views13 pages

Lightweight Document Clustering Sholom Weiss, Brian White, Chid Apte IBM Research Report RC-21684

Uploaded by

cs_bd4654

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

You are on page 1/ 13

Lightweight Document Clustering

Sholom Weiss, Brian White, Chid Apte

IBM Research Report RC-21684

Lightweight Document Clustering
Sholom M. Weiss, Brian F. White, and Chidanand V. Apte
IBM T.J. Watson Research Center
P.O. Box 218, Yorktown Heights, NY 10598, USA
[email protected], [email protected], [email protected]

Abstract
A lightweight document clustering method is described that operates in high
dimensions, processes tens of thousands of documents and groups them into
several thousand clusters, or by varying a single parameter, into a few dozen
clusters. The method uses a reduced indexing view of the original documents,
where only the k best keywords of each document are indexed. An ecient
procedure for clustering is specied in two parts (a) compute k most similar
documents for each document in the collection and (b) group the documents
into clusters using these similarity scores. The method has been evaluated
on a database of over 50,000 customer service problem reports that are re-
duced to 3,000 clusters and 5,000 exemplar documents. Results demonstrate
ecient clustering performance with excellent group similarity measures.
Keywords: text clustering, structuring information to aid search and navi-
gation. automated presentation of information, text data mining

1 Introduction

The objective of document clustering is to group similar documents together,

assigning them to the same implicit topic. Why is document clustering of
interest? The original motivation was to improve the eectiveness of infor-
mation retrieval. Standard information retrieval techniques, such as nearest
neighbor methods using cosine distance, can be very ecient when combined
with an inverted list of word to document mappings. These same techniques
for information retrieval perform a variant of dynamic clustering, matching
a query or a full document to their most similar neighbors in the document
database. Thus standard information retrieval techniques are ecient and dy-
namically nd similarity among documents, reducing the value for information
retrieval purposes of nding static clusters of large numbers of similar docu-
ments [Sparck-Jones, 1997].
The advent of the web has renewed interest in clustering documents in the

1
context of information retrieval. Instead of pre-clustering all documents in
a database, the results of a query search can be clustered, with documents
appearing in multiple clusters. Instead of presenting a user with a linear list
of related documents, the documents can be grouped in a small number of
clusters, perhaps ten, and the user has an overview of dierent documents
that have been found in the search and their relationship within similar groups
of documents. One approach to this type of visualization and presentation is
described in [Zamir et al., 1997]. Here again though, the direct retrieval and
linear list remains eective, especially when the user is given a "more like this"
option that nds a subgroup of documents representing the cluster of interest
to the user.
Document clustering can be of great value for tasks other than immediate
information retrieval. Among these task are
summarization and label assignment, or
dimension reduction and duplication elimination.
Let's look at these concepts by way of a help-desk example, where users submit
problems or queries online to the vendor of a product. Each submission can be
considered a document. By clustering the documents, the vendor can obtain
a overview of the types of problems the customers are having, for example
a computer vendor might discover that printer problems comprise a large
percentage of customer complaints. If the clusters form natural problem types,
they may be assigned labels or topics. New user problems may then be assigned
a label and sent to the problem queue for appropriate response. Any number
of methods can be used for document categorization once the appropriate
clusters have been identied. Typically, the number of clusters or categories
number no more than a few hundred and often less than 100.
Not all users of a product report unique problems to the help-desk. It can
be expected that most problem reports are repeat problems, with many users
experiencing the same diculty. Given enough users who report the same
problem, an FAQ, Frequently Asked Questions report, may be created. To
reduce the number of documents in the database of problem reports, redun-
dancies in the documents must be detected. Unlike the summary of problem
types, many problems will be similar but still have distinctions that are crit-
ical. Thus, while the number of clusters needed to eliminate duplication of
problem reports can be expected to be much smaller than the total number of
problems reports, the number of clusters is necessarily relatively large, much
larger than needed for summarization of problem types.
In this paper, we describe a new lightweight procedure for clustering docu-
ments. It is intended to operate in high dimensions with tens of thousands
of documents and is capable of clustering a database into the moderate num-

2
ber of clusters need for summarization and label assignment or the very large
number of clusters needed for the elimination of duplication.

2 Document Clustering Techniques

The classical k-means technique [Hartigan and Wong, 1979] can be applied to
document clustering. Its weaknesses are well known. The number of clusters
k must be specied prior to application. The summary statistic is a mean of
the values for each cluster. The individual members of the the cluster can
have a high variance and the mean may not be a good summary of the cluster
members. As the number of clusters grow, for example to thousands of clusters,
k-means clustering becomes untenable, approaching the O(n2 ) comparisons
where n is the number of documents. However, for relatively few clusters and
a reduced set of pre-selected words, k-means can do well[Vaithyanathan and
Dom, 1999].
More recent attention has been given to hierarchical agglomerative methods
[Griths et al., 1997]. The documents are recursively merged bottom up,
yielding a decision tree of recursively partitioned clusters. The distance mea-
sures used to nd similarity vary from single-link to more computationally
expensive ones, but they are closely tied to nearest-neighbor distance. The
algorithm works by recursively merging the single best pair of documents or
clusters, making the computational costs prohibitive for document collections
numbering in the tens of thousands.
To cluster very large numbers of documents, possibly with a large number of
clusters, some compromises must be made to reduce the number of indexed
words and the number of expected comparisons. In [Larsen and Aone, 1999],
indexing of each document is reduced to the 25 highest scoring TF-IDF words
(term frequency and inverse document frequency [Salton and Buckley, 1997],
and then k-means is applied recursively, for k=9. While ecient, this approach
has the classical weaknesses associated with k-means document clustering. A
hierarchical technique that also works in steps with a small, xed number of
clusters is described in [Cutting et al., 1992].
We will describe a new lightweight procedure that operates eciently in high
dimensions and is eective in directly producing clusters that have objective
similarity. Unlike k-means clustering, the number of clusters is dynamically
determined, and similarity is based on nearest-neighbor distance, not mean
feature distance. Thus, the new document clustering method maintains the
key advantage of hierarchical clustering techniques, their compatibility with
information retrieval methods, yet performance does not rapidly degrade for
large numbers of documents.

3
3 Methods and Procedures

3.1 Data Preparation

Clustering algorithms process documents in a transformed state, where the

documents are represented as a collection of terms or words. A vector repre-
sentation is used: in the simplest format, each element of the vector is the pres-
ence or absence of a word. The same vector format is used for each document;
the vector is a space taken over the complete set of words in all documents.
Clearly, a single document has a sparse vector over the set of all words. Some
processing may take place to stem words to their essential root and to trans-
form the presence or absence of a word to a score, such as TF-IDF, that is a
predictive distance measure. In addition weakly predictive words, stopwords,
are removed. These same processes can be used to reduce indexing further by
measuring for a document's vector only the top k-words in a document and
setting all remaining vector entries to zero.
An alternative approach to selecting a subset of features for a document, de-
scribed in [Weiss et al., 2000], assumes that documents are carefully composed
and have eective titles. Title words are always indexed along with the k most
frequent words in the document and any human-assigned key words.
Not all words are of the same predictive value and many approaches have been
tried to select a subset of words that are most predictive. The main concept is
to reduce the number of overall words that are considered, which reduces the
representational and computational tasks of the clustering algorithm. Reduced
indexing can be eective in these goals when performed prior to clustering.
The clustering algorithm accepts as input the transformed data, much like any
information retrieval system, and works with a vector representation that is a
transformation of the original documents.

3.2 Clustering Methods

Our method uses a reduced indexing view of the original documents, where
only the k best keywords of each document are indexed. That reduces a doc-
ument's vector size and the computation time for distance measures for a
clustering method. Our procedure for clustering is specied in two parts (a)
compute k most similar documents (typically the top 10) for each document in
the collection and (b) group the documents into clusters using these similarly
scores. To be fully ecient, both procedures must be computationally ecient.
Finding and scoring the k most similar documents for each document will be
specied as a mathematical algorithm that processes xed scalar vectors. The

4
procedure is simple, a repetitive series of loops that accesses a xed portion
of memory, leading to ecient computation. The second procedure uses the
scores for the k most similar documents in clustering the document. Unlike the
other algorithms described earlier, the second clustering step does not perform
a "best-match rst-out" merging. It merges documents and clusters based on
a "rst-in rst-out" basis.
Figure 1 describes the data structures needed to process the algorithms. Each
of these lists can be represented as a simple linear vector. Figure 2 describes
the steps for the computation of the k most similar documents, typically the
top 10, for each document in the collection. Figure 3 is an ecient algorithm
for these computations. Similarity or distance is measured by a simple addi-
tive count of words found in both documents that are compared plus their
inverse document frequency. This diers from the standard TF-IDF formula
in that term frequency is measured in binary terms, i.e. 0 or 1 for presence
or absence. In addition the values are not normalized, just the sum is used.
In a comparative study, we show that TF-IDF has slightly stronger predictive
value, but the simpler function has numerous advantages in terms of inter-
pretability, simple additive computation, and elimination of storage of term
frequencies. The steps in Figure 2 can readily be modied to use TF-IDF
scoring.

doclist: The words (terms) in each document. A series of numbers;

documents are separated by zeros. example: Sequence = 10 44 98
0 24 ... The rst document has words 10, 44 and 98. The second
document has words 24...
wordlist: The documents in which a word is found. A series of con-
secutive numbers pointing to specic document numbers.
word(c): A pointer to wordlist indicates the starting location of the
documents for word c. To process all documents for word c, access
word(c) through word(c+1)-1. example: word(1)=1, word(2)=4;
wordlist=f18 22 64 16 ...g Word 1 appears in the documents listed
in locations 1, 2, and 3 in wordlist. The documents are 18, 22, and
64.
pv(c): predictive values of word c = 1+idf, where idf is 1/(number
of documents where word c appears)

Fig. 1. Denitions for Top-k Scoring Algorithm

The remaining task is to group the documents into clusters using these simi-
larly scores. We describe a a single pass algorithm for clustering, with at most
k*n comparisons of similarity, where n is the number of documents.

5
(i) Get the next document's words (from doclist), and set all doc-
ument scores to zero.
(ii) Get the next word, w, for current document. If no words remain,
store the k documents with the highest scores and continue with
step (i).
(iii) For all documents having word w (from wordlist), add to their
scores and continue with step (ii).

Fig. 2. Steps for Top-k Scoring Algorithm

For each document Di , the scoring algorithm produces a set of k documents,

fDj , where j varies from 1 to k. Given the scores of the top-k matches of
g

each document Di , Figure 4 describes the actions that may be taken for
each matched pair during cluster formation. Figure 5 describes an ecient
algorithm for forming clusters. In Figure 5, N is the number of documents,
D[i] is the i-th document, Cluster(D[i]) is the cluster of document D[i] and
In Cluster(D[i]) indicates that the i-th document has been clustered. Docu-
ments are examined in a pairwise fashion proceeding with the rst document
and its top-k matches. Matches below a pre-set minimum score threshold are
ignored. Clusters are formed by the document pairs not yet in clusters. Clus-
ters are merged when the matched pair appear in separate clusters. As we
shall see in Section 4, not allowing merging yields a very large number clus-
ters whose members are highly similar. The single setting of the minimum
score has a strong eect on the number of clusters; a high value produces a
relatively large number of clusters and a zero value produces a relatively small
number of clusters. Similarly, a high minimum score may leave some docu-
ments unclustered, while a low value clusters all documents. As an alternative
to merging, it may be preferable to repeat the same document in multiple
clusters. We do not report results on this form of duplication, typically done
for smaller numbers of documents, but the procedure provides an option for
duplicating documents across clusters.

3.3 Measures for Evaluation of Clustering Results

How can we objectively evaluate clustering performance? Very often, the ob-
jective measure is related to the clustering technique. For example, k-means
clustering can measure overall distance from the mean. Techniques that are
based on nearest neighbor distance, such as most information retrieval tech-
niques, can measure distance from the nearest neighbor or the average distance
from other cluster members.

6
c=1;
n=0;

rank document n;
compute current top c < end_of_doclist
F STOP
k;

score=0;

b=doclist(c); c=c+1;

for m=m1,m2;
a=wordlist(m);
score(a)=score(a)+pv(b);

b <> 0

F T

c=c+1; m1=word[b];
n=n+1; m2=word[b+1]-1;

Fig. 3. Top-k Scoring Algorithm

For our clustering algorithm, distance is measured mostly in terms of counts

of words present in documents. A natural measure of cluster performance
is the average number of indexed words per cluster, i.e. the local dictio-
nary size. Analogous measures of cluster \cohesion," that count the number
common words among documents in a cluster, have been used to evaluate
performance[Zamir et al., 1997]. The average is computed by weighing the

7
(i) If score for D and D is less than Minimum Score, next pair.
i j

(ii) If D and D are already in the same cluster, next pair.

i j

(iii) If D is in a cluster and D isn't, add D to the D cluster, next

i j j i

pair.
(iv) Cluster Merge Step: if both D and D are in separate clusters:
i j

(a) If action plan is "no merging", next pair.

(b) If action plan is "repeat documents", repeat D in all the
D clusters, next pair.
j

(c) Merge the D cluster with D cluster, next pair.

i j

Fig. 4. Actions for Clustering Document Pairs

number of documents in the cluster as in equation 1, where N is the total

number number of documents, m is the number of clusters, Sizek is the num-
ber of documents in the k-th cluster, and LDictk is the number of indexed
words in the k-th cluster.
AverageDictionarySize =
Xm Sizek LDictk (1)
k=1 N

Results of clustering are compared to documents randomly assigned to the

same size clusters. Clearly, the average dictionary size for computed clusters
should be much smaller than those for randomly assigned clusters of the same
number of documents.

3.4 Summarizing Clustering Results

The same measure of evaluation can be used to nd exemplar documents for

a cluster. The local dictionary of a document cluster can be used as a virtual
document that is matched to the members of the cluster. The top-k matched
documents can be considered a ranked list of exemplar documents for the
cluster.
Selecting exemplar documents from a cluster is a form of summary of the clus-
ter. The technique for selecting the exemplars is based on matching the clus-
ter's dictionary of words to its constituent documents. The words themselves
can provide another mode of summary for a cluster. The highest frequency
words in the local dictionary of a cluster often can distinguish a cluster from
others. If only a few words are extracted, they may be considered a label for
the cluster.

8
i=i; i<N
T j=1;

F
F T
i=i+1; j<=k
Stop

match_score(D[i],D[j])
j=j+1; <
T Minimum Score

F
T
Cluster(D[i])=Cluster(D[j])

Add D[j] to T In_Cluster(D[i]) &

Cluster(D[i]) !In_Cluster(D[j])

F Cluster(D[i])!=Cluster(D[j])

T
Action
merging repeat
Plan = ? documents

no
Merge merging
Cluster(D[i]) Add D[j] to
with Cluster Cluster(D[i])
(D[j])

Fig. 5. Cluster Formation Algorithm

3.5 Special Scoring

Scoring of document matches has been described in term of an additive func-

tion, mostly dependent on the number of matched words for a pair of docu-
ments. In some applications, such as help-desk documents, special tags in a
document can convey important information about its related documents. For
example, a product model number might be considered a special characteristic

9
that is particularly important for nding similar documents. Words can be as-
signed special tags, whereby they are given extra weight or more importantly,
their absence can lead to a subtraction from the score. For example, when two
documents have dierent model numbers, we can subtract one from the score
of the candidate matching document. This additive scheme for special tags
is consistent with the general scheme of measuring similarity, yet provides a
mechanism for handling exceptional matches of words.

4 Results

To evaluate the performance of the clustering algorithms, we obtained 51,110

documents, taken from reports from customers having IBM AS/400 computer
systems. These documents were constructed in real-time by customer service
representatives who record their phone dialog with customers encountering
problems with their systems.
The documents were indexed with a total of 21,682 words in a global dictio-
nary computed from all the documents. Table 1 summarizes the results for
clustering the document collection in terms of the number of clusters, the av-
erage cluster size, the ratio of the local dictionary size to random assignment,
the percentage of unclustered documents, the minimum score for matching
document pairs, and whether merging was used. The rst row in the table
indicates that 49 clusters were found with an average size of 1027 documents.
A random cluster's dictionary was on average 1.4 times larger than the gen-
erated cluster; and 1.5% of the documents were not clustered. These results
were obtained by using a minimum score of 1 and cluster merging was allowed.
All results are for nding the top-10 document matches.
Cnum AveSize RndRatio Unclust % MinScore Merge
49 1027.3 1.4 1.5 1 yes
86 579.6 1.4 2.5 2 yes
410 105.5 1.5 16.2 3 yes
3250 15.5 1.8 1.5 1 no
3346 14.9 1.8 2.5 2 no
3789 11.4 1.9 16.2 3 no
Table 1
Results for Clustering Help-Desk Problems
A single clustering run, one row in Table 1 currently takes 15 minutes on a
375 MHz RS6000 running AIX. The code is written in Java.

10
Exemplar documents were selected for each of the 3250 clusters found in the
fourth entry of the table. For some large clusters, two or three exemplars
were selected for a total of 5,000 exemplar documents. Using the same scoring
scheme, each of the exemplars was matched to the original 51,110 documents.
98.4% of the documents matched at least one of of the exemplars, having
at least one indexed word in common. 60.7% of the documents matched an
exemplar of their assigned cluster, rather than an exemplar of an alternative
cluster.

5 Discussion

The lightweight document clustering algorithms achieves our stated objectives.

The process is ecient in high dimensions, both for large document collections
and for large numbers of clusters. No compromises are made to partition the
clustering process into smaller sub-problems. All documents are clustered in
one stage.
These clustering algorithms have many desirable properties. Unlike k-means
clustering, the number of clusters is dynamically assigned. A single parameter,
the minimum score threshold, eectively controls whether a large number
of clusters or a much smaller number is chosen. By disallowing merging of
clusters, we are able to obtain a very large number of clusters.
The success of clustering can be measured in terms of an objective function. In
our case, we are using the local dictionary size of the cluster. In all instances,
we see that clustering is far better than random assignment. As expected, the
greater the number of clusters, the better the performance when measured by
dictionary size. In the help-desk application, it is important to remove dupli-
cation, while still maintaining a large number of exemplar documents. The
help-desk clusters have strong similarity for their documents, suggesting that
they can be readily summarized by a one or two documents. For the largest
number of clusters, dictionary size is nearly half that for random document
assignment, far better than for smaller number of clusters.
The help-desk application is characterized by a large number of specialized
indexed words for computer systems. Future applications will determine the
generality of this approach. There is ample room for enhancements to the
computer implementation that will lead to faster performance and a capability
to run on far larger document collections.

11
References

[Cutting et al., 1992] D. Cutting, D. Karger, J. Pedersen, and J. Tukey.

Scatter/Gather: a Cluster-based Approach to Browsing Large Document
collections. In Proceedings of the 15th ACM SIGIR. ACM, 1992.
[Griths et al., 1997] A. Griths, H. Luckhurst, and P. Willett. Using
interdocument similarity information in document retrieval systems. In P. Sparck-
Jones, K. and. Willet, editor, Readings in Information Retrieval, pages 365{373.
Morgan Kaufmann, 1997.
[Hartigan and Wong, 1979] J. Hartigan and M Wong. A k-means clustering
algorithm. Applied Statitsics, 1979.
[Larsen and Aone, 1999] B. Larsen and C. Aone. Fast and Eective Text Mining
Using Linear-time Document Clustering. In Proceedings of the 5th International
Conference on Knowledge Discovery ad Data Mining, pages 16{22. ACM, 1999.
[Salton and Buckley, 1997] G. Salton and C. Buckley. Term-weighting approaches
in automatic text retrieval. In P. Sparck-Jones, K. and. Willet, editor, Readings
in Information Retrieval, pages 323{328. Morgan Kaufmann, 1997.
[Sparck-Jones, 1997] P. Sparck-Jones, K. and. Willet. Chapter 6 - techniques. In
P. Sparck-Jones, K. and. Willet, editor, Readings in Information Retrieval, pages
305{312. Morgan Kaufmann, 1997.
[Vaithyanathan and Dom, 1999] S. Vaithyanathan and B. Dom. Model Selection in
Unsupervised Learning with Applications to Document Clustering. In Proceedings
International Conference on Machine Learning, 1999.
[Weiss et al., 2000] S. Weiss, B. White, C. Apte, and F. Damerau. Lightweight
document matching for help-desk applications. IEEE Intelligent Systems, page in
press, 2000.
[Zamir et al., 1997] O. Zamir, O. Etzioni, O. Madani, and R. Karp. Fast and
Intuitive Clustering of Web Documents. In Proceedings of the 3rd International
Conference on Knowledge Discovery ad Data Mining. Morgan Kaufmann, 1997.

د سکون ښار
100% (6)
د سکون ښار
156 pages
Document Clustering Method Based On Visual Features
No ratings yet
Document Clustering Method Based On Visual Features
5 pages
MVS Clustering of Sparse and High Dimensional Data
No ratings yet
MVS Clustering of Sparse and High Dimensional Data
5 pages
Kmeanseppcsit
No ratings yet
Kmeanseppcsit
5 pages
A New Hierarchical Document Clustering Method: Gang Kou Yi Peng
No ratings yet
A New Hierarchical Document Clustering Method: Gang Kou Yi Peng
4 pages
K-Means Document Clustering Using Vector Space Model
No ratings yet
K-Means Document Clustering Using Vector Space Model
5 pages
Clustering Notes
No ratings yet
Clustering Notes
20 pages
Scalable Contruction of Topic Directory With Nonparametric Closed Termset Mining
No ratings yet
Scalable Contruction of Topic Directory With Nonparametric Closed Termset Mining
8 pages
Ref 2 Hierarchical
No ratings yet
Ref 2 Hierarchical
7 pages
Minor Project Ii Report Text Mining: Reuters-21578: Submitted by
100% (1)
Minor Project Ii Report Text Mining: Reuters-21578: Submitted by
51 pages
Clustering Algorithm With A Novel Similarity Measure: Gaddam Saidi Reddy, Dr.R.V.Krishnaiah
No ratings yet
Clustering Algorithm With A Novel Similarity Measure: Gaddam Saidi Reddy, Dr.R.V.Krishnaiah
6 pages
An Improved Technique For Document Clustering
No ratings yet
An Improved Technique For Document Clustering
4 pages
Document Clustering in Web Search Engine: International Journal of Computer Trends and Technology-volume3Issue2 - 2012
No ratings yet
Document Clustering in Web Search Engine: International Journal of Computer Trends and Technology-volume3Issue2 - 2012
4 pages
A Novel Multi-Viewpoint Based Similarity Measure For Document Clustering
No ratings yet
A Novel Multi-Viewpoint Based Similarity Measure For Document Clustering
4 pages
PSO11
No ratings yet
PSO11
5 pages
Slicing A New Approach To Privacy Preserving Data Publishing
No ratings yet
Slicing A New Approach To Privacy Preserving Data Publishing
19 pages
Information Retrieval Algorithms: A Survey: Prabhakar Raghavan
No ratings yet
Information Retrieval Algorithms: A Survey: Prabhakar Raghavan
8 pages
A Tag - Tree For Retrieval From Multiple Domains of A Publication System
No ratings yet
A Tag - Tree For Retrieval From Multiple Domains of A Publication System
6 pages
Week 5 - LLM - RAG
No ratings yet
Week 5 - LLM - RAG
34 pages
International Journal of Engineering and Science Invention (IJESI)
No ratings yet
International Journal of Engineering and Science Invention (IJESI)
6 pages
Stop Words
No ratings yet
Stop Words
6 pages
Clustering Techniques Notes 1
No ratings yet
Clustering Techniques Notes 1
20 pages
A Graph Analytical Approach For Topic Detection
No ratings yet
A Graph Analytical Approach For Topic Detection
21 pages
Solving Ordinary Differential Equations Using Tayl
No ratings yet
Solving Ordinary Differential Equations Using Tayl
15 pages
Ljubesic08 Document
No ratings yet
Ljubesic08 Document
6 pages
The Peculiarities of The Text Document Representation, Using Ontology and Tagging-Based Clustering Technique
No ratings yet
The Peculiarities of The Text Document Representation, Using Ontology and Tagging-Based Clustering Technique
4 pages
A Survey On Text Categorization: International Journal of Computer Trends and Technology-volume3Issue1 - 2012
No ratings yet
A Survey On Text Categorization: International Journal of Computer Trends and Technology-volume3Issue1 - 2012
7 pages
IJCER (WWW - Ijceronline.com) International Journal of Computational Engineering Research
No ratings yet
IJCER (WWW - Ijceronline.com) International Journal of Computational Engineering Research
4 pages
6902 An Applied Algorithmic Foundation For Hierarchical Clustering
No ratings yet
6902 An Applied Algorithmic Foundation For Hierarchical Clustering
10 pages
Unit-4 1
No ratings yet
Unit-4 1
7 pages
150
No ratings yet
150
6 pages
Implement A Mining Web Document Through New Data Clustering Algorithm PDF
No ratings yet
Implement A Mining Web Document Through New Data Clustering Algorithm PDF
7 pages
Survey On Clustering Algorithms For Sentence Level Text
No ratings yet
Survey On Clustering Algorithms For Sentence Level Text
6 pages
02 Ieee Kadhim2014
No ratings yet
02 Ieee Kadhim2014
6 pages
Douglis
No ratings yet
Douglis
14 pages
Thesis Information Retrieval
100% (2)
Thesis Information Retrieval
8 pages
BIRCH: A New Data Clustering Algorithm and Its Applications
No ratings yet
BIRCH: A New Data Clustering Algorithm and Its Applications
42 pages
Literature Mining in Molecular Biology
No ratings yet
Literature Mining in Molecular Biology
5 pages
Ela 2
No ratings yet
Ela 2
3 pages
Information Retrieval Thesis Topics
100% (3)
Information Retrieval Thesis Topics
6 pages
Parallel Data Mining of Association Rules
No ratings yet
Parallel Data Mining of Association Rules
10 pages
Paper IJRITCC
No ratings yet
Paper IJRITCC
5 pages
A Note On The Unification of Information Extraction and Data Mining Using Conditional-Probability, Relational Models
No ratings yet
A Note On The Unification of Information Extraction and Data Mining Using Conditional-Probability, Relational Models
8 pages
Combining Content and Collaboration in Text Filtering
No ratings yet
Combining Content and Collaboration in Text Filtering
9 pages
Text, Web and Social Media Analytics: SE Computer, Sem VIII Academic Year: 2023 - 24
No ratings yet
Text, Web and Social Media Analytics: SE Computer, Sem VIII Academic Year: 2023 - 24
36 pages
Anti-Serendipity: Finding Useless Documents and Similar Documents
No ratings yet
Anti-Serendipity: Finding Useless Documents and Similar Documents
9 pages
Chapter 1: Introduction: Efficient Search in Large Textual Collections With Redundancy - 2009
No ratings yet
Chapter 1: Introduction: Efficient Search in Large Textual Collections With Redundancy - 2009
31 pages
A Comparative Study On Feature Selection in Text Categorization
No ratings yet
A Comparative Study On Feature Selection in Text Categorization
9 pages
Cash2013 - Highly Scalable Searchable Symmetric Encryption With Support For Boolean Queries
No ratings yet
Cash2013 - Highly Scalable Searchable Symmetric Encryption With Support For Boolean Queries
21 pages
Novel and Efficient Approach For Duplicate Record Detection
No ratings yet
Novel and Efficient Approach For Duplicate Record Detection
5 pages
An Efficient and Empirical Model of Distributed Clustering
No ratings yet
An Efficient and Empirical Model of Distributed Clustering
5 pages
UNIT 3 DWDM Notes
No ratings yet
UNIT 3 DWDM Notes
32 pages
DoCEIS2013 - BrainMap - A Navigation Support System in A Tourism Case Study
No ratings yet
DoCEIS2013 - BrainMap - A Navigation Support System in A Tourism Case Study
8 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
Unit-3 DWDM 7TH Sem Cse
No ratings yet
Unit-3 DWDM 7TH Sem Cse
54 pages
Document Clustering: Alankrit Bhardwaj 18BIT0142 Priyanshu Gupta 18BIT0146 Aditya Raj 18BIT0412
No ratings yet
Document Clustering: Alankrit Bhardwaj 18BIT0142 Priyanshu Gupta 18BIT0146 Aditya Raj 18BIT0412
33 pages
Literature Review On Information Retrieval System
100% (1)
Literature Review On Information Retrieval System
5 pages
Anti-Serendipity: Finding Useless Documents and Similar Documents
No ratings yet
Anti-Serendipity: Finding Useless Documents and Similar Documents
9 pages
The Future of Search
From Everand
The Future of Search
Andres J. Clary
No ratings yet
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
Pattern Recognition: Fundamentals and Applications
From Everand
Pattern Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Review Questions: Where Is The Data in The Cars? (Case Complexity: Easy)
No ratings yet
Review Questions: Where Is The Data in The Cars? (Case Complexity: Easy)
5 pages
Big Data Qpapers
No ratings yet
Big Data Qpapers
4 pages
Chapter 2 Reference
100% (1)
Chapter 2 Reference
2 pages
Chapter 5
No ratings yet
Chapter 5
22 pages
Connecting Oracle Data Visualization Desktop To OBIEE
No ratings yet
Connecting Oracle Data Visualization Desktop To OBIEE
9 pages
SMP 7 1 Database Schema
No ratings yet
SMP 7 1 Database Schema
109 pages
Leancontext: Cost-Efficient Domain-Specific Question Answering Using Llms
No ratings yet
Leancontext: Cost-Efficient Domain-Specific Question Answering Using Llms
8 pages
Unit 1 - Introduction
No ratings yet
Unit 1 - Introduction
8 pages
SRS For Pinterest
No ratings yet
SRS For Pinterest
9 pages
Mlie 102
No ratings yet
Mlie 102
291 pages
Top 25 ETL Testing Interview Questions & Answers: 1) What Is ETL? Extract, Transform and Load
No ratings yet
Top 25 ETL Testing Interview Questions & Answers: 1) What Is ETL? Extract, Transform and Load
7 pages
8.scopus ECB IJ Ramesh 2023
No ratings yet
8.scopus ECB IJ Ramesh 2023
11 pages
Disposal of Corporate Records
No ratings yet
Disposal of Corporate Records
17 pages
Resume - Shubham Agarwal - Linkedin
No ratings yet
Resume - Shubham Agarwal - Linkedin
1 page
MCIS Case-1
No ratings yet
MCIS Case-1
2 pages
A Metadata Repository Tables
No ratings yet
A Metadata Repository Tables
2 pages
Recovery
No ratings yet
Recovery
45 pages
Query and Transformation in DBMS: Done by Ruban Christu Raj
No ratings yet
Query and Transformation in DBMS: Done by Ruban Christu Raj
10 pages
L3 - Supervised and Unsupervised Learning
100% (3)
L3 - Supervised and Unsupervised Learning
24 pages
Electronic Commerce Research and Applications: Mehmet Türkay Yoldar, U Ğur Özcan T
No ratings yet
Electronic Commerce Research and Applications: Mehmet Türkay Yoldar, U Ğur Özcan T
17 pages
EC SF Presentation 02
No ratings yet
EC SF Presentation 02
10 pages
Final LP-VI Lab Manual 23-24
No ratings yet
Final LP-VI Lab Manual 23-24
71 pages
Mongodb: Presented By: Josmi Agnes Jose Roll Number: 20bda27
No ratings yet
Mongodb: Presented By: Josmi Agnes Jose Roll Number: 20bda27
27 pages
BIGA Dec
No ratings yet
BIGA Dec
50 pages
Data Warehousing Answer Booklet SuppExam2021
No ratings yet
Data Warehousing Answer Booklet SuppExam2021
10 pages
An Insider's Guide To Object Storage
No ratings yet
An Insider's Guide To Object Storage
17 pages
Information Storage and Retrival
No ratings yet
Information Storage and Retrival
31 pages
MAD 1 - Week 7 Parampreet Singh
No ratings yet
MAD 1 - Week 7 Parampreet Singh
11 pages
Manfaat Kecerdasan Buatan ChatGPT Untuk Membantu Penulisan Ilmiah
No ratings yet
Manfaat Kecerdasan Buatan ChatGPT Untuk Membantu Penulisan Ilmiah
7 pages