A Fast Clustering Algorithm To Cluster Very Large Categorical Data Sets in Data Mining
A Fast Clustering Algorithm To Cluster Very Large Categorical Data Sets in Data Mining
A Fast Clustering Algorithm To Cluster Very Large Categorical Data Sets in Data Mining
Abstract—Although attempts have been made to solve the problem of clustering categorical data via cluster ensembles, with the
results being competitive to conventional algorithms, it is observed that these techniques unfortunately generate a final data partition
based on incomplete information. The underlying ensemble-information matrix presents only cluster-data point relations, with many
entries being left unknown. The paper presents an analysis that suggests this problem degrades the quality of the clustering result, and
it presents a new link-based approach, which improves the conventional matrix by discovering unknown entries through similarity
between clusters in an ensemble. In particular, an efficient link-based algorithm is proposed for the underlying similarity assessment.
Afterward, to obtain the final clustering result, a graph partitioning technique is applied to a weighted bipartite graph that is formulated
from the refined matrix. Experimental results on multiple real data sets suggest that the proposed link-based method almost always
outperforms both conventional clustering algorithms for categorical data and well-known cluster ensemble techniques.
Index Terms—Clustering, categorical data, cluster ensembles, link-based similarity, data mining.
1 INTRODUCTION
Fig. 3. The link-based cluster ensemble framework: 1) a cluster where 2 ½0; 1 is a uniform random variable, qmin and qmax
ensemble ¼ f1 ; . . . ; M g is created from M base clusterings, 2) a are the lower and upper bounds of the generated subspace,
refined cluster-association matrix is then generated from the ensemble respectively. In particular, qmin and qmax are set to 0:75d and
using a link-based similarity algorithm, and 3) a final clustering result ( ) 0:85d. An attribute is selected one by one from the pool of d
is produced by a consensus function of the spectral graph partitioning.
attributes, until the collection of q is obtained. The index of
each randomly selected attribute is determined as
used to represent the ensemble information. The focus has
h ¼ b1 þ dc, given that h denotes the hth attribute in the
shifted from revealing the similarity among data points to
pool of d attributes and 2 ½0; 1Þ is a uniform random
estimating those between clusters. A new link-based
variable. Note that k-modes is exploited to create a cluster
algorithm has been specifically proposed to generate such
ensemble from the set of subspace attributes, using both
measures in an accurate, inexpensive manner. The LCE
Fixed-k and Random-k schemes for selecting the number of
methodology is illustrated in Fig. 3. It includes three major
clusters.
steps of: 1) creating base clusterings to form a cluster
ensemble (), 2) generating a refined cluster-association 3.2 Generating a Refined Matrix
matrix (RM) using a link-based similarity algorithm, and Several cluster ensemble methods, both for numerical [28],
3) producing the final data partition ( ) by exploiting the [29] and categorical data [48], [49], are based on the binary
spectral graph partitioning technique as a consensus cluster-association matrix. Each entry in this matrix
function. BMðxi ; clÞ 2 f0; 1g represents a crisp association degree
3.1 Creating a Cluster Ensemble between data point xi 2 X and cluster cl 2 . According
to Fig. 2 that shows an example of cluster ensemble and the
Type I (Direct ensemble). Following the study in [49], the first corresponding BM, a large number of entries in the BM are
type of cluster ensemble transforms the problem of unknown, each presented with “0.” Such condition occurs
categorical data clustering to cluster ensembles by con- when relations between different clusters of a base cluster-
sidering each categorical attribute value (or label) as a ing are originally assumed to be nil. In fact, each data point
cluster in an ensemble. Let X ¼ fx1 ; . . . ; xN g be a set of N can possibly associate (to a certain degree within ½0; 1) to
data points, A ¼ fa1 ; . . . ; aM g be a set of categorical several clusters of any particular clustering. These hidden or
attributes, and ¼ f1 ; . . . ; M g be a set of M partitions. unknown associations can be estimated from the similarity
Each partition i is generated for a specific categorical among clusters, discovered from a network of clusters.
attribute ai 2 A. Clusters belonging to a partition i ¼ Based on this insight, the refined cluster-association
fC1i ; . . . ; Cki i g correspond S
to different values of the attribute matrix is put forward as the enhanced variation of the
ai ¼ fai1 ; . . . ; aiki g, where kj¼1
i
Cji ¼ ai and ki is the number original BM. Its aim is to approximate the value of unknown
of values of attribute ai . With this formalism, categorical associations (“0”) from known ones (“1”), whose association
data X can be directly transformed to a cluster ensemble , degrees are preserved within the RM, i.e., BMðxi ; clÞ ¼
without actually implementing any base clustering. While 1 ! RMðxi ; clÞ ¼ 1. For each clustering t ; t ¼ 1 . . . M and
single-attribute data partitions may not be as accurate as their corresponding clusters C1t ; . . . ; Ckt t (where kt is the
those obtained from the clustering of all data attributes, number of clusters in the clustering t ), the association
they can bring about great diversity within an ensemble. degree RMðxi ; clÞ 2 ½0; 1 that data point xi 2 X has with
Besides its efficiency, this ensemble generation method has each cluster cl 2 fC1t ; . . . ; Ckt t g is estimated as follows:
the potential to lead to a high-quality clustering result.
Type II (Full-space ensemble). Unlike the previous case, the 1; ifcl ¼ Ct ðxi Þ;
following two ensemble types are created from base RMðxi ; clÞ ¼ t ð2Þ
simðcl; C ðxi ÞÞ; otherwise;
clustering results, each of which is obtained by applying a
clustering algorithm to the categorical data set. For this study, where Ct ðxi Þ is a cluster label (corresponding to a particular
the k-modes technique [8] is used to generate base cluster- cluster of the clustering t ) to which data point xi belongs.
ings, each with a random initialization of cluster centers. In In addition, simðCx ; Cy Þ 2 ½0; 1 denotes the similarity
particular to a full-space ensemble, base clusterings are between any two clusters Cx ; Cy , which can be discovered
created from the original data, i.e., with all data attributes. To using the following link-based
P algorithm. Note that, for any
introduce an artificial instability to k-modes, the following clustering t 2 , 1 8C2t RMðxi ; CÞ kt . Unlike the
two schemes are employed to select thep number of clusters in measure
P of fuzzy membership, the typical constraint of
ffiffiffiffiffi
each base clusterings: 1) Fixed-k, k ¼ d N e (where Npisffiffiffiffiffithe 8C2t RMðx i ; CÞ ¼ 1 is not appropriate for rescaling
number of data points), and 2) Random-k, k 2 f2; . . . ; d N eg. associations within the RM. In fact, this local normalization
Type III (Subspace ensemble). Another alternative to will significantly distort the true semantics of known
generate diversity within an ensemble is to exploit a associations (“1”), such that their magnitudes become
number of different data subsets. To this extent, the cluster dissimilar, different from one clustering to another. Accord-
ensemble is established on various data subspaces, from ing to the empirical investigation, this fuzzy-like enforce-
which base clustering results are generated [48]. Similar to ment decreases the quality of the RM, and hence, the
the study in [41], for a given N d data set of N data points performance of the resulting cluster ensemble method.
IAM-ON ET AL.: A LINK-BASED CLUSTER ENSEMBLE APPROACH FOR CATEGORICAL DATA CLUSTERING 417
TABLE 2
Classification Accuracy of Different Clustering Methods
The five highest CA scores of each data set are highlighted in boldface. Note that unobtainable results are marked as “N/A.”
KDDCup99, for which several cluster ensemble techniques In order to further evaluate the quality of identified
(CO+SL, CO+AL, and CSPA) are immaterial. techniques, the number of times that one method is
With the measures of LCE models being mostly higher significantly better and worse (of 95 percent confidence
than those of the corresponding baseline counterparts level) than the others are assessed across experimented data
(Base), the quality of the RM appears to be significantly sets. Let X C ði; Þ be the average value of validity index C 2
better than that of the original, binary variation. As fCA; NMI; ARg across n runs (n ¼ 50 in this evaluation)
compared to the LCE models that use Type-II and Type- for a clustering method i 2 CM (CM is a set of 40
III ensembles (both “Fixed-k” and “Random-k”), the LCE experimented clustering methods), on a specific data set
with Type-I (or direct) ensemble is less effective. This is 2 DT (DT is a set of six data sets). To obtain a fair
greatly due to the quality of base clusterings, which are comparison, this pairwise assessment is conducted on the
single attribute and multiattribute for Type-I and the others, results with six data sets, where the clustering results can be
respectively. Despite its inefficiency, CSPA has the best obtained for all the clustering methods. Also note that CM
performance among assessed ensemble methods. In addi- consists of five clustering algorithms for categorical data
tion, Cobweb is the most effective among five categorical and 35 different cluster ensemble models, each of which is a
data clustering algorithms included in this evaluation. unique combination of ensemble type (i.e., Type-I, Type-
Similar experimental results are also observed using NMI II(Fixed-k), Type-II(Random-k), Type-III(Fixed-k), and
and AR evaluation indices. The corresponding details are Type-III(Random-k)) and ensemble method (i.e., LCE, Base,
given in Section II-A of the online supplementary. CO+SL, CO+AL, CSPA, HGPA, and MCLA).
IAM-ON ET AL.: A LINK-BASED CLUSTER ENSEMBLE APPROACH FOR CATEGORICAL DATA CLUSTERING 421
SC ði; Þ
LXC ði;Þ ¼ XC ði; Þ 1:96 pffiffiffi
n
SC ði; Þ
and UXC ði;Þ ¼ XC ði; Þ þ 1:96 pffiffiffi
n
Note that SC ði; Þ denotes the standard deviation of the
validity index C across n runs for a clustering method i and
a data set . The number of times that one method i 2 CM
is significantly better than its competitors, BC ðiÞ (in
accordance with the validity criterion C), can be defined as
X X
BC ðiÞ ¼ betterC ði; i Þ; ð8Þ
82DT 8i 2CM;i 6¼i
1; ifLXC ði;Þ > UXC ði ;Þ ;
betterC ði; i Þ ¼ ð9Þ
0; otherwise:
Similarly, the number of times that one method i 2 CM is
significantly worse than its competitors, WC ðiÞ, in accor-
dance with the validity criterion C, can be computed as
X X
WC ðiÞ ¼ worseC ði; i Þ; ð10Þ
82DT 8i 2CM;i 6¼i
1; ifUXC ði;Þ < LXC ði ;Þ ;
worseC ði; i Þ ¼ ð11Þ
0; otherwise:
Using the aforementioned assessment formalism, Table
3 illustrates for each method the frequencies of significant
better (B) and significant worse (W ) performance, which
are categorized in accordance with the evaluation indices
(CA, NMI, and AR). The results shown in this table
For each evaluation index, “B” and “W ” denote the number of times that
indicate the superior effectiveness of the proposed link- a particular method performs significantly “better” and “worse” than the
based methods, as compared to other clustering techni- others.
ques included in this experiment. To better perceive this
comparison, Fig. 6 summarizes the total performance link-based methods perform better than their competitors.
(B W ) of each clustering method, sorted in the In fact, these LCE models have the highest five statistics
descending order, across all evaluation indices and six of B W , while CO+AL with a Type-II(Fixed-k) ensemble
data sets. Note that the total performance (B W ) of any is the most effective among compared techniques. In
particular algorithm is specified as the difference between addition, Cobweb and Squeezer perform better than the
corresponding values of B and W . It can be seen that all other three categorical data clustering algorithms. Another
Fig. 6. The statistics of total performance (B W ) at 95 percent confidence level, summarized across all evaluation indices and six data sets, and
sorted in descending order.
422 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012
Fig. 7. The relations between DC 2 f0:1; 0:2; . . . ; 0:9g and the perfor-
mance of the LCE models (the averages across all validity indices and
six data sets), whose values are presented in X-axis and Y-axis,
respectively. Note that the performance of other clustering methods is Fig. 8. Performance of different cluster ensemble methods in accor-
also included for a comparison purpose. (a) Ensemble Type I. dance with ensemble size (M 2 f10; 20; . . . ; 100g), as the averages of
(b) Ensemble Type II-Fixed-k. (c) Ensemble Type II-Random-k. validity measures (CA, NMI, and AR) across six data sets. (a) Ensemble
(d) Ensemble Type III-Fixed-k. (e) Ensemble Type III-Random-k. Type II-Fixed-k. (b) Ensemble Type II-Random-k. (c) Ensemble Type III-
Fixed-k. (d) Ensemble Type III-Random-k.
important investigation is on the subject of relations
between performance of experimented cluster ensemble this assessment. Another important observation is that the
methods and different types of ensemble being explored effectiveness of the link-based measure decreases as DC
in the present evaluation. To this point, it has been becomes smaller. Intuitively, the significance of disclosed
demonstrated that the LCE approach is more accurate associations becomes trivial when DC is low. Hence, they
than other cluster ensemble methods, across all examined may be overlooked by a consensus function and the quality
ensemble types. Further details of the results and of the resulting data partition is not improved.
Another parameter that may determine the quality of
discussion regarding the effect of ensemble types on the
data partition generated by a cluster ensemble technique is
performance of LCE are provided in Section II-B of the
the ensemble size (M). Intuitively, the larger an ensemble is,
online supplementary.
the better the performance becomes. According to Fig. 8
4.4 Parameter and Complexity Analysis which is obtained with a DC of 0.9, this heuristic is
generally applicable to LCE with Type-II and Type-III
The parameter analysis aims to provide a practical means
ensembles, where its average quality measures (across all
by which users can make the best use of the link-based
validity indices and six data sets) gradually incline to the
framework. Essentially, the performance of the resulting increasing value of M 2 f10; 20; . . . ; 100g. Furthermore, LCE
technique is dependant on the decay factor (i.e., performs consistently better than its competitors with all
DC 2 ½0; 1), which is used in estimating the similarity different ensemble sizes, while CO+SL is apparently the
among clusters and association degrees previously un- least effective. Note that a bigger ensemble leads to an
known in the original BM. improved accuracy, but with the trade-off of runtime—but,
We varied the value of this parameter from 0.1 through again, even the worst results of LCE are better than the best
0.9, in steps of 0.1, and obtained the results in Fig. 7. Note results of the other methods.
that the presented results are obtained with the ensemble Besides previous quality assessments, computational
size (M) of 10. The figure clearly shows that the results of requirements of the link-based method are discussed here.
LCE are robust across different ensemble types, and do not Primarily, the time complexity of creating the RM is
depend strongly on any particular value of DC. This makes OðP 2 þ NP Þ, where N is the number of data points. While
it easy for users to obtain high-quality, reliable results, with P denotes the number of clusters in a Type-II or Type-III
the best outcomes being obtained with values of DC ensemble, it represents the cardinality of all categorical
between 0.7 and 0.9. Although there is variation in response values in a direct ensemble (i.e., Type-I). Please consult
across the DC values, the performance of LCE is always Section III in the online supplementary for the details of the
better than any of the other clustering methods included in scalability evaluation.
IAM-ON ET AL.: A LINK-BASED CLUSTER ENSEMBLE APPROACH FOR CATEGORICAL DATA CLUSTERING 423
directly used to generate the final data partition from the [14] S. Guha, R. Rastogi, and K. Shim, “ROCK: A Robust Clustering
Algorithm for Categorical Attributes,” Information Systems, vol. 25,
proposed RM. The LCE framework is generic such that it no. 5, pp. 345-366, 2000.
can be adopted for analyzing other types of data. [15] M.J. Zaki and M. Peters, “Clicks: Mining Subspace Clusters in
Categorical Data via Kpartite Maximal Cliques,” Proc. Int’l Conf.
Data Eng. (ICDE), pp. 355-356, 2005.
6 CONCLUSION [16] V. Ganti, J. Gehrke, and R. Ramakrishnan, “CACTUS: Clustering
Categorical Data Using Summaries,” Proc. ACM SIGKDD Int’l
This paper presents a novel, highly effective link-based Conf. Knowledge Discovery and Data Mining (KDD), pp. 73-83, 1999.
cluster ensemble approach to categorical data clustering. It [17] D. Barbara, Y. Li, and J. Couto, “COOLCAT: An Entropy-Based
transforms the original categorical data matrix to an Algorithm for Categorical Clustering,” Proc. Int’l Conf. Information
and Knowledge Management (CIKM), pp. 582-589, 2002.
information-preserving numerical variation (RM), to which
[18] Y. Yang, S. Guan, and J. You, “CLOPE: A Fast and Effective
an effective graph partitioning technique can be directly Clustering Algorithm for Transactional Data,” Proc. ACM SIGKDD
applied. The problem of constructing the RM is efficiently Int’l Conf. Knowledge Discovery and Data Mining (KDD), pp. 682-
resolved by the similarity among categorical labels (or 687, 2002.
[19] D.H. Wolpert and W.G. Macready, “No Free Lunch Theorems for
clusters), using the Weighted Triple-Quality similarity Search,” Technical Report SFI-TR-95-02-010, Santa Fe Inst., 1995.
algorithm. The empirical study, with different ensemble [20] L.I. Kuncheva and S.T. Hadjitodorov, “Using Diversity in Cluster
types, validity measures, and data sets, suggests that the Ensembles,” Proc. IEEE Int’l Conf. Systems, Man and Cybernetics,
proposed link-based method usually achieves superior pp. 1214-1219, 2004.
[21] H. Xue, S. Chen, and Q. Yang, “Discriminatively Regularized
clustering results compared to those of the traditional Least-Squares Classification,” Pattern Recognition, vol. 42, no. 1,
categorical data algorithms and benchmark cluster ensem- pp. 93-104, 2009.
ble techniques. The prominent future work includes an [22] A. Gionis, H. Mannila, and P. Tsaparas, “Clustering Aggregation,”
Proc. Int’l Conf. Data Eng. (ICDE), pp. 341-352, 2005.
extensive study regarding the behavior of other link-based
[23] N. Nguyen and R. Caruana, “Consensus Clusterings,” Proc. IEEE
similarity measures within this problem context. Also, the Int’l Conf. Data Mining (ICDM), pp. 607-612, 2007.
new method will be applied to specific domains, including [24] A.P. Topchy, A.K. Jain, and W.F. Punch, “Clustering Ensembles:
tourism and medical data sets. Models of Consensus and Weak Partitions,” IEEE Trans. Pattern
Analysis and Machine Intelligence, vol. 27, no. 12, pp. 1866-1881,
Dec. 2005.
ACKNOWLEDGMENTS [25] C. Boulis and M. Ostendorf, “Combining Multiple Clustering
Systems,” Proc. European Conf. Principles and Practice of Knowledge
The authors are grateful to the anonymous referees for their Discovery in Databases (PKDD), pp. 63-74, 2004.
constructive comments which have helped considerably in [26] B. Fischer and J.M. Buhmann, “Bagging for Path-Based Cluster-
ing,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25,
revising this paper. no. 11, pp. 1411-1415, Nov. 2003.
[27] C. Domeniconi and M. Al-Razgan, “Weighted Cluster Ensembles:
Methods and Analysis,” ACM Trans. Knowledge Discovery from
REFERENCES Data, vol. 2, no. 4, pp. 1-40, 2009.
[1] D.S. Hochbaum and D.B. Shmoys, “A Best Possible Heuristic for [28] X.Z. Fern and C.E. Brodley, “Solving Cluster Ensemble Problems
the K-Center Problem,” Math. of Operational Research, vol. 10, no. 2, by Bipartite Graph Partitioning,” Proc. Int’l Conf. Machine Learning
pp. 180-184, 1985. (ICML), pp. 36-43, 2004.
[2] L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An [29] A. Strehl and J. Ghosh, “Cluster Ensembles: A Knowledge Reuse
Introduction to Cluster Analysis. Wiley Publishers, 1990. Framework for Combining Multiple Partitions,” J. Machine
[3] A.K. Jain and R.C. Dubes, Algorithms for Clustering. Prentice-Hall, Learning Research, vol. 3, pp. 583-617, 2002.
1998. [30] H. Ayad and M. Kamel, “Finding Natural Clusters Using
[4] P. Zhang, X. Wang, and P.X. Song, “Clustering Categorical Data Multiclusterer Combiner Based on Shared Nearest Neighbors,”
Based on Distance Vectors,” The J. Am. Statistical Assoc., vol. 101, Proc. Int’l Workshop Multiple Classifier Systems, pp. 166-175, 2003.
no. 473, pp. 355-367, 2006. [31] A.L.N. Fred and A.K. Jain, “Combining Multiple Clusterings
[5] J. Grambeier and A. Rudolph, “Techniques of Cluster Algo- Using Evidence Accumulation,” IEEE Trans. Pattern Analysis and
rithms in Data Mining,” Data Mining and Knowledge Discovery, Machine Intelligence, vol. 27, no. 6, pp. 835-850, June 2005.
vol. 6, pp. 303-360, 2002. [32] S. Monti, P. Tamayo, J.P. Mesirov, and T.R. Golub, “Consensus
[6] K.C. Gowda and E. Diday, “Symbolic Clustering Using a New Clustering: A Resampling-Based Method for Class Discovery and
Dissimilarity Measure,” Pattern Recognition, vol. 24, no. 6, pp. 567- Visualization of Gene Expression Microarray Data,” Machine
578, 1991. Learning, vol. 52, nos. 1/2, pp. 91-118, 2003.
[7] J.C. Gower, “A General Coefficient of Similarity and Some of Its [33] N. Iam-On, T. Boongoen, and S. Garrett, “Refining Pairwise
Properties,” Biometrics, vol. 27, pp. 857-871, 1971. Similarity Matrix for Cluster Ensemble Problem with Cluster
[8] Z. Huang, “Extensions to the K-Means Algorithm for Clustering Relations,” Proc. Int’l Conf. Discovery Science, pp. 222-233, 2008.
Large Data Sets with Categorical Values,” Data Mining and
[34] T. Boongoen, Q. Shen, and C. Price, “Disclosing False Identity
Knowledge Discovery, vol. 2, pp. 283-304, 1998.
through Hybrid Link Analysis,” Artificial Intelligence and Law,
[9] Z. He, X. Xu, and S. Deng, “Squeezer: An Efficient Algorithm for
vol. 18, no. 1, pp. 77-102, 2010.
Clustering Categorical Data,” J. Computer Science and Technology,
vol. 17, no. 5, pp. 611-624, 2002. [35] L. Getoor and C.P. Diehl, “Link Mining: A Survey,” ACM SIGKDD
[10] P. Andritsos and V. Tzerpos, “Information-Theoretic Software Explorations Newsletter, vol. 7, no. 2, pp. 3-12, 2005.
Clustering,” IEEE Trans. Software Eng., vol. 31, no. 2, pp. 150-165, [36] D. Liben-Nowell and J. Kleinberg, “The Link-Prediction Problem
Feb. 2005. for Social Networks,” J. Am. Soc. for Information Science and
[11] D. Cristofor and D. Simovici, “Finding Median Partitions Using Technology, vol. 58, no. 7, pp. 1019-1031, 2007.
Information-Theoretical-Based Genetic Algorithms,” J. Universal [37] J. Kittler, M. Hatef, R. Duin, and J. Matas, “On Combining
Computer Science, vol. 8, no. 2, pp. 153-172, 2002. Classifiers,” IEEE Trans. Pattern Analysis and Machine Intelligence,
[12] D.H. Fisher, “Knowledge Acquisition via Incremental Conceptual vol. 20, no. 3, pp. 226-239, Mar. 1998.
Clustering,” Machine Learning, vol. 2, pp. 139-172, 1987. [38] L.I. Kuncheva and D. Vetrov, “Evaluation of Stability of K-Means
[13] D. Gibson, J. Kleinberg, and P. Raghavan, “Clustering Catego- Cluster Ensembles with Respect to Random Initialization,” IEEE
rical Data: An Approach Based on Dynamical Systems,” VLDB J., Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 11,
vol. 8, nos. 3-4, pp. 222-236, 2000. pp. 1798-1808, Nov. 2006.
IAM-ON ET AL.: A LINK-BASED CLUSTER ENSEMBLE APPROACH FOR CATEGORICAL DATA CLUSTERING 425
[39] A.P. Topchy, A.K. Jain, and W.F. Punch, “A Mixture Model [65] E. Abdu and D. Salane, “A Spectral-Based Clustering Algorithm
for Clustering Ensembles,” Proc. SIAM Int’l Conf. Data Mining, for Categorical Data Using Data Summaries,” Proc. Workshop Data
pp. 379-390, 2004. Mining using Matrices and Tensors, pp. 1-8, 2009.
[40] X.Z. Fern and C.E. Brodley, “Random Projection for High [66] B. Mirkin, “Reinterpreting the Category Utility Function,” Machine
Dimensional Data Clustering: A Cluster Ensemble Approach,” Learning, vol. 45, pp. 219-228, 2001.
Proc. Int’l Conf. Machine Learning (ICML), pp. 186-193, 2003.
[41] Z. Yu, H.-S. Wong, and H. Wang, “Graph-Based Consensus Natthakan Iam-On received the PhD degree in
Clustering for Class Discovery from Gene Expression Data,” computer science from Aberystwyth University.
Bioinformatics, vol. 23, no. 21, pp. 2888-2896, 2007. She is a lecturer in the School of Information
[42] S. Dudoit and J. Fridyand, “Bagging to Improve the Accuracy of a Technology, Mae Fah Luang University, Thai-
Clustering Procedure,” Bioinformatics, vol. 19, no. 9, pp. 1090-1099, land. Her research focuses on data clustering,
2003. cluster ensembles and applications to biomedi-
[43] B. Minaei-Bidgoli, A. Topchy, and W. Punch, “A Comparison of cal data analysis, advance database technology
Resampling Methods for Clustering Ensembles,” Proc. Int’l Conf. and knowledge discovery.
Artificial Intelligence, pp. 939-945, 2004.
[44] X. Hu and I. Yoo, “Cluster Ensemble and Its Applications in
Gene Expression Analysis,” Proc. Asia-Pacific Bioinformatics Conf.,
pp. 297-302, 2004.
[45] M. Law, A. Topchy, and A.K. Jain, “Multiobjective Data Tossapon Boongoen received the PhD de-
Clustering,” Proc. IEEE Conf. Computer Vision and Pattern Recogni- gree in artificial intelligence from Cranfield
tion, vol. 2, pp. 424-430, 2004. University and worked as a postdoctoral
[46] G. Karypis and V. Kumar, “Multilevel K-Way Partitioning research associate at Aberystwyth University,
Scheme for Irregular Graphs,” J. Parallel Distributed Computing, United Kingdom. He is a lecturer in the
vol. 48, no. 1, pp. 96-129, 1998. Department of Mathematics and Computer
[47] A. Ng, M. Jordan, and Y. Weiss, “On Spectral Clustering: Analysis Science, Royal Thai Air Force Academy,
and an Algorithm,” Advances in Neural Information Processing Thailand. His research interests include data
Systems, vol. 14, pp. 849-856, 2001. mining, link analysis, data clustering, fuzzy
[48] M. Al-Razgan, C. Domeniconi, and D. Barbara, “Random Sub- aggregation, and classification system.
space Ensembles for Clustering Categorical Data,” Supervised and
Unsupervised Ensemble Methods and Their Applications, pp. 31-48,
Springer, 2008. Simon Garrett founded and is CEO of Aispire
[49] Z. He, X. Xu, and S. Deng, “A Cluster Ensemble Method for Consulting Ltd., having worked at Aberystwyth
Clustering Categorical Data,” Information Fusion, vol. 6, no. 2, University in the Department of Computer
pp. 143-151, 2005. Science as both a lecturer and researcher. His
[50] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association research has been in machine learning and
Rules between Sets of Items in Large Databases,” Proc. ACM clustering, which have been his interests for
SIGMOD Int’l Conf. Management of Data, pp. 207-216, 1993. more than 10 years. He has been recognized for
[51] P.N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. his contribution to artificial immune systems, and
Addison Wesley, 2005. has done work on their ability to cluster data and
[52] G. Jeh and J. Widom, “Simrank: A Measure of Structural-Context find cluster centers quickly and efficiently.
Similarity,” Proc. ACM SIGKDD Int’l Conf. Knowledge Discovery and
Data Mining (KDD), pp. 538-543, 2002.
[53] F. Fouss, A. Pirotte, J.M. Renders, and M. Saerens, “Random-Walk
Computation of Similarities between Nodes of a Graph with Chris Price received the BSc degree in com-
Application to Collaborative Recommendation,” IEEE Trans. puter science from Aberystwyth University,
Knowledge and Data Eng., vol. 19, no. 3, pp. 355-369, Mar. 2007. United Kingdom, in 1979 and, after eight years
[54] E. Minkov, W.W. Cohen, and A.Y. Ng, “Contextual Search and building artificial intelligence systems in industry,
Name Disambiguation in Email Using Graphs,” Proc. Int’l Conf. returned to academia in 1986. He received the
Research and Development in IR, pp. 27-34, 2006. PhD degree in computer science from Aberyst-
[55] P. Reuther and B. Walter, “Survey on Test Collections and wyth University in 1994, where he was made a
Techniques for Personal Name Matching,” Int’l J. Metadata, full professor in 1999. Much of his research has
Semantics and Ontologies, vol. 1, no. 2, pp. 89-99, 2006. concentrated on reasoning from models to build
design and diagnosis tools for use by engineers
[56] L.A. Adamic and E. Adar, “Friends and Neighbors on the Web,”
in automotive and aerospace companies.
Social Networks, vol. 25, no. 3, pp. 211-230, 2003.
[57] U. Luxburg, “A Tutorial on Spectral Clustering,” Statistics and
Computing, vol. 17, no. 4, pp. 395-416, 2007.
[58] A. Asuncion and D.J. Newman, “UCI Machine Learning Reposi- . For more information on this or any other computing topic,
tory,” School of Information and Computer Science, Univ. of please visit our Digital Library at www.computer.org/publications/dlib.
California, http://www.ics.uci.edu/~mlearn/MLRepository.
html, 2007.
[59] L. Hubert and P. Arabie, “Comparing Partitions,” J. Classification,
vol. 2, no. 1, pp. 193-218, 1985.
[60] G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar, “Multilevel
Hypergraph Partitioning: Applications in VLSI Domain,” IEEE
Trans. Very Large Scale Integration Systems, vol. 7, no. 1, pp. 69-79,
Mar. 1999.
[61] G. Das and H. Mannila, “Context-Based Similarity Methods for
Categorical Attributes,” Proc. Principles of Data Mining and Knowl-
edge. Discovery (PKDD), pp. 201-211, 2000.
[62] G. Das, H. Mannila, and P. Ronkainen, “Similarity of Attributes by
External Probes,” Proc. ACM SIGKDD Int’l Conf. Knowledge
Discovery and Data Mining (KDD), pp. 16-22, 1998.
[63] Y. Zhang, A. Fu, C. Cai, and P. Heng, “Clustering Categorical
Data,” Proc. Int’l Conf. Data Eng. (ICDE), p. 305, 2000.
[64] M. Dutta, A.K. Mahanta, and A.K. Pujari, “QROCK: A Quick
Version of the ROCK Algorithm for Clustering of Categorical
Data,” Pattern Recognition Letters, vol. 26, pp. 2364-2373, 2005.