INCONCO Interpretable Clustering of Numerical and
INCONCO Interpretable Clustering of Numerical and
INCONCO Interpretable Clustering of Numerical and
net/publication/221653451
CITATIONS READS
33 2,614
2 authors, including:
Christian Böhm
Ludwig-Maximilians-University of Munich
154 PUBLICATIONS 5,204 CITATIONS
SEE PROFILE
All content following this page was uploaded by Claudia Plant on 17 May 2014.
ABSTRACT attention until now. The first of these two challenges is the inter-
The integrative mining of heterogeneous data and the interpretabil- pretability of the result. Most of the current algorithms for clus-
ity of the data mining result are two of the most important chal- tering merely assign objects to groups without making any attempt
lenges of today’s data mining. It is commonly agreed in the com- to further explain, why a certain set of objects has been selected to
munity that, particularly in the research area of clustering, both build a cluster and what distinguishes one cluster from another. In
challenges have not yet received the due attention. Only few ap- this paper, we represent clusters by model formulas, involving case-
proaches for clustering of objects with mixed-type attributes exist by-case analysis of categorical attributes and linear combinations of
and those few approaches do not consider cluster-specific depen- numerical attributes. For the interpretability of a clustering result
dencies between numerical and categorical attributes. Likewise, involving a medium to high number of attributes, it is essential to
only a few clustering papers address the problem of interpretabil- avoid overly complex model formulas. Therefore, our algorithm
ity: to explain why a certain set of objects have been grouped into a searches specifically for a natural balance between a good fit of the
cluster and what a particular cluster distinguishes from another. In data to the model and suitable reduction of the model complexity
this paper, we approach both challenges by constructing a relation- (also leading to generalization). The most important goal of data
ship to the concept of data compression using the Minimum De- mining in general is to gain actionable knowledge from data (or
scription Length principle: a detected cluster structure is the better colloquially speaking, to stop starving for knowledge while drown-
the more efficient it can be exploited for data compression. Follow- ing in data). The objective of interpretability is mission-critical for
ing this idea, we can learn, during the run of a clustering algorithm, a successful application of clustering in the whole knowledge dis-
the optimal trade-off for attribute weights and distinguish relevant covery process.
attribute dependencies from coincidental ones. We extend the ef- The second big challenge which is addressed in this paper are
ficient Cholesky decomposition to model dependencies in hetero- mixed-type attributes. The overwhelming majority of previous
geneous data and to ensure interpretability. Our proposed algo- clustering algorithms is restricted to numerical objects, like vectors,
rithm, INCONCO, successfully finds clusters in mixed type data metric objects, or nodes in a weighted or unweighted graph. There
sets, identifies the relevant attribute dependencies, and explains are also some approaches for categorical objects like K-modes [8]
them using linear models and case-by-case analysis. Thereby, it but there is a severe lack of solutions which can handle objects
outperforms existing approaches in effectiveness, as our extensive having both numerical and categorical attributes. Most of the few
experimental evaluation demonstrates. existing approaches like the algorithm K-Prototypes [17] typically
use two different optimization goals, one for the numerical and an-
Categories and Subject Descriptors other for the categorical data, see also Section 6. Whenever these
H.2.8 [Database Applications]: Data mining; E.4 [Coding and two goals disagree, a manually chosen weighting factor has to de-
Information Theory]: Data compactification and compression cide how to resolve this tie situation. The weighting factor solu-
tion, however, is not completely satisfactory because it is difficult
General Terms to select and it is not allowed to vary between different clusters
Algorithms or change over time (while the clusters evolve). Moreover, such
approaches implicitly assume the independence between attributes
1. MOTIVATION of different types. In contrast, our solution considers the task of
With multiple books, surveys, and papers, the research area of clus- learning weighting factors and detecting dependencies between at-
tering has already gained a high level of maturity. Nevertheless, tributes (of the same or different type) as part of the overall clus-
we address in this paper two challenges for the clustering problem, tering process. It is commonly agreed in the data mining commu-
which are of high practical importance but have received too little nity that there is a severe lack of knowledge discovery methods for
mixed-type attributes. In panel discussions [25] and position pa-
pers [29] this has been identified as one of the 10 most important
challenges in data mining for the next decade. Our method gives
Permission to make digital or hard copies of all or part of this work for a comprehensive answer to the two challenges of mixed-type at-
personal or classroom use is granted without fee provided that copies are tributes and interpretability in clustering.
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to Contributions
republish, to post on servers or to redistribute to lists, requires prior specific We propose a principled approach to informative clustering of ob-
permission and/or a fee.
KDD’11, August 21–24, 2011, San Diego, California, USA. jects represented by mixed categorical and numerical data. The
Copyright 2011 ACM 978-1-4503-0813-7/11/08 ...$10.00. optimization goal is defined in Section 3 and our algorithm IN-
1127
CONCO based on this optimization goal in Section 4. The major
benefits of our algorithm INCONCO (for INterpretable Clustering
of Numerical and Categorical Objects) can be summarized as fol-
lows:
• Relying on the principle of Minimum Description Length
(MDL), INCONCO seamlessly integrates numerical and cat-
egorical information in clustering.
• INCONCO clusters by revealing dependency patterns among
the mixed-type attributes by an extended Cholesky decompo-
sition.
• Attribute dependency patterns are formalized by conditional
probabilities which yield interpretable formulas concisely de-
scribing the cluster content.
• The practical applicability of our algorithm guaranteed since
INCONCO does not require input parameters and is scalable. Figure 1: PCA vs. the Conditional Cholesky Sequence.
new, uncorrelated (and possibly fewer) attributes, which are de- p(x1 , ..., xd ) = Nμ,Σ (x) = · e− 2 (x−μ) Σ (x−μ) .
(2π) |Σ|
d
fined by the Eigenvectors corresponding to the large Eigenvalues. The joint probability of one (or more) categorical and one (or more)
However, the interpretation of this model is difficult because (1) the numerical attributes is modeled by a case-by-case analysis, associ-
Eigenvectors do not describe directly how the original attributes de- ating to each categorical value an individual Gaussian:
pend from each other but instead the indirect dependency from the
new, Eigenvector based attributes, and (2) the dependencies are not P (A = a1 ) · Nμ1 ,Σ1 (x) if A = a1 ,
p(A, x) = P (A = a2 ) · Nμ2 ,Σ2 (x) if A = a2 , (1)
restricted to the most relevant ones: reported is always a complete ...
model defining in an (n × m)-matrix how the original n attributes
determine the m new ones. The only gain w.r.t. interpretability is The categorical attribute A may actually be composed of several at-
the dimensionality reduction if m < n. tributes A = {Ai1 , Ai2 , ...}, and likewise, the numerical attribute
In this paper, we handle and represent dependencies by a decom- from several numerical attributes x = (xj1 , xj2 , ...)T . An attribute
position which is inspired from the Cholesky decomposition [13]. which may be composed from one or more original attributes of the
The advantages of the original Cholesky decomposition over PCA same or different type is called a super-attribute S ⊆ A ∪ X hav-
are: (1) Cholesky much more efficient than PCA, approximately ing a dimension dS and a number of categorical attributes cS with
by a factor of 10, see e.g. [23]. (2) Its result directly interrelates 0 ≤ dS ≤ d and 0 ≤ cS ≤ c. Our method described in Section 3
the original attributes to each other for a higher degree of inter- will find for each cluster a disjoint partitioning of the attributes into
pretability. (3) Using suitable pivoting strategies [9] the number super-attributes. Thereby, we consider the super-attributes as inde-
of reported attribute dependencies can be considerably further re- pendent from each other. This is not a limitation since it is also
duced over the PCA result. Our decomposition extends the original possible to group all original attributes into one super-attribute to
Cholesky decomposition in two additional aspects yielding further regard every pair of original attributes as dependent.
major advantages over PCA: (4) Categorical attributes are consid- Original Cholesky Decomposition
ered in a seamless way and (5) of all observable dependencies only We consider a d-dimensional mean and a (d × d) covariance matrix
a subset of relevant ones are considered (where relevance is decided Σ. The Cholesky decomposition generates from Σ = [σi,j ] a lower
by MDL). We base our cluster notion on this extended Cholesky triangular matrix L = [li,j ] with j ≥ i such that Σ = L · LT ,
decomposition which captures mixed-type attribute dependencies −1 −1 T −1
Σ = (L ) L , |Σ| = |L| = i li,i 2 2
. L can be determined
in descriptive formulas. This powerful cluster model serves as an
and inverted (L−1 =: [¯li,j ]) in 61 d3 steps:
optimization goal for an efficient top-down clustering strategy. 2 1/2 1
Notation. We assume a database DB consisting of a number n = li,i = σi,i − li,k , lj,i = σi,j − lj,k · li,k .
|DB| of objects, each consisting of c categorical and d numerical li,i
1≤k<i 1≤k<i
attributes A = {A, B, C, ...}, X = {x, y, z, ...} with |A| = c
and |X | = d. The task of our clustering algorithm is to decom- Creating a Sequence of Conditional Probabilities
pose DB into disjoint clusters {C1 , ..., Ck } where k is eventually It is possible to rewrite the Gaussian probability density function
not an input parameter but determined by the overall clustering using the Cholesky decomposition to gain a sequence of condi-
method. Each cluster is described by its weight wi = |Cni | and tional probabilities which can be multiplied using Bayes’ theorem:
1 1 T −1 T −1
a probability distribution for every categorical and numerical at- Nμ,Σ (x) = · e− 2 (x−μ) ·(L ) ·L ·(x−μ) .
tribute. Every categorical attribute A ∈ A consists of mA = |A| (2π) |L|
d 2
distinct values {a1 , ..., amA }. The probability of each attribute By a substitution of a new series of variables μ = (μ1 , . . . , μd )T ,
value in a cluster is denoted by P (A = ai ). If categorical at- 1 ¯
μ1 = μ1 ; μi := μi − ¯ li,j (xj − μj ) ∀i ≥ 2. (2)
tributes A and B are dependent, we consider the joint probability li,i 1≤j<i
1128
(xi −μi )2
1 −
2·l2
a combination of numerical attributes y, z which is already consid-
we obtain Nμ,Σ (x) = √ 2· e i,i = Nμ ,l2 (xi ). ered in the conditional sequence by the Cholesky decomposition),
i 2πli,i 1≤i≤d
i i,i
we simply replace the corresponding original expression p(x|y, z)
The last formula can be interpreted as a sequence of conditional in the chain by p(x|A, B, y, z). Note that this term is a shortcut for
PDFs, because the first factor of the product formula uses the orig- the interpretable formula:
inal mean μ1 = μ1 and variance l1,1 2
= σ12 . The second factor
represents a Gaussian which has been shifted from the original μ2 α1,1 y + β1,1 z + 1,1 if A = a1 ∧ B = b1 ,
using a linear term involving x1 , the third factor a Gaussian shifted x= α1,2 y + β1,2 z + 1,2 if A = a1 ∧ B = b2 ,
...
by x1 and x2 , and so on. Together, we can say that
Nμ,Σ (x) = p(x1 ) · p(x2 |x1 ) · p(x3 |x1 , x2 ) · . . . where αi,j and βi,j are constants and i,j ∼ Nμ,σ2 is a normally
The intuition of this sequence of conditional probabilities and the distributed error term. Obviously, such formulas are the better in-
relationship to PCA is visualized in Figure 1. PCA handles the de- terpretable the fewer attributes are present in a super-attribute. We
pendency by rotating the Gaussian until it is axis-parallel (by mul- will show in the next section how to decide which attribute combi-
tiplication with the Eigenvector matrix). The resulting attributes nations should be regarded as independent in order to improve the
(x and y ), depicted above and left from the multivariate Gaus- overall clustering result and obtain a good trade-off between data
sian, are independent but also less interpretable than the original fit and interpretability.
coordinates. In contrast, Cholesky first evaluates the original x-
dimension, as depicted below the multivariate Gaussian. Depend-
ing on the value of the x-coordinate (for the two marked points,
3. MINIMUM DESCRIPTION LENGTH
x = 0 and x = 1), two different Gaussians for the y-coordinate The principle of Minimum Description Length (MDL) is a well es-
are valid (determined by μy ). Of course, the product of the two tablished technique to evaluate and compare different competing
PCA probabilities always agrees with the conditional sequence: statistical models for the description of data, and for tuning the pa-
p(x ) · p(y ) = p(x) · p(y|x), because Cholesky and PCA are just rameters of these models [26]. MDL optimizes for a good fit of the
two different methods to factorize the multivariate Gaussian. data to the model, formalized by the entropy of the data given the
Nevertheless, Cholesky has many advantages over PCA. Apart model while punishing the model complexity. This is done by re-
from being by a constant (but high) factor faster than PCA, the lating the problem of model selection to that of data compression:
model is also much more interpretable for the following two rea- By Huffman coding, objects are coded with short codewords if they
sons: (1) The original variables are maintained, and all projections are, according to the model, in more likely areas of the (discrete or
are orthogonal to the original coordinates. Equation (2) can be ef- continuous) data space. In contrast, less likely objects are coded
fectively used to present the attribute dependencies in a formal but by longer codewords which results in an average length of the code
also intuitive way. (2) Typically even many of the triangular entries words which is equivalent to the entropy of the data. If the model
in the resulting matrix L are zero, if the original covariance ma- has parameters such as probabilities, mean vectors, covariance ma-
trix is not fully occupied. Therefore, the model formulas generated trices, these parameters have to be included to the compression cost
by Cholesky are much simpler, particularly if the corresponding as a “code book”, required to be able to decompress the coded data
numerical attributes are reordered prior to the actual Cholesky de- again.
composition. For reordering, several different pivoting algorithms
such as AMD, MMD and SYMMMD [9] are available. In contrast,
MDL for Data Compression
the Eigenvector matrices of PCA are often fully occupied, and even The coding costs for a categorical attribute A are provided by the
if a row-echelon form is gained from that, a fully occupied triangu- entropy H(A), and a for a numerical attribute x by the differential
lar matrix is typically obtained. In Section 3, we develop a strategy entropy h(x):
to exploit and enhance this effect by an MDL-based criterion. In the H(A) = − P (A = a) log(P (A = a)), (3)
next section, we show how categorical attributes can be inserted at a∈A
appropriate positions in the sequence of conditional probabilities.
h(x) = − p(x) log2 p(x)dx = (4)
Considering Categorical Attributes Rd
If a numerical and a categorical attribute are not independent from 1
each other, this dependency can also be handled by a conditional = log2 (2πe)d |Σ| for p(x) = Nμ,Σ (x). (5)
2
probability in either direction. The simplest case is that the numer-
ical attribute is conditionally dependent on the categorical attribute, Both equations are also valid if a super-attribute is actually com-
p(x|A). According to the definition in Equation (1), we have: posed from two or more single categorical or numerical attributes,
respectively. The summation of Eq. (3) is in this case over all com-
Nμ1 ,Σ1 (x) if A = a1 ,
p(x|A) = Nμ2 ,Σ2 (x) if A = a2 , binations of values of the respective attributes. If the statistical
... model consists of clusters, each represented by a separate, complex
This works also if A and/or x are composed from more than one model, then the MDL consists of the entropy of the cluster assign-
categorical/numerical attribute. Also the other way round P (A|x) ment (− 1≤i≤k wi · log2 wi , also called the cluster-ID-cost of
can be easily determined by exploiting Bayes’ theorem: the model), followed by the average of the entropies and differen-
P (A|x) = P (A) · p(x|A)/p(x). tial entropies of all clusters, weighted by wi = |Cni | .
If a given super-attribute contains both numerical and categorical Every parameter of the model is usually accounted by 12 log2 n.
attributes, then the probabilities of the categorical attributes can be Here, the number of bits is dependent on the overall number of
inserted into the conditional chain at the beginning. If two or more objects to be described by the model. This corresponds to the ob-
categorical variables A, B, C are dependent from each other, we in- servation that parameters should be coded with higher accuracy if
sert at the beginning a chain like P (A) · P (B|A) · P (C|A, B). For a higher number of objects is coded. For fewer objects a less accu-
every numerical attribute x which is dependent on one or more cat- rate knowledge of the parameters is sufficient. The term 12 log2 n
egorical attributes A, B (maybe in addition to a dependency from represents again an optimal balance of coding accuracy and data fit.
1129
MDL for Dependent Mixed Attributes repeated until no more improvement is possible or all attributes are
Lets now consider a super-attribute S composed of one or more grouped into a common super-attribute.
numerical attributes x = (xi1 , xi2 , ..., xid )T and one or more cat- Optimization Goal for Clustering
egorical attributes A = (Aj1 , Aj2 , ..., Ajc ) where c is the number
of categorical attributes and d the number of numerical attributes For each cluster Ci , 1 ≤ i ≤ k, an individual grouping Si =
(the dimension) in S. It is natural to consider the differential en- {Si,1 , Si,2 , ...} of attributes into super-attributes is determined ac-
tropy of the common probability according to Equation (1): cording to the algorithm proposed in the previous paragraph. The
overall MDL of Cluster Ci corresponds to:
h(A, x) = − p(A, x) log2 p(A, x)dx = M DL(Ci ) = n · wi log2 (wi ) + M DL(Si,j ). (7)
a∈A Rd
Si,j ∈Si
=− P (a) · Nμ,Σ (x) · log2 P (a) · Nμ,Σ (x) dx = The complete database requires the following code-length:
Rd
a∈A M DL(DB) = M DL(Ci ).
1
1≤i≤k
=− P (a) log2 P (a) − P (a) log2 (2πe)d |Σ| . (6)
a∈A
2 The overall optimization goal of our clustering method is the min-
imization of M DL(DB), that means find a complete and disjoint
The entropy of every super-attribute can be determined using Equa- partitioning of DB into a set of clusters {C1 , ..., Ck } with DB =
tion (6). The number of parameters of a super-variable having c C1 ∪ ... ∪ Ck and Ci ∩ Cj = ∅ ∀i = j such that
categorical attributes A = {Ai1 , ..., Aic } and d numerical at- {C1 , ..., Ck } = argmin M DL(Ci ) , (8)
tributes is determined as follows: First, the overall number of pos- k∈N ,C1 ,...,Ck ⊆DB
1≤i≤k
sible combinations of categories
of this super-attribute must be de- where k is usually not a user-provided parameter but, instead also
termined by m = |A| = 1≤j≤c mij . If the super-attribute has part of the optimization process. If the user intentionally wants to
no categorical attributes, we set by default m = 1. For each of perform the clustering method with a certain, pre-selected number
these combinations, the probability must be stored as one parame- of clusters k, this is possible as well.
ter. In addition, for each of the category-combinations, one separate
d -variate Gaussian must be stored with d components of the mean 4. THE ALGORITHM INCONCO
vector μ and potentially 12 d · (d + 1) entries of the decomposed The extended Cholesky decomposition proposed in Section 2 and 3
covariance matrix L. The overall number of parameters therefore to enhance interpretability and to handle mixed-type attributes (in-
sums up to m·(1+d + 12 d ·(d +1)). The number of bits required cluding the conditional sequences and grouping of attributes into
for one cluster including code book and data coding corresponds
to: super-attributes) can in principle be used as a stand-alone technique
1 1 to analyze heterogeneous data (without clustering or as a postpro-
M DL(S) = |Ci | · h(A, x) + 1 + d + d · (d + 1) m log2 n.
2 2 cessing step after clustering). However, we integrated the extended
Grouping Attributes into Super-attributes Cholesky decomposition tightly into a clustering algorithm to per-
form a targeted search for interpretable mixed-type clusters. Our
Our algorithm starts with a number (c+d) of trivial super-attributes,
bisecting top-down paradigm is fully automatic and guarantees run-
each consisting of one attribute and then merges in each step the
time efficiency. We first describe the algorithm k-INCONCO in
best pair of old super-attributes into a new one. This is repeated un-
which the number k of clusters to be searched for is a parameter.
til no more improvement is possible. Technically, we store during
Later this algorithm will be merely used as a building block for
the run of the algorithm one vector of super-attributes represent-
the more general algorithm INCONCO which repeatedly applies
ing the current state and one matrix containing all possible pairs of
k-INCONCO with k = 2 in a bisecting fashion. The algorithm
super-attributes to be formed from the current state. For each of
k-INCONCO is an iterative algorithm performing the two steps as-
the entries of the current state vector and the future state matrix,
signment of points to clusters and model-computation interchang-
we store the value M DL(S). In each step of the algorithm, the
ing until convergence. The model-computation is essentially the
pair (S, T ) of super-attributes is merged which yields the maximum
grouping of attributes into super-attributes, as described in Sec-
cost saving, i.e. for which M DL(S ∪ T ) − M DL(S) − M DL(T )
tion 3. During the attribute grouping, our extended Cholesky de-
is maximal. Then, the entries corresponding to S and T in the cur-
composition of the covariance matrix is computed, and the model
rent state vector and the corresponding rows and columns in the
is determined in form of a sequence of conditional probabilities in-
future state matrix are deleted, and the element (S ∪ T ) is moved
cluding numerical and categorical terms. The initialization is done
from the future state matrix to the current state vector. Finally, a
by sampling of k objects ∈ DB, and then a first assignment of the
new column (and row) of the symmetric future state matrix is de-
database objects to the cluster representatives using all attributes as
termined by tentatively merging the new super-attribute with every
singleton super-attributes. From the obtained clusters, the super-
other super-attribute in the current state vector. The whole pro-
attributes are determined as described in Section 3.
cedure is repeated until no more MDL improvement is possible.
In the assignment step, the models of each cluster are evaluated
The example in Figure 2 starts on the left side with trivial super-
for every object o ∈ DB. The object o is assigned to that clus-
attributes, each consisting of a single numerical or categorical at-
ter in which it has the least MDL-cost, i.e. for which it has the
tribute. The current state vector is {(A), (B), (x), (y), (z)}. All
2-combinations of these attributes are tentatively formed and stored
in the upper triangular future matrix, as indicated in the diagram. In
Figure 2 it is assumed that merging of B and y (the marked entry of
the future matrix) pays off best, and therefore, in the next step, the
corresponding rows and columns are removed, and instead a new
row and column with super-attribute (By) is appended, and the cor-
responding new row and column is computed using tentative merg-
ers into possible super-attributes {(ABy), (Bxy), (Byz)}. This is
Figure 2: Grouping of Super-attributes.
1130
algorithm INCONCO (): set of clusters
Cluster CS := k-INCONCO (1);
return PARTITION-REC (CR ) ;
algorithm k-INCONCO (k): set of k clusters
{C1 , ..., Ck } := INITIALIZATION (k);
repeat
assign every object to Ci with minimum coding cost;
determine the super-attributes according to Section 3;
until convergence;
return {C1 , ..., Ck };
procedure INITIALIZATION (k): set of k clusters
resultMDL := ∞;
for i := 1 to 10 do
select k different objects from DB;
assign the remaining objects according to MDL; Figure 4: Mixed-type Data Set consisting of two clusters repre-
determine the super-attributes according to Section 3; sented by two numerical (x, y) and two categorical attributes (color
re-assign the objects according to the super-attributes;
if M DL(DB) ≤ resultMDL then = {blue, red, green}; symbol = {2, }).
resultMDL := M DL(DB);
store clusters and super-attributes in {C1 , ..., Ck }; From the strictly decrease of the MDL criterion it follows that (1)
return {C1 , ..., Ck }; the algorithm converges (as stated in the following lemma, and (2)
procedure PARTITION-REC (Cluster C): set of clusters that the algorithm performs a steepest descent towards a local min-
{CL , CR } := k-INCONCO (2); imum of the cost function. Like in K-means and many similar clus-
if M DL(CL ) + M DL(CR ) ≥ M DL(C) then
tering algorithms it cannot be guaranteed that the local optimum
return {C};
else corresponds also to the global optimum. As usual, we perform
return PARTITION-REC (CL )∪ PARTITION-REC (CR ); a number of runs of the overall algorithm (i.e. INCONCO) with
different initializations in order to explore different parts of the so-
Figure 3: The Algorithm INCONCO. lution space. We can easily decide according to the MDL criterion
which of the runs has the best overall performance.
highest log-likelihood. Thus, also in the assignment step, possi- L EMMA 2 (C ONVERGENCE OF k-INCONCO).
ble dependencies between numerical and categorical attributes are The algorithm k-INCONCO converges after a finite number of it-
considered. Therefore, the algorithm performs a targeted search for erations.
clusters exhibiting interpretable dependencies between attributes of P ROOF. From Lemma 1 it follows that (1) the MDL-criterion is
the same or different kind. The pseudocode of k-INCONCO is de- monotonically non-increasing. Being a compression-based crite-
fined in Figure 3. INCONCO (a top-down partitioning algorithm) rion, it is obvious that (2) there exists a lower limit of the MDL
is a version of our method which does not require to specify the since no infinite compression ratio is possible. Taking (1) and
number k of clusters to be searched for in advance, and is thus fully (2) together, k-INCONCO must converge after a finite number of
automatic. INCONCO starts by assuming the complete database in steps.
one cluster. Then, iteratively one cluster is selected and split up into
two using the algorithm k-INCONCO with k = 2. Our method al- 5. EXPERIMENTS
ways selects that cluster for which the overall MDL profits most. In this section, we provide an extensive experimental evaluation
INCONCO is also defined in Figure 3. comparing INCONCO to K-means [14], K-modes [8] and K-means-
INCONCO clusters the data set in a bisecting style starting with mixed [3]. K-means and K-modes serve as a baseline and K-means-
the complete data set, and repeatedly partitioning it by calling the mixed is a state-of-the-art technique for mixed data clustering. All
method k-INCONCO (with k = 2). Being an iterative method, experiments have been performed on a workstation equipped with
we have to ensure that k-INCONCO converges and finds at least a a 2.4 GHz CPU and 3 GB of main memory. We obtained the
local optimum of Eq. (7). Java code of K-means-mixed from the authors and implemented
L EMMA 1 (M ONOTONICITY OF MDL). all other algorithms in Java. For comparison in effectiveness, we
In the iterations of k-INCONCO, Eq. (7) is a monotonically non- report the Normalized Mutual Information (NMI) between class
increasing function. labels and cluster-IDs [28]. NMI scales between 0 and 1, where
P ROOF. k-INCONCO inherits from k-Means the property that NMI = 1 represents the perfect clustering and NMI = 0 means a
both the re-assignment of objects to clusters and the re-determination completely random clustering, i.e. no dependency between cluster-
of the cluster model can never increase the MDL. This is obvious in IDs assigned by the algorithm and the ground truth. Moreover, we
the re-assignment step because an object is assigned to that cluster report precision and recall. We report the best result of 10 ran-
in which it has least cost. It is also true for the re-determination dom initializations. With MDL, INCONCO provides an internal
of the model: The previous model (i.e. the set of super-attributes clustering quality measure which is part of our method. Therefore,
determined in the last iteration of k-INCONCO) of a certain clus- selecting the best result is easy. In all experiments, we selected the
ter is always part of the solution space. If no better set of super- result of INCONCO providing the best data compression which is
attributes can be found in the current iteration, we use the same the result with the smallest MDL. For the comparison methods, on
super-attributes as in the last iteration. Of course, the contingency synthetic data we selected the result having the highest NMI with
table(s) of the categorical attributes and the means and variances the ground truth.
of the numerical attributes may be updated, but like in K-means,
these updates can never decrease the log-likelihood and thus can
Proof of Concept
never increase the MDL. If a better set of super-attributes can be To provide an objective comparison in effectiveness, we constructed
found (with lower MDL cost), then the MDL has been decreased as a synthetic example data set consisting of numerical and categor-
well. Since both steps are monotonic, Eq. (7) is also a monotoni- ical attributes with cluster-specific patterns of attribute dependen-
cally non-increasing function. cies. The data set displayed in Figure 4 is composed of two clusters
1131
wrongly Prec. Rec. Prec. Rec. NMI
C1 , C2 , each consisting of 1,000 objects. Especially the categorical clustered C1 % C1 % C2 % C2 %
attribute symbol seems to be very useful for cluster separation: In INCONCO 1 100 99.9 99.9 100 0.99
cluster C1 we observe 89% 2 and only 11% , conversely, in C2 K-modes 209 93.9 84.6 85.9 94.5 0.52
88% of the objects have the symbol , and there are only 12% 2 discretized 661 70.5 57.8 64.3 76.1 0.1
K-means 348 100 65.2 74.2 100 0.44
(cf. left and right panel of Figure 4). Moreover, the distribution binerized 326 93.3 84.6 85.9 94.5 0.37
of the numerical attributes support the fact that both clusters have K-means-mixed 225 89 88.4 88.4 89.1 0.49
been generated by different mechanisms (cf. Figure 4 (middle)).
Table 1: Results on Synthetic Data.
However, considering each attribute separately reveals only parts
of the ground truth. For each cluster, we generated a unique pat- wrongly clustered, resulting in an NMI of 0.44. Let us note that
tern of attribute dependencies illustrated in Figure 5. Especially, in some specialized algorithms for detecting linear and non-linear cor-
cluster C1 the values of both categorical attributes depend on the relation clusters on numerical data have been proposed, e.g. 4C [7],
numerical x-coordinate, see Figure 5(a). ORCLUS [2] and CURLER [27]. However, these approaches are
Table 1 summarizes the results of INCONCO and the compar- designed for numerical data only and none of them is suitable for
ison methods. As the only technique supporting mixed-type at- detecting mixed-type attribute dependencies.
tribute dependencies, INCONCO perfectly clusters this data set. When mining mixed-type data, a frequently chosen option is
Only one of the 2,000 objects is wrongly clustered. This clustering transforming all attributes to a common type. To illustrate this, we
has an NMI of 0.99 and precision and recall above 99% for both re-ran K-Modes and K-Means after discretizing the numerical and
clusters. Besides partitioning the objects into clusters, the result of binarizing the categorical attributes. This feature transformation
INCONCO includes a model for each cluster concisely describing does not improve the results of K-means and K-modes. Choosing
the attribute dependencies using case-by-case analysis. In cluster a suitable resolution for discretization is a non-trivial task. We re-
C1 the attributes symbol, color and x are grouped into a common port the result obtained with 10 equi-width bins, which was the best
super-attribute and the complex three-fold dependency is modeled choice among several trials. Since the attributes in our example are
by the following case-by-case analysis: nominal, they have no natural order. Therefore, we binarized the
⎧ attributes. After binarization the categorical information has with
⎪ 35.0 if symbol = 2 and color = blue,
⎪
⎪ five attributes more weight than the two numerical features. This
⎪
⎨ 37.9 if symbol = 2 and color = red,
41.0 if symbol = 2 and color = green, weighting is artificially introduced by the feature transformation
μ(x) =
⎪ 30.7 if symbol = and color = blue, not justified by the ground truth. Since these experiments illus-
⎪
⎪
⎪ 33.6 if symbol = and color = red,
⎩ trates the severe difficulties associated with feature transformation,
36.0 if symbol = and color = green. we do not consider this option in the real data experiments.
In cluster C2 the numerical attributes x and y are grouped into a Recently, some methods for integrative clustering of mixed-type
common super-attribute and their strong linear dependency is dis- data have been proposed, e.g. K-prototypes [17], CFIKP [30],
covered as follows: μ(x) = 0.9 · y. CAVE [16] and K-means-mixed [3], see also Section 6. We se-
Although the detected attribute dependencies are very valuable lected to compare to K-means-mixed which is a state-of-the-art
for interpretation, one might wonder if this synthetic data set really technique performing metric learning to find a suitable weighting of
requires such a complex algorithm since the clusters seem to be the relative importance of numerical and categorical data in cluster-
separable using the categorical information alone. K-modes suc- ing. K-means-mixed assigns 225 objects to the wrong cluster which
cessfully detects the differences in probability for the symbols and results in an NMI of 0.49. Interestingly, K-means-mixed only clus-
boxes in the two clusters: All green and blue boxes are assigned to ters the data according the attribute symbol which is the most rel-
one cluster. All remaining objects are assigned to the other cluster. evant single attribute for clustering, cf. Figure 4. To summarize,
In total 209 objects are wrongly clustered which corresponds to an integrating numerical and categorical information as performed by
NMI of 0.52 Without considering the information of the numeri- K-means-mixed is not sufficient for successfully clustering our ex-
cal features, it is not possible to successfully cluster this example ample data set. INCONCO performs best by exploiting mixed-type
data set. Approaching the problem from the numerical side, we attribute dependencies.
ran K-means on the numerical coordinates since the clusters seem
to be well separable in the x-coordinate. In total, 348 objects are Eigenvalue vs. Cholesky Decomposition
The major reason for basing our technique on the Cholesky decom-
position is to enhance the interpretability of the attribute dependen-
cies. We illustrate this on a synthetic data set of 1,000 objects and
10 numerical attributes which exhibits a strong correlation among
two pairs of attributes. Figure 6(a) depicts the color-coded Eigen-
vectors resulting from PCA. Each attribute is expressed by a linear
combination of all other attributes. Thus, the Eigenvector matrix
is fully occupied and many entries differ much from zero (coded
1132
in black color). Figure 6(b) depicts the triangular matrix result- numerical attribute, integrating numerical and categorical informa-
ing from Cholesky decomposition. The interpretability is improved tion is beneficial. As INCONCO, K-means-mixed perfectly identi-
since also in the occupied part of the matrix many entries are rel- fies the diseases lichen planus (class 3) and pityriasis rubra pilaris
atively close to zero. Finally, Figure 6(c) displays the result after (class 6) by creating class-pure clusters. However, the diseases se-
attribute grouping into super-attributes. Only the relevant two pairs boreic dermatitis (class 2) and pityriasis rosea (class 4) are assigned
of truly existing attribute dependencies have non-zero values. This to a common cluster resulting an a low precision for these classes,
fact is automatically transformed into an understandable formula cf. Table 2. INCONCO creates own clusters for these diseases,
and reported as part of the result by INCONCO. and assigns most objects correctly (precision and recall for classes
2 and 4 are above 75%). Moreover, INCONCO much better iden-
Mixed-type Attribute Dependencies in Real Data tifies the largest group of psoriasis (precision and recall of class 1
Dermatology Data. The dermatology data on erythemato-squamous above 99%). To summarize, INCONCO automatically detects the
diseases (http://archive.ics.uci.edu/ml/datasets/ true number of clusters in this data set, reveals interesting novel
Dermatology) with 358 instances contains clinical and histopatho- information on cluster-specific attribute dependencies and outper-
logical information measured by 33 categorical attributes and the forms the comparison methods in clustering accuracy.
age of the subject as a numerical attribute. INCONCO detects Automobile Data. The automobile data set (http://archive.
meaningful class-pure clusters in this data set and reveals interest- ics.uci.edu/ml/datasets/Automobile) consists of 193
ing attribute dependencies. The result of INCONCO with an NMI instances which are described by 15 numerical and 10 categori-
of 0.93 consists of 6 clusters, each of which representing a certain cal attributes. The numerical attribute normalized-losses and 7 in-
disease with high precision and recall, see Table 2. stances have been removed from the original data set due to too
The largest cluster consists of 112 subjects and represents the many missing values. The result of INCONCO consists of 7 clus-
group of subjects suffering from psoriasis (class 1 in Table 2) with ters each characterized by a unique pattern of attribute dependen-
precision of 99.1% and recall of 100%. In this cluster, INCONCO cies.
reveals an interesting dependency between the categorical attributes The clusters capture well the different segments of the automo-
itching and Koebner Phenomenon, both having 4 categories repre- tive market. Cluster 1 is composed of 20 large relatively cheap cars
senting the intensity of the symptom. The Koebner Phenomenon mostly manufactured by Nissan and Toyota (scaled average price:
is a reaction to skin injury often seen in psoriasis patients and it 0.37). Cluster 2 is composed of 25 cheap small cars, 22 of which
perfectly makes sense that its intensity is strongly related to the in- have 2 doors only (average price: 0.18). Cluster 3 is composed of
tensity of itching. This dependency can only be observed for the 24 expensive very large cars mostly from the brands Mercedes and
psoriasis cluster. Volvo (average price 0.88). Cluster 4 consists of 22 high quality
Another interesting attribute dependency is observed in a clus- small cars, manufactured e.g. by Audi and Saab. 15 of these 22
ter with 20 objects representing the disease pityriasis rubra pilaris cars have only two doors but are relatively expensive in compari-
(class with precision and recall of 100%. In this cluster, the categor- son to cluster 2 (average price 0.84). Cluster 5 represents 24 upper
ical attribute family history and the numerical attribute age depend middle class cars (average price 0.74). Cluster 6 represents 52 mid-
on each other: dle class cars of medium size and price. As indicated by positive
values in the attribute symboling, these cars tend to best preserve
9.3 if family history = true,
μ(age) = their value over time. Cluster 7 represents 16 premium cars man-
11.2 if family history = false.
ufactured by BMW, Porsche, Jaguar and Mercedes (average price:
The binary attribute family history indicates whether the disease 0.97).
has been observed in the family. In this cluster, patients with family Figure 7 visualizes the pair-wise attribute dependencies of each
history are in average two years younger than patients without a cluster. Pairwise dependencies among numerical, categorical and
family history. Probably, the disease is diagnosed earlier in young mixed (i.e. numerical-categorical) attributes are displayed in dif-
patients with a family history since they are sent to medical care by ferent colors (in the same order of attributes for each cluster). We
their parents. Also this dependency is cluster-specific. can observe that (1) INCONCO detects unique cluster-specific at-
Parameterized to find K = 6 clusters, K-modes and K-means do tribute dependency patterns; (2) many interesting dependencies in-
not perform well on this data set. The best result of K-means has volving mixed types are detected; and (3) the dependency patterns
an NMI of only 0.08. This is not surprising, since we have only one are sparse which is essential for interpretability. Let us focus on
numerical attribute. K-modes performs slightly better with an NMI cluster 5 exhibiting a variety of attribute dependencies of all types.
of 0.24. With an NMI of 0.86 K-means-mixed is the best compar- Note that Figure 7 displays the pair-wise dependencies only and
ison method. Interestingly, although the attribute age is the only does not visualize the grouping of attributes into super-attributes. In
cluster 5 consisting of 24 data objects, 6 non-trivial super-attributes
Algorithm / Class 1 2 3 4 5 6
INCONCO
consisting of more than one attribute have been detected. One of
Precision % 99.1 97.8 100.0 78.3 100.0 100.0 the super-attributes describes a dependency between the numerical
Recall % 100.0 76.7 100.0 97.9 100.0 100.0 attributes engine size, horsepower and the binary attribute aspira-
K-means tion. INCONCO describes the well known fact that with turbo as-
Precision % 30.3 21.3 31.3 18.9 13.9 40.4
Recall % 21.6 26.7 28.2 31.2 22.9 95.0 piration larger horsepower is possible with a smaller engine size by
K-modes the following formula:
Precision % 100.0 20.6 29.7 17.7 34.8 1.0
0.86 · horsepower if aspiration = standard,
Recall % 52.7 59.0 72.2 63.2 28.8 85.0 μ(engine size) = 0.77 · horsepower if aspiration = turbo.
K-means-mixed
Precision % 100.0 55.5 100.0 44.4 94.1 100.0 Another 3-fold dependency is detected between the binary attribute
Recall % 67.6 100.0 100.0 100.0 100.0 100.0
fuel type and the numerical attributes compression ratio and peak-
Table 2: Results on Dermatology Data. The classes correspond rpm:
to: 1 psoriasis, 2 seboreic dermatitis, 3 lichen planus, 4 pityriasis
0.58 · compression ratio if fuel type = diesel,
rosea, 5 cronic dermatitis, 6 pityriasis rubra pilaris. μ(peak-rpm) = 2.45 · compression ratio if fuel type = gas.
1133
Figure 7: INCONCO Clustering of Automobile Data: Cluster-specific Attribute Dependency Patterns.
1134
timizing the objective function, the attribute significance is dynam- [14] J. A. Hartigan. Clustering Algorithms. John Wiley & Sons,
ically adjusted. 1975.
Discovering attribute dependency patterns in clustering mixed- [15] Z. He, X. Xu, and S. Deng. Clustering mixed numeric and
type data is a research question which has not been addressed so categorical data: A cluster ensemble approach. CoRR,
far. Our approach INCONCO is a solution inspired by ideas for abs/cs/0509011, 2005.
interpretable correlation clustering [1] and clustering by data com- [16] C.-C. Hsu and Y.-C. Chen. Mining of mixed data with
pression [26]. application to catalog marketing. Expert Syst. Appl.,
32(1):12–23, 2007.
7. CONCLUSION [17] Z. Huang. Extensions to the k-means algorithm for clustering
Our algorithm INCONCO addresses major challenges in cluster- large data sets with categorical values. Data Min. Knowl.
ing: INCONCO clusters mixed-type data consisting of numerical Discov., 2(3):283–304, 1998.
and categorical attributes. INCONCO clusters by revealing cluster- [18] H.-P. Kriegel, P. Kröger, and A. Zimek. Clustering
specific dependency patterns among the attributes. Dependency high-dimensional data: A survey on subspace clustering,
patterns are discovered by integrating several numerical and cate- pattern-based clustering, and correlation clustering. ACM
gorical attributes into super-attributes which represent the semantic Trans. Knowl. Discov. Data, 3:1:1–1:58, March 2009.
concepts of the cluster. To facilitate interpretation of the clustering [19] J. B. Macqueen. Some methods of classification and analysis
result, we express dependency patterns by conditional probabilities of multivariate observations. In Proceedings of the Fifth
of the original attributes. To further facilitate the practical appli- Berkeley Symposium on Mathematical Statistics and
cation and interpretability, we base the clustering objective func- Probability, pages 281–297, 1967.
tion of INCONCO on the Minimum Description Length principle [20] G. Moise and J. Sander. Finding non-redundant, statistically
relating clustering to data compression. Supported by MDL, IN- significant regions in high dimensional data: a novel
CONCO detects the truly relevant clusters and attribute dependency approach to projected and subspace clustering. In KDD
patterns in the data. Conference, pages 533–541, 2008.
[21] E. Müller, I. Assent, S. Günnemann, R. Krieger, and T. Seidl.
8. REFERENCES Relevant subspace clustering: Mining the most interesting
[1] E. Achtert, C. Böhm, H.-P. Kriegel, P. Kröger, and A. Zimek. non-redundant concepts in high dimensional data. In ICDM
Deriving quantitative models for correlation clusters. In Conference, pages 377–386, 2009.
KDD Conference, pages 4–13, 2006.
[22] E. Müller, I. Assent, R. Krieger, T. Jansen, and T. Seidl.
[2] C. C. Aggarwal and P. S. Yu. Finding generalized projected Morpheus: interactive exploration of subspace clustering. In
clusters in high dimensional spaces. In SIGMOD KDD Conference, pages 1089–1092, 2008.
Conference, pages 70–81, 2000.
[23] Y. Okiyama, T. Nakano, K. Yamashita, Y. Mochizuki,
[3] A. Ahmad and L. Dey. A k-mean clustering algorithm for N. Taguchi, and S. Tanaka. Acceleration of fragment
mixed numeric and categorical data. Data Knowl. Eng., molecular orbital calculations with cholesky decomposition
63(2):503–527, 2007. approach. Chemical Physics Letters, 490(1-3):84 – 89, 2010.
[4] P. Andritsos, P. Tsaparas, R. J. Miller, and K. C. Sevcik. [24] D. Pelleg and A. Moore. X-means: Extending K-means with
Limbo: Scalable clustering of categorical data. In EDBT efficient estimation of the number of clusters. In ICML
Conference, pages 123–146, 2004. Conference, pages 727–734, 2000.
[5] C. Böhm, C. Faloutsos, and C. Plant. Outlier-robust [25] G. Piatetsky-Shapiro, C. Djeraba, L. Getoor, R. Grossman,
clustering using independent components. In SIGMOD R. Feldman, and M. J. Zaki. What are the grand challenges
Conference, pages 185–198, 2008. for data mining? KDD 2006 panel report. SIGKDD
[6] C. Böhm, S. Goebl, A. Oswald, C. Plant, M. Plavinski, and Explorations, 8(2):70–77, 2006.
B. Wackersreuther. Integrative parameter-free clustering of [26] J. Rissanen. Information and Complexity in Statistical
data with mixed type attributes. In PAKDD, pages 38–47, Modeling. Springer, 2007.
2010.
[27] A. K. Tung, X. Xu, and B. C. Ooi. CURLER: Finding and
[7] C. Böhm, K. Kailing, P. Kröger, and A. Zimek. Computing visualizing nonlinear correlation clusters. In SIGMOD
clusters of correlation connected objects. In SIGMOD Conference, pages 467–478, 2005.
Conference, pages 455–466, 2004.
[28] N. X. Vinh, J. Epps, and J. Bailey. Information theoretic
[8] A. Chaturvedi, P. E. Green, and J. D. Caroll. K-modes measures for clusterings comparison: is a correction for
clustering. Journal of Classification, 18(1). chance necessary? In ICML Conference, pages 1073–1080,
[9] T. A. Davis. Direct Methods for Sparse Linear Systems. 2009.
Society for Industrial and Applied Mathematics, 1996. [29] Q. Yang and X. Wu. 10 challenging problems in data mining
[10] I. S. Dhillon, S. Mallela, and D. S. Modha. research. IJITDM, 5(4):597–604, 2006.
Information-theoretic co-clustering. In In KDD Conference, [30] J. Yin and Z. Tan. Clustering mixed type attributes in large
pages 89–98. ACM Press, 2003. dataset. In ISPA Conference, pages 655–661, 2005.
[11] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A [31] M. J. Zaki, M. Peters, I. Assent, and T. Seidl. Clicks: An
density-based algorithm for discovering clusters in large effective algorithm for mining subspace clusters in
spatial databases with noise. In KDD Conference, 1996. categorical datasets. Data Knowl. Eng., 60(1):51–70, 2007.
[12] D. Gibson, J. Kleinberg, and P. Raghavan. Clustering [32] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: An
categorical data: an approach based on dynamical systems. efficient data clustering method for very large databases. In
The VLDB Journal, 8(3-4):222–236, 2000. SIGMOD Conference, pages 103–114, 1996.
[13] G. H. Golub and C. F. V. Loan. Matrix Computations. The
Johns Hopkins University Press, 3rd edition, 1996.
1135