DENCLUE 2.0: Fast Clustering Based On Kernel Density Estimation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

DENCLUE 2.

0: Fast Clustering based on Kernel


Density Estimation

Alexander Hinneburg1 and Hans-Henning Gabriel2


1
Institute of Computer Science
Martin-Luther-University Halle-Wittenberg, Germany
[email protected]
2
Otto-von-Guericke-University Magdeburg, Germany
[email protected]

Abstract. The Denclue algorithm employs a cluster model based on


kernel density estimation. A cluster is defined by a local maximum of
the estimated density function. Data points are assigned to clusters by
hill climbing, i.e. points going to the same local maximum are put into
the same cluster. A disadvantage of Denclue 1.0 is, that the used hill
climbing may make unnecessary small steps in the beginning and never
converges exactly to the maximum, it just comes close.
We introduce a new hill climbing procedure for Gaussian kernels, which
adjusts the step size automatically at no extra costs. We prove that the
procedure converges exactly towards a local maximum by reducing it
to a special case of the expectation maximization algorithm. We show
experimentally that the new procedure needs much less iterations and
can be accelerated by sampling based methods with sacrificing only a
small amount of accuracy.

1 Introduction
Clustering can be formulated in many different ways. Non-parametric methods
are well suited for exploring clusters, because no generative model of the data is
assumed. Instead, the probability density in the data space is directly estimated
from data instances. Kernel density estimation [15,14] is a principled way of doing
that task. There are several clustering algorithms, which exploit the adaptive
nature of a kernel density estimate. Examples are the algorithms by Schnell
[13] and Fukunaga [5] which use the gradient of the estimated density function.
The algorithms are also described in the books by Bock [3] and Fukunaga [4]
respectively. The Denclue framework for clustering [7,8] builds upon Schnells
algorithm. There, clusters are defined by local maxima of the density estimate.
Data points are assigned to local maxima by hill climbing. Those points which
are assigned to the same local maximum are put into a single cluster.
However, the algorithms use directional information of the gradient only.
The step size remains fixed throughout the hill climbing. This implies certain
disadvantages, namely the hill climbing does not converges towards the local
maximum, it just comes close, and the number of iteration steps may be large
due to many unnecessary small steps in the beginning. The step size could be
heuristically adjusted by probing the density function at several positions in the
direction of the gradient. As the computation of the density function is relatively
costly, such a method involves extra costs for step size adjustment, which are
not guaranteed to be compensated by less iterations.
The contribution of this article is a new hill climbing method for kernel
density estimates with Gaussian kernels. The new method adjusts the step size
automatically at no additional costs and converges towards a local maximum. We
prove this by casting the hill climbing as a special case of the expectation max-
imization algorithm. Depending on the convergence criterium, the new method
needs less iterations as fixed step size methods. Since the new hill climbing can
be seen as an EM algorithm, general acceleration methods for EM, like sparse
EM [11] can be used as well. We also explore acceleration by sampling. Fast
Density estimation [17] can be combined with our method as well but is not
tested in this first study.
Other density based clustering methods beside Denclue, which would ben-
efit from the new hill climbing, have been proposed by Herbin et al [6]. Variants
of density based clustering are Dbscan [12], Optics [1], and followup versions,
which, however, do not use a probabilistic framework. This lack of foundation
prevents the application of our new method there.
Related approaches include fuzzy c-means [2], which optimized the location
of cluster centers and uses membership functions in a similar way as kernel
functions are used by Denclue. A subtle difference between fuzzy c-means and
Denclue is, that in c-means the membership grades of a point belonging to
a cluster are normalized, s.t. the weights of a single data point for all clusters
sum to one. This additional restriction makes the clusters competing for data
points. Denclue does not have such restriction. The mountain method [16] also
uses similar membership grades as c-means. It finds clusters by first discretizing
the data space into a grid, calculates for all grid vertices the mountain function
(which is comparable to the density up to normalization) and determines the grid
vertex with the maximal mountain function as the center of the dominant cluster.
After effects of the dominant cluster on the mountain function are removed, the
second dominant cluster is found. The method iterates until the heights of the
clusters drop below a predefined percentage of the dominant cluster. As the
number of grid vertices grow exponentially in high dimensional data spaces, the
method is limited to low dimensional data. Niche clustering [10] uses a non-
normalized density function as fitness function for prototype-based clustering in
a genetic algorithm. Data points with high density (larger than a threshold) are
seen as core points, which are used to estimate scale parameters similar to the
smoothing parameter h introduced in the next section.
The rest of the paper is structured as follows. In section 2, we briefly introduce
the old Denclue framework and in section 3 we propose our new improvements
for that framework. In section 4, we compare the old and the new hill climbing
experimentally.
h=0.25 h=0.75

0.6

0.00 0.10 0.20


Kernels Kernels
Density Estimate Density Estimate
Density

Density
0.3
0.0 Data Data

0 2 4 6 0 2 4 6

Data Space Data Space

Fig. 1. Kernel density estimate for one-dimensional data and different values for the
smoothing parameter h.

2 DENCLUE 1.0 framework for clustering

The Denclue framework [8] builds on non-parametric methods, namely kernel


density estimation. Non-parametric methods are not looking for optimal parame-
ters of some model, but estimate desired quantities like the probability density
of the data directly from the data instances. This allows a more direct definition
of a clustering in contrast to parametric methods, where a clustering corre-
sponds to an optimal parameter setting of some high-dimensional function. In
the Denclue framework, the probability density in the data space is estimated
as a function of all data instances xt X Rd , d N, t = 1, . . . , N . The in-
fluences of the data instances in the data space are modeled via a simple kernel
d 2
function, e.g. the Gaussian kernel K(u) = (2) 2 exp u2 . The sum of all
kernels (with suitable normalization) gives an estimate of the probability at any
PN
point x in the data space p(x) = 1/(N hd ) t=1 K xxt/h . The estimate p(x)
enjoys all properties like differentiability like the original kernel function. The
quantity h > 0 specifies to what degree a data instance is smoothed over data
space. When h is large, an instance stretches its influence up to more distant
regions. When h is small, an instance effects only the local neighborhood. We
illustrate the idea of kernel density estimation on one-dimensional data as shown
in figure 1.
A clustering in the Denclue framework is defined by the local maxima of
the estimated density function. A hill-climbing procedure is started for each
data instance, which assigns the instance to a local maxima. In case of Gaussian
kernels, the hill climbing is guided by the gradient of p(x), which takes the form

N
X
1 x xt

p(x) = K (xt x). (1)
hd+2 N t=1
h

The hill climbing procedure starts at a data point and iterates until the density
does not grow anymore. The update formula of the iteration to proceed from
x(l) to x(l+1) is
p(x(l) )
x(l+1) = x(l) + . (2)
kp(x(l) )k2
0.6
Kernels
Density Estimate

Density
Cluster 1

0.3
Cluster 2
Outlier

0.0
0 1 2 3 4 5 6

Data Space

Fig. 2. Example of a Denclue clustering based on a kernel density estimate and a


noise threshold .

The step size is a small positive number. In the end, those end points of the
hill climbing iteration, which are closer than 2 are considered, to belong to the
same local maximum. Instances, which are assigned to the same local maximum,
are put into the same cluster.
A practical problem of gradient based hill climbing in general is the adap-
tation of the step size. In other words, how far to follow the direction of the
gradient? There are several general heuristics for this problem, which all need
to calculate p(x) several times to decide a suitable step size.
In the presence of random noise in the data, the Denclue framework pro-
vides an extra parameter > 0, which treats all points assigned to local maxima
x
with p(
x) < as outliers. Figure 2 sketches the idea of a Denclue clustering.

3 DENCLUE 2.0
In this section, we propose significant improvements of the Denclue 1.0 frame-
work for Gaussian kernels. Since the choice of the kernel type does not have large
effects on the results in the typical case, the restriction on Gaussian kernels is
not very serious. First, we introduce a new hill climbing procedure for Gaussian
kernels, which adjust the step size automatically at no extra costs. The new
method does really converge towards a local maximum. We prove this property
by casting the hill climbing procedure as an instance of the expectation maxi-
mization algorithm. Last, we propose sampling based methods to accelerate the
computation of the kernel density estimate.

3.1 Fast Hill Climbing


The goal of a hill climbing procedure is to maximize the density p(x). An alter-
native approach to gradient based hill climbing is to set the first derivative of
p(x) to zero and solve for x. Setting (1) to zero and rearranging we get
PN xxt
t=1 K h xt
x= P N xx (3)
t=1 K
t
h
1.0

1.0
0.8

0.8
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Fig. 3. (left) Gradient hill climbing as used by Denclue 1.0, (right) Step size adjusting
hill climbing used by Denclue 2.0.

Obviously, this is not a solution for x, since the vector is still involved into the
righthand side. Since x influences the righthand side only through the kernel,
the idea is to compute the kernel for some fixed x and update the vector on the
lefthand side according to formula (3). This give a new iterative procedure with
the update formula
PN x(l) xt
t=1 K xt
x(l+1) = P N
h
x(l) xt
(4)
t=1 K h

The update formula can be interpreted as a normalized and weighted average of


the data points and the weights of the data points depend on the influence of
their kernels on the current x(l) . In order to see that the new update formula
makes sense it is interesting to look at the special case N = 1. In that case,
the estimated density function consists just of a single kernel and the iteration
jumps after one step to x1 , which is the maximum.
The behavior of Denclues 1.0 hill climbing and the new hill climbing pro-
cedure is illustrated in figure 3. The figure shows that the step size of the new
procedure is adjusted to the shape of the density function. On the other hand,
an iteration of the new procedure has the same computational costs as one of
the old gradient based hill climbing. So, adjusting the step size comes at no
additional costs. Another difference is, that the hill climbing of the new method
really converges towards a local maximum, while the old method just comes
close.
Since the new method does not need the step size parameter , the assignment
of the instances to clusters is done in a new way. The problem is to define
a heuristic, which automatically adjusts to the scale of distance between the
converged points.
Fig. 4. (left) Assignment to a local maximum, (right) Ambiguous assignment. The
points M and M 0 denote the true but unknown local maxima.

A hill climbing is started at each data point xt X and iterates until the
(l) (l1)
density does not change much, i.e. [f(xt )f(xt )]/f(x(l) t ) . An end point
(l)
reached by the hill climbing is denoted by xt = xt and the sum of the k last
Pk (li+1) (li)
step sizes is st = i=1 kxt xt k2 . The integer k is parameter of the
heuristic. We found that k = 2 worked well for all experiments. Note, that the
number of iterations may vary between the data points, however, we restricted
the number of iterations to be larger than k. For appropriate > 0, it is safe
to assume that the end points xt are close to the respective local maxima.
Typically, the step sizes are strongly shrinking before the convergence criterium
is met. Therefore, we assume that the true local maximum is within a ball around
xt of radius st . Thus, the points belonging to the same local maximum have end
points xt and xt0 , which are closer than st + st0 . Figure 4 left illustrates that
case.

However, there might exists rare cases, when such an assignment is not
unique. This happens when for three end points xt , xt0 and xt00 hold the fol-
lowing conditions kxt xt0 k st + st0 and kxt xt00 k st + st00 but not
kxt0 xt00 k st0 + st00 . In order to solve the problem, the hill climbing is contin-
ued for all points, which are involved in such situations, until the convergence
criterium is met for some smaller (a simple way to reduce is multiply it with
a constant between zero and one). After convergence is reached again, the am-
biguous cases are rechecked. The hill climbing is continued until all such cases
are solved. Since further iterations causes the step sizes to shrink the procedure
will stop at some point. The idea is illustrated in figure 4 right.

However, until now it is not clear why the new hill climbing procedure con-
verges towards a local maximum. In the next section, we prove this claim.
3.2 Reduction to Expectation Maximization
We prove the convergence of the new hill climbing method by casting the maxi-
mization of the density function as a special case of the expectation maximization
framework [9]. When using the Gaussian kernel we can rewrite the kernel den-
sity estimate p(x) in the form of a constrained mixture model with Gaussian
components
XN
p(x|, ) = t N (x|t , ) (5)
t=1
and the constraints t = 1/N ,t = xt ( denotes a vector consisting of all
concatenated t ), and = h. We can think of p(x|, ) as a likelihood of x
given the model determined by and . Maximizing log p(x|, ) wrt. x is not
possible in a direct way. Therefore, we resort
PN to the EM framework by introducing
a hidden bit variable z {0, 1}N with t=1 zt = 1 and
(
1 if the density at x is explained by N (x|t , ) only
zt = . (6)
0 else
The complete log-likelihood is log p(x, z|, ) = log p(x|z, , )p(z) with p(z) =
QN zt QN zt
t=1 t and p(x|z, , ) = t=1 N (x|t , ) .
In contrast to generative models, which use EM to determine parameters
of the model, we maximize the complete likelihood wrt. x. The EM-framework
ensures that maximizing the complete log-likelihood maximizes the original log-
likelihood as well. Therefore, we define the quantity Q(x|x(l) ) = E[log p(x, z|, )|, , x(l) ].
In the E-step the expectation Q(x|x(l) ) is computed wrt. to z and x(l) is put
for x, while in the M-step Q(x|x(l) ) is taken as a function of x and maximized.
The E-step boils down to compute the posterior probability for the zt :
E[zt |, , x(l) ] = p(zt = 1|x(l) , , ) (7)
(l)
p(x |zt = 1, , )p(zt = 1|, )
= PN (8)
(l)
t=1 p(x |zt = 1, , )p(zt = 1|, )
1/N N (x(l) |t , )
= PN (9)
t0 =1
1/N N (x(l) |t0 , )
(l)
1/N K( x hxt )
= = t (10)
p(x(l) )
In the M-step, zt is replaced by the fixed posterior t , which yields Q(x|x(l) ) =
PN
t=1 t [log /N + log N (x|t , )]. Computing the derivative wrt. x and setting
1
PN
it to zero yields t=1 t 2 (x t ) = 0 and thus
PN PN x(l) xt
t t t=1 K( )xt
x(l+1) = Pt=1 N
= PN
h
x(l) xt
(11)
t=1 t t=1 K( h )
By starting the EM with x(0) = xt the method performs an iterative hill climbing
starting at data point xt .
3.3 Sampling based Acceleration

As the hill climbing procedure is a special case of the expectation maximization


algorithm, we can employ different general acceleration techniques known for
EM to speed up the the Denclue clustering algorithm.
Most known methods for EM, try to reduce the number of iterations needed
until convergence [9]. Since the number of iterations is typically quite low, that
kind of techniques yield no significant reduction for the clustering algorithm.
In order to speed up the clustering algorithm, the costs for the iterations itself
should be reduced. One option is sparse EM [11], which still converges to the
true local maxima. The idea is to freeze small posteriors for several iterations, so
only the p% largest posteriors are updated in each iteration. As the hill climbing
typically needs only a few iterations we modify the hill climbing starting at the
(0)
single point x(0) as follows. All kernels K( x hxt ) are determined in the initial
iteration and x(1) is determined as before. Let be U the index set of the p%
largest kernels and L the complement. Then, in the next iterations the update
formula is modified to
P x(l) xt P (0)

tU K( )xt + tL K( x hxt )xt


x(l+1) = P h
x(l) xt P (0)
(12)
tU K( h ) + tL K( x hxt )

The index set U and L can be computed by sorting. The disadvantage of the
method is, that the first iteration is still the same as in the original EM.
The original hill climbing converges towards a true local maximum of the
density function. However, we does not need the exact position of such a max-
imum. It is sufficient for the clustering algorithm, that all points of a cluster
converge to the same local maximum, regardless where that location might be.
In that light, it makes sense to simplify the original density function by reducing
the data set to a set of p% representative points. That reduction can be done
in many ways. We consider here random sampling and k-means. So the number
of points N is reduced to a much smaller number of representative points N 0 ,
which are used to construct the density estimate.
Note that random sampling has much smaller costs as k-means. We investi-
gate in the experimental section, whether the additional costs by k-means pay
off by less needed iterations or by cluster quality.

4 Experimental Evaluation

We compared the new step size adjusting (SSA) hill climbing method with the old
fixed step size hill climbing. We used synthetic data with normally distributed
16-dimensional clusters with uniformly distributed centers and approximately
same size. Both methods are tuned to find the perfect clustering in the most
efficient way. The total sum of numbers of iterations for the hill climbings of
all data points is plotted versus the number of data points. SSA was run with
different values for , which controls the convergence criterium of SSA. Figure
4000 FS
SSA e=0
SSA e=1E10
SSA e=1E5
3000

SSA e=1E2
Sum Num Iterations

2000

Fig. 5. Number of data points versus the


total sum of numbers of iterations.
1000

50 100 150 200

Num Points

5 clearly shows that SSA ( = 0.01) needs only a fraction of the number of
iterations of FS to achieve the same results. The costs per iterations are the
same for both methods.
Next, we tested the influence of different sampling methods on the computa-
tional costs. Since the costs per iteration differ for sparse EM, we measure the
costs in number of kernel computations versus sample size. Figure 6(left) shows
that sparse EM is more expensive than random sampling and k-means based
data reduction. The difference between the two latter methods is negligible, so
the additional effort of k-means during the data reduction does not pay off in less
computational costs during the hill climbing. For sample size 100% the methods
converge to the original SSA hill climbing.
For random sampling, we tested sample size versus cluster quality measured
by normalized mutual information (NMI is one if the perfect clustering is found).
Figure 6(right) shows that the decrease of cluster quality is not linear in sample
size. So, a sample of 20% is still sufficient for a good clustering when the dimen-
sionality is d = 16. Larger dimensionality requires larger samples as well as more
smoothing (larger h), but the clustering can still be found.
In the last experiment, we compared SSA, its sampling variants, and k-means
with the optimal k on various real data sets from the machine learning repository
wrt. cluster quality. Table 1 shows average values of NMI with standard deviation
for k-means and sampling, but not for SSA which is a deterministic algorithm.
SSA has better or comparable cluster quality as k-means. The sampling vari-
ants degrade with smaller sample sizes (0.8, 0.4, 0.2), but k-means based data
reduction suffers much less from that effect. So, the additional effort of k-means
based data reduction pays off in cluster quality.
In all experiments, the smoothing parameter h was tuned manually. Cur-
rently, we are working on methods to determine that parameter automatically.
In conclusion, we proposed a new hill climbing method for kernel density func-
tions, which really converges towards a local maximum and adjusts the step
7 e+05

1.0
Random Sampling
Sparse EM
kmeans Sampling

0.8
Normalized Mutual Information
5 e+05
Num Kernel Computations

0.6
3 e+05

0.4
d=16, h=0.6

0.2
d=32, h=0.8
1 e+05

d=64, h=1.2
d=128, h=1.7

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Sample Size Sample size

Fig. 6. (left) Sample size versus number of kernel computations, (right) sample size
versus cluster quality (normalized mutual information, NMI).

Table 1. NMI values for different data and methods, the first number in the three
rightmost columns shows the sample size.

k-means SSA Random Sampling Sparse EM k-means Sampling


iris 0.690.10 0.72 0.8: 0.660.05 0.8: 0.680.06 0.8: 0.670.06
0.4: 0.630.05 0.4: 0.600.06 0.4: 0.650.07
0.2: 0.630.06 0.2: 0.500.04 0.2: 0.640.07
ecoli 0.560.05 0.67 0.8: 0.650.02 0.8: 0.660.00 0.8: 0.650.02
0.4: 0.620.06 0.4: 0.610.00 0.4: 0.650.04
0.2: 0.590.06 0.2: 0.400.00 0.2: 0.650.03
wine 0.820.14 0.80 0.8: 0.710.06 0.8: 0.720.07 0.8: 0.700.11
0.4: 0.630.10 0.4: 0.630.00 0.4: 0.700.05
0.2: 0.550.15 0.2: 0.410.00 0.2: 0.580.21

size automatically. We believe, that our new technique has some potential for
interesting combinations with parametric clustering methods.

References
1. M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points
to identify the clustering structure. In Proceedings SIGMOD99, pages 4960. ACM
Press, 1999.
2. J. Bezdek. Fuzzy Models and Algorithms for Pattern Recognition and Image
Processing. Kluwer Academic Pub, 1999.
3. H. H. Bock. Automatic Classification. Vandenhoeck and Ruprecht, 1974.
4. K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press,
1990.
5. K. Fukunaga and L. Hostler. The estimation of the gradient of a density function,
with application in pattern recognition. IEEE Trans. Info. Thy., 21:3240, 1975.
6. M. Herbin, N. Bonnet, and P. Vautrot. Estimation of the number of clusters and
influence zones. Pattern Recognition Letters, 22:15571568, 2001.
7. A. Hinneburg and D. Keim. An efficient approach to clustering in large multimedia
databases with noise. In Proceedings KDD98, pages 5865. AAAI Press, 1998.
8. A. Hinneburg and D. A. Keim. A general approach to clustering in large databases
with noise. Knowledge and Information Systems (KAIS), 5(4):387415, 2003.
9. G. J. McLachlan and T. Krishnan. EM Algorithm and Extensions. Wiley, 1997.
10. O. Nasraoui and R. Krishnapuram. The unsupervised niche clustering algorithm:
extension tomultivariate clusters and application to color image segmentation.
IFSA World Congress and 20th NAFIPS International Conference,, 3, 2001.
11. R. M. Neal and G. E. Hinton. A view of the em algorithm that justifies incremental,
sparse, and other variants. In Learning in graphical models, pages 355368. MIT
Press, 1999.
12. J. Sander, M. Ester, H.-P. Kriegel, and X. Xu. Density-based clustering in spatial
databases: The algorithm gdbscan and its applications. Data Mining and Knowl-
edge Discovery, 2(2):169194, 1997.
13. P. Schnell. A method to find point-groups. Biometrika, 6:4748, 1964.
14. D. Scott. Multivariate Density Estimation. Wiley, 1992.
15. B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman
& Hall, 1986.
16. R. Yager and D. Filev. Approximate clustering via the mountain method. IEEE
Transactions on Systems, Man and Cybernetics, 24(8):12791284, 1994.
17. T. Zhang, R. Ramakrishnan, and M. Livny. Fast density estimation using cf-kernel
for very large databases. In Proceedings KDD99, pages 312316. ACM, 1999.

You might also like