Identifying the components

van Leeuwen, Matthijs; Vreeken, Jilles; Siebes, Arno

doi:10.1007/s10618-009-0137-2

Identifying the components

Open access
Published: 17 July 2009

Volume 19, pages 176–193, (2009)
Cite this article

Download PDF

You have full access to this open access article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Identifying the components

Download PDF

Matthijs van Leeuwen¹,
Jilles Vreeken¹ &
Arno Siebes¹

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

Most, if not all, databases are mixtures of samples from different distributions. Transactional data is no exception. For the prototypical example, supermarket basket analysis, one also expects a mixture of different buying patterns. Households of retired people buy different collections of items than households with young children. Models that take such underlying distributions into account are in general superior to those that do not. In this paper we introduce two MDL-based algorithms that follow orthogonal approaches to identify the components in a transaction database. The first follows a model-based approach, while the second is data-driven. Both are parameter-free: the number of components and the components themselves are chosen such that the combined complexity of data and models is minimised. Further, neither prior knowledge on the distributions nor a distance metric on the data is required. Experiments with both methods show that highly characteristic components are identified.

Article PDF

A New Look at the Dirichlet Distribution: Robustness, Clustering, and Both Together

Article Open access 02 July 2024

Extracting Diverse Patterns with Unbalanced Concept Hierarchy

Fast Estimation of the Pattern Frequency Spectrum

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Aggarwal CC, Procopiuc C, Yu PS (2002) Finding localized associations in market basket data. IEEE Trans Knowl Data Eng 14(1): 51–62
Article Google Scholar
Andritsos P, Tsaparas P, Miller RJ, Sevcik KC (2004) LIMBO: scalable clustering of categorical data. In: Proceedings of the EDBT’04, pp 124–146
Bischof H, Leonardis A, Sleb A (1999) MDL principle for robust vector quantization. Pattern Anal Appl 2: 59–72
Article MATH Google Scholar
Böhm C, Faloutsos C, Pan J-Y, Plant C (2006) Robust information-theoretic clustering. In: Proceedings of the KDD’06, pp 65–75
Brijs T, Swinnen G, Vanhoof K, Wets G (1999) The use of association rules for product assortment decisions: a case study. In: Proceedings of the KDD’99, pp 254–260
Cadez IV, Smyth P, Mannila H (2001) Probabilistic modeling of transaction data with applications to profiling, visualization, and prediction. In: Proceedings of the KDD’01, pp 37–46
Cilibrasi R, Vitányi PMB (2005) Clustering by compression. IEEE Trans Inf Theory 51(4): 1523–1545
Article Google Scholar
Coenen F (2003) The LUCS-KDD discretised/normalised ARM and CARM data library. http://www.csc.liv.ac.uk/~frans/KDD/Software/LUCS-KDD-DN/DataSets/dataSets.html. Accessed 30 June 2009
Faloutsos C, Megalooikonomou V (2007) On data mining, compression, and kolomogorov complexity. Data Min Knowl Discov 15(1): 3–20
Article MathSciNet Google Scholar
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comp Sys Sci 55(1): 119–139
Article MATH MathSciNet Google Scholar
Gokcay E, Principe JC (2002) Information theoretic clustering. IEEE Trans Pattern Anal Mach Intell 24(2): 158–171
Article Google Scholar
Grünwald PD (2005) Minimum description length tutorial. In: Grünwald PD, Myung IJ, Pitt MA (eds) Advances in minimum description length. MIT Press, Cambridge
Google Scholar
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceedings of the SIGMOD’00, pp 1–12
Heikinheimo H, Fortelius M, Eronen J, Mannila H (2007) Biogeography of European land mammals shows environmentally distinct and spatial coherent clusters. J Biogeogr 34(6): 1053–1064
Article Google Scholar
Kontkanen P, Myllymäki P, Buntine W, Rissanen J, Tirri H (2004) An MDL framework for clustering. Technical Report 2004–6, HIIT
Koyotürk M, Grama A, Ramakrishnan N (2005) Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets. IEEE Trans Knowl Data Eng 17(4): 447–461
Article Google Scholar
Li M, Vitányi PMB (1993) An introduction to Kolmogorov complexity and its applications. Springer, New York
MATH Google Scholar
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the symposium on mathematical statistics and probability, pp 281–297
Mitchell-Jones AJ, Amori G, Bogdanowicz W, Krystufek B, Reijnders PJH, Spitzenberger F, Stubbe M, Thissen JBM, Vohralik V, Zima J (1999) The atlas of European mammals. Academic Press, New York
Google Scholar
Pensa R, Robardet C, Boulicaut J-F (2005) A bi-clustering framework for categorical data. In: Proceedings of the PKDD’05, pp 643–650
Siebes A, Vreeken J, van Leeuwen M (2006) Item sets that compress. In: Proceedings of the SDM’06, pp 393–404
Titterington D, Smith A, Makov U (1985) Statistical analysis of finite mixture distributions. Wiley, New York
MATH Google Scholar
Tishby N, Pereira FC, Bialek W (1999) The information bottleneck method. In: Proceedings of the allerton conference on communication, control and computing, pp 368–377
van Leeuwen M, Vreeken J, Siebes A (2006) Compression picks item sets that matter, In: Proceedings of the ECML/PKDD’06, pp 585–592
Vreeken J, van Leeuwen M, Siebes A (2007) Characterising the difference. In: Proceedings of the KDD’07, pp 765–774
Wang K, Xu C, Liu B (1999) Clustering transactions using large items. In: Proceedings of the CIKM’99, pp 483–490

Download references

Open Access

This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Author information

Authors and Affiliations

Department of Computer Science, Universiteit Utrecht, Utrecht, The Netherlands
Matthijs van Leeuwen, Jilles Vreeken & Arno Siebes

Authors

Matthijs van Leeuwen
View author publications
You can also search for this author in PubMed Google Scholar
Jilles Vreeken
View author publications
You can also search for this author in PubMed Google Scholar
Arno Siebes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matthijs van Leeuwen.

Additional information

Responsible editors: Aleksander Kołcz, Wray Buntine, Marko Grobelnik, Dunja Mladenic, and John Shawe-Taylor.

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and permissions

About this article

Cite this article

van Leeuwen, M., Vreeken, J. & Siebes, A. Identifying the components. Data Min Knowl Disc 19, 176–193 (2009). https://doi.org/10.1007/s10618-009-0137-2

Download citation

Received: 11 June 2009
Accepted: 20 June 2009
Published: 17 July 2009
Issue Date: October 2009
DOI: https://doi.org/10.1007/s10618-009-0137-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Identifying the components

Abstract

Article PDF

Similar content being viewed by others

A New Look at the Dirichlet Distribution: Robustness, Clustering, and Both Together

Extracting Diverse Patterns with Unbalanced Concept Hierarchy

Fast Estimation of the Pattern Frequency Spectrum

References

Open Access

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Identifying the components

Abstract

Article PDF

Similar content being viewed by others

A New Look at the Dirichlet Distribution: Robustness, Clustering, and Both Together

Extracting Diverse Patterns with Unbalanced Concept Hierarchy

Fast Estimation of the Pattern Frequency Spectrum

Explore related subjects

References

Open Access

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation