Abstract
Most, if not all, databases are mixtures of samples from different distributions. Transactional data is no exception. For the prototypical example, supermarket basket analysis, one also expects a mixture of different buying patterns. Households of retired people buy different collections of items than households with young children. Models that take such underlying distributions into account are in general superior to those that do not. In this paper we introduce two MDL-based algorithms that follow orthogonal approaches to identify the components in a transaction database. The first follows a model-based approach, while the second is data-driven. Both are parameter-free: the number of components and the components themselves are chosen such that the combined complexity of data and models is minimised. Further, neither prior knowledge on the distributions nor a distance metric on the data is required. Experiments with both methods show that highly characteristic components are identified.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Aggarwal CC, Procopiuc C, Yu PS (2002) Finding localized associations in market basket data. IEEE Trans Knowl Data Eng 14(1): 51–62
Andritsos P, Tsaparas P, Miller RJ, Sevcik KC (2004) LIMBO: scalable clustering of categorical data. In: Proceedings of the EDBT’04, pp 124–146
Bischof H, Leonardis A, Sleb A (1999) MDL principle for robust vector quantization. Pattern Anal Appl 2: 59–72
Böhm C, Faloutsos C, Pan J-Y, Plant C (2006) Robust information-theoretic clustering. In: Proceedings of the KDD’06, pp 65–75
Brijs T, Swinnen G, Vanhoof K, Wets G (1999) The use of association rules for product assortment decisions: a case study. In: Proceedings of the KDD’99, pp 254–260
Cadez IV, Smyth P, Mannila H (2001) Probabilistic modeling of transaction data with applications to profiling, visualization, and prediction. In: Proceedings of the KDD’01, pp 37–46
Cilibrasi R, Vitányi PMB (2005) Clustering by compression. IEEE Trans Inf Theory 51(4): 1523–1545
Coenen F (2003) The LUCS-KDD discretised/normalised ARM and CARM data library. http://www.csc.liv.ac.uk/~frans/KDD/Software/LUCS-KDD-DN/DataSets/dataSets.html. Accessed 30 June 2009
Faloutsos C, Megalooikonomou V (2007) On data mining, compression, and kolomogorov complexity. Data Min Knowl Discov 15(1): 3–20
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comp Sys Sci 55(1): 119–139
Gokcay E, Principe JC (2002) Information theoretic clustering. IEEE Trans Pattern Anal Mach Intell 24(2): 158–171
Grünwald PD (2005) Minimum description length tutorial. In: Grünwald PD, Myung IJ, Pitt MA (eds) Advances in minimum description length. MIT Press, Cambridge
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceedings of the SIGMOD’00, pp 1–12
Heikinheimo H, Fortelius M, Eronen J, Mannila H (2007) Biogeography of European land mammals shows environmentally distinct and spatial coherent clusters. J Biogeogr 34(6): 1053–1064
Kontkanen P, Myllymäki P, Buntine W, Rissanen J, Tirri H (2004) An MDL framework for clustering. Technical Report 2004–6, HIIT
Koyotürk M, Grama A, Ramakrishnan N (2005) Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets. IEEE Trans Knowl Data Eng 17(4): 447–461
Li M, Vitányi PMB (1993) An introduction to Kolmogorov complexity and its applications. Springer, New York
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the symposium on mathematical statistics and probability, pp 281–297
Mitchell-Jones AJ, Amori G, Bogdanowicz W, Krystufek B, Reijnders PJH, Spitzenberger F, Stubbe M, Thissen JBM, Vohralik V, Zima J (1999) The atlas of European mammals. Academic Press, New York
Pensa R, Robardet C, Boulicaut J-F (2005) A bi-clustering framework for categorical data. In: Proceedings of the PKDD’05, pp 643–650
Siebes A, Vreeken J, van Leeuwen M (2006) Item sets that compress. In: Proceedings of the SDM’06, pp 393–404
Titterington D, Smith A, Makov U (1985) Statistical analysis of finite mixture distributions. Wiley, New York
Tishby N, Pereira FC, Bialek W (1999) The information bottleneck method. In: Proceedings of the allerton conference on communication, control and computing, pp 368–377
van Leeuwen M, Vreeken J, Siebes A (2006) Compression picks item sets that matter, In: Proceedings of the ECML/PKDD’06, pp 585–592
Vreeken J, van Leeuwen M, Siebes A (2007) Characterising the difference. In: Proceedings of the KDD’07, pp 765–774
Wang K, Xu C, Liu B (1999) Clustering transactions using large items. In: Proceedings of the CIKM’99, pp 483–490
Open Access
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editors: Aleksander Kołcz, Wray Buntine, Marko Grobelnik, Dunja Mladenic, and John Shawe-Taylor.
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
van Leeuwen, M., Vreeken, J. & Siebes, A. Identifying the components. Data Min Knowl Disc 19, 176–193 (2009). https://doi.org/10.1007/s10618-009-0137-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-009-0137-2