Abstract
Large data is challenging for most existing discovery algorithms, for several reasons. First of all, such data leads to enormous hypothesis spaces, making exhaustive search infeasible. Second, many variants of essentially the same pattern exist, due to (numeric) attributes of high cardinality, correlated attributes, and so on. This causes top-k mining algorithms to return highly redundant result sets, while ignoring many potentially interesting results. These problems are particularly apparent with subgroup discovery (SD) and its generalisation, exceptional model mining. To address this, we introduce subgroup set discovery: one should not consider individual subgroups, but sets of subgroups. We consider three degrees of redundancy, and propose corresponding heuristic selection strategies in order to eliminate redundancy. By incorporating these (generic) subgroup selection methods in a beam search, the aim is to improve the balance between exploration and exploitation. The proposed algorithm, dubbed DSSD for diverse subgroup set discovery, is experimentally evaluated and compared to existing approaches. For this, a variety of target types with corresponding datasets and quality measures is used. The subgroup sets that are discovered by the competing methods are evaluated primarily on the following three criteria: (1) diversity in the subgroup covers (exploration), (2) the maximum quality found (exploitation), and (3) runtime. The results show that DSSD outperforms each traditional SD method on all or a (non-empty) subset of these criteria, depending on the specific setting. The more complex the task, the larger the benefit of using our diverse heuristic search turns out to be.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Abudawood T, Flach P (2009) Evaluation measures for multi-class subgroup discovery. In: Proceedings of the ECML/PKDD’09, Bled, pp 35–50
Atzmüller M, Lemmerich F (2009) Fast subgroup discovery for continuous target concepts. In: Proceedings of ISMIS ’09, Prague, pp 35–44
Aumann Y, Lindell Y (1999) A statistical theory for quantitative association rules. In: Proceedings of KDD’99, San Diego, pp 261–270
Bailey J, Dong G (2007) Contrast data mining: methods and applications. Tutorial at the IEEE international conference on data mining (ICDM), Omaha
Bay S, Pazzani M (2001) Detecting group differences: mining contrast sets. Data Min Knowl Discov 5(3): 213–246
Bringmann B, Zimmermann A (2007) The chosen few: on identifying valuable patterns. In: Proceedings of the ICDM’07, Omaha, pp 63–72
Clark P, Boswell R (1991) Rule induction with CN2: some recent improvements. In: Proceedings of the European working session on learning (EWSL-91), Porto, pp 151–163
Clark P, Niblett T (1989) The CN2 induction algorithm. Mach Learn 3: 261–283
Cover T, Thomas J (2006) Elements of information theory, 2nd ed. Wiley, New York
Daly O, Taniar D (2005) Exception rules in data mining. In: Encyclopedia of information science and technology (II), pp 1144–1148
Dong G, Zhang X, Wong L, Li J (1999) CAEP: classification by aggregating emerging patterns. In: Proceedings of DS’99, Tokyo, pp 30–42
Duivesteijn W, Knobbe A, Feelders A, van Leeuwen M (2010) Subgroup discovery meets bayesian networks: an exceptional model mining approach. In: Proceedings of the ICDM’10, Sydney, pp 158–167
Friedman J, Fisher N (1999) Bump hunting in high-dimensional data. Stat Comput 9(2): 123–143
Garriga G, Kralj P, Lavrac N (2008) Closed sets for labeled data. J Mach Learn Res 9: 559–580
Grosskreutz H, Paurat D (2011) Fast and memory-efficient discovery of the top-k relevant subgroups in a reduced candidate space. In: Proceedings of the ECML/PKDD ’11, Athens, pp 533–548
Grosskreutz H, Rüping S (2009) On subgroup discovery in numerical domains. Data Min Knowl Discov 19(2): 210–226
Grosskreutz H, Rüping S, Wrobel S (2008) Tight optimistic estimates for fast subgroup discovery. In: Proceedings of the ECML/PKDD’08, Antwerp, pp 440–456
Grosskreutz H, Boley M, Krause-Traudes M (2010) Subgroup discovery for election analysis: a case study in descriptive data mining. In: Proceedings of DS’10, no. 6332 in LNAI. Springer, New York, pp 57–71
Grünwald P (2007) The minimum description length principle. MIT Press, Cambridge
Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Discov 15(1): 55–86
Heikinheimo H, Fortelius M, Eronen J, Mannila H (2007) Biogeography of european land mammals shows environmentally distinct and spatially coherent clusters. J Biogeogr 34(6): 1053–1064
Klösgen W (1996) Advances in knowledge discovery and data mining, chap Explora: a multipattern and multistrategy discovery assistant. MIT Press, Cambridge, pp 249–271
Klösgen W (2002) Handbook of data mining and knowledge discovery, chap Subgroup discovery. Oxford University Press, Oxford
Knobbe A (2006) Multi-relational data mining. IOS Press, Amsterdam
Knobbe A, Ho E (2006a) Maximally informative k-itemsets and their efficient discovery. In: Proceedings of the KDD’06, Philadelphia, Berlin, pp 237–244
Knobbe A, Ho E (2006b) Pattern teams. In: Proceedings of the ECML PKDD’06, Berlin, pp 577–584
Knobbe A, Valkonet J (2009) Building classifiers from pattern teams. In: Proceedings of the ECML PKDD’09 workshop LeGo 2009, Bled, pp 77–93
Kocev D, Struyf J, Dzeroski S (2007) Beam search induction and similarity constraints for predictive clustering trees. In: LNCS KDID 2006, Berlin, pp 134–151
Kralj Novak P, Lavrač N, Webb G (2009) Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. J Mach Learn Res 10: 377–403
Kullback S, Leibler R (1951) On information and sufficiency. Ann Math Stat 22(1): 79–86
Lavrač N, Kavšek B, Flach P, Todorovski L (2004) Subgroup discovery with CN2-SD. J Mach Learn Res 5: 153–188
Leman D, Feelders A, Knobbe A (2008) Exceptional model mining. In: Proceedings of the ECML/PKDD’08, vol 2, Antwerp, pp 1–16
Lemmerich F, Puppe F (2011) Local models for expectation-driven subgroup discovery. In: Proceedings of the ICDM’11, Vancouver
Lemmerich F, Rohlfs M, Atzmüller M (2010) Fast discovery of relevant subgroup patterns. In: Proceedings of FLAIRS, Daytona Beach
Liu B, Hsu W, Ma Y (2001) Discovering the set of fundamental rule changes. In: Proceedings of KDD’01, San Francisco, pp 335–340
Lowerre B (1976) The harpy speech recognition system. PhD thesis
Mannila H, Toivonen H (1996) Multiple uses of frequent sets and condensed representations. In: Proceedings of the KDD’96, Portland, pp 189–194
Mitchell-Jones A, Amori G, Bogdanowicz W, Krystufek B, Reijnders P, Spitzenberger F, Stubbe M, Thissen J, Vohralik V, Zima J (1999) The atlas of European mammals. Academic Press, London
Morishita S, Sese J (2000) Traversing itemset lattice with statistical metric pruning. In: Proceedings PODS, Dallas, pp 226–236
Nijssen S, Guns T, De Raedt L (2009) Correlated itemset mining in roc space: a constraint programming approach. In: Proceedings KDD’09, Paris, pp 647–656
Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceedings of the ICDT’99, Jerusalem, pp 398–416
Peng H, Long F, Ding C (2005) Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8): 1226–1238
Pieters B, Knobbe A, Dzeroski S (2010) Subgroup discovery in ranked data, with an application to gene set enrichment. In: Proceedings preference learning workshop (PL 2010) at ECML PKDD ’10, Barcelona
Shell P, Rubio JH, Barro GQ (1994) Improving search through diversity. In: AAAI, Seattle, pp 1323–1328
Tsoumakas G, Vilcek J, Spyromitros L (2010) MULAN: a java library for multi-label learning. http://mulan.sourceforge.net/
van Leeuwen M (2010) Maximal exceptions with minimal descriptions. Data Min Knowl Discov 21(2): 259–276
van Leeuwen M, Knobbe A (2011) Non-redundant subgroup discovery in large and complex data. In: Proceedings of the ECML PKDD’11, Bled, pp 459–474
Vreeken J, van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Discov 23(1): 169–214
Webb G (1995) Opus: an efficient admissible algorithm for unordered search. J Artif Intell Res 3: 431–465
Webb G (2001) Discovering associations with numeric variables. In: Proceedings of KDD’01, San Francisco, pp 383–388
Webb G, Butler S, Newlands D (2003) On detecting differences between groups. In: Proceedings of KDD’03, Washington, pp 256–265
Wrobel S (1997) An algorithm for multi-relational discovery of subgroups. In: Proceedings of PKDD 1997. Springer, Heidelberg, pp 78–87
Yan X, Han J (2002) gSpan: Graph-based substructure pattern mining. In: Proceedings of the ICDM’02, Maebashi, pp 721–724
Acknowledgements
The authors wish to thank Antti Ukkonen for sharing the Finnish elections data. This research is financially supported by the Netherlands Organisation for Scientific Research (NWO) through the EMM project (number 612.065.822) and a Rubicon grant.
Open Access
This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Dimitrios Gunopulos, Donato Malerba, Michalis Vazirgiannis.
The research described in this paper builds upon and extends the work appearing in ECML PKDD’11 (van Leeuwen and Knobbe 2011).
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
van Leeuwen, M., Knobbe, A. Diverse subgroup set discovery. Data Min Knowl Disc 25, 208–242 (2012). https://doi.org/10.1007/s10618-012-0273-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-012-0273-y