Diverse subgroup set discovery

van Leeuwen, Matthijs; Knobbe, Arno

doi:10.1007/s10618-012-0273-y

Diverse subgroup set discovery

Open access
Published: 10 June 2012

Volume 25, pages 208–242, (2012)
Cite this article

Download PDF

You have full access to this open access article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Diverse subgroup set discovery

Download PDF

Matthijs van Leeuwen^1,2 &
Arno Knobbe³

2207 Accesses
78 Citations
Explore all metrics

Abstract

Large data is challenging for most existing discovery algorithms, for several reasons. First of all, such data leads to enormous hypothesis spaces, making exhaustive search infeasible. Second, many variants of essentially the same pattern exist, due to (numeric) attributes of high cardinality, correlated attributes, and so on. This causes top-k mining algorithms to return highly redundant result sets, while ignoring many potentially interesting results. These problems are particularly apparent with subgroup discovery (SD) and its generalisation, exceptional model mining. To address this, we introduce subgroup set discovery: one should not consider individual subgroups, but sets of subgroups. We consider three degrees of redundancy, and propose corresponding heuristic selection strategies in order to eliminate redundancy. By incorporating these (generic) subgroup selection methods in a beam search, the aim is to improve the balance between exploration and exploitation. The proposed algorithm, dubbed DSSD for diverse subgroup set discovery, is experimentally evaluated and compared to existing approaches. For this, a variety of target types with corresponding datasets and quality measures is used. The subgroup sets that are discovered by the competing methods are evaluated primarily on the following three criteria: (1) diversity in the subgroup covers (exploration), (2) the maximum quality found (exploitation), and (3) runtime. The results show that DSSD outperforms each traditional SD method on all or a (non-empty) subset of these criteria, depending on the specific setting. The more complex the task, the larger the benefit of using our diverse heuristic search turns out to be.

Article PDF

For real: a thorough look at numeric attributes in subgroup discovery

Article Open access 21 September 2020

Discovering Outstanding Subgroup Lists for Numeric Targets Using MDL

Robust subgroup discovery

Article Open access 12 August 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Abudawood T, Flach P (2009) Evaluation measures for multi-class subgroup discovery. In: Proceedings of the ECML/PKDD’09, Bled, pp 35–50
Atzmüller M, Lemmerich F (2009) Fast subgroup discovery for continuous target concepts. In: Proceedings of ISMIS ’09, Prague, pp 35–44
Aumann Y, Lindell Y (1999) A statistical theory for quantitative association rules. In: Proceedings of KDD’99, San Diego, pp 261–270
Bailey J, Dong G (2007) Contrast data mining: methods and applications. Tutorial at the IEEE international conference on data mining (ICDM), Omaha
Bay S, Pazzani M (2001) Detecting group differences: mining contrast sets. Data Min Knowl Discov 5(3): 213–246
Article MATH Google Scholar
Bringmann B, Zimmermann A (2007) The chosen few: on identifying valuable patterns. In: Proceedings of the ICDM’07, Omaha, pp 63–72
Clark P, Boswell R (1991) Rule induction with CN2: some recent improvements. In: Proceedings of the European working session on learning (EWSL-91), Porto, pp 151–163
Clark P, Niblett T (1989) The CN2 induction algorithm. Mach Learn 3: 261–283
Google Scholar
Cover T, Thomas J (2006) Elements of information theory, 2nd ed. Wiley, New York
MATH Google Scholar
Daly O, Taniar D (2005) Exception rules in data mining. In: Encyclopedia of information science and technology (II), pp 1144–1148
Dong G, Zhang X, Wong L, Li J (1999) CAEP: classification by aggregating emerging patterns. In: Proceedings of DS’99, Tokyo, pp 30–42
Duivesteijn W, Knobbe A, Feelders A, van Leeuwen M (2010) Subgroup discovery meets bayesian networks: an exceptional model mining approach. In: Proceedings of the ICDM’10, Sydney, pp 158–167
Friedman J, Fisher N (1999) Bump hunting in high-dimensional data. Stat Comput 9(2): 123–143
Article Google Scholar
Garriga G, Kralj P, Lavrac N (2008) Closed sets for labeled data. J Mach Learn Res 9: 559–580
MathSciNet MATH Google Scholar
Grosskreutz H, Paurat D (2011) Fast and memory-efficient discovery of the top-k relevant subgroups in a reduced candidate space. In: Proceedings of the ECML/PKDD ’11, Athens, pp 533–548
Grosskreutz H, Rüping S (2009) On subgroup discovery in numerical domains. Data Min Knowl Discov 19(2): 210–226
Article MathSciNet Google Scholar
Grosskreutz H, Rüping S, Wrobel S (2008) Tight optimistic estimates for fast subgroup discovery. In: Proceedings of the ECML/PKDD’08, Antwerp, pp 440–456
Grosskreutz H, Boley M, Krause-Traudes M (2010) Subgroup discovery for election analysis: a case study in descriptive data mining. In: Proceedings of DS’10, no. 6332 in LNAI. Springer, New York, pp 57–71
Grünwald P (2007) The minimum description length principle. MIT Press, Cambridge
Google Scholar
Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Discov 15(1): 55–86
Article MathSciNet Google Scholar
Heikinheimo H, Fortelius M, Eronen J, Mannila H (2007) Biogeography of european land mammals shows environmentally distinct and spatially coherent clusters. J Biogeogr 34(6): 1053–1064
Article Google Scholar
Klösgen W (1996) Advances in knowledge discovery and data mining, chap Explora: a multipattern and multistrategy discovery assistant. MIT Press, Cambridge, pp 249–271
Klösgen W (2002) Handbook of data mining and knowledge discovery, chap Subgroup discovery. Oxford University Press, Oxford
Knobbe A (2006) Multi-relational data mining. IOS Press, Amsterdam
MATH Google Scholar
Knobbe A, Ho E (2006a) Maximally informative k-itemsets and their efficient discovery. In: Proceedings of the KDD’06, Philadelphia, Berlin, pp 237–244
Knobbe A, Ho E (2006b) Pattern teams. In: Proceedings of the ECML PKDD’06, Berlin, pp 577–584
Knobbe A, Valkonet J (2009) Building classifiers from pattern teams. In: Proceedings of the ECML PKDD’09 workshop LeGo 2009, Bled, pp 77–93
Kocev D, Struyf J, Dzeroski S (2007) Beam search induction and similarity constraints for predictive clustering trees. In: LNCS KDID 2006, Berlin, pp 134–151
Kralj Novak P, Lavrač N, Webb G (2009) Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. J Mach Learn Res 10: 377–403
MATH Google Scholar
Kullback S, Leibler R (1951) On information and sufficiency. Ann Math Stat 22(1): 79–86
Article MathSciNet MATH Google Scholar
Lavrač N, Kavšek B, Flach P, Todorovski L (2004) Subgroup discovery with CN2-SD. J Mach Learn Res 5: 153–188
Google Scholar
Leman D, Feelders A, Knobbe A (2008) Exceptional model mining. In: Proceedings of the ECML/PKDD’08, vol 2, Antwerp, pp 1–16
Lemmerich F, Puppe F (2011) Local models for expectation-driven subgroup discovery. In: Proceedings of the ICDM’11, Vancouver
Lemmerich F, Rohlfs M, Atzmüller M (2010) Fast discovery of relevant subgroup patterns. In: Proceedings of FLAIRS, Daytona Beach
Liu B, Hsu W, Ma Y (2001) Discovering the set of fundamental rule changes. In: Proceedings of KDD’01, San Francisco, pp 335–340
Lowerre B (1976) The harpy speech recognition system. PhD thesis
Mannila H, Toivonen H (1996) Multiple uses of frequent sets and condensed representations. In: Proceedings of the KDD’96, Portland, pp 189–194
Mitchell-Jones A, Amori G, Bogdanowicz W, Krystufek B, Reijnders P, Spitzenberger F, Stubbe M, Thissen J, Vohralik V, Zima J (1999) The atlas of European mammals. Academic Press, London
Google Scholar
Morishita S, Sese J (2000) Traversing itemset lattice with statistical metric pruning. In: Proceedings PODS, Dallas, pp 226–236
Nijssen S, Guns T, De Raedt L (2009) Correlated itemset mining in roc space: a constraint programming approach. In: Proceedings KDD’09, Paris, pp 647–656
Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceedings of the ICDT’99, Jerusalem, pp 398–416
Peng H, Long F, Ding C (2005) Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8): 1226–1238
Article Google Scholar
Pieters B, Knobbe A, Dzeroski S (2010) Subgroup discovery in ranked data, with an application to gene set enrichment. In: Proceedings preference learning workshop (PL 2010) at ECML PKDD ’10, Barcelona
Shell P, Rubio JH, Barro GQ (1994) Improving search through diversity. In: AAAI, Seattle, pp 1323–1328
Tsoumakas G, Vilcek J, Spyromitros L (2010) MULAN: a java library for multi-label learning. http://mulan.sourceforge.net/
van Leeuwen M (2010) Maximal exceptions with minimal descriptions. Data Min Knowl Discov 21(2): 259–276
Article MathSciNet Google Scholar
van Leeuwen M, Knobbe A (2011) Non-redundant subgroup discovery in large and complex data. In: Proceedings of the ECML PKDD’11, Bled, pp 459–474
Vreeken J, van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Discov 23(1): 169–214
Article MathSciNet MATH Google Scholar
Webb G (1995) Opus: an efficient admissible algorithm for unordered search. J Artif Intell Res 3: 431–465
MATH Google Scholar
Webb G (2001) Discovering associations with numeric variables. In: Proceedings of KDD’01, San Francisco, pp 383–388
Webb G, Butler S, Newlands D (2003) On detecting differences between groups. In: Proceedings of KDD’03, Washington, pp 256–265
Wrobel S (1997) An algorithm for multi-relational discovery of subgroups. In: Proceedings of PKDD 1997. Springer, Heidelberg, pp 78–87
Yan X, Han J (2002) gSpan: Graph-based substructure pattern mining. In: Proceedings of the ICDM’02, Maebashi, pp 721–724

Download references

Acknowledgements

The authors wish to thank Antti Ukkonen for sharing the Finnish elections data. This research is financially supported by the Netherlands Organisation for Scientific Research (NWO) through the EMM project (number 612.065.822) and a Rubicon grant.

Open Access

This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.

Author information

Authors and Affiliations

Machine Learning, Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium
Matthijs van Leeuwen
Algorithmic Data Analysis, Department of Information and Computer Sciences, Faculty of Science, Universiteit Utrecht, Utrecht, The Netherlands
Matthijs van Leeuwen
Leiden Institute of Advanced Computer Science, Universiteit Leiden, Leiden, The Netherlands
Arno Knobbe

Authors

Matthijs van Leeuwen
View author publications
You can also search for this author in PubMed Google Scholar
Arno Knobbe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matthijs van Leeuwen.

Additional information

Responsible editor: Dimitrios Gunopulos, Donato Malerba, Michalis Vazirgiannis.

The research described in this paper builds upon and extends the work appearing in ECML PKDD’11 (van Leeuwen and Knobbe 2011).

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

van Leeuwen, M., Knobbe, A. Diverse subgroup set discovery. Data Min Knowl Disc 25, 208–242 (2012). https://doi.org/10.1007/s10618-012-0273-y

Download citation

Received: 31 October 2011
Accepted: 21 May 2012
Published: 10 June 2012
Issue Date: September 2012
DOI: https://doi.org/10.1007/s10618-012-0273-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Diverse subgroup set discovery

Abstract

Article PDF

Similar content being viewed by others

For real: a thorough look at numeric attributes in subgroup discovery

Discovering Outstanding Subgroup Lists for Numeric Targets Using MDL

Robust subgroup discovery

References

Acknowledgements

Open Access

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Diverse subgroup set discovery

Abstract

Article PDF

Similar content being viewed by others

For real: a thorough look at numeric attributes in subgroup discovery

Discovering Outstanding Subgroup Lists for Numeric Targets Using MDL

Robust subgroup discovery

Explore related subjects

References

Acknowledgements

Open Access

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation