1 Introduction

Machine-learning algorithms are nowadays being used in almost every domain. While such algorithms are known to perform well in many tasks, they are often used as “black-boxes,” i.e., the decision processes involved are too complex for humans to interpret. Black-box machine-learning algorithms only offer a partial representation of the reality, and moreover, their performance is assessed using metrics that only capture particular aspects of the real world. The lack of interpretability considerably limits the level of trust humans put in black-box machine-learning algorithms and thus poses a barrier for the wide adoption of machine-learning techniques in real-world applications. In many applications, particularly when machine learning is used to aid in high-stake decision making, interest lies not only in having accurate predictions, but also in extracting information from the learned model that can be analyzed by human expert to produce valuable insight. In an attempt to overcome this barrier, interpretable and explainable machine learning has recently emerged as increasingly prominent topics, and in this context, classification problems hold a central position. In the standard classification setting, the goal is to learn a classifier that accurately maps data points to two or more mutually exclusive classes.

In this paper, we focus on a different setting, namely multi-label classification. In contrast to the standard single-label setting, in multi-label classification, a point can be associated with more than one class at the same time. Due to the increasing complexity of modern data, multi-label classification is becoming more and more popular in recent years, and it has been extensively studied. Nonetheless, the main focus is still on improving predictive performance [1]. Significantly less attention has been paid to interpretability.

Classification rules, due to their simple structures, are gaining popularity in interpretable multi-label classification literature. In rule-based approaches, the goal is to learn a set of rules that captures the most prominent associative patterns on the features and labels in the data. A rule usually takes the form “({a set of predicates} \(\rightarrow \) {a set of labels})”.

For a given data point, a rule would usually predict the associated labels to be present, if all the predicates in the rule evaluate to true.

Due to the structural simplicity of rules, classifiers based on a set of rules are generally considered more interpretable than other types of classifiers, such as neural networks or even decision trees.

The research question that we bring forward is whether we can design rule-based multi-label classification methods that are both accurate and interpretable. Boomer [2, 3], a recently proposed rule-based multi-label classifier based on gradient boosting, gives promising results in accuracy. However, despite being a rule-based approach, its interpretability is limited because it often produces rules sets that are both too large and redundant. Such limitations pose increased cognitive load for humans, making it hard to interpret and use.

In this work, we propose corset, a rule-based method that significantly improves over the state-of-the-art Boomer. The improvement is due to (1) reducing rule redundancy, which is achieved by incorporating a term in our objective that penalizes for rule overlap, and (2) explicitly limiting the complexity of rules via a suite of novel sampling schemes. As a result, our method produces a concise set of interpretable rules. An illustration of the concept of our approach is given in Fig. 1.

Fig. 1
figure 1

Illustration of the concept of our approach for multi-label rule selection. A toy dataset is visualized as a feature matrix and a label matrix. Four rules are shown as colored regions. Regions covered by the same rule are connected by a dashed arrow. The rules in green are chosen because of accuracy, generality, and diversity. The rules in red are discarded

Fig. 2
figure 2

Micro \(F_1\), as a function of number of rules for corset vs. Boomer in the \(\textsf { bibtex}\) dataset

Example. To illustrate the improvement of corset over Boomer, we consider as an example the \(\textsf { bibtex}\) dataset, where each data point represents a scientific article, with bag of words as features and topics as labels. We first consider predictive performance as a function of the number of rules. In Fig. 2, we show the (micro-averaged) balanced \(F_1\) scores, a popular measure for multi-label classification used throughout this paper, for both corset and Boomer. Due to the conciseness of its learned rule set, corset achieves a score close to 0.36 with about 100 rules, whereas Boomer needs over 800 rules to achieve similar performance. Note that corset ’s performance starts to drop after about 100 rules, as there are no more good rules to learn. The drop indicates over-fitting, which can be addressed by standard methods, e.g., cross-validation.

In addition, Fig. 3 demonstrates the conciseness of the rules found by corset vs. the ones found by Boomer. Here, we show a subset of rules as a bipartite graph, where nodes at the top represent labels and nodes at the bottom represent the predicates (features). Rules are represented by colors, and two nodes are connected if they are part of the same rule. corset uses fewer rules than Boomer and rules tend to contain fewer predicates, resulting in a sparser graph.

Finally in Fig. 4, we reveal the underlying block structure in the subset of the feature matrix \(\textbf{F}\) and label matrix \(\textbf{L}\) of \(\textsf { bibtex}\) induced by the same set of rules generated by corset shown in Fig. 3. The rows and columns of the two matrices are re-ordered using a bi-clustering algorithm [4]. Specifically, each row in \(\textbf{F}\) (and \(\textbf{L}\)) corresponds to a training point, which is included in \(\textbf{F}\) and \(\textbf{L}\) if it is covered by at least one rule. An entry in \(\textbf{F}\) and \(\textbf{L}\) is plotted white if the corresponding value in the matrix is 0. An entry is plotted non-white if its value is 1. Further, the color is gray if the corresponding feature (label) is not selected in the rule set. Otherwise, the color of a block depends on the rule that covers it.

Contributions. Concretely, in this work we make the following contributions.

  • We frame the problem of learning concise rule sets as an optimization problem. The objective function to be maximized over the space of possible rule sets is a linear combination of a quality function and a diversity function. The optimization problem is \(\textbf{NP}\)-hard and our proposed algorithm, corset, given a set of rule candidates, achieves an approximation ratio of 2, that is, it finds a rule set that is guaranteed to have value of the objective function which is at least half the value of the best possible rule set(s).

  • The performance of corset depends on the quality of the rule candidates. To find good rules efficiently, we design a suite of fast sampling algorithms with probabilistic guarantees as well as an effective heuristic. The sampling algorithms allow to eschew the combinatorial nature of the search space of rules and work with a considerably smaller space that is relevant to our objective.

  • Our experiments show that corset achieves competitive predictive performance compared to state-of-the-art rule-based multi-label classifiers, while offering significantly better interpretability.

Fig. 3
figure 3

An example set of rules returned by our algorithm (top) and Boomer (bottom) on the \(\textsf { bibtex}\) dataset. We depict all rules for the set of labels { SNA, socialnets, social, networks, analysis}

Fig. 4
figure 4

Block structures of a feature sub-matrix (left) and a label sub-matrix (right) revealed by corset rules for the set of labels { SNA, socialnets, social, networks, analysis}

The rest of this paper is organized as follows: Section 2 discusses related work. Section 3 formalizes the problem we consider. Section 4 illustrates corset, omitting the details of the rule-sampling algorithms it relies on, which are described in Sects. 5 and 6. Section 7 analyzes the complexity of corset. Afterwards, Sect. 8 presents a thorough experimental evaluation of corset. Lastly, Sect. 9 demonstrates the practical applicability of our methods through a case study.

2 Related work

Multi-label classification. In multi-label classification, the goal is to learn a function that maps input points to one or more predefined categories. For instance, a song can be associated with multiple music genres. A plethora of algorithms have been proposed for this problem; interested readers may refer to a survey [1]. The simplest approaches for multi-label classification are the so-called transformation methods, which convert the original problem into multiple single-label classification problems. The main drawback of these approaches is that they fail to capture label correlations. To overcome this issue, label power-set approaches map each distinct set of labels to a unique meta-label, which serves as target label for a single-label classifier. Clearly, these approaches do not scale with the number of labels, and the pruned problem transformation method [5] has been proposed as a remedy. Another line of research focuses on designing ad hoc multi-label classification methods by extending existing single-label algorithms. Examples include adaption from support vector machines [6], k-nearest neighbor classifiers [7], and perceptrons [8].

Interpretable machine learning. There is no agreed formal definition of interpretability, but it can be loosely defined as the degree to which a human can understand the cause of a decision [9]. Broadly speaking, interpretability in machine learning can be achieved by imposing constraints or penalties to guarantee that the process behind the decision of the algorithm is understandable to humans.

A related topic is explainable machine learning, where the goal is to provide “explanations” to the predictions of black-box models. Although interpretable and explainable machine learning pursue the same general goal and are sometimes lumped together in the literature, whenever possible, interpretable models should be used rather than “explained” black-box models for high-stake decision making, because explanations for black boxes can be problematic, misleading, and error-prone [10]. By contrast, interpretable models, due to their inherent simplicity, are particularly simple to interpret and troubleshoot. Thus, in high-stake domains, where it is crucial to understand the underlying mechanism leading to a given prediction, it is advisable to prioritize using interpretable models over explained ones.

Rule-based approaches to single-label classification. Research in interpretable machine learning has boomed in the last years. Rule-based (or associative) approaches have shown promising potential, because decisions are driven by a simple set of “if-then” rules. Liu et al. [11] are among the first to investigate association rule mining for single-label classification tasks, followed by extensions such as MCAR [12] and ADA [13]. These approaches are conceptually similar, but differ in their methodologies for rule learning, ranking, pruning, and prediction.

Concise single-label rule sets. Our work pursues for the first time the goal of designing a multi-label associative classifier for achieving a given classification performance with the smallest possible number of rules. A similar objective has been recently considered in the context of single-label classification. In particular, Zhang et al. [14] frame the problem of learning a set of classification rules as an optimization problem with an objective function combining rule quality and diversity. A 2-approximation algorithm is then proposed to solve this problem, which relies on existing frameworks for max-sum diversification and pattern sampling. In this paper, we investigate how to extend these ideas to the multi-label classification setting. The problem of controlling the number of rules used for prediction has also been studied for single-label rule boosting, where learned rules are combined additively [15]. An extension to multi-label classification represents a possible direction of future work. A different line of work, which has also addressed rule set conciseness [16, 17], relies on encoding single-label rule learning problems as SAT programs and solving them via SAT solvers. Finally, Wang et al. [18] propose to control the number of learned rules using a Bayesian approach.

In addition to the number of rules, different definitions of interpretability have been explored in association rule mining and single-label rule-based classification. For instance, rule set interpretability has been quantified by the numbers of conditions used by the rules [17] and by the minimum description length principle [19].

Rule-based approaches to multi-label classification. In general, adaptation from the single-label to the multi-label setting is not trivial and while single-label associative classification has been studied extensively, relatively few attempts have been made for associative multi-label classification. In an early work, Thabtah et al. [20] propose a label ranking-based assignment method. More recently, new approaches have been developed, and SeCo [21] and Boomer [2] are state of the art in the current literature of rule-based multi-label classification. The main limitation of the existing works, addressed in our paper, is that they use a very large set of highly redundant rules, which hinders interpretability. We compare our method against SeCo [21] and Boomer [2] in Sect. 8.

Pattern sampling. Association pattern discovery, which aims at discovering relevant association patterns between items in large datasets, is challenging due to the prohibitive size of the pattern space. This challenge is inherited by rule-based classifiers.

Deterministic approaches to association pattern discovery are based on exhaustive enumeration of the pattern space [22,23,24]. Furthermore, such approaches typically impose hard thresholds on some frequency-based measure (such as support and confidence) to distinguish interesting patterns from irrelevant ones.

Despite the remarkable progress made over the past years [22,23,24], deterministic approaches to association pattern discovery still incur several limitations [25,26,27]. First, the required enumeration (or partial enumeration) of the search space makes it hard to control the running time and consumed memory. Second, finding an appropriate threshold can be a non-intuitive and even cumbersome task. Third, these approaches generally do not allow enough flexibility in the choice of the interestingness measure and often return enormous amounts of redundant patterns that describe the same local phenomenon.

To overcome the aforementioned limitations, efficient sampling methods have been proposed for various pattern mining tasks [25, 26]. These methods sample directly from the output space of patterns, circumventing the enumeration of the search space, they do not require setting an hard threshold, they are robust with respect to redundancy in the output patterns, and they allow great efficiency and flexibility in choosing the target distribution to sample patterns from.

In the multi-label classification setting, however, existing pattern samplers do not deliver satisfactory performance (Sects. 6 and 8). In this work, we extend existing methods to efficiently find high-quality candidate multi-label rules, as discussed in detail in Sect. 5 and 6.

3 Problem statement

At a high level, our objective is to capture the relevant patterns in the data that best discriminate sets of labels and are as concise as possible. Next we formally define the problem.

3.1 Preliminaries

We denote sets and multisets by uppercase letters, e.g., X. For a finite X, we denote by \(\mathcal {P} \left( X\right) \) its power set. We consider a binary dataset \(\mathcal {D}\) (all values are either 0 or 1) over a feature set \(\mathcal {F}\) and a label set \(\mathcal {L}\). The dataset \(\mathcal {D} \) is a set of data records, \(D _1, \ldots , D _{n}\). A data record or instance \(D =(F, L)\) consists of a set of features \(F \subseteq \mathcal {F} \) and a set of labels \(L \subseteq \mathcal {L} \). We denote by \(F _{D} \) and \(L _{D} \) the feature set and label set of \(D\), respectively. Furthermore, we denote by \(|\mathcal {F} |\) and \(|\mathcal {L} |\) the dimensions of the feature and label space, respectively, and we denote by \(|\mathcal {D} |\) the total number of data records. We use \(\Vert \mathcal {F} \Vert \) and \(\Vert \mathcal {L} \Vert \) to refer to the total number of feature and label occurrences over all data records.

In multi-label classification, the goal is to learn a function mapping as accurately as possible the features \(F _{D} \) to one or more labels \(L _{D}\). We use mappings consisting of conjunctive rules. A conjunctive rule \(R =({B} \rightarrow {H})\) is composed of a non-empty feature set \({B} \) (called body) and a non-empty label set \({H} \) (called head). The body \({B} \) can be viewed as a predicate \({B}: \left\{ 0, 1\right\} ^{|\mathcal {F} |} \rightarrow \left\{ \text {true}, \text {false}\right\} \), which states whether an instance contains all the features in \({B} \). If the predicate evaluates to \(\text {true}\) for some instance, the head \({H} \) of \(R \) specifies that labels \({H} \) should be predicted as present.

We say that a body \({B} \) matches a data record \(D \) if \({B} \subseteq F _{D} \). Similarly, a head \({H} \) matches \(D \) if \({H} \subseteq L _{D} \). We also say that a rule \(R \) covers a data record \(D \) if \({B} _{R} \subseteq F _{D} \) and similarly \(R \) matches a data record \(D \) if both \({B} _{R} \) and \({H} _{R} \) match \(D \). For a dataset \(\mathcal {D}\), we denote the support set of \(X \in \{{B}, {H}, R \}\) by:

$$\begin{aligned} \mathcal {D} \left[ X\right] =\left\{ D \in \mathcal {D} \mid X \text { matches } D \right\} . \end{aligned}$$

The space of all possible rules we consider is \(\mathcal {U} = \mathcal {P} \left( \mathcal {F} \right) \times \mathcal {P} \left( \mathcal {L} \right) \), i.e., the Cartesian product of the power set of the feature set and the power set of the label set.

As explained in the remainder of this section, our ultimate goal is to find the most succinct set of rules to model the main dependencies in the data that are relevant for multi-label classification. Rule sets are denoted by \(\mathcal {R} \). They are in disjunctive normal form, i.e., they are “OR of ANDs”, so that a label is predicted for a data record \(D\) if it belongs to the head \({H} \) of at least one of the rules \(R \in \mathcal {R} \) covering \(D\).

Before introducing the details of the proposed problem formulation, we remark that more complex and structured rule-based models, such as rule lists or decision trees, have emerged as popular alternatives to rule set models [10]. In addition, negative rules, which predict the absence of labels rather than their presence, have also been investigated [28].

The choice of using simple rule sets in disjunctive normal form is a consequence of the fundamental principle underlying this work, that is, the design of a multi-label classifier that prioritizes interpretability while also offering accurate classification. Unlike rule lists and decision trees which introduce hierarchical structures among the rules, thereby complicating the process of interpretation, rule sets are easier to interpret due to the absence of hierarchy. Similarly, we do not consider negative rules because they may lead to labels simultaneously predicted as present and absent, which also hampers interpretability.

3.2 Problem formulation

We want to discover rules that are accurate and general, but also sufficiently different from each other. To capture this trade-off, we design an objective function that consists of a quality term \(q: \mathcal {U} \rightarrow \mathbb {R} \) measuring the accuracy and generality of a single rule, and a diversity term \(d: \mathcal {U} \times \mathcal {U} \rightarrow \mathbb {R} \) measuring the distance between pairs of rules.

Quality term. Given a rule \(R\) and a set of rules \(\mathcal {R}\), the quality \(q (R; \mathcal {R})\) of \(R \) with respect to \(\mathcal {R}\) is the product of two values: the uncovered area \(\text {area}\! \left( R; \mathcal {R} \right) \), capturing the generality of \(R\) with respect to \(\mathcal {R}\), and its adjusted accuracy \(a (R)\),

$$\begin{aligned} q (R; \mathcal {R}) = \text {area}\! \left( R; \mathcal {R} \right) \cdot a (R). \end{aligned}$$
(1)

Next we describe these two functions. To capture generality, we first define the coverage of \(R \) as:

$$\begin{aligned} \text {cov}\! \left( R \right) = \left\{ \left( i, k\right) \mid R \text { matches } D _i \in \mathcal {D} \text { and } k \in {H} \right\} . \end{aligned}$$
(2)

In other words, the coverage of a rule is the set of label occurrences it matches in a dataset. To incorporate what is already covered by a set of selected rules \(\mathcal {R}\), we define uncovered area of \(R\) with respect to \(\mathcal {R}\) as

$$\begin{aligned} \text {area}\! \left( R; \mathcal {R} \right) = |\text {cov}\! \left( R \right) \setminus \bigcup \limits _{R \in \mathcal {R}} \text {cov}\! \left( R \right) |, \end{aligned}$$
(3)

that is, the size of covered label occurrences by \(R\) after excluding those already covered by at least one rule in \(\mathcal {R}\). Thus, a rule \(R\) is considered general with respect to \(\mathcal {R}\) if \(\text {area}\! \left( R; \mathcal {R} \right) \) is large.

Before introducing the adjusted-accuracy function, we need some additional notation. Data records whose labels contain \({H}\) are said to be positive with respect to \({H}\), whilst the remaining ones are negative. More formally, a head \({H}\) bi-partitions a dataset \(\mathcal {D}\) into two disjoint sets: a set of positive data records \(\mathcal {D}^+_{{H}} = \{ D \in \mathcal {D} \mid {H} \subseteq L _{D} \}\) and a set of negative data records \(\mathcal {D}^-_{{H}} = \{ D \in \mathcal {D} \mid {H} \nsubseteq L _{D} \}\). Given a rule \(R\), let \(P_{\mathcal {D} \left[ R \right] } = \frac{|\mathcal {D} \left[ R \right] |}{|\mathcal {D} \left[ {B} \right] |}\) be the precision of \(R \) and \(P_{\mathcal {D}} = \frac{|\mathcal {D} \left[ {H} \right] |}{|\mathcal {D} |}\) is the base rate of \({H}\) in \(\mathcal {D}\). We denote the corresponding binomial distributions as \(\text {Bin}\! \left( P_{\mathcal {D} \left[ R \right] } \right) \) and \(\text {Bin}\! \left( P_{\mathcal {D}} \right) \), respectively. Then, the adjusted accuracy of \(R\) is defined as:

$$\begin{aligned} a \left( R \right) = I\! \left( R \right) \cdot D_{\text {KL}}\! \left( \text {Bin}\! \left( P_{\mathcal {D} \left[ R \right] } \right) \mid \mid \text {Bin}\! \left( P_{\mathcal {D}} \right) \right) , \end{aligned}$$
(4)

where \(I\! \left( R \right) \) is 1 if \(P_{\mathcal {D} \left[ R \right] } > P_{\mathcal {D}} \) and 0 otherwise, and \(D_{\text {KL}}\! \left( \cdot \mid \mid \cdot \right) \) is the KL divergence between two probability distributions. The underlying intuition is that if the precision of a rule is below its base rate, it is useless and receives a zero score. If instead the precision of a rule is larger than the base rate, the higher the precision is, the larger the score.

Diversity term. We measure the distance between two rules by how much their coverages overlap. Formally, given two rules \(R _1\) and \(R _2\), their distance is defined as

$$\begin{aligned} d \left( R _1, R _2\right) = 1 - \frac{|\text {cov}\! \left( R _1\right) \cap \text {cov}\! \left( R _2\right) |}{|\text {cov}\! \left( R _1\right) \cup \text {cov}\! \left( R _2\right) |}, \end{aligned}$$
(5)

which is the Jaccard distance between \(\text {cov}\! \left( R _1\right) \) and \(\text {cov}\! \left( R _2\right) \).

Total quality and diversity. We extend the quality function and diversity function to rule sets, since we want to learn a set of rules. The extension of diversity is straightforward. Given a rule set \(\mathcal {R}\), we define its total diversity as:

$$\begin{aligned} d \left( \mathcal {R} \right) = \sum \limits _{R _i, R _j \in \mathcal {R}, i \ne j} d (R _i, R _j), \end{aligned}$$
(6)

i.e., the sum of all pairwise diversity values in \(\mathcal {R}\).

The extension of the quality function from single rules to rule sets requires careful judgment. The total quality of a rule set \(\mathcal {R} \) is defined as:

$$\begin{aligned} q \left( \mathcal {R} \right) = \sum \limits _{R _i \in \mathcal {R}} q \left( R _i; \mathcal {R} _{1, \dots i-1}\right) . \end{aligned}$$
(7)

Thus, this definition implicitly defines an order on the rules.

We argue that an order is required to make the total quality function meaningful. Consider an alternative definition: \(q '\left( \mathcal {R} \right) = \sum \nolimits _{R \in \mathcal {R}} q \left( R; \mathcal {R} {\setminus } \{R \}\right) \), which does not assume any order on \(\mathcal {R} \). Recall that \(q (R; \mathcal {R} {\setminus } \{R \}) = \text {area}\! \left( R; \mathcal {R} {\setminus } \{R \}\right) \cdot a (R)\), where \(\text {area}\! \left( R; \mathcal {R} {\setminus } \{R \}\right) \) counts the label occurrences uniquely covered by \(R \). The issue with the definition of \(q '\) is that the \(\text {area}\! \) term in \(q '\) only considers the label occurrences that are uniquely covered by a single rule and neglects those that are covered more than once. In contrast, it is more desirable to consider the union of the label occurrences covered by all rules in \(\mathcal {R} \). Our proposed definition in Eq. (7) fulfills this desideratum.

Problem definition. We frame the learning problem as a combinatorial optimization problem with budget constraint, where we set a budget on the maximum number of rules to discover, and rules should be selected to maximize a linear combination of the total quality and total diversity.

Problem 1

Given a dataset \(\mathcal {D} =\left\{ D _i\right\} _{i=1}^{n}\), a budget \(\mathcal {B} \in \mathbb {Z}_{+} \), a space of rules \(\mathcal {S} \subseteq \mathcal {U} \), and a parameter \(\lambda \in \mathbb {R} _+\), find a set of \(\mathcal {B} \) rules \(\mathcal {R} = \left\{ R _1, \ldots , R _\mathcal {B} \right\} \subseteq \mathcal {S} \), to maximize the following objective

$$\begin{aligned} f \left( \mathcal {R} \right) = q \left( \mathcal {R} \right) + \lambda d \left( \mathcal {R} \right) . \end{aligned}$$
(8)

This problem is known to be NP-hard [29]. In the next section, we present a greedy algorithm which finds a solution to Problem 1 with an approximation factor of 2, provided that the space of rules \(\mathcal {S} \) can be visited in polynomial time.

4 The CORSET learning algorithm

In this section, we present a meta algorithm named corset (\({\underline{co}ncise\, \underline{r}ule\, \underline{set}}\)) for Problem 1. corset greedily picks one rule at a time from a pool of candidate rules, so as to maximize a modified version of the marginal gain for the objective in Eq. (8), i.e.,

$$\begin{aligned} f '\left( \mathcal {R} \cup \left\{ R \right\} \right) - f '\left( \mathcal {R} \right) = \frac{1}{2} q \left( R; \mathcal {R} \right) + \lambda \sum \limits _{R _j \in \mathcal {R}} d (R, R _j). \end{aligned}$$
(9)

Here, \(f '\left( \mathcal {R} \right) \) is the same as \(f \left( \mathcal {R} \right) \) in our problem formulation with the sole exception that the quality term is multiplied by \(\frac{1}{2}\). The difference can be ignored by adjusting \(\lambda \) accordingly and is only required for the proof of Proposition 1.

The candidate rules corset picks from are generated by a procedure called GenCandRules. The effectiveness of GenCandRules heavily affects the predictive performance of the classifier. The goal is to sample high-quality rules in terms of generality, diversity, and accuracy (that is, tailored to our objective). This is a challenging goal since the size of rule space is exponential [30], and GenCandRules should therefore avoid exploring the whole space. We defer the description of the candidate generation to Sects. 5 and 6.

For now, we focus on the description of the main algorithm. corset maintains a set of selected rules, \(\mathcal {R} \), which is initially empty. At each iteration, it considers a pool of candidate rules generated by GenCandRules. Within this pool, the rule \(R ^*\) maximizing the marginal gain in Eq. (9) with respect to \(\mathcal {R}\) is selected and added to \(\mathcal {R} \). The process stops when the proportion of labels in \(\mathcal {L}\) predicted by \(\mathcal {R} \cup {R ^*}\) and not by \(\mathcal {R} \) falls below a user-specified tolerance level \(\tau \).

To ensure the aforementioned approximation guarantee, we need to repeat the greedy procedure a second time, over the union of the rules generated in all iterations, because using a different candidate set at each iteration does not offer an approximation guarantee. Denote as \(\mathcal {R} '\) the new rule set obtained in this second round. We return as solution the set that yields the largest objective function value between \(\mathcal {R} \) and \(\mathcal {R} '\), since in practice \(\mathcal {R} '\) is not necessarily better than \(\mathcal {R} \).

The pseudo-code of corset is shown in Algorithm 1. Note that GenCandRules receives as input the current rule set \(\mathcal {R} \), so as to generate rules different from \(\mathcal {R} \), as explained in the next section.

We provide the approximation guarantee next.

Proposition 1

For a fixed pool of candidate rules, corset is a 2-approximation algorithm for Problem 1.

Proof

Problem 1 is a special case of a general problem known as max-sum diversification [29], which has been studied in the context of result diversification [31]. In the general formulation, given a set of elements U in a metric space and a set-valued function \(f: 2^{U} \rightarrow \mathbb {R} \), the problem asks for a subset \(S \subseteq U\) that maximizes an objective function \(f(S) + \lambda \sum _{u,v \in S} dist(u,v)\) under cardinality (or more general) constraints, where dist is a pairwise distance function.

Setting f to be the quality function \(q \) in Eq. (7) and dist to be the Jaccard distance \(d \) in Eq. (5), it follows that Problem 1 is a special case of the max-sum diversification problem.

A result due to Borodin et al. [29] shows that a simple greedy algorithm, which iteratively picks the element maximizing the marginal gain in the objective function, guarantees an approximation factor of 2 for the max-sum diversification problem and therefore for Problem 1, as long as f is monotone and submodular and \(d \) is a metric.

As a consequence, to show that the 2-approximation guarantee holds for corset, it is sufficient to show that the quality term \(q \left( \mathcal {R} \right) \) is monotone and submodular, since the Jaccard distance \(d \) is known to be a proper metric [32].

First, \(q \left( \mathcal {R} \right) \) is a sum of nonnegative terms. Hence, it is a monotone (and non-decreasing) function.

Next, we show that \(q \left( \mathcal {R} \right) \) is also submodular. To prove the submodularity of \(q \), consider two rule sets \(\mathcal {R} '\) and \(\mathcal {R} ''\) such that \(\mathcal {R} ' \subseteq \mathcal {R} ''\). Given a new rule \(R ^*\), the marginal gain in quality resulting from adding \(R ^*\) to \(\mathcal {R} '\) is:

$$\begin{aligned} \Delta ' = q \left( \mathcal {R} ' \cup \{R ^*\}\right) - q \left( \mathcal {R} '\right) = a \left( R ^*\right) \times |\text {cov}\! \left( R ^*\right) \setminus \bigcup \limits _{R \in \mathcal {R} '} \text {cov}\! \left( R \right) |. \end{aligned}$$

Similarly, the marginal gain in quality for \(\mathcal {R} ''\) is:

$$\begin{aligned} \Delta ''&\; = q \left( \mathcal {R} '' \cup \{R ^*\}\right) - q \left( \mathcal {R} ''\right) = a \left( R ^*\right) \times |\text {cov}\! \left( R ^*\right) \setminus \bigcup \limits _{R \in \mathcal {R} ''} \text {cov}\! \left( R \right) | \\&\;= a \left( R ^*\right) \times |\text {cov}\! \left( R \right) \setminus \bigcup \limits _{R \in \mathcal {R} '} \text {cov}\! \left( R \right) \setminus \bigcup \limits _{R \in \mathcal {R} '' \setminus \mathcal {R} '} \text {cov}\! \left( R \right) |. \end{aligned}$$

Since

$$\begin{aligned}&\; |\text {cov}\! \left( R ^*\right) \setminus \bigcup \limits _{R \in \mathcal {R} '} \text {cov}\! \left( R \right) | \ge |\text {cov}\! \left( R ^*\right) \setminus \bigcup \limits _{R \in \mathcal {R} '} \text {cov}\! \left( R \right) \setminus \bigcup \limits _{R \in \mathcal {R} '' \setminus \mathcal {R} '} \text {cov}\! \left( R \right) |, \end{aligned}$$

it immediately follows that \(\Delta '\ge \Delta ''\). Therefore, the quality function \(q \) is submodular.

We conclude that corset is a 2-approximation algorithm for Problem 1. This means that, given a space of candidate rules, corset is guaranteed to return rule sets achieving a value of the objective in Eq. (8) that is at least half of the maximum value of the objective function achieved by the optimal rule set(s). \(\square \)

It is important to note that the approximation guarantee does not hold with respect to the entire rule space, but thanks to the second round of greedy selection, it holds with respect to the entire pool of rules sampled in all the iterations of the algorithm.

Algorithm 1
figure a

The corset algorithm.

Prediction. At prediction time, given the set of selected rules \(\mathcal {R}\), we return the set of predicted labels for a data record \(D = (F _{D}, L _{D})\) as \(\bigcup _{R \in \mathcal {R} \mid {B} _{R} \subseteq F _{D}} {H} _{R} \), that is, the union of the heads of the rules such that \(F _{D} \) contains all attributes in the body \({B} \), or, equivalently, such that \(R \) covers \(D\).

5 Rule sampling

In the next two sections, we present the main contribution of our work, a suite of rule-sampling algorithms used by \(\textsc {GenCandRules} \). In this section, we first describe the technical basis of our proposal. Then, we formulate our sampling problem and present the algorithms for it. The devised sampling algorithms have, however, important limitations. In the next section, we discuss such limitations and describe practical enhancements.

5.1 Background: two-stage pattern sampling

Our sampling scheme builds on the pattern-sampling algorithms proposed by Boley et al. [25, 26]. These algorithms allow us to sample patterns according to a target distribution over the pattern space, without the need of exhaustive enumeration. The target distribution reflects a measure of interestingness for the patterns. Example measures include support, area, and, if the data are labelled, discriminativity. Sampling algorithms for a variety of measures share a two-stage structure, whilst the details depend on the measure under consideration.

The key insight brought by Boley et al. [25, 26] is that random experiments reveal frequent events. We use sampling by support and area for illustration. Consider a dataset \(\mathcal {D} =\left\{ D _1, \ldots , D _{n}\right\} \) over a finite ground set \(\mathcal {E} \), with \(D \subseteq \mathcal {E} \) for each \(D \in \mathcal {D} \). Consider the problem of sampling an itemset (pattern) \(F \subseteq \mathcal {E} \) with probability proportional to its support \(q_{\text {supp}}\! \left( F\right) =|D \left[ F\right] |\).

For each \(D \in \mathcal {D} \), the set of itemsets including \(D \) in their support is \(\mathcal {P} \left( D \right) \). It can be shown that sampling an itemset F uniformly from , where denotes the union operator of multi-sets, is the same as sampling F according to \(|D {\left[ F\right] }|\). To avoid materializing , Boley et al. use a two-step procedure:

  1. 1.

    sample a data record \(D\) with probability proportional to the weight \(w \!\left( D \right) = \sum _{F \in \mathcal {P} \left( D \right) } 1 = 2^{|D |}\).

  2. 2.

    sample an itemset F uniformly from \(\mathcal {P} \left( D \right) \).

To sample from the “area” distribution \(q_{\text {area}}\! \left( F\right) = |F| |D \left[ F\right] |\), the above procedure is changed as follows: set weights \(w \!\left( D \right) = \sum _{F \in \mathcal {P} \left( D \right) } F = |D | 2^{|D | - 1}\), and then sample F with weight |F| from \(\mathcal {P} \left( D \right) \).

The two-stage sampling idea can be generalized to a number of other measures. Some of them, such as discriminativity, which we use later, require sampling tuples of data records rather than a single one in the first stage.

Next we describe two sampling distributions and the corresponding sampling algorithms for our objective. The first distribution is a generalization of the area function (not discussed by Boley et al. [25, 26]) and is used for head sampling. The second distribution is discriminativity used for body sampling. For discriminativity, we propose an improved sampling algorithm, which is faster than the original version [26].

5.2 Sampling objectives

Our rule sampling objective can be expressed using the chain rule of probabilities as a product of two values, one reflecting the generality of a rule \(R =({B} \rightarrow {H})\) given the current set of rules \(\mathcal {R}\), and the other its discriminative power:

$$\begin{aligned} \text {Pr}\left( {B}, {H} \right) \propto w ({B}, {H}; \mathcal {R}) = q_{a}\! \left( {B}; {H} \right) \cdot \text {area}\! \left( {H}; \mathcal {R} \right) . \end{aligned}$$
(10)

Note that the uncovered area function in Eq. (3) generalizes to tails, i.e., \(\text {area}\! \left( {H}; \mathcal {R} \right) = |\text {cov}\! \left( {H} \right) {\setminus } \bigcup _{{H} ^{'} \in \mathcal {R}} \text {cov}\! \,({H} ^{'})|\) and \(\text {cov}\! \left( {H} \right) = \left\{ \left( i, k\right) \mid {H} \text { matches } D _i \in \mathcal {D} \text { and } k \in {H} \right\} \). We use \(\text {area}\! \left( {H}; \mathcal {R} \right) \) because we want to extract heads that cover the largest possible area and are as diverse as possible.

For \(q_{a}\! \), we choose the discriminativity measure studied by Boley et al. [25], which permits sampling in polynomial time.

Given a head \({H} \subseteq \mathcal {L} \), the discriminativity of \({B} \) is defined as:

$$\begin{aligned} q_\text {disc}\!\left( {B}; {H} \right) = |\mathcal {D}^+_{{H}} \left[ {B} \right] |\; |\mathcal {D}^-_{{H}} \setminus \mathcal {D}^-_{{H}} \left[ {B} \right] |. \end{aligned}$$
(11)

The goal is to sample bodies that have as large support as possible in \(\mathcal {D}^+_{{H}}\) and as small support as possible in \(\mathcal {D}^-_{{H}}\). Thus, discriminativity captures the ability of a body \({B}\) in discriminating between the presence and absence of a given head \({H}\) in data records.

To sample from the distribution in Eq. (10), we use the following steps:

  1. 1.

    sample \({H}\) with probability proportional to \(\text {area}\! \left( {H}; \mathcal {R} \right) \);

  2. 2.

    sample \({B}\) with probability proportional to \(q_{a}\! \left( {B}; {H} \right) \).

We explain each sampling step next.

5.3 Head sampling

To sample from \(\text {area}\! \left( {H}; \mathcal {R} \right) \), we apply a similar two-step sampling procedure as in Boley et al. [25]: we first sample a data record \(D\) with probability proportional to its weight \(w \!\left( D ; \mathcal {R} \right) \) and then sample \({H}\) from \(D\). The function \(\text {area}\! \left( {H}; \mathcal {R} \right) \) is a generalization of the area function considered in Boley et al. [25]. Adapting the original algorithm to our case requires to design a weight function \(w \!\left( \cdot ; \mathcal {R} \right) \) appropriate for our target. To define \(w \!\left( \cdot ; \mathcal {R} \right) \), a few new definitions are needed. Given a rule \(R\) and a data record \(D\), the \(D\) -specific coverage of \(R\) is defined to be:

$$\begin{aligned} \text {cov}\! _{D} \!\left( R \right) = {\left\{ \begin{array}{ll} L _{D} \cap {H} _{R}, &{}\text { if } R \text { covers } D, \\ \emptyset , &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$
(12)

Extending \(D\)-specific coverage to a rule set \(\mathcal {R}\), we have:

$$\begin{aligned} \text {cov}\! _{D} \!\left( \mathcal {R} \right) = \bigcup \limits _{R \in \mathcal {R}} \text {cov}\! _{D} \!\left( R \right) . \end{aligned}$$
(13)

Given a label set \({H}\), its marginal coverage with respect to \(\mathcal {R}\)  is:

$$\begin{aligned} \text {cov}\! _{D}\!\left( {H}; \mathcal {R} \right) = \left( L _{D} \cap {H} \right) \setminus \text {cov}\! _{D} \!\left( \mathcal {R} \right) , \end{aligned}$$
(14)

that is, the covered label occurrences in \(D\) by \({H}\), excluding those by \(\mathcal {R}\). As a shortcut, we define \(\overline{\text {cov}\!}_{D}\!\left( \mathcal {R} \right) = \text {cov}\! _{D}\!\left( L _{D}; \mathcal {R} \right) \), i.e., the set of label occurrences in \(D\) not covered by \(\mathcal {R}\).

The weight of a label set \({H} \) on a data record \(D\) is:

$$\begin{aligned} w \!\left( {H}, D ; \mathcal {R} \right) = |\text {cov}\! _{D}\!\left( {H}; \mathcal {R} \right) |. \end{aligned}$$
(15)

We give a small example to illustrate these definitions:

figure b

For \(R _1\) and \(R _2\), the sets \(\text {cov}\! _{D} \!\left( \cdot \right) \) are \(\left\{ a\right\} , \left\{ a, b\right\} \), respectively. For \(R _3\), the set \(\text {cov}\! _{D} \!\left( R _3\right) \) is \(\emptyset \) since \(R _3\) does not cover \(D\). Therefore, \(\text {cov}\! _{D}\!\left( {H}; \mathcal {R} \right) = \left\{ c\right\} \) and \(w \!\left( {H}, D ; \mathcal {R} \right) = 1\).

The intuition of the definition of \(w \!\left( {H}, D ; \mathcal {R} \right) \) is that \({H} \) has large weight on \(D \) if it contains many label occurrences not covered by \(\mathcal {R} \). Therefore, the weight of any data record \(D\) is simply the summation of the weights over all possible heads and has the following simple form:

$$\begin{aligned} w \!\left( D ; \mathcal {R} \right)&= \sum _{{H} \subseteq L _{D}} w \!\left( {H}, D ; \mathcal {R} \right) \nonumber \\&= \sum _{{H} \subseteq L _{D}} |\text {cov}\! _{D}\!\left( {H}; \mathcal {R} \right) | \nonumber \\&= \sum _{\begin{array}{c} ({H} _1 \cup {H} _2) \subseteq L _{D} \\ \text { s.t. } {H} _1 \subseteq \overline{\text {cov}\!}_{D}\!\left( \mathcal {R} \right) , \\ {H} _2 \subseteq \text {cov}\! _{D} \left( \mathcal {R} \right) , \\ {H} _1 \cap {H} _2 = \emptyset \end{array}} |{H} _1| \nonumber \\&= \sum _{{H} _1 \subseteq \overline{\text {cov}\!}_{D}\!\left( \mathcal {R} \right) } |{H} _1| \sum _{{H} _2 \subseteq \text {cov}\! _{D} \left( \mathcal {R} \right) } 1\nonumber \\&= \left( \left| \overline{\text {cov}\!}_{D}\!\left( \mathcal {R} \right) \right| 2^{|\overline{\text {cov}\!}_{D}\!\left( \mathcal {R} \right) | - 1}\right) 2^{|\text {cov}\! _{D} \left( \mathcal {R} \right) |} \nonumber \\&= |\overline{\text {cov}\!}_{D}\!\left( \mathcal {R} \right) | \;2^{|L _{D} | - 1}. \end{aligned}$$
(16)

In the third equation, a head \({H} \) is split into two disjoint parts, \({H} _1 \text { and } {H} _2\), such that \({H} _1\) belongs to the uncovered label occurrences, whereas \({H} _2\) belongs to the already covered ones. The remaining equations follow from simple algebra.

Using these weights, we adapt the sampling algorithm of Boley et al. [25, 26] as per Algorithm 2.

Algorithm 2
figure c

Two-stage head sampling.

The sampling algorithm is supported by probabilistic guarantees, as formalized next.

Proposition 2

Algorithm 2 returns \({H} \sim \text {area}\! \left( {H}; \mathcal {R} \right) \).

Proof

We prove that Algorithm 2 returns \({H} \sim \text {area}\! \left( {H}; \mathcal {R} \right) \).

$$\begin{aligned}&Pr({H} \text { is drawn})=\sum _{D \in \mathcal {D}} Pr({H} \text { is drawn and } D \text { is drawn} )\\&\quad = \sum _{D \in \mathcal {D}} Pr(D \text { is drawn} ) Pr({H} \text { is drawn from } \mathcal {P} \left( L _{D} \right) ) \\&\quad = \sum _{D \in \mathcal {D} \left[ {H} \right] } Pr(D \text { is drawn} ) Pr({H} \text { is drawn from } \mathcal {P} \left( L _{D} \right) ) \\&\quad \propto \sum _{D \in \mathcal {D} \left[ {H} \right] } w(D;\mathcal {R}) \times \frac{w({H}, D; \mathcal {R})}{w(D;\mathcal {R})} \\&\quad = \sum _{D \in \mathcal {D} \left[ {H} \right] }w({H}, D; \mathcal {R}) = \sum _{D \in \mathcal {D} \left[ {H} \right] } |\text {cov}\! _{D}\!\left( {H}; \mathcal {R} \right) | = \text {area}\! \left( {H}; \mathcal {R} \right) . \end{aligned}$$

The first and second equations follow from the law of total probability and the chain rule of probabilities, respectively. The third equality is guaranteed because \({H} \) can only be sampled from \(D \in \mathcal {D} \left[ {H} \right] \). If \(D \) is not in \(\mathcal {D} \left[ {H} \right] \), it has zero probability of generating \({H}\). In the fourth equality, we have used the definitions of \(w \!\left( {H}, D ; \mathcal {R} \right) \) and \(w \!\left( D ; \mathcal {R} \right) \), as well as the equality:

$$\begin{aligned} \sum _{{H} \subseteq L _{D}} w \!\left( {H}, D ; \mathcal {R} \right) = w \!\left( D ; \mathcal {R} \right) . \end{aligned}$$

Finally, the last equality follows since \(\text {area}\! \left( {H}; \mathcal {R} \right) \) can be obtained by definition as the sum of the marginal coverage \(|\text {cov}\! _{D}\!\left( {H}; \mathcal {R} \right) |\) of \({H} \) on \(D\), given \(\mathcal {R}\), over all data records \(D \in \mathcal {D} \left[ {H} \right] \). \(\square \)

5.4 Body sampling

After a head \({H}\) is sampled, we sample \({B}\) according to \(q_{a}\! \left( {B}; {H} \right) = q_\text {disc}\!\left( {B}; {H} \right) \), from Eq. (11). The two-stage sampling scheme by Boley et al. [25, 26] can be applied for this case. In contrast to the previous cases, the weight function is defined on pairs of data records:

$$\begin{aligned} w (D^+, D^-) = 2^{|D^+ | }- 2^{|D^+ \cap D^- |} - |D^+ \setminus D^- |, \end{aligned}$$
(17)

where \(D^+ \in \mathcal {D}^+_{{H}} \) and \(D^- \in \mathcal {D}^-_{{H}} \), and \(|D^+ |\) (resp. \(|D^- |\)) denotes the number of features present in \(D^+ \) (resp. \(D^- \)). Thus, pre-computing the weights leads to quadratic space complexity in \(|\mathcal {D} |\), which limits the practicality of the sampling procedure.

The above limitation is addressed by Boley et al. [26] using the technique of coupling from the past (cftp), which leads to linear space complexity. Unlike many Markov chain Monte Carlo (mcmc) methods, cftp can guarantee that samples are generated according to the target distribution. It operates by simulating the Markov chain backwards by sampling from a proposal distribution, until all states coalesce to the same unique state. The main challenge of using cftp is the design of the proposal distribution and the efficient monitoring of coalescence condition.

The proposal distribution should be (i) efficient to sample from; and (ii) an appropriate approximation to the target distribution to obtain fast convergence. Boley et al. [26] devise a “general-purpose” proposal distribution, which works for all target distributions they consider. For the case of discriminativity, the proposal distribution is defined as:

$$\begin{aligned} \overline{w}(D^+, D^-) = \sqrt{w _1\!\left( D^+ \right) \cdot w _2\!\left( D^- \right) }, \end{aligned}$$
(18)

where \(w _1\!\left( D^+ \right) =2^{|D^+ |} - |D^+ | - 1\) and \(w _2\!\left( D^- \right) =2^{|\mathcal {F} |} - 2^{|D^-}| - |D^- | - 1\). Sampling from \(\overline{w}\!\left( \cdot , \cdot \right) \) can be done efficiently by sampling separately from \(w _1\!\left( \cdot \right) \) and \(w _2\!\left( \cdot \right) \). However, we argue that the choice of \(\overline{w}\!\left( \cdot , \cdot \right) \) is not a good approximation of the target, and therefore, it suffers from slow convergence. We provide empirical evidence to support this claim in Sect. 8. The reason is that when a data record is high-dimensional but sparse, as is often the case in multi-label classification, \(w _2\!\left( D^- \right) \), and hence, \(\overline{w}\!\left( D^+, D^- \right) \) grow exponentially with the number of features, making the acceptance probability extremely low and, as a consequence, convergence is extremely slow.

To overcome the convergence issue, we use a different proposal distribution better suited for our setting. Our proposal is the same as in Eq. (18), except that \(w _2\) is defined as a uniform function and the square root is removed:

$$\begin{aligned} \overline{w}'(D^+, D^-) = w _1\!\left( D^+ \right) = 2^{|D^+ |} - 1 - |D^+ |. \end{aligned}$$
(19)

It is easy to sample from this modified proposal distribution because we can sample \(D^+\) weighted by \(w _1\!\left( D^+ \right) \) and \(D^-\) uniformly at random. Further, \(\overline{w}'\) has an appealing property:

Proposition 3

\(\overline{w}'\) is a tight upper bound of the target weight distribution in \(w \) in Eq. (17).

Proof

First, \(2^{|D^+ |} - 1 - |D^+ |\) is the value taken on by the weight function \(w (D^+, D^-)\) for a tuple \((D^+, D^-)\) when \(D^+ \cap D^- = \emptyset \).

Also when \(|D^+ \cap D^- | = 1\), we have \(w(D^+, D^-) = 2^{|D^+|} - 2 - (|D^+ |-1) = 2^{|D^+ |} -1 - |D^+ |\).

Finally, when \(|D^+ \cap D^+ | = |I| \ge 2\), we have \(w(D^+, D^-) = 2^{|D^+ |} - 2^{|I|} -( |D^+ | - |I|) = 2^{|D^+ |} - 2^{|I|} - |D^+ | + |I|\). Because \(2^{|I|} \ge |I| \ \forall I\) and \(2^{|I|}\) grows at a faster rate than |I|, the target weight function \(w (D^+, D^-)\) achieves its maximum \(2^{|D^+ |} -1 - |D^+ |\) at \(|I| = 0\) and \(|I| = 1\) and takes smaller values for \(|I|\ge 2\). It follows that \(\overline{w}'(D^+, D^-) = 2^{|D^+ |} - 1 - |D^+ |\) is a tight upper bound for \(w (D^+, D^-)\), as claimed above. \(\square \)

Proposition 3 ensures that our proposal distribution \(\overline{w}'\) provides a better approximation of the target, as compared to \(\overline{w}\) proposed by Boley et al. [26]. Therefore, it is expected that \(\overline{w}'\) gives faster convergence than \(\overline{w}\). We empirically evaluate the convergence speed in Sect. 8.

Body sampling is summarized in Algorithm 3. We first use cftp (lines 1-9) to sample a pair \(\left( D^+, D^- \right) \). Then, we sample a body in line 11. We denote \(u(\cdot )\) as the uniform distribution over a set. For brevity, we use a boldface letter to denote a pair (tuple) of records, e.g., \(\textbf{D} \). We denote an empty pair by \(\perp \) and define \(\overline{w}(\perp ) / w (\perp ) = 1\).

Algorithm 3
figure d

Two-stage body sampling.

6 Enhancements to the sampling scheme

In this section, we first discuss important drawbacks of the pattern sampling schemes described in Sect. 5 and then describe our enhancements to tackle these drawbacks. The benefits of such enhancements are thoroughly demonstrated via extensive experiments presented in Sect. 8.

6.1 Limitations of the two-stage pattern-sampling framework

While theoretically sound, in our setting, the two-stage sampling framework [25, 26] suffers from two limitations, as can be verified empirically. First, we observe that most of the sampled rules are very specific, with very low support. Second, rule interpretability is not explicitly considered.

Heavy-hitter problem for head sampling. Consider the head sampling part. Notice that the weight of a data record \(D \) in Eq. (16) is exponential in \(L _{D} \). If there is a data record \(D \in \mathcal {D} \) whose \(|L _{D} |\) is moderately larger than the rest, its weight dominates, making it very likely to be sampled in the first sampling step. We refer to this issue as the heavy-hitter problem. For instance in \(\textsf { bibtex}\), the largest label set of a data record \(D ^*\) contains 28 labels, while the second largest contains 16. The probability of \(D ^*\) being sampled is \(99.97\%\). A head sampled from \(D ^*\) has an expected length of 14.5. Empirically, heads of about this length match only a few data records. Thus, most of the sampled heads have low support, hampering the goal of sampling general rules.

Heavy-hitter problem for body sampling. A similar issue arises in body sampling. The weight function in Eq. (17) grows exponentially with \(|D^+|\), so that cftp most likely returns the positive data records with the highest number of present features. Therefore, sampled bodies tend to be very long and have small support (often 1). Thus, they may have high discriminativity but cannot generalize to unseen data.

Head interpretability. Interpretability of heads is a central focus in our work. Nonetheless, in the original pattern-sampling algorithms [25, 26] all elements in \(\mathcal {P} \!\left( L \right) \) are considered possible heads, regardless of whether they are interpretable or not.

The original head-sampling algorithms described in Sect. 5.3 sample new heads based on global correlations. Some heads are sampled simply because the labels in them co-occur frequently rather than because the labels are strongly correlated. Such heads may contain overall prominent labels, but they may be hard to interpret.

The root of the above limitations is the enormous sample space under consideration, which contains a large amount of undesirable bodies and heads with small support and heads with limited interpretability. Next, we describe a strategy to eliminate undesirable bodies and heads, thus effectively addressing the above limitations of the original two-stage pattern-sampling framework.

6.2 Head sampling under interpretable label space

We propose to restrict the label sample space to a much smaller sample space \(\mathcal {S} ^{-} \subseteq \mathcal {P} \!\left( \mathcal {L} \right) \) designed to contain only interpretable label sets so as to mitigate the heavy-hitter problem. We call \(\mathcal {S} ^{-} \) the interpretable label space. Before describing the construction of \(\mathcal {S} ^{-} \), we notice that pattern sampling under any subspace of \(\mathcal {P} \!\left( \mathcal {L} \right) \) is a slight generalization of the original sampling setting. Most importantly, the original sampling algorithms can be adapted to different sample spaces, such as \(\mathcal {S} ^{-}\), while preserving probabilistic guarantees.

In Algorithm 4, we describe a procedure for sampling by uncovered area under \(\mathcal {S} ^{-} \). The algorithm can be easily adapted for other sampling objectives including discriminativity. Compared to sampling under \(\mathcal {P} \!\left( \mathcal {L} \right) \), we require the extra step of determining the set \(I \left[ D \right] \) of patterns in \(\mathcal {S} ^{-} \) contained by \(D \in \mathcal {D} \) and computing the weight for \(D \) accordingly.

Algorithm 4
figure e

Head sampling under \(\mathcal {S} ^{-} \) according to uncovered area.

Constructing \(\mathcal {S} ^{-}\). To construct the interpretable label space, we first define interpretability in our setting. Humans are used to think in an associative manner [33]. The underlying cognitive process is called associative activation, which can be described as “ideas that have been evoked trigger many other ideas, in a spreading cascade of activity in your brain” [34, p. 51]. To accommodate such tendency, we argue that a label set is interpretable if the corresponding labels are sufficiently associated. The problem of constructing \(\mathcal {S} ^{-}\) is then framed as finding sufficiently associated label sets. We rely on a graph-based approach whereby we construct a suitable label graph and extract its dense subgraphs. Specifically, we construct a directed weighted graph \(\mathcal {G} =\left( V, E, p \right) \). Each node represents a label. A node pair (uv) is an edge in \(E \) if \(\mathcal {D} \left[ \left\{ u\right\} \right] \cap \mathcal {D} \left[ \left\{ v\right\} \right] \ne \emptyset \). The corresponding weight is defined as \(p(u, v) = \frac{|\mathcal {D} \left[ \left\{ u\right\} \right] \cap \mathcal {D} \left[ \left\{ v\right\} \right] |}{|\mathcal {D} \left[ \left\{ u\right\} \right] |}\), which can be interpreted as the conditional probability that label v occurs given that label u occurs.

The need of edge direction arises because in real-world multi-label datasets, the association between labels can be asymmetric. For instance in \(\textsf { bibtex}\), general labels, such as \(\textsf { statistical physics}\), and specific ones, such as \(\textsf { simulation}\), co-exist. The asymmetry implies that a single value assigned to a pair of labels cannot fully capture their interaction. The specific label \(\textsf { simulation}\) is likely to co-occur with \(\textsf { statistical physics}\) in most of its occurrences, while the contrary is not true.

Probabilistic interpretation of the edge weights suggests that \(\mathcal {G} \) can be viewed as a probabilistic graph [35]. From such point of view, the problem of finding sufficiently associated label sets can be seen as finding highly probable cliques in \(\mathcal {G} \) [36], whose probability of forming is above a pre-specified threshold. To solve this problem, we adapt a recursive depth-first search (DFS) procedure similar to the one proposed by Mukherjee et al. [36].

Efficient preprocessing. Execution of line 1 in Algorithm 4 can be done efficiently by framing the problem appropriately. In this problem, we are given a set of subsets \(\mathcal {S} ^{-}\) and we are asked to find, for each \(D \in \mathcal {D} \), the subsets in \(\mathcal {S} ^{-}\) that are contained in \(L _{D} \). A naive solution checks the containment relations for all pairs of \(L _{D} \) and \(\mathcal {S} ^{-} \), and in practice can take hours for many datasets. However, the problem is an instance of the the set containment problem, extensively studied by the database community. Among several efficient solutions proposed for this problem, we resort to one well-established algorithm, PRETTI [37], built upon the idea of inverted index and prefix trees. The running time is effectively brought down to a few seconds.

6.3 Improved body sampling

To alleviate the heavy-hitter problem during body sampling, we consider two approaches. The first approach is based on reduced sample space, but may have scalability issues. The second is a greedy heuristic, which explicitly maximizes a modified version of discriminativity.

1.  Using reduced sample space. We adapt a similar idea as in head sampling (Sect. 6.2) and use a reduced sample space \(\mathcal {S} ^{-}\) for body sampling. However, in practice, while in real multi-label classification tasks the label matrix is always sparse, the feature matrix can be dense. In this case, a scalability issue arises since the DFS procedure may take exponential time. For sufficiently sparse graphs, this is not a concern, whereas in denser graphs, constructing \(\mathcal {S} ^{-}\) becomes a key scalability bottleneck.

2.  A greedy heuristic. To address the above scalability issue, we propose a greedy heuristic, which drops the probabilistic guarantee, but is highly effective in practice. We use cftp as in Algorithm 3 to sample a tuple \(\left( D^+, D^- \right) \). Then, we greedily select features in \(D^+ \setminus D^- \) to maximize a modified version of discriminativity: for any \({B}\), we define the measure

$$\begin{aligned} \phi ({B}) = |\mathcal {D} \left[ {B} \right] \cap \mathcal {D}^+_{{H}} | - \gamma |\mathcal {D} \left[ {B} \right] \cap \mathcal {D}^-_{{H}} |, \end{aligned}$$
(20)

where \(\gamma \) weighs the importance of positive and negative support, so smaller values of \(\gamma \) lead to more general but more error-prone bodies. Further, we use early stopping (controlled by \(\epsilon \)) when \(|\mathcal {D} \left[ H\right] |\) is too small.

The algorithm is described in Algorithm 5. It iteratively picks a feature \(h \in F _{D^+} \cup F _{D^-}\), which maximizes the marginal gain of \(\phi \). The best feature \(h^*\) is added to \({B} \) and the support is updated accordingly. Finally, a linear sweep over \({B} \) finds the body with the highest objective value (in Eq. (20)). In practice, we use a pre-computed inverted index to allow for efficient intersection of supports. Variations of Algorithm 5 have also been investigated in which the input is deterministic, the difference in line 4 is normalized by \(\mathcal {D} [\{h\}]\) and the support is replaced by the portion of support not belonging to the support of previously chosen rules.

Summary. The second approach scales better for dense feature matrices than the first approach. However, the first approach has the following advantages: (1) body sampling has probabilistic guarantees, (2) it is much faster to run when the feature matrices are sparse, (3) bodies in the reduced sample space \(\mathcal {S} ^{-}\) are composed of highly correlated attributes and hence are easy to interpret. In the sequel, we use corset-surs to denote the version where the first approach is used for body sampling, and corset-gh when the second approach is used.

Algorithm 5
figure f

A greedy heuristic for body sampling.

7 Complexity analysis

Time complexity. Let \(T_f \) be the time complexity of evaluating the quality and diversity function. \(T_f \) is bounded by \(|\mathcal {D} | ( |\mathcal {F} | + |\mathcal {L} | )\). Let \(\mathcal {S} ^{-} _{\mathcal {L}}\) be the interpretable sample space for head sampling and \(\mathcal {S} ^{-} _{\mathcal {F}}\) be the reduced sample space for body sampling. The pre-processing times \(T_S ^{\mathcal {L}}\) and \(T_S ^{\mathcal {F}}\) to construct \(\mathcal {S} ^{-} _{\mathcal {L}}\) and \(\mathcal {S} ^{-} _{\mathcal {F}}\) are exponential in the worst case. It follows that the time complexity of corset is \(\mathcal {O} ( B |\mathcal {C}_R | T_f + T_S ^{\mathcal {L}} + T_S ^{\mathcal {F}} )\). If \(\mathcal {F}\) is sufficiently sparse, the exponential complexity is not a concern in practice. When \(\mathcal {F}\) is dense, it is appropriate to use corset-gh, for which the time complexity is \(\mathcal {O} ( B |\mathcal {C}_R | T_f + T_S ^{\mathcal {L}} )\).

Space complexity. Space \(S_R \!=\!\mathcal {O} (|\mathcal {D} | + |\mathcal {F} | + |\mathcal {L} |)\) is required to keep a single rule, as we store body, head, and support. Let \(S_S \) denote the space complexity of sampling. In both head and (owing to cftp) body sampling, we only need to store a single weight value for each data record. Building \(\mathcal {S} ^{-} _{\mathcal {L}}\) and \(\mathcal {S} ^{-} _{\mathcal {F}}\) requires space \(\mathcal {O} (|\mathcal {F} |^2)\) and \(\mathcal {O} (|\mathcal {L} |^2)\), respectively. Furthermore, storing samples from \(\mathcal {S} ^{-} _{\mathcal {L}}\) (\(\mathcal {S} ^{-} _{\mathcal {F}}\)) takes space \(\mathcal {O} (|\mathcal {D} ||\mathcal {S} ^{-} _{\mathcal {L}}|)\) (\(\mathcal {O} (|\mathcal {D} ||\mathcal {S} ^{-} _{\mathcal {F}}|)\)). Despite this theoretical complexity, the graphs are very sparse in practice. Combining the above, we have that \(S_S \!=\!\mathcal {O} (|\mathcal {L} |^2 + |\mathcal {F} |^2 + |\mathcal {D} | |\mathcal {S} ^{-} _{\mathcal {L}}| + |\mathcal {D} | |\mathcal {S} ^{-} _{\mathcal {F}}|)\) and the space complexity of corset is \(\mathcal {O} ( ||\mathcal {F} || + ||\mathcal {L} || + |\mathcal {C}_R |S_R + S_S )\). When corset-gh is used, the greedy body sampler only takes space \(\mathcal {O} (|\mathcal {F} |)\), and hence, \(S_S \) reduces to \(S_{S +}\!=\!\mathcal {O} (|\mathcal {L} |^2 + |\mathcal {D} | |\mathcal {S} ^{-} _{\mathcal {L}}| + |\mathcal {F} |)\) so that the space complexity of corset-gh is \(\mathcal {O} ( ||\mathcal {F} || + ||\mathcal {L} || + |\mathcal {C}_R |S_R + S_{S +})\).

8 Experimental evaluation

The main goals in this section are twofold: (i) we empirically verify that the sampling algorithms used by corset outperform prior approaches in a variety of settings; (ii) and corset (and in particular its two implementations corset-surs and corset-gh) delivers a concise set of rules while still providing competitive performance in multi-label classification. We first present the experimental settings and then the results.’

8.1 Experimental setup

We describe the data, the baseline methods and the parameter settings used in the experiments. The design choices specific to the experiments comparing different sampling approaches are deferred to the dedicated section (Sect. 8.2).

Datasets. We use both synthetic and real-world datasets.

We use synthetic datasets to better understand the behavior of the methods with respect to different parameters. Data are obtained from a set of generating rules, and as a consequence, a notion of ground truth is available. For simplicity, bodies and heads are assumed to be composed of 3 attributes and labels, respectively. Moreover, different bodies do not share attributes, while heads may share labels. For each generating rule, we sample its support either (i) uniformly at random, or (ii) from a skewed distribution where a small subset of rules covers a large portion of the data, mimicking the typical behavior of real-world data. Supports of different rules may thus have arbitrary intersection. To obtain the synthetic dataset from the generating rules, we initialize the feature matrix and the label matrix to contain all zeros. Then, for each generating rule, once its support is sampled, we set to 1 its attributes and labels over its support. The case of skewed support (ii) is expected to reflect more accurately real-world data, where there are typically a small set of very general rules and a larger number of specific rules.

Further, noise is injected by flipping each entry in the feature and label matrices with a fixed probability.

For real-world data, we use 9 heterogeneous benchmark datasets for multi-label classification.Footnote 1 Summary statistics of the datasets are shown in Table 1. Categorical and numerical features are converted to binary form. For simplicity, we convert numerical features into binary ones by setting to 0 all values lower than a given percentile \( p \) (90-th percentile by default) and by setting to 1 the rest of the values. A more refined pre-processing is advisable to improve performance.

Next, we provide a brief description of each of the datasets utilized in our experiments. The \(\textsf { mediamill}\) [38] dataset is used for generic (broadcast news) video indexing. The \(\textsf { Yelp}\) [39] dataset contains reviews of clients for various businesses that are used to classify the quality of the businesses. The \(\textsf { corel-5k}\) [40] dataset is a benchmark for image classification. The \(\textsf { bibtex}\) [41] dataset, introduced in Sect. 1, is composed of bibtex entries associated with publications, and the goal is to assign a list of tags to each publication. The \(\textsf { enron}\) [42] dataset consists of a collection of email messages that were categorized into topic categories. The \(\textsf { medical}\) [43] dataset is used to predict a number of diseases, given clinical free text, and it is described more in depth in Sect. 9. The \(\textsf { birds}\) [44] dataset is used to predict the set of birds species that are present from a ten-second audio clip. The \(\textsf { emotions}\) [45] dataset classifies music into emotions that it provokes. The \(\textsf { CAL500}\) [46] dataset consists of songs and the classifier maps each song to a number of semantic concepts.

Metrics. To measure the quality of a classifier, we primarily use the popular balanced micro \(F_1\) score, which micro-averages precision and recall. In addition, for real data, we report the balanced macro \(F_1\) score, which macro-averages precision and recall, as well as the Hamming accuracy (or Hamming score), defined as the fraction of label occurrences that are correctly predicted.

To monitor rule diversity, we report the average pairwise intersection size between the coverage of different rules. To assess interpretability, we rely primarily on the number of rules \(\mathcal {R} \), but we also consider the number of conditions in each rule, and the degree of association of the rule heads.

Table 1 Summary statistics of the datasets used in the experimental evaluation

Baselines. We compare our classifier with three baselines.

SeCo [47] is a rule-based classifier, which extracts new rules iteratively and discards the associated covered examples from the training data if enough of their labels are predicted by already learned rules. To learn rules, SeCo starts from the most general body and then it proceeds by adding conditions. Given a rule body, SeCo searches for the best possible head according to a metric, while pruning the search space by exploiting properties of the metric, and introducing bias towards heads with multiple labels.

Boomer [2, 15] utilizes the gradient-boosting framework to learn ensembles of single-label or multi-label classification rules that are combined additively to minimize the empirical risk with respect to a suitable loss function.

svm-br [48, 49] is a linear support vector machine classifier based on the binary relevance approach, whereby each label is treated independently. svm-br, unlike corset, Boomer and SeCo is not a rule-based model and does not offer opportunities for interpretation. Therefore, in our experimental evaluation, svm-br is used as a representative black-box machine-learning algorithm. The goal of the comparison between corset and svm-br is to demonstrate that simple rule-based models, if appropriately learned, can be competitive with popular black-box machine-learning algorithms. In other words, we seek to demonstrate that if there is a price to pay in performance for pursuing interpretability, that price is consistently small, and hence the pursuit of interpretability is justified.

While we consider multiple baselines to compare corset against, Boomer is the most important one because it achieves the state-of-the-art performance in rule-based multi-label classification.

However, Boomer and corset are significantly different in nature. As an ensemble method, Boomer gains its good performance by learning a large number of weak (i.e., poorly optimized) rules, which are aggregated for prediction. Boomer only optimizes for classification performance and does not favor ensembles composed of a small number of rules.

On the other hand, to achieve high interpretability, corset is the first multi-label rule-based classifier which optimizes both classification performance and rule set conciseness. corset does not aggregate weak rules in an ensemble like Boomer. Instead, corset builds a small set of carefully selected rules and predicts using the union of the predictions of the selected rules.

As a consequence of the illustrated differences between corset and Boomer, Boomer generally requires a larger number of rules than corset to achieve a similar level of classification performance, as we demonstrate through our experiments.

For the synthetic datasets, we focus on comparing our approach with Boomer for increasing number of rules, whereas for the real-world datasets we consider all baselines.

Parameter setting. For the experiments with synthetic data, we explore the scalability of our algorithm with respect to the number of attributes and labels, as well as robustness with respect to noise. We vary the level of noise (the proportion of flipped entries in the feature matrix and the label matrix), and the number of attributes and labels by a geometric progression of ratio 1.5. When not varied, the number of attributes and labels is fixed to 100, and the noise level to 0.01. When the noise is varied there are 10 ground truth rules, otherwise the number of generating rules increases with the size of the data and it is given by \( \lfloor \frac{\min ( |\mathcal {F}|, |\mathcal {L}| ) }{ 3 } \rfloor \).

For the experiments with real-world data, we tune the hyper-parameters of Boomer and corset via random search to minimize micro-averaged \(F_1\) on a validation set.

As concerns Boomer, we optimize the shrinkage parameter \(\eta \), which controls the impact of individual rules, over [0.1, 0.3, 0.5] and the L2 regularization parameter \(\lambda \) in [0.0, 1, 10.0]. A key strength of Boomer is its ability to optimize different decomposable and non-decomposable loss functions. For hyper-parameter tuning in our experiments, we consider all 4 loss functions provided in the available implementation,Footnote 2 namely a variant of logistic loss applied to each label individually and a variant of the logistic loss applied to all labels simultaneously (that are continuous surrogates for the standard Hamming loss and 0/1 loss, respectively), as well as a variant of the squared error loss and a variant of the squared hinge loss, both applied to each label individually. Boomer additionally allows to learn rules with heads composed of a single label or all labels. In hyper-parameter tuning, we consider both single-label rule heads and rule heads composed of all labels.

corset does not support different loss functions as Boomer does. Instead, corset is tailored to the optimization of the objective function in Eq. (8). Subsequently, the loss function implicitly minimized by corset is the objective function in Eq. (8) reversed in sign. For corset, the size of each sampled pool of rules \(\mathcal {C} \) does not need to be tuned. Larger \(\mathcal {C} \) improves performance at the cost of increased running time. While tuning we fix \(\mathcal {C} =150\), otherwise \(\mathcal {C} \) is set to 500 by default. For hyper-parameter tuning, all hyper-parameters are searched in the range of (0, 1), except \(\lambda \), which is searched in \((10^{-2}, 10^{2})\).

Moreover, we also investigate the impact of \(\lambda \) on the diversity and accuracy of the set of chosen rules \(\mathcal {R}\), by varying it in a geometric progression of ratio 10.

All reported experimental results are obtained as average over 10 repetitions.

Implementation. Experiments are executed on a machine with \(2\!\times \!10\) core Xeon E5 2680 v2  2.80 GHz processor and 256 GB memory. Our implementation is available online.Footnote 3 To speed up and facilitate hyper-parameter tuning for corset, we have implemented two practical changes. First, we run only the first round of greedy selection in Algorithm 1. The second round guarantees the approximation factor, but often offers a modest increase in performance, not worth the increase in running time. Second, in the experiments with real data, we pass as input to corset the number of rules to be returned (at most 150) instead of the tolerance parameter (c in Algorithm 1) to reduce variability and simplify hyperparameter optimization.

8.2 Sampling performance

We evaluate the performance of our sampling algorithms and compare them against a few alternatives.

Head sampling. In this setting, we consider the task of sampling heads according to the area function. We compare two samplers: our sampler described in Alg. 4 and the original two-stage pattern sampler introduced by Boley et al. [25] (the baseline). They differ in the sample space they use: the baseline sampler operates in the original label space, i.e., the powerset of all labels, while ours uses a subspace of it, namely the interpretable label space, introduced in Sect. 6.2.

For each configuration of dataset and sampler, we sample 1000 heads. For each sampled head \({H} \), we evaluate its quality using the logarithm of its area and its support size, i.e., \(\log \left( \text {area}\!\ ({H})\right) \) and \(\log \left( |\mathcal {D} \left[ {H} \right] |\right) \). In addition, we quantify the association of the labels in \({H} \) (our proxy of interpretability) using the edge density of the subgraph induced by \({H} \). Specifically, given a label graph \(\mathcal {G} =\left( V, E, p \right) \) and a set of labels (nodes) \({H} \), denote \(\mathcal {G} \left[ {H} \right] \) as the subgraph induced by \({H} \) in \(\mathcal {G}\). Then, the edge density of \(\mathcal {G} \left[ {H} \right] \) is defined as \(d\left( \mathcal {G} \left[ {H} \right] \right) = \left( \sum _{e \in \mathcal {G} \left[ {H} \right] }{p \left( e\right) }\right) / \left( |{H} |\times \left( |{H} |-1\right) \right) \). We denote \(d\left( {H} \right) = d\left( \mathcal {G} \left[ {H} \right] \right) \) for brevity.

In Fig. 5, we show the distributions of the metric scores, visualized using violin plots [50]. On the one hand, our sampler produces better heads than the baseline according to all measures, on average. On the other hand, the best head (by a certain metric) obtained by our sampler is no worse than the baseline. Further, it can be seen that on datasets where the heavy hitter problem is pronounced, such as \(\textsf { bibtex}\), the baseline suffers the most and many heads have zero support size.

Fig. 5
figure 5

Distributions of quality measures for the heads obtained by two samplers. Both samplers target at the area function, but operate under different sample spaces: the interpretable label space (Sect. 6.2) or the original space (used in [25]). We consider three measures: log of support size (left), log of area (middle), and edge density of the label subgraph (right). For all measures, the larger the values, the better

Body sampling. We then consider sampling bodies according to the discriminativity function and compare two samplers: the one used by corset, which samples from a reduced space (Sect. 6.3), and a baseline [26], which instead samples from the original feature space.

For each dataset, we consider the top-50 heads ranked by area. For each head and sampler, we sample 100 bodies and take the best one according to a specific goodness measure.

We consider two goodness measures for the bodies: the quality function \(q \) (as in Eq. (1)) used in our problem formulation and the micro-averaged \(F_1\) score.

For each configuration of dataset and head, we report the relative difference between the best scores obtained by the samplers. For a given metric m, the bodies \({B} \) obtained by our sampler, and the baseline \({B} _{\text {B}}\), the relative difference between \({B} \) and \({B} _{\text {B}}\) according to m is calculated as: \(\Delta _{m}\left( {B}; {B} _{\text {B}}\right) = \left( m\left( {B} \right) - m\left( {B} _{\text {B}}\right) \right) / \left( m\left( {B} _{\text {B}}\right) \right) \). For brevity, we use \(\Delta _{m}\) instead. To bring down the numerical scales of different datasets to a similar level, we report the logarithmic version of \(\Delta _{m}\): \(\log \left( \Delta _{m}\right) = \text {sign} \left( \Delta _{m}\right) \log \left( |\Delta _{m}| + 1\right) \).

The distributions of \(\log \left( \Delta _{m}\right) \) are summarized in Fig. 6 using letter-value plots [51].

The two measures give different impressions: according to the \(F_1\) score, our sampler is almost always the best choice, while for \(q \), our sampler still outperforms the baseline, but to a lesser extent than in the comparison based on the \(F_1\) score (e.g., on \(\textsf { birds}\) and \(\textsf { corel-5k}\)).

Fig. 6
figure 6

Distributions of the relative difference for two samplers (our sampler and the baseline) evaluated on two metrics: the quality function \(q \) (left) and the micro-averaged \(F_1\) score (right). Both samplers target at the discriminativity function, but operate under different sample spaces. Having a relative difference value greater (smaller) than zero indicates our sampler (the baseline) is better. Dashed vertical lines are drawn to illustrate such boundary

Fig. 7
figure 7

Effect of the proposal distribution on the number of samples until coalescence. The smaller the value, the better

Fig. 8
figure 8

Synthetic datasets generated from rules with uniform (top) and skewed (bottom) coverage. Micro-averaged \(F_1\) score against proportion of noise (left), number of attributes (middle) and labels (right). The x-axis is in log scale

Effect of proposal distributions on convergence speed. We examine the proposal distribution used by the CFTP subroutine (Algorithm 3) and study its effect on the convergence speed of the Markov chain. In particular, we compare the proposal in Eq. (18) (used By Boley et al. [26]) against the proposal in Eq. (19) (used by corset). Recall that the latter is a tight approximation of the target distribution, while the former is not. We measure the convergence speed of a Markov chain by N, the number of Monte Carlo samples that are drawn until the chain coalesces.

For each dataset and proposal distribution, we sample 50 random labels. Further, for each label, 100 bodies are sampled, resulting in a total of 5000 samples. Finally, we report the average of \(\log (N)\).

The results are summarized in Fig. 7 and show clear superiority of our proposal distribution.

8.3 Classification performance

Synthetic datasets. Results on synthetic datasets, both for data generated from rules with uniform and skewed coverage, are shown in Fig. 8.

In general, when the noise level is not too large, corset tends to recover the generating rules.

In the experiment with synthetic data, we let corset run until the stopping condition is met, with \(\tau = 0.01\). The number of rules retrieved by corset typically coincides with the number of generating rules, and it is always at most the maximum number of generating rules, 33. On the other hand, Boomer based on 10 rules consistently offers poor performance. The classification accuracy of Boomer increases when the number of rules increases, but even with 1000 rules, our method outperforms Boomer while using a very concise set of rules. Unlike Boomer, corset seeks to uncover the true set of generating rules and only use those for classification. Thus, the experiments with synthetic datasets clearly show the advantage of our approach. Also note that the performance of corset, unlike that of Boomer, does not significantly deteriorate when \(|\mathcal {F} |\) increases.

Table 2 Micro-averaged \(F_1\)-scores on real datasets achieved by corset-surs, corset-gh, SeCo, Boomer with increasing number of rules, and svm-br
Table 3 Macro-averaged \(F_1\)-scores on real datasets achieved by corset-surs, corset-gh, SeCo, Boomer with increasing number of rules, and svm-br
Table 4 Hamming scores on real datasets achieved by corset-surs, corset-gh, SeCo, Boomer with increasing number of rules, and svm-br

Real datasets: classification performance and interpretability. Results on real datasets, for classification performance and interpretability, are shown in Table 23 and 4 where classification performance is measured by micro-averaged \(F_1\) score (optimized in hyper-parameter tuning), macro-averaged \(F_1\) score, and Hamming score, respectively. The tables also show the number of rules that is our main measure of interpretability. While we show results for three different classification performance metrics, the algorithms are tuned to optimize the micro-averaged \(F_1\) score. Furthermore, caution is required in assessing the Hamming scores because they tend to be biased towards more conservative classifiers that predict less labels to be present.

Fig. 9
figure 9

Average coverage overlap between pairs of rules as a function of \(\lambda \) (lower values indicate higher diversity) on real datasets. Both axes are in log scale

If the training does not terminate within 12 hours, we report NA in the corresponding table entry. Table 23 and 4 show that Boomer requires very large sets of rules to achieve competitive performance. Thus, interpretability of Boomer rules is questionable. Similarly, svm-br performs well but it is not designed for interpretability. Instead, corset extracts a small set of rules, guaranteeing ease of interpretation, and yet it is consistently competitive with the baselines on all the datasets, according to all considered performance metrics. corset generally requires fewer rules than rule-based alternatives to attain similar performance in multi-label classification. Further, it is never drastically worse than Boomer with 1000 rules or even svm-br, suggesting that the price of interpretability, if there is one, is small when corset is used. Finally, note that corset-gh often outperforms corset-surs but using a larger set of rules.

Real datasets: diversity and impact of \(\lambda \). A fundamental characteristic of corset is that it allows to control the degree of diversity in the set of recovered rules via a single tunable parameter \(\lambda \). In Fig. 9, we show for a subset of datasets that the shared coverage within rules is lower for corset than for Boomer, and moreover that increasing the value of \(\lambda \) is very effective in reducing overlap between rules. As the impact of \(\lambda \) on overlap is not significantly different in corset-surs and corset-gh, we only show results for the former.

Fig. 10
figure 10

Micro-averaged \(F_1\) score as a function of \(\lambda \) (lower values indicate higher diversity) on real datasets. The x-axis is in log scale

Fig. 11
figure 11

Distributions of the number of attributes in the bodies of the rules extracted by corset-surs, corset-gh, and Boomer with 1000 rules. Smaller values indicate fewer attributes. The x-axis is in log scale

In addition, since \(\lambda \) determines the weight assigned to the diversity term in Eq. (8), it also affects the performance of corset. Figure 10, however, shows that the performance of corset, as measured by the micro-averaged \(F_1\) score, does not vary monotonically with \(\lambda \), although there is evidence that very large values (e.g., \(\lambda = 1000\)) typically yield poor classification. The level of rule diversity leading to optimal classification performance varies in different datasets. Hence, in practice, \(\lambda \) must be carefully tuned to optimize the performance of corset. Note that the results in Fig. 10 are obtained by fixing the number of rules to 100 for both corset and Boomer and by choosing the best between corset-surs and corset-gh in terms of micro-averaged \(F_1\) score.

Real datasets: body and head sizes. In addition to the number of rules considered before, it is possible to measure the interpretability of a model by the sizes of the bodies of its rules (i.e., the number of predicates of each rule in a rule set).

The number of rules within a rule set may be limited; however, if the body of the rules contains a substantial number of conditions, the interpretability of the rule set may be impaired. We show that this is not the case for corset.

The number of attributes in bodies may vary considerably depending on whether corset-surs or corset-gh is used. Figure 11 shows the distributions of body sizes over the rule sets obtained by corset-surs, corset-gh and Boomer (with 1000 rules). We consider the bodies of the rules whose results are summarized in Table 2. Clearly, the rules obtained by corset-surs have consistently a small number of attributes (always \(\le 10\)). In general, the distribution of the sizes of the bodies of Boomer is comparable to that of corset-surs and corset-gh. Nonetheless, Boomer produces a significant number of extremely long bodies in all datasets.

Although corset-surs does not explicitly limit body sizes, the limited sizes are by-products of the reduced sample space constructed prior to sampling. As illustrated in Sect. 6.3, samples in the reduced space correspond to highly probable cliques in the feature graph. By the way that “probable” is defined, such cliques typically have a small number of nodes (attributes).

Fig. 12
figure 12

Distributions of the number of labels in the heads of the rules extracted by corset-surs and corset-gh. Smaller values indicate smaller heads. The x-axis is in log scale

Similarly, corset-gh forms bodies greedily selecting attributes to maximize a modified version of discriminativity while controlling the balance between generality and precision with a user-specified parameter \(\gamma \). Small values of \(\gamma \) favor generality, allowing for more mistakes and leading to bodies with a limited number of conditions, whereas larger values of \(\gamma \) generate more specific rules with a larger number of attributes. In Fig. 11, we show results for \(\gamma =0.5\). For this value of \(\gamma \), bodies sampled by corset-gh tend to have more attributes than those sampled by corset-surs. Nevertheless, remarkably large bodies are present only in the \(\textsf { mediamill}\) dataset, whereas Boomer relies on rules with large bodies in all datasets.

Shorter heads do not necessarily correspond to more interpretable heads. However, small heads are typically easy to interpret. For completeness, in Fig. 12, we additionally show, with the same parameter settings, results concerning the distribution of the heads in the rules chosen by corset-surs and corset-gh. Boomer is not included in the comparison because Boomer allows the user to decide whether to learn rules having heads composed of either a single label or all labels.

The distributions of the head lengths in the different datasets suggest that corset consistently learns rules with heads composed of a limited number of labels. Similarly to the case of body sampling, this experimental finding is largely a consequence of the construction of the interpretable label space prior to sampling, whereby long and hard-to-interpret heads are discarded.

Real datasets: running time. The main goal pursued in this work is to achieve a good balance between interpretability and classification quality. corset carefully chooses each rule to be used for prediction, which requires a significant amount of time.

Fast training is not a primary concern of our work.

Nonetheless, it is important to show that the execution of corset terminates within a reasonable amount of time in all the considered datasets. Table 5 reports the training time of corset and its competitors.

The time incurred in obtaining predictions for new data records is generally modest for all algorithms, thus prediction times are not included in the measurements reported in Table 5.

Table 5 Execution time (in seconds) of corset-surs, corset-gh, SeCo, Boomer with increasing number of rules and svm-br

corset always finishes training in less than 12 minutes. Boomer and svm-br are faster to train than corset in most cases. Their better efficiency can be attributed to the implementation language (C/C++ against Python and Cython for corset) and, most importantly, to the continuous nature of the optimization problems they solve.

On the other hand, SeCo is drastically slower than all the other algorithms.

For corset, a considerable amount of time is spent in the pre-processing stage to construct the interpretable label space for head sampling and, in corset-surs, the reduced sample space for body sampling, which, as mentioned in Sect. 6.3, is a potential bottleneck.

Therefore, in practice, in case one is interested in learning multiple rule sets (e.g., corresponding to different values of \(\lambda \)), it is sufficient to carry out the pre-processing stage just once. While corset-gh overcomes the scalability issues that may arise in constructing the reduced sample space for body sampling, it requires significantly more time than corset-surs to perform a single iteration. Subsequently, when the attribute space is sparse, the construction of the reduced sample space is fast and corset-surs runs faster than corset-gh.

As explained in the end of Sect. 8.1, in our experimental evaluation with real data, by default we only execute the first round of greedy selection of corset. This follows since the gain in classification quality offered by the second round is generally not worth the increase in running time. In particular, the running time incurred by corset with both rounds of greedy selection is often 2 to 3 times as high as the running time incurred by corset with only one round of greedy selection.

9 Case study

We carry out a case study using the \(\textsf { medical}\) [10] dataset to showcase an application of corset in the medical domain. When machine learning models are used to assist medical decision making, interpretability of the models is of particular importance [10], since medical decisions are often high-stake and may deeply affect the lives of patients. In the context of rule-based models, human practitioners may have to assess the rules one-by-one to ensure that they have sufficient understanding of the models. Arguably, rule sets that are very large and highly redundant are costly to check, thus undesirable. Concise rule sets are instead particularly valuable. In order to effectively illustrate this point through a concrete example, we use the \(\textsf { medical}\) [43] dataset, which contains fully anonymized clinical free text in medical records, each labeled with one or more disease names. The disease names follow the International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) standard, a globally used mapping of disease names to unique codes [52].

Because the medical practitioner wishes to evaluate the rules qualitatively using her expertise it is necessary to focus on a small set of rules, for instance, composed of 10 rules.

The first 10 rules picked by corset are general (as indicated by their coverage) and easy (often trivial) to interpret. All of them have coverage above 40, and the average coverage is 123. In contrast, the first 10 rules picked by Boomer appear to be opaque and exhibit null coverage. In fact, corset using 10 rules achieves micro-averaged \(F_1\) score above 0.7 while Boomer with the same number of rules yields a drastically lower micro-averaged \(F_1\) score. In particular, as shown in Sect. 8, Boomer does not grant high-quality classification using 10 or 100 rules, but only using 1000 rules. A set of 1000 rules is impractical for medical practitioners to inspect.

Therefore, Boomer, which does not impose constraints on model complexity, contradicts the principles of interpretable machine learning and may not be not be suitable for applications where interpretability is an important concern.

As regards the first 10 rules learned by corset, however, it is important to note that most of the remarkably accurate and general rules learned by corset do not necessarily generate novel insight. For example, the first rule selected by corset is \( R _1 = (\{ fever \ \& \ cough \} \rightarrow \{ fever \ \& \ cough \})\). We argue that this is not a drawback of corset; rather it is a characteristic of the data.

The most general and accurate rules are necessary for corset to perform well in the multi-label classification task. For instance, \(R _1\) has an uncovered area of 238 and adjusted accuracy of 1.06.

Even if the rules represent somewhat trivial associations between some words and some diseases, they are beneficial in revealing the salient patterns in the data, potentially confirming the medical practitioner’s knowledge and helping build trust in the models.

It is also possible to run corset to extract a larger rule set, containing more specific and non-trivial rules. Further, if required, it is simple to filter out all rules capturing trivial associations during post-processing. As a demonstration, we run corset for 30 iterations and filter out all rules such that the attributes in the body are found as part of the labels in the tail (e.g., \( R _1 = (\{ fever \ \& \ cough \} \rightarrow \{ fever \ \& \ cough \})\)). In Table 6, we list the first 10 rules after the post-processing (in the order of rule selection by corset). The first rule maps the word “myelomeningocele” to the label “spina bifida without mention of hydrocephalus". This is a rule of broad coverage. As “myelomeningocele” is one of four types of “spina bifida,” the rule is easy to interpret; “myelomeningocele” is likely to identify “spina bifida without mention of hydrocephalus” and since rules for other types of “spina bifida” are not present, we infer that this is likely to be the prevalent type of “spina bifida” in the data at hand. As another example, the fourth rule in the table, which is also general and easy to interpret, contains two attributes in its body and, in particular, it maps the words “lobe” and “atelectasis” to “pulmonary collapse.” The rule suggests that when a lobe (i.e., a section) of the lung is affected by atelectasis, there is a significant chance that the lung will collapse. This is likely due to the fact that atelectasis can cause a decrease in the amount of oxygen that reaches the lung tissue and can lead to the accumulation of fluid or mucus in the lung, which can further decrease lung expansion and increase the risk of collapse. An atelectasis may also indicate the collapse of the entire lung, but the rule under consideration suggests that in the \(\textsf { medical}\) dataset the word lobe leads to a better discrimination of the subsequent occurrence of “pulmonary collapse.”

As a final example, we consider a rule with a longer head. The eighth rule in the table maps the word “throat” to two diseases: “fever and physiologic disturbances of temperature regulation” and “acute pharyngitis.” The rule indicates that there is a strong correlation in the data between the presence of symptoms in the throat and the subsequent occurrence of fever and physiologic disturbances of temperature regulation and acute pharyngitis. This is probably attributed to the fact that the pharynx in the throat is a sensitive area that can be easily infected by pathogens and viruses, which can cause acute pharyngitis, i.e., an inflammation of the pharynx, and fever as part of the immune response of the body. For instance, both acute pharyngitis and fever may be caused by the widespread influenza virus. Additionally, physiologic disturbances of temperature regulation may be due to the attempt of the body experiencing fever to fight off the infection.

All rules in Table 6, even the ones with smallest coverage, are biologically meaningful as they can find scientific explanations.

Some of the rules may describe associations that may look obvious for medical practitioners. In this case, the medical practitioner only acquires information on which diverse patterns in the data are the most relevant for multi-label disease classification.

The \(\textsf { medical}\) dataset is relatively small and does not contain a large number of surprising patterns.

More unexpected rules can be found in the \(\textsf { medical}\) dataset. However, they would have limited support and lead to a decrease in classification performance, if chosen.

Regardless of the knowledge generated by the extracted rules, our in-depth examination of the \(\textsf { medical}\) dataset shows an example of application of corset in a consequential domain, demonstrating that the rule sets provided by corset are more interpretable and more practical to assist in high-stake decision making than those provided by existing algorithms such as Boomer.

Table 6 First 10 rules chosen by corset (body, head and coverage) after post-processing to filter out “trivial” rules

10 Conclusion

We propose a novel rule-based classifier, corset, for multi-label classification tasks. Our training objective explicitly penalizes rule redundancy, thereby encouraging the algorithm to learn a concise and easy-to-interpret set of rules.

Furthermore, we design a suite of fast sampling algorithms, which can efficiently generate rules with good accuracy and interpretability. We show through extensive experiments that our sampling algorithms are highly effective and that corset achieves competitive performance comparable to strong baselines, while offering better interpretability. Thus, corset achieves an unmatched compromise between performance and interpretability in multi-label classification, and, as demonstrated through a case study, it is particularly valuable in multi-label classification tasks where interpretability is of primary interest.

Our work opens interesting questions for future research. Can we design training objectives that reflect popular multi-label classification metrics, while producing concise rule sets? Can we use the techniques in this work to address the interpretability issue of existing rule-based classifiers?