As many real-world classification problems require to assign more than one label to an instance, multi-label classification (MLC) has become a well-established topic in the machine learning community. There are various applications of MLC such as text categorization[lewis1992, klimt2004], the annotation of images [boutell2004, li2008] and music [trohidis2008, turnbull2008], as well as use cases in bioinformatics [diplaris2005] and medicine [pestian2007].
Rule learning algorithms are a well-researched approach to solve classification problems [furnkranz2012]
. In comparison to complex statistical methods, like for example support vector machines or artificial neural networks, their main advantage is the interpretability of the resulting models. Rule-based models can easily be understood by humans and form a structured hypothesis space that can be analyzed and modified by domain experts. Ideally, rule-based approaches are able to yield insight into the application domain by revealing patterns and regularities hidden in the data and allow to reason why individual predictions have been made by a system. This is especially relevant in safety-critical domains, such as medicine, power systems, or financial markets, where malfunctions and unexpected behavior may entail the risk of health damage or financial harm.
1.0.1 Motivation and goals.
To assess the quality of multi-label predictions in terms of a single score, several commonly used performance measures exist. Even though some of them originate from measures used in binary or multi-class classification, different ways to aggregate and average the predictions for individual labels and instances — most prominently micro- and macro-averaging — exist in MLC. Some measures like subset accuracy are even unique to the multi-label setting. No studies that investigate the effects of using different rule learning heuristics in MLC and discuss how they affect different multi-label performance measures have been published so far.
In accordance with previous publications in single-label classification, we argue that all common rule learning heuristics basically trade off between two aspects, consistency and coverage [furnkranz2005]. Our long-term goal is to better understand how these two aspects should be weighed to assess the quality of candidate rules during training if one is interested in a model that optimizes a certain multi-label performance measure. As a first step towards this goal, we present a method for flexibly creating rule-based models that are built with respect to certain heuristics. Using this method, we empirically analyze how different heuristics affect the models in terms of predictive performance and model characteristics. We demonstrate how models that aim to optimize a given multi-label performance measure can deliberately be trained by choosing a suitable heuristic. By comparing our results to a state-of-the-art rule learner, we emphasize the need for configurable approaches that can flexibly be tailored to different multi-label measures. Due to space limitations, we restrict ourselves to micro-averaged measures, as well as to Hamming and subset accuracy.
1.0.2 Structure of this work.
We start in Section 2 by giving a formal definition of multi-label classification tasks as well as an overview of inductive rule learning and the rule evaluation measures that are relevant to this work. Based on these foundations, in Section 3
, we discuss our approach for flexibly creating rule-based classifiers that are built with respect to said measures. In Section4, we present the results of the empirical study we have conducted, before we provide an overview of related work in Section 5. Finally, we conclude in Section 6 by recapitulating our results and giving an outlook on planned future work.
MLC is a supervised learning problem in which the task is to associate an instance with one or several labelsout of a finite label space , with being the total number of predefined labels. An individual instance is represented in attribute-value form, i.e., it consists of a vector , where is a numeric or nominal attribute. Additionally, each instance is associated with a binary label vector , where indicates the presence () or absence () of label . Consequently, the training data set of a MLC problem can be defined as a set of tuples , with being the number of available training instances. The classifier function , that is deduced from a given training data set, maps an instance to a predicted label vector .
2.1 Classification rules
In this work, we are concerned with the induction of conjunctive, propositional rules . The body of such a rule consists of one or several conditions that compare an attribute-value of an instance to a constant by using a relational operator such as (in case of nominal attributes), or and (in case of numerical attributes). On the one hand, the body of a conjunctive rule can be viewed as a predicate that states whether an instance satisfies all of the given conditions, i.e., whether the instance is covered by the rule or not. On the other hand, the head of a (single-label head) rule consists of a single label assignment ( or ) that specifies whether the label should be predicted as present () or absent ().
2.2 Binary relevance method
In the present work, we use the binary relevance transformation method (cf. [boutell2004]), which reduces MLC to binary classification by treating each label of a MLC problem independently. For each label , we aim at learning rules that predict the minority class , i.e., rules that contain the label assignment in their head. We define , if the corresponding label is associated with less than 50% of the training instances, or otherwise.
A rule-based classifier — also referred to as a theory — combines several rules into a single model. In this work, we use (unordered) rule sets containing all rules that have been induced for the individual labels. Such a rule set can be considered as a disjunction of conjunctive rules (DNF). At prediction time, all rules that cover a given instance are taken into account to determine the predicted label vector . An individual element , that corresponds to the label , is set to the minority class if at least one of the covering rules contains the label assignment in its head. Otherwise, the element is set to the majority class . As all rules that have been induced for a label have the same head, no conflicts may arise in the process.
2.3 Bipartition evaluation functions
To assess the quality of individual rules, usually bipartition evaluation functions are used [tsoumakas2009]. Such functions — also called heuristics
— map a two-dimensional confusion matrix to a heuristic value. A confusion matrix consists of the number of true positive (TP), false positive (FP), true negative (TN), and false negative (FN) labels that are predicted by a rule. We calculate the example-wise aggregated confusion matrix for a rule as
where denotes the cell-wise addition of atomic confusion matrices that correspond to label and instance .
Further, let and denote the absence () or presence () of label for an instance according to the ground truth and a rule’s prediction, respectively. Based on these variables, we calculate the elements of as
where , if is true, otherwise.
2.4 Rule learning heuristics
A good rule learning heuristic should (among other aspects) take both, the consistency and coverage of a rule, into account [janssen2010, furnkranz2012]. On the one hand, rules should be consistent, i.e., their prediction should be correct for as many of the covered instances as possible. On the other hand, rules with great coverage, i.e., rules that cover a large number of instances, tend to be more reliable, even though they may be less consistent.
The precision metric exclusively focuses on the consistency of a rule. It calculates as the fraction of correct predictions among all covered instances:
In contrast, recall focuses on the coverage of a rule. It measures the fraction of covered instances among all — covered and uncovered — instances for which the label assignment in the rule’s head is correct:
As an alternative to the F-measure, we use different parameterizations of the m-estimate
m-estimatein this work. It is defined as
where and . Depending on the parameter , this measure trades off precision and weighted relative accuracy (WRA). If , it is equivalent to precision and therefore focuses on consistency. As approaches , it converges to WRA and puts more emphasis on coverage, respectively [furnkranz2012].
3 Induction of rule-based theories
For our experimental study, we implemented a method that allows to generate a large number of rules for a given training data set in a short amount of time (cf. Section 3.1).111Source code available at https://github.com/mrapp-ke/RuleGeneration. The rules should ideally be unbiased, i.e., they should not be biased in favor of a certain heuristic, and they should be diverse, i.e., general rules should be included as well as specific rules. Given that these requirements are met, we consider the generated rules to be representative samples for the space of all possible rules, which is way too large to be explored exhaustively. We use the generated candidate rules as a starting point for building different theories. They consist of a subset of rules that are selected with respect to a specific heuristic (cf. Section 3.2) and filtered according to a threshold (cf. Section 3.3). Whereas the first step yields a theory with great coverage, the threshold selection aims at improving its consistency.
3.1 Generation of candidate rules
As noted in Section 2.2, we consider each label
of a MLC problem independently. For each of the labels we train multiple random forests[breiman2001]
, using varying configuration parameters, and extract rules from their decision trees.222We use the random forest implementation provided by Weka 3.9.3, which is available at https://www.cs.waikato.ac.nz/ml/weka. As illustrated in Algorithm 1, we repeat the process until a predefined number of rules has been generated.
Each random forest consists of a predefined number of decision trees (we specify ). To ensure that we are able to generate diverse rules later on, we vary the configuration parameter that specifies the maximum depth of trees (unrestricted, if ) (cf. Algorithm 1, trainForest). For building individual trees, we only take a subset of the available training instances and attributes into account, which guarantees a diverse set of trees. Bagging is used for sampling the training instances, i.e., if instances are available in total, instances (, by default) are drawn randomly with replacement. Additionally, each time a new node is added to a decision tree, only a random selection of out of attributes (, by default) is considered.
To extract rules from a random forest (cf. Algorithm 1, extractRules), we traverse all paths from the root node to a leaf in each of its decision trees. We only consider paths that lead to a leaf where the minority class is predicted. As a consequence, all rules that are generated with respect to a certain label have the same head . The body of a rule consists of a conjunction of all conditions encountered on the path from the root to the correspondin
3.2 Candidate subset selection
Like many traditional rule learning algorithms, we use a separate-and-conquer (SeCo) strategy for selecting candidate rules, i.e., new rules are added to the theory until all training instances are covered (or until it describes the training data sufficiently according to some stopping criterion). Whenever a new rule is added to the theory the training instances it covers are removed (“separate” step), and the next rule is chosen according to its performance on the remaining instances (“conquer” step).
To create different theories, we select subsets of the rules that have been generated earlier (cf. Section 3.1). We therefore apply the SeCo strategy for each label independently, i.e., for each label we take all rules with head into account. Among these candidates we successively select the best rule according to a heuristic (cf. Section 2.4) until all positive training instances , with respect to label , are covered. To measure the quality of a candidate according to , we only take yet uncovered instances into account for computing the confusion matrix . If two candidates evaluate to the same heuristic value, we prefer the one that a) covers more true positives, or b) contains fewer conditions in its body. Whenever a new rule is added, the overall coverage of the theory increases, as more positive training instances are covered. However, the rule may also cover some of the negative instances . As the rule’s prediction is incorrect in such cases, the consistency of the theory may decrease.
3.3 Threshold selection
As described in Section 3.2, we use a SeCo strategy to select more rules until all positive training instances are covered for each label. In this way, the coverage of the resulting theory is maximized at the expense of consistency, because each rule contributes to the overall coverage, but might introduce wrong predictions for some instances. To trade off between these aspects, we allow to (optionally) specify a threshold that aims at diminishing the effects of inconsistent rules. It is compared to a heuristic value that is calculated for each rule according to the heuristic . For calculating the heuristic value, the rule’s predictions on the entire training data set are taken into account. This is different from the candidate selection discussed in Section 3.2, where instances that are already covered by previously selected rules are not considered. Because the candidate selection aims at selecting non-redundant rules, that cover the positive training instances as uniformly as possible, it considers rules in the context of their predecessors. In contrast, the threshold is applied at prediction time when no order is imposed on the rules, i.e., all rules whose heuristic value exceeds the threshold equally contribute to the prediction.
In this section, we present an empirical study that emphasises the need to use varying heuristics for candidate selection and filtering to learn theories that are tailored to specific multi-label measures. We further compare our method to different baselines to demonstrate the benefits of being able to flexibly adjust a learner to different measures, rather than employing a general-purpose learner.
4.1 Experimental setup
We applied our method to eight different data sets taken from the Mulan project.333Data sets and detailed statistics available at http://mulan.sourceforge.net/datasets-mlc.html. We set the minimum number of rules to be generated to 300.000 (cf. Algorithm 1, parameter ). For candidate selection according to Section 3.2, we used the m-estimate (cf. Equation 6) with . For each of these variants, we applied varying thresholds according to Section 3.3. The thresholds have been chosen such that they are satisfied by at least of the selected rules. All results have been obtained using 10-fold cross validation.
In addition to the m-estimate, we also used the F-measure (cf. Equation 5) with varying -parameters. As the conclusions drawn from these experiments are very similar to those for the m-estimate, we focus on the latter at this point.
Among the performance measures that we report are micro-averaged precision and recall. Given a global confusion matrix that consists of the TP, FP, TN, and FN aggregated over all test instances and labels , these two measures are calculated as defined in Equations 3 and 4. Moreover, we report the micro-averaged F1 score (cf. Equation 5 with ) as well as Hamming and subset accuracy. Hamming accuracy calculates as
whereas subset accuracy differs from the other measures, because it is computed instance-wise. Given true label vectors and predicted label vectors , it measures the fraction of perfectly labeled instances:
4.2 Analysis of different parameter settings
For a broad analysis, we trained theories per data set using the same candidate rules, but selecting and filtering them differently by using varying combinations of the parameters and as discussed in Section 4.1. We visualize the performance and characteristics of the resulting models as two-dimensional matrices of scores (cf. e.g. Figure 1). One dimension corresponds to the used -parameter, the other refers to the threshold , respectively.
Ranks and standard deviation of average ranks over all data sets according to Hamming and subset accuracy using different parameters(horizontal axis) and (vertical axis). Best parameters for different data sets specified by red + signs.
Some of the used data sets (cal500, flags, and yeast) contain very frequent labels for which the minority class . This is rather atypical in MLC and causes the unintuitive effect that the removal of individual rules results in a theory with greater recall and/or lower precision. To be able to compare different parameter settings across multiple data sets, we worked around this effect by altering affected data sets., i.e., inverting all labels for which .
4.2.1 Predictive performance.
In Figure 1 and 2 the average ranks of the tested configurations according to different performance measures are depicted. The rank of each of the 400 parameter settings was determined for each data set separately and then averaged over all data sets. The depicted standard deviations show that the optimal parameter settings for a respective measure may vary depending on the data set. However, for each measure there is an area in the parameter space where a good setting can be found with high certainty.
As it can clearly be seen, precision and recall are competing measures. The first is maximized by choosing small values for and filtering extensively, the latter benefits from large values for and no filtering. Interestingly, setting , i.e., selecting candidates according to the precision metric, does not result in models with the highest overall precision. This is in accordance with Figure 3, where the models with the highest F1 score do not result from using the F1-measure for candidate selection. Instead, optimizing the F1 score requires to choose small values for to trade off between consistency and coverage. The same applies to Hamming and subset accuracy, albeit both of these measure demand to put even more weight on consistency and filtering more extensively compared to F1.
|Mi. Precision = 74.07%, Mi. Recall = 78.26%|
|Mi. Precision = 65.61%, Mi. Recall = 89.57%|
4.2.2 Model characteristics.
Besides the predictive performance, we are also interested in the characteristics of the theories. Figure 4 shows how the number of rules in a theory as well as the average number of conditions are affected by varying parameter settings. The number of rules independently declines when using greater values for the parameter and/or smaller values for . resulting in less complex theories that can be comprehended by humans more easily. The average number of conditions is mostly affected by the parameter .
Figure 5 provides an example of how different parameters affect the model characteristics. It shows the rules for predicting the same label as induced by two fundamentally different approaches. The first approach () reaches high scores according to the F1-measure, Hamming accuracy, and subset accuracy, whereas the second one () results in high recall.
4.3 Baseline comparison
Although the goal of this work is not to develop a method that generally outperforms existing rule learners, we want to ensure that we achieve competitive results. For this reason, we compared our method to JRip, Weka’s re-implementation of Ripper [cohen1995], using the binary relevance method. By default, Ripper uses incremental reduced error pruning (IREP) and post-processes the induced rule set. Although our approach could make use of such optimizations, this is out of the scope of this work. For a fair comparison, we also report the results of JRip without using IREP () and/or with post-processing turned off ().
|F1||Hamming acc.||Subset acc.|
Note that we do not consider the random forests from which we generate rules (cf. Section 3.1) to be relevant baselines. This is, because random forests use voting for making a prediction, which is fundamentally different than rule learners that model a DNF. Also, we train random forests consisting of a very large number of trees with varying depths to generate diverse rules. In our experience, these random forests perform badly compared to commonly used configurations.
We tested three different configurations of our approach. The parameters and used by these approaches have been determined on a validation set by using nested 5-fold cross validation on the training data. For the approach , the parameters have been chosen such that the F1-measure is maximized. The approaches and were tuned with respect to Hamming and subset accuracy, respectively.
According to Table 1, our method is able to achieve reasonable predictive performances. With respect to the measure they try to optimize, our approaches generally rank before JRip with optimizations turned off (), which is the competitor that is conceptually closest to our method. Although IREP definitely has a positive effect on the predictive performance, our approaches also tend to outperform JRip with IREP enabled, but without using post-processing (). Despite the absence of advanced pruning and post-processing techniques, our approaches are even able to surpass the fully fledged variant of JRip () on some data sets. We consider these results as a clear indication that it is indispensable to be able to flexibly adapt the heuristic used by a rule learner — which JRip is not capable of —, if one aims at deliberately optimizing a specific multi-label performance measure.
5 Related work
Several rule-based approaches to multi-label classification have been proposed in the literature. On the one hand, there are methods based on descriptive rule learning, such as association rule discovery [thabtah2004, thabtah2006, li2008, lakkaraju2016]allamanis2013, cano2013], or evolutionary classification systems [arunadevi2011, avila2010]. On the other hand, there are algorithms that adopt the separate-and-conquer strategy used by many traditional rule learners for binary or multi-class classification, e.g. by Ripper [cohen1995], and transfer it to MLC [mencia2016, rapp2018]. Whereas in descriptive rule learning one does usually not aim at discovering rules that minimize a certain (multi-label) loss, the latter approaches employ a heuristic-guided search for rules that optimize a given rule learning heuristic and hence could benefit from the results of this work.
Similar to our experiments, empirical studies aimed at discovering optimal rule learning heuristics have been published in the realm of single-label classification [janssen2008, janssen2010]. Moreover, to investigate the properties of bipartition evaluation functions, ROC space isometrics have been proven to be a helpful tool [flach2003, furnkranz2003]. They have successfully been used in the literature to study the effects of using different heuristics in separate-and-conquer algorithms [furnkranz2005], or for ranking and filtering rules [furnkranz2004].
In this work, we presented a first empirically study that thoroughly investigates the effects of using different rule learning heuristics for candidate selection and filtering in the context of multi-label classification. As commonly used multi-label measures, such as micro-averaged F1, Hamming accuracy, or subset accuracy, require to put more weight on the consistency of rules rather than on their coverage, models that perform well with respect to these measures are usually small and tend to contain specific rules. This is beneficial in terms of interpretability as less complex models are assumed to be easier to understand by humans.
As our main contribution, we emphasise the need to flexibly trade off the consistency and coverage of rules, e.g., by using parameterized heuristics like the m-estimate, depending on the multi-label measure that should be optimized by the model. Our study revealed that the choice of the heuristic is not straight-forward, because selecting rules that minimize a certain loss functions locally does not necessarily result in that loss being optimized globally. E.g., selecting rules according to the F1-measure does not result in the overall F1 score to be maximized. For optimal results, the trade-off between consistency and coverage should be fine-tuned depending on the data set at hand. However, our results indicate that, even across different domains, the optimal settings for maximizing a measure can often be found in the same region of the parameter space.
In this work, we restricted our study to DNFs, i.e., models that consist of non-conflicting rules that all predict the same outcome for an individual label. On the one hand, this restriction simplifies the implementation and comprehensibility of the learner, as no conflicts may arise at prediction time. On the other hand, we expect that including both, rules that model the presence as well as the absence of labels, could be beneficial in terms of robustness and could have similar, positive effects on the consistency of the models as the threshold selection used in this work. Furthermore, we leave the empirical analysis of macro-averaged performance measures for future work.