1 Introduction
Multilabel classification (MLC) is the task of learning a model for assigning a set of labels to unknown instances [16]. For example, newspaper articles can often be associated with multiple topics. This is in contrast to binary or multiclass classification, where single classes are predicted. As many studies show, MLC approaches that are able to take correlations between labels into account can be expected to achieve better predictive results (see [7, 11, 16]; and references therein).
In addition to statistical approaches that often rely on complex mathematical concepts, such as Bayesian or neural networks, rule learning algorithms have recently been proposed as an alternative, because rules are not only a natural and simple form to represent a learned model, but they are well suited for making discovered correlations between instance and label attributes explicit
[11]. Especially for safetycritical application domains, such as medicine, power systems, autonomous driving or financial markets, where hidden malfunctions could lead to lifethreatening actions or economic loss, the possibility of interpreting, inspecting and verifying a classification model is essential (cf. e.g., [9]). However, the algorithm of [11], which is based on the separateandconquer (SeCo) strategy, can only learn dependencies where the presence or absence of a single label depends on a subset of the instance’s features. Especially cooccurrences of labels – a common pattern in multilabel data – are hence only representable by a combination of rules. Conversely, algorithms based on subgroup discovery were proposed which are able to find single rules that predict a subset of the possible labels [5]. However, this framework is limited in the sense that it relies on the adaptation of conventional rule learning heuristics for rating and selecting candidate rules and can thus not be easily adapted to a variety of different loss functions which are commonly used for evaluating multilabel predictions. Such an adaptation is not straightforward, because it is not known whether these measures satisfy properties like antimonotonicity that can ensure an efficient exploration of the search space of all possible rule heads – despite the fact that it grows exponentially with the number of available labels.
Thus, the main contribution of this work (presented in Section 3) is to formally define antimonotonicity in the context of multilabel rules and to prove that selected multilabel metrics satisfy that property. Based on these findings, we present an algorithm that prunes searches for multilabel rules in Section 4. Said algorithm is not meant to set new standards in terms of predictive performance, but to serve as a starting point for developing more enhanced approaches. Nevertheless, we evaluate that it is able to compete with different baselines in terms of predictive and – more importantly – computational performance in Section 5.
2 Preliminaries
The task of MLC is to associate an instance with one or several labels out of a finite label space with being the number of available labels. An instance
is typically represented in attributevalue form, i.e., it consists of a vector
where is a numeric or nominal attribute. Each instance is mapped to a binary label vector which specifies the labels that are associated with the example . Consequently, the training data set of a MLC problem can be defined as a sequence of tuples with. The model which is derived from a given multilabel data set can be viewed as a classifier function
mapping a single example to a prediction .2.1 Multilabel rule learning
We are concerned with learning multilabel rules . The body may consist of several conditions, the examples that are covered by the rule have to satisfy. In this work only conjunctive, propositional rules are considered, i.e., each condition compares an attribute’s value to a constant by either using equality (nominal attributes) or inequalities (numerical attributes). It is also possible to include label conditions in the body [11, 12]. This allows to expose and distinct between unconditional or global dependencies and conditional or local dependencies [7].
The head consists of one or several label attributes ( or ) which specify the absence or presence of the corresponding label . Rules that contain a single label attribute in their head are referred to as singlelabel head rules, whereas multilabel head rules may contain several label attributes in their head.
A predicted label vector may have different semantics. We differentiate between full predictions and partial predictions.

Full predictions: Each rule predicts a full label vector, i.e., if a label attribute is not contained in the head, the absence of the corresponding label is predicted.

Partial predictions: Each rule predicts the presence or absence of the label only for a subset of the possible labels. For the remaining labels the rule does not make a prediction (but other rules might).
We believe that partial predictions have several conceptual and practical advantages and therefore we focus on that particular strategy throughout the remainder of this work.
2.2 Bipartition evaluation functions
To evaluate the quality of multilabel predictions, we use bipartition evaluation measures (cf. [16]) which are based on evaluating differences between true (ground truth) and predicted label vectors. They can be considered as functions of twodimensional label confusion matrices which represent the true positive (), false positive (), true negative () and false negative () label predictions. For a given example and a label
the elements of an atomic confusion matrix
are computed as(1) 
where the variables and denote the absence (0) or presence (1) of label of example according to the ground truth or the predicted label vector, respectively.
Note that for candidate rule selection we assess , , , and differently. To ensure that absent and present labels have the same impact on the performance of a rule, we always count correctly predicted labels as and incorrect predictions as , respectively. Labels for which no prediction is made are counted as if they are absent, or as if they are present.
2.2.1 Multilabel evaluation functions
In the following some of the most common bipartition metrics used for MLC are presented (cf., e.g., [16]). They are surjections mapping a confusion matrix to a heuristic value . Predictions that reach a greater heuristic value outperform those with smaller values.

Precision: Percentage of correct predictions among all predicted labels.
(2) 
Hamming accuracy: Percentage of correctly predicted present and absent labels among all labels.
(3) 
Fmeasure:
Weighted harmonic mean of precision and recall. If
, precision has a greater impact. If , the Fmeasure becomes more recalloriented.(4) 
Subset accuracy: Percentage of perfectly predicted label vectors among all examples. Per definition, it is always calculated using examplebased averaging.
(5)
2.2.2 Aggregation and averaging
When evaluating multilabel predictions which have been made for examples with labels one has to deal with the question of how to aggregate the resulting atomic confusion matrices. Essentially, there are four possible averaging strategies – either (label and examplebased) microaveraging, labelbased (macro)averaging, examplebased (macro) averaging or (label and examplebased) macroaveraging. Due to the space limitations, we restrict our analysis to the most popular aggregation strategy employed in the literature, namely microaveraging. This particular averaging strategy is formally defined as
(6) 
where the operator denotes the cellwise addition of confusion matrices.
2.2.3 Relation to conventional association rule discovery
To illustrate the difference between measures used in association rule discovery and in multilabel rule learning, assume that the rule covers three examples , and . In conventional association rule discovery the head is considered to be satisfied for one of the three covered examples (), yielding a precision/confidence value of . This essentially corresponds to subset accuracy. On the other hand, microaveraged precision would correspond to the fraction of correctly predicted labels among predictions, yielding a value of .
3 Properties of multilabel evaluation measures
To induce multilabel head rules, we need to find the multilabel head which reaches the best possible performance
(7) 
given an evaluation function and a body . In this section we consider rule evaluation functions that are based on microaveraged atomic confusion matrices in a partial prediction setting, i.e., where is defined as in (6).
Due to the exponential complexity of an exhaustive search, it is crucial to prune the search for the best multilabel head by leaving out unpromising label combinations. The first property which can be exploited for pruning searches – while still being able to find the best solution – is antimonotonicity.
Definition 1 (Antimonotonicity)
Let and denote two multilabel head rules consisting of body and heads , respectively . It is further assumed that . A multilabel evaluation function is antimonotonic if the following condition is met, i.e., if no head that results from adding additional labels to may result in being reached:
In addition to the adaptation of antimonotonicity in Definition 1, we propose decomposability as a stronger criterion. It comes at linear costs, as the best possible head can be deduced from considering each available label separately. Due to its restrictiveness, if Definition 2 is met, Definition 1 is implied to be met as well.
Definition 2 (Decomposability)
A multilabel evaluation function is decomposable if the following conditions are met:

If the multilabel head rule contains a label attribute for which the corresponding singlelabel head rule does not reach , the multilabel head rule cannot reach that performance either (and vice versa).

If all single label head rules which correspond to the label attributes of the multilabel head reach , the multilabel head rule reaches that performance as well (and vice versa).
In the following we examine selected multilabel metrics in terms of decomposability and antimonotonicity to reveal whether they satisfy these properties when making partial predictions (cf. Section 2.1).
Theorem 3.1
Microaveraged precision is decomposable.
Proof
We rewrite the performance calculation for a multilabel head rule with using the fact that the single label head rules with share the same body and therefore cover the same number of examples .
(8) 
Thus, the microaveraged precision for corresponds to the average of the microaveraged precision of the singlelabel head rules . As we assume that is maximal, it follows that for all singlelabel head rules .
Theorem 3.2
Microaveraged Hamming accuracy is decomposable.
Proof
Similar to (8), we rewrite the microaveraged Hamming accuracy of a multilabel head rule with in terms of averaging the performance of singlelabel head rules . This is possible as the performance for each label calculates as the percentage of and among all labels. For reasons of simplicity, we use the abbreviations and .
(9) 
Theorem 3.3
Subset accuracy is antimonotonic.
Proof
In accordance with Definition 1, two multilabel head rules and , for whose heads the subset relationship holds, take part in equation (10). The subscript notation is used to denote that a lefthand expression should be evaluated using the rule . The proof is based on writing subset accuracy in terms of and (cf. line 2).
(10) 
In (10) it is concluded that when using the rule the performance for at least one example is less than when using the rule . Due to the definition of subset accuracy, the performance for that example must be 0 in the first case and 1 in the latter (cf. line 3). As the performance only evaluates to 0 if at least one label is predicted incorrectly, the head must contain a label attribute which predicts the corresponding label incorrectly (cf. line 4). When adding additional label attributes the prediction for that label will still be incorrect (cf. line 5). Therefore, for all multilabel head rules which result from adding additional label attributes to the head the performance for the example evaluates to 0 (cf. line 6). Consequently, none of them can reach the overall performance of , nor (cf. line 7 and 8).
Lemma 1
Microaveraged recall is decomposable.
Proof
The mediant of fractions is defined as . The microaveraged recall of a multilabel head rule is the mediant of the performances which are obtained for corresponding singlelabel head rules with according to the recall metric.
(11) 
The mediant inequality states that the mediant strictly lies between the fractions it is calculated from, i.e., that . This is in accordance with Definition 2.
Theorem 3.4
Microaveraged Fmeasure is decomposable.
Proof
Microaveraged Fmeasure calculates as the (weighted) harmonic mean of microaveraged precision and recall. This proof is based on the finding that both of these metrics fulfill the properties of decomposability (cf. Theorem 3.1 and Lemma 1). As multiple metrics take part in the proof, we use a superscript notation to distinguish between the best possible performances according to different metrics, e.g., in case of the Fmeasure. Furthermore, we exploit the inequality .
(12) 
In (12) the first property of Definition 2 is proved. As the premise of the proof, we assume w.l.o.g. that the best possible performance according to the recall metric is equal to or greater than the best performance according to precision, i.e., that the relation holds. We further assume that the Fmeasure of a singlelabel head rule is less than the best possible performance (cf. line 1 and 2). When rewriting the Fmeasure in terms of the harmonic mean of precision and recall, it follows that either recall or precision of must be less than , respectively . Due to the premise of the proof, can be considered as an upper limit for both recall and precision (cf. line 3). Furthermore, because precision and recall are decomposable, the multilabel head rule with cannot outperform (cf. lines 5, 7 and 8). In order to prove the second property of decomposability to be met, the derivation in (13) uses a similar approach as in (12). However, it is not based on its premise.
(13) 
4 Algorithm for learning multilabel head rules
To evaluate the utility of these properties, we implemented a multilabel rule learning algorithm based on the SeCo algorithm for learning singlelabel head rules by Loza Mencía an Janssen [11]. Both algorithms share a common structure where new rules are induced iteratively and the examples they cover are removed from the training data set if enough of their labels are predicted by already learned rules. The rule induction process continues until only few training examples are left. To classify test examples, the learned rules are applied in the order of their induction. If a rule fires, the labels in its head are applied unless they were already set by a previous rule.
For learning new multilabel rules, our algorithm performs a topdown greedy search, starting with the most general rule. By adding additional conditions to the rule’s body it can successively be specialized, resulting in less examples being covered. Potential conditions result from the values of nominal attributes or from averaging two adjacent values of the sorted examples in case of numerical attributes. Whenever a new condition is added, a corresponding single or multilabel head that predicts the labels of the covered examples as accurate as possible must be found.
Evaluating possible multilabel heads To find the best head for a given body different label combinations must be evaluated by calculating a score based on the used averaging and evaluation strategy. The algorithm performs a breadthfirst search by recursively adding additional label attributes to the (initially empty) head and keeps track of the best rated head. Instead of performing an exhaustive search, the search space is pruned according to the findings in Section 3. When pruning according to antimonotonicity unnecessary evaluations of label combinations are omitted in two ways: On the one hand, if adding a label attribute causes the performance to decrease, the recursion is not continued at deeper levels of the currently searched subtree. On the other hand, the algorithm keeps track of already evaluated or pruned heads and prevents these heads from being evaluated in later iterations. When a decomposable evaluation metric is used no deep searches through the label space must be performed. Instead, all possible singlelabel heads are evaluated in order to identify those that reach the highest score and merge them into one multilabel head rule.
Fig. 1 illustrates how the algorithm prunes a search through the label space using antimonotonicity and decomposability. The nodes of the given search tree correspond to the evaluations of label combinations, resulting in heuristic values . The edges correspond to adding an additional label to the head which is represented by the preceding node. As equivalent heads must not be evaluated multiple times, the tree is unbalanced.
5 Evaluation
The purpose of the experimental evaluation was to demonstrate the applicability of the proposed SeCo algorithm despite the exponentially large search space. We did not expect any significant improvements in predictive performance since no enhancements in that respect were made to the original algorithm as proposed in [11].
Experimental setup We compared our multilabel head algorithm to its singlelabel head counterpart and also to the binary relevance method on 8 different data sets.^{1}^{1}1scene (6, 1.06), emotions (6, 1.87), flags (7, 3.39), yeast (14, 4.24), birds (19, 1.01), genbase (27, 1.25), medical (45, 1.24), cal500 (174, 26.15), with respective number of labels and cardinality, from http://mulan.sf.net. Source code and results are available at https://github.com/keelm/SeCoMLC. Following [11], we used Hamming accuracy, subset accuracy (only for multilabel heads), microaveraged precision and Fmeasure (with ) on partial predictions for candidate rule selection and also allowed negative assignments in the heads.
Predictive performance Due to the space limitations, we limit ourselves to the results of the statistical tests (following [8]
). The null hypothesis of the Friedman test (
, , ) that all algorithms have the same predictive quality could not be rejected for many of the evaluation measures, such as subset accuracy and micro and macroaveraged F1. In the other cases, the Nemenyi posthoc test was not able to assess a statistical difference between the algorithms using the same heuristic.Computational costs As expected, SeCo finds rules with a comparable predictive performance when searching for multilabel head rules. However, from the point of view of the proven properties of the evaluation measures, it was more interesting to demonstrate the usefulness of antimonotonicity and decomposability regarding the computational efficiency. Fig. 2 shows the relation between the time spent for finding single vs. multilabel head rules using the same heuristic and data set. The empty forms denote the singlelabel times multiplied by the number of labels in the data set. Note that full exploration of the labels space was already intractable for the smaller data sets on our system. We can observe that the costs for learning multilabel head rules are in the same order of magnitude despite effectively exploring the full label space for each candidate body.
Rule models When analyzing the characteristics of the models which have been learned by the proposed algorithm, it becomes apparent that more multilabel head rules are learned when using the precision metric, rather than one of the other metrics. This is due to the fact that precision only takes and into account. Therefore, the performance of such a rule depends exclusively on the examples it covers. When using another metric, where the performance also depends on uncovered examples, it is very likely that the performance of a rule slightly decreases when adding an additional label to its head. This causes singlelabel heads to be preferred. The inclusion of a factor which takes the head’s size in account could resolve this bias and lead to heads with more labels.
Whether more labels in the head are more desirable or not highly depends on the data set at hand, the particular scenario and the preferences of the user, as generally do comprehensibility and interpretability of rules. These issues cannot be solved by the proposed method, nor are in the scope of this work. However, the proposed extension of SeCo to multilabel head rules can lay the foundation to further improvements, gaining better control over the characteristics of the induced model and hence better adaption to the requirements of a particular use case.
The extended expressiveness of using multilabel head rules can be visualized by the following example. Consider the rules in Fig. 3, learned on the data set flags which maps characteristics of a flag and corresponding country to the colors appearing on the flag. The shown rules all cover the flag of the US Virgin Islands. Whereas in this case the singlelabel heads allow an easier visualization of the pairwise dependencies between characteristics/labels and labels, the multilabel head rules allow to represent more complex relationships and provide a more direct explanation of why the respective colors are predicted for the flag.
6 Related work
So far, only a few approaches to multilabel rule learning can be found in the literature. Most of them are based on association rule (AR) discovery. Alternatively, a few approaches use evolutionary algorithms or classifier systems for evolving multilabel classification rules
[2, 3, 4]. Creating rules with several labels in the head is usually implemented as a postprocessing step. For example, [15] and similarly [10] induce singlelabel ARs which are merged to create multilabel rules. By using a separateandconquer approach the step of inducing descriptive but often redundant models of the data is omitted and it is directly tried to produce predictive rules [11].Most of the approaches mentioned so far have in common that they are restricted to expressing a certain type of relationship since labels are only allowed as the consequent of a rule. Approaches that allow labels as antecedents of an implication are often restricted to global label dependencies, such as the approaches by [14, 6, 13] that use the relationships discovered by AR mining on the label matrix for refining the predictions of multilabel classifiers.
The antimonotonicity property is already well known from AR learning and subgroup discovery. For instance, it is used by the Apriori algorithm [1] to prune searches for frequent item sets. [5] already used antimonotonicity for efficiently mining subgroups in multilabel problems. However, in contrast to our work, they have not considered evaluation measures that are commonly used in MLC, but instead adapted metrics that are commonly used in subgroup discovery. We believe that the antimonotonicity property must be assessed differently in a multilabel context. This is because AR learning neglects partial matches and labels that are not present in the heads (cf. Sec. 2.2.3). In contrast, most MLC measures are much more sensitive in this respect. This is also demonstrated by the more restrictive property of decomposability which does not exist in common metrics for AR.
7 Conclusions
In this work, we formulated antimonotonicity and decomposability criteria for multilabel rule learning and formally proved that several common multilabel evaluation measures meet these properties. Furthermore, we demonstrated how these results can be used to efficiently find rules with multilabel heads that are optimal with respect to commonly used multilabel evaluation functions. Our experiments showed that more work is needed to effectively combine such rules into a powerful rulebased theory.
References
 [1] Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.I.: Fast discovery of association rules. In: Advances in Knowledge Discovery and Data Mining, pp. 307–328 (1995)
 [2] Allamanis, M., Tzima, F., Mitkas, P.: Effective RuleBased Multilabel Classification with Learning Classifier Systems. In: Proc. of the 11th Int. Conf. on Adaptive and Natural Computing Algorithms (ICANNGA13). pp. 466–476 (2013)

[3]
Arunadevi, J., Rajamani, V.: An evolutionary multi label classification using associative rule mining for spatial preferences. International Journal of Computer Applications (3), 28–37 (2011), Special Issue on Artificial Intelligence Techniques – Novel Approaches and Practical Applications

[4]
Ávila, J., Galindo, E., Ventura, S.: Evolving Multilabel Classification Rules with Gene Expression Programming: A Preliminary Study. In: Corchado, E., Romay, M.G., Savio, A. (eds.) Hybrid Artificial Intelligence Systems. pp. 9–16. Springer (2010)
 [5] Bosc, G., Golebiowski, J., Bensafi, M., Robardet, C., Plantevit, M., Boulicaut, J.F., Kaytoue, M.: Local subgroup discovery for eliciting and understanding new structureodor relationships. In: Proc. of the 19th Int. Conf. on Discovery Science (DS16). pp. 19–33 (2016)
 [6] Charte, F., Rivera, A.J., del Jesús, M.J., Herrera, F.: LIMLC: A label inference methodology for addressing high dimensionality in the label space for multilabel classification. IEEE Transactions on Neural Networks and Learning Systems 25(10), 1842–1854 (2014)

[7]
Dembczyński, K., Waegeman, W., Cheng, W., Hüllermeier, E.: On label dependence and loss minimization in multilabel classification. Machine Learning
88(12), 5–45 (2012)  [8] Demšar, J.: Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7, 1–30 (2006)
 [9] Kayande, U., De Bruyn, A., Lilien, G.L., Rangaswamy, A., Van Bruggen, G.H.: How incorporating feedback mechanisms in a DSS affects DSS evaluations. Information Systems Research 20(4), 527–546 (2009)

[10]
Li, B., Li, H., Wu, M., Li, P.: Multilabel Classification based on Association Rules with Application to Scene Classification. In: Proceedings of the 9th International Conference for Young Computer Scientists (ICYCS08). pp. 36–41. IEEE Computer Society (2008)
 [11] Loza Mencía, E., Janssen, F.: Learning rules for multilabel classification: A stacking and a separateandconquer approach. Machine Learning 105(1), 77–126 (2016)
 [12] Malerba, D., Semeraro, G., Esposito, F.: A multistrategy approach to learning multiple dependent concepts. In: Nakhaeizadeh, G., Taylor, C.C. (eds.) Machine learning and statistics: The interface, pp. 87–106. John Wiley & Sons, London (1997)
 [13] Papagiannopoulou, C., Tsoumakas, G., Tsamardinos, I.: Discovering and exploiting deterministic label relationships in multilabel learning. In: Proc. of the 21th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining. pp. 915–924 (2015)
 [14] Park, S.H., Fürnkranz, J.: Multilabel classification with label constraints. In: Proc. of the ECML PKDD 2008 Workshop on Preference Learning (PL08). pp. 157–171 (2008)
 [15] Thabtah, F., Cowling, P., Peng, Y.: Multiple labels associative classification. Knowledge and Information Systems 9(1), 109–129 (2006)
 [16] Tsoumakas, G., Katakis, I., Vlahavas, I.P.: Mining Multilabel Data. In: Data Mining and Knowledge Discovery Handbook, pp. 667–685. Springer (2010)
Comments
There are no comments yet.