Conformal Rule-Based Multi-label Classification

07/16/2020 ∙ by Eyke Hüllermeier, et al. ∙ 0

We advocate the use of conformal prediction (CP) to enhance rule-based multi-label classification (MLC). In particular, we highlight the mutual benefit of CP and rule learning: Rules have the ability to provide natural (non-)conformity scores, which are required by CP, while CP suggests a way to calibrate the assessment of candidate rules, thereby supporting better predictions and more elaborate decision making. We illustrate the potential usefulness of calibrated conformity scores in a case study on lazy multi-label rule learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The setting of multi-label classification (MLC), which generalizes standard multi-class classification by relaxing the assumption of mutual exclusiveness of classes, has received a lot of attention in machine learning, and various methods for tackling this problem have been proposed in the literature

[15]. A rule-based approach to MLC is appealing and comes with a number of interesting properties. For example, rules are potentially interpretable and can provide explanations of a prediction [7]. Moreover, due to their local nature, rule-based predictors are very expressive and can adapt to local properties of the data in a flexible way.

In the context of MLC, the local nature of rules may also cause difficulties, however. In particular, due to the imbalance between positive and negative labels, which is typical for MLC, “good” rules with positive predictions that can stand up to negative rules are difficult to find. Here, we advocate the combination of multi-label rule learning with conformal prediction (CP) to mitigate this problem. To the best of our knowledge, CP has not been used in the context of MLC (neither rule-based nor otherwise) so far.

2 Multilabel Classification

Let denote an instance space, and let be a finite set of class labels. We assume that an instance is (probabilistically) associated with a subset of labels ; this subset is often called the set of relevant (positive) labels, while the complement is considered as irrelevant (negative) for . We identify a set

of relevant labels with a binary vector

, where .111 is the indicator function, i.e., if the predicate is true and otherwise. By we denote the set of possible labelings.

Given training data

, the goal in MLC is to learn a predictive model in the form of a multilabel classifier

, which is a mapping that assigns a (predicted) label subset to each instance . Thus, the output of a classifier is a vector of predictions , also denoted as

. For measuring the (generalization) performance of such a model, a large spectrum of loss functions or performance metrics have been proposed in the literature, including the Hamming loss

and the F1-measure [4].

3 Conformal Prediction

Conformal prediction [13, 12, 3, 6]

is a framework for reliable prediction that is rooted in classical frequentist statistics and hypothesis testing. Given a sequence of training observations

and a new query with unknown outcome , the basic idea is to hypothetically replace by each candidate, i.e., to test the hypothesis for all . Only those outcomes for which this hypothesis can be rejected at a predefined level of confidence are excluded, while those for which the hypothesis cannot be rejected are collected to form the prediction set or prediction region . By construction, the set-valued prediction is guaranteed to cover the true outcome

with a pre-specified probability of

(for example 95 %).

Hypothesis testing is done in a nonparametric way: Consider any “nonconformity” function that assigns scores to input/output tuples; the latter can be interpreted as a measure of “strangeness” of the pattern , i.e., the higher the score, the less the data point conforms to what one would expect to observe. Applying this function to the sequence of observations, with a specific (though hypothetical) choice of , yields a sequence of scores , where . Denote by the permutation of that sorts the scores in increasing order, i.e., such that . Under the assumption that the hypothetical choice of is in agreement with the true data-generating process, and that this process has the property of exchangeability (which is weaker than the assumption of independence and essentially means that the order of observations is irrelevant), every permutation has the same probability of occurrence. Consequently, the probability that is among the  % highest nonconformity scores should be low. This notion can be captured by the -values associated with the candidate , defined as

(1)

According to what we said, the probability that (i.e., is among the  % highest -values) is upper-bounded by . Thus, the hypothesis can be rejected for those candidates for which .

Conformal prediction as outlined above realizes transductive inference, although inductive variants also exist [9], where the nonconformity scores in (1) are produced on a training resp. validation data set. The error bounds are valid and well calibrated by construction, regardless of the nonconformity function . However, the choice of this function has an important influence on the efficiency of conformal prediction, that is, the size of prediction regions: The more suitably the nonconformity function is chosen, the smaller these sets will be.

4 Conformal Rule-Based MLC

A rule-based classifier in the context of MLC is understood as a collection of individual rules , where each rule is characterized by a head and a body . Roughly speaking, the rule head makes an assertion about the relevance of the labels , while the rule body specifies conditions under which this assertion is valid. It typically appears in the form of a logical predicate that specifies conditions on a query instance , for example a logical conjunction of restrictions on some of the features (e.g., a numerical value must lie in a certain interval).

4.1 Lazy Rule Learning

Here, we consider a lazy approach to multi-label rule learning, in which, instead of (eagerly) inducing a complete model from the training data , a single rule is induced at prediction time [1, 5]. This rule is specifically tailored to a query instance , for which a prediction is sought. More concretely, considering a binary relevance approach, a separate rule is constructed for each label . The rule head is of the form or . In the first case, the rule is a negative rule that predicts to be irrelevant, in the second case a positive rule that predicts to be relevant.

The local nature of rules has advantages but may also cause difficulties, especially in the context of MLC, where the data is highly imbalanced. In many cases, only a tiny fraction of the labels is relevant (positive), while the majority is irrelevant (negative). In general, this makes it difficult to find a “good” rule with positive predictions in its head, where the quality of a rule is typically measured in terms of two criteria, namely support (the body should be general enough so as to cover many instances) and confidence (the covered instances should belong to the same class). On the contrary, the learner has a strong incentive to make negative predictions, especially for loss functions such as Hamming. For example, the default rule with empty body, which predicts all labels to be always negative, will often have a very low Hamming loss, because most labels will be negative in the test examples. At the same time, this rule has a large support. When learning a single rule, as opposed to a complete model with many rules, that single rule must at least be better than the default rule — which is difficult for positive rules, as these normally have a small support.

4.2 Conformity of Positive and Negative Predictions

In general, the evaluation of negative rules is systematically better than the evaluation of positive rules. This is a motivation for the use of conformal prediction, which, if applied in a per-class manner, could “calibrate” the evaluations. More specifically, for a query instance and a label , we propose the conformity (instead of non-conformity) score

(2)

where , is a set of candidate rules that cover and predict for the label , and is an evaluation measure informing about the quality of the rule . As already said, such measures typically depend on the confidence and the support of the rule. In our illustration below, we shall use the lower confidence bound , where is the number of examples covered by the rule and the fraction of examples with the predicted label [2], though any other measure could be used as well. Practically, it might be difficult to determine the maximum in (2) exactly, as an exhaustive search of the candidate set might be infeasible. Instead, greedy search techniques are often used to find an approximately optimal rule.

The measure (2) appears to be a very natural measure of conformity: The conformity of for is high if a high-quality rule can be found that predicts . A measure of plausibility of this label is then given by

(3)

where is the training data and the conformity of the training example determined in a leave-one-out manner (i.e., the quality of the best rule for found in ). In other words, if , it means that the quality of the best positive rule for is better than the quality of of the rules found for the truly positive examples in the training data, and the same interpretation applies to . Consequently, only low values close to 0 provide real evidence against a certain prediction. For example, if , it means that the positive rule found for is still better than of the rules for the truly positive examples in the training data. In the spirit of hypothesis testing, one would “reject” the positive class only if for some critical threshold such as or , and similarly for the negative class.

Figure 1: Positive and negative conformity scores (2) and calibrated plausibilities (3) for the first label in the emotions data. Positive examples are plotted as red, negative examples as blue points.

As an illustration, Fig. 1 shows the distribution of positive and negative conformity scores (2) and calibrated plausibilities (3) for the first label in the emotions data (on a randomly chosen training set of size 400), a common benchmark data set with 596 examples, 72 attributes, and 6 labels [14]. Here, simple rules in the form of Parzen windows [11] have been learned, searching the space of such rules in a greedy, bottom-up manner (starting with a small window around and successively increasing its size). As expected, the positive examples tend to have a higher positive than negative plausibility, and vice versa for the negative examples. Moreover, the sum of the two scores tends to be upper-bounded by 1 and sometimes takes values closer to 0, suggesting higher certainty in the true label in some cases and less in others, again confirming the appropriateness of the conformity measure (2).

4.3 Prediction and Decision Making

Given a query , the degrees and provide useful information about the plausibility of the positive and negative class, respectively, and hence a suitable basis for prediction and decision making. The arguably most obvious idea is to compare the two degrees and predict the label with higher plausibility, i.e., positive if and negative otherwise. Yet, since MLC losses are not necessarily symmetric, and the class distribution is imbalanced, one may also think of a more general decision rule of the form

(4)

where is a parameter. Fig. 2 (top) shows the average test performance22250 random splits into 400 training examples and 196 test examples. on the emotions data in terms of the Hamming loss and (micro) F1-measure. As can be seen, by tuning the threshold , the performance can indeed be optimized, although is already close to optimal, confirming that the scores (3) are already well calibrated.

Figure 2: Top: Hamming loss and F-measure on the emotions data, depending on the threshold in the decision rule (4). Bottom: Accuracy-rejection curves for Hamming loss and F1-measure on the same data.

Recalling that conformal prediction is actually conceived for set-valued prediction, one may also think of using the two plausibilities to support more sophisticated decision making. One example is multi-label classification with (partial) abstention, where the learner is allowed to abstain on those labels on which it is not certain enough [8]. A natural reason to abstain, for example, is a low support for both options: , where is again a threshold. The effectiveness of such an approach is shown by the accuracy-rejection curves in Fig. 2 (bottom), which depict the average Hamming loss and F1-measure on those parts of the test data on which the learner does not abstain. The curves show a drastic increase in performance with an increasing amount of abstention (i.e., increasing ), suggesting that the learner is indeed abstaining on the right labels, namely those that are most uncertain333Note that the accuracy-rejection curve for random abstention is flat..

5 Conclusion and Outlook

The purpose of this paper is to highlight the potential usefulness of combining multi-label (rule) learning with conformal prediction. On the one side, rules provide a natural means for producing conformity scores of candidate labelings, very much like nearest neighbor methods, which are commonly used for CP [10]. On the other side, CP allows for producing meaningful and better calibrated measures of support in favor or label relevance, thus providing the basis for improved prediction, especially in advanced settings like MLC with abstention.

Exploiting the potential of this approach requires answers to a multitude of questions. One important building block, for example, is the class of candidate rules and the search in this class. Lazy rule learning as well as ensemble methods appear to be appealing in this regard. Moreover, to capture correlations and dependencies between different labels, the approach should be generalized toward the learning of rules with multi-label heads, predicting complete label combinations instead of individual labels.

Acknowledgements

This work was supported by the German Research Foundation (DFG) under grant number 400845550.

References

  • [1] Aha, D. (ed.): Lazy Learning. Kluwer Academic Publ. (1997)
  • [2] Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2–3), 235–256 (2002)
  • [3] Balasubramanian, V., Ho, S., Vovk, V. (eds.): Conformal Prediction for Reliable Machine Learning: Theory, Adaptations and Applications. Morgan Kaufmann (2014)
  • [4] Dembczynski, K., Waegeman, W., Cheng, W., Hüllermeier, E.: On label dependence and loss minimization in multi-label classification. Machine Learning 88(1–2), 5–45 (2012)
  • [5]

    Friedman, J., Kohavi, R., Yun, Y.: Lazy decision trees. In: Proceedings

    AAAI–96. pp. 717–724. Morgan Kaufmann, Menlo Park, California (1996)
  • [6] Gammerman, A., Vovk, V., Boström, H., Carlsson, L.: Conformal and probabilistic prediction with applications: Editorial. Machine Learning 108(3), 379–380 (2019)
  • [7]

    Loza Mencia, E., Fürnkranz, J., Hüllermeier, E., Rapp, M.: Learning interpretable rules for multi-label classification. In: Escalante, H.J., Escalera, S., Guyon, I., Baro, X., Güclüütürk, Y., Güclü, U., van Gerven, M. (eds.) Explainable and Interpretable Models in Computer Vision and Machine Learning, pp. 81–113. The Springer Series on Challenges in Machine Learning, Springer-Verlag (2018)

  • [8] Nguyen, V.L, Hüllermeier, E.: Reliable multi-label classification: Prediction with partial abstention. In: Proc. AAAI-20, Thirty-Fourth AAAI Conference on Artificial Intelligence. New York, USA (2020)
  • [9]

    Papadopoulos, H.: Inductive conformal prediction: Theory and application to neural networks. Tools in Artificial Intelligence

    18(2), 315–330 (2008)
  • [10] Papadopoulos, H., Vovk, V., Gammerman, A.: Regression conformal prediction with nearest neighbours. Journal of Artificial Intelligence Research 40, 815–840 (2011)
  • [11]

    Parzen, E.: On estimation of a probability density function and mode. Annals of Mathematical Statistics

    33, 1065–1076 (1962)
  • [12] Shafer, G., Vovk, V.: A tutorial on conformal prediction. Journal of Machine Learning Research pp. 371–421 (2008)
  • [13] Vovk, V., Gammerman, A., Shafer, G.: Algorithmic Learning in a Random World. Springer-Verlag (2003)
  • [14] Wieczorkowska, A., Synak, P., Ras, Z.: Multi-label classification of emotions in music. In: Klopotek, M., Wierzchon, S., Trojanowski, K. (eds.) Intelligent Information Processing and Web Mining. Springer, Berlin, Heidelberg (2006)
  • [15] Zhang, M.L., Zhou, Z.H.: A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering 26(8), 1819–1837 (2014)