The -measure is a widely used performance measure for multi-label classification (MLC) problems. In particular, in an MLC problem, multiple labels can be active in an instance simultaneously; a good example is that of image tagging, where several tags (such as sky, sand, water) can be active in the same image. In such problems, when evaluating the performance of a classifier on a particular instance, it is important to balance the recall of the classifier on the given instance, i.e. the fraction of active labels for that instance that are correctly predicted as such, and the precision of the classifier on the instance, i.e. the fraction of labels predicted to be active for that instance that are actually so. The
-measure accomplishes this by taking the (possibly weighted) harmonic mean of these two quantities.
Unfortunately, as with most discrete prediction problems, optimizing the -measure directly during training is computationally hard. Consequently, one generally settles for some form of approximation. One approach is to simply treat the labels as independent, and train a separate binary classifier for each label; this is sometimes referred to as the binary relevance (BR) approach. Of course, this ignores the fact that labels can have correlations among them (e.g. sky and cloud may be more likely to co-occur than sky and computer). Several other approaches have been proposed in recent years (Dembczynski et al., 2013; Koyejo et al., 2015; Wu and Zhou, 2017; Pillai et al., 2017).
In this paper, we turn to the theory of convex calibrated surrogate losses – which has yielded convex risk minimization algorithms for several other discrete prediction problems in recent years (Bartlett et al., 2006; Zhang, 2004a; Tewari and Bartlett, 2007; Steinwart, 2007; Duchi et al., 2010; Gao and Zhou, 2013; Ramaswamy et al., 2014, 2015) – to design principled surrogate risk minimization algorithms for the multi-label -measure. In particular, for an MLC problem with tags, the total number of possible labelings of an instance is (each tag can be active or inactive). Viewing the -measure as (one minus) a loss matrix, we show that this matrix has rank at most , and apply the results of Ramaswamy et al. (2014) to design an output coding scheme that reduces the learning problem to a set of
binary class probability estimation (CPE) problems. By using a suitable binary surrogate risk minimization algorithm (such as binary logistic regression) for these binary problems, we effectively construct a-dimensional convex calibrated surrogate loss for the -measure. We also give a quantitative regret transfer bound for the constructed surrogate, which allows us to transfer any regret guarantees for the binary subproblems to guarantees on -regret for the overall MLC problem. In particular, this means that using a consistent learner for the binary problems yields a consistent learner for the MLC problem (whose -regret goes to zero as the training sample size increases).
Our algorithm is related to the plug-in algorithm of Dembczynski et al. (2013), which also estimates statistics of the underlying distribution. Dembczynski et al. (2013) estimate these statistics by reducing the maximization problem to multiclass CPE problems, each with at most classes (plus one binary CPE problem); we do so by reducing the problem to binary CPE problems. As we show, both algorithms effectively estimate the same statistics, and indeed, both perform similarly in experiments. Interestingly, the algorithm of Dembczynski et al. (2013), while motivated primarily by the plug-in approach, can also be viewed as minimizing a certain convex calibrated surrogate loss (different from ours); conversely, our algorithm, while motivated primarily by the convex calibrated surrogates approach, can also be viewed as a plug-in algorithm. Our study brings out interesting connections between the two approaches; in addition, to the best our knowledge, our analysis is the first to provide a quantitative regret transfer bound for calibrated surrogates for the -measure.
Organization. Section 2 discusses related work. Section 3 gives preliminaries and background. Section 4 gives our convex calibrated surrogates for the -measure; Section 5 provides a regret transfer bound for them. Section 6 discusses the relationship with the plug-in algorithm of Dembczynski et al. (2013). Section 7 summarizes our experiments.
2 Related Work
There has been much work on multi-label learning, learning with the -measure, and convex calibrated surrogates. Below we briefly discuss work that is most related to our study. For detailed surveys on multi-label learning, we refer the reader to Zhang and Zhou (2014) and Pillai et al. (2017).
Bayes optimal multi-label classifiers. In an elegant study, Dembczynski et al. (2011) studied in detail the form of a Bayes optimal multi-label classifier for the -measure. In particular, they showed that, for an -label MLC problem, given a certain set of statistics of the true conditional label distribution (distribution over labelings), one can compute a Bayes optimal classifier for the -measure in time. Their result extends to general -measures. Bayes optimal classifiers have also been studied for other MLC performance measures, such as Hamming loss and subset 0-1 loss (Dembczynski et al., 2010).
Consistent algorithms for multi-label learning. Dembczynski et al. (2013) extended and operationalized the results of Dembczynski et al. (2011) by providing a consistent plug-in MLC algorithm for the -measure. Specifically, they showed that the statistics of the conditional label distribution needed to compute a Bayes optimal classifier can be estimated via multiclass CPE problems, each with at most classes, plus one binary CPE problem; the statistics estimated by solving these CPE problems can then be plugged into the -time procedure of Dembczynski et al. (2011) to produce a consistent plug-in algorithm termed the exact F-measure plug-in (EFP) algorithm. Consistent learning algorithms have also been studied for other multi-label performance measures (Gao and Zhou, 2013; Koyejo et al., 2015).111Note that while the study of Koyejo et al. (2015) also includes the -measure (among other performance measures), their study is in the context of what has been referred to as the ‘expected utility maximization’ (EUM) framework; in contrast, our study is in the context of what has been referred to as the ‘decision-theoretic analysis’ (DTA) framework. Their results are generally incomparable to ours. (In particular, under the EUM framework, Koyejo et al. (2015) showed that a thresholding approach leads to Bayes optimal performance; on the contrary, under the DTA framework, it was shown by Dembczynski et al. (2011) that a thresholding approach cannot be optimal for general distributions.) The simple approach of learning an independent binary classifier for each of the labels, known as binary relevance (BR), is known to yield a consistent algorithm for the Hamming loss; it also yields a consistent algorithm for the -measure under the assumption of conditionally independent labels, but can be arbitrarily bad otherwise (Dembczynski et al., 2011).
Large-margin algorithms for multi-label learning. Several studies have considered large-margin algorithms for multi-label learning with the -measure. These include the reverse multi-label (RML) and sub-modular multi-label (SML) algorithms of Petterson and Caetano (2010, 2011), which make use of the StructSVM framework (Tsochantiridis et al., 2005), and more recently, the label-wise and instance-wise margin optimization (LIMO) algorithm due to Wu and Zhou (2017), which aims to simultaneously optimize several different multi-label performance measures. The RML and SML algorithms were proven to be inconsistent for the -measure and shown to be outperformed by the EFP algorithm by Dembczynski et al. (2013). We include a comparison with LIMO in our experiments.
Multivariate -measure for binary classification. The -measure is also used as a multivariate performance measure in binary classification tasks with significant class imbalance. This use of the -measure is related to, but distinct from, the use of the -measure in MLC problems. Several approaches have been proposed that aim to optimize the multivariate -measure in binary classification (Joachims, 2005; Ye et al., 2012; Parambath et al., 2014).
Convex calibrated surrogates.
Convex surrogate losses are frequently used in machine learning to design computationally efficient learning algorithms. The notion of calibrated surrogate losses, which ensures that minimizing the surrogate loss can (in the limit of sufficient data) recover a Bayes optimal model for the target discrete loss, was initially studied in the context of binary classification(Bartlett et al., 2006; Zhang, 2004b) and multiclass 0-1 classification (Zhang, 2004a; Tewari and Bartlett, 2007). In recent years, calibrated surrogates have been designed for several more complex learning problems, including general multiclass problems and certain types of subset ranking and multi-label problems (Steinwart, 2007; Duchi et al., 2010; Gao and Zhou, 2013; Ramaswamy et al., 2013, 2014, 2015). In our work, we will make use of a result of Ramaswamy et al. (2014), who designed convex calibrated surrogates based on output coding for multiclass problems with low-rank loss matrices.
3 Preliminaries and Background
3.1 Problem Setup
Multi-label classification (MLC). In an MLC problem, there is an instance space , and a set of labels or ‘tags’ that can be associated with each instance in . For example, in image tagging, is the set of possible images, and is a set of pre-defined tags (such as sky, cloud, water etc) that can be associated with each image. The learner is given a training sample , where the labeling indicates which of the tags are active in instance (specifically, denotes that tag is active in instance , and denotes it is inactive). The goal is to learn from these examples a multi-label classifier which, given a new instance , predicts which tags are active or inactive via .
-measure. For any , the -measure evaluates the quality of an MLC prediction as follows. Given a true labeling and a predicted labeling , the recall and precision are given by
In words, the recall measures the fraction of active tags that are predicted correctly, and the precision measures the fraction of tags predicted as active that are actually so. The -measure balances these two quantities by taking their (weighted) harmonic mean:
Clearly, . Higher values of the -measure correspond to better quality predictions. We will take , so that when , we have . The most commonly used instantiation is the -measure, which weighs recall and precision equally; other commonly used variants include the -measure, which weighs recall more heavily than precision, and the -measure, which weighs precision more heavily than recall.
Assuming that training examples are drawn IID from some underlying probability distributionon , it is natural then to measure the quality of a multi-label classifier by its -generalization accuracy:222Note that our focus is on instance-averaged performance (Zhang and Zhou, 2014).
The Bayes -accuracy is then the highest possible value of the -generalization accuracy for :
The -regret of a multi-label classifier is then the difference between the Bayes -accuracy and the -accuracy of :
Our goal will be to design consistent algorithms for the -measure, i.e. algorithms whose -regret converges (in probability) to zero as the number of training examples increases. In particular, since we cannot maximize the (discrete) -measure directly, we would like to design consistent algorithms that maximize a concave surrogate performance measure – or equivalently, minimize a convex surrogate loss – instead. For this, we will turn to the theory of convex calibrated surrogates.
3.2 Convex Calibrated Surrogates for Multiclass Problems
Here we review the theory of convex calibrated surrogates for multiclass classification problems, and in particular, the result of Ramaswamy et al. (2014) for low-rank multiclass loss matrices that we will use in our work. We will apply the theory to the multi-label -measure in Section 4.
Multiclass classification. Consider a standard multiclass (not multi-label) learning problem with instance space and label space (i.e., classes). Let be a loss matrix whose -th entry (for each ) specifies the loss incurred on predicting when the true label is (the 0-1 loss is a special case with ). Then, given a training sample with examples drawn IID from some underlying probability distribution on , the performance of a classifier is measured by its -generalization error , or its -regret , where is the Bayes -error for . A learning algorithm that maps training samples to classifiers is said to be (universally) -consistent if for all and for , as .
Surrogate risk minimization and calibrated surrogates. Since minimizing the discrete loss directly is computationally hard, a common algorithmic framework is to minimize a surrogate loss for some suitable . In particular, given a multiclass training sample as above, one learns a -dimensional ‘scoring’ function by solving
over a suitably rich class of functions ; and then returns for some suitable mapping . In practice, the surrogate is often chosen to be convex in its second argument to enable efficient minimization. It is known that if the minimization is performed over a universal function class (with suitable regularization), then the resulting algorithm is universally -consistent, i.e. that the -regret converges to zero: as (where is the -generalization error of and is the Bayes -error). The surrogate , together with the mapping decode, is said to be -calibrated if this also implies -consistency, i.e. if
Thus, given a target loss , the task of designing an -consistent algorithm reduces to designing a convex -calibrated surrogate-mapping pair ; the resulting surrogate risk minimization algorithm (implemented in a universal function class with suitable regularization) is then universally -consistent.
Result of Ramaswamy et al. (2014) for low-rank loss matrices. The result of Ramaswamy et al. (2014) effectively decomposes multiclass problems into a set of binary CPE problems; to describe the result, we will need the following definition for binary losses:
Definition 1 (Strictly proper composite binary losses (Reid and Williamson, 2010)).
A binary loss is strictly proper composite with underlying (invertible) link function if for all and :
where denotes a -valued random variable that takes value
-valued random variable that takes valuewith probability and value with probability .
Intuitively, minimizing a strictly proper composite binary loss allows one to recover accurate class probability estimates for binary CPE problems: the learned real-valued score is simply inverted via (Reid and Williamson, 2010).
We can now state the result of Ramaswamy et al. (2014), which for multiclass loss matrices of rank , gives a family of -dimensional convex -calibrated surrogates defined in terms of strictly proper composite binary losses as follows (result specialized here to the case of square loss matrices, and stated with a small change in normalization):
Theorem 2 (Ramaswamy et al. (2014)).
Let be a rank- multiclass loss matrix, with for some . Let be any strictly proper composite binary loss, with underlying link function . Define a multiclass surrogate and mapping as follows:
Then is -calibrated.
The above result effectively decomposes the multiclass problem into binary CPE problems, where the labels for these CPE problems can themselves be given as probabilities in rather than binary values (see Ramaswamy et al. (2014) for details). For our purposes, we will use the standard binary logistic loss for the binary CPE problems, which is known to be strictly proper composite (see Section 4 below for more details).
4 Convex Calibrated Surrogates for
In order to construct convex calibrated surrogates – and corresponding surrogate risk minimization algorithms – for the multi-label -measure, we will start by viewing the multi-label learning problem as a giant multiclass classification problem with classes (this is only for the purpose of analysis and derivation of the surrogates; as we will see, the actual algorithms we will obtain will require learning only real-valued score functions). To this end, let us define the -loss matrix as follows:
has low rank. We show here that (a slightly shifted version of) the above loss matrix has rank at most .
Stratifying over the different values of , we can write this as
This proves the claim. ∎
-calibrated surrogates. Given the above result, we can now apply Theorem 2 to construct a family of -dimensional convex calibrated surrogate losses for .333Note that minimizing the -generalization error is equivalent to minimizing the -generalization error, and therefore a calibrated surrogate for is also calibrated for . Specifically, starting with any strictly proper composite binary loss with underlying link function , we define a multiclass surrogate and mapping as follows (where we denote ):
where are as defined in Eqs. (2-5). Then, by Theorem 2 and the proof of Proposition 3, it follows that is -calibrated.444Note that when applying Theorem 2 here, we have and , and therefore , , and . Therefore, the resulting -based surrogate risk minimization algorithm, when implemented in a universal function class (with suitable regularization), is consistent for the -measure. The algorithm is summarized in Algorithm 2. Note that since , in this case minimizing the surrogate risk above amounts to solving binary CPE problems with standard binary (non-probabilistic) labels.
Choice of strictly proper composite binary loss . As a specific instantiation, in our experiments, we will make use of the binary logistic loss given by
as the binary loss above; this is known to be strictly proper composite (Reid and Williamson, 2010)
, with underlying logit link functiongiven by
Implementation of ‘decode’ mapping. The mapping above can be implemented in time using a procedure due to Dembczynski et al. (2011); details are provided in the Appendix for completeness. In particular, Dembczynski et al. (2011) show that if one knows the true conditional MLC distribution , then one can use statistics of this distribution to construct a Bayes optimal classifier for the -measure; they then provide a procedure to perform this computation in time. As we discuss in greater detail in Section 6, our surrogate loss can be viewed as computing estimates of the same statistics from the training sample , and therefore our algorithm, which applies the ‘decoding’ procedure of Dembczynski et al. (2011) to these estimated quantities, can be viewed as effectively learning a form of ‘plug-in’ multi-label classifier for the -measure.
5 Regret Transfer Bound
Above, we constructed a family of -calibrated surrogate-mapping pairs (Eqs. (6-LABEL:eqn:decode)), yielding a family of surrogate risk minimization algorithms for the -measure (Algorithm 2). We now give a quantitative regret transfer bound showing that any guarantees on the surrogate -regret also translate to guarantees on the target -regret. Specifically, the surrogate loss was defined in terms of a constituent strictly proper composite binary loss . We show that if the binary loss is strongly proper composite (a relatively mild condition satisfied by several common strictly proper composite binary losses, including the logistic loss), then for all models , we can upper bound , the target -regret of the multi-label classifier given by , in terms of , the surrogate regret of . In order to prove the regret transfer bound, we will need the following definition:
Definition 4 (Strongly proper composite binary losses (Agarwal, 2014)).
Let . A binary loss is said to be -strongly proper composite with underlying (invertible) link function if for all , :
Additional notation. To prove our regret transfer bound, we will also need some additional notation. In particular, for each
, we will define the vectors
Intuitively, the elements of are the ‘class probability functions’ corresponding to the binary CPE problems effectively created by the surrogate loss defined in Eq. (6). The function learned by minimizing will be such that will serve as an estimate of .
Regret transfer bound. We are now ready to state and prove the following regret transfer bound for the family of surrogate losses defined in the previous section:
Let be a -strongly proper composite binary loss with underlying link function . Let () be defined as in Eqs. (6-LABEL:eqn:decode). Then for all probability distributions on and all , we have
Now, since is -strongly proper composite with link function , we have
Moreover, we have
and for , we have
It can be verified that is maximized at , yielding for each ,
Combining Eqs. (13-LABEL:eqn:proof-3) and applying Jensen’s inequality (to the convex function ) proves the claim. ∎
Remark. We note that Theorem 5 gives a self-contained proof that the surrogate-mapping pair defined in Eqs. (6-LABEL:eqn:decode) is -calibrated, since the result implies that for any sequence of models learned from training samples of increasing size ,
Nevertheless, since the design of our surrogate-mapping pair was based on the work of Ramaswamy et al. (2014), we chose to present their calibration result (Theorem 2) first. We also note that, while we have stated the above regret transfer bound for the -measure, a similar bound also applies more generally to all multiclass problems with low-rank matrices as considered in Theorem 2, thus yielding a stronger (quantitative) result than Theorem 2 (Ramaswamy, 2015).
6 Relationship with Plug-in Algorithm of Dembczynski et al. (2013)
The plug-in algorithm of Dembczynski et al. (2013), termed exact -measure plug-in (EFP), estimates the following statistics of the conditional label distribution :
It formulates estimation of the first statistic above as a binary CPE problem (solved via binary logistic regression), and estimation of the remaining statistics as multiclass CPE problems (one for each ), each with classes (solved via multiclass logistic regression). In practice, since the label vectors are typically sparse (only a small subset of the labels are active in any instance), the effective number of classes for each of the problems is much smaller than , and Dembczynski et al. (2013) exploit this fact by considering the statistics only for small (based on the maximum number of active labels in the training instances).
As the proof of Theorem 5 makes clear, our algorithm can be viewed as estimating the vector , with estimation of each component formulated as a binary CPE problem; in particular, having learned a score vector , our algorithm yields as an estimate for . A closer look reveals that captures essentially the same statistics as above:555Note that for each , the probabilities () and estimated by the -th multiclass problem in EFP add up to 1, so the EFP algorithm effectively estimates a total of statistics.