In many real-world classification tasks, the performance metric used to evaluate a multi-class classifier is often a non-decomposable function of the confusion matrix of a classifier and cannot be expressed as a sum or expectation of losses on individual data points; this includes for example, the micro and macro F-measure used widely in information retrieval and the multi-class G-mean metric popular in class-imbalanced problems (see Table 1 for other examples). While there has been much work in recent years in understanding the consistency properties of plug-in or cost-sensitive risk minimization based learning algorithms for ‘binary’ non-decomposable metrics [1, 2, 3, 4, 5], little is known about the form of the optimal classifier for a general multi-class non-decomposable metric, or about how these learning algorithms for binary performance metrics, which make use of a brute-force line search of a single threshold/cost parameter, generalize to the multi-class case, where the number of parameters needed to be tuned scales with the number of classes.
In this paper, we provide a general framework for analysing a multi-class non-decomposable performance metric, where the problem of finding optimal classifier for the performance metric is viewed as an optimization problem over the space of all confusion matrices achievable under the given distribution. Using this framework, we show that, under a continuous distribution, the optimal classifier for any multi-class performance metric (that satisfies a mild condition) can be obtained by solving a cost-sensitive classification problem, where the costs are given by the gradient of the non-decomposable metric at the optimal confusion matrix. This result generalizes a previous result for binary non-decomposable metrics  and also recovers several previous results on the form of the optimal classifier for specific binary performance metrics [6, 3, 5].
A natural first-cut learning algorithm that arises from the above characterization is one that learns a plug-in classifier by applying an empirical weight matrix chosen by a brute-force search to a suitable class probability estimator. While this method can be shown to be statistically consistent with respect to the given performance metric (under a continuous distribution), it becomes computationally inefficient when the number of classes is large. As an alternative, we provide an efficient learning algorithm based on the conditional gradient (CG) optimization method (which we call the ‘BayesCG’ algorithm) that avoids a brute-force search over costs and can be seen as instead running the CG method over the space of feasible confusion matrices; the resulting algorithm proceeds via a sequence of cost-sensitive classification problems, the solutions for which take the form of plug-in classifiers. We show that the BayesCG algorithm is consistent for performance metrics that are concave functions of the confusion matrix; to the best of our knowledge, this is the first efficient learning algorithm (whose running time is polynomial in the number of classes) that is provably consistent for a large family of multi-class non-decomposable metrics. Also, unlike the brute-force plug-in method, the BayesCG algorithm requires no assumptions on the form of the optimal classifier for the given performance metric and hence on the underlying distribution.
Our consistency result makes use of a novel proof technique based on the convergence analysis of the CG method . More specifically, we show that the linear optimization step of the above CG method is solved approximately in the BayesCG algorithm and thus establish a regret bound for the algorithm for smooth concave performance metrics. For performance metrics that are non-smooth concave functions of the confusion matrix, we prescribe applying the BayesCG algorithm to a suitable smooth approximation of these performance metrics; we instantiate and show consistency of this approach for concave performance metrics such as the G-mean, H-mean and Q-mean.
1.1 Related Work
There have been several algorithms designed to optimize non-decomposable classification metrics, particularly in the binary classification setting; these include the binary plug-in algorithm that applies an empirical threshold to a class probability estimate [8, 1, 2, 3, 5], cost-sensitive risk minimization based approaches [9, 3, 4], methods that optimize convex and non-convex approximations to the given performance metric [10, 11, 12, 13, 14], and decision-theoretic methods that learn a class probability estimate and compute predictions that maximize the expected value of the performance metric on a test set [15, 16, 9]. Of these, the plug-in method is known to be consistent for any binary performance metric for which the optimal classifier is threshold-based , while the cost-sensitive approach is shown to be consistent for the class of fractional-linear performance metrics . There have also been results characterizing the optimal classifier for several binary non-decomposable metrics [1, 6, 2, 3], with the specific form of the classifier available in closed-form for fraction-linear metrics (i.e., metrics that are ratios of linear functions) .
We would also like to point out that there has been some work on designing algorithms for optimizing the F-measure in multi-label classification settings [17, 18, 19, 4] and consistency results for these methods [19, 20], but these results do not apply to the setting considered in this paper. In particular, while the multi-class performance metrics that we seek to optimize are non-decomposable/non-additive over data points, the standard performance metrics of interest in a multi-label setting can indeed be expressed as a sum of losses on individual examples, with each loss on an example potentially being a non-decomposable function of the labels on the example.
Organization. We start with some preliminaries and background on non-decomposable performance metrics in Section 2. In Section 3, we give a general framework for analysing multi-class non-decomposable performance metrics and use this framework to derive the form of the optimal classifier for a non-decomposable performance metric. Based on this characterization, we consider a brute-force plug-in method for a multi-class non-decomposable metric in Section 4, and show that this method is consistent. In Section 5, we design an alternate efficient learning algorithm based on the conditional gradient optimization method, which we show is consistent for a large family of concave non-decomposable metrics. All proofs not in the main text are provided in the Appendix.
2 Preliminaries and Background
Notations. For any , we shall denote . For a predicate , we shall denote by the indicator function that takes value 1 if is true and 0 otherwise. The probability simplex of dimension will be denoted by . For a matrix , we will use to denote the column of the matrix, and shall refer to as the norm of and to as the norm of ; for any two matrices , we shall denote their component-wise inner product as . For any set , we denote its closure under an appropriate metric space by . For maximization over integral sets, the notation shall refer to ties being broken in favor of the larger number.
Problem Setup. Let be an instance space and be a set of class labels. We are given a training sample
drawn i.i.d. according to an underlying (unknown) probability distributionover , and the goal in a multi-class classification problem is to learn from these examples a prediction model , which when given a new instance , makes a prediction . We will be interested in the more general problem of learning from , a randomized classifier that for each instance outputs a probability distribution over the labels in ; note that any deterministic classifier can be seen as randomized classifier whose output is always a vertex of the probability simplex . In particular, we will consider settings where the performance of is evaluated using a non-decomposable performance metric that cannot be expressed as a sum or expectation of losses on individual examples. We shall denote the marginal of over as , the conditional class probabilities for an instance as , and the prior class probabilities as ; for a sample , we shall use to denote the empirical distribution which has its mass uniformly on the instances in .
|Binary -metric||||Non-concave, Pseudo-linear|
|Jaccard Coefficient (JAC)||||Non-concave, Pseudo-linear|
|Micro -metric||-||||Non-concave, Pseudo-linear|
|H-Mean (HM)||||Concave, Non-smooth|
|Q-Mean (QM)||||Concave, Non-smooth|
|G-Mean (GM)||[29, 30]||Concave, Non-smooth|
|Min-Max metric||||Concave, Non-differentiable|
Multi-class Non-decomposable Performance Metrics. Let us first define for a deterministic classifier and distribution , the confusion matrix as
the corresponding confusion matrix for a randomized classifier is given by
In this paper, we shall be interested in non-decomposable performance metrics that can expressed as a continuous and bounded function of the confusion matrix:
For example, the macro -measure used widely in text retrieval can be expressed as a function of the confusion matrix . Table 1 contains several examples of performance metrics that are functions of the confusion matrix.111For all performance metrics considered in this paper, higher values indicate better performance.222In the setting considered here, the goal is to maximize a performance metric that can be expressed as a (non-decomposable) function of expectations; this is referred to by Ye et al. (2012)  as the expected utility maximization setup and is different from the decision-theoretic setting that they consider, where one looks at the expectation of a non-decomposable performance metric on examples, and seeks to maximize its limiting value as . Throughout this paper, we shall use the term performance metric to refer to both and .
-consistency. We now consider the optimal value of performance metric over all randomized classifiers:
and shall refer to the classifier attaining the above value, if one exists, as the -optimal classifier. One can then define the -regret of classifier as
A learning algorithm that takes a training sample drawn i.i.d. from and outputs a classifier is said to be -consistent if the -regret of classifier goes to zero in probability:
where the convergence in probability is over the random draw of from .333We say converges in probability to , written as , if , .
Optimal Classifier for Decomposable Metrics. While in general, it is not clear if there exists a classifier that attains the optimal value of a given performance metric , it is well-known that when is a linear function (i.e., can be expressed as an expectation of a loss on individual example), a -optimal classifier always exists. In particular, if takes the form
for some matrix , then any classifier that satisfies the following condition is -optimal:
It is seen that there always exists a deterministic classifier that satisfies the above condition. Also, it is worth noting that maximizing the above performance metric is equivalent to solving a cost-sensitive classification problem, with the costs given by the the negative of the ‘gain’ matrix .
Plug-in Algorithm for Decomposable Metrics. A standard approach for maximizing a decomposable metric (or equivalently solving a cost-sensitive classification problem) is the plug-in method, where one first obtains a class probability estimation (CPE) model from the given training sample and constructs a classifier for any instance . This approach can be shown to be -consistent if the CPE algorithm used to learn is such that  (which is indeed the case for any algorithm that performs a regularized empirical risk minimization of a proper loss such as the logistic loss [33, 34]).
Known Results for Binary Non-decomposable Performance Metrics. We now summarize what is understood about the the optimal classifier for binary non-decomposable performance metrics and about the consistency properties of learning algorithms for these metrics. It is known that, under a continuous distribution, the optimal classifier for a binary monotonic non-decomposable metric is obtained by placing a suitable threshold on the posterior class probability function . For certain specific performance metrics, such as those that are fractional-linear/ratio of linear functions (e.g., binary F-measure and JAC measure) [1, 35, 6, 3]2], and the approximate median sign (AMS) metric , this characterization holds even without the continuity assumption on the distribution; for some of these metrics, the exact form of the threshold is also available in closed-form [3, 5]. It is also known that a plug-in algorithm that constructs a classifier by assigning an empirical threshold to a suitable class probability estimate (see Algorithm 1) is statistically consistent with respect to any binary non-decomposable metric for which the optimal classifier is of the above thresholded form [2, 3]; a similar result has also been shown for a cost-sensitive risk minimization based approach for fractional-linear metrics .
While there has been a lot of work on binary non-decomposable metrics as seen above, little is known about how these results extend to the multi-class case. In particular, what is the form of the optimal classifier for a general multi-class non-decomposable metric? How does the plug-in and cost-sensitive risk minimization based algorithms for binary performance metrics, which essentially need to tune a single parameter, generalize to the multi-class case, where the number of parameters needed to be tuned grows with the number of classes? In this paper, we address these questions.
Before we proceed further, we will find it convenient to define for any given function , the set of weighted argmax classifiers obtained by a gain matrix on :
Finally, a function is said to be -Lipschitz w.r.t. the norm over , for some , if
and is -smooth w.r.t. the norm over , for some , if
3 Characterization of the Optimal Classifier for a General Multi-class Performance Metric
We start by providing a generic framework for studying a multi-class non-decomposable performance metric, where we view the problem of finding the optimal classifier for a non-decomposable metric as an optimization problem over the space of all confusion matrices that are attainable under the given distribution. Using this framework, we give a characterization of the optimal classifier for a non-decomposable metric; in particular, we show that under a continuous distribution, the optimal classifier for any multi-class non-decomposable performance metric (that satisfies a mild condition) can be obtained by maximizing a decomposable performance metric, whose gain matrix is given by the gradient of non-decomposable metric at the optimal confusion matrix. To our knowledge, this is the first such result for a general multi-class non-decomposable metric, generalizing a previous result for binary non-decomposable metrics  and in addition also recovering previous results on the form of the optimal classifier for several performance metrics [6, 3, 5].
Feasible confusion matrices. We begin by defining the set of feasible confusion matrices for a distribution as the set of all confusion matrices achievable by a randomized classifier under :
Note that every matrix
is such that its row sums are equal to the prior probabilities, i.e.. It can be shown that this set is convex.
Proposition 1 (Convexity of ).
is a convex set.
The problem of finding the optimal classifier for the given performance metric can now be cast as an optimization problem over ; we shall shortly see that this viewpoint is useful in both characterizing the optimal classifier for the performance metric and in designing consistent learning algorithms for the metric.
We next make the following continuity assumption on , which is essentially a multi-class extension of a similar assumption on in  (in the binary label setting).
Assumption A (Continuity of ). Let
be a random variable distributed uniformly over the simplex, and let be a base measure over such that . Let denote the probability measure that is associated with the random variable . We will say that a distribution satisfies Assumption A if is absolutely continuous w.r.t. .
We shall also make a mild assumption on that is satisfied by all performance metrics in Table 1 except the min-max metric.
Assumption B. We will say that satisfies Assumption B w.r.t. distribution if it is continuous, differentiable and bounded over , and is strictly increasing in the diagonal elements of its argument and non-increasing in the non-diagonal elements of its argument.
Under the above assumptions on and , we now show that a -optimal classifier always exists and can be obtained by maximizing a decomposable performance metric constructed from the gradient of at the optimal confusion matrix.
Theorem 2 (Characterization of -optimal Classifier for a General Multi-class Non-decomposable Metric Under Continuous Distributions).
Let distribution satisfy Assumption A, and satisfy Assumption B w.r.t. . Then there exists a classifier that is -optimal. Furthermore, for , we have
and thus any classifier of the following form is -optimal:
The above theorem is a multi-class generalization of the result in  for binary monotonic performance metrics, and in addition also gives the precise form of the optimal classifier for the given performance metric. By a simple application of this theorem, we recover previous results on the form of the optimal classifier for performance metrics that fractional-linear  such as the F-measure and Jaccard coefficient , and also for the AMS metric .
Before we prove Theorem 2, we will find it useful to state the following lemma.
Lemma 3 (Uniqueness of Optimal Confusion Matrix for Gain Matrices Obtained from Gradients of ).
Under the assumptions on and in Theorem 2, for any , we have
Moreover, the above set is a singleton.
The proof of Theorem 2 then follows from the first order necessary conditions for optimality of a confusion matrix and the above result.
Proof of Theorem 2.
We shall first show that there exists a -optimal classifier. By compactness of , we know that there exists such that
It remains to be shown that there exists a classifier that achieves this confusion matrix, i.e., . For this, we note from the first order necessary condition for optimality of , given convexity of (see Proposition 1), that
The above equation along with Lemma 3 implies that
Thus and hence there exists a clasifier such that . This completes the proof of existence of a -optimal classifier.
Next for , we further have
Clearly, a classifier that maximizes the linear performance metric is also -optimal; as seen in Eq. (2), such a classifier takes the form given in the theorem statement. ∎
Remark 1 (Necessity of continuity Assumption A on ).
We note here that for the above characterization to hold for a general non-decomposable performance metric, the continuity assumption on distribution (Assumptions A) is indeed necessary. We illustrate this fact for the H-mean performance metric by constructing a simple distribution that does not satisfy this assumption, and where a classifier of the form in the theorem statement is not necessarily optimal. Consider the following distribution over with . It can be seen that the unique optimal classifier for the H-mean performance metric is , whose confusion matrix and the gradient of at are given by:
Clearly, any classifier will have ; hence
It is worth noting that for certain restricted families of performance metrics, the characterization in Theorem 2 holds even without Assumption A on the distribution; this is the case, for example, when is fractional-linear (e.g., F-measure, JAC) [3, 4] and is convex (e.g., AMS metric) .
Remark 2 (Extension to the min-max metric).
A result similar to the one in Theorem 2 also holds for the min-max metric, where it is well known from classical detection theory (in particular, from min-max hypothesis testing) that the optimal classifier for this metric is obtained by maximizing a decomposable metric with an appropriate gain matrix . In fact, one can show that if is an optimal classifier for the min-max metric , and is in the sub-differential of at , then
4 A Consistent Plug-in Method for Multi-class Non-decomposable Metrics Based on a Brute-force Search
Based on the above characterization of the optimal classifier of a non-decomposable metric, we now consider a simple plug-in based learning algorithm for a multi-class non-decomposable metric that uses a brute-force search over gain matrices; this approach can be seen as a natural extension of the binary plug-in method in Algorithm 1. We show that this method is consistent with respect to a general non-decomposable metric, and also provide an explicit regret bound for this method for the special case of performance metrics that exhibit a certain convexity-like property. In the next section, we design an alternate efficient learning algorithm based on the conditional gradient algorithm which is consistent for a large family of non-decomposable metrics.
Clearly, if the optimal confusion matrix for a multi-class non-decomposable metric is known apriori, one can learn a simple plug-in classifier by applying the gradient of at to a suitable class probability estimator. In the absence of knowledge of , a natural first-cut approach would be to perform a brute-force search over all gain matrices with bounded entries444Since a plug-in classifier constructed from a gain matrix is invariant to scaling of entries of the matrix, it suffices to perform the search over gain matrices with bounded entries., and pick the one for which the resulting plug-in classifier yields maximum performance value on a held-out part of the training set (see Algorithm 2). While for the binary case (), this brute-force search essentially reduces to a search over thresholds (on the class probability estimate) that can be performed efficiently in time linear in the number of held-out instances (as seen in Algorithm 1), for the general multi-class case, it is not clear if an exact search is tractable; in practice, this maximization over gain matrices can be performed approximately by considering only a finite number of matrices obtained from a fine-grained grid.
We now show that (under a continuous distribution) the brute-force plug-in method is statistically consistent with respect to the given performance metric.
Theorem 4 (Consistency of Brute-force Plug-in Algorithm for Multi-class Non-decomposable Metrics).
The above guarantee applies to all performance metrics in Table 1. Before we prove this result, we state a couple of lemmas; in the first lemma, we consider a classifier obtained by applying a fixed gain matrix to a class probability estimation model, and show convergence of the entries of the confusion matrix for this classifier to those of a classifier obtained by applying the gain matrix to the true class probability function; in the second lemma, we give a uniform convergence bound for the confusion matrix of a set of weighted argmax classifiers.
Lemma 5 (Convergence of conf for fixed gain matrix).
Let satisfy Assumption A. Let be a class probability estimation model learned using a sample drawn i.i.d. from . For a fixed gain matrix such that no two columns are identical, let and be classifiers constructed as follows: and . If is such that , then (as ).
Lemma 6 (Uniform Convergence Generalization Bound for conf Over ).
Let be a fixed function and be a sample drawn i.i.d. according to . For any , we have with probability at least (over draw of from ),
where is a distribution-independent constant.
We are now ready to prove Theorem 4.
Proof of Theorem 4.
where the fourth step follows by definition of . By assumption B on , the matrix has no two identical columns, and hence by Lemma 5 we have that converges to as goes to . Along with the continuity of , this ensures that . By suitably conditioning on and using the uniform convergence bound in Lemma 6, one gets . ∎
For a special class of performance metrics that satisfy a certain convexity-like property, we have an explicit regret bound guarantee for the brute-force plug-in method.
Theorem 7 (Regret Bound for Brute-force Plug-in Algorithm for Convex-like Non-decomposable Metrics).
Let satisfy Assumption A, and satisfy Assumption B w.r.t. . Furthermore, let be -Lipschitz w.r.t. the norm over , and be such that there exists such that . If is the classifier learned by Algorithm 2 using training sample with parameter , then for any , we have with probability at least (over draw of from ):
where is a distribution-independent constant.
The above result applies to several performance metrics including the AMS measure () , the binary F-measure ()  and the multi-class micro F-measure () . The proof of this theorem follows a similar progression as that of Theorem 4 and additionally makes use of the convexity-like property of and the following regret bound for a linear/decomposable performance metric defined using a bounded gain matrix.
Lemma 8 (Regret Bound for Linear/Decomposable Performance Metric with Bounded Gain Matrix).
Let be a fixed gain matrix. Let be a class probability estimation model and be a classifier constructed such that . We then have
Remark 3 (Connection to the method of Parambath et al. (2014) ).
For certain classes of performance metrics, the brute-force method in Algorithm 2 can be made more efficient by considering in the maximization step, only those gain matrices that are obtained from gradients of at feasible confusion matrices in . This is beneficial for example, in the case of fractional-linear performance metrics such as the binary and micro F-measure, where any gradient obtained from a feasible confusion matrix can be parametrized using a single scalar. The method of Parambath et al. (2014), which makes use of this fact, can be seen as a special case of Algorithm 2.
5 A Consistent and Efficient Algorithm for Multi-class Non-decomposable Metrics Based on the Conditional Gradient Method
While the (brute-force) plug-in method analyzed in the previous section is consistent for any non-decomposable metric for which the optimal classifier is of a certain desired form, the number of parameters that need to be tuned in this method grows with the number of classes ; in particular, the number of evaluations of the performance metric required in this method could be exponential in . In this section, we provide an alternate efficient learning algorithm based on the conditional gradient (CG) optimization method and show that this algorithm is consistent for a large family of concave performance metrics. Also, unlike the brute-force plug-in, the CG based method makes no assumption on the form of the optimal classifier and hence on the underlying distribution.
More specifically, we pose the problem of learning a classifier for a non-decomposable metric as a constrained optimization problem over the space of feasible confusion matrices, and explore the use of optimization methods for solving this problem. However, unlike a standard optimization problem where the constraint is explicitly specified, in the problem that we consider, testing feasibility of a confusion matrix is not tractable in general; this precludes the use of standard gradient descent based constrained optimization solvers for this problem. Instead, we make use of the conditional gradient (CG) method which does not require the constraint set to be explicitly specified, and instead only requires access to a linear optimization oracle over the constraint set . In particular, this method proceeds via a sequence of linear optimization steps, each of which is equivalent to maximization of a decomposable performance metric and thus can be solved efficiently.