is a well-studied discrete prediction problem, and in this paper we are interested in risk bounds for classifiers. The classifiers we focus on are obtained through Empirical Risk Minimization(ERM, Vapnik, 2013). While the goal in classification is to minimize the misclassification probability (a.k.a. the expected - loss or the risk), minimizing the empirical risk (the - loss on a sample, as prescribed by ERM) can be computationally hard (Höffgen et al., 1995). So it is common to minimize a convex surrogate loss, as a means to minimize the true risk, which is defined in terms of the non-convex - loss. ERM with the surrogate loss gives us approximate minimizers of the surrogate risk, i.e., the expected surrogate loss, so the first question that comes to mind is whether minimizers of the surrogate risk are also minimizers of the true risk, i.e., whether, ultimately, good classifiers can be obtained by ERM with a surrogate loss. When a surrogate loss enjoys this guarantee, we say that it is calibrated or Fisher-consistent. The question of calibration for different losses has been recently investigated by a number of authors (Zhang, 2004; Liu, 2007; Tewari and Bartlett, 2007; Reid and Williamson, 2009; Guruprasad and Agarwal, 2012; Ramaswamy et al., 2013; Calauzènes et al., 2013; Ramaswamy and Agarwal, 2016; Doğan et al., 2016). Rather than being concerned with just calibration, here we investigate how to obtain bounds for the true risk of (surrogate-risk) ERM classifiers. Thanks to the solid understanding of techniques to obtain risk bounds for empirical risk minimizers (Steinwart and Christmann, 2008; Koltchinskii, 2011), we can follow the approach of Steinwart (2007); Ávila Pires et al. (2013) and focus on techniques for converting surrogate risk bounds into true risk bounds (to which we will henceforth refer as “bound conversion”). As a bonus, having effective means to perform this conversion also enables us to answer calibration questions for surrogate losses.
We build on the works of Steinwart (2007); Ávila Pires et al. (2013), exploring the concept of calibration functions, which are an effective tool for bound conversion. In fact, the toolset developed by Steinwart (2007) fully constitutes an approach for bound conversion, at least in an abstract sense. This toolset generalizes techniques for the binary case that were introduced by Bartlett et al. (2006) and used to characterize bound conversion for a large family of popular surrogate losses. Unfortunately, notwithstanding their power, the techniques of Steinwart (2007) are too abstract for one to perform bound conversion for multiclass losses without a significant amount of effort directed to each specific loss, an effort surmised in the calculation of the aforementioned calibration functions. In contrast, calibration functions for common choices of binary surrogate losses can be obtained almost immediately (see Theorem 3.5).
Our goal is, therefore, to simplify the process of calculating calibration functions in the multiclass case for various surrogate losses, and our main contribution is a generic way to “reduce” multiclass calibration functions to binary “calibration-like” functions that can be easily computed for specific surrogate loss choices.
We achieve our goal by designing a set of conditions that, when satisfied by a particular loss, yield a function that is essentially a calibration function for a binary loss, similar to the calibration function presented by Bartlett et al. (2006) for margin-based losses. As an advantage, we are able easily generalize, to the multiclass case, a result by Bartlett et al. (2006) that gives improved calibration functions when the distribution of satisfies the Mammen-Tsybakov noise condition (Mammen et al., 1999; Boucheron et al., 2005; Bartlett et al., 2006).
Our analysis generalizes the work of Ávila Pires et al. (2013), who presented calibration functions for a family of multiclass classification losses introduced by Lee et al. (2004). While Ávila Pires et al. (2013) investigate a cost-sensitive setting, we restrict our considerations to the ordinary (cost-insensitive) classification problem, and we refine their results for this case.
While various multiclass surrogate losses have been proposed in the literature (see Table 3), many of them share a similar structure that allows our results to be widely applicable. In order to illustrate the application of our results, we perform case studies for our analysis. We verify the proposed conditions in order to easily obtain calibration functions for the loss of Lee et al. (2004), thus recovering the results of Ávila Pires et al. (2013) in the cost-insensitive setting. Also as case studies, we obtain novel calibration functions for the decoupled unconstrained background discrimination losses presented by Zhang (2004), the logistic regression loss (a special case of the coupled unconstrained background discrimination losses), and the one-versus-all loss (Rifkin and Klautau, 2004). Specific instantiations of the decoupled unconstrained background discrimination loss (including the one-versus-all loss) require verification of an additional condition that, we believe, is not harder to verify than a related binary calibration function is to derive. We verify this condition for some choices of unconstrained background discrimination losses. Our analysis does not cover the surrogate losses proposed by Weston and Watkins (1998); Zou et al. (2006); Beijbom et al. (2014), for which we believe a different analysis is required.
This work is structured as follows. In Section 2, we introduce the classification problem and some notation, and we discuss the conversion of surrogate risk bounds into true risk bounds. Section 3 presents a review of related work and introduces the core concepts that we use for the bound conversion. We follow with Section 4, where we introduce a general analysis that allows us to reduce multiclass calibration functions to binary calibration functions. Then, in Section 5, we perform “case studies” by looking at how the analysis works for specific families of surrogate losses. We conclude this work in Section 6, with a commentary on the strengths and limitations of our results, and a discussion of possible extensions of our work.
In classification we wish to find a function111 We will frequently omit well-understood technical details, such as measurability. In our discussions (but not the proofs), we will mention minimizers of lower-bounded functions that may not have a minimizer, e.g., the exponential function. In those cases, the considerations are easily extended to approximate minimizers that are arbitrarily close to the infimum. , called a classifier, achieving the smallest expected misclassification error, also known as the misclassification rate or risk
whereand , respectively. The risk can also be written as the expected value of a loss, in this case the function , which is called the - loss. The goal of the classification problem can also be stated as minimizing the excess risk
whenever the Bayes-risk is bounded in absolute value.
What we defined as the classification problem is often referred to as multiclass classification. When , in particular, the classification problem is called binary classification. It is possible to define the classification problem in more general terms, to include cost-sensitive classification (Zhang, 2004; Steinwart, 2007; Ramaswamy et al., 2013; Ávila Pires et al., 2013), however we will leave this direction aside in this work.
In the classification learning problem, the distribution is unknown and we are only given a finite, i.i.d. sample . Moreover, one typically fixes a set of classifiers, , called the hypothesis class, in which case the goal of the problem can be written as minimizing the -excess risk
Empirical risk minimization.
As shown by (Höffgen et al., 1995; Ben-David et al., 2003; Feldman et al., 2012; Nguyen and Sanner, 2013), computing empirical risk minimizers for the empirical risk in (2.2) is -hard for some commonly used hypothesis classes, so one often replaces the empirical risk with an empirical surrogate risk
where is a convex surrogate loss, is a set of scores and is a score function. One chooses the loss and the hypothesis class so that minimizing (2.3) over can be done efficiently. ERM with (2.3) as its objective will allow us to obtain guarantees (risk bounds) for the surrogate risk
While the true loss yields a value in when given a prediction and a class , the surrogate loss will yield a real number when given a
-dimensional real vector, called a score, and a class . The set of scores is commonly chosen to be itself, or the space of sum-to-zero scores , where is the -dimensional vector of ones. A third common choice is the -dimensional simplex .
In order to properly have classifiers, we will transform scores into classes using the maximum selector defined by222 If the is not a singleton we pick an arbitrary element from it. Our results will be worst-case when it comes to ties, so that tie-breaking in the maximum selector is not an issue and can be done arbitrarily.
It is easy to show that any classifier can be obtained by composing a score function with the maximum selector, so using score functions does not inherently limit solutions for the classification problem.
In the binary case, a common loss choice is a margin loss (Steinwart and Christmann, 2008, Section 2.3) with score set and convex transformation function . Table 1 contains some common choices of (see also Steinwart and Christmann, 2008, Section 2.3; Hastie et al., 2009, Chapter 4), and it includes non-convex choices for completeness.
Applying ERM to some of the margin losses with from Table 1 and an appropriate choice of in fact gives a correspondence to successful binary classification methods. For example, SVMs use
, Ridge regression uses, Logistic regression uses and AdaBoost uses (see Table 21.1, and Sections 4.4.1 and 10.4 of Hastie et al., 2009).
The convexity of (precisely, the convexity of for all ) and the choice of will ensure that the empirical risk can be minimized efficiently, which satisfactorily addresses the computational side to constructing classifiers with ERM and surrogate losses. On the statistical side, we are concerned with obtaining true risk bounds for these classifiers.
We know that under certain conditions ERM will give us, with high probability, an approximate surrogate risk minimizer over (Steinwart and Christmann, 2008; Koltchinskii, 2011), which will be a bound on the surrogate risk of a score function. We have a straightforward way to convert score functions into classifiers, so what remains is to show that ERM on the empirical surrogate risk will also give us, with high-probability, an approximate true risk minimizer.
While our work is primarily motivated by ERM, our main concern is bound conversion, and are be able to make statements about any learning algorithm for which a surrogate risk bound is available. A learning algorithm is a function mapping samples to hypotheses 333 Extensions to randomized algorithms, mapping samples to distributions over score functions, are straightforward. . For example, learning algorithms following the ERM approach satisfy
when given a sample .
As we are not concerned with specifics of bounding the surrogate risk, we will leave this problem aside in our discussion of bound conversion. Assumption 1 establishes that classifiers constructed by a given learning algorithm from a random sample of size have surrogate risk (conditioned on the sample) bounded by with probability at least . The surrogate risk of a random score function conditioned on a sample taking values in is defined as
Similarly, for a random classifier ,
Assumption 1 (Surrogate risk bound).
Given a learning algorithm and a hypothesis class , there exists a function s.t. for every and the following holds w.p. at least :
where is a -i.i.d.
Steinwart and Christmann (2008); Koltchinskii (2011) discuss techniques for obtaining bounds that satisfy Assumption 1. Alternatively, we can replace Assumption 1 with an assumption that an “expectation bound” is available, i.e., that (2.5) holds in expectation (where the expectation is taken over the sample ). In this case, using bound conversion we obtain expectation bounds for the true risk bounds. Assumption 1, in addition to a -excess surrogate risk bound, also gives an excess surrogate risk bound, since (2.5) implies that
is the approximation error w.r.t. the surrogate risk, which can only be made small with appropriate choices of . From a non-parametric point of view, one should control by trading it off with , so as to obtain an appropriate rate of convergence for the excess surrogate risk (Steinwart and Christmann, 2008, p.8).
There are different ways to convert surrogate risk bounds into true risk bounds. The following well-known result can be applied if the surrogate loss upper-bounds the - loss.
Theorem 2.1 (Boucheron et al. 2005).
Given and satisfying Assumption 1, if (, ), then, for all , with probability at least , we have
A limitation of Theorem 2.1 is that the resulting true risk bounds can be loose (Lin, 2004). For example, take , for , , and (the hinge loss). Then , but . Besides the undesirable factor of , even if as , we cannot guarantee from the bound in Theorem 2.1 that we get an optimal classifier with probability at least .
Zhang (2004); Lin (2004); Chen and Sun (2006); Bartlett et al. (2006); Steinwart (2007) provide tighter guarantees for the true risk by using what came to be known as calibration functions. The following theorem can be inferred from the more general theoretical framework proposed by Steinwart (2007).
Theorem 2.2 (Steinwart 2007).
Given a surrogate loss , assume that there exists a positive function s.t., for every , and , with if
Assume also that is measurable. Then, given and satisfying Assumption 1, for all , w.p. at least , we have
The function in Theorem 2.2 is called a calibration function (Steinwart, 2007). Steinwart (2007) presents a general, extensive discussion on calibration functions, a few of which are reported in Section 3.4.2. In order to properly obtain Theorem 2.2 even if is not invertible, we can use .
Theorem 2.2 states that if the excess surrogate risk goes to zero at a particular rate, we can also get a rate at which the excess true risk goes to zero. Sometimes, we may know that a calibration function exists for without knowing the calibration function itself. In this case, we know that the the excess surrogate risk goes to zero iff the excess true risk goes to zero, so surrogate risk minimizers are also true risk minimizers. Conversely, if some surrogate risk minimizer is not a true risk minimizer, then no calibration function can exist. The existence of a calibration function is equivalent to fisher-consistency (Liu, 2007) or classification-calibration (Steinwart, 2007; Tewari and Bartlett, 2007); we will also call this property consistency and calibration.
A limitation of Theorem 2.2 is the lack of elegant bounds for the -excess true risk when the Bayes optimal classifier cannot be obtained from a hypothesis in . From a parametric point of view, we want to get true risk bounds that mirror our surrogate risk bounds, that is, bounds on the -excess true risk given bounds on the -excess surrogate risk. This means that in a parametric setting we are concerned about using Assumption 1, to obtain that for all , w.p. at least ,
Theorem 2.2, however, implies that for all w.p. at least , we have
Long and Servedio (2013) have been concerned with guarantees of the above type in specific settings, but we will work with bounds that have the form of (2.8). Extending calibration functions to the parametric setting (to the so-called -calibration functions) will be left as future work. Bounds with the form of (2.8) are, however, still informative in the non-parametric setting, when the approximation error can be controlled or is zero.
3 The calibration toolset
In this section we discuss calibration functions in more detail, surveying existing results from the literature. Section 3.1 is an instantiation of the theoretical framework of Steinwart (2007) for the classification problem, where a we present the so-called maximum calibration function, which is an important type of calibration function for our analysis. In Section 3.3 we discuss existing fisher-consistency results and calibration functions for binary classification, and in Section 3.4 we have a discussion of the corresponding results for multiclass classification.
3.1 The maximum calibration function
Steinwart (2007) defined a function that depends on the given surrogate loss and constitutes a key notion for calibration functions. is special because no calibration function for the given surrogate loss is larger than . Moreover, if the loss is calibrated, then is a calibration function (hence the name maximum calibration function). As a consequence (see Theorem 3.3), the surrogate loss is calibrated iff for all , i.e., iff is a calibration function. Conveniently, any positive lower bound to the maximum calibration function is also a calibration function, which is a useful fact for understanding and calculating calibration functions for specific losses.
In order to define , we must define three useful concepts (see Definition 3.1): The set of scores in whose maximum coordinate is (), the set of scores that give -sub-optimal class predictions (), and the set of -sub-optimal indices with maximum probability (). A score is -sub-optimal for a given if . On the other hand, a score is -optimal for a given if .
Given a set of scores let, for and
We will override notation and use to denote the pointwise surrogate risk for a surrogate loss with by
and we will write when the choice of is clear from context. The distinction between the surrogate risk and the pointwise surrogate risk can be made by the first argument. For now, we will be concerned with the pointwise surrogate risk, for which we will define calibration functions. In Section 3.2, we will discuss how to use these calibration functions to obtain calibration functions per se (in the sense of Theorem 2.2), for the surrogate risk.
In Definition 3.2, we present two important functions introduced by Steinwart (2007): and . The former is the difference between the smallest surrogate risk of any -suboptimal score and the optimal surrogate risk. If any score has surrogate risk closer to the optimal surrogate risk than , the score must be -optimal w.r.t. . Confronting this fact with Theorem 2.2, we see that if is positive for all , then it is a calibration function. It is, however, a calibration function only for the pointwise surrogate risk defined in terms of , so in order to define a that is a calibration function iff is a calibration function for all , it is natural to take as the infimum of over all . If is a calibration function, it is called the maximum calibration function.
Given a set of scores and a surrogate loss , let
If for all , then it is called the maximum calibration function.
Theorem 3.3 (Steinwart 2007).
is always non-negative, and no calibration function for the choice of and surrogate loss (and optionally of ) is larger than the corresponding .
As mentioned, Theorem 3.3 and non-decreasingness of imply that for some iff the corresponding surrogate loss is not calibrated.
3.2 From to risk bounds
By definition of , we have that for all , and if
and if (3.3) holds then we also have
If for all and all , then (3.4) holds for all , otherwise the guarantee is vacuous. Moreover, the guarantee breaks down if the infimum in (3.2) is unbounded, so we will assume otherwise in Assumption 2.
Given a surrogate loss , we have
If the surrogate loss satisfies Assumption 2, then any positive function that lower-bounds will allow us to obtain a similar guarantee implying (3.4). Therefore, the strategy for calculating calibration functions will be to find a positive function that lower-bounds for all , or a positive . Once we have one of these, we can obtain Theorem 2.2 or a similar result by taking the expectation of (3.2) or (3.3) with , the conditional probability of given , defined as almost everywhere (a.e.) for all 444 For this argument to work, the surrogate risk minimizer must be measurable, which we assume to be the case. Formally, following Steinwart (2007), we assume that for every , there exists a measurable function s.t. a.e.. . This integration step is used by Zhang (2004); Chen and Sun (2006); Steinwart (2007) to obtain risk bounds from calibration functions that they define. We now proceed to results that yield calibration functions for classification, first binary, then multiclass.
3.3 Calibration functions in binary classification
Bartlett et al. (2006) characterized for margin losses and also characterized the conditions under which is a calibration function for convex, lower-bounded 555 is lower-bounded if . . In Definition 3.4, we introduce , which Bartlett et al. (2006) have shown to be equal to in binary classification (see Theorem 3.5 ahead). By comparing from Definition 3.2 and from Definition 3.4, and taking Theorem 3.5 into consideration, we can point out a few facts that will shape the conditions that we design to reduce multiclass classification calibration functions, namely, to binary classification calibration-like functions, viz. . For each , the worst-case distribution is , i.e., . Moreover, for every , . In Theorem 3.5 and henceforth, we will denote the subdifferential of at by .
Consider a surrogate loss with and . Let
Theorem 3.5 (Bartlett et al. 2006).
Assume , and where is convex and lower-bounded. If then
Moreover, is a calibration function iff has a unique, positive derivative at zero (i.e., ). Finally, if is convex with a unique, positive derivative at zero, then is convex.
|Truncated square (squared hinge)|
The functions and do not have a calibration function because they violate lower-boundedness. The function with does not have unique derivative at zero, so, by Theorem 3.5, it is not calibrated. The calibration function for the margin loss based on is not reported by Steinwart (2007) but is evident from a result shown by Zou et al. (2006) and later, independently, by Ávila Pires et al. (2013). The multiclass version of the result is given later in this text as Lemma 5.18. Informally, in the binary case Lemma 5.18 implies that if and has a minimum , then is the same for with and for with . Thus, we can combine Lemma 5.18 with Theorem 3.5 to obtain that the for is the same when and (since they are equal in ). Also, and share the same .
It would seem that transformation functions leading to superlinear , such as the squared transformation function, lead to true risk bounds with slower rates than transformation functions associated with linear , such as the hinge loss. That would be true if the rates of the surrogate risk bounds were asymptotically the same for all the losses, which is not always the case. As shown by Mammen et al. (1999) (see also Boucheron et al. (2005); Bartlett et al. (2006)) fast rates can be obtained under the following low-noise condition, known as the Mammen-Tsybakov noise condition, which states that there exists and s.t. for every classifier
where is the Bayes-optimal classifier. We see that (3.5
) interpolates between the noiseless case () and the case where no assumption about the noise is made (, ) Under the Mammen-Tsybakov noise condition, it is possible to get faster rates, as shown by Bartlett et al. (2006, Theorems 3 and 5). Theorem 3 of Bartlett et al. (2006), presented here as Theorem 3.6, improves over Theorem 2.2 by using the Mammen-Tsybakov noise condition. We can see from Theorem 3.6 that the right-hand side of (3.6) with becomes , which gives a fast rate if and a “slow” rate with . As shown by Bartlett et al. (2006, Theorem 5), fast rates for the true risk can be obtained by combining (3.6) and fast rates for surrogate risk, which can be obtained, for example, if scores are bounded in a range , and if is strictly convex and Lipschitz in the interval .
3.4 Calibration functions and multiclass classification
The panorama of calibration and calibration functions for multiclass classification losses is significantly more disperse that in binary classification, due to the many existing generalizations of the binary margin loss, some of which are collected in Table 3.
For , various choices of and are possible (see Zhang, 2004), and we compute calibration functions for some of these in Section 5. The loss requires (strictly) increasing, and corresponds to when . The surrogate also generalizes different entries in Table 3, if we are flexible about . It is easy to see that with , and , corresponds to ; with , and it corresponds to , and with , and it corresponds to . It can also be seen that with , , and , the logistic regression loss (Zhang, 2004), is equivalent to with , and (see Proposition 5.8).
Although , , , , and all reduce to a margin loss in the binary case, they lead to classifiers with substantially different behaviors in terms of calibration and calibration functions, as we will see next.
We will first provide an overview of calibration results for convex surrogate losses, with calibration functions presented later. By reduction to the binary case, we get from Theorem 3.5 that is a necessary condition for calibration of , , , , and . Any condition presented ahead for the consistency of these losses will be, of course, in addition to , and the assumption of Theorem 3.5 that is convex and lower-bounded.
Lee et al. (2004); Zhang (2004); Liu (2007); Tewari and Bartlett (2007) showed that is consistent when is differentiable. Tewari and Bartlett (2007) provided a counter example of a with a kink that is not calibrated, similar to Proposition 3.7.
The loss with is not classification-calibrated.
is not consistent in general, but it is consistent for distributions there (Zhang, 2004; Liu, 2007; Tewari and Bartlett, 2007). Zhang (2004) defines a property called order-preservation, which is sufficient for calibration when .
Definition 3.8 (Order-preservation).
A loss is order-preserving if for all , has a minimizer s.t. for every , .
is inconsistent in general Tewari and Bartlett (2007), but consistent for when allows them to be order-preserving (Zhang, 2004). In particular, is order-preserving whenever is twice-differentiable with and for all (Zou et al., 2006).
With , the losses and are neither order-preserving nor calibrated (Zhang, 2004; Liu, 2007; Tewari and Bartlett, 2007). Informally, any minimizer of any of these two losses with any convex with satisfies . In the binary case, this is not a problem because . In the multiclass case, however, with and , we have whenever , so we can only “find” , but not , which is what we are interested in. Liu (2007) modified with to obtain the calibrated loss . Interestingly, if we let , we can show that with and with and have the same (see Lemma 5.18).
When and is differentiable we can use KKT conditions (Boyd and Vandenberghe, 2004) to get conditions for consistency of some losses, as done by Zhang (2004) for many of the losses in Table 3. For , for example, if the the optimizer exists, it must satisfy, for all ,
If, in particular, is increasing and the function mapping to the zero of is well-defined and increasing for all , we get that is order-preserving and, thus, calibrated (Zhang, 2004). The one-versus-all loss () is a special case of , so under similar conditions it is calibrated. Zhang (2004) showed that with continuously-differentiable and increasing is also order-preserving and thus consistent (see also Tewari and Bartlett, 2007, for an alternative proof). We can also see that is order-preserving and thus calibrated under the same conditions on as , since is (strictly) increasing (cf. Zhang (2004) and Beijbom et al. (2014)).
3.4.2 Calibration functions
Calibration functions have been calculated for (Zhang, 2004) and for (Chen and Sun, 2006; Ávila Pires et al., 2013), with specific choices of . The first result we present is Theorem 3.9, due to Zhang (2004). The function defined in Theorem 3.9 is the optimal (binary) surrogate risk when and , and the condition in (3.7) corresponds to strong concavity (Nesterov, 2013, Definition 2.1.2 and Theorem 2.1.9). In particular, (as pointed out by Zhang (2004)) if for all and some , then (3.7) is satisfied with . We will recover Theorem 3.9 as a special case of our Lemma 5.4, at which the argument used to prove Theorem 3.9 will be evident.
Theorem 3.9 (Zhang 2004).
Consider with and convex, lower-bounded, differentiable s.t. for all . Given , let
is concave, and if there is a s.t. for all
then, for all and all s.t. , we also have
Chen and Sun (2006) derive calibration functions for with convex, differentiable, increasing, and satisfying and , based on an assumption that can be shown to imply the existence of a calibration function. Ávila Pires et al. (2013) proved a result for more general in a cost-sensitive setting, which we present as Theorem 3.10. Theorem 3.10 has a form closer to Theorem 3.5, with two notable differences: The calibration function depends on both and , rather than just , and the form for takes an infimum over . We can use Theorem 3.10 to obtain a calibration function in terms of alone, but we will take a slightly different route and arrive at such a result in Section 5.3. If the infimum over in (3.8) is taken at , then Theorem 3.10 becomes a generalization of Theorem 3.5 (when is taken as the multiclass generalization of the margin loss). The infimum is taken for , , and (see Section 5.3).
Theorem 3.10 (Ávila Pires et al. 2013).
Consider with convex and let