A loss function is the means by which a learning algorithm’s performance is judged. A binaryloss function is a loss for a supervised prediction problem where there are two possible labels associated with the examples. A composite loss is the composition of a proper loss (defined below) and a link function (also defined below). In this paper we study composite binary losses and develop a number of new characterisation results.
Informally, proper losses are well-calibrated losses for class probability estimation, that is for the problem of not only predicting a binary classification label, but providing an estimate of the probability that an example will have a positive label. Link functions are often used to map the outputs of a predictor to the interval so that they can be interpreted as probabilities. Having such probabilities is often important in applications, and there has been considerable interest in understanding how to get accurate probability estimates (Platt, 2000; Gneiting and Raftery, 2007; Cohen and Goldszmidt, 2004) and understanding the implications of requiring loss functions provide good probability estimates (Bartlett and Tewari, 2007).
Much previous work in the machine learning literature has focussed onmargin losses which intrinsically treat positive and negative classes symmetrically. However it is now well understood how important it is to be able to deal with the non-symmetric case (Bach et al., 2006; Elkan, 2001; Beygelzimer et al., 2008; Buja et al., 2005; Provost and Fawcett, 2001). A key goal of the present work is to consider composite losses in the general (non-symmetric) situation.
Having the flexibility to choose a loss function is important in order to “tailor” the solution to a machine learning problem; confer (Hand, 1994; Hand and Vinciotti, 2003; Buja et al., 2005). Understanding the structure of the set of loss functions and having natural parametrisations of them is useful for this purpose. Even when one is using a loss as a surrogate for the loss one would ideally like to minimise, it is helpful to have an easy to use parametrisation — see the discussion of “surrogate tuning” in the Conclusion.
The paper is structured as follows. In §2 we introduce the notions of a loss, the conditional and full risk which we will make extensive use of throughout the paper.
In §3 we introduce losses for Class Probability Estimation (CPE), define some technical properties of them, and present some structural results. We introduce and exploit Savage’s characterisation of proper losses and use it to characterise proper symmetric CPE-losses.
In §4 we define composite losses formally and characterise when a loss is a proper composite loss in terms of its partial losses. We introduce a natural and intrinsic parametrisation of proper composite losses and characterise when a margin loss can be a proper composite loss. We also show the relationship between regret and Bregman divergences for general composite losses.
In §6, motivated by the question of which is the best surrogate loss, we characterise when a proper composite loss is convex in terms of the natural parametrisation of such losses.
In §7 we study surrogate losses making use of some of the earlier material in the paper. A surrogate loss function is a loss function which is not exactly what one wishes to minimise but is easier to work with algorithmically. We define a well founded notion of “best” surrogate loss and show that some convex surrogate losses are incommensurable on some problems. We also study other notions of “best” and explicitly determine the surrogate loss that has the best surrogate regret bound in a certain sense.
Finally, in §8 we draw some more general conclusions.
Appendix C builds upon some of the results in the main paper and presents some new algorithm-independent results on the relationship between properness, convexity and robustness to misclassification noise for binary losses and shows that all convex proper losses are non-robust to misclassification noise.
2 Losses and Risks
We write and if is true and otherwise111This is the Iverson bracket notation as recommended by Knuth (1992).. The generalised function is defined by when is continuous at and
. Random variables are written in sans-serif font:, .
Given a set of labels and a set of prediction values we will say a loss is any function222Restricting the output of a loss to is equivalent to assuming the loss has a lower bound and then translating its output. . We interpret such a loss as giving a penalty when predicting the value when an observed label is . We can always write an arbitrary loss in terms of its partial losses and using
Our definition of a loss function covers all commonly used margin losses (i.e. those which can be expressed as for some function ) such as the 0-1 loss , the hinge loss , the logistic loss , and the exponential loss commonly used in boosting. It also covers class probability estimation losses where the predicted values are directly interpreted as probability estimates.333 These are known as scoring rules in the statistical literature (Gneiting and Raftery, 2007). We will use instead of as an argument to indicate losses for class probability estimation and use the shorthand CPE losses to distinguish them from general losses. For example, square loss has partial losses and , the log loss and , and the family of cost-weighted misclassification losses parametrised by is given by
2.1 Conditional and Full Risks
Suppose we have random examples with associated labels
The joint distribution ofis denoted and the marginal distribution of is denoted . Let the observation conditional density . Thus one can specify an experiment by either or .
If is the probability of observing the label the point-wise risk (or conditional risk) of the estimate is defined as the -average of the point-wise risk for :
is a shorthand for labels being drawn from a Bernoulli distribution with parameter. When is an observation-conditional density, taking the -average of the point-wise risk gives the (full) risk of the estimator , now interpreted as a function :
We sometimes write for where ( corresponds to the joint distribution . We write , and for the loss, point-wise and full risk throughout this paper. The Bayes risk is the minimal achievable value of the risk and is denoted
is the point-wise or conditional Bayes risk.
There has been increasing awareness of the importance of the conditional Bayes risk curve — also known as “generalized entropy” (Grünwald and Dawid, 2004) — in the analysis of losses for probability estimation (Kalnishkan et al., 2004, 2007; Abernethy et al., 2009; Masnadi-Shirazi and Vasconcelos, 2009). Below we will see how it is effectively the curvature of that determines much of the structure of these losses.
3 Losses for Class Probability Estimation
We begin by considering CPE losses, that is, functions and briefly summarise a number of important existing structural results for proper losses — a large, natural class of losses for class probability estimation.
3.1 Proper, Fair, Definite and Regular Losses
There are a few properties of losses for probability estimation that we will require. If is to be interpreted as an estimate of the true positive class probability (i.e., when ) then it is desirable to require that be minimised by for all . Losses that satisfy this constraint are said to be Fisher consistent and are known as proper losses (Buja et al., 2005; Gneiting and Raftery, 2007). That is, a proper loss satisfies for all . A strictly proper loss is a proper loss for which the minimiser of over is unique.
We will say a loss is fair whenever
That is, there is no loss incurred for perfect prediction. The main place fairness is relied upon is in the integral representation of Theorem 6 where it is used to get rid of some constants of integration. In order to explicitly construct losses from their associated “weight functions” as shown in Theorem 7, we will require that the loss be definite, that is, its point-wise Bayes risk for deterministic events (i.e., or ) must be bounded from below:
Since properness of a loss ensures we see that a fair proper loss is necessarily definite since , and similarly for . Conversely, if a proper loss is definite then the finite values and can be subtracted from and to make it fair.
Intuitively, this condition ensures that making mistakes on events that never happen should not incur a penalty. In most of the situations we consider in the remainder of this paper will involve losses which are proper, fair, definite and regular.
3.2 The Structure of Proper Losses
A key result in the study of proper losses is originally due to Shuford et al. (1966) though our presentation follows that of Buja et al. (2005). It characterises proper losses for probability estimation via a constraint on the relationship between its partial losses.
Suppose is a loss and that its partial losses and are both differentiable. Then is a proper loss if and only if for all
for some weight function such that for all .
The equalities in (6) should be interpreted in the sense.
This simple characterisation of the structure of proper losses has a number of interesting implications. Observe from (6) that if is proper, given we can determine or vice versa. Also, the partial derivative of the conditional risk can be seen to be the product of a linear term and the weight function:
If is a differentiable proper loss then for all
Another corollary, observed by Buja et al. (2005), is that the weight function is related to the curvature of the conditional Bayes risk .
Let be a a twice differentiable555 The restriction to differentiable losses can be removed in most cases if generalised weight functions—that is, possibly infinite but defining a measure on —are permitted. For example, the weight function for the 0-1 loss is . proper loss with weight function defined as in equation (6). Then for all its conditional Bayes risk satisfies
One immediate consequence of this corollary is that the conditional Bayes risk for a proper loss is always concave. Along with an extra constraint, this gives another characterisation of proper losses (Savage, 1971; Reid and Williamson, 2009a).
(Savage) A loss function is proper if and only if its point-wise Bayes risk is concave and for each
Furthermore if is regular this characterisation also holds at the endpoints .
This link between loss and concave functions makes it easy to establish a connection, as Buja et al. (2005) do, between regret for proper losses and Bregman divergences. The latter are generalisations of distances and are defined in terms of convex functions. Specifically, if is a convex function over some convex set then its associated Bregman divergence666 A concise summary of Bregman divergences and their properties is given by Banerjee et al. (2005, Appendix A). is
for any , where is the gradient of at . By noting that over we have , these definitions lead immediately to the following corollary of Theorem 4.
If is a proper loss then its regret is the Bregman divergence associated with . That is,
Many of the above results can be observed graphically by plotting the conditional risk for a proper loss as in Figure 1. Here we see the two partial losses on the left and right sides of the figure are related, for each fixed , by the linear map . For each fixed the properness of requires that these convex combinations of the partial losses (each slice parallel to the left and right faces) are minimised when . Thus, the lines joining the partial losses are tangent to the conditional Bayes risk curve shown above the dotted diagonal. Since the conditional Bayes risk curve is the lower envelope of these tangents it is necessarily concave. The coupling of the partial losses via the tangents to the conditional Bayes risk curve demonstrates why much of the structure of proper losses is determined by the curvature of — that is, by the weight function .
The relationship between a proper loss and its associated weight function is captured succinctly via the following representation of proper losses as a weighted integral of the cost-weighted misclassification losses defined in (2). The reader is referred to (Reid and Williamson, 2009b) for the details, proof and the history of this result.
Let be a fair, proper loss Then for each and
where . Conversely, if is defined by (11) for some weight function then it is proper.
Some example losses and their associated weight functions are given in Table 1. Buja et al. (2005) show that is strictly proper if and only if in the sense that has non-zero mass on every open subset of .
The following theorem from Reid and Williamson (2009a) shows how to explicitly construct a loss in terms of a weight function.
Given a weight function , let and . Then the loss defined by
is a proper loss. Additionally, if and are both finite then
is a fair, proper loss.
Observe that if and are functions which differ on a set of measure zero then they will lead to the same loss. A simple corollary to Theorem 6 is that the partial losses are given by
3.3 Symmetric Losses
We will say a loss is symmetric if for all . We say a weight function for a proper loss or the conditional Bayes risk is symmetric if or for all . Perhaps unsurprisingly, an immediate consequence of Theorem 1 is that these two notions are identical.
A proper loss is symmetric if and only if its weight function is symmetric.
Requiring a loss to be proper and symmetric constrains the partial losses significantly. Properness alone completely specifies one partial loss from the other. Now suppose in addition that is symmetric. Combining with (6) implies
This shows that is completely determined by for (or ). Thus in order to specific a symmetric proper loss, one needs to only specify one of the partial losses on one half of the interval . Assuming is continuous at (or equivalently that has no atoms at ), by integrating both sides of (15) we can derive an explicit formula for the other half of in terms of that which is specified:
which works for determining on either or when is specified on or respectively (recalling the usual convention that ). We have thus shown:
We demonstrate (16) with four examples. Suppose that for . Then one can readily determine the complete partial loss to be
Suppose instead that for . In that case we obtain
Suppose for . Then one can determine that
Finally consider specifying that for . In this case we obtain that
4 Composite Losses
General loss functions are often constructed with the aid of a link function. For a particular set of prediction values this is any continuous mapping . In this paper, our focus will be composite losses for binary class probability estimation. These are the composition of a CPE loss and the inverse of a link function , an invertible mapping from the unit interval to some range of values. Unless stated otherwise we will assume . We will denote a composite loss by
The classical motivation for link functions (McCullagh and Nelder, 1989) is that often in estimating one uses a parametric representation of [0,1] which has a natural scale not matching . Traditionally one writes where is the “inverse link” (and is of course the forward link). The function is the hypothesis. Often
is parametrised linearly in a parameter vector. In such a situation it is computationally convenient if is convex in (which implies it is convex in when is linear in ).
Often one will choose the loss first (tailoring its properties by the weighting given according to ), and then choose the link somewhat arbitrarily to map the hypotheses appropriately. An interesting alternative perspective arises in the literature on “elicitability”. Lambert et al. (2008)777See also (Gneiting, 2009). provide a general characterisation of proper scoring rules (i.e. losses) for general properties of distributions, that is, continuous and locally non-constant functions which assign a real value to each distribution over a finite sample space. In the binary case, these properties provide another interpretation of links that is complementary to the usual one that treats the inverse link as a way of interpreting scores as class probabilities.
To see this, we first identify distributions over with the probability of observing 1. In this case properties are continuous, locally non-constant maps . When a link function is continuous it can therefore be interpreted as a property since its assumed invertibility implies it is locally non-constant. A property is said to be elicitable whenever there exists a strictly proper loss for it so that the composite loss satisfies for all
Theorem 1 of (Lambert et al., 2008) shows that is elicitable if and only if is convex for all . This immediately gives us a characterisation of “proper” link functions: those that are both continuous and have convex level sets in — they are the non-decreasing continuous functions. Thus in Lambert’s perspective, one chooses a “property” first (i.e. the invertible link) and then chooses the proper loss.
4.1 Proper Composite Losses
We will call a composite loss (19) a proper composite loss if in (19) is a proper loss for class probability estimation. As in the case for losses for probability estimation, the requirement that a composite loss be proper imposes some constraints on its partial losses. Many of the results for proper losses carry over to composite losses with some extra factors to account for the link function.
Let be a composite loss with differentiable and strictly monotone link and suppose the partial losses and are both differentiable. Then is a proper composite loss if and only if there exists a weight function such that for all
where equality is in the sense. Furthermore, for all .
Proof This is a direct consequence of Theorem 1
for proper losses for probability estimation and the chain rule applied to. Since is assumed to be strictly monotonic we know and so, since we have .
As we shall see, the ratio is a key quantity in the analysis of proper composite losses. For example, Corollary 2 has natural analogue in terms of that will be of use later. It is obtained by letting and using the chain rule.
Suppose is a proper composite loss with conditional risk denoted . Then
Loosely speaking then, is a “co-ordinate free” weight function for composite losses where the link function is interpreted as a mapping from arbitrary to values which can be interpreted as probabilities.
Another immediate corollary of Theorem 10 shows how properness is characterised by a particular relationship between the choice of link function and the choice of partial composite losses.
Let be a composite loss with differentiable partial losses and . Then is proper if and only if the link satisfies
Substituting into (20)
solving this for gives the result.
These results give some insight into the “degrees of freedom” available when specifying proper composite losses. Theorem10 shows that the partial losses are completely determined once the weight function and (up to an additive constant) is fixed. Corollary 12 shows that for a given link one can specify one of the partial losses but then properness fixes the other partial loss . Similarly, given an arbitrary choice of the partial losses, equation 22 gives the single link which will guarantee the overall loss is proper.
We see then that Corollary 12 provides us with a way of constructing a reference link for arbitrary composite losses specified by their partial losses. The reference link can be seen to satisfy
for and thus calibrates a given composite loss in the sense of Cohen and Goldszmidt (2004).
We now briefly consider an application of the parametrisation of proper losses as a weight function and link. In order to implement Stochastic Gradient Descent (SGD) algorithms one needs to compute the derivative of the loss with respect to predictions. Letting be the probability estimates associated with the prediction , we can use (21) when to obtain the update rules for positive and negative examples:
Given an arbitrary weight function (which defines a proper loss via Corollary 2 and Theorem 4) and link , the above equations show that one could implement SGD directly parametrised in terms of without needing to explicitly compute the partial losses themselves.
Finally, we make a note of an analogue of Corollary 5 for composite losses. It shows that the regret for an arbitrary composite loss is related to a Bregman divergence via its link.
Let be a proper composite loss with invertible link. Then for all ,
4.2 Margin Losses
The margin associated with a real-valued prediction and label is the product . Any function can be used as a margin loss by interpreting as the penalty for predicting for an instance with label . Margin losses are inherently symmetric since and so the penalty given for predicting when the label is is necessarily the same as the penalty for predicting when the label is . Margin losses have attracted a lot of attention (Bartlett et al., 2000)
because of their central role in Support Vector Machines(Cortes and Vapnik, 1995). In this section we explore the relationship between these margin losses and the more general class of composite losses and, in particular, symmetric composite losses.
Recall that a general composite loss is of the form for a loss and an invertible link . We would like to understand when margin losses can be understood as losses suitable for probability estimation tasks. As discussed above, proper losses are a natural class of losses over for probability estimation so a natural question in this vein is the following: given a margin loss can we choose a link so that there exists a proper loss such that ? In this case the proper loss will be .
The following corollary of Theorem 10 gives necessary and sufficient conditions on the choice of link to guarantee when a margin loss can be expressed as a proper composite loss.
Suppose is a differentiable margin loss. Then, can be expressed as a proper composite loss if and only if the link satisfies
Margin losses, by definition, have partial losses
Substituting these into (22) gives the result.
This result provides a way of interpreting predictions as probabilities in a consistent manner, for a problem defined by a margin loss. Conversely, it also guarantees that using any other link to interpret predictions as probabilities will be inconsistent.888 Strictly speaking, if the margin loss has “flat spots” — i.e., where — then the choice of link may not be unique. Another immediate implication is that for a margin loss to be considered a proper loss its link function must be symmetric in the sense that
and so, by letting , we have and thus .
Corollary 14 can also be seen as a simplified and generalised version of the argument by Masnadi-Shirazi and Vasconcelos (2009) that a concave minimal conditional risk function and a symmetric link completely determines a margin loss999 Shen (2005, Section 4.4) seems to have been the first to view margin losses from this more general perspective..
We now consider a couple of specific margin losses and show how they can be associated with a proper loss through the choice of link given in Corollary 14. The exponential loss gives rise to a proper loss via the link
which has non-zero denominator. In this case is just the logistic link. Now consider the family of margin losses parametrised by
This family of differentiable convex losses approximates the hinge loss as and was studied in the multiclass case by Zhang et al. (2009). Since these are all differentiable functions with , Corollary 14 and a little algebra gives
Examining this family of inverse links as gives some insight into why the hinge loss is a surrogate for classification but not probability estimation. When an estimate for all but very large . That is, in the limit all probability estimates sit infinitesimally to the right or left of depending on the sign of .
5 Classification Calibration and Proper Losses
The notion of properness of a loss designed for class probability estimation is a natural one. If one is only interested in classification (rather than estimating probabilities) a weaker condition suffices. In this section we will relate the weaker condition to properness.
5.1 Classification Calibration for CPE Losses
We begin by giving a definition of classification calibration for CPE losses (i.e., over the unit interval ) and relate it to composite losses via a link.
We say a CPE loss is classification calibrated at and write is if the associated conditional risk satisfies
The expression constraining the infimum ensures that is on the opposite side of to , or .
The condition is equivalent to what is called “classification calibrated” by Bartlett et al. (2006) and “Fisher consistent for classification problems” by Lin (2002) although their definitions were only for margin losses.
One might suspect that there is a connection between classification calibrated at and standard Fisher consistency for class probability estimation losses. The following theorem, which captures the intuition behind the “probing” reduction (Langford and Zadrozny, 2005), characterises the situation.
A CPE loss is for all if and only if is strictly proper.
Proof is for all is equivalent to
which means is strictly proper.
The following theorem is a generalisation of the characterisation of for margin losses via due to Bartlett et al. (2006).
Suppose is a loss and suppose that and exist everywhere. Then for any is if and only if
Proof Since and are assumed to exist everywhere
exists for all . is is equivalent to
where we have used the fact that (29) with
and respectively substituted implies and
If is proper with weight , then for any ,
The simple form of the weight function for the cost-sensitive misclassification loss () gives the following corollary (confer Bartlett et al. (2006)):
is if and only if .
5.2 Calibration for Composite Losses
The translation of the above results to general proper composite losses with invertible differentiable link is straight forward. Condition (27) becomes
Theorem 16 then immediately gives:
A composite loss with invertible and differentiable link is for all if and only if the associated proper loss is strictly proper.