## 1 Introduction

A *loss* function is the means by which a learning algorithm’s performance
is judged. A *binary*loss function is a loss for a supervised prediction
problem where there are two possible labels associated with the examples. A
*composite* loss is the composition of a proper loss (defined below) and a
link function (also defined below). In this paper we study composite binary
losses and develop a number of new characterisation results.

Informally, proper losses are well-calibrated losses for class probability estimation, that is for the problem of not only predicting a binary classification label, but providing an estimate of the probability that an example will have a positive label. Link functions are often used to map the outputs of a predictor to the interval so that they can be interpreted as probabilities. Having such probabilities is often important in applications, and there has been considerable interest in understanding how to get accurate probability estimates (Platt, 2000; Gneiting and Raftery, 2007; Cohen and Goldszmidt, 2004) and understanding the implications of requiring loss functions provide good probability estimates (Bartlett and Tewari, 2007).

Much previous work in the machine learning literature has focussed on

*margin losses*which intrinsically treat positive and negative classes symmetrically. However it is now well understood how important it is to be able to deal with the non-symmetric case (Bach et al., 2006; Elkan, 2001; Beygelzimer et al., 2008; Buja et al., 2005; Provost and Fawcett, 2001). A key goal of the present work is to consider composite losses in the general (non-symmetric) situation.

Having the flexibility to choose a loss function is important in order to “tailor” the solution to a machine learning problem; confer (Hand, 1994; Hand and Vinciotti, 2003; Buja et al., 2005). Understanding the structure of the set of loss functions and having natural parametrisations of them is useful for this purpose. Even when one is using a loss as a surrogate for the loss one would ideally like to minimise, it is helpful to have an easy to use parametrisation — see the discussion of “surrogate tuning” in the Conclusion.

The paper is structured as follows. In §2 we introduce the notions of a loss, the conditional and full risk which we will make extensive use of throughout the paper.

In §3 we introduce losses for Class Probability Estimation (CPE), define some technical properties of them, and present some structural results. We introduce and exploit Savage’s characterisation of proper losses and use it to characterise proper symmetric CPE-losses.

In §4 we define composite losses formally and characterise when a loss is a proper composite loss in terms of its partial losses. We introduce a natural and intrinsic parametrisation of proper composite losses and characterise when a margin loss can be a proper composite loss. We also show the relationship between regret and Bregman divergences for general composite losses.

In §5 we characterise the relationship between classification calibrated losses (as studied for example by Bartlett et al. (2006)) and proper composite losses.

In §6, motivated by the question of which is the best surrogate loss, we characterise when a proper composite loss is convex in terms of the natural parametrisation of such losses.

In §7 we study surrogate losses making use of some of
the earlier material in the paper. A *surrogate* loss function is a loss
function which is not exactly what one wishes to minimise but is easier to work
with algorithmically. We define a well founded notion of “best” surrogate
loss and show that some convex surrogate losses are incommensurable on some
problems.
We also study other notions of “best” and explicitly
determine the surrogate loss that has the best surrogate regret bound in a
certain sense.

Finally, in §8 we draw some more general conclusions.

Appendix C builds upon some of the results in the main paper and presents some new algorithm-independent results on the relationship between properness, convexity and robustness to misclassification noise for binary losses and shows that all convex proper losses are non-robust to misclassification noise.

## 2 Losses and Risks

We write and if is true and
otherwise^{1}^{1}1This is the Iverson bracket notation as
recommended by Knuth (1992)..
The generalised function is defined by
when is continuous at and

. Random variables are written in sans-serif font:

, .Given a set of
labels and a set of prediction values we will
say a *loss* is any function^{2}^{2}2Restricting the
output of a loss to is equivalent to
assuming the loss has a lower bound and then translating its output.
.
We interpret such a loss as giving a penalty when predicting the
value when an observed label is .
We can always write an arbitrary loss in terms of its *partial losses*
and using

(1) |

Our definition of a loss function covers all commonly used *margin losses*
(*i.e.* those which can be expressed as for some
function ) such as the
*0-1 loss* , the
*hinge loss* , the
*logistic loss* , and the
*exponential loss* commonly used in boosting.
It also covers *class probability estimation losses* where the
predicted values are directly interpreted as probability
estimates.^{3}^{3}3
These are known as *scoring rules* in the statistical literature
(Gneiting and Raftery, 2007).
We will use instead of as an argument to indicate losses for class
probability estimation and use the shorthand *CPE losses* to distinguish
them from general losses.
For example,
*square loss* has partial losses
and
,
the *log loss*
and ,
and the family of
*cost-weighted misclassification losses* parametrised by is
given by

(2) |

### 2.1 Conditional and Full Risks

Suppose we have random examples with associated labels

The joint distribution of

is denoted and the marginal distribution of is denoted . Let the observation conditional density . Thus one can specify an experiment by either or .If is the probability of observing the label the
*point-wise risk* (or *conditional risk*)
of the estimate is defined as the
-average of the point-wise risk for :

Here,

is a shorthand for labels being drawn from a Bernoulli distribution with parameter

. When is an observation-conditional density, taking the -average of the point-wise risk gives the*(full) risk*of the estimator , now interpreted as a function :

We sometimes write for where (
corresponds to the joint distribution .
We write , and for the loss, point-wise
and full risk throughout this paper.
The *Bayes risk* is the minimal achievable value of the risk and is
denoted

where

is the *point-wise* or *conditional Bayes risk*.

There has been increasing awareness of the importance of the conditional Bayes risk curve — also known as “generalized entropy” (Grünwald and Dawid, 2004) — in the analysis of losses for probability estimation (Kalnishkan et al., 2004, 2007; Abernethy et al., 2009; Masnadi-Shirazi and Vasconcelos, 2009). Below we will see how it is effectively the curvature of that determines much of the structure of these losses.

## 3 Losses for Class Probability Estimation

We begin by considering CPE losses, that is, functions
and briefly summarise a number of
important existing structural results for *proper losses* — a large,
natural class of losses for class probability estimation.

### 3.1 Proper, Fair, Definite and Regular Losses

There are a few properties of losses for probability estimation that we will
require.
If is to be interpreted as an estimate of the true positive class
probability (*i.e.*, when ) then it is desirable to require
that be minimised by for all .
Losses that satisfy this constraint are said to be *Fisher consistent* and
are known as *proper losses* (Buja et al., 2005; Gneiting and Raftery, 2007).
That is, a proper loss satisfies
for all .
A *strictly proper* loss is a proper loss for which the minimiser of
over is unique.

We will say a loss is *fair* whenever

(3) |

That is, there is no loss incurred for perfect prediction.
The main place fairness is relied upon is in the integral representation of
Theorem 6 where it is used to get rid of some constants of
integration.
In order to explicitly construct losses from their associated “weight functions”
as shown in Theorem 7, we will require that the loss
be *definite*, that is, its point-wise Bayes risk for deterministic events
(*i.e.*, or ) must be bounded from below:

(4) |

Since properness of a loss ensures we see that a fair proper loss is necessarily definite since , and similarly for . Conversely, if a proper loss is definite then the finite values and can be subtracted from and to make it fair.

Finally, for Theorem 4 to hold at the endpoints of the unit
interval, we require a loss to be *regular*^{4}^{4}4This
is equivalent to the conditions of Savage (1971) and
Schervish (1989).;
that is,

(5) |

Intuitively, this condition ensures that making mistakes on events that never happen should not incur a penalty. In most of the situations we consider in the remainder of this paper will involve losses which are proper, fair, definite and regular.

### 3.2 The Structure of Proper Losses

A key result in the study of proper losses is originally due to Shuford et al. (1966) though our presentation follows that of Buja et al. (2005). It characterises proper losses for probability estimation via a constraint on the relationship between its partial losses.

###### Theorem 1

Suppose is a loss and that its partial losses and are both differentiable. Then is a proper loss if and only if for all

(6) |

for some *weight function*
such that
for all .

The equalities in (6) should be interpreted in the sense.

This simple characterisation of the structure of proper losses has a number of interesting implications. Observe from (6) that if is proper, given we can determine or vice versa. Also, the partial derivative of the conditional risk can be seen to be the product of a linear term and the weight function:

###### Corollary 2

If is a differentiable proper loss then for all

(7) |

Another corollary, observed by Buja et al. (2005), is that the weight function is related to the curvature of the conditional Bayes risk .

###### Corollary 3

Let be a a twice differentiable^{5}^{5}5
The restriction to differentiable losses can be removed in most cases
if generalised weight functions—that is, possibly infinite but
defining a measure on —are permitted. For example, the
weight function for the 0-1 loss is .
proper loss with weight function defined as in equation
(6).
Then for all its conditional Bayes risk satisfies

(8) |

One immediate consequence of this corollary is that the conditional Bayes risk for a proper loss is always concave. Along with an extra constraint, this gives another characterisation of proper losses (Savage, 1971; Reid and Williamson, 2009a).

###### Theorem 4

(Savage) A loss function is proper if and only if its point-wise Bayes risk is concave and for each

(9) |

Furthermore if is regular this characterisation also holds at the endpoints .

This link between loss and concave functions makes it easy to establish a
connection, as Buja et al. (2005) do, between *regret*
for proper losses and *Bregman divergences*.
The latter are generalisations of distances and are defined in terms of convex
functions.
Specifically, if is a convex function over some convex set
then its associated Bregman divergence^{6}^{6}6
A concise summary of Bregman divergences and their properties
is given by Banerjee et al. (2005, Appendix A).
is

for any , where is the gradient of at . By noting that over we have , these definitions lead immediately to the following corollary of Theorem 4.

###### Corollary 5

If is a proper loss then its regret is the Bregman divergence associated with . That is,

(10) |

Many of the above results can be observed graphically by plotting the conditional risk for a proper loss as in Figure 1. Here we see the two partial losses on the left and right sides of the figure are related, for each fixed , by the linear map . For each fixed the properness of requires that these convex combinations of the partial losses (each slice parallel to the left and right faces) are minimised when . Thus, the lines joining the partial losses are tangent to the conditional Bayes risk curve shown above the dotted diagonal. Since the conditional Bayes risk curve is the lower envelope of these tangents it is necessarily concave. The coupling of the partial losses via the tangents to the conditional Bayes risk curve demonstrates why much of the structure of proper losses is determined by the curvature of — that is, by the weight function .

The relationship between a proper loss and its associated weight function is captured succinctly via the following representation of proper losses as a weighted integral of the cost-weighted misclassification losses defined in (2). The reader is referred to (Reid and Williamson, 2009b) for the details, proof and the history of this result.

###### Theorem 6

Let be a fair, proper loss Then for each and

(11) |

where . Conversely, if is defined by (11) for some weight function then it is proper.

Some example losses and their associated weight functions are given in Table 1. Buja et al. (2005) show that is strictly proper if and only if in the sense that has non-zero mass on every open subset of .

Loss | |||
---|---|---|---|

0-1 | |||

, | |||

— | |||

1 | Square | ||

Log | |||

— | |||

Boosting |

The following theorem from Reid and Williamson (2009a) shows how to explicitly construct a loss in terms of a weight function.

###### Theorem 7

Given a weight function , let and . Then the loss defined by

(12) |

is a proper loss. Additionally, if and are both finite then

(13) |

is a fair, proper loss.

Observe that if and are functions which differ on a set of measure zero then they will lead to the same loss. A simple corollary to Theorem 6 is that the partial losses are given by

(14) |

.

### 3.3 Symmetric Losses

We will say a loss is *symmetric* if
for all . We say a weight function for a proper loss
or the conditional Bayes risk is
*symmetric* if or for all .
Perhaps unsurprisingly, an immediate consequence of Theorem 1
is that these two notions are identical.

###### Corollary 8

A proper loss is symmetric if and only if its weight function is symmetric.

Requiring a loss to be proper and symmetric constrains the partial losses significantly. Properness alone completely specifies one partial loss from the other. Now suppose in addition that is symmetric. Combining with (6) implies

(15) |

This shows that is completely determined by for (or ). Thus in order to specific a symmetric proper loss, one needs to only specify one of the partial losses on one half of the interval . Assuming is continuous at (or equivalently that has no atoms at ), by integrating both sides of (15) we can derive an explicit formula for the other half of in terms of that which is specified:

(16) |

which works for determining on either or when is specified on or respectively (recalling the usual convention that ). We have thus shown:

###### Theorem 9

We demonstrate (16) with four examples. Suppose that for . Then one can readily determine the complete partial loss to be

(17) |

Suppose instead that for . In that case we obtain

(18) |

Suppose for . Then one can determine that

Finally consider specifying that for . In this case we obtain that

## 4 Composite Losses

General loss functions are often constructed with the aid of a
*link function*.
For a particular set of prediction values this is any continuous
mapping .
In this paper, our focus will be *composite losses* for binary class
probability estimation.
These are the composition of a CPE loss
and the inverse of a *link function* , an invertible
mapping from the unit interval to some range of values.
Unless stated otherwise we will assume .
We will denote a composite loss by

(19) |

The classical motivation for link functions (McCullagh and Nelder, 1989) is that often in estimating one uses a parametric representation of [0,1] which has a natural scale not matching . Traditionally one writes where is the “inverse link” (and is of course the forward link). The function is the hypothesis. Often

is parametrised linearly in a parameter vector

. In such a situation it is computationally convenient if is convex in (which implies it is convex in when is linear in ).Often one will choose the loss first (tailoring its properties
by the weighting given according to ), and *then* choose the link
somewhat arbitrarily to map the hypotheses appropriately. An interesting
alternative perspective arises in the literature on “elicitability”.
Lambert et al. (2008)^{7}^{7}7See also (Gneiting, 2009). provide a general characterisation of
proper scoring rules (i.e. losses) for general *properties* of
distributions, that is,
continuous and locally non-constant functions which assign a real
value to each distribution over a finite sample space. In the binary case,
these properties provide another interpretation of links that is complementary
to the usual one that treats the inverse link as a way of
interpreting scores as class probabilities.

To see this, we first identify distributions over with the
probability of observing 1.
In this case properties are continuous, locally non-constant maps
.
When a link function is continuous it can therefore be interpreted as a
property since its assumed invertibility implies it is locally non-constant.
A property is said to be *elicitable*
whenever there exists a strictly proper loss for it so that
the composite loss satisfies for all

Theorem 1 of (Lambert et al., 2008) shows that is
elicitable if and only if is convex for all
.
This immediately gives us a characterisation of “proper” link functions:
those that are both continuous and have convex level sets in — they
are the non-decreasing continuous functions. Thus in Lambert’s perspective,
one chooses a “property” first (i.e. the invertible
link) and *then* chooses the
proper loss.

### 4.1 Proper Composite Losses

We will call a composite loss (19)
a *proper composite loss* if
in (19) is a proper loss for class probability estimation.
As in the case for losses for probability estimation, the requirement that a
composite loss be proper imposes some constraints on its partial losses.
Many of the results for proper losses carry over to composite losses with some
extra factors to account for the link function.

###### Theorem 10

Let be a composite loss with differentiable and strictly monotone link and suppose the partial losses and are both differentiable. Then is a proper composite loss if and only if there exists a weight function such that for all

(20) |

where equality is in the sense. Furthermore, for all .

Proof This is a direct consequence of Theorem 1

for proper losses for probability estimation and the chain rule applied to

. Since is assumed to be strictly monotonic we know and so, since we have .As we shall see, the ratio is a key quantity in the analysis of proper composite losses. For example, Corollary 2 has natural analogue in terms of that will be of use later. It is obtained by letting and using the chain rule.

###### Corollary 11

Suppose is a proper composite loss with conditional risk denoted . Then

(21) |

Loosely speaking then, is a “co-ordinate free” weight function for composite losses where the link function is interpreted as a mapping from arbitrary to values which can be interpreted as probabilities.

Another immediate corollary of Theorem 10 shows how properness is characterised by a particular relationship between the choice of link function and the choice of partial composite losses.

###### Corollary 12

Let be a composite loss with differentiable partial losses and . Then is proper if and only if the link satisfies

(22) |

Proof
Substituting into (20)
yields
and
solving this for gives the result.

These results give some insight into the “degrees of freedom” available when specifying proper composite losses. Theorem

10 shows that the partial losses are completely determined once the weight function and (up to an additive constant) is fixed. Corollary 12 shows that for a given link one can specify one of the partial losses but then properness fixes the other partial loss . Similarly, given an arbitrary choice of the partial losses, equation 22 gives the single link which will guarantee the overall loss is proper.We see then that Corollary 12 provides us with
a way of constructing a *reference link* for arbitrary composite losses
specified by their partial losses.
The reference link can be seen to satisfy

for
and thus *calibrates* a given composite loss in the sense of Cohen and Goldszmidt (2004).

We now briefly consider an application of the parametrisation of proper losses as a weight function and link. In order to implement Stochastic Gradient Descent (SGD) algorithms one needs to compute the derivative of the loss with respect to predictions

. Letting be the probability estimates associated with the prediction , we can use (21) when to obtain the update rules for positive and negative examples:(23) | |||||

(24) |

Given an arbitrary weight function (which defines a proper loss via Corollary 2 and Theorem 4) and link , the above equations show that one could implement SGD directly parametrised in terms of without needing to explicitly compute the partial losses themselves.

Finally, we make a note of an analogue of Corollary 5 for composite losses. It shows that the regret for an arbitrary composite loss is related to a Bregman divergence via its link.

###### Corollary 13

Let be a proper composite loss with invertible link. Then for all ,

(25) |

### 4.2 Margin Losses

The *margin* associated with a real-valued prediction and label
is the product .
Any function can be used as a *margin loss* by
interpreting as the penalty for predicting for an instance with
label .
Margin losses are inherently symmetric since and so the penalty
given for predicting when the label is is necessarily the
same as the penalty for predicting when the label is .
Margin losses have attracted a lot of attention (Bartlett et al., 2000)

because of their central role in Support Vector Machines

(Cortes and Vapnik, 1995). In this section we explore the relationship between these margin losses and the more general class of composite losses and, in particular, symmetric composite losses.Recall that a general composite loss is of the form for a loss and an invertible link . We would like to understand when margin losses can be understood as losses suitable for probability estimation tasks. As discussed above, proper losses are a natural class of losses over for probability estimation so a natural question in this vein is the following: given a margin loss can we choose a link so that there exists a proper loss such that ? In this case the proper loss will be .

The following corollary of Theorem 10 gives necessary and sufficient conditions on the choice of link to guarantee when a margin loss can be expressed as a proper composite loss.

###### Corollary 14

Suppose is a differentiable margin loss. Then, can be expressed as a proper composite loss if and only if the link satisfies

(26) |

Proof
Margin losses, by definition, have partial losses
which means
and .
Substituting these into (22) gives the result.

This result provides a way of interpreting predictions
as probabilities in a
consistent manner,
for a problem defined by a margin loss.
Conversely, it also guarantees that using any other link to interpret
predictions as probabilities will be inconsistent.^{8}^{8}8
Strictly speaking, if the margin loss has “flat spots” — i.e., where
— then the choice of link may not be unique.
Another immediate implication is that for a margin loss to be considered a
proper loss its link function must be *symmetric* in the sense that

and so, by letting , we have and thus .

Corollary 14 can also be seen as a simplified and
generalised version of the argument by Masnadi-Shirazi and Vasconcelos (2009)
that a concave minimal conditional risk function and a symmetric link completely
determines a margin loss^{9}^{9}9
Shen (2005, Section 4.4) seems to have been the first to view margin
losses from this more general perspective..

We now consider a couple of specific margin losses and show how they can be associated with a proper loss through the choice of link given in Corollary 14. The exponential loss gives rise to a proper loss via the link

which has non-zero denominator. In this case is just the logistic link. Now consider the family of margin losses parametrised by

This family of differentiable convex losses approximates the hinge loss as and was studied in the multiclass case by Zhang et al. (2009). Since these are all differentiable functions with , Corollary 14 and a little algebra gives

Examining this family of inverse links as gives some insight into why the hinge loss is a surrogate for classification but not probability estimation. When an estimate for all but very large . That is, in the limit all probability estimates sit infinitesimally to the right or left of depending on the sign of .

## 5 Classification Calibration and Proper Losses

The notion of properness of a loss designed for class probability estimation is a natural one. If one is only interested in classification (rather than estimating probabilities) a weaker condition suffices. In this section we will relate the weaker condition to properness.

### 5.1 Classification Calibration for CPE Losses

We begin by giving a definition of classification calibration
for CPE losses (*i.e.*, over
the unit interval ) and relate it to composite losses via a link.

###### Definition 15

We say a CPE loss is *classification calibrated at *
and write
is if the associated conditional risk satisfies

(27) |

The expression constraining the infimum ensures that is on the opposite side of to , or .

The condition is equivalent to what is called “classification calibrated” by Bartlett et al. (2006) and “Fisher consistent for classification problems” by Lin (2002) although their definitions were only for margin losses.

One might suspect that there is a connection between classification calibrated at and standard Fisher consistency for class probability estimation losses. The following theorem, which captures the intuition behind the “probing” reduction (Langford and Zadrozny, 2005), characterises the situation.

###### Theorem 16

A CPE loss is for all if and only if is strictly proper.

Proof is for all is equivalent to

which means is strictly proper.

The following theorem is a generalisation of the characterisation of for margin losses via due to Bartlett et al. (2006).

###### Theorem 17

Suppose is a loss and suppose that and exist everywhere. Then for any is if and only if

(28) |

Proof Since and are assumed to exist everywhere

exists for all . is is equivalent to

(29) | |||||

(30) |

where we have used the fact that (29) with
and respectively substituted implies and
.

If is proper, then by evaluating (7) at and we obtain and . Thus (30) implies and which holds if and only if . We have thus shown the following corollary.

###### Corollary 18

If is proper with weight , then for any ,

The simple form of the weight function for the cost-sensitive misclassification loss () gives the following corollary (confer Bartlett et al. (2006)):

###### Corollary 19

is if and only if .

### 5.2 Calibration for Composite Losses

The translation of the above results to general proper composite losses with invertible differentiable link is straight forward. Condition (27) becomes

Theorem 16 then immediately gives:

###### Corollary 20

A composite loss with invertible and differentiable link is for all if and only if the associated proper loss is strictly proper.

Comments

There are no comments yet.