Exp-Concavity of Proper Composite Losses

by   Parameswaran Kamalaruban, et al.

The goal of online prediction with expert advice is to find a decision strategy which will perform almost as well as the best expert in a given pool of experts, on any sequence of outcomes. This problem has been widely studied and O(√(T)) and O(T) regret bounds can be achieved for convex losses (zinkevich2003online) and strictly convex losses with bounded first and second derivatives (hazan2007logarithmic) respectively. In special cases like the Aggregating Algorithm (vovk1995game) with mixable losses and the Weighted Average Algorithm (kivinen1999averaging) with exp-concave losses, it is possible to achieve O(1) regret bounds. van2012exp has argued that mixability and exp-concavity are roughly equivalent under certain conditions. Thus by understanding the underlying relationship between these two notions we can gain the best of both algorithms (strong theoretical performance guarantees of the Aggregating Algorithm and the computational efficiency of the Weighted Average Algorithm). In this paper we provide a complete characterization of the exp-concavity of any proper composite loss. Using this characterization and the mixability condition of proper losses (van2012mixability), we show that it is possible to transform (re-parameterize) any β-mixable binary proper loss into a β-exp-concave composite loss with the same β. In the multi-class case, we propose an approximation approach for this transformation.



There are no comments yet.


page 1

page 2

page 3

page 4


On loss functions and regret bounds for multi-category classification

We develop new approaches in multi-class settings for constructing prope...

Composite Binary Losses

We study losses for binary classification and class probability estimati...

Fast rates in statistical and online learning

The speed with which a learning algorithm converges as it is presented w...

Generalized Mixability Constant Regret, Generalized Mixability, and Mirror Descent

We consider the setting of prediction with expert advice; a learner make...

Generalised Mixability, Constant Regret, and Bayesian Updating

Mixability of a loss is known to characterise when constant regret bound...

The Convexity and Design of Composite Multiclass Losses

We consider composite loss functions for multiclass prediction comprisin...

Surrogate Regret Bounds for Bipartite Ranking via Strongly Proper Losses

The problem of bipartite ranking, where instances are labeled positive o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Loss functions are the means by which the quality of a prediction in learning problem is evaluated. A composite loss (the composition of a class probability estimation (CPE) loss with an invertible link function which is essentially just a re-parameterization) is proper if its risk is minimized when predicting the true underlying class probability (a formal definition is given later). In [Vernet et al. [2014]], there is an argument that shows that there is no point in using losses that are neither proper nor proper composite as they are inadmissible. Flexibility in the choice of loss function is important to tailor the solution to a learning problem ([Buja et al. [2005]], [Hand [1994]], [Hand and Vinciotti [2003]]), and it could be attained by characterizing the set of loss functions using natural parameterizations.

The goal of the learner in a game of prediction with expert advice (which is formally described in section 2.5) is to predict as well as the best expert in the given pool of experts. The regret bound of the learner depends on the merging scheme used to merge the experts’ predictions and the type of loss function used to measure the performance. It has already been shown that constant regret bounds are achievable for mixable losses when the Aggregating Algorithm is the merging scheme ([Vovk [1995]]), and for exp-concave losses when the Weighted Average Algorithm is the merging scheme ([Kivinen and Warmuth [1999]]). We can see that the exp-concavity trivially implies mixability. Even though the converse implication is not true in general, under some re-parameterization we can make it possible. This paper discusses general conditions on proper losses under which they can be transformed to an exp-concave loss through a suitable link function. In the binary case, these conditions give two concrete formulas (Proposition 1 and Corollary 8) for link functions that can transform -mixable proper losses into -exp-concave, proper, composite losses. The explicit form of the link function given in Proposition 1 is derived using the same geometric construction used in [van Erven [2012]].

Further we extend the work by [Vernet et al. [2014]], to provide a complete characterization of the exp-concavity of the proper composite multi-class losses in terms of the Bayes risk associated with the underlying proper loss, and the link function. The mixability of proper losses (mixability of a proper composite loss is equivalent to the mixability of its generating proper loss) is studied in [Van Erven et al. [2012]]. Using these characterizations (for the binary case), in Corollary 8 we derive an exp-concavifying link function that can also transform any -mixable proper loss into a -exp-concave composite loss. Since for the multi-class losses these conditions do not hold in general, we propose a geometric approximation approach (Proposition 2) which takes a parameter and transforms the mixable loss function appropriately on a subset of the prediction space. When the prediction space is , any prediction belongs to the subset for sufficiently small . In the conclusion we provide a way to use the Weighted Average Algorithm with learning rate for proper -mixable but non-exp-concave loss functions to achieve regret bound.

The exp-concave losses achieve regret bound in online convex optimization algorithms, which is a more general setting of online learning problems. Thus the exp-concavity characterization of composite losses could be helpful in constructing exp-concave losses for online learning problems.

The paper is organized as follows. In Section 2 we formally introduce the loss function, several loss types, conditional risk, proper composite losses and a game of prediction with expert advice. In Section 3 we consider our main problem — whether one can always find a link function to transform -mixable losses into -exp-concave losses. Section 4 concludes with a brief discussion. The impact of the choice of substitution function on the regret of the learner is explored via experiments in Appendix A. In Appendix B, we discuss the mixability conditions of probability games with continuous outcome space. Detailed proofs are in Appendix C.

2 Preliminaries and Background

This section provides the necessary background on loss functions, conditional risks, and the sequential prediction problem.

2.1 Notation

We use the following notation throughout. A superscript prime,

denotes transpose of the matrix or vector

, except when applied to a real-valued function where it denotes derivative (). We denote the matrix multiplication of compatible matrices and by , so the inner product of two vectors is . Let , and the -simplex . If is a -vector, is the matrix with entries , and for . If is positive definite (resp. semi-definite), then we write (resp. ). We use to denote the th -dimensional unit vector, when , and define when . The -vector . We write if is true and otherwise. Given a set and a weight vector , the convex combination of the elements of the set w.r.t the weight vector is denoted by , and the convex hull of the set which is the set of all possible convex combinations of the elements of the set is denoted by ([Rockafellar [1970]]). If , then the Minkowski sum . represents the set of all functions . Other notation (the Kronecker product , the Jacobian D, and the Hessian H) is defined in Appendix A of [Van Erven et al. [2012]]. and denote the Jacobian and Hessian of w.r.t. respectively. When it is not clear from the context, we will explicitly mention the variable; for example where .

2.2 Loss Functions

For a prediction problem with an instance space , outcome space and prediction space , a loss function (bivariate function representation) can be defined to assign a penalty for predicting when the actual outcome is . When the outcome space , the loss function is called a multi-class loss and it can be expressed in terms of its partial losses for any outcome , as . The vector representation of the multi-class loss is given by , which assigns a vector to each prediction . A loss is differentiable if all of its partial losses are differentiable. In this paper, we will use the bivariate function representation () to denote a general loss function and the vector representation for multi-class loss functions.

The super-prediction set of a binary loss is defined as

where inequality is component-wise. For any dimension and , the -exponential operator is defined by . For it is clearly invertible with inverse . The -exponential transformation of the super-prediction set is given by

A multi-class loss is convex if is convex in for all , -exp-concave (for ) if is concave in for all , weakly mixable if the super-prediction set is convex ([Kalnishkan and Vyugin [2005]]), and -mixable (for ) if the set is convex ([Vovk and Zhdanov [2009], Vovk [1995]]). The mixability constant of a loss is the largest such that is -mixable; i.e. . If the loss function is -exp-concave (resp. -mixable) then it is -exp-concave for any (resp. -mixable for any ), and its -scaled version () for some is -exp-concave (resp. -mixable). If the loss is -exp-concave, then it is convex and -mixable.

For a multi-class loss , if the prediction space then it is said to be multi-class probability estimation (CPE) loss, where the predicted values are directly interpreted as probability estimates: . We will say a multi-CPE loss is fair whenever , for all . That is, there is no loss incurred for perfect prediction. Examples of multi-CPE losses include the square loss , the log loss , the absolute loss , and the 0-1 loss .

2.3 Conditional and Full Risks

Let X and Y

be random variables defined on the instance space

and the outcome space respectively. Let

be the joint distribution of

and for , denote the conditional distribution by where , and the marginal distribution by . For any multi-CPE loss , the conditional risk is defined as


where represents a Multinomial distribution with parameter . The full risk of the estimator function is defined as

Furthermore the Bayes risk is defined as

where is the conditional Bayes risk and is always concave ([Gneiting and Raftery [2007]]). If is fair, . One can understand the effect of choice of loss in terms of the conditional perspective ([Reid and Williamson [2011]]), which allows one to ignore the marginal distribution of X which is typically unknown.

2.4 Proper and Composite Losses

A multi-CPE loss is said to be proper if for all , ([Buja et al. [2005]], [Gneiting and Raftery [2007]]), and strictly proper if for all and . It is easy to see that the log loss, square loss, and 0-1 loss are proper while absolute loss is not. Furthermore, both log loss and square loss are strictly proper while 0-1 loss is proper but not strictly proper.

Given a proper loss with differentiable Bayes conditional risk , in order to be able to calculate derivatives easily, following [Van Erven et al. [2012]] we define


Let be continuous and strictly monotone (hence invertible) for some convex set . It induces via


Clearly is continuous and invertible with . We can now extend the notion of properness to the prediction space from using this link function. Given a proper loss , a proper composite loss for multi-class probability estimation is defined as . We can easily see that the conditional Bayes risks of the composite loss and the underlying proper loss are equal (). Every continuous proper loss has a convex super-prediction set ([Vernet et al. [2014]]). Thus they are weakly mixable. Since by applying a link function the super-prediction set won’t change (as it is just a re-parameterization), all proper composite losses are also weakly mixable.

2.5 Game of Prediction with Expert Advice

Let be the outcome space, be the prediction space, and be the loss function, then a game of prediction with expert advice represented by the tuple () can be described as follows: for each trial ,

  • experts make their prediction

  • the learner makes his own decision

  • the environment reveals the actual outcome

Let be the outcome sequence in trials. Then the cumulative loss of the learner over is given by , of the i-th expert is given by , and the regret of the learner is given by . The goal of the learner is to predict as well as the best expert; to which end the learner tries to minimize the regret.

When using the exponential weights algorithm (which is an important family of algorithms in game of prediction with expert advice), at the end of each trial, the weight of each expert is updated as for all , where is the learning rate and is the weight of the expert at time (the weight vector of experts at time is denoted by ). Then based on the weights of experts, their predictions are merged using different merging schemes to make the learner’s prediction. The Aggregating Algorithm and the Weighted Average Algorithm are two important algorithms in the family of exponential weights algorithm.

Consider multi-class games with outcome space . In the Aggregating Algorithm with learning rate , first the loss vectors of the experts and their weights are used to make a generalized prediction which is given by

Then this generalized prediction is mapped into a permitted prediction via a substitution function such that , where the inequality is element-wise and the constant depends on the learning rate. If is -mixable, then is convex, so , and we can always choose a substitution function with . Consequently for -mixable losses, the learner of the Aggregating Algorithm is guaranteed to have regret bounded by ([Vovk [1995]]).

In the Weighted Average Algorithm with learning rate , the experts’ predictions are simply merged according to their weights to make the learner’s prediction , and this algorithm is guaranteed to have a regret bound for -exp-concave losses ([Kivinen and Warmuth [1999]]). In either case it is preferred to have bigger values for the constants and to have better regret bounds. Since an -exp-concave loss is -mixable for some , the regret bound of the Weighted Average Algorithm is worse than that of the Aggregating Algorithm by a small constant factor. In [Vovk [2001]], it is noted that is always guaranteed only for -exp-concave losses. Thus for -exp-concave losses, the Weighted Average Algorithm is equivalent to the Aggregating Algorithm with the weighted average of the experts’ predictions as its substitution function and as the learning rate for both algorithms.

Even though the choice of substitution function will not have any impact on the regret bound and the weight update mechanism of the Aggregating Algorithm, it will certainly have impact on the actual regret of the learner over a given sequence of outcomes. According to the results given in Appendix A (where we have empirically compared some substitution functions), this impact on the actual regret varies depending on the outcome sequence, and in general the regret values for practical substitution functions don’t differ much — thus we can stick with a computationally efficient substitution function.

3 Exp-Concavity of Proper Composite Losses

Exp-concavity of a loss is desirable for better (logarithmic) regret bounds in online convex optimization algorithms, and for efficient implementation of exponential weights algorithms. In this section we will consider whether one can always find a link function that can transform a -mixable proper loss into -exp-concave composite loss — first by using the geometry of the set (Section 3.1), and then by using the characterization of the composite loss in terms of the associated Bayes risk (Sections 3.2, and 3.3).

3.1 Geometric approach

In this section we will use the same construction used by [van Erven [2012]] to derive an explicit closed form of a link function that could re-parameterize any -mixable loss into a -exp-concave loss, under certain conditions which are explained below. Given a multi-class loss , define


For any let . Then the “north-east” boundary of the set is given by . The following proposition is the main result of this section.

Proposition 1

Assume is strictly proper and it satisfies the condition : for some . Define for all , where . Then is invertible, and is -exp-concave over , which is a convex set.

The condition stated in the above proposition is satisfied by any -mixable proper loss in the binary case (), but it is not guaranteed in the multi-class case where . In the binary case the link function can be given as for all .

Unfortunately, the condition that is generally not satisfied; an example based on squared loss ( and classes) is shown in Figure 3, where for and in , the mid-point can travel along the ray of direction without hitting any point in the exp-prediction set . Therefore we resort to approximating a given -mixable loss by a sequence of -exp-concave losses parameterised by positive constant , while the approximation approaches the original loss in some appropriate sense as tends to 0. Without loss of generality, we assume .

Inspired by Proposition 1, a natural idea to construct the approximation is by adding “faces” to the exp-prediction set such that all rays in the direction will be blocked. Technically, it turns out more convenient to add faces that block rays in (almost) all directions of positive orthant. See Figure 3

for an illustration. In particular, we extend the “rim” of the exp-prediction set by hyperplanes that are

close to axis-parallel. The key challenge underlying this idea is to design an appropriate parameterisation of the surrogate loss , which not only produces such an extended exp-prediction set, but also ensures that for almost all as .

Figure 1: Ray “escaping” in direction. More evidence in Figure 12 in Appendix D.
Figure 2: Adding “faces” to block rays in (almost) all positive directions.
Figure 3: Sub-exp-prediction set extended by removing near axis-parallel supporting hyperplanes.

Given a -mixable loss , its sub-exp-prediction set defined as follows must be convex:


Note extends infinitely in any direction . Therefore it can be written in terms of supporting hyperplanes as


To extend the sub-exp-prediction set with “faces”, we remove some hyperplanes involved in (11) that correspond to the “rim” of the simplex (see Figure 3 for an illustration in 2-D)


Since , for any , is exactly the super-prediction set of a log-loss with appropriate scaling and shifting (see proof in Appendix C). So it must be convex. Therefore must be convex, and its recession cone is clearly . This guarantees that the following loss is proper over (Williamson, 2014, Proposition 2):


where the argmin must be attained uniquely (Appendix C). Our next proposition states that meets all the requirements of approximation suggested above.

Proposition 2

For any , satisfies the condition . In addition, over a subset , where for any in the relative interior of , for sufficiently small , i.e., .

Note is not bounded for . While the result does not show that all -mixable losses can be made -exp-concave, it is suggestive that such a result may be obtainable by a different argument.

3.2 Calculus approach

Proper composite losses are defined by the proper loss and the link . In this section we will characterize the exp-concave proper composite losses in terms of . The following proposition provides the identities of the first and second derivatives of the proper composite losses (Vernet et al. (2014)).

Proposition 3

For all , (the interior of ), and (so ),




The term can be interpreted as the curvature of the Bayes risk function relative to the rate of change of the link function . In the binary case where , above proposition reduces to


where and so .

A loss is -exp-concave (i.e. is -exp-concave for all ) if and only if the map is -exp-concave for all . It can be easily shown that the maps are -exp-concave if and only if . By applying Proposition 3 we obtain the following characterization of the -exp-concavity of the composite loss .

Proposition 4

A proper composite loss is -exp-concave (with and ) if and only if for all and for all


Based on this characterization, we can determine which loss functions can be exp-concavified by a chosen link function and how much a link function can exp-concavify a given loss function. In the binary case (), the above proposition reduces to the following.

Proposition 5

Let be an invertible link and be a strictly proper binary loss with weight function . Then the binary composite loss is -exp-concave (with ) if and only if


The following proposition gives an easier to check necessary condition for the binary proper losses that generate an -exp-concave (with ) binary composite loss given a particular link function. Since scaling a loss function will not affect what a sensible learning algorithm will do, it is possible to normalize the loss functions by normalizing their weight functions by setting . By this normalization we are scaling the original loss function by and the super-prediction set is scaled by the same factor. If the original loss function is -mixable (resp. -exp-concave), then the normalized loss function is -mixable (resp. -exp-concave).

Proposition 6

Let be an invertible link and be a strictly proper binary loss with weight function normalised such that . Then the binary composite loss is -exp-concave (with ) only if


where denotes for and denotes for .

Proposition 5 provides necessary and sufficient conditions for the exp-concavity of binary composite losses, whereas Proposition 6 provides simple necessary but not sufficient conditions. By setting in all the above results we have obtained for exp-concavity, we recover the convexity conditions for proper and composite losses which are already derived by Reid and Williamson (2010) for the binary case and Vernet et al. (2014) for multi-class.

3.3 Link functions

Figure 4: Necessary but not sufficient region of normalised weight functions to ensure -exp-concavity and convexity of proper losses (red— ; black— ; blue— convexity).
Figure 5: Necessary and sufficient region of unnormalised weight functions to ensure -exp-concavity of composite losses with canonical link (black— ; red— ; blue— ).

A proper loss can be exp-concavified () by some link function only if the loss is mixable () and the maximum possible value for exp-concavity constant is the mixability constant of the loss (since the link function won’t change the super-prediction set and an -exp-concave loss is always -mixable for some ).

By applying the identity link in (21) we obtain the necessary and sufficient conditions for a binary proper loss to be -exp-concave (with ) as given by,


By substituting in (22) we obtain the following necessary but not sufficient (simpler) constraints for a normalized binary proper loss to be -exp-concave


which are illustrated as the shaded region in Figure 5 for different values of . Observe that normalized proper losses can be -exp-concave only for . When , only the normalized weight function of log loss () will satisfy (24), and when , the allowable (necessary) region to ensure -exp-concavity vanishes. Thus normalized log loss is the most exp-concave normalized proper loss. Observe (from Figure 5) that normalized square loss () is at most 2-exp-concave. Further from (24), if , then the allowable region to ensure -exp-concavity will be within the region for -exp-concavity, and also any allowable region to ensure -exp-concavity will be within the region for convexity, which is obtained by setting in (24). Here we recall the fact that, if the normalized loss function is -exp-concave, then the original loss function is -exp-concave. The following theorem provides sufficient conditions for the exp-concavity of binary proper losses.

Theorem 7

A binary proper loss with the weight function normalized such that is -exp-concave (with ) if


For square loss we can find that and