1 Introduction
Loss functions are the means by which the quality of a prediction in learning problem is evaluated. A composite loss (the composition of a class probability estimation (CPE) loss with an invertible link function which is essentially just a reparameterization) is proper if its risk is minimized when predicting the true underlying class probability (a formal definition is given later). In [Vernet et al. [2014]], there is an argument that shows that there is no point in using losses that are neither proper nor proper composite as they are inadmissible. Flexibility in the choice of loss function is important to tailor the solution to a learning problem ([Buja et al. [2005]], [Hand [1994]], [Hand and Vinciotti [2003]]), and it could be attained by characterizing the set of loss functions using natural parameterizations.
The goal of the learner in a game of prediction with expert advice (which is formally described in section 2.5) is to predict as well as the best expert in the given pool of experts. The regret bound of the learner depends on the merging scheme used to merge the experts’ predictions and the type of loss function used to measure the performance. It has already been shown that constant regret bounds are achievable for mixable losses when the Aggregating Algorithm is the merging scheme ([Vovk [1995]]), and for expconcave losses when the Weighted Average Algorithm is the merging scheme ([Kivinen and Warmuth [1999]]). We can see that the expconcavity trivially implies mixability. Even though the converse implication is not true in general, under some reparameterization we can make it possible. This paper discusses general conditions on proper losses under which they can be transformed to an expconcave loss through a suitable link function. In the binary case, these conditions give two concrete formulas (Proposition 1 and Corollary 8) for link functions that can transform mixable proper losses into expconcave, proper, composite losses. The explicit form of the link function given in Proposition 1 is derived using the same geometric construction used in [van Erven [2012]].
Further we extend the work by [Vernet et al. [2014]], to provide a complete characterization of the expconcavity of the proper composite multiclass losses in terms of the Bayes risk associated with the underlying proper loss, and the link function. The mixability of proper losses (mixability of a proper composite loss is equivalent to the mixability of its generating proper loss) is studied in [Van Erven et al. [2012]]. Using these characterizations (for the binary case), in Corollary 8 we derive an expconcavifying link function that can also transform any mixable proper loss into a expconcave composite loss. Since for the multiclass losses these conditions do not hold in general, we propose a geometric approximation approach (Proposition 2) which takes a parameter and transforms the mixable loss function appropriately on a subset of the prediction space. When the prediction space is , any prediction belongs to the subset for sufficiently small . In the conclusion we provide a way to use the Weighted Average Algorithm with learning rate for proper mixable but nonexpconcave loss functions to achieve regret bound.
The expconcave losses achieve regret bound in online convex optimization algorithms, which is a more general setting of online learning problems. Thus the expconcavity characterization of composite losses could be helpful in constructing expconcave losses for online learning problems.
The paper is organized as follows. In Section 2 we formally introduce the loss function, several loss types, conditional risk, proper composite losses and a game of prediction with expert advice. In Section 3 we consider our main problem — whether one can always find a link function to transform mixable losses into expconcave losses. Section 4 concludes with a brief discussion. The impact of the choice of substitution function on the regret of the learner is explored via experiments in Appendix A. In Appendix B, we discuss the mixability conditions of probability games with continuous outcome space. Detailed proofs are in Appendix C.
2 Preliminaries and Background
This section provides the necessary background on loss functions, conditional risks, and the sequential prediction problem.
2.1 Notation
We use the following notation throughout. A superscript prime,
denotes transpose of the matrix or vector
, except when applied to a realvalued function where it denotes derivative (). We denote the matrix multiplication of compatible matrices and by , so the inner product of two vectors is . Let , and the simplex . If is a vector, is the matrix with entries , and for . If is positive definite (resp. semidefinite), then we write (resp. ). We use to denote the th dimensional unit vector, when , and define when . The vector . We write if is true and otherwise. Given a set and a weight vector , the convex combination of the elements of the set w.r.t the weight vector is denoted by , and the convex hull of the set which is the set of all possible convex combinations of the elements of the set is denoted by ([Rockafellar [1970]]). If , then the Minkowski sum . represents the set of all functions . Other notation (the Kronecker product , the Jacobian D, and the Hessian H) is defined in Appendix A of [Van Erven et al. [2012]]. and denote the Jacobian and Hessian of w.r.t. respectively. When it is not clear from the context, we will explicitly mention the variable; for example where .2.2 Loss Functions
For a prediction problem with an instance space , outcome space and prediction space , a loss function (bivariate function representation) can be defined to assign a penalty for predicting when the actual outcome is . When the outcome space , the loss function is called a multiclass loss and it can be expressed in terms of its partial losses for any outcome , as . The vector representation of the multiclass loss is given by , which assigns a vector to each prediction . A loss is differentiable if all of its partial losses are differentiable. In this paper, we will use the bivariate function representation () to denote a general loss function and the vector representation for multiclass loss functions.
The superprediction set of a binary loss is defined as
where inequality is componentwise. For any dimension and , the exponential operator is defined by . For it is clearly invertible with inverse . The exponential transformation of the superprediction set is given by
A multiclass loss is convex if is convex in for all , expconcave (for ) if is concave in for all , weakly mixable if the superprediction set is convex ([Kalnishkan and Vyugin [2005]]), and mixable (for ) if the set is convex ([Vovk and Zhdanov [2009], Vovk [1995]]). The mixability constant of a loss is the largest such that is mixable; i.e. . If the loss function is expconcave (resp. mixable) then it is expconcave for any (resp. mixable for any ), and its scaled version () for some is expconcave (resp. mixable). If the loss is expconcave, then it is convex and mixable.
For a multiclass loss , if the prediction space then it is said to be multiclass probability estimation (CPE) loss, where the predicted values are directly interpreted as probability estimates: . We will say a multiCPE loss is fair whenever , for all . That is, there is no loss incurred for perfect prediction. Examples of multiCPE losses include the square loss , the log loss , the absolute loss , and the 01 loss .
2.3 Conditional and Full Risks
Let X and Y
be random variables defined on the instance space
and the outcome space respectively. Letbe the joint distribution of
and for , denote the conditional distribution by where , and the marginal distribution by . For any multiCPE loss , the conditional risk is defined as(1) 
where represents a Multinomial distribution with parameter . The full risk of the estimator function is defined as
Furthermore the Bayes risk is defined as
where is the conditional Bayes risk and is always concave ([Gneiting and Raftery [2007]]). If is fair, . One can understand the effect of choice of loss in terms of the conditional perspective ([Reid and Williamson [2011]]), which allows one to ignore the marginal distribution of X which is typically unknown.
2.4 Proper and Composite Losses
A multiCPE loss is said to be proper if for all , ([Buja et al. [2005]], [Gneiting and Raftery [2007]]), and strictly proper if for all and . It is easy to see that the log loss, square loss, and 01 loss are proper while absolute loss is not. Furthermore, both log loss and square loss are strictly proper while 01 loss is proper but not strictly proper.
Given a proper loss with differentiable Bayes conditional risk , in order to be able to calculate derivatives easily, following [Van Erven et al. [2012]] we define
(2)  
(3)  
(4)  
(5)  
(6) 
Let be continuous and strictly monotone (hence invertible) for some convex set . It induces via
(7) 
Clearly is continuous and invertible with . We can now extend the notion of properness to the prediction space from using this link function. Given a proper loss , a proper composite loss for multiclass probability estimation is defined as . We can easily see that the conditional Bayes risks of the composite loss and the underlying proper loss are equal (). Every continuous proper loss has a convex superprediction set ([Vernet et al. [2014]]). Thus they are weakly mixable. Since by applying a link function the superprediction set won’t change (as it is just a reparameterization), all proper composite losses are also weakly mixable.
2.5 Game of Prediction with Expert Advice
Let be the outcome space, be the prediction space, and be the loss function, then a game of prediction with expert advice represented by the tuple () can be described as follows: for each trial ,

experts make their prediction

the learner makes his own decision

the environment reveals the actual outcome
Let be the outcome sequence in trials. Then the cumulative loss of the learner over is given by , of the ith expert is given by , and the regret of the learner is given by . The goal of the learner is to predict as well as the best expert; to which end the learner tries to minimize the regret.
When using the exponential weights algorithm (which is an important family of algorithms in game of prediction with expert advice), at the end of each trial, the weight of each expert is updated as for all , where is the learning rate and is the weight of the expert at time (the weight vector of experts at time is denoted by ). Then based on the weights of experts, their predictions are merged using different merging schemes to make the learner’s prediction. The Aggregating Algorithm and the Weighted Average Algorithm are two important algorithms in the family of exponential weights algorithm.
Consider multiclass games with outcome space . In the Aggregating Algorithm with learning rate , first the loss vectors of the experts and their weights are used to make a generalized prediction which is given by
Then this generalized prediction is mapped into a permitted prediction via a substitution function such that , where the inequality is elementwise and the constant depends on the learning rate. If is mixable, then is convex, so , and we can always choose a substitution function with . Consequently for mixable losses, the learner of the Aggregating Algorithm is guaranteed to have regret bounded by ([Vovk [1995]]).
In the Weighted Average Algorithm with learning rate , the experts’ predictions are simply merged according to their weights to make the learner’s prediction , and this algorithm is guaranteed to have a regret bound for expconcave losses ([Kivinen and Warmuth [1999]]). In either case it is preferred to have bigger values for the constants and to have better regret bounds. Since an expconcave loss is mixable for some , the regret bound of the Weighted Average Algorithm is worse than that of the Aggregating Algorithm by a small constant factor. In [Vovk [2001]], it is noted that is always guaranteed only for expconcave losses. Thus for expconcave losses, the Weighted Average Algorithm is equivalent to the Aggregating Algorithm with the weighted average of the experts’ predictions as its substitution function and as the learning rate for both algorithms.
Even though the choice of substitution function will not have any impact on the regret bound and the weight update mechanism of the Aggregating Algorithm, it will certainly have impact on the actual regret of the learner over a given sequence of outcomes. According to the results given in Appendix A (where we have empirically compared some substitution functions), this impact on the actual regret varies depending on the outcome sequence, and in general the regret values for practical substitution functions don’t differ much — thus we can stick with a computationally efficient substitution function.
3 ExpConcavity of Proper Composite Losses
Expconcavity of a loss is desirable for better (logarithmic) regret bounds in online convex optimization algorithms, and for efficient implementation of exponential weights algorithms. In this section we will consider whether one can always find a link function that can transform a mixable proper loss into expconcave composite loss — first by using the geometry of the set (Section 3.1), and then by using the characterization of the composite loss in terms of the associated Bayes risk (Sections 3.2, and 3.3).
3.1 Geometric approach
In this section we will use the same construction used by [van Erven [2012]] to derive an explicit closed form of a link function that could reparameterize any mixable loss into a expconcave loss, under certain conditions which are explained below. Given a multiclass loss , define
(8)  
(9) 
For any let . Then the “northeast” boundary of the set is given by . The following proposition is the main result of this section.
Proposition 1
Assume is strictly proper and it satisfies the condition : for some . Define for all , where . Then is invertible, and is expconcave over , which is a convex set.
The condition stated in the above proposition is satisfied by any mixable proper loss in the binary case (), but it is not guaranteed in the multiclass case where . In the binary case the link function can be given as for all .
Unfortunately, the condition that is generally not satisfied; an example based on squared loss ( and classes) is shown in Figure 3, where for and in , the midpoint can travel along the ray of direction without hitting any point in the expprediction set . Therefore we resort to approximating a given mixable loss by a sequence of expconcave losses parameterised by positive constant , while the approximation approaches the original loss in some appropriate sense as tends to 0. Without loss of generality, we assume .
Inspired by Proposition 1, a natural idea to construct the approximation is by adding “faces” to the expprediction set such that all rays in the direction will be blocked. Technically, it turns out more convenient to add faces that block rays in (almost) all directions of positive orthant. See Figure 3
for an illustration. In particular, we extend the “rim” of the expprediction set by hyperplanes that are
close to axisparallel. The key challenge underlying this idea is to design an appropriate parameterisation of the surrogate loss , which not only produces such an extended expprediction set, but also ensures that for almost all as .Given a mixable loss , its subexpprediction set defined as follows must be convex:
(10) 
Note extends infinitely in any direction . Therefore it can be written in terms of supporting hyperplanes as
(11) 
To extend the subexpprediction set with “faces”, we remove some hyperplanes involved in (11) that correspond to the “rim” of the simplex (see Figure 3 for an illustration in 2D)
(12) 
Since , for any , is exactly the superprediction set of a logloss with appropriate scaling and shifting (see proof in Appendix C). So it must be convex. Therefore must be convex, and its recession cone is clearly . This guarantees that the following loss is proper over (Williamson, 2014, Proposition 2):
(13) 
where the argmin must be attained uniquely (Appendix C). Our next proposition states that meets all the requirements of approximation suggested above.
Proposition 2
For any , satisfies the condition . In addition, over a subset , where for any in the relative interior of , for sufficiently small , i.e., .
Note is not bounded for . While the result does not show that all mixable losses can be made expconcave, it is suggestive that such a result may be obtainable by a different argument.
3.2 Calculus approach
Proper composite losses are defined by the proper loss and the link . In this section we will characterize the expconcave proper composite losses in terms of . The following proposition provides the identities of the first and second derivatives of the proper composite losses (Vernet et al. (2014)).
Proposition 3
For all , (the interior of ), and (so ),
(14)  
(15) 
where
(16) 
The term can be interpreted as the curvature of the Bayes risk function relative to the rate of change of the link function . In the binary case where , above proposition reduces to
(17)  
(18)  
(19) 
where and so .
A loss is expconcave (i.e. is expconcave for all ) if and only if the map is expconcave for all . It can be easily shown that the maps are expconcave if and only if . By applying Proposition 3 we obtain the following characterization of the expconcavity of the composite loss .
Proposition 4
A proper composite loss is expconcave (with and ) if and only if for all and for all
(20) 
Based on this characterization, we can determine which loss functions can be expconcavified by a chosen link function and how much a link function can expconcavify a given loss function. In the binary case (), the above proposition reduces to the following.
Proposition 5
Let be an invertible link and be a strictly proper binary loss with weight function . Then the binary composite loss is expconcave (with ) if and only if
(21) 
The following proposition gives an easier to check necessary condition for the binary proper losses that generate an expconcave (with ) binary composite loss given a particular link function. Since scaling a loss function will not affect what a sensible learning algorithm will do, it is possible to normalize the loss functions by normalizing their weight functions by setting . By this normalization we are scaling the original loss function by and the superprediction set is scaled by the same factor. If the original loss function is mixable (resp. expconcave), then the normalized loss function is mixable (resp. expconcave).
Proposition 6
Let be an invertible link and be a strictly proper binary loss with weight function normalised such that . Then the binary composite loss is expconcave (with ) only if
(22) 
where denotes for and denotes for .
Proposition 5 provides necessary and sufficient conditions for the expconcavity of binary composite losses, whereas Proposition 6 provides simple necessary but not sufficient conditions. By setting in all the above results we have obtained for expconcavity, we recover the convexity conditions for proper and composite losses which are already derived by Reid and Williamson (2010) for the binary case and Vernet et al. (2014) for multiclass.
3.3 Link functions
A proper loss can be expconcavified () by some link function only if the loss is mixable () and the maximum possible value for expconcavity constant is the mixability constant of the loss (since the link function won’t change the superprediction set and an expconcave loss is always mixable for some ).
By applying the identity link in (21) we obtain the necessary and sufficient conditions for a binary proper loss to be expconcave (with ) as given by,
(23) 
By substituting in (22) we obtain the following necessary but not sufficient (simpler) constraints for a normalized binary proper loss to be expconcave
(24) 
which are illustrated as the shaded region in Figure 5 for different values of . Observe that normalized proper losses can be expconcave only for . When , only the normalized weight function of log loss () will satisfy (24), and when , the allowable (necessary) region to ensure expconcavity vanishes. Thus normalized log loss is the most expconcave normalized proper loss. Observe (from Figure 5) that normalized square loss () is at most 2expconcave. Further from (24), if , then the allowable region to ensure expconcavity will be within the region for expconcavity, and also any allowable region to ensure expconcavity will be within the region for convexity, which is obtained by setting in (24). Here we recall the fact that, if the normalized loss function is expconcave, then the original loss function is expconcave. The following theorem provides sufficient conditions for the expconcavity of binary proper losses.
Theorem 7
A binary proper loss with the weight function normalized such that is expconcave (with ) if
and  
For square loss we can find that and
Comments
There are no comments yet.