Convex surrogate losses are a central building block in machine learning for classification and classification-like problems. A growing body of work seeks to design and analyze convex surrogates for given loss functions, and more broadly, understand when such surrogates can and cannot be found. For example, recent work has developed tools to bound the required number of dimensions of the surrogate’s hypothesis spaceFrongillo and Kash (2015b); Ramaswamy and Agarwal (2016). Yet in some cases these bounds are far from tight, such as for abstain loss (classification with an abstain option) Bartlett and Wegkamp (2008); Yuan and Wegkamp (2010); Ramaswamy and Agarwal (2016); Ramaswamy et al. (2018); Zhang et al. (2018). Furthermore, the kinds of strategies available for constructing surrogates, and their relative power, are not well-understood.
We augment this literature by studying a particularly natural approach for finding convex surrogates, wherein one “embeds” a discrete loss. Specifically, we say a convex surrogate embeds a discrete loss
if there is an injective embedding from the discrete reports (predictions) to a vector space such that (i) the original loss values are recovered, and (ii) a report is-optimal if and only if the embedded report is -optimal. If this embedding can be extended to a calibrated link function, which maps approximately -optimal reports to -optimal reports, consistency follows Agarwal and Agarwal (2015). Common examples which follow this general construction include hinge loss as a surrogate for 0-1 loss and the abstain surrogate mentioned above.
Using tools from property elicitation, we show a tight relationship between such embeddings and the class of polyhedral (piecewise-linear convex) loss functions. In particular, by focusing on Bayes risks, we show that every discrete loss is embedded by some polyhedral loss, and every polyhedral loss function embeds some discrete loss. Moreover, we show that any polyhedral loss gives rise to a calibrated link function to the loss it embeds, thus giving a very general framework to construct consistent convex surrogates for arbitrary losses.
The literature on convex surrogates focuses mainly on smooth surrogate losses Crammer and Singer (2001); Bartlett et al. (2006); Bartlett and Wegkamp (2008); Duchi et al. (2018); Williamson et al. (2016); Reid and Williamson (2010). Nevertheless, nonsmooth losses, such as the polyhedral losses we consider, have been proposed and studied for a variety of classification-like problems Yang and Koyejo (2018); Yu and Blaschko (2018); Lapin et al. (2015). A notable addition to this literature is Ramaswamy et al. (2018), who argue that nonsmooth losses may enable dimension reduction of the prediction space (range of the surrogate hypothesis) relative to smooth losses, illustrating this conjecture with a surrogate for abstain loss needing only dimensions for labels, whereas the best known smooth loss needs . Their surrogate is a natural example of an embedding (cf. Section 5.1), and serves as inspiration for our work.
While property elicitation has by now an extensive literature Savage (1971); Osband and Reichelstein (1985); Lambert et al. (2008); Gneiting (2011); Steinwart et al. (2014); Frongillo and Kash (2015a); Fissler et al. (2016); Lambert (2018)
, these works are mostly concerned with point estimation problems. Literature directly connecting property elicitation to consistency is sparse, with the main reference beingAgarwal and Agarwal (2015); note however that they consider single-valued properties, whereas properties elicited by general convex losses are necessarily set-valued.
For discrete prediction problems like classification, due to hardness of directly optimizing a given discrete loss, many machine learning algorithms can be thought of as minimizing a surrogate loss function with better optimization qualities, e.g., convexity. Of course, to show that this surrogate loss successfully addresses the original problem, one needs to establish consistency, which depends crucially on the choice of link function that maps surrogate reports (predictions) to original reports. After introducing notation, and terminology from property elicitation, we thus give a sufficient condition for consistency (Def. 4) which depends solely on the conditional distribution over .
2.1 Notation and Losses
Let be a finite outcome (label) space, and throughout let
. The set of probability distributions onis denoted , represented as vectors of probabilities. We write for the probability of outcome drawn from .
We assume that a given discrete prediction problem, such as classification, is given in the form of a discrete loss , which maps a report (prediction) from a finite set to the vector of loss values for each possible outcome . We will assume throughout that the given discrete loss is non-redundant, meaning every report is uniquely optimal (minimizes expected loss) for some distribution . Similarly, surrogate losses will be written , typically with reports written . We write the corresponding expected loss when as and . The Bayes risk of a loss is the function given by ; naturally for discrete losses we write (and the infimum is over ).
For example, 0-1 loss is a discrete loss with given by , with Bayes risk . Two important surrogates for are hinge loss , where , and logistic loss for .
Most of the surrogates we consider will be polyhedral, meaning piecewise linear and convex; we therefore briefly recall the relevant definitions. In , a polyhedral set or polyhedron is the intersection of a finite number of closed halfspaces. A polytope is a bounded polyhedral set. A convex function is polyhedral if its epigraph is polyhedral, or equivalently, if it can be written as a pointwise maximum of a finite set of affine functions (Rockafellar, 1997).
Definition 1 (Polyhedral loss).
A loss is polyhedral if is a polyhedral (convex) function of for each .
For example, hinge loss is polyhedral, whereas logistic loss is not.
2.2 Property Elicitation
To make headway, we will appeal to concepts and results from the property elicitation literature, which elevates the property, or map from distributions to optimal reports, as a central object to study in its own right. In our case, this map will often be multivalued, meaning a single distribution could yield multiple optimal reports. (For example, when , both and optimize 0-1 loss.) To this end, we will use double arrow notation to mean a mapping to all nonempty subsets, so that is shorthand for . See the discussion following Definition 3 for conventions regarding , , , , , etc.
Definition 2 (Property, level set).
A property is a function . The level set of for report is the set .
Intuitively, is the set of reports which should be optimal for a given distribution , and is the set of distributions for which the report should be optimal. For example, the mode is the property , and captures the set of optimal reports for 0-1 loss: for each distribution over the labels, one should report the most likely label. In this case we say 0-1 loss elicits the mode, as we formalize below.
Definition 3 (Elicits).
A loss , elicits a property if
As is uniquely defined by , we write to refer to the property elicited by a loss .
For finite properties (those with ) and discrete losses, we will use lowercase notation and , respectively, with reports ; for surrogate properties and losses we use and , with reports . For general properties and losses, we will also use and , as above.
2.3 Links and Embeddings
To assess whether a surrogate and link function align with the original loss, we turn to the common condition of calibration. Roughly, a surrogate and link are calibrated if the best possible expected loss achieved by linking to an incorrect report is strictly suboptimal.
Let original loss , proposed surrogate , and link function be given. We say is calibrated with respect to if for all ,
It is well-known that calibration implies consistency, in the following sense (cf. Agarwal and Agarwal (2015)). Given feature space , fix a distribution . Let be the best possible expected -loss achieved by any hypothesis , and the best expected -loss for any hypothesis , respectively. Then is consistent if a sequence of surrogate hypotheses whose -loss limits to , then the -loss of limits to . As Definition 4 does not involve the feature space , we will drop it for the remainder of the paper.
Several consistent convex surrogates in the literature can be thought of as “embeddings”, wherein one maps the discrete reports to a vector space, and finds a convex loss which agrees with the original loss. A key condition is that the original reports should be optimal exactly when the corresponding embedded points are optimal. We formalize this notion as follows.
A loss embeds a loss if there exists some injective embedding such that (i) for all we have , and (ii) for all we have
Note that it is not clear if embeddings give rise to calibrated links; indeed, apart from mapping the embedded points back to their original reports via , how to map the remaining values is far from clear. We address the question of when embeddings lead to calibrated links in Section 4.
To illustrate the idea of embedding, let us examine hinge loss in detail as a surrogate for 0-1 loss for binary classification. Recall that we have , with and , typically with link function . We will see that hinge loss embeds (2 times) 0-1 loss, via the embedding . For condition (i), it is straightforward to check that for all . For condition (ii), let us compute the property each loss elicits, i.e., the set of optimal reports for each :
In particular, we see that , and . With both conditions of Definition 5 satisfied, we conclude that embeds . In this particular case, it is known is calibrated for ; we address in Section 4 the interesting question of whether embeddings lead to calibration in general.
3 Embeddings and Polyhedral Losses
In this section, we establish a tight relationship between the technique of embedding and the use of polyhedral (piecewise-linear convex) surrogate losses. We defer to the following section the question of when such surrogates are consistent.
To begin, we observe that, somewhat surprisingly, our embedding condition in Definition 5 is equivalent to merely matching Bayes risks. This useful fact will drive many of our results.
A loss embeds discrete loss if and only if .
Throughout we have , , and define and . Suppose embeds via the embedding . Letting , define by . To see that for all , note that by the definition of as the property elicited by we have some , and by the embedding condition (3), . By Lemma 3, we see that (the loss with reports restricted to ) elicits and . As by the embedding, we have
for all . Combining with the above, we now have .
For the reverse implication, assume that . In what follows, we implicitly work in the affine hull of , so that interiors are well-defined, and may be differentiable on the (relative) interior of . Since is discrete, is polyhedral as the pointwise maximum of a finite set of linear functions. The projection of its epigraph onto forms a power diagram by Theorem 4, whose cells are full-dimensional and correspond to the level sets of .
For each , let be a distribution in the interior of , and let . Observe that, by definition of the Bayes risk and , for all
the hyperplanesupports the epigraph of at the point if and only if . Thus, the hyperplane supports at the point , and thus does so at the entire facet ; by the above, for all such distributions as well. We conclude that , satisfying condition (3) for . To see that the loss values match, we merely note that the supporting hyperplanes to the facets of and are the same, and the loss values are uniquely determined by the supporting hyperplane. (In particular, if supports the facet corresponding to , we have , where is the point distribution on outcome .) ∎
From this more succinct embedding condition, we can in turn simplify the condition that a loss embeds some discrete loss: it does if and only if its Bayes risk is polyhedral. (We say a concave function is polyhedral if its negation is a polyhedral convex function.) Note that Bayes risk, a function from distributions over to the reals, may be polyhedral even if the loss itself is not.
A loss embeds a discrete loss if and only if is polyhedral.
If embeds , Proposition 1 gives us , and its proof already argued that is polyhedral. For the converse, let be polyhedral; we again examine the proof of Proposition 1. The projection of onto forms a power diagram by Theorem 4 with finitely many cells , which we can index by . Defining the property by for , we see that the same construction gives us points such that . Defining by , the same proof shows that embeds . ∎
Every polyhedral loss embeds a discrete loss.
We now turn to the reverse direction: which discrete losses are embedded by some polyhedral loss? Perhaps surprisingly, we show that every discrete loss is embeddable, using a construction via convex conjugate duality which has appeared several times in the literature (e.g. Duchi et al. (2018); Abernethy et al. (2013); Frongillo and Kash (2014)). Note however that the number of dimensions required could be as large as .
Every discrete loss is embedded by a polyhedral loss.
Let , and let be given by , the convex conjugate of . From standard results in convex analysis, is polyhedral as is, and is finite on all of as the domain of is bounded (Rockafellar, 1997, Corollary 13.3.1). Note that is a closed convex function, as the infimum of affine functions, and thus . Define by , where is the all-ones vector. We first show that embeds , and then establish that the range of is in fact , as desired.
We compute Bayes risks and apply Proposition 1 to see that embeds . For any , we have
It remains to show for all , . Letting be the point distribution on outcome , we have for all , , where the final inequality follows from the nonnegativity of . ∎
4 Consistency via Calibrated Links
We have now seen the tight relationship between polyhedral losses and embeddings; in particular, every polyhedral loss embeds some discrete loss. The embedding itself tells us how to link the embedded points back to the discrete reports (map to ), but it is not clear when this link can be extended to the remaining reports, and whether such a link can lead to consistency. In this section, we give a construction to generate calibrated links for any polyhedral loss.
Appendix D contains the full proof; this section provides a sketch along with the main construction and result. The first step is to give a link such that exactly minimizing expected surrogate loss , followed by applying , always exactly minimizes expected original loss . The existence of such a link is somewhat subtle, because in general some point that is far from any embedding point can minimize expected loss for two very different distributions , making it unclear whether there exists a link choice that is simultaneously optimal for both. We show that as we vary over , there are only finitely many sets of the form (Lemma 4). Associating each with , the set of reports whose embedding points are in , we enforce that all points in link to some report in . (As a special case, embedding points must link to their corresponding reports.) Proving this is well-defined uses a chain of arguments involving the Bayes risk, ultimately showing that if lies in multiple , the corresponding report sets all intersect at some .
Intuitively, to create separation, we just need to “thicken” this construction by mapping all approximately-optimal points to optimal reports . Let contain all optimal report sets of the form above. A key step in the following definition will be to narrow down a “link envelope” where denotes the legal or valid choices for .
Given a polyhedral that embeds some , an , and a norm , the -thickened link is constructed as follows. First, initialize by setting for all . Then for each , for all points such that , update . Finally, define , breaking ties arbitrarily. If became empty, then leave undefined.
Let be polyhedral, and the discrete loss it embeds from Theorem 1. Then for small enough , the -thickened link is well-defined and, furthermore, is a calibrated link from to .
Well-defined: For the initial construction above, we argued that if some collection such as overlap at a , then their report sets also overlap, so there is a valid choice . Now, we thicken all sets by a small enough ; it can be shown that if the thickened sets overlap at , then themselves overlap, so again overlap and there is a valid chioce .
Calibrated: By construction of the thickened link, if maps to an incorrect report, i.e. , then must have at least distance to the optimal set . We then show that the minimal gradient of the expected loss along any direction away from is lower-bounded, giving a constant excess expected loss at . ∎
5 Application to Specific Surrogates
Our results give a framework to construct consistent surrogates and link functions for any discrete loss, but they also provide a way to verify the consistency or inconsistency of given surrogates. Below, we illustrate the power of this framework with specific examples from the literature, as well as new examples. In some cases we simplify existing proofs, while in others we give new results, such as a new calibrated link for abstain loss, and the inconsistency of the recently proposed Lovász hinge.
5.1 Consistency of abstain surrogate and link construction
In classification settings with a large number of labels, several authors consider a variant of classification, with the addition of a “reject” or abstain option. For example, Ramaswamy et al. (2018) study the loss defined by if , if , and 1 otherwise. Here, the report corresponds to “abstaining” if no label is sufficiently likely, specifically, if no has . Ramaswamy et al. (2018) provide a polyhedral surrogate for , which we present here for . Letting their surrogate is given by
where is a arbitrary injection; let us assume so that we have a bijection. Consistency is proven for the following link function,
In light of our framework, we can see that is an excellent example of an embedding, where and . Moreover, the link function can be recovered from Theorem 3 with norm and ; see Figure 1(L). Hence, our framework would have simplified the process of finding such a link, and the corresponding proof of consistency. To illustrate this point further, we give an alternate link corresponding to and , shown in Figure 1(R):
Theorem 3 immediately gives calibration of with respect to . Aside from its simplicity, one possible advantage of is that it appears to yield the same constant in generalization bounds as , yet assigns to much less of the surrogate space . It would be interesting to compare the two links in practice.
5.2 Inconsistency of Lovász hinge
Many structured prediction settings can be thought of as making multiple predictions at once, with a loss function that jointly measures error based on the relationship between these predictions Hazan et al. (2010); Gao and Zhou (2011); Osokin et al. (2017). In the case of binary predictions, these settings are typically formalized by taking the predictions and outcomes to be vectors, so . One then defines a joint loss function, which is often merely a function of the set of mispredictions, meaning for some function . For example, Hamming loss is simply given by . In an effort to provide a general convex surrogate for these settings when is a submodular function, Yu and Blaschko Yu and Blaschko (2018) introduce the Lovász hinge, which leverages the well-known convex Lovász extension of submodular functions. While the authors provide theoretical justification and experiments, consistency of the Lovász hinge is left open, which we resolve.
Rather than formally define the Lovász hinge, we defer the complete analysis to Appendix E, and focus here on the case. For brevity, we write , , etc. Assuming is normalized and increasing (meaning ), the Lovász hinge is given by
where . We will explore the range of values of for which is consistent, where the link function is fixed as , with ties broken arbitrarily.
Let us consider , , for which is merely 0-1 loss on . For consistency, then, for any distribution , we must have that whenever , then must be the most likely outcome, in . Simplifying eq. (13), however, we have
which is exactly the abstain surrogate (4) for . We immediately conclude that cannot be consistent with , as the origin will be the unique optimal report for under distributions with for all , and one can simply take a distribution which disagrees with the way ties are broken in . For example, if we take , then under and , we have , yet .
In fact, this example is typical: using our embedding framework, and characterizing when is an embedded point, one can show that is consistent if and only if . Moreover, in the linear case, which corresponds to being modular, the Lovász hinge reduces to weighted Hamming loss, which is trivially consistent from the consistency of hinge loss for 0-1 loss. In Appendix E, we generalize this observation for all : is consistent if and only if is modular. In other words, even for , the only consistent Lovász hinge is weighted Hamming loss. These results cast doubt on the effectiveness of the Lovász hinge in practice.
5.3 Inconsistency of top- losses
In certain classification problems when ground truth may be ambiguous, such as object identification, it is common to predict a set of possible labels. As one instance, the top- classification problem is to predict the set of most likely labels; formally, we have , , , and discrete loss . Surrogates for this problem commonly take reports , with the link , where is the largest entry of .
Lapin et al. (2015, 2016, 2018) provide the following convex surrogate loss for this problem, which Yang and Koyejo (2018) show to be inconsistent: 111Yang and Koyejo also introduce a consistent surrogate, but it is non-convex.
where is in component and 0 elsewhere. With our framework, we can say more. Specifically, while is not consistent for , since is polyhedral (Lemma 17), we know from Theorem 1 that it embeds some discrete loss , and from Theorem 3 there is a link such that is calibrated (and consistent) for . We therefore turn to deriving this discrete loss .
For concreteness, consider the case with over outcomes. We can re-write . By inspection, we can derive the properties elicited by and , respectively, which reveals that the set consisting of all permutations of , , and , are always represented among the minimizers of . Thus, embeds the loss if or otherwise. Observe that this is like , with a punishment for reporting weight on elements of other than the outcome and a reward for showing a “higher confidence” in the correct outcome (i.e. ). Moreover, we can visually inspect the corresponding properties (Fig. 2) to immediately see why is inconsistent: for distributions where the two least likely labels are roughly equally (un)likely, the minimizer will put all weight on the most likely label, and thus fail to distinguish the other two. More generally, cannot be consistent because the property it embeds does not “refine” (subdivide) the top- property, so not just , but no link function, could make consistent.
6 Conclusion and Future Directions
This paper formalizes an intuitive way to design convex surrogate losses for classification-like problems—by embedding the reports into . We establish a close relationship between embedding and polyhedral surrogates, showing both that every polyhedral loss embeds a discrete loss (Theorem 1) and that every discrete loss is embedded by some polyhedral loss (Theorem 2). We then construct a calibrated link function from any polyhedral loss to the discrete loss it embeds, giving consistency for all such losses. We conclude with examples of how the embedding framework presented can be applied to understand existing surrogates in the literature, including those for the abstain loss, top- loss, and Lovász hinge.
One open question of particular interest involves the dimension of the input to a surrogate; given a discrete loss, can we construct the surrogate that embeds it of minimal dimension? If we naïvely embed the reports into an -dimensional space, the dimensionality of the problem scales with the number of possible labels . As the dimension of the optimization problem is a function of this embedding dimension , a promising direction is to leverage tools from elicitation complexity Lambert et al. (2008); Frongillo and Kash (2015b) and convex calibration dimension Ramaswamy and Agarwal (2016) to understand when we can take .
We thank Arpit Agarwal and Peter Bartlett for several discussions early on, which led to several important insights. We thank Eric Balkanski for his help with Lemma 16. This material is based upon work supported by the National Science Foundation under Grant No. 1657598.
- Abernethy et al.  Jacob Abernethy, Yiling Chen, and Jennifer Wortman Vaughan. Efficient market making via convex optimization, and a connection to online learning. ACM Transactions on Economics and Computation, 1(2):12, 2013. URL http://dl.acm.org/citation.cfm?id=2465777.
- Agarwal and Agarwal  Arpit Agarwal and Shivani Agarwal. On consistent surrogate risk minimization and property elicitation. In JMLR Workshop and Conference Proceedings, volume 40, pages 1–19, 2015. URL http://www.jmlr.org/proceedings/papers/v40/Agarwal15.pdf.
- Aurenhammer  Franz Aurenhammer. Power diagrams: properties, algorithms and applications. SIAM Journal on Computing, 16(1):78–96, 1987. URL http://epubs.siam.org/doi/pdf/10.1137/0216006.
- Bach  Francis Bach. Learning with submodular functions: A convex optimization perspective. Foundations and Trends® in Machine Learning, 6(2-3):145–373, 2013.
- Bartlett and Wegkamp  Peter L Bartlett and Marten H Wegkamp. Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 9(Aug):1823–1840, 2008.
- Bartlett et al.  Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006. URL http://amstat.tandfonline.com/doi/abs/10.1198/016214505000000907.
- Boyd and Vandenberghe  S.P. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 2004.
- Crammer and Singer  Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of machine learning research, 2(Dec):265–292, 2001.
- Duchi et al.  John Duchi, Khashayar Khosravi, Feng Ruan, et al. Multiclass classification, information, divergence and surrogate risk. The Annals of Statistics, 46(6B):3246–3275, 2018.
- Fissler et al.  Tobias Fissler, Johanna F Ziegel, and others. Higher order elicitability and Osband’s principle. The Annals of Statistics, 44(4):1680–1707, 2016.
- Frongillo and Kash  Rafael Frongillo and Ian Kash. General truthfulness characterizations via convex analysis. In Web and Internet Economics, pages 354–370. Springer, 2014.
- Frongillo and Kash [2015a] Rafael Frongillo and Ian Kash. Vector-Valued Property Elicitation. In Proceedings of the 28th Conference on Learning Theory, pages 1–18, 2015a.
- Frongillo and Kash [2015b] Rafael Frongillo and Ian A. Kash. On Elicitation Complexity. In Advances in Neural Information Processing Systems 29, 2015b.
- Gao and Zhou  Wei Gao and Zhi-Hua Zhou. On the consistency of multi-label learning. In Proceedings of the 24th annual conference on learning theory, pages 341–358, 2011.
- Gneiting  T. Gneiting. Making and Evaluating Point Forecasts. Journal of the American Statistical Association, 106(494):746–762, 2011.
- Hazan et al.  Tamir Hazan, Joseph Keshet, and David A McAllester. Direct loss minimization for structured prediction. In Advances in Neural Information Processing Systems, pages 1594–1602, 2010.
- Lambert  Nicolas S. Lambert. Elicitation and evaluation of statistical forecasts. 2018. URL https://web.stanford.edu/ñlambert/papers/elicitability.pdf.
- Lambert et al.  Nicolas S. Lambert, David M. Pennock, and Yoav Shoham. Eliciting properties of probability distributions. In Proceedings of the 9th ACM Conference on Electronic Commerce, pages 129–138, 2008.
- Lapin et al.  Maksim Lapin, Matthias Hein, and Bernt Schiele. Top-k multiclass svm. In Advances in Neural Information Processing Systems, pages 325–333, 2015.
- Lapin et al.  Maksim Lapin, Matthias Hein, and Bernt Schiele. Loss functions for top-k error: Analysis and insights. In
- Lapin et al.  Maksim Lapin, Matthias Hein, and Bernt Schiele. Analysis and optimization of loss functions for multiclass, top-k, and multilabel classification. IEEE transactions on pattern analysis and machine intelligence, 40(7):1533–1554, 2018.
- Osband and Reichelstein  Kent Osband and Stefan Reichelstein. Information-eliciting compensation schemes. Journal of Public Economics, 27(1):107–115, June 1985. ISSN 0047-2727. doi: 10.1016/0047-2727(85)90031-3. URL http://www.sciencedirect.com/science/article/pii/0047272785900313.
- Osokin et al.  Anton Osokin, Francis Bach, and Simon Lacoste-Julien. On structured prediction theory with calibrated convex surrogate losses. In Advances in Neural Information Processing Systems, pages 302–313, 2017.
- Ramaswamy and Agarwal  Harish G Ramaswamy and Shivani Agarwal. Convex calibration dimension for multiclass loss matrices. The Journal of Machine Learning Research, 17(1):397–441, 2016.
- Ramaswamy et al.  Harish G Ramaswamy, Ambuj Tewari, Shivani Agarwal, et al. Consistent algorithms for multiclass classification with an abstain option. Electronic Journal of Statistics, 12(1):530–554, 2018.
- Reid and Williamson  M.D. Reid and R.C. Williamson. Composite binary losses. The Journal of Machine Learning Research, 9999:2387–2422, 2010.
- Rockafellar  R.T. Rockafellar. Convex analysis, volume 28 of Princeton Mathematics Series. Princeton University Press, 1997.
- Savage  L.J. Savage. Elicitation of personal probabilities and expectations. Journal of the American Statistical Association, pages 783–801, 1971.
- Steinwart et al.  Ingo Steinwart, Chloé Pasin, Robert Williamson, and Siyu Zhang. Elicitation and Identification of Properties. In Proceedings of The 27th Conference on Learning Theory, pages 482–526, 2014.
- Williamson et al.  Robert C Williamson, Elodie Vernet, and Mark D Reid. Composite multiclass losses. Journal of Machine Learning Research, 17(223):1–52, 2016.
- Yang and Koyejo  Forest Yang and Sanmi Koyejo. On the consistency of top-k surrogate losses, 01 2018.
- Yu and Blaschko  Jiaqian Yu and Matthew B Blaschko. The lovász hinge: A novel convex surrogate for submodular losses. IEEE transactions on pattern analysis and machine intelligence, 2018.
- Yuan and Wegkamp  Ming Yuan and Marten Wegkamp. Classification methods with reject option based on convex risk minimization. Journal of Machine Learning Research, 11(Jan):111–130, 2010.
- Zhang et al.  Chong Zhang, Wenbo Wang, and Xingye Qiao. On reject and refine options in multicategory classification. Journal of the American Statistical Association, 113(522):730–745, 2018. doi: 10.1080/01621459.2017.1282372. URL https://doi.org/10.1080/01621459.2017.1282372.
Appendix A Power diagrams
First, we present several definitions from Aurenhammer Aurenhammer .
A cell complex in is a set of faces (of dimension ) which (i) union to , (ii) have pairwise disjoint relative interiors, and (iii) any nonempty intersection of faces in is a face of and and an element of .
Given sites and weights , the corresponding power diagram is the cell complex given by
A cell complex in is affinely equivalent to a (convex) polyhedron if is a (linear) projection of the faces of .
Theorem 4 (Aurenhammer Aurenhammer ).
A cell complex is affinely equivalent to a convex polyhedron if and only if it is a power diagram.
In particular, one can consider the epigraph of a polyhedral convex function on and the projection down to ; in this case we call the resulting power diagram induced by the convex function. We extend Aurenhammer’s result to a weighted sum of convex functions, showing that the induced power diagram is the same for any choice of strictly positive weights.
Let be polyhedral convex functions. The power diagram induced by is the same for all .
For any convex function with epigraph , the proof of Aurenhammer [1987, Theorem 4] shows that the power diagram induced by is determined by the facets of . Let be a facet of , and its projection down to . It follows that is affine, and thus is differentiable on with constant derivative . Conversely, for any subgradient of , the set of points is the projection of a face of ; we conclude that and .
Now let with epigraph , and with epigraph . By Rockafellar Rockafellar , are polyhedral. We now show that is differentiable whenever is differentiable:
From the above observations, every facet of is determined by the derivative of at any point in the interior of its projection, and vice versa. Letting be such a point in the interior, we now see that the facet of containing has the same projection, namely . Thus, the power diagrams induced by and are the same. The conclusion follows from the observation that the above held for any strictly positive weights , and was fixed. ∎
Appendix B Embedding properties
It is often convenient to work directly with properties and set aside the losses which elicit them. To this end, we say a property to embeds another if eq. (3) holds. We begin with the notion of redundancy.
Definition 10 (Finite property, non-redundant).
A property is redundant if for some we have , and non-redundant otherwise. is finite if it is non-redundant and is a finite set. We typically write finite properties in lower case, as .
When working with convex losses which are not strictly convex, one quickly encounters redundant properties: if is minimized by a point where is flat, then there will be an uncountable set of reports which also minimize the loss. As results in property elicitation typically assume non-redundant properties (e.g. Frongillo and Kash [2014, 2015b]), it is useful to consider a transformation which removes redundant level sets. We capture this transformation as the trim operator presented below.
Given an elicitable property , we define as the set of maximal level sets of .
Take note that the unlabeled property is non-redundant, meaning that for any , there is no level set such that .
Before we state the Proposition needed to prove many of the statements in Section 3, we will need to general lemmas about properties and their losses. The first follows from standard results relating finite properties to power diagrams (see Theorem 4 in the appendix), and its proof is omitted. The second is closely related to the trim operator: it states that if some subset of the reports are always represented among the minimizers of a loss, then one may remove all other reports and elicit the same property (with those other reports removed).
Let be a finite (non-redundant) property elicited by a loss . Then the negative Bayes risk of is polyhedral, and the level sets of are the projections of the facets of the epigraph of onto , and thus form a power diagram. In particular, the level sets are full-dimensional in (i.e., of dimension ).
Let elicit , and let such that for all . Then ( restricted to ) elicits defined by . Moreover, the Bayes risks of and are the same.
Let be fixed throughout. First let . Then , so as we have in particular . For the other direction, suppose . By our assumption, we must have some . On the one hand, . On the other, as , we certainly have . But now we must have , and thus as well. We now see . Finally, the equality of the Bayes risks follows immediately by the above, as for all . ∎
We now state a useful result for proving the existence of an embedding loss, which shows remarkable structure of embeddable properties, and the properties that embed them. First, we conclude that any embeddable property must be elicitable. We also conclude that if embeds , the level sets of must all be redundant relative to . In other words, is exactly the property , just with other reports filling in the gaps between the embedded reports of . (When working with convex losses, these extra reports are typically the convex hull of the embedded reports.) In this sense, we can regard embedding as a minor departure from direct elicitation: if a loss elicits which embeds , we can think of as essentially eliciting itself. Finally, we have an important converse: if has finitely many full-dimensional level sets, or if is finite, then must embed some finite elicitable property with those same level sets.
Let be an elicitable property. The following are equivalent:
embeds a finite property .
is a finite set, and .
There is a finite set of full-dimensional level sets of , and .
Moreover, when any of the above hold, , and is elicitable.
Let elicit .
1 2: By the embedding condition, taking and satisfies the conditions of Lemma 3: for all , as by definition, we have some and thus some . Let be the negative Bayes risk of , which is convex, and that of . By the Lemma, we also have . As is finite, is polyhedral. Moreover, the projection of the epigraph of onto forms a power diagram, with the facets projecting onto the level sets of , the cells of the power diagram. (See Theorem 4.) As elicits , for all , the hyperplane is a supporting hyperplane of the epigraph of at if and only if . This supporting hyperplane exposes some face of the epigraph of , which must be contained in some facet . Thus, the projection of , which is , must be contained in the projection of , which is a level set of . We conclude that for some . Hence, , which is finite, and unions to .
2 3: let be a set of distinct reports such that . Now as , for any , we have <