1 Introduction
Loss functions are a cornerstone of statistics and machine learning: They measure the difference, or “loss,” between a groundtruth label and a prediction. Some loss functions, such as the hinge loss of support vector machines, are intimately connected to the notion of separation margin—a prevalent concept in statistical learning theory, which has been used to prove the famous perceptron mistake bound
(Rosenblatt, 1958) and many other generalization bounds (Vapnik, 1998; Schölkopf and Smola, 2002). For probabilistic classification, the most popular loss is arguably the (multinomial) logistic loss. It is smooth, enabling fast convergence rates, and the softmax operator provides a consistent mapping to probability distributions. However, the logistic loss does not enjoy a margin, and the generated probability distributions have dense support, which is undesirable in some applications for interpretability or computational efficiency reasons.To address these shortcomings, Martins and Astudillo (2016) proposed a new loss based on the projection onto the simplex. Unlike the logistic loss, this “sparsemax” loss has a natural separation margin and induces a sparse probability distribution. However, the sparsemax loss was derived in a relatively adhoc manner and it is still relatively poorly understood. Thorough understanding of the core principles underpinning these losses, enabling the creation of new losses combining their strengths, is still lacking.
This paper studies and extends FenchelYoung (FY) losses, recently proposed for structured prediction (Niculae et al., 2018). We show that FY losses provide a generic and principled way to construct a loss with an associated probability distribution. We uncover a fundamental connection between generalized entropies, margins, and sparse probability distributions. In sum, we make the following contributions.

[topsep=0pt,itemsep=2pt,parsep=2pt,leftmargin=10pt]

We introduce regularized prediction functions to generalize the softmax and sparsemax transformations, possibly beyond the probability simplex (§2).

We study FY losses and their properties, showing that they unify many existing losses, including the hinge, logistic, and sparsemax losses (§3).

We then show how to seamlessly create entire new families of losses from generalized entropies. We derive efficient algorithms to compute the associated probability distributions, making such losses appealing both in theory and in practice (§4).

We characterize which entropies yield sparse distributions and losses with a separation margin, notions we prove to be intimately connected (§5).

Finally, we demonstrate FY losses on the task of sparse label proportion estimation (§
6).
Notation. We denote the probability simplex by , the domain of by , the Fenchel conjugate of by , the indicator function of a set by .
2 Regularized prediction functions
We consider a general predictive setting with input
, and a parametrized model
, producing a score vector . To map to predictions, we introduce regularized prediction functions. Regularized prediction functionLet be a regularization function, with . The prediction function regularized by is defined by
(1) 
We emphasize that the regularization is w.r.t. the output and not w.r.t. the model parameters , as is usually the case in the literature. The optimization problem in (1) balances between two terms: an “affinity” term , and a “confidence” term which should be low if is “uncertain”. Two important classes of convex are (squared) norms and, when is the probability simplex, generalized negative entropies. However, our framework does not require to be convex in general. Allowing extendedreal further permits general domain constraints in (1) via indicator functions, as we now illustrate.
Examples.
When , is a onehot representation of the argmax prediction
(2) 
We can see that output as a probability distribution that assigns all probability mass on the same class. When , where is Shannon’s entropy, is the wellknown softmax
(3) 
See Boyd and Vandenberghe (2004, Ex. 3.25) for a derivation. The resulting distribution always has dense support. When , is the Euclidean projection onto the probability simplex
(4) 
a.k.a. the sparsemax transformation (Martins and Astudillo, 2016). The distribution has sparse support (it may assign exactly zero probability to lowscoring classes) and can be computed exactly in time (Brucker, 1984; Duchi et al., 2008; Condat, 2016). This paradigm is not limited to the probability simplex: When , we get
(5) 
i.e., the sigmoid function evaluated coordinatewise. We can think of its output as a positive measure (unnormalized probability distribution).
Properties.
We now discuss simple properties of regularized prediction functions. The first two assume that is a symmetric function, i.e., that it satisfies
(6) 
where is the set of permutation matrices.
Properties of

[topsep=0pt,itemsep=3pt,parsep=3pt,leftmargin=15pt]

Effect of a permutation. If is symmetric, then : .

Order preservation. Let . If is symmetric, then the coordinates of and are sorted the same way, i.e., and .

Gradient mapping. is a subgradient of at , i.e., . If is strictly convex, is the gradient of , i.e., .

Temperature scaling. For any constant , . If is strictly convex, .
The proof is given in §C.1. For classification, the orderpreservation property ensures that the highestscoring class according to and agree with each other:
(7) 
Temperature scaling is useful to control how close we are to unregularized prediction functions.
3 FenchelYoung losses
Loss  

Squared  
Perceptron (Rosenblatt, 1958)  
Hinge (Crammer and Singer, 2001)  
Sparsemax (Martins and Astudillo, 2016)  
Logistic (multinomial)  
Logistic (onevsall) 
In this section, we introduce FenchelYoung losses as a natural way to learn models whose output layer is a regularized prediction function. FenchelYoung loss generated by
Let be a regularization function such that the maximum in (1) is achieved for all . Let be a groundtruth label and be a vector of prediction scores. The FenchelYoung loss generated by is
(8) 
FenchelYoung losses can also be written as , where , highlighting the relation with the regularized prediction function . Therefore, as long as we can compute , we can evaluate the associated FenchelYoung loss . Examples of FenchelYoung losses are given in Table 1. In addition to the aforementioned multinomial logistic and sparsemax losses, we recover the squared, hinge and onevsall logistic losses, for suitable choices of and .
Properties.
As the name indicates, this family of loss functions is grounded in the FenchelYoung inequality (Borwein and Lewis, 2010, Proposition 3.3.4)
(9) 
The inequality, together with wellknown properties of convex conjugates, imply the following results. Properties of FY losses

[topsep=0pt,itemsep=3pt,parsep=3pt,leftmargin=15pt]

Nonnegativity. for any and .

Zero loss. If is a lower semicontinuous proper convex function, then and iff . If is strictly convex, then iff .

Convexity & subgradients. is convex in and the residual vectors are its subgradients: .

Differentiability & smoothness. If is strictly convex, then is differentiable and . If is strongly convex, then is smooth, i.e., is Lipschitz continuous.

Temperature scaling. For any constant , .
Remarkably, the nonnegativity and convexity properties hold even if is not convex. The zero loss property follows from the fact that, if is l.s.c. proper convex, then (9) becomes an equality (i.e., the duality gap is zero) if and only if . It suggests that minimizing a FenchelYoung loss requires adjusting to produce predictions that are close to the target , reducing the duality gap.
Relation with Bregman divergences.
FenchelYoung losses seamlessly work when groundtruth vectors are label proportions, i.e., instead of . For instance, setting to the Shannon negative entropy restricted to yields the crossentropy loss, , where
denotes the KullbackLeibler divergence. From this example, it is tempting to conjecture that a similar result holds for more general Bregman divergences
(Bregman, 1967). Recall that the Bregman divergence generated by a strictly convex and differentiable is(10) 
the difference at between and its linearization around . It turns out that is not in general equal to . However, when , where is a Legendretype function (Rockafellar, 1970; Wainwright and Jordan, 2008), meaning that it is strictly convex, differentiable and its gradient explodes at the boundary of its domain, we have the following proposition, proved in §C.2. Let , where is of Legendre type and is a convex set. Then, for all and , we have:
(11) 
with equality when the loss is . If , i.e., , then . As an example, applying (11) with and , we get that the sparsemax loss is a convex upperbound for the nonconvex . This suggests that the sparsemax loss can be useful for sparse label proportion estimation, as we confirm in §6.
The relation between FenchelYoung losses and Bregman divergences can be further clarified using duality. Letting (i.e., is a dual pair), we have . Substituting in (10), we get In other words, FenchelYoung losses can be viewed as a “mixedform Bregman divergence” (Amari, 2016, Theorem 1.1) where the argument in (10) is replaced by its dual point . This difference is best seen by comparing the function signatures, vs. . An important consequence is that FenchelYoung losses do not impose any restriction on their left argument : Our assumption that the maximum in the prediction function (1) is achieved for all implies .
4 New loss functions for sparse probabilistic classification
In the previous section, we presented FenchelYoung losses in a broad setting. We now restrict to classification over the probability simplex and show how to easily create several entire new families of losses.
Generalized entropies.
A natural choice of regularization function over the probability simplex is , where is a generalized entropy (DeGroot, 1962; Grünwald and Dawid, 2004): a concave function over , used to measure the “uncertainty” in a distribution .
Assumptions: We will make the following assumptions about .

[topsep=0pt,itemsep=3pt,parsep=3pt,leftmargin=25pt]

Zero entropy: if is a delta distribution, i.e., .

Strict concavity: , for , .

Symmetry: for any .
Assumptions A.2 and A.3 imply that is Schurconcave (Bauschke and Combettes, 2017), a common requirement in generalized entropies. This in turn implies assumption A.1, up to a constant (that constant can easily be subtracted so as to satisfy assumption A.1). As suggested by the next result, proved in §C.3, together, these assumptions imply that can be used as a sensible uncertainty measure. If
satisfies assumptions A.1A.3, then it is nonnegative and uniquely maximized by the uniform distribution
. A particular case of generalized entropies satisfying assumptions A.1–A.3 are uniformly separable functions of the form , where is a nonnegative strictly concave function such that . However, our framework is not restricted to this form.Induced FenchelYoung loss.
If the ground truth is and assumption A.1. holds, (8) becomes
(12) 
By using the fact that for all if , we can further rewrite it as
(13) 
This expression shows that FenchelYoung losses over can be written solely in terms of the generalized “cumulant function” . Indeed, when is Shannon’s entropy, we recover the cumulant (a.k.a. logpartition) function . When is strongly concave over , we can also see as a smoothed max operator (Niculae and Blondel, 2017; Mensch and Blondel, 2018) and hence can be seen as a smoothed upperbound of the perceptron loss .
Tsallis entropies (Tsallis, 1988).
Defined as , where and is an arbitrary positive constant, these entropies arise as a generalization of the ShannonKhinchin axioms to nonextensive systems (Suyari, 2004) and have numerous scientific applications (GellMann and Tsallis, 2004; Martins et al., 2009). For convenience, we set for the rest of this paper. Tsallis entropies satisfy assumptions A.1–A.3 and can also be written in uniformly separable form:
(14) 
The limit case corresponds to the Shannon entropy. When , we recover the Gini index (Gini, 1912)
, a popular “impurity measure” for decision trees:
(15) 
It is easy to check that recovers the sparsemax loss (Martins and Astudillo, 2016) (cf. Table 1). Another interesting case is , which gives , hence is the perceptron loss in Table 1. The resulting “” distribution puts all probability mass on the topscoring classes. In summary, for is , , and , and is the logistic, sparsemax and perceptron loss, respectively. Tsallis entropies induce a continuous parametric family subsuming these important cases. Since the best surrogate loss often depends on the data (Nock and Nielsen, 2009), tuning typically improves accuracy, as we confirm in §6.
Norm entropies.
An interesting class of nonseparable entropies are entropies generated by a norm, defined as . We call them norm entropies. From the Minkowski inequality, norms with are strictly convex on the simplex, so satisfies assumptions A.1–A.3 for . The limit case is particularly interesting: in this case, we obtain , recovering the BergerParker dominance index (Berger and Parker, 1970), widely used in ecology to measure species diversity. We surprisingly encounter again in §5, as a limit case for the existence of separation margins.
Computing .
For nonseparable entropies , the regularized prediction function does not generally enjoy a closedform expression and one must resort to projected gradient methods to compute it. Fortunately, for uniformly separable entropies, which we saw to be the case of Tsallis entropies, we now show that can be computed in linear time. Reduction to root finding
Let where is strictly concave and differentiable. Then,
(16) 
where is a root of , in the tight search interval , where and . An approximate such that can be found in time by, e.g., bisection. The related problem of Bregman projection onto the probability simplex was recently studied by Krichene et al. (2015) but our derivation is different and more direct (cf. §C.4).
5 Separation margin of FY losses
In this section, we are going to see that the simple assumptions A.1–A.3 about a generalized entropy are enough to obtain results about the separation margin associated with . The notion of margin is wellknown in machine learning, lying at the heart of support vector machines and leading to generalization error bounds (Vapnik, 1998; Schölkopf and Smola, 2002; Guermeur, 2007). We provide a definition and will see that many other FenchelYoung losses also have a “margin,” for suitable conditions on . Then, we take a step further, and connect the existence of a margin with the sparsity of the regularized prediction function, providing necessary and sufficient conditions for FenchelYoung losses to have a margin. Finally, we show how this margin can be computed analytically. Separation margin
Let be a loss function over . We say that has the separation margin property if there exists such that:
(17) 
The smallest possible that satisfies (17) is called the margin of , denoted .
Examples.
The most famous example of a loss with a separation margin is the multiclass hinge loss, , which we saw in Table 1 to be a FenchelYoung loss: it is immediate from the definition that its margin is . Less trivially, Martins and Astudillo (2016, Prop. 3.5) showed that the sparsemax loss also has the separation margin property. On the negative side, the logistic loss does not have a margin, as it is strictly positive. Characterizing which FenchelYoung losses have a margin is an open question which we address next.
Conditions for existence of margin.
To accomplish our goal, we need to characterize the gradient mappings and associated with generalized entropies (note that is never singlevalued: if is in , then so is , for any constant ). Of particular importance is the subdifferential set . The next proposition, whose proof we defer to §C.5, uses this set to provide a necessary and sufficient condition for the existence of a separation margin, along with a formula for computing it. Let satisfy A.1–A.3. Then:

The loss has a separation margin iff there is a such that .

If the above holds, then the margin of is given by the smallest such or, equivalently,
(18)
Reassuringly, the first part confirms that the logistic loss does not have a margin, since . A second interesting fact is that the denominator of (18) is the generalized entropy introduced in §4: the norm entropy. As Figure 1 suggests, this entropy provides an upper bound for convex losses with unit margin. This provides some intuition to the formula (18), which seeks a distribution maximizing the entropy ratio between and .
Equivalence between sparsity and margin.
The next result, proved in §C.6, characterizes more precisely the image of . In doing so, it establishes a key result in this paper: a sufficient condition for the existence of a separation margin in is the sparsity of the regularized prediction function , i.e., its ability to reach the entire simplex, including the boundary points. If is uniformly separable, this is also a necessary condition. Equivalence between sparse probability distribution and loss enjoying a margin
Let satisfy A.1–A.3 and be uniformly separable, i.e., . Then the following statements are all equivalent:

for any ;

The mapping covers the full simplex, i.e., ;

has the separation margin property.
For a general (not necessarily separable) satisfying A.1–A.3, we have (1) (2) (3).
Let us reflect for a moment on the three conditions stated in Proposition
5. The first two conditions involve the subdifferential and gradient of and its conjugate; the third condition is the margin property of . To provide some intuition, consider the case where is separable with and is differentiable in . Then, from the concavity of , its derivative is decreasing, hence the first condition is met if and . This is the case with Tsallis entropies for , but not Shannon entropy, since explodes at . Functions whose gradient “explodes” in the boundary of their domain (hence failing to meet the first condition in Proposition 5) are called “essentially smooth” (Rockafellar, 1970). For those functions, maps only to the relative interior of , never attaining boundary points (Wainwright and Jordan, 2008); this is expressed in the second condition. This prevents essentially smooth functions from generating a sparse or (if they are separable) a loss with a margin, as asserted by the third condition. Since Legendretype functions (§3) are strictly convex and essentially smooth, by Proposition 3, loss functions for which the composite form holds, which is the case of the logistic loss but not of the sparsemax loss, do not enjoy a margin and cannot induce a sparse probability distribution. This is geometrically visible in Figure 1.Margin computation.
For FenchelYoung losses that have the separation margin property, Proposition 5 provided a formula for determining the margin. While informative, formula (18) is not very practical, as it involves a generally nonconvex optimization problem. The next proposition, proved in §C.7, takes a step further and provides a remarkably simple closedform expression for generalized entropies that are twicedifferentiable. To simplify notation, we denote by the component of . Assume satisfies the conditions in Proposition 5 and is twicedifferentiable on the simplex. Then, for arbitrary :
(19) 
In particular, if is separable, i.e., , where is concave, twice differentiable, with , then
(20) 
Example: case of Tsallis and norm entropies.
6 Experimental results
As we saw, Tsallis entropies generate a family of losses, with the logistic () and sparsemax losses () as important special cases. In addition, they are twice differentiable for , produce sparse probability distributions for and are computationally efficient for any , thanks to Proposition 4. In this section, we demonstrate their usefulness on the task of label proportion estimation and compare different solvers for computing .
tuned  
(logistic)  (sparsemax)  
Birds  0.359 / 0.530  0.364 / 0.504  0.364 / 0.504  0.358 / 0.501 
Cal500  0.454 / 0.034  0.456 / 0.035  0.452 / 0.035  0.456 / 0.034 
Emotions  0.226 / 0.327  0.225 / 0.317  0.225 / 0.317  0.224 / 0.321 
Mediamill  0.375 / 0.208  0.363 / 0.193  0.356 / 0.191  0.361 / 0.193 
Scene  0.175 / 0.344  0.176 / 0.363  0.176 / 0.363  0.175 / 0.345 
TMC  0.225 / 0.337  0.224 / 0.327  0.224 / 0.327  0.217 / 0.328 
Yeast  0.307 / 0.183  0.314 / 0.186  0.314 / 0.186  0.307 / 0.183 
Avg. rank  2.57 / 2.71  2.71 / 2.14  2.14 / 2.00  1.43 / 1.86 
[][][b]
Label proportion estimation.
Given an input vector , where is the number of features, our goal is to estimate a vector of label proportions , where is the number of classes. If is sparse, we expect the superiority of Tsallis losses over the conventional logistic loss on this task. At training time, given a set of pairs, we estimate a matrix by minimizing the convex objective
(23) 
We use LBFGS (Liu and Nocedal, 1989) for simplicity. From Proposition 9
and using the chain rule, we obtain the gradient expression
, where , and are matrices whose rows gather , and , for . At test time, we predict label proportions by .We ran experiments on standard multilabel benchmark datasets — see §B
for dataset characteristics. For all datasets, we removed samples with no label, normalized samples to have zero mean unit variance, and normalized labels to lie in the probability simplex. We chose
and against the validation set. We report the test set mean JensenShannon divergence, , and the mean squared error in Table 3. As can be seen, the loss with tuned achieves the best averaged rank overall. Tuning allows to choose the best loss in the family in a datadriven fashion. Additional experiments confirm these findings — see §B.Solver comparison.
Next, we compared bisection (binary search) and Brent’s method for solving (1) by root finding (Proposition 4). We focus on , i.e. the Tsallis entropy, and also compare against using a generic projected gradient algorithm (FISTA) to solve (1) naively. We measure the time needed to reach a solution with , over 200 samples with . Median and 99% CI times reported in Figure 3 reveal that root finding scales better, with Brent’s method outperforming FISTA by one to two orders of magnitude.
7 Related work
Proper scoring rules (proper losses) are a wellstudied object in statistics (Grünwald and Dawid, 2004; Gneiting and Raftery, 2007) and machine learning (Reid and Williamson, 2010; Williamson et al., 2016), that measures the discrepancy between a groundtruth and a probability forecast in a Fisherconsistent manner. From Savage (1971) (see also Gneiting and Raftery (2007)), we can construct a proper scoring rule by
(24) 
recovering the wellknown relation between Bregman divergences and proper scoring rules. For example, using the Gini index generates the Brier score (Brier, 1950)