Probabilistic classification in PyTorch/TensorFlow/scikit-learn with Fenchel-Young losses
Over the past decades, numerous loss functions have been been proposed for a variety of supervised learning tasks, including regression, classification, ranking, and more generally structured prediction. Understanding the core principles and theoretical properties underpinning these losses is key to choose the right loss for the right problem, as well as to create new losses which combine their strengths. In this paper, we introduce Fenchel-Young losses, a generic way to construct a convex loss function for a regularized prediction function. We provide an in-depth study of their properties in a very broad setting, covering all the aforementioned supervised learning tasks, and revealing new connections between sparsity, generalized entropies, and separation margins. We show that Fenchel-Young losses unify many well-known loss functions and allow to create useful new ones easily. Finally, we derive efficient predictive and training algorithms, making Fenchel-Young losses appealing both in theory and practice.READ FULL TEXT VIEW PDF
We study in this paper Fenchel-Young losses, a generic way to construct
We propose in this paper a general framework for deriving loss functions...
Being able to quickly adapt to changes in dynamics is paramount in
Supervised learning requires the specification of a loss function to
Loss functions are a cornerstone of machine learning and the starting po...
In this work, we introduce the average top-k (AT_k) loss as a new
We consider neural network training, in applications in which there are ...
Probabilistic classification in PyTorch/TensorFlow/scikit-learn with Fenchel-Young losses
Loss functions are a cornerstone of statistics and machine learning: They measure the difference, or “loss,” between a ground truth and a prediction. As such, much work has been devoted to designing loss functions for a variety of supervised learning tasks, including regression(Huber, 1964), classification (Crammer and Singer, 2001), ranking (Joachims, 2002) and structured prediction (Lafferty et al., 2001; Collins, 2002; Tsochantaridis et al., 2005), to name only a few well-known directions.
offer a principled framework unifying various existing loss functions. A proper composite loss is the composition of a proper loss between two probability distributions, with an invertible mapping from real vectors to probability distributions. The theoretical properties of proper loss functions, also known as proper scoring rules (Grünwald and Dawid (2004); Gneiting and Raftery (2007)
; and references therein), such as their Fisher consistency (classification calibration) and correspondence with Bregman divergences, are now well-understood. However, not all existing losses are proper composite loss functions; a notable example is the hinge loss used in support vector machines. In fact, we shall see that any loss function enjoying a separation margin, a prevalent concept in statistical learning theory which has been used to prove the famous perceptron mistake bound(Rosenblatt, 1958) and many other generalization bounds (Vapnik, 1998; Schölkopf and Smola, 2002), cannot be written in composite proper loss form.
At the same time, loss functions are often intimately related to an underlying statistical model and prediction function. For instance, the logistic loss corresponds to the multinomial distribution and the softmax operator, while the conditional random field (CRF) loss (Lafferty et al., 2001) for structured prediction is tied with marginal inference (Wainwright and Jordan, 2008). Both are instances of generalized linear models (Nelder and Baker, 1972; McCullagh and Nelder, 1989), associated with exponential family distributions. More recently, Martins and Astudillo (2016) proposed a new classification loss based on the projection onto the simplex. Unlike the logistic loss, this “sparsemax” loss induces probability distributions with sparse support, which is desirable in some applications for interpretability or computational efficiency reasons. However, the sparsemax loss was derived in a relatively ad-hoc manner and it is still relatively poorly understood. Is it one of a kind or can we generalize it in a principled manner? Thorough understanding of the core principles underpinning existing losses and their associated predictive model, potentially enabling the creation of useful new losses, is one of the main quests of this paper.
The starting point of this paper are the notions of output regularization and regularized prediction functions, which we use to provide a variational perspective on many existing prediction functions, including the aforementioned softmax, sparsemax and marginal inference. Based on simple convex duality arguments, we then introduce Fenchel-Young losses, a new way to automatically construct a loss function associated with any regularized prediction function. As we shall see, our proposal recovers many existing loss functions, which is in a sense surprising since many of these losses were originally proposed by independent efforts. Our framework goes beyond the simple probabilistic classification setting: We show how to create loss functions over various structured domains, including convex polytopes and convex cones. Our framework encourages the loss designer to think geometrically about the outputs desired for the task at hand. Once a (regularized) prediction function has been designed, our framework generates a corresponding loss function automatically. We will demonstrate the ease of creating loss functions, including useful new ones, using abundant examples throughout this paper.
The rest of this paper is organized as follows.
We introduce regularized prediction functions, unifying and generalizing softmax, sparsemax and marginal inference, among many others.
We introduce Fenchel-Young losses to learn models whose output layer is a regularized prediction function. We provide an in-depth study of their properties, showing that they unify many existing losses, including unstructured and structured losses.
We study Fenchel-Young losses for probabilistic classification. We show how to seamlessly create entire new families of losses from generalized entropies.
We characterize which entropies yield sparse distributions and losses with a separation margin, notions we prove to be intimately connected. Furthermore, we show that losses that enjoy a margin and induce a sparse distribution are precisely the ones that cannot be written in proper composite loss form.
We study Fenchel-Young losses for positive measures (unnormalized probability distributions), providing a new perspective on one-vs-all loss reductions.
We study Fenchel-Young losses for structured prediction. We introduce the concepts of probability distribution and mean regularizations, providing a unifying perspective on a number of structured losses, including the conditional random field (CRF) loss and the recently-proposed SparseMAP loss (Niculae et al., 2018). We illustrate these results by deriving losses over various convex polytopes, including ranking losses.
We present primal and dual algorithms for learning models with Fenchel-Young losses defined over arbitrary domains. We derive new efficient algorithms to compute regularized prediction functions and proximity operators, which are a core sub-routine for dual training algorithms.
We demonstrate the ability of Fenchel-Young losses to induce sparse distributions on two tasks: label proportion estimation and dependency parsing.
Finally, we review related work on proper (composite) losses and other losses proposed in the literature, Fenchel duality, and approximate inference.
This paper builds upon two previously published shorter conference papers. The first (Niculae et al., 2018) introduced Fenchel-Young losses in the structured prediction setting but only provided a limited analysis of their properties. The second (Blondel et al., 2019) provided a more in-depth analysis but focused on unstructured probabilistic classification. This paper provides a comprehensive study of Fenchel-Young losses across various domains. Besides a much more thorough treatment of previously covered topics, this paper contributes entirely new sections, including §6 on losses for positive measures, §8 on primal and dual training algorithms, and §A.2 on loss “Fenchel-Youngization”. We provide in §7 a new unifying view between structured predictions losses, and discuss at length various convex polytopes, promoting a geometric approach to structured prediction loss design; we also provide novel results in this section regarding structured separation margins (Proposition 7.4), proving the unit margin of the SparseMAP loss. We demonstrate how to use our framework to create useful new losses, including ranking losses, not covered in the previous two papers.
We denote the -dimensional probability simplex by . We denote the convex hull of a set by and the conic hull by . We denote the domain of a function by . We denote the Fenchel conjugate of by . We denote the indicator function of a set by
and its support function by . We define the proximity operator (a.k.a. proximal operator) of by
We denote the interior and relative interior of by and , respectively. We denote , evaluated element-wise.
In this section, we introduce the concept of regularized prediction function (§2.1), which is central to this paper. We then give simple and well-known examples of such functions (§2.2) and discuss their properties in a general setting (§2.3,§2.4).
We consider a general predictive setting with input variables
, and a parametrized model
(which could be a linear model or a neural network), producing a score vector. In a simple multi-class classification setting, the score vector is typically used to pick the highest-scoring class among possible ones
This can be generalized to an arbitrary output space by using instead
where intuitively captures the affinity between (since is produced by ) and . Therefore, (4) seeks the output with greatest affinity with . The support function can be interpreted as the largest projection of any element of onto the line generated by .
Clearly, (4) recovers (3) with , where is a standard basis vector, . In this case, the cardinality and the dimensionality coincide, but this need not be the case in general. Eq. (4) is often called a linear maximization oracle or maximum a-posteriori (MAP) oracle (Wainwright and Jordan, 2008). The latter name comes from the fact that (4) coincides with the mode of the Gibbs distribution defined by
We now extend the prediction function (4) by replacing with its convex hull and introducing a regularization function into the optimization problem:
We emphasize that the regularization is w.r.t. predictions (outputs) and not w.r.t. model parameters (denoted by in this paper), as is usually the case in the literature. We illustrate the regularized prediction function pipeline in Figure 1.
The regularized prediction function (8) casts computing a prediction as a variational problem. It involves an optimization problem that balances between two terms: an “affinity” term , and a “confidence” term which should be low if is “uncertain.” Two important classes of convex are (squared) norms and, when is the probability simplex, generalized negative entropies. However, our framework does not require to be convex in general.
Introducing in (8) tends to move the prediction away from the vertices of . Unless the regularization term is negligible compared to the affinity term , a prediction becomes a convex combination of several vertices. As we shall see in §7, we can interpret this prediction as the mean under some underlying distribution. This contrasts with (4), which always outputs the most likely vertex, i.e., the mode.
Regularized prediction functions are in fact not limited to convex hulls. We now state their precise definition in complete generality. Prediction function regularized by
Let be a regularization function, with . The prediction function regularized by is defined by
Allowing extended-real permits general domain constraints in (8) via indicator functions. For instance, choosing , where is the indicator function defined in (1), recovers the MAP oracle (4). Importantly, the choice of domain is not limited to convex hulls. For instance, we will also consider conic hulls, , later in this paper.
Regularized prediction functions involve two main design choices: the domain over which is defined and itself. The choice of is mainly dictated by the type of output we want from , such as for convex combinations of elements of , and for conic combinations. The choice of the regularization itself further governs certain properties of , including, as we shall see in the sequel, its sparsity or its use of prior knowledge regarding the importance or misclassification cost of certain outputs. The choices of and may also be constrained by computational considerations. Indeed, while computing involves a potentially challenging constrained maximization problem in general, we will see that certain choices of lead to closed-form expressions. The power of our framework is that the user can focus solely on designing and computing : We will see in §3 how to automatically construct a loss function associated with .
To illustrate regularized prediction functions, we give several concrete examples enjoying a closed-form expression.
When , is a one-hot representation of the argmax prediction
We can see that output as a probability distribution that assigns all probability mass on the same class. When , where is Shannon’s entropy, is the well-known softmax
See Boyd and Vandenberghe (2004, Ex. 3.25) for a derivation. The resulting distribution always has dense support. When , is the Euclidean projection onto the probability simplex
a.k.a. the sparsemax transformation (Martins and Astudillo, 2016). It is well-known that
for some threshold . Hence, the predicted distribution can have sparse support (it may assign exactly zero probability to low-scoring classes). The threshold can be computed exactly in time (Brucker, 1984; Duchi et al., 2008; Condat, 2016).
The regularized prediction function paradigm is, however, not limited to the probability simplex: When , we get
i.e., the sigmoid function evaluated coordinate-wise. We can think of its output as a positive measure (unnormalized probability distribution).
We will see in §4 that the first three examples (argmax, softmax and sparsemax) are particular instances of a broader family of prediction functions, using the notion of generalized entropy. The last example is a special case of regularized prediction function over positive measures, developed in §6. Regularized prediction functions also encompass more complex convex polytopes for structured prediction, as we shall see in §7.
From Danskin’s theorem (Danskin, 1966) (see also Bertsekas (1999, Proposition B.25)) is a subgradient of at , i.e., . If, furthermore, is strictly convex, then is the gradient of at , i.e., . This interpretation of as a (sub)gradient mapping will play a crucial role in the next section for deriving a loss function associated with .
where denotes the infimal convolution of with . Furthermore, from Danskin’s theorem, , where denotes an optimal solution of the infimum in (14). We can think of that infimum as the dual of the optimization problem in (8). When , is known as the Moreau envelope of (Moreau, 1965) and using Moreau’s decomposition, we obtain . As another example, when , we obtain
where we used , the support function of . In particular, when , we have . This dual view is informative insofar as it suggests that regularized prediction functions with minimize a trade-off between maximizing the value achieved by the unregularized prediction function , and a proximity term .
We now discuss simple yet useful properties of regularized prediction functions. The first two assume that is a symmetric function, i.e., that it satisfies
where is the set of permutation matrices.
Properties of regularized prediction functions
Effect of a permutation. If is symmetric, then : .
Order preservation. Let . If is symmetric, then the coordinates of and are sorted the same way, i.e., and .
Approximation error. If is -strongly convex and bounded with for all , then .
Temperature scaling. For any constant , . If is strictly convex, .
Constant invariance. For any constant , .
The proof is given in Appendix B.1.
For classification, the order-preservation property ensures that the highest-scoring class according to and agree with each other:
Temperature scaling is useful to control how close we are to unregularized prediction functions. Clearly, as , where is defined in (4).
In the previous section, we introduced regularized prediction functions over arbitrary domains, as a generalization of classical (unregularized) decision functions. In this section, we introduce Fenchel-Young losses for learning models whose output layer is a regularized prediction function. We first give their definitions and state their properties (§3.1). We then discuss their relationship with Bregman divergences (§3.2) and their Bayes risk (§3.3). Finally, we show how to construct a cost-sensitive loss from any Fenchel-Young loss (§3.4).
Given a regularized prediction function , we define its associated loss as follows. Fenchel-Young loss generated by
Let be a regularization function such that the maximum in (8) is achieved for all . Let be a ground-truth label and be a vector of prediction scores.
The Fenchel-Young loss generated by is
It is easy to see that Fenchel-Young losses can be rewritten as
where , highlighting the relation with regularized prediction functions. Therefore, as long as we can compute a regularized prediction function , we can automatically obtain an associated Fenchel-Young loss . Conversely, we also have that outputs the prediction minimizing the loss:
Examples of existing losses that fall into the Fenchel-Young loss family are given in Table 1. Some of these examples will be discussed in more details in the sequel of this paper.
As the name indicates, this family of loss functions is grounded in the Fenchel-Young inequality (Borwein and Lewis, 2010, Proposition 3.3.4)
The inequality, together with well-known results regarding convex conjugates, imply the following properties of Fenchel-Young losses. Properties of Fenchel-Young losses
Non-negativity. for any and .
Zero loss. If is a lower semi-continuous proper convex
, and . If is strictly convex, then .
Convexity & subgradients. is convex in and the residual vectors are its subgradients: .
Differentiability & smoothness. If is strictly convex, then is differentiable and . If is strongly convex, then is smooth, i.e., is Lipschitz continuous.
Temperature scaling. For any constant , .
Constant invariance. For any constant , .
Remarkably, the non-negativity, convexity and constant invariance properties hold even if is not convex. The zero loss property follows from the fact that, if is l.s.c. proper convex, then (21) becomes an equality (i.e., the duality gap is zero) if and only if . It suggests that the minimization of Fenchel-Young losses attempts to adjust the model to produce predictions that are close to the target , reducing the duality gap. This is illustrated with and (leading to the squared loss) in Figure 3.
Our assumption that the maximum in the regularized prediction function (8) is achieved for all implies that . This assumption is quite mild and does not require to be bounded. Minimizing w.r.t. is therefore an unconstrained convex optimization problem. This contrasts with proper loss functions, which are defined over the probability simplex, as discussed in §10.
Fenchel-Young losses seamlessly work when instead of . For example, in the case of the logistic loss, where is the Shannon entropy restricted to , allowing instead of yields the cross-entropy loss, , where
denotes the (generalized) Kullback-Leibler divergence
This can be useful in a multi-label setting with supervision in the form of label proportions.
From this example, it is tempting to conjecture that a similar result holds for more general Bregman divergences (Bregman, 1967). Recall that the Bregman divergence generated by a strictly convex and differentiable is
In other words, this is the difference at between and its linearization around . It turns out that is not in general equal to . In fact the latter is not necessarily convex in while the former always is. However, there is a duality relationship between Fenchel-Young losses and Bregman divergences, as we now discuss.
Letting (i.e., is a dual pair), we have . Substituting in (23), we get In other words, Fenchel-Young losses can be viewed as a “mixed-form Bregman divergence” (Amari, 2016, Theorem 1.1) where the argument in (23) is replaced by its dual point . This difference is best seen by comparing the function signatures, vs. . An important consequence is that Fenchel-Young losses do not impose any restriction on their left argument : Our assumption that the maximum in the prediction function (8) is achieved for all implies . In contrast, a Bregman divergence would typically need to be composed with a mapping from to , such as , resulting in a possibly non-convex function.
We can make the relationship with Bregman divergences further precise when , where is restricted to the class of so-called Legendre-type functions (Rockafellar, 1970; Wainwright and Jordan, 2008). We first recall the definition of this class of functions and then state our results. Essentially smooth and Legendre type functions
A function is essentially smooth if
is differentiable throughout ,
and for any sequence contained in , and converging to a boundary point of .
A function is of Legendre type if
it is strictly convex on
and essentially smooth.
For instance, is Legendre-type with , and is Legendre-type with . However, is not Legendre-type, since the gradient of does not explode everywhere on the boundary of . The Legendre-type assumption crucially implies that
We can use this fact to derive the following results, proved in Appendix B.2. Relation with Bregman divergences
Let be of Legendre type with and
let be a convex set.
Let be the restriction of to , i.e., .
Bregman projection. The prediction function regularized by , , reduces to the Bregman projection of onto :
Difference of divergences. For all and :
Bound. For all and :
with equality when the loss is minimized
Composite form. When , i.e., , we have equality for all
We illustrate these properties using as a running example. From the first property, since and , we get
recovering the Euclidean projection onto . The reduction of regularized prediction functions to Bregman projections (when is of Legendre type) is useful because there exist efficient algorithms for computing the Bregman projection onto various convex sets (Yasutake et al., 2011; Suehiro et al., 2012; Krichene et al., 2015; Lim and Wright, 2016). Therefore, we can use these algorithms to compute provided that is available.
From the second property, we obtain for all and
This recovers the expression of the sparsemax loss given in Table 1 with .
From the third claim, we obtain for all and
This shows that provides a convex upper-bound for the possibly non-convex composite function . In particular, when , we get . This suggests that the sparsemax loss is useful for sparse label proportion estimation, as confirmed in our experiments (§9).
Finally, from the last property, if , we obtain , which is indeed the squared loss given in Table 1.
In this section, we discuss the relation between the pointwise Bayes risk (minimal achievable loss) of a Fenchel-Young loss and Bregman information (Banerjee et al., 2005).
be a random variable taking values infollowing the distribution . The expected loss (a.k.a. expected risk) is then
Here, we defined the Bregman information of by
We refer the reader to Banerjee et al. (2005) for a detailed discussion as to why the last two equalities hold. The r.h.s. is exactly equal to the difference between the two sides of Jensen’s inequality and is therefore non-negative. For this reason, it is sometimes also called Jensen gap (Reid and Williamson, 2011).
From Proposition 21, we know that for all . Therefore, the pointwise Bayes risk coincides precisely with the Bregman information of ,
provided that . A similar relation between Bayes risk and Bregman information exists for proper losses (Reid and Williamson, 2011). We can think of (38) as a measure of the “difficulty” of the task. Combining (36) and (38), we obtain
the pointwise “regret” of w.r.t.