Learning with Fenchel-Young Losses

01/08/2019 ∙ by Mathieu Blondel, et al. ∙ Mathieu Blondel Unbabel Inc. 0

Over the past decades, numerous loss functions have been been proposed for a variety of supervised learning tasks, including regression, classification, ranking, and more generally structured prediction. Understanding the core principles and theoretical properties underpinning these losses is key to choose the right loss for the right problem, as well as to create new losses which combine their strengths. In this paper, we introduce Fenchel-Young losses, a generic way to construct a convex loss function for a regularized prediction function. We provide an in-depth study of their properties in a very broad setting, covering all the aforementioned supervised learning tasks, and revealing new connections between sparsity, generalized entropies, and separation margins. We show that Fenchel-Young losses unify many well-known loss functions and allow to create useful new ones easily. Finally, we derive efficient predictive and training algorithms, making Fenchel-Young losses appealing both in theory and practice.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

fenchel-young-losses

Probabilistic classification in PyTorch/TensorFlow/scikit-learn with Fenchel-Young losses


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Loss functions are a cornerstone of statistics and machine learning: They measure the difference, or “loss,” between a ground truth and a prediction. As such, much work has been devoted to designing loss functions for a variety of supervised learning tasks, including regression

(Huber, 1964), classification (Crammer and Singer, 2001), ranking (Joachims, 2002) and structured prediction (Lafferty et al., 2001; Collins, 2002; Tsochantaridis et al., 2005), to name only a few well-known directions.

For the case of probabilistic classification, proper composite loss functions  (Reid and Williamson, 2010; Williamson et al., 2016)

offer a principled framework unifying various existing loss functions. A proper composite loss is the composition of a proper loss between two probability distributions, with an invertible mapping from real vectors to probability distributions. The theoretical properties of proper loss functions, also known as proper scoring rules (

Grünwald and Dawid (2004); Gneiting and Raftery (2007)

; and references therein), such as their Fisher consistency (classification calibration) and correspondence with Bregman divergences, are now well-understood. However, not all existing losses are proper composite loss functions; a notable example is the hinge loss used in support vector machines. In fact, we shall see that any loss function enjoying a separation margin, a prevalent concept in statistical learning theory which has been used to prove the famous perceptron mistake bound

(Rosenblatt, 1958) and many other generalization bounds (Vapnik, 1998; Schölkopf and Smola, 2002), cannot be written in composite proper loss form.

At the same time, loss functions are often intimately related to an underlying statistical model and prediction function. For instance, the logistic loss corresponds to the multinomial distribution and the softmax operator, while the conditional random field (CRF) loss (Lafferty et al., 2001) for structured prediction is tied with marginal inference (Wainwright and Jordan, 2008). Both are instances of generalized linear models (Nelder and Baker, 1972; McCullagh and Nelder, 1989), associated with exponential family distributions. More recently, Martins and Astudillo (2016) proposed a new classification loss based on the projection onto the simplex. Unlike the logistic loss, this “sparsemax” loss induces probability distributions with sparse support, which is desirable in some applications for interpretability or computational efficiency reasons. However, the sparsemax loss was derived in a relatively ad-hoc manner and it is still relatively poorly understood. Is it one of a kind or can we generalize it in a principled manner? Thorough understanding of the core principles underpinning existing losses and their associated predictive model, potentially enabling the creation of useful new losses, is one of the main quests of this paper.

This paper.

The starting point of this paper are the notions of output regularization and regularized prediction functions, which we use to provide a variational perspective on many existing prediction functions, including the aforementioned softmax, sparsemax and marginal inference. Based on simple convex duality arguments, we then introduce Fenchel-Young losses, a new way to automatically construct a loss function associated with any regularized prediction function. As we shall see, our proposal recovers many existing loss functions, which is in a sense surprising since many of these losses were originally proposed by independent efforts. Our framework goes beyond the simple probabilistic classification setting: We show how to create loss functions over various structured domains, including convex polytopes and convex cones. Our framework encourages the loss designer to think geometrically about the outputs desired for the task at hand. Once a (regularized) prediction function has been designed, our framework generates a corresponding loss function automatically. We will demonstrate the ease of creating loss functions, including useful new ones, using abundant examples throughout this paper.

Organization and contributions.

The rest of this paper is organized as follows.

  • [topsep=0pt,itemsep=3pt,parsep=3pt,leftmargin=15pt]

  • We introduce regularized prediction functions, unifying and generalizing softmax, sparsemax and marginal inference, among many others.

  • We introduce Fenchel-Young losses to learn models whose output layer is a regularized prediction function. We provide an in-depth study of their properties, showing that they unify many existing losses, including unstructured and structured losses.

  • We study Fenchel-Young losses for probabilistic classification. We show how to seamlessly create entire new families of losses from generalized entropies.

  • We characterize which entropies yield sparse distributions and losses with a separation margin, notions we prove to be intimately connected. Furthermore, we show that losses that enjoy a margin and induce a sparse distribution are precisely the ones that cannot be written in proper composite loss form.

  • We study Fenchel-Young losses for positive measures (unnormalized probability distributions), providing a new perspective on one-vs-all loss reductions.

  • We study Fenchel-Young losses for structured prediction. We introduce the concepts of probability distribution and mean regularizations, providing a unifying perspective on a number of structured losses, including the conditional random field (CRF) loss and the recently-proposed SparseMAP loss (Niculae et al., 2018). We illustrate these results by deriving losses over various convex polytopes, including ranking losses.

  • We present primal and dual algorithms for learning models with Fenchel-Young losses defined over arbitrary domains. We derive new efficient algorithms to compute regularized prediction functions and proximity operators, which are a core sub-routine for dual training algorithms.

  • We demonstrate the ability of Fenchel-Young losses to induce sparse distributions on two tasks: label proportion estimation and dependency parsing.

  • Finally, we review related work on proper (composite) losses and other losses proposed in the literature, Fenchel duality, and approximate inference.

Previous papers.

This paper builds upon two previously published shorter conference papers. The first (Niculae et al., 2018) introduced Fenchel-Young losses in the structured prediction setting but only provided a limited analysis of their properties. The second (Blondel et al., 2019) provided a more in-depth analysis but focused on unstructured probabilistic classification. This paper provides a comprehensive study of Fenchel-Young losses across various domains. Besides a much more thorough treatment of previously covered topics, this paper contributes entirely new sections, including §6 on losses for positive measures, §8 on primal and dual training algorithms, and §A.2 on loss “Fenchel-Youngization”. We provide in §7 a new unifying view between structured predictions losses, and discuss at length various convex polytopes, promoting a geometric approach to structured prediction loss design; we also provide novel results in this section regarding structured separation margins (Proposition 7.4), proving the unit margin of the SparseMAP loss. We demonstrate how to use our framework to create useful new losses, including ranking losses, not covered in the previous two papers.

Notation.

We denote the -dimensional probability simplex by . We denote the convex hull of a set by and the conic hull by . We denote the domain of a function by . We denote the Fenchel conjugate of by . We denote the indicator function of a set by

(1)

and its support function by . We define the proximity operator (a.k.a. proximal operator) of by

(2)

We denote the interior and relative interior of by and , respectively. We denote , evaluated element-wise.

2 Regularized prediction functions

In this section, we introduce the concept of regularized prediction function (§2.1), which is central to this paper. We then give simple and well-known examples of such functions (§2.2) and discuss their properties in a general setting (§2.32.4).

2.1 Definition

We consider a general predictive setting with input variables

, and a parametrized model

(which could be a linear model or a neural network), producing a score vector

. In a simple multi-class classification setting, the score vector is typically used to pick the highest-scoring class among possible ones

(3)

This can be generalized to an arbitrary output space by using instead

(4)

where intuitively captures the affinity between (since is produced by ) and . Therefore, (4) seeks the output with greatest affinity with . The support function can be interpreted as the largest projection of any element of onto the line generated by .

Clearly, (4) recovers (3) with , where is a standard basis vector, . In this case, the cardinality and the dimensionality coincide, but this need not be the case in general. Eq. (4) is often called a linear maximization oracle or maximum a-posteriori (MAP) oracle (Wainwright and Jordan, 2008). The latter name comes from the fact that (4) coincides with the mode of the Gibbs distribution defined by

(5)

Prediction over convex hulls.

We now extend the prediction function (4) by replacing with its convex hull and introducing a regularization function into the optimization problem:

(6)

We emphasize that the regularization is w.r.t. predictions (outputs) and not w.r.t. model parameters (denoted by in this paper), as is usually the case in the literature. We illustrate the regularized prediction function pipeline in Figure 1.

Unsurprisingly, the choice recovers the unregularized prediction function (4

). This follows from the fundamental theorem of linear programming

(Dantzig et al., 1955, Theorem 6), which states that the maximum of a linear form over a convex polytope is always achieved at one of its vertices:

(7)

Why regularize outputs?

The regularized prediction function (8) casts computing a prediction as a variational problem. It involves an optimization problem that balances between two terms: an “affinity” term , and a “confidence” term which should be low if is “uncertain.” Two important classes of convex are (squared) norms and, when is the probability simplex, generalized negative entropies. However, our framework does not require to be convex in general.

Introducing in (8) tends to move the prediction away from the vertices of . Unless the regularization term is negligible compared to the affinity term , a prediction becomes a convex combination of several vertices. As we shall see in §7, we can interpret this prediction as the mean under some underlying distribution. This contrasts with (4), which always outputs the most likely vertex, i.e., the mode.


Figure 1: Illustration of the proposed regularized prediction framework over a convex hull . A parametrized model (linear model, neural network, etc.) produces a score vector . The regularized prediction function produces a prediction . Regularized prediction functions are not limited to convex hulls and can be defined over arbitrary domains (Def. 2.1).

Prediction over arbitrary domains.

Regularized prediction functions are in fact not limited to convex hulls. We now state their precise definition in complete generality. Prediction function regularized by

Let be a regularization function, with . The prediction function regularized by is defined by

(8)

Allowing extended-real permits general domain constraints in (8) via indicator functions. For instance, choosing , where is the indicator function defined in (1), recovers the MAP oracle (4). Importantly, the choice of domain is not limited to convex hulls. For instance, we will also consider conic hulls, , later in this paper.

Choosing .

Regularized prediction functions involve two main design choices: the domain over which is defined and itself. The choice of is mainly dictated by the type of output we want from , such as for convex combinations of elements of , and for conic combinations. The choice of the regularization itself further governs certain properties of , including, as we shall see in the sequel, its sparsity or its use of prior knowledge regarding the importance or misclassification cost of certain outputs. The choices of and may also be constrained by computational considerations. Indeed, while computing involves a potentially challenging constrained maximization problem in general, we will see that certain choices of lead to closed-form expressions. The power of our framework is that the user can focus solely on designing and computing : We will see in §3 how to automatically construct a loss function associated with .

2.2 Examples

To illustrate regularized prediction functions, we give several concrete examples enjoying a closed-form expression.

When , is a one-hot representation of the argmax prediction

(9)

We can see that output as a probability distribution that assigns all probability mass on the same class. When , where is Shannon’s entropy, is the well-known softmax

(10)

See Boyd and Vandenberghe (2004, Ex. 3.25) for a derivation. The resulting distribution always has dense support. When , is the Euclidean projection onto the probability simplex

(11)

a.k.a. the sparsemax transformation (Martins and Astudillo, 2016). It is well-known that

(12)

for some threshold . Hence, the predicted distribution can have sparse support (it may assign exactly zero probability to low-scoring classes). The threshold can be computed exactly in time (Brucker, 1984; Duchi et al., 2008; Condat, 2016).

The regularized prediction function paradigm is, however, not limited to the probability simplex: When , we get

(13)

i.e., the sigmoid function evaluated coordinate-wise. We can think of its output as a positive measure (unnormalized probability distribution).

We will see in §4 that the first three examples (argmax, softmax and sparsemax) are particular instances of a broader family of prediction functions, using the notion of generalized entropy. The last example is a special case of regularized prediction function over positive measures, developed in §6. Regularized prediction functions also encompass more complex convex polytopes for structured prediction, as we shall see in §7.

Figure 2: Examples of regularized prediction functions. The unregularized prediction always hits a vertex of the probability simplex, leading to a probability distribution that puts all probability mass on the same class. Unlike, which always occurs in the relative interior of the simplex and thus leads to a dense distribution, (Euclidean projection onto the probability simplex) may hit the boundary, leading to a sparse probability distribution. We also display the operator which lies in the unit cube and is thus not guaranteed to output a valid probability distribution.

2.3 Gradient mapping and dual objective

From Danskin’s theorem (Danskin, 1966) (see also Bertsekas (1999, Proposition B.25)) is a subgradient of at , i.e., . If, furthermore, is strictly convex, then is the gradient of at , i.e., . This interpretation of as a (sub)gradient mapping will play a crucial role in the next section for deriving a loss function associated with .

Viewing as the (sub)gradient of is also useful to derive the dual of (8). Let . It is well-known (Borwein and Lewis, 2010; Beck and Teboulle, 2012) that

(14)

where denotes the infimal convolution of with . Furthermore, from Danskin’s theorem, , where denotes an optimal solution of the infimum in (14). We can think of that infimum as the dual of the optimization problem in (8). When , is known as the Moreau envelope of (Moreau, 1965) and using Moreau’s decomposition, we obtain . As another example, when , we obtain

(15)

where we used , the support function of . In particular, when , we have . This dual view is informative insofar as it suggests that regularized prediction functions with minimize a trade-off between maximizing the value achieved by the unregularized prediction function , and a proximity term .

2.4 Properties

We now discuss simple yet useful properties of regularized prediction functions. The first two assume that is a symmetric function, i.e., that it satisfies

(16)

where is the set of permutation matrices.

Properties of regularized prediction functions

  1. [topsep=0pt,itemsep=3pt,parsep=3pt,leftmargin=15pt]

  2. Effect of a permutation. If is symmetric, then : .

  3. Order preservation. Let . If is symmetric, then the coordinates of and are sorted the same way, i.e., and .

  4. Approximation error. If is -strongly convex and bounded with for all , then .

  5. Temperature scaling. For any constant , . If is strictly convex, .

  6. Constant invariance. For any constant , .

The proof is given in Appendix B.1.

For classification, the order-preservation property ensures that the highest-scoring class according to and agree with each other:

(17)

Temperature scaling is useful to control how close we are to unregularized prediction functions. Clearly, as , where is defined in (4).

3 Fenchel-Young losses

In the previous section, we introduced regularized prediction functions over arbitrary domains, as a generalization of classical (unregularized) decision functions. In this section, we introduce Fenchel-Young losses for learning models whose output layer is a regularized prediction function. We first give their definitions and state their properties (§3.1). We then discuss their relationship with Bregman divergences (§3.2) and their Bayes risk (§3.3). Finally, we show how to construct a cost-sensitive loss from any Fenchel-Young loss (§3.4).

3.1 Definition and properties

Given a regularized prediction function , we define its associated loss as follows. Fenchel-Young loss generated by

Let be a regularization function such that the maximum in (8) is achieved for all . Let be a ground-truth label and be a vector of prediction scores.

The Fenchel-Young loss generated by is

(18)

It is easy to see that Fenchel-Young losses can be rewritten as

(19)

where , highlighting the relation with regularized prediction functions. Therefore, as long as we can compute a regularized prediction function , we can automatically obtain an associated Fenchel-Young loss . Conversely, we also have that outputs the prediction minimizing the loss:

(20)

Examples of existing losses that fall into the Fenchel-Young loss family are given in Table 1. Some of these examples will be discussed in more details in the sequel of this paper.

Loss
Squared
Perceptron
Logistic
Hinge
Sparsemax
Logistic (one-vs-all)
Structured perceptron
Structured hinge
CRF  
SparseMAP
Table 1: Examples of regularized prediction functions and their corresponding Fenchel-Young losses. For multi-class classification, we assume and the ground-truth is , where denotes a standard basis (“one-hot”) vector. For structured classification, we assume that elements of are -dimensional binary vectors with , and we denote by the corresponding marginal polytope (Wainwright and Jordan, 2008). We denote by the Shannon entropy of a distribution .

Properties.

As the name indicates, this family of loss functions is grounded in the Fenchel-Young inequality (Borwein and Lewis, 2010, Proposition 3.3.4)

(21)

The inequality, together with well-known results regarding convex conjugates, imply the following properties of Fenchel-Young losses. Properties of Fenchel-Young losses

  1. [topsep=0pt,itemsep=3pt,parsep=3pt,leftmargin=15pt]

  2. Non-negativity. for any and .

  3. Zero loss. If is a lower semi-continuous proper convex function, then
    , and . If is strictly convex, then .

  4. Convexity & subgradients. is convex in and the residual vectors are its subgradients: .

  5. Differentiability & smoothness. If is strictly convex, then is differentiable and . If is strongly convex, then is smooth, i.e., is Lipschitz continuous.

  6. Temperature scaling. For any constant , .

  7. Constant invariance. For any constant , .

Remarkably, the non-negativity, convexity and constant invariance properties hold even if is not convex. The zero loss property follows from the fact that, if is l.s.c. proper convex, then (21) becomes an equality (i.e., the duality gap is zero) if and only if . It suggests that the minimization of Fenchel-Young losses attempts to adjust the model to produce predictions that are close to the target , reducing the duality gap. This is illustrated with and (leading to the squared loss) in Figure 3.

Domain of .

Our assumption that the maximum in the regularized prediction function (8) is achieved for all implies that . This assumption is quite mild and does not require to be bounded. Minimizing w.r.t.  is therefore an unconstrained convex optimization problem. This contrasts with proper loss functions, which are defined over the probability simplex, as discussed in §10.


Figure 3: Illustration of the Fenchel-Young loss , here with and . Minimizing w.r.t.  can be seen as minimizing the duality gap, the difference between and the tangent , at (the ground truth). The regularized prediction is the value of at which the tangent touches . When is of Legendre type, is equal to the Bregman divergence generated by between and (cf. §3.2). However, we do not require that assumption in this paper.

3.2 Relation with Bregman divergences

Fenchel-Young losses seamlessly work when instead of . For example, in the case of the logistic loss, where is the Shannon entropy restricted to , allowing instead of yields the cross-entropy loss, , where

denotes the (generalized) Kullback-Leibler divergence

(22)

This can be useful in a multi-label setting with supervision in the form of label proportions.

From this example, it is tempting to conjecture that a similar result holds for more general Bregman divergences (Bregman, 1967). Recall that the Bregman divergence generated by a strictly convex and differentiable is

(23)

In other words, this is the difference at between and its linearization around . It turns out that is not in general equal to . In fact the latter is not necessarily convex in while the former always is. However, there is a duality relationship between Fenchel-Young losses and Bregman divergences, as we now discuss.

A “mixed-space” Bregman divergence.

Letting (i.e., is a dual pair), we have . Substituting in (23), we get In other words, Fenchel-Young losses can be viewed as a “mixed-form Bregman divergence” (Amari, 2016, Theorem 1.1) where the argument in (23) is replaced by its dual point . This difference is best seen by comparing the function signatures, vs. . An important consequence is that Fenchel-Young losses do not impose any restriction on their left argument : Our assumption that the maximum in the prediction function (8) is achieved for all implies . In contrast, a Bregman divergence would typically need to be composed with a mapping from to , such as , resulting in a possibly non-convex function.

Case of Legendre-type functions.

We can make the relationship with Bregman divergences further precise when , where is restricted to the class of so-called Legendre-type functions (Rockafellar, 1970; Wainwright and Jordan, 2008). We first recall the definition of this class of functions and then state our results. Essentially smooth and Legendre type functions

A function is essentially smooth if

  • is non-empty,

  • is differentiable throughout ,

  • and for any sequence contained in , and converging to a boundary point of .

A function is of Legendre type if

  • it is strictly convex on

  • and essentially smooth.

For instance, is Legendre-type with , and is Legendre-type with . However, is not Legendre-type, since the gradient of does not explode everywhere on the boundary of . The Legendre-type assumption crucially implies that

(24)

We can use this fact to derive the following results, proved in Appendix B.2. Relation with Bregman divergences

Let be of Legendre type with and let be a convex set.
Let be the restriction of to , i.e., .

  1. [topsep=0pt,itemsep=3pt,parsep=3pt,leftmargin=15pt]

  2. Bregman projection. The prediction function regularized by , , reduces to the Bregman projection of onto :

    (25)
  3. Difference of divergences. For all and :

    (26)
  4. Bound. For all and :

    (27)

    with equality when the loss is minimized

    (28)
  5. Composite form. When , i.e., , we have equality for all

    (29)

We illustrate these properties using as a running example. From the first property, since and , we get

(30)

recovering the Euclidean projection onto . The reduction of regularized prediction functions to Bregman projections (when is of Legendre type) is useful because there exist efficient algorithms for computing the Bregman projection onto various convex sets (Yasutake et al., 2011; Suehiro et al., 2012; Krichene et al., 2015; Lim and Wright, 2016). Therefore, we can use these algorithms to compute provided that is available.

From the second property, we obtain for all and

(31)

This recovers the expression of the sparsemax loss given in Table 1 with .

From the third claim, we obtain for all and

(32)

This shows that provides a convex upper-bound for the possibly non-convex composite function . In particular, when , we get . This suggests that the sparsemax loss is useful for sparse label proportion estimation, as confirmed in our experiments (§9).

Finally, from the last property, if , we obtain , which is indeed the squared loss given in Table 1.

3.3 Expected loss, Bayes risk and Bregman information

In this section, we discuss the relation between the pointwise Bayes risk (minimal achievable loss) of a Fenchel-Young loss and Bregman information (Banerjee et al., 2005).

Expected loss.

Let

be a random variable taking values in

following the distribution . The expected loss (a.k.a. expected risk) is then

(33)
(34)
(35)
(36)

Here, we defined the Bregman information of by

(37)

We refer the reader to Banerjee et al. (2005) for a detailed discussion as to why the last two equalities hold. The r.h.s. is exactly equal to the difference between the two sides of Jensen’s inequality and is therefore non-negative. For this reason, it is sometimes also called Jensen gap (Reid and Williamson, 2011).

Bayes risk.

From Proposition 21, we know that for all . Therefore, the pointwise Bayes risk coincides precisely with the Bregman information of ,

(38)

provided that . A similar relation between Bayes risk and Bregman information exists for proper losses (Reid and Williamson, 2011). We can think of (38) as a measure of the “difficulty” of the task. Combining (36) and (38), we obtain

(39)

the pointwise “regret” of w.r.t.