Learning Classifiers with Fenchel-Young Losses: Generalized Entropies, Margins, and Algorithms

05/24/2018 ∙ by Mathieu Blondel, et al. ∙ 0

We study in this paper Fenchel-Young losses, a generic way to construct convex loss functions from a convex regularizer. We provide an in-depth study of their properties in a broad setting and show that they unify many well-known loss functions. When constructed from a generalized entropy, which includes well-known entropies such as Shannon and Tsallis entropies, we show that Fenchel-Young losses induce a predictive probability distribution and develop an efficient algorithm to compute that distribution for separable entropies. We derive conditions for generalized entropies to yield a distribution with sparse support and losses with a separation margin. Finally, we present both primal and dual algorithms to learn predictive models with generic Fenchel-Young losses.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Loss functions are a cornerstone of statistics and machine learning: They measure the difference, or “loss,” between a ground-truth label and a prediction. Some loss functions, such as the hinge loss of support vector machines, are intimately connected to the notion of separation margin—a prevalent concept in statistical learning theory, which has been used to prove the famous perceptron mistake bound

(Rosenblatt, 1958) and many other generalization bounds (Vapnik, 1998; Schölkopf and Smola, 2002). For probabilistic classification, the most popular loss is arguably the (multinomial) logistic loss. It is smooth, enabling fast convergence rates, and the softmax operator provides a consistent mapping to probability distributions. However, the logistic loss does not enjoy a margin, and the generated probability distributions have dense support, which is undesirable in some applications for interpretability or computational efficiency reasons.

To address these shortcomings, Martins and Astudillo (2016) proposed a new loss based on the projection onto the simplex. Unlike the logistic loss, this “sparsemax” loss has a natural separation margin and induces a sparse probability distribution. However, the sparsemax loss was derived in a relatively ad-hoc manner and it is still relatively poorly understood. Thorough understanding of the core principles underpinning these losses, enabling the creation of new losses combining their strengths, is still lacking.

This paper studies and extends Fenchel-Young (F-Y) losses, recently proposed for structured prediction (Niculae et al., 2018). We show that F-Y losses provide a generic and principled way to construct a loss with an associated probability distribution. We uncover a fundamental connection between generalized entropies, margins, and sparse probability distributions. In sum, we make the following contributions.

  • [topsep=0pt,itemsep=2pt,parsep=2pt,leftmargin=10pt]

  • We introduce regularized prediction functions to generalize the softmax and sparsemax transformations, possibly beyond the probability simplex (§2).

  • We study F-Y losses and their properties, showing that they unify many existing losses, including the hinge, logistic, and sparsemax losses (§3).

  • We then show how to seamlessly create entire new families of losses from generalized entropies. We derive efficient algorithms to compute the associated probability distributions, making such losses appealing both in theory and in practice (§4).

  • We characterize which entropies yield sparse distributions and losses with a separation margin, notions we prove to be intimately connected (§5).

  • Finally, we demonstrate F-Y losses on the task of sparse label proportion estimation

    6).

Notation. We denote the probability simplex by , the domain of by , the Fenchel conjugate of by , the indicator function of a set by .

2 Regularized prediction functions

We consider a general predictive setting with input

, and a parametrized model

, producing a score vector . To map to predictions, we introduce regularized prediction functions. Regularized prediction function

Let be a regularization function, with . The prediction function regularized by is defined by

(1)

We emphasize that the regularization is w.r.t. the output and not w.r.t. the model parameters , as is usually the case in the literature. The optimization problem in (1) balances between two terms: an “affinity” term , and a “confidence” term which should be low if is “uncertain”. Two important classes of convex are (squared) norms and, when is the probability simplex, generalized negative entropies. However, our framework does not require to be convex in general. Allowing extended-real further permits general domain constraints in (1) via indicator functions, as we now illustrate.

Examples.

When , is a one-hot representation of the argmax prediction

(2)

We can see that output as a probability distribution that assigns all probability mass on the same class. When , where is Shannon’s entropy, is the well-known softmax

(3)

See Boyd and Vandenberghe (2004, Ex. 3.25) for a derivation. The resulting distribution always has dense support. When , is the Euclidean projection onto the probability simplex

(4)

a.k.a. the sparsemax transformation (Martins and Astudillo, 2016). The distribution has sparse support (it may assign exactly zero probability to low-scoring classes) and can be computed exactly in time (Brucker, 1984; Duchi et al., 2008; Condat, 2016). This paradigm is not limited to the probability simplex: When , we get

(5)

i.e., the sigmoid function evaluated coordinate-wise. We can think of its output as a positive measure (unnormalized probability distribution).

Properties.

We now discuss simple properties of regularized prediction functions. The first two assume that is a symmetric function, i.e., that it satisfies

(6)

where is the set of permutation matrices.

Properties of

  1. [topsep=0pt,itemsep=3pt,parsep=3pt,leftmargin=15pt]

  2. Effect of a permutation. If is symmetric, then : .

  3. Order preservation. Let . If is symmetric, then the coordinates of and are sorted the same way, i.e., and .

  4. Gradient mapping. is a subgradient of at , i.e., . If is strictly convex, is the gradient of , i.e., .

  5. Temperature scaling. For any constant , . If is strictly convex, .

The proof is given in §C.1. For classification, the order-preservation property ensures that the highest-scoring class according to and agree with each other:

(7)

Temperature scaling is useful to control how close we are to unregularized prediction functions.

3 Fenchel-Young losses

Loss
Squared
Perceptron (Rosenblatt, 1958)
Hinge (Crammer and Singer, 2001)
Sparsemax (Martins and Astudillo, 2016)
Logistic (multinomial)
Logistic (one-vs-all)
Table 1: Examples of regularized prediction functions and their associated Fenchel-Young losses. For multi-class classification, we denote the ground-truth by , where denotes a standard basis (“one-hot”) vector. We denote by the Shannon entropy of a distribution .

In this section, we introduce Fenchel-Young losses as a natural way to learn models whose output layer is a regularized prediction function. Fenchel-Young loss generated by

Let be a regularization function such that the maximum in (1) is achieved for all . Let be a ground-truth label and be a vector of prediction scores. The Fenchel-Young loss generated by is

(8)

Fenchel-Young losses can also be written as , where , highlighting the relation with the regularized prediction function . Therefore, as long as we can compute , we can evaluate the associated Fenchel-Young loss . Examples of Fenchel-Young losses are given in Table 1. In addition to the aforementioned multinomial logistic and sparsemax losses, we recover the squared, hinge and one-vs-all logistic losses, for suitable choices of and .

Properties.

As the name indicates, this family of loss functions is grounded in the Fenchel-Young inequality (Borwein and Lewis, 2010, Proposition 3.3.4)

(9)

The inequality, together with well-known properties of convex conjugates, imply the following results. Properties of F-Y losses

  1. [topsep=0pt,itemsep=3pt,parsep=3pt,leftmargin=15pt]

  2. Non-negativity. for any and .

  3. Zero loss. If is a lower semi-continuous proper convex function, then and iff . If is strictly convex, then iff .

  4. Convexity & subgradients. is convex in and the residual vectors are its subgradients: .

  5. Differentiability & smoothness. If is strictly convex, then is differentiable and . If is strongly convex, then is smooth, i.e., is Lipschitz continuous.

  6. Temperature scaling. For any constant , .

Remarkably, the non-negativity and convexity properties hold even if is not convex. The zero loss property follows from the fact that, if is l.s.c. proper convex, then (9) becomes an equality (i.e., the duality gap is zero) if and only if . It suggests that minimizing a Fenchel-Young loss requires adjusting to produce predictions that are close to the target , reducing the duality gap.

Relation with Bregman divergences.

Fenchel-Young losses seamlessly work when ground-truth vectors are label proportions, i.e., instead of . For instance, setting to the Shannon negative entropy restricted to yields the cross-entropy loss, , where

denotes the Kullback-Leibler divergence. From this example, it is tempting to conjecture that a similar result holds for more general Bregman divergences

(Bregman, 1967). Recall that the Bregman divergence generated by a strictly convex and differentiable is

(10)

the difference at between and its linearization around . It turns out that is not in general equal to . However, when , where is a Legendre-type function (Rockafellar, 1970; Wainwright and Jordan, 2008), meaning that it is strictly convex, differentiable and its gradient explodes at the boundary of its domain, we have the following proposition, proved in §C.2. Let , where is of Legendre type and is a convex set. Then, for all and , we have:

(11)

with equality when the loss is . If , i.e., , then . As an example, applying (11) with and , we get that the sparsemax loss is a convex upper-bound for the non-convex . This suggests that the sparsemax loss can be useful for sparse label proportion estimation, as we confirm in §6.

The relation between Fenchel-Young losses and Bregman divergences can be further clarified using duality. Letting (i.e., is a dual pair), we have . Substituting in (10), we get In other words, Fenchel-Young losses can be viewed as a “mixed-form Bregman divergence” (Amari, 2016, Theorem 1.1) where the argument in (10) is replaced by its dual point . This difference is best seen by comparing the function signatures, vs. . An important consequence is that Fenchel-Young losses do not impose any restriction on their left argument : Our assumption that the maximum in the prediction function (1) is achieved for all implies .

t

Entropy

s

Predictive distribution

s

Loss

Tsallis entropy

()

()

()

t

s

s

Norm entropy

()

Figure 1: New families of losses made possible by our framework. Left: Tsallis and norm entropies. Center: regularized prediction functions. Right: Fenchel-Young loss. Except for softmax, which never exactly reaches 0, all distributions shown in the center can have sparse support. As can be checked visually, is differentiable everywhere when . Hence, is twice differentiable everywhere for these values.

4 New loss functions for sparse probabilistic classification

In the previous section, we presented Fenchel-Young losses in a broad setting. We now restrict to classification over the probability simplex and show how to easily create several entire new families of losses.

Generalized entropies.

A natural choice of regularization function over the probability simplex is , where is a generalized entropy (DeGroot, 1962; Grünwald and Dawid, 2004): a concave function over , used to measure the “uncertainty” in a distribution .

Assumptions: We will make the following assumptions about .

  1. [topsep=0pt,itemsep=3pt,parsep=3pt,leftmargin=25pt]

  2. Zero entropy: if is a delta distribution, i.e., .

  3. Strict concavity: , for , .

  4. Symmetry: for any .

Assumptions A.2 and A.3 imply that is Schur-concave (Bauschke and Combettes, 2017), a common requirement in generalized entropies. This in turn implies assumption A.1, up to a constant (that constant can easily be subtracted so as to satisfy assumption A.1). As suggested by the next result, proved in §C.3, together, these assumptions imply that can be used as a sensible uncertainty measure. If

satisfies assumptions A.1-A.3, then it is non-negative and uniquely maximized by the uniform distribution

. A particular case of generalized entropies satisfying assumptions A.1–A.3 are uniformly separable functions of the form , where is a non-negative strictly concave function such that . However, our framework is not restricted to this form.

Induced Fenchel-Young loss.

If the ground truth is and assumption A.1. holds, (8) becomes

(12)

By using the fact that for all if , we can further rewrite it as

(13)

This expression shows that Fenchel-Young losses over can be written solely in terms of the generalized “cumulant function” . Indeed, when is Shannon’s entropy, we recover the cumulant (a.k.a. log-partition) function . When is strongly concave over , we can also see as a smoothed max operator (Niculae and Blondel, 2017; Mensch and Blondel, 2018) and hence can be seen as a smoothed upper-bound of the perceptron loss .

We now give two examples of generalized entropies. The resulting families of prediction and loss functions, new to our knowledge, are illustrated in Figure 1. We provide more examples in §A.

Tsallis -entropies (Tsallis, 1988).

Defined as , where and is an arbitrary positive constant, these entropies arise as a generalization of the Shannon-Khinchin axioms to non-extensive systems (Suyari, 2004) and have numerous scientific applications (Gell-Mann and Tsallis, 2004; Martins et al., 2009). For convenience, we set for the rest of this paper. Tsallis entropies satisfy assumptions A.1–A.3 and can also be written in uniformly separable form:

(14)

The limit case corresponds to the Shannon entropy. When , we recover the Gini index (Gini, 1912)

, a popular “impurity measure” for decision trees:

(15)

It is easy to check that recovers the sparsemax loss (Martins and Astudillo, 2016) (cf. Table 1). Another interesting case is , which gives , hence is the perceptron loss in Table 1. The resulting “” distribution puts all probability mass on the top-scoring classes. In summary, for is , , and , and is the logistic, sparsemax and perceptron loss, respectively. Tsallis entropies induce a continuous parametric family subsuming these important cases. Since the best surrogate loss often depends on the data (Nock and Nielsen, 2009), tuning typically improves accuracy, as we confirm in §6.

Norm entropies.

An interesting class of non-separable entropies are entropies generated by a -norm, defined as . We call them norm entropies. From the Minkowski inequality, -norms with are strictly convex on the simplex, so satisfies assumptions A.1–A.3 for . The limit case is particularly interesting: in this case, we obtain , recovering the Berger-Parker dominance index (Berger and Parker, 1970), widely used in ecology to measure species diversity. We surprisingly encounter again in §5, as a limit case for the existence of separation margins.

Computing .

For non-separable entropies , the regularized prediction function does not generally enjoy a closed-form expression and one must resort to projected gradient methods to compute it. Fortunately, for uniformly separable entropies, which we saw to be the case of Tsallis entropies, we now show that can be computed in linear time. Reduction to root finding

Let where is strictly concave and differentiable. Then,

(16)

where is a root of , in the tight search interval , where and . An approximate such that can be found in time by, e.g., bisection. The related problem of Bregman projection onto the probability simplex was recently studied by Krichene et al. (2015) but our derivation is different and more direct (cf. §C.4).

5 Separation margin of F-Y losses

In this section, we are going to see that the simple assumptions A.1–A.3 about a generalized entropy are enough to obtain results about the separation margin associated with . The notion of margin is well-known in machine learning, lying at the heart of support vector machines and leading to generalization error bounds (Vapnik, 1998; Schölkopf and Smola, 2002; Guermeur, 2007). We provide a definition and will see that many other Fenchel-Young losses also have a “margin,” for suitable conditions on . Then, we take a step further, and connect the existence of a margin with the sparsity of the regularized prediction function, providing necessary and sufficient conditions for Fenchel-Young losses to have a margin. Finally, we show how this margin can be computed analytically. Separation margin

Let be a loss function over . We say that has the separation margin property if there exists such that:

(17)

The smallest possible that satisfies (17) is called the margin of , denoted .

Examples.

The most famous example of a loss with a separation margin is the multi-class hinge loss, , which we saw in Table 1 to be a Fenchel-Young loss: it is immediate from the definition that its margin is . Less trivially, Martins and Astudillo (2016, Prop. 3.5) showed that the sparsemax loss also has the separation margin property. On the negative side, the logistic loss does not have a margin, as it is strictly positive. Characterizing which Fenchel-Young losses have a margin is an open question which we address next.

Conditions for existence of margin.

To accomplish our goal, we need to characterize the gradient mappings and associated with generalized entropies (note that is never single-valued: if is in , then so is , for any constant ). Of particular importance is the subdifferential set . The next proposition, whose proof we defer to §C.5, uses this set to provide a necessary and sufficient condition for the existence of a separation margin, along with a formula for computing it. Let satisfy A.1–A.3. Then:

  1. The loss has a separation margin iff there is a such that .

  2. If the above holds, then the margin of is given by the smallest such or, equivalently,

    (18)

Reassuringly, the first part confirms that the logistic loss does not have a margin, since . A second interesting fact is that the denominator of (18) is the generalized entropy introduced in §4: the -norm entropy. As Figure 1 suggests, this entropy provides an upper bound for convex losses with unit margin. This provides some intuition to the formula (18), which seeks a distribution maximizing the entropy ratio between and .

Equivalence between sparsity and margin.

The next result, proved in §C.6, characterizes more precisely the image of . In doing so, it establishes a key result in this paper: a sufficient condition for the existence of a separation margin in is the sparsity of the regularized prediction function , i.e., its ability to reach the entire simplex, including the boundary points. If is uniformly separable, this is also a necessary condition. Equivalence between sparse probability distribution and loss enjoying a margin

Let satisfy A.1–A.3 and be uniformly separable, i.e., . Then the following statements are all equivalent:

  1. for any ;

  2. The mapping covers the full simplex, i.e., ;

  3. has the separation margin property.

For a general (not necessarily separable) satisfying A.1–A.3, we have (1) (2) (3).

Let us reflect for a moment on the three conditions stated in Proposition 

5. The first two conditions involve the subdifferential and gradient of and its conjugate; the third condition is the margin property of . To provide some intuition, consider the case where is separable with and is differentiable in . Then, from the concavity of , its derivative is decreasing, hence the first condition is met if and . This is the case with Tsallis entropies for , but not Shannon entropy, since explodes at . Functions whose gradient “explodes” in the boundary of their domain (hence failing to meet the first condition in Proposition 5) are called “essentially smooth” (Rockafellar, 1970). For those functions, maps only to the relative interior of , never attaining boundary points (Wainwright and Jordan, 2008); this is expressed in the second condition. This prevents essentially smooth functions from generating a sparse or (if they are separable) a loss with a margin, as asserted by the third condition. Since Legendre-type functions (§3) are strictly convex and essentially smooth, by Proposition 3, loss functions for which the composite form holds, which is the case of the logistic loss but not of the sparsemax loss, do not enjoy a margin and cannot induce a sparse probability distribution. This is geometrically visible in Figure 1.

Margin computation.

For Fenchel-Young losses that have the separation margin property, Proposition 5 provided a formula for determining the margin. While informative, formula (18) is not very practical, as it involves a generally non-convex optimization problem. The next proposition, proved in §C.7, takes a step further and provides a remarkably simple closed-form expression for generalized entropies that are twice-differentiable. To simplify notation, we denote by the component of . Assume satisfies the conditions in Proposition 5 and is twice-differentiable on the simplex. Then, for arbitrary :

(19)

In particular, if is separable, i.e., , where is concave, twice differentiable, with , then

(20)

The compact formula (19) provides a geometric characterization of separable entropies and their margins: (20) tells us that only the slopes of at the two extremities of are relevant in determining the margin.

Example: case of Tsallis and norm entropies.

As seen in §4, Tsallis entropies are separable with . For , , hence and . Proposition 5 then yields

(21)

Norm entropies, while not separable, have gradient , giving , so

(22)

as confirmed visually in Figure 1, in the binary case.

6 Experimental results

As we saw, -Tsallis entropies generate a family of losses, with the logistic () and sparsemax losses () as important special cases. In addition, they are twice differentiable for , produce sparse probability distributions for and are computationally efficient for any , thanks to Proposition  4. In this section, we demonstrate their usefulness on the task of label proportion estimation and compare different solvers for computing .

tuned
(logistic) (sparsemax)
Birds 0.359 / 0.530 0.364 / 0.504 0.364 / 0.504 0.358 / 0.501
Cal500 0.454 / 0.034 0.456 / 0.035 0.452 / 0.035 0.456 / 0.034
Emotions 0.226 / 0.327 0.225 / 0.317 0.225 / 0.317 0.224 / 0.321
Mediamill 0.375 / 0.208 0.363 / 0.193 0.356 / 0.191 0.361 / 0.193
Scene 0.175 / 0.344 0.176 / 0.363 0.176 / 0.363 0.175 / 0.345
TMC 0.225 / 0.337 0.224 / 0.327 0.224 / 0.327 0.217 / 0.328
Yeast 0.307 / 0.183 0.314 / 0.186 0.314 / 0.186 0.307 / 0.183
Avg. rank 2.57 / 2.71 2.71 / 2.14 2.14 / 2.00 1.43 / 1.86
Figure 2: Test-set performance of Tsallis losses for various on the task of sparse label proportion estimation: average Jensen-Shannon divergence (left) and mean squared error (right). Lower is better.

[][][b]

Brent

Bisect

FISTA

time (ms)

Figure 3: Median time until accuracy is met for computing .
Label proportion estimation.

Given an input vector , where is the number of features, our goal is to estimate a vector of label proportions , where is the number of classes. If is sparse, we expect the superiority of Tsallis losses over the conventional logistic loss on this task. At training time, given a set of pairs, we estimate a matrix by minimizing the convex objective

(23)

We use L-BFGS (Liu and Nocedal, 1989) for simplicity. From Proposition 9

and using the chain rule, we obtain the gradient expression

, where , and are matrices whose rows gather , and , for . At test time, we predict label proportions by .

We ran experiments on standard multi-label benchmark datasets — see §B

for dataset characteristics. For all datasets, we removed samples with no label, normalized samples to have zero mean unit variance, and normalized labels to lie in the probability simplex. We chose

and against the validation set. We report the test set mean Jensen-Shannon divergence, , and the mean squared error in Table 3. As can be seen, the loss with tuned achieves the best averaged rank overall. Tuning allows to choose the best loss in the family in a data-driven fashion. Additional experiments confirm these findings — see §B.

Solver comparison.

Next, we compared bisection (binary search) and Brent’s method for solving (1) by root finding (Proposition 4). We focus on , i.e. the -Tsallis entropy, and also compare against using a generic projected gradient algorithm (FISTA) to solve (1) naively. We measure the time needed to reach a solution with , over 200 samples with . Median and 99% CI times reported in Figure 3 reveal that root finding scales better, with Brent’s method outperforming FISTA by one to two orders of magnitude.

7 Related work

Proper scoring rules (proper losses) are a well-studied object in statistics (Grünwald and Dawid, 2004; Gneiting and Raftery, 2007) and machine learning (Reid and Williamson, 2010; Williamson et al., 2016), that measures the discrepancy between a ground-truth and a probability forecast in a Fisher-consistent manner. From Savage (1971) (see also Gneiting and Raftery (2007)), we can construct a proper scoring rule by

(24)

recovering the well-known relation between Bregman divergences and proper scoring rules. For example, using the Gini index