# Geometric Losses for Distributional Learning

Building upon recent advances in entropy-regularized optimal transport, and upon Fenchel duality between measures and continuous functions , we propose a generalization of the logistic loss that incorporates a metric or cost between classes. Unlike previous attempts to use optimal transport distances for learning, our loss results in unconstrained convex objective functions, supports infinite (or very large) class spaces, and naturally defines a geometric generalization of the softmax operator. The geometric properties of this loss make it suitable for predicting sparse and singular distributions, for instance supported on curves or hyper-surfaces. We study the theoretical properties of our loss and show-case its effectiveness on two applications: ordinal regression and drawing generation.

## Authors

• 12 publications
• 23 publications
• 39 publications
• ### Regularized Optimal Transport is Ground Cost Adversarial

Regularizing Wasserstein distances has proved to be the key in the recen...
02/10/2020 ∙ by François-Pierre Paty, et al. ∙ 0

• ### Interpolating between Optimal Transport and MMD using Sinkhorn Divergences

Comparing probability distributions is a fundamental problem in data sci...
10/18/2018 ∙ by Jean Feydy, et al. ∙ 0

• ### Deep Ordinal Regression using Optimal Transport Loss and Unimodal Output Probabilities

We propose a framework for deep ordinal regression, based on unimodal ou...
11/15/2020 ∙ by Uri Shaham, et al. ∙ 0

• ### Sinkhorn Divergences for Unbalanced Optimal Transport

This paper extends the formulation of Sinkhorn divergences to the unbala...
10/28/2019 ∙ by Thibault Séjourné, et al. ∙ 6

• ### Autoregressive Optimal Transport Models

Series of distributions indexed by equally spaced time points are ubiqui...
05/12/2021 ∙ by Changbo Zhu, et al. ∙ 0

• ### Random concave functions

Spaces of convex and concave functions appear naturally in theory and ap...
10/30/2019 ∙ by Peter Baxendale, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

For probabilistic classification, the most popular loss is arguably the (multinomial) logistic loss. It is smooth, enabling fast convergence rates, and the softmax operator provides a consistent mapping to probability distributions. In many applications, different costs are associated to misclassification errors between classes. While a cost-aware generalization of the logistic loss exists

(margincrf), it does not provide a cost-aware counterpart of the softmax. The softmax is pointwise by nature: it is oblivious to misclassification costs or to the geometry of classes.

Optimal transport (Wasserstein) losses have recently gained popularity in machine learning, for their ability to compare probability distributions in a geometrically faithful manner, with applications such as classification

(kusner2015word), clustering (cuturi2014fast), domain adaptation (courty2017joint), dictionary learning (rolet2016fast) and generative models training (montavon2016wasserstein; arjovsky2017wasserstein). For probabilistic classification, frogner_learning_2015 proposes to use entropy-regularized optimal transport (cuturi_sinkhorn_2013) in the multi-label setting. Although this approach successfully leverages a cost between classes, it results in a non-convex loss, when combined with a softmax. A similar regularized Wasserstein loss is used by luise2018wasserstein

in conjunction with a kernel ridge regression procedure

(ciliberto2016consistent) in order to obtain a consistency result.

The relation between the logistic loss and the maximum entropy principle is well-known. Building upon a generalization of the Shannon entropy originating from entropy regularized optimal transport (feydy_interpolating_2018) and Fenchel duality between measures and continuous functions, we propose a generalization of the logistic loss that takes into account a metric or cost between classes. Unlike previous attempts to use optimal transport distances for learning, our loss is convex, and naturally defines a geometric generalization of the softmax operator. Besides providing novel insights in the logistic loss, our loss is theoretically sound, even when learning and predicting continuous probability distributions over a potentially infinite number of classes. To sum up, our contributions are as follows.

##### Organization and contributions.
• [topsep=0pt,itemsep=0pt,parsep=3pt,leftmargin=15pt]

• We introduce the distribution learning setting, review existings losses leveraging a cost between classes and point out their shortcomings (§2).

• Building upon entropy-regularized optimal transport, we present a novel cost-sensitive distributional learning loss and its corresponding softmax operator. Our proposal is theoretically sound even in continuous measure spaces (§3).

• We study the theoretical properties of our loss, such as its Fisher consistency (§4). We derive tractable methods to compute and minimize it in the discrete distribution setting. We propose an abstract Frank-Wolfe scheme for computations in the continuous setting.

• Finally, we demonstrate its effectiveness on two discrete prediction tasks involving a geometric cost: ordinal regression and drawing generation using VAEs (§5).

##### Notation.

We denote a finite or infinite input space, and a compact potentially infinite output space. When is a finite set of classes, we write . We denote , , and the sets of continuous (bounded) functions, Radon (positive) measures and probability measures on . Note that in finite dimensions, is the probability simplex and

. We write vectors in

and continuous functions in with small normal letters, e.g., . In the finite setting, where , we define . We write elements of and measures in with greek letters . We write matrices and operators with capital letters, e.g., . We denote by and

the tensor product and sum, and

the scalar product.

## 2 Background

In this section, after introducing distributional learning in a discrete setting, we review two lines of work for taking into account a cost between classes: cost-augmented losses, and geometric losses based on Wasserstein and energy distances. Their shortcomings motivate the introduction of a new geometric loss in §3.

### 2.1 Discrete distribution prediction and learning

We consider a general predictive setting in which an input vector

is fed to a parametrized model

(e.g., a neural network), that predicts a score vector

. At test time, that vector is used to predict the most likely class . In order to predict a probability distribution , it is common to compose with a link function , where . A typical example of link function is the softmax.

To learn the model parameters , it is necessary to define a loss between a ground-truth and the score vector . Composite losses (reid_composite_binary; vernet_2016) decompose that loss into a loss , where and : . Note that depending on and , is not necessarily convex in . More recently, blondel_learning_2018; fy_losses_journal introduced Fenchel-Young losses, a generic way to directly construct a loss and a corresponding link . We will revisit and generalize that framework to the continuous output setting in the sequel of this paper. Given a loss and a training set of input-distribution pairs, , where and , we then minimize , potentially with regularization on .

### 2.2 Cost-augmented losses

Before introducing a new geometric cost-sensitive loss in §3

, let us now review classical existing cost-sensitive loss functions. Let

be a matrix, such that is the cost of misclassifying class as class . We assume for all . To take into account the cost , in the single label setting, it is natural to define a loss as follows

 L(y,f)=cy,y′wherey′∈argmaxi∈[d]fi. (1)

To obtain a loss , we simply define , where is the one-hot representation of . Note that choosing when and otherwise (i.e., ) reduces to the zero-one loss. To obtain a convex upper-bound, (1) is typically replaced with a cost-augmented hinge loss (multiclass_svm; structured_hinge):

 L(y,f)=maxi∈[d] cy,i+fi−fy. (2)

Replacing the max above with a log-sum-exp leads to a cost-augmented version of the logistic (or conditional random field) loss (margincrf). Another convex relaxation is the cost-sensitive pairwise hinge loss (multiclass_weston; duchi_multiclass_2016). Remarkably, all these losses use only one row of , the one corresponding to the ground truth . Because of this dependency on , it is not clear how to define a probabilistic mapping at test time. In this paper, we propose a loss which comes with a geometric generalization of the softmax operator. That operator uses the entire cost matrix .

### 2.3 Wasserstein and energy distance losses

Wasserstein or optimal transport distances recently gained popularity as a loss in machine learning for their ability to compare probability distributions in a geometrically faithful manner. As a representative application, frogner_learning_2015 proposed to use entropy-regularized optimal transport (cuturi_sinkhorn_2013) for cost-sensitive multi-label classification. Effectively, optimal transport lifts a distance or cost to a distance between probability distributions over . Following genevay_stochastic_2016, given a ground-truth probability distribution and a predicted probability distribution , we define

 OTC,ϵ(α,β)≜minπ∈U(α,β)⟨π,C⟩+εKL(π|α⊗β), (3)

where is the transportation polytope, a subset of whose elements have constrained marginals: and . KL

is the Kullback–Leibler divergence (a.k.a. relative entropy). Because

needs to be a valid probability distribution, frogner_learning_2015 propose to use , where is a vector of prediction scores. Unfortunately, the resulting composite loss, , is not convex w.r.t. . Another class of divergences between measures and stems from energy distances (szekely_energy_2013) and maximum mean discrepancies. However, composing these divergences with a softmax again breaks convexity in . In contrast, our proposal is convex in and defines a natural geometric softmax.

## 3 Continuous and cost-sensitive distributional learning and prediction

In this section, we construct a loss between probability measures and score functions, canonically associated with a link function. Our construction takes into account a cost function between classes. Unlike existing methods reviewed in §2.2, our loss is well defined and convex on compact, possibly infinite spaces . We start by extending the setting of §2.1 to predicting arbitrary probabilities, for instance having continuous densities with respect to the Lebesgue measure or singular distributions supported on curves or surfaces.

### 3.1 Continuous probabilities and score functions

We consider a compact metric space of outputs , endowed with a symmetric cost function . We wish to predict probabilities over , that is, learn to predict distributions . The space of probability measures forms a closed subset of the space of Radon measures , i.e., . From the Riesz representation theorem, is the topological dual of the space of continuous measures , endowed with the uniform convergence norm . The topological duality between the primal and the dual defines a pairing, similar to a “scalar product”, between these spaces:

 ⟨α,f⟩≜∫Yf(y)dα(y)=E[f(Y)], (4)

for all and , where

is a random variable with law

. This pairing also defines the natural topology to compare measures and to differentiate functionals depending on measures. This is the so-called weak topology, which corresponds to the convergence in law of random variables. A sequence is said to converge weak to some if for all functions , . Note that when endowing with this weak topology, the dual of is , which is the key to be able to use duality (and in particular Legendre-Fenchel transform) from convex optimization. Using this topology is fundamental to define geometric losses that can cope with arbitrary, possibly highly localized or even singular distributions (for instance sparse sums of Diracs or measures concentrated on thin sets such as 2-D curves or 3-D surfaces).

Similarly to the discrete setting reviewed in §2.1, in the continuous setting, we now wish to predict a distribution by setting , where , (i.e., is unconstrained), and is a link function. We propose to use maps between the primal and the dual score space as link functions. As we shall see, such mirror maps are naturally defined by continuous convex function on the primal space, through Fenchel-Legendre duality. Our framework recovers the discrete case as a particular case, with corresponding to and to , though the isomorphisms and for all , .

Regularization of optimal transport is our key tool to construct entropy functions which are continuous with respect to the weak topology, and that can be conjugated to define a link function. It allows us to naturally leverage a cost between classes.

### 3.2 An entropy function for continuous probabilities

The regularized optimal transport cost (3) remains well defined when and belong to a continuous measure space , with now being a subset of with marginal constraints. It induces the self-transport functional (feydy_interpolating_2018), that we reuse for our purpose:

 ΩC(α)≜{−12OTC,ε=2(α,α)forα∈M+1(Y)+∞otherwise. (5)

We will omit the dependency of on when clear from context. It is shown by feydy_interpolating_2018 that is continuous and convex on , and strictly convex on , where continuity is taken w.r.t. the weak topology. We call , the Sinkhorn negentropy. As a negative entropy function, it can be used to measure the uncertainty in a probability distribution (lower is more uncertain), as illustrated in Figure 2. It will prove crucial in our loss construction. In the above, we have set w.l.o.g. to recover simple asymptotical behavior of , as will be clear in Prop. 1.

We first recall some known results from feydy_interpolating_2018. Using Fenchel-Rockafellar duality theorem (rockafellar_extension_1966), the function rewrites as the solution to a Kantorovich-type dual problem (see e.g., villani_optimal_2008). For all , we have that

 −ΩC(α) =maxf∈C(Y)⟨α,f⟩−log⟨α⊗α,ef⊕f−C2⟩, (6)

where we use the homogeneous dual (i.e. with a log in the maximization), as explained in cuturi_semidual_2018.

is differentiable in the sense of measures (santambrogio_optimal_2015), meaning that there exists a continuous function such that, for all , , ,

 Ω(α+t(ξ2−ξ1))=Ω(α)+t⟨ξ2−ξ1,∇Ω(α)⟩+o(t). (7)

As shown in feydy_interpolating_2018, this function , that we call the symmetric Sinkhorn potential, is a particular solution of the dual problem. It is the only function in such that , where the soft -transform operator (cuturi_semidual_2018) is defined as

 T(f,α)(y)≜−2log⟨α,ef−C(y,⋅)2⟩. (8)

This operator can be understood as the log-convolution of the measure with the Sinkhorn kernel . The Sinkhorn potential has the remarkable property of being defined on all , even though the support of may be smaller. Given any dual solution to (6), which is defined -almost everywhere, we have , i.e. extrapolates the values of on the support of , using the Sinkhorn kernel.

##### Special cases.

The following proposition, which is an original contribution, shows that the Sinkhorn negentropy asymptotically recovers the negative Shannon entropy and Gini index (gini_index) when rescaling the cost. The Sinkhorn negentropy therefore defines a parametric family of negentropies, recovering these important special cases. Note however that on continuous spaces , the Shannon entropy is not weak continuous and thus cannot be used to define geometric loss and link functions, the softmax link function being geometry-oblivious. Similarly, the Gini index is not defined on , as it involves the squared values of in a discrete setting.

###### Proposition 1 (Asymptotics of Sinkhorn negentropies).

For compact, the rescaled Sinkhorn negentropy converges to a kernel norm for high regularization . Namely, for all , we have

 εΩC/ε(α) ε→+∞⟶12⟨α⊗α,−C⟩. (9)

Let be discrete and choose . The Sinkhorn negentropy converges to the Shannon negentropy for low-regularization, and into the negative Gini index for high regularization:

 ΩC/ε(α)ε→0⟶⟨α,logα⟩,εΩC/ε(α)ε→+∞⟶12(∥α∥22−1). (10)

Proof is provided in §A.1. The first part of the proposition shows that the Sinkhorn negentropies converge to a kernel norm (see e.g., sriperumbudur_universality_2011). This is similar to the regularized Sinkhorn divergences converging to an Energy Distance (szekely_energy_2013) for (genevay_sinkhorn-autodiff_2017; feydy_interpolating_2018).

##### From probabilities to potentials.

The symmetric Sinkhorn potential is a continuous function, or a vector in the discrete setting. It can be interpreted as a distance field to the distribution . We visualize this field on a 2D space in Figure 1, where is the set of pixels of an image, and we wish to predict a 2-dimensional probability distribution in . Predicting a distance field to a measure is more convenient than predicting a distribution directly, as it has unconstrained values and is therefore easier to optimize against. For this reason, we propose to learn parametric models that predict a “distance field” given an input . In the following section, we construct a link function , for general probability measure and function spaces and

, so to obtain a distributional estimator

.

### 3.3 Fenchel-Young losses in continuous setting

To that end, we generalize in this section the recently-proposed Fenchel-Young (FY) loss framework (blondel_learning_2018; fy_losses_journal), originally limited to discrete cost-oblivious measure spaces, to infinite measure spaces. Inspired by that line of work, we use Legendre-Fenchel duality to define loss and link functions from Sinkhorn negative entropies, in a principled manner. We define the Legendre-Fenchel conjugate of as

 Ω⋆(f)≜maxα∈M+1(Y)⟨α,f⟩−Ω(α). (11)

Rigorously, is a pre-conjugate, as is defined on , the topological dual of continuous functions . For a comprehensive and rigorous treatment of the theory of conjugation in infinite spaces, and in particular Banach spaces as is the case of , see mitter_convex_2008.

As is strictly convex, is differentiable everywhere and we have, from a Danskin theorem (danskin_theory_1966) with left Banach space and right compact space (bernhard_variations_1990, Theorem C.1):

 ∇Ω⋆(f)≜argmaxα∈M+1(Y)⟨α,f⟩−Ω(α)∈C(Y).

That gradient can be used as a link from to . It can also be interpreted as a regularized prediction function (blondel_learning_2018; mensch_differentiable_2018). Following the FY loss framework, we define the loss associated with by

 ℓΩ(α,f)≜Ω⋆(f)+Ω(α)−⟨α,f⟩. (12)

In the discrete single-label setting, that loss is also related to the construction of duchi_multiclass_2016. From the Fenchel-Young theorem (rockafellar_convex_1970), , with equality if and only if . The loss is thus positive, convex and differentiable in its second argument, and minimizing it amounts to find the pre-image of the target distribution with respect to the link (mapping) .

Our construction is a generalization of the Fenchel-Young loss framework (blondel_learning_2018; fy_losses_journal), in the sense that it relies on topological duality between and , instead of the Hilbertian structure of and , to construct the loss and link function . We now instantiate the Fenchel-Young loss (12) with Sinkhorn negentropies in order to obtain a novel cost-sensitive loss.

### 3.4 A new geometrical loss and softmax

The key ingredients to derive a Fenchel-Young loss and a link are the conjugate and its gradient. Remarkably, they enjoy a simple form with Sinkhorn negentropies, as shown in the following proposition.

###### Proposition 2 (Conjugate of the Sinkhorn negentropy).

For all , the Legendre-Fenchel conjugate of defined in (5) and its gradient read

 g-LSE(f) ≜Ω⋆(f)=−logminα∈M+1(Y)Φ(α,f) (13) g-softmax(f) ≜∇Ω⋆(f)=argminα∈M+1(Y)Φ(α,f) (14) whereΦ(α,f) ≜⟨α⊗α,exp(−f⊕f+C2)⟩ (15)

and where g stands for geometric and LSE for log-sum-exp.

The proof can be found in §A.2. is the usual Fréchet derivative of , that lies a priori in the topological dual space of , i.e. . From a Danskin theorem (bernhard_variations_1990), it is in fact a probability measure. The probability distribution is typically sparse, as the minimizer of a quadratic on a convex subspace of . We call the loss generated by the Sinkhorn negentropy g-logistic loss.

##### Special cases.

Let and (- cost matrix). From Prop. 1, asymptotically recovers the negative Shannon entropy when as and the negative Gini index when , as . is then equal to , and to (martins_softmax_2016), respectively. Likewise, recovers the logistic and sparsemax losses. When , because if and otherwise, we see that the logistic loss infinitely penalizes inter-class errors. That is, to obtain zero logistic loss, the model must assign probability to the correct class. The limit case is the only case for which g-softmax always outputs completely dense distributions. In the continuous case, degenerates into a positive deconvolution objective with simplex constraint:

 maxα∈M+1(Y)⟨α,f⟩−12⟨α⊗α,−C⟩. (16)

Fig. 1 shows that has indeed a deconvolutional effect.

### 3.5 Computation

Before studying the g-logistic loss and link function , we now describe practical algorithms for computing and in the discrete and continuous cases. The key element in using the g-LSE as a layer in an arbitrary complex model is to minimize the quadratic function , on . We can then use the minimum value in the forward pass, and the minimizer in the backward pass, during e.g. SGD training.

##### Continuous optimisation.

In the general case where is compact, we cannot represent using a finite vector. Yet, we can use a Frank-Wolfe scheme to progressively add spikes, i.e. Diracs to an iterate sequence . For this, we need to compute, at each iteration, the gradient of in the sense of measure (7), i.e. the function in

 ∇Φf(α)=exp(−f+T(−f,α)2), (17)

that simply requires to compute the C-transform of on the measure , similarly to regularized optimal transport. The simplest Frank-Wolfe scheme then updates

 yt∈argminY∇Φf(αt−1),αt=αt−1+2t+2(δyt−αt−1). (18)

Indeed, for , the minimizer of on is the Dirac where minimizes . This optimization scheme may be refined to ensure a geometric convergence of . It can be used to identify Diracs from a continuous distance field

, similar to super-resolution approaches proposed in

bredies2013inverse; boyd2017alternating. It requires to work with computer-friendly representation of , so that we can obtain an approximation of efficiently, using e.g. non-convex optimization. Another approach is to rely on a deep parametrization of a particle swarm, as proposed by boyd2018deeploco. We leave such an application for future work, and focus on an efficient discrete solver for the g-LSE and g-softmax.

##### Discrete optimisation.

In the discrete case, we can parametrize in logarithmic space, by setting , with . then reads

 maxl∈Rd−logd∑i,j=1eli+lj−fi+fj+ci,j2+2LSE(l). (19)

This objective is non-convex on but invariant with translation and convex on . It thus admits a unique solution, that we can find using an unconstrained quasi-Newton solver like L-BFGS (liu_limited_1989), that we stop when the iterates are sufficiently stable. For that maximizes (19), the gradient

is used for backpropagation and at test time. As

is sparse, we expect some coordinates to go to . In practice, then underflows to after a few iterations.

##### Two-dimensional convolution.

In the discrete case, when dealing with two-dimensional potentials and measures, the objective function (19) can be written with a convolution operator, as where . It is therefore efficiently computable and differentiable on GPUs, especially when the kernel is separable in height and width, e.g. for the norm, in which case we perform 2 successive one-dimensional convolutions. We use this computational trick in our variational auto-encoder experiments (§5).

## 4 Geometric and statistical properties

We start by studying the mirror map , that we expect to invert the mapping . This study is necessary as we cannot rely on typical conjugate inversion results (e.g., rockafellar_convex_1970, Theorem 26.5), that would stipulate that on the domain of . Indeed, this result is stated in finite dimension, and requires that and be Legendre, i.e. be strictly convex and differentiable on their domain of definition, and have diverging derivative on the boundaries of these domains (see also wainwright_graphical_2008)

. This is not the case of the Sinkhorn negentropy, which requires novel adjustements. With these at hands, we show that parametric models involving a final g-softmax layer can be trained to minimize a certain well-behaved Bregman divergence on the space of probability measures. Proofs are reported in

§A.3 and §A.4

### 4.1 Geometry of the link function

We have constructed the link function in hope that it would allow to go from a symmetric Sinkhorn potential back to the original measure . The following lemma states that this is indeed the case, and derives two consequences on the space of symmetric Sinkhorn potentials, defined as .

###### Lemma 1 (Inversion of the Sinkhorn potentials).
 ∀α∈M+1(Y),∇Ω⋆∘∇Ω(α)=α. (20) ∀f∈F,∇Ω∘∇Ω⋆(f)=f,Ω⋆(f)=0. (21)

The computation of the Sinkhorn potential thus inverts the g-LSE operator on the space , which is included in the -level set of . This is similar to the set being the level set of the log-sum-exp function when using the Shannon negentropy as .

This corollary is not sufficient for our purpose, as we want to characterize the action of on all continuous functions . For this, note that the g-LSE operator has the same behavior as the log-sum-exp when composed with the addition of a constant :

 Ω⋆(f+c)=Ω⋆(f)+c,∇Ω⋆(f+c)=∇Ω⋆(f). (22)

Therefore, for all , , which almost makes a part of the space of potentials . Yet, in contrast with the Shannon entropy case, the inclusion of in is strict. Indeed, following §3.2 implies that there exists such that is the image of the C-transform operator. The operator has therefore an extrapolation effect, as it replaces onto the set of Sinhorn potentials. This is made clear by the following proposition.

###### Proposition 3 (Extrapolation effect of ∇Ω∘∇Ω⋆).

For all , we define the extrapolation of to be

 fE≜−T(−(f−Ω⋆(f)),∇Ω⋆(f))+Ω⋆(f). (23)

Then, for all ,

The extrapolation operator translates to , extrapolates so that it becomes a Sinkhorn potential, then translates back the result so that . Its effects clearly appears on Figure 2 (right), where we see that is a projection of onto the cylinder .

### 4.2 Relation to Hausdorff divergence

Recall that the Bregman divergence (bregman_1967) generated by a strictly convex is defined as

 DΩ(α,β)≜Ω(α)−Ω(β)−⟨∇Ω(f),α−β). (24)

When is the classical negative Shannon entropy , it is well-known that equals the Kullback-Leibler divergence and it is easy to check that

 ℓΩ(α,f)=DΩ(α,∇Ω⋆(f))=KL(α,softmax(f)). (25)

The equivalence between Fenchel-Young loss and composite Bregman divergence , however, no longer holds true when is the Sinkhorn negentropy defined in (5). In that case, can be interpreted as an asymmetric Hausdorff divergence (aspert2002mesh; feydy_interpolating_2018). It forms a geometric divergence akin to OT distances, and estimates the distance between distribution supports. As we now show, provides an upper-bound on the composition of that divergence with .

###### Proposition 4 (ℓΩ upper-bounds Hausdorff divergence).
 DΩ(α,∇Ω⋆(f))=ℓΩ(α,fE) (26) =ℓΩ(α,f)−⟨α,fE−f⟩≤ℓΩ(α,f) (27)

with equality if .

In contrast with the KL divergence, the asymmetric Hausdorff divergence is finite even when , a geometrical property that it shares with optimal transport divergences. We now use Prop. 4 to derive a new consistency result justifying our loss. Let us assume that input features and output distributions follow a distribution . We define the Hausdorff divergence risk and the Fenchel-Young loss risk as

 E(β)≜E[DΩ(α,β(x))]andR(g)≜E[ℓΩ(α,g(x))],

where the expectation is taken w.r.t. . We define their associated Bayes estimators as

 β⋆≜argminβ:X→M+1(Y)E(β)andg⋆≜argming:X→C(Y)R(g).

The next proposition guarantees calibration of with respect to the asymmetric Hausdorff divergence .

###### Proposition 5 (Calibration of the g-logistic loss).

The g-logistic loss where is defined in (5) is Fisher consistent with the Hausdorff divergence for the same . That is,

 E(β⋆)=R(g⋆),% withg⋆=∇Ω∘β⋆. (28)

The excess of risk in the Hausdorff divergence is controlled by the excess of risk in the g-logistic loss. For all , we have

 E(∇Ω⋆∘g)−E(β⋆)≤R(g)−R(g⋆). (29)

This result, that follows the terminology of tewari_consistency_2005, shows that is suitable for learning predictors that minimize .

## 5 Applications

We present two experiments that demonstrate the validity and usability of the geometric softmax in practical use-cases. We provide a PyTorch package for reusing the discrete geometric softmax layer

.

### 5.1 Ordinal regression

We first demonstrate the g-softmax for ordinal regression. In this setting, we wish to predict an ordered category among categories, and we assume that the cost of predicting instead of is symmetric and depends on the difference between and . For instance, when predicting ratings, we may have three categories bad average good. This is typically modeled by a cost-function , where is the or cost. We use the real-world ordinal datasets provided by gutierrez_ordinal_2016, using their predefined 30 cross-validation folds.

##### Experiment and results.

We study the performance of the geometric softmax in this discrete setting, where the score function is assumed to be a linear function of the input features , i.e, , with , and

. We compare its performance to multinomial regression, and to immediate threshold and all-threshold logistic regression

(rennie_loss_2005), using a reference implementation provided by pedregosa_consistency_2017. We use a cross-validated penalty term on the linear score model . To compute the Hausdorff divergence at test time and the geometric loss during training, we set .

The results, aggregated over datasets and cross-validation folds, are reported in Table 1. We observe that the g-logistic regression performs better than the others for the Hausdorff divergence on average. It performs slightly worse than a simple logistic regression in term of accuracy, but slightly better in term of mean absolute error (MAE, the reference metric in ordinal regression). It thus provides a viable alternative to thresholding techniques, that performs worse in accuracy but better in MAE. It has the further advantage of naturally providing a distribution of output given an input . We simply have, for all , .

##### Calibration of the geometric loss.

We validate Prop. 5 experimentally on the ordinal regression dataset car. During training, we measure the geometric cross-entropy loss and the Hausdorff divergence on the train and validation set. Figure 3 shows that is indeed an upper bound of , and that the difference between both terms reduces to almost on the train set. Prop. 5 ensures this finding provided that the set of scoring function is large enough, which appears to be approximately the case here.

### 5.2 Drawing generation with variational auto-encoders

The proposed geometric loss and softmax are suitable to estimate distributions from inputs. As a proof-of-concept experiment, we therefore focus on a setting in which distributional output is natural: generation of hand-drawn doodles and digits, using the Google QuickDraw (ha_neural_2017)

and MNIST dataset. We train variational autoencoders on these datasets using, as output layers, (1) the KL divergence with normalized output and (2) our geometric loss with normalized output. These approaches output an image prediction using a softmax/g-softmax over all pixels, which is justified when we seek to output a concentrated distributional output. This is the case for doodles and digits, which can be seen as 1D distributions in a 2D space. It differs from the more common approach that uses a binary cross-entropy loss for every pixel and enables to capture interactions between pixels at the feature extraction level. We use standard KL penalty on the latent space distribution.

Using the g-softmax takes into account a cost between pixels and , that we set to be the Euclidean cost , where is the cost and is the typical distance of interaction—we choose in our experiments. We therefore made the hypothesis that it would help in reconstructing the input distributions, forming a non-linear layer that captures interaction between inputs in a non-parametric way.

##### Results.

We fit a simple MLP VAE on 28x28 images from the QuickDraw Cat dataset. Experimental details are reported in Appendix B (see Figure 6). We also present an experiment with 64x64 images and a DCGAN architecture, as well as visualization of a VAE fitted on MNIST. In Figure 4, we compare the reconstruction and the samples after training our model with the g-softmax and simple softmax loss. Using the g-softmax, which has a deconvolutional effect, yields images that are concentrated near the edges we want to reconstruct. We compare the training curves for both the softmax and g-softmax version: using the g-softmax link function and its associated loss better minimizes the asymmetric Hausdorff divergence. The cost of computation is again increased by a factor 10.

## 6 Conclusion

We introduced a principled way of learning distributional predictors in potentially continuous output spaces, taking into account a cost function in between inputs. We constructed a geometric softmax layer, that we derived from Fenchel conjugation theory in Banach spaces. The key to our construction is an entropy function derived from regularized optimal transport, convex and weak continuous on probability measures. Beyond the experiments in discrete measure spaces that we presented, our framework opens the doors for new applications that are intrinsically off-the-grid, such as super-resolution.

## Acknowledgements

The work of A. Mensch and G. Peyré has been supported by the European Research Council (ERC project NORIA). A. Mensch thanks Jean Feydy and Thibault Séjourné for fruitful discussions.

## Appendix A Proofs

We prove propositions by order of appearance in the main text.

### a.1 Asymptotics of the Sinkhorn negentropy—Proof of Prop. 1

###### Proof.

We start by showing the Shannon entropy limit of the Sinkhorn entropy, in the discrete case. In this case, we use the standard Kantorovich dual (cuturi_sinkhorn_2013). Let , , and

 Ω(α)≜ΩC/ε(α)=−maxf∈Rd⟨α,f⟩−⟨α⊗α,exp(f⊕f−C2)⟩+1. (30)

For all

 Ψα(f)≜⟨α,f⟩−⟨α⊗α,exp(f⊕f−C/ε2)+1⟩=d∑i=1fiαi−d∑i,j=1αiαjexp(fi+fj−ci,j/ε2)+1. (31)

For optimal in (30), letting , we have, using element-wise multiplication ,

 ∇Ψα(f)=α−α2∗ef=0i.e.efi=1αifor all i∈[d]. (32)

Replacing in (30), we obtain

 Ω(α)=⟨α,logα⟩+d∑i=1αi−1=⟨α,logα⟩. (33)

Let us now consider the limit for of , for an arbitrary symmetric cost matrix . We rewrite

 ΩC/ε(α)=maxf∈C(Y)2⟨α,f2⟩−ε⟨α⊗α,ef⊕f2−Cε⟩=OTε(α,α). (34)

The asymptotic behavior of , namely

 εΩC/ε(α) ε→+∞⟶12⟨α⊗α,−C⟩, (35)

is then a simple consequence of the asymptotics of Sinkhorn OT distances (genevay_sinkhorn-autodiff_2017), that we apply in the symmetric case. In the discrete setting, the result for becomes, if ,

 12⟨α⊗α,Id×d−1⟩=12d∑i=1α2i−1, (36)

as , which concludes the proof.

### a.2 Construction of the geometric softmax—Proof of Prop. 2

###### Proof.

We can rewrite the self transport with the change of variable , due to feydy_global_2018. We then have , and

 Ω(α)≜−12OT2(α,α) =−maxf∈C(Y)⟨α,f⟩−log⟨α⊗α,exp(f⊕f−C)2⟩ =−maxμ∈M+(Y)−2⟨α,logdαdμ⟩−log∥μ∥2k2, where∥μ∥k2 ≜∫X∫Xexp(−C(x,y)2)dμ(x)dμ(y)

is the kernel norm defined with kernel . Then, the conjugate of reads, for all ,

 Ω⋆(f) =maxα∈M+1(Y)⟨α,f⟩−Ω(α) =maxμ∈M+(Y)log∬X2exp(f(x)+f(y)2)dμ(x)dμ(y)∬X2exp(−C(x,y)2)dμ(x)dμ(y),

where we have used the conjugation of the relative entropy over the space of probability measure :

 maxα∈M+1(Y)⟨α,f⟩−2⟨α,logdαdμ⟩=2log∫Xexp(f(x)2)dμ(x).

We now revert the first change of variable, setting , and . We have

 Ω⋆(f) =maxα∈M+1(Y)−log∬X2exp(−f(x)+f(y)+C(x,y)2)dα(x)dα(y),

and the first part of the proposition follows:

 g-LSE(f)=Ω⋆(f) =−minα∈M+1(Y)⟨α⊗α,exp(−f⊕f+C2)⟩.

We have assumed that is positive definite, which ensures that the bivariate function

 Φ(f,α)≜⟨α⊗α,exp(−f⊕f+C2)⟩ (37)

is strictly convex in and in . Let . The gradient of with respect to is a measure that reads

 ∇fΦ(f,α) =−αexp(−f−TC(−f,α))∈M(Y),where we recall (38) TC(f,α) ≜−2log⟨α,exp(f−C2)⟩. (39)

From a generalized version of the Danskin theorem (bernhard_variations_1990), the function

 f→argminα∈M+1(Y)⟨α⊗α,exp(−f⊕f+C2)⟩ (40)

is differentiable everywhere and has for gradient . Composing with the , we obtain

 ∇Ω⋆(f)∈M+1(Y),and∇Ω⋆(f)∝α⋆exp(−f−TC(−f,α⋆)), (41)

where indicates proportionality. To conclude, we use Lemma 2, that describes the minimizers of (37), and that we prove in the next section. It ensures that on the support of . Therefore

 ∇Ω⋆(f)=α⋆∈M+1(Y), (42)

and the proposition follows. ∎

### a.3 Geometry of the link function—Proofs of Lemma 1 and Prop. 3

We first state and proof Lemma 2 on optimality condition in the minimization of . We then prove Lemma 1, establish some basic properties of the extrapolation operator and prove Prop. 3.

#### a.3.1 Necessary and sufficient condition of optimality in ∇Ω⋆(f)

Finding the minimizer of amounts to finding the distribution for which and its C-transform are the less distant, as it appears in the following lemma.

###### Lemma 2 (∇Ω⋆ from first order optimality condition).

is the only distribution such that there exists a constant such that

 f(y)+T(−f,α)(y)2=A∀y∈suppαf(y)+T(−f,α)(y)2≤A∀y∈Y/suppα,