Learning with Differentiable Perturbed Optimizers

by   Quentin Berthet, et al.

Machine learning pipelines often rely on optimization procedures to make discrete decisions (e.g. sorting, picking closest neighbors, finding shortest paths or optimal matchings). Although these discrete decisions are easily computed in a forward manner, they cannot be used to modify model parameters using first-order optimization techniques because they break the back-propagation of computational graphs. In order to expand the scope of learning problems that can be solved in an end-to-end fashion, we propose a systematic method to transform a block that outputs an optimal discrete decision into a differentiable operation. Our approach relies on stochastic perturbations of these parameters, and can be used readily within existing solvers without the need for ad hoc regularization or smoothing. These perturbed optimizers yield solutions that are differentiable and never locally constant. The amount of smoothness can be tuned via the chosen noise amplitude, whose impact we analyze. The derivatives of these perturbed solvers can be evaluated efficiently. We also show how this framework can be connected to a family of losses developed in structured prediction, and describe how these can be used in unsupervised and supervised learning, with theoretical guarantees. We demonstrate the performance of our approach on several machine learning tasks in experiments on synthetic and real data.


Stochastic Optimization of Sorting Networks via Continuous Relaxations

Sorting input objects is an important step in many machine learning pipe...

Pontryagin Differentiable Programming: An End-to-End Learning and Control Framework

This paper develops a Pontryagin differentiable programming (PDP) method...

Differentiable Rendering with Perturbed Optimizers

Reasoning about 3D scenes from their 2D image projections is one of the ...

Differentiable Sorting using Optimal Transport:The Sinkhorn CDF and Quantile Operator

Sorting an array is a fundamental routine in machine learning, one that ...

Sharp Analysis of Learning with Discrete Losses

The problem of devising learning strategies for discrete losses (e.g., m...

Gradient Backpropagation Through Combinatorial Algorithms: Identity with Projection Works

Embedding discrete solvers as differentiable layers has given modern dee...

Intriguing Parameters of Structural Causal Models

In recent years there has been a lot of focus on adversarial attacks, es...

1 Introduction

Many applications of machine learning benefit from the possibility to train by gradient descent compositional models using end-to-end differentiability. Yet, there remain many fields in which discrete decisions are required at intermediate steps of a data processing pipeline, notably those involving sequences of decisions and/or discrete objects (e.g., in robotics, graphics or biology). This is the result of many factors: discrete decisions provide a much sought-for interpretability of what a black-box is actually doing and discrete solvers are built upon decades of advances in combinatorial algorithms (Schrijver, 2003)

to make quick decisions (e.g., sorting, picking closest neighbors, exploring options with beam-search, and more generally with knapsack, routing and shortest paths problems). Even though these discrete decisions can be easily computed, in what would be called a forward pass in a deep learning context, the derivatives of these decisions with respect to inputs are degenerate (small changes in the inputs either yield no change or discontinuous changes in the outputs). As a consequence, discrete solvers break the back-propagation of computational graphs, and cannot be incorporated in end-to-end learning.

In order to expand the set of operations that can be incorporated in differentiable models, we propose and investigate a systematic method to transform blocks with discrete optimizers as outputs into differentiable operations. Our approach relies on the method of stochastic perturbations, the theory of which was developed and applied to several tasks of machine learning recently (see, e.g. Hazan et al., 2016, for a survey). In a nutshell, we perturb the inputs of a discrete solver with random noise, and consider the perturbed optimal solutions of the problem. The method is both easy to analyze theoretically and trivial to implement. Using the formal expectation of these perturbed solutions, we show that they are never locally constant and everywhere differentiable, with successive derivatives being expectations of simple expressions.

Related work.

Our work is part of growing efforts to modify operations in order to make them differentiable, that we now review.

Differentiating through an argmax. Several works have studied the introduction of regularization in the optimization problem in order to make the argmax differentiable. However these works are usually problem-specific, since a new optimization problem needs to be solved. Examples include assignments (Adams & Zemel, 2011) and more generally optimal transport (Bonneel et al., 2016; Cuturi, 2013), differentiable dynamic programming (Mensch & Blondel, 2018), differentiable submodular optimization (Djolonga & Krause, 2017). An exception is the so-called SparseMAP algorithm (Niculae et al., 2018), which is based on Frank-Wolfe or active-set algorithms for solving the -regularized-problem, and on implicit differentiation for computing a Jacobian. Like our proposal, SparseMAP only requires access to a linear maximization oracle. However, it is sequential in nature, while our approach is trivial to parallelize. More recently Agrawal et al. (2019) analyze implicit differentiation on solutions of convex optimization. They express the derivatives of the argmax exactly, leading to zero Jacobian almost everywhere for optimization over polytopes. Vlastelica et al. (2019)

propose a new scheme to interpolate in a piecewise-linear manner between locally constant regions. Their aim is to keep the same value for the Jacobian of the argmax for a large region of inputs, allowing for zero Jacobians as well.

Perturbation methods. The idea of using the expectation of a perturbed max and argmax, commonly known as the “Gumbel trick”, dates back to Gumbel (1954), and the notion of random choice models (Luce, 1959; McFadden et al., 1973; Guadagni & Little, 1983). They are exploited in online learning and bandit problems in order to promote exploration, and induce robustness to adversarial data (see, e.g., Abernethy et al., 2016, for a survey). In relation with our work, they are used for action spaces that are combinatorial in nature (Neu & Bartók, 2016). The Gumbel trick has also been used together with a softmax to obtain differentiable sampling (Jang et al., 2016; Maddison et al., 2016).

The use of perturbation techniques as an alternative to MCMC techniques for sampling was pioneered by Papandreou & Yuille (2011). They are used to compute the expected statistics arising in the gradient of conditional random fields. They show the exactness for the fully perturbed (but intractable case) and propose “low-rank” perturbations as an approximation. These results are extended by Hazan & Jaakkola (2012), who proved that the maximum with low-rank perturbations, in expectation, provides an upper-bound on the log partition and proposed to replace the log partition in conditional random fields loss by that expectation. Their results, however, are limited to discrete product spaces. Shortly after, Hazan et al. (2013) derived new lower bounds on the partition function and proposed a new unbiased sequential sampler for the Gibbs distribution based on low-rank perturbations. These results were further refined by Gane et al. (2014) and Orabona et al. (2014). Shpakova & Bach (2016) further studied these bounds and proposed a doubly stochastic scheme. Balog et al. (2017) explored other distributions from extreme value theory, including the Fréchet and Weibull distributions. Apart from Lorberbom et al. (2019), who use a finite difference method, we are not aware of any prior work using perturbation techniques to differentiate through an argmax. As reviewed above, all papers focus on (approximately) sampling from the Gibbs distribution, upper-bounding the log partition function, or differentiating through the max.

Our contributions.

  • [topsep=0pt,itemsep=2pt,parsep=2pt,leftmargin=10pt]

  • We propose a new general method to transform discrete optimizer outputs, inspired by the stochastic perturbation literature. This versatile method is easy to apply to any blackbox solvers without ad-hoc modifications.

  • Our stochastic smoothing allows argmax differentiation, through the formal perturbed maximizer. We show that its Jacobian is well-defined and non-zero everywhere, thereby avoiding vanishing gradients.

  • The successive derivatives of the perturbed maximum and argmax are expressed as simple expectations, which are easy to approximate with Monte-Carlo methods.

  • This particular approach to operator smoothing yields natural connections to the recently-proposed Fenchel-Young losses by Blondel et al. (2019). We show that the equivalence via duality with regularized optimization makes Fenchel-Young losses particularly natural. We propose a doubly stochastic scheme for minimization of these losses, for unsupervised and supervised learning.

  • We demonstrate the performance of our approach on several machine learning tasks in experiments on synthetic and real data.

2 Perturbed maximizers

Given a finite set of distinct points and its convex hull, we consider a general discrete optimization problem parameterized by an input as follows:


As detailed in Section 2.2 below, this formulation encompasses a variety of discrete operations such as picking the maximum value or the top

largest values of a vector; ranking the entries of a vector; or computing the shortest path over a weighted graph, to name just a few. In all cases,

is a polytope and these problems are linear programs (LP). For almost every

, the argmax is unique, and . While widespread, these functions do not have the convenient properties of blocks in end-to-end learning methods, such as smoothness or differentiability. In particular, is piecewise constant: its gradient is zero almost everywhere, and undefined otherwise.

To address these issues, we simply add to a random noise vector , where is a temperature parameter and has a positive and differentiable density on , ensuring that

is almost surely (a.s.) uniquely defined. This induces a probability distribution

on given by .

Taking expectations w.r.t. the random perturbation leads to smoothed versions of and (see Figure 1): [linewidth=0.5pt]

Definition 2.1.

For all , and , we define
the perturbed maximum as

the perturbed maximizer as

This creates a general and natural model on the variable , when observations are solutions of optimization problems, with uncertain costs. It enables the modeling of phenomena where agents chose an optimal based on uncertain knowledge of , or varying circumstances. We view this as a generalization, or alternative to the Gibbs distribution, rather than an approximation thereof.

Figure 1: Over a polytope , stochastic perturbations on induce a probability over (in red) on . The perturbed maximizer is the expectation of Y under .

Models of random optimizers for linear problems with perturbed inputs are the subject of a wide litterature in machine learning, under the name of Perturb-and-MAP (Papandreou & Yuille, 2011; Hazan & Jaakkola, 2012), and perturbed leader method in online learning (Hannan, 1957; Kalai & Vempala, 2003; Abernethy et al., 2014). We refer to it here as the perturbed model. A notorious example is when

is the set of one-hot-encoding of

classes, is the unit simplex, and has the Gumbel distribution (Gumbel, 1954). In that case it is well-known that is the Gibbs distribution, proportional to , is the log-sum-exp function of , and is the vector of softmax (or exponential weights) of the components of . The following result shows that the perturbed maximizer can also be defined as the solution of a convex problem, by Fenchel-Rockafellar duality. [linewidth=0.5pt]

Proposition 2.1.

Let be the Fenchel dual of , with domain . We have that


As generalizes the log-sum-exp function for Gumbel noise on the simplex, its dual is a generalization of the negative entropy (which is the Fenchel dual of log-sum-exp). These connections have been studied in many parts of statistical and machine learning (Wainwright et al., 2008). In the literature on bandit problems and online learning, these links between regularization and perturbation for linear optimization problems are well-studied, and applied to gradient-based algorithms (Abernethy et al., 2014, 2016).

This model, and the perturbed functions of Definition 2.1 inherit several important properties from this formulation. First, it allows to take derivatives with respect to the input  of and of (Proposition 2.2). These derivatives are also easily expressed as expectations involving and with noisy inputs, as discussed in Section 3. In turn, this yields fast computational methods for these functions and their derivatives, described in Section 3.1. Further, by the duality point of view describing as a regularized maximizer, there exists a natural convex loss for this model that can be efficiently optimized in , for data . We describe this formalism in Section 4. All proofs are in the appendix.

2.1 Properties of the model

This model modifies the maximum and maximizer by perturbation. Because of the simple action of the stochastic noise , we can analyze their properties precisely.


Proposition 2.2.

Assume is a convex polytope with non-empty interior, and has positive differentiable density. The perturbed model and the associated functions , , and have the following properties, for and :

  • [topsep=0pt,itemsep=2pt,parsep=2pt,leftmargin=5pt]

  • is strictly convex, twice differentiable, -Lipschitz-continuous and its gradient is -Lipschitz-continuous.

  • is -strongly convex, differentiable, and Legendre-type.

  • is in the interior of and is differentiable in .

  • Impact of the temperature : we have that

We develop in further details the simple expressions for derivatives of and in Section 3. By this proposition, since is strictly convex, it is nowhere locally linear, so is nowhere locally constant. Formally, is a mirror map, a one-to-one mapping from unto the interior of . The gradient of is its functional inverse, by convex duality between these functions (see, e.g. surveys Wainwright et al., 2008; Bubeck et al., 2015, and references therein).

Remark 1.

For these properties to hold, it is crucial that has non-empty interior, i.e., that does not lie in an affine subspace of lower dimension. To adapt to cases where lies in a subspace, we consider the set of inputs up to vectors orthogonal to , or represent in a lower-dimensional subspace. As an example, over the unit simplex and Gumbel noise, the log-sum-exp is not strictly convex, and in fact linear along the all-ones vector . In such cases, the model is only well-specified in up to the space orthogonal to , which does not affect prediction tasks.

For any positive temperature, these properties imply that there is an informative, well-defined and nonzero gradient in . The limiting behavior at extreme temperatures can also be explicitly derived from these properties.


Proposition 2.3.

With the conditions of Proposition 2.2, for  such that is a unique maximum:

- When , and .

- When , .

For every , we have ,

for constants and .

The properties of the distributions in this model are well studied in the perturbations literature (see, e.g. Hazan et al., 2016, for a survey). They notably do not have a simple closed-form expression, but can be very easy to sample from. By the argmax definition, simulating , only requires to sample (e.g., Gaussian, or vector of i.i.d. Gumbel), and to solve the original optimization problem. It is the case in the applications we consider (e.g., max, ranking, shortest paths). This is in stark contrast to the Gibbs distribution, which has the opposite properties.

2.2 Examples

Many operations frequently used in machine learning can be written as optimal decisions over a discrete set and can be written as the problem in Equation (1). They correspond to a linear program (typically with integral solutions), but we emphasize again that this structure need not be known or exploited to use the perturbed maximizers in practice.

Maximum. The max function from to , that returns the largest of entries of a vector is ubiquitous in machine learning, the hallmark of any classification task. It is equal to an LP over the unit simplex , with .

Top . The function from to that returns the sum of the largest entries of a vector is also commonly used. Akin to the maximum, it is the value of an LP (cf. Appendix for more details).

Ranking. The function returning the ranks (in descending order) of a vector can be written as the argmax of a linear program over the permutahedron, the convex hull of permutations of any vector with distinct entries. Using different reference vectors yield different perturbed operations, and is commonly used.

Shortest paths. For a graph and positive costs over edges , the problem of finding a shortest path (i.e. with minimal total cost) from vertices to can be written in our setting with as an LP (see Appendix).

Other discrete problems. Finding linear assignments, minimum spanning trees, maximum flows and even applying logical operators can be written in this form. The strength of our method is that it adapts to any setting where the scores to be maximized correspond to a linear function , for some embedding of the possible outputs in .

3 Differentiation of soft maximizers

As noted above, for the right noise distributions, the perturbed maximizer is differentiable in its inputs, with non-zero Jacobian. Further, the derivatives associated to this model can be expressed as simple expectations. [linewidth=0.5pt]

Proposition 3.1.

(Abernethy et al., 2016, Lemma 1.5) For noise with distribution and twice differentiable , the following hold:

We discuss in the following subsection efficient techniques to evaluate in practice and its Jacobian, or to generate stochastic gradients, based on these expressions.

Remark 2.

Being able to compute the perturbed maximizer and its Jacobian allows to optimize functions that depend on  through . This can be used to alter the costs to promote solutions with certain desired properties. Moreover, in a supervised learning setting, this allows to train models containing blocks with inputs (for some features ) and outputs , for response

For first-order methods, back-propagating requires not only the usual architecture-dependent , but also a gradient in of the loss . If this block is a strict discrete maximizer , as noted above, the computational graph is broken. However, with our proposed modification, we have that


and the gradient can be fully backpropagated. Perturbed maximizers can therefore be used in end-to-end prediction models, for any loss

on the predicted maximizer. Furthermore, we describe in Section 4 a loss that can be directly optimized in by first-order methods. It comes with a strong algorithmic advantage, as it requires only to compute the perturbed maximizer and not its Jacobian.

3.1 Practical implementation

For any , the perturbed maximizer is a solution of a convex optimization problem (Eq. 2). If has a simple form, can be computed either explicitly or through convex optimization (e.g. exponential weights for the entropy on the simplex).

More generally, by their expressions as expectations, the perturbed maximizer and its Jacobian in the input can be stochastically approximated with Monte-Carlo methods. This only requires to efficiently sample from , and to solve the maximization problem over , which is a much weaker requirement (see, e.g., examples in Section 2.2). [linewidth=0.5pt]

Definition 3.1.

Given , let be i.i.d. copies of and, for ,


Monte-Carlo estimate

of is:

Since for every , by definition of

, this yields an unbiased estimate of


Note that the formulae in Proposition 3.1 give several manners to stochastically approximate , , and their derivatives by using , and and averages. This yields unbiased estimates for , , and its Jacobian. The plurality of these formulae gives the user several options for practical implementation, depending on how convenient it is to compute the maximum or the maximizer, how numerically stable they are, or what the impact of the factors are.


or its derivatives are used in stochastic gradient descent for training in supervised learning, a full approximation of the gradients is not always necessary. Indeed, taking only

(or a small number) of observations is acceptable here, as the gradients are stochastic in the first place. We call this scheme doubly stochastic, as it is stochastic with respect to both training samples and noise.

A great strength of this method is the absence of conceptual overhead, or of supplementary computations. Only sampling from the chosen noise distribution and solutions of the problem are required. Further, even though our analysis relies on the specific structure of the problem as an LP, these algorithms do not. The Monte-Carlo estimates can be obtained by using a function as a blackbox, without requiring knowledge of the problem or of the algorithm that solves it.

Implementation details. Two methods in which this implementation can be optimized are parallelization and warm starts. Indeed, to alleviate the dependency in of the running time, we can independently sample the and compute the in parallel. On the other hand, on certain algorithms, starting from a solution or near-solution can lead to significant speed-ups. Using as initialization can improve running times dramatically, especially at lower temperatures.

4 Perturbed model learning with Fenchel-Young losses

There is a large literature on learning parameters of a Gibbs distribution based on data , through maximization of the likelihood:


They can be optimized by taking gradients of the empirical log-likelihood, of the form

thus earning the name of moment-matching procedures. The expectation of the Gibbs distribution is however hard to evaluate in some cases. This motivates its replacement by

(perturb-and-MAP in this literature), and to use this method to learn the parameters in this model (Papandreou & Yuille, 2011), as a proxy for log-likelihood.

This minimization is equivalent to maximizing a term akin to Equation (4), substituting the log-partition with . We show here that this approach can be formally analyzed by the use of Fenchel-Young losses (Blondel, 2019) in this context. The use of these losses also drastically improves the algorithmic aspects of the learning tasks, because of the specific expression of the gradients in of the loss.

In the perturbed model, the Fenchel-Young loss is defined for by

It is–among other things–nonnegative, convex in , and minimized with value 0 if and only if is such that . It is equal to the Bregman divergence associated to

As and interact in this loss only through a scalar product, for random we have

where does not depend on . This is particularly convenient in analyzing the performance of Fenchel-Young losses in generative models. The gradient of the loss is

The Fenchel-Young loss can therefore be interpreted as a loss in that is a function of . Moreover, it can be optimized in with first-order methods simply by computing the soft maximizer, without having to compute its Jacobian. It is therefore a particular case of the situation described in Eq. (3), allowing to even bypass virtually the perturbed maximizer block in the output, and to directly optimize a loss between observations and model outputs .

These ideas are developed in the following two subsections, in the cases of unsupervised and supervised learning.

4.1 Unsupervised learning - parameter estimation

For a given sequence of observations , a natural problem is to find the parameter that best fits these observations with the Fenchel-Young loss. We show here that this approach is particularly appropriate if the ’s are generated by the perturbed model

Figure 2: Results for image classification with perturbed argmax. Left. Accuracy in training, using the cross-entropy and the FY loss for two sample sizes. Center. Test accuracy for these three methods. Right. Impact of the parameter on the test and train loss.

that is, , for some unknown . In this unsupervised model, we therefore have a natural empirical and population loss , given a sample of size

Their gradients are given by

The empirical loss is minimized for such that and the population loss when . As a consequence, the whole battery of statistical results, from asymptotic to non-asymptotic, can be leveraged, and we present the simplest one (asymptotic normality). [linewidth=0.5pt]

Proposition 4.1.

When goes to , with the assumptions of Proposition 2.2 on the model, we have

in distribution, where is the covariance of .

In practice, can be fitted by stochastic optimization. Taking the loss only on observation yields

As a consequence, as usually in stochastic optimization, is a stochastic gradient either for (w.r.t. a random uniform from ) or for (w.r.t. drawn from ).

The methods described in Section 3.1 to stochastically approximate the gradient are particularly adapted here. Indeed, following Shpakova & Bach (2016), given an observation and a current value , a doubly stochastic version of the gradient is obtained by


This can also be incorporated with a procedure where batches of data points are used to compute an approximate gradients, where the number of artificial samples and the batch size can be chosen separately.

4.2 Supervised learning

This loss can also be applied to a supervised learning task with observations , in a model class of parametrized functions , for . The Fenchel-Young loss between and is, in this case as well, a natural and convenient loss to minimize. We consider

As in the unsupervised setting, this loss can be motivated by a generative model for some parameter such that


This gives a natural distribution for discrete outputs that are optimizers, where the population loss is

The population loss is therefore minimized at . The gradient of the empirical loss is given by

Each term in the sum, gradient of the loss for a single observation, is therefore a stochastic gradient for (w.r.t. uniform in ) or for (w.r.t. to a random from

). As in unsupervised learning, a doubly stochastic gradient is obtained by averaging

perturbed argmax.

5 Experiments

We demonstrate how perturbed maximizers can be used in a supervised learning setting, as described in Section 4. We do so in several tasks with features and responses in a discrete set of optimizers . In this section, since we focus on the prediction task, the issues raised in Remark 1 do not apply. We can write these maxima over polytopes that might have empty interior, for ease of notation.

When learning with the Fenchel-Young losses, we simulate doubly stochastic gradients of the empirical loss with artificial perturbations (see Equation 5)

We give here a proof of concept of the adaptivity of this method to several tasks, and exhibit its performance for simple models (small neural networks, linear models).

We will open-source a package allowing to make any black-box solver differentiable in just a few lines of code.

Figure 3: In the shortest paths experiment, training features are images. Shortest paths are computed based on terrain costs, hidden to the network. Training responses are shortest paths based on this cost. We illustrate the impact of the temperature on perturbed shortest paths.

5.1 Perturbed max

We use the perturbed argmax with Gaussian noise in an image classification task. We train a vanilla-CNN made of 4 convolutional and 2 fully connected layers on the CIFAR-10 dataset for 600 epochs with minibatches of size 32. The 10 network outputs are the entries of

and we minimize the Fenchel-Young loss between and , with different temperatures and number of perturbations .

We analyze the impact of these two algorithmic parameters on the optimization and generalization abilities. We exhibit the final loss and accuracy for different number of perturbations in the doubly stochastic gradient (), and observe competitive performance compared to using the standard cross-entropy loss (Figure 2, left and center).

We also highlight the importance of the temperature parameter on the algorithm (see Figure 2, right). Very low temperatures do not smooth the maximizers enough and no fitting occurs, even on training data. At very high temperatures, the perturbed maximizer carries less information about , and gradient estimates are closer to pure noise, which degrades the ability to fit to training and to generalize to the testing data.

5.2 Perturbed ranking

We create a ranking task by, at each instance , projecting vectors of dimension along an unknown direction . The label is then given by the vector of ranks of , for . This provides a simple experiment with permuted vectors as labels, for which the Fenchel-Young loss is convex in : we fit to the response . Since is linear in , the loss is also convex in and we explore here the performance of our framework in a simple setting. On our experiments, we use instances of vectors in dimension - with instances in the training set and on the test set.

To explore the complexity of this task, we create a range of datasets, where the projections are perturbed with Gaussian noise of variance

before being ranked. They follow precisely the distribution of the generative model described in Equation 6, and the population loss is indeed minimized at . For correct labels (), our doubly stochastic scheme allows us to accurately generalize after epochs to the test data (see Figure 4). Even under small variances, we are able to accurately predict a high proportion of the correct ranks, and predictably fail for larger variances.

Figure 4: The accuracy, measured in percentage of perfect ranks (complete recovery) and partial ranks (mean of correct positions), under different noise variances on the labels.

5.3 Perturbed shortest path

We replicate the experiment of Vlastelica et al. (2019), aiming to learn the travel costs in graphs based on features, given examples of shortest path solutions (see Figure 3). On a dataset of 10,000 RGB images of size illustrating Warcraft terrains in the form of 2D grid networks. The responses are a shortest path between the top-left and bottom-right corners, for costs hidden to the network corresponding, to the terrain type. These responses are represented as boolean matrices indicating the vertices along the shortest path.

Figure 5: Accuracy of the predicted path, measured by the ratio of costs between the predicted path and the actual shortest path, for several values of the temperature parameter .

We train a purely convolutional neural network made of 3 layers with a Fenchel-Young loss between the predicted costs

and the shortest path . We optimize over epochs with batches of size , temperature parameter and , a single perturbation. We are able, only after a few epochs, to generalize very well for , and to accurately predict the shortest path on the test data (see Figure 5). We show here the impact of the temperature parameter on a finer metric based on the ratio between the optimal and the predicted cost.

6 Conclusion

Despite a large body of work on perturbations techniques for machine learning, most existing works focused on approximating sampling, log-partitions and expectations under the Gibbs distribution. Together with novel theoretical insights, we proposed a framework for differentiating through, not only a max, but also an argmax, without ad-hoc modification of the underlying solver. In addition, by defining an equivalent regularizer , we showed how to construct Fenchel-Young losses and proposed a doubly stochastic scheme, enabling both unsupervised and supervised learning. Experiments validated the ease of application of our framework to various tasks.


FB’s work was funded in part by the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute). FB also acknowledges support from the European Research Council (grant SEQUOIA 724063).


  • Abernethy et al. (2014) Abernethy, J., Lee, C., Sinha, A., and Tewari, A. Online linear optimization via smoothing. In Conference on Learning Theory, pp. 807–823, 2014.
  • Abernethy et al. (2016) Abernethy, J., Lee, C., and Tewari, A. Perturbation techniques in online learning and optimization. Perturbations, Optimization, and Statistics, pp. 233, 2016.
  • Adams & Zemel (2011) Adams, R. P. and Zemel, R. S. Ranking via Sinkhorn propagation. arXiv preprint arXiv:1106.1925, 2011.
  • Agrawal et al. (2019) Agrawal, A., Amos, B., Barratt, S., Boyd, S., Diamond, S., and Kolter, J. Z. Differentiable convex optimization layers. In Advances in Neural Information Processing Systems, pp. 9558–9570, 2019.
  • Bach et al. (2012) Bach, F., Jenatton, R., Mairal, J., Obozinski, G., et al. Optimization with sparsity-inducing penalties. Foundations and Trends® in Machine Learning, 4(1):1–106, 2012.
  • Balog et al. (2017) Balog, M., Tripuraneni, N., Ghahramani, Z., and Weller, A. Lost relatives of the Gumbel trick. In Proc. of ICML, pp. 371–379, 2017.
  • Blondel (2019) Blondel, M. Structured prediction with projection oracles. In Proc. of NeurIPS, 2019.
  • Blondel et al. (2019) Blondel, M., Martins, A. F., and Niculae, V. Learning with fenchel-young losses. arXiv preprint arXiv:1901.02324, 2019.
  • Bonneel et al. (2016) Bonneel, N., Peyré, G., and Cuturi, M. Wasserstein barycentric coordinates: histogram regression using optimal transport. ACM Trans. Graph., 35(4):71–1, 2016.
  • Bubeck et al. (2015) Bubeck, S. et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
  • Chandrasekaran et al. (2012) Chandrasekaran, V., Recht, B., Parrilo, P. A., and Willsky, A. S. The convex geometry of linear inverse problems. Foundations of Computational mathematics, 12(6):805–849, 2012.
  • Cuturi (2013) Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pp. 2292–2300, 2013.
  • Djolonga & Krause (2017) Djolonga, J. and Krause, A. Differentiable learning of submodular models. In Proc. of NeurIPS, pp. 1013–1023, 2017.
  • Gane et al. (2014) Gane, A., Hazan, T., and Jaakkola, T. Learning with maximum a-posteriori perturbation models. In Proc. of AISTATS, pp. 247–256, 2014.
  • Guadagni & Little (1983) Guadagni, P. M. and Little, J. D.

    A logit model of brand choice calibrated on scanner data.

    Marketing science, 2(3):203–238, 1983.
  • Gumbel (1954) Gumbel, E. J. Statistical Theory of Extreme Values and some Practical Applications: A Series of Lectures. Number 33. US Govt. Print. Office, 1954.
  • Hannan (1957) Hannan, J. Approximation to bayes risk in repeated plays. In Dresher, M., Tucker, A. W., and Wolfe, P. (eds.), Contributions to the Theory of Games 3, pp. 97–139. Princeton University Press, 1957.
  • Hazan & Jaakkola (2012) Hazan, T. and Jaakkola, T. On the partition function and random maximum a-posteriori perturbations. Proc. of ICML, 2012.
  • Hazan et al. (2013) Hazan, T., Maji, S., and Jaakkola, T. On sampling from the gibbs distribution with random maximum a-posteriori perturbations. In Advances in Neural Information Processing Systems, pp. 1268–1276, 2013.
  • Hazan et al. (2016) Hazan, T., Papandreou, G., and Tarlow, D. Perturbations, Optimization, and Statistics. MIT Press, 2016.
  • Jang et al. (2016) Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
  • Kalai & Vempala (2003) Kalai, A. and Vempala, S. Efficient algorithms for online decision. In Proc. 16th Annual Conference on Learning Theory (COLT-2003)

    , Lecture Notes in Artificial Intelligence, pp. 506–521. Springer, 2003.

  • Lorberbom et al. (2019) Lorberbom, G., Gane, A., Jaakkola, T., and Hazan, T. Direct optimization through argmax for discrete variational auto-encoder. In Proc. of NeurIPS, 2019.
  • Luce (1959) Luce, R. D. Individual choice behavior. 1959.
  • Maddison et al. (2016) Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
  • McFadden et al. (1973) McFadden, D. et al. Conditional logit analysis of qualitative choice behavior. 1973.
  • Mensch & Blondel (2018) Mensch, A. and Blondel, M. Differentiable dynamic programming for structured prediction and attention. In Proc. of ICML, 2018.
  • Neu & Bartók (2016) Neu, G. and Bartók, G. Importance weighting without importance weights: An efficient algorithm for combinatorial semi-bandits. The Journal of Machine Learning Research, 17(1):5355–5375, 2016.
  • Niculae et al. (2018) Niculae, V., Martins, A. F., Blondel, M., and Cardie, C. Sparsemap: Differentiable sparse structured inference. In Proc. of ICML, 2018.
  • Orabona et al. (2014) Orabona, F., Hazan, T., Sarwate, A. D., and Jaakkola, T. S. On measure concentration of random maximum a-posteriori perturbations. In ICML, pp. 432–440, 2014.
  • Papandreou & Yuille (2011) Papandreou, G. and Yuille, A. L. Perturb-and-MAP random fields: Using discrete optimization to learn and sample from energy models. In Proc. of ICCV, pp. 193–200, 2011.
  • Peyré & Cuturi (2019) Peyré, G. and Cuturi, M. Computational optimal transport. Now Publishers, Inc., 2019.
  • Rockafellar & Wets (2009) Rockafellar, R. T. and Wets, R. J.-B. Variational analysis, volume 317. Springer Science & Business Media, 2009.
  • Schrijver (2003) Schrijver, A. Combinatorial Optimization: Polyhedra and Efficiency, volume 24. Springer Science & Business Media, 2003.
  • Shpakova & Bach (2016) Shpakova, T. and Bach, F. Parameter learning for log-supermodular distributions. In Advances in Neural Information Processing Systems, pp. 3234–3242, 2016.
  • van der Vaart (2000) van der Vaart, A. W. Asymptotic statistics, volume 3. Cambridge university press, 2000.
  • Vlastelica et al. (2019) Vlastelica, M., Paulus, A., Musil, V., Martius, G., and Rolínek, M. Differentiation of blackbox combinatorial solvers. arXiv preprint arXiv:1912.02175, 2019.
  • Wainwright et al. (2008) Wainwright, M. J., Jordan, M. I., et al. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.

Appendix A Proofs of technical results

Proof of Proposition 2.1.

The function is the Fenchel dual of (see Proposition 2.2, impact of the temperature), and is defined on . As such, as in (Abernethy et al., 2014), we have that

It is maximized at , by Fenchel-Rockaffelar duality (see, e.g. Wainwright et al., 2008, Appendix A). ∎

Proof of Proposition 2.2.

The proof of these properties makes use of the notion of the normal fan of . It is the set of all normal cones to all faces of the polytope (Rockafellar & Wets, 2009). For each face, such a cone is the set of vectors in such that the linear program on with this vector as cost is maximized on this face. They form a partition of , and these cones are full dimensional if and only if they are associated to a vertex of . These vertices are a subset of , corresponding to extreme points of .

As a consequence of this normal cone structure, since has a positive density, it assigns positive mass to sets if and only if they have non-empty interior, so for any , and any , if and only if . In most applications, to begin with (all are potential maximizer for some vector of costs, otherwise they are not included in the set), and all points in have positive mass.

Properties of

- is strictly convex

The function is convex, as a maximum of convex (linear) functions. By definition of , for every and , for we have

The inequality holds with equality if and only if it holds within the expectation for almost all since the distribution of is positive on . If the function is not strictly convex, there exists therefore and such that

for all , for almost all . In this case, is linear on the segment for almost all .

If is contained in the boundary between the normal cones to and , for all distinct , we have for all such pairs of , so is orthogonal to the span of all the pairwise differences of . However, since has no empty interior, it is not contained in a strict affine subspace of so . As a consequence, for distinct and , there exists such that and are in the interior of two normal cones to different . As a consequence, the same holds under perturbations of in a small enough ball of , so cannot be linear on almost all segments , and is strictly convex.

- is twice differentiable, as a direct consequence of Proposition 3.1.

- is -Lipschitz

is the maximum of finitely many functions that are -Lipschitz. It therefore also satisfies this property. is an expectation of such functions, therefore it satisfies the same property.

- is -gradient Lipschitz.

We have, by Proposition 3.1, for and in

As a consequence, by the Cauchy–Schwarz inequality, and Lipschitz property of , it holds that

Properties of

The function is the Fenchel dual of , which is strictly convex and smooth. As a consequence, is differentiable on the image of – the interior of – and it is -strongly convex.

- Legendre type property

The regularization function is differentiable on the interior. If there is a point of its boundary such that does not diverge when approaching , then taking such that (where is the normal cone to at ), then . However, takes image in the interior of (see immediately below), leading to a contradiction.

Properties of

- The perturbed maximizer is in the interior of

Since the distribution of has positive density, the probability that (i.e. ) is positive for all . As a consequence, since

with all positive weights , is in the interior of the convex hull of .

- The function is differentiable, by twice differentiability of , by Proposition 3.1.

Influence of temperature parameter

We have for all

As a consequence

Since , and since , we have . ∎

Proof of Proposition 2.3.

We recall that we assume that yields a unique maximum to the linear program on . This is true almost everywhere, and assumed here for simplicity of the results. We discuss briefly at the end of this proof how this can be painlessly extended to the more general case.

Limit at low temperatures

Since is convex (see proof of Proposition 2.2), so by Jensen’s inequality

Further, we have for all

Taking expectations on both sides yields that

As a consequence, when , combining these two inequalities yields that .

Regarding the behavior of the perturbed maximizer , we follow the arguments of (Peyré & Cuturi, 2019, Proposition 4.1). By Proposition 2.1 and the definition of , we have

Since is continuous, it is bounded on , and the right hand term above is bounded by , for some . As a consequence, when , . For any sequence , the sequence is in a compact . Therefore, it has a subsequence that converges to some limit . However, since , we have , by continuity. Since is a unique maximizer, . As a consequence, all convergent subsequences of converge to the same limit : it is the unique accumulation point of this sequence. It follows directly that converges to , as it lives in a compact set, which yields the desired result.

Limit at high temperatures By Proposition 2.2, , so the desired result follows by continuity of the perturbed maximizer.

Nonasymptotic inequalities. These inequalities follow directly from those proved to establish limits at low temperatures.

If is such that the maximizer is not unique (which occurs only on a set of measure 0), the only result affected is the convergence of when . Following the same proof of (Peyré & Cuturi, 2019, Proposition 4.1), it can be shown to converge to the minimizer of over the set of maximizer. This point is always unique, as the minimizer of a strongly convex function over a convex set. ∎

Proof of Proposition 4.1.

We follow the classical proofs in M-estimation (see, e.g. van der Vaart, 2000, Section 5.3). First, the estimator is consistent as a virtue of the continuous mirror map between and . For large enough , since the probability of each extreme point of is positive. By definition of the estimator and stationarity condition for , we have in these conditions

By the law of large numbers,

converges to its expectation a.s. Since , the inverse of , is also continuous (by the fact that is convex smooth), so we have that converges to a.s.

We write the first order conditions for at and the Taylor expansion with Lagrange remainder for all coordinates, one by one


where is such that, for all coordinates

for some . We note here that since the estimator is not necessarily in dimension 1, cannot be written directly as for some , since the Taylor expansion with Lagrange remainder is not true in its multivariate form. However, doing it coordinate-by-coordinate as here allows to circumvent this issue.

We have that . Since a.s.  we have that for all , so a.s. Rearranging terms in Eq. (7), we have

By the central limit theorem,

in distribution. As a consequence, by convergence of and Slutsky’s lemma, we have the convergence in distribution

Appendix B Examples of discrete decision problems as linear programs

Our method applies seamlessly to all decision problems over discrete sets. Indeed, any problem of the form , for some score function , can at least be written in the form

by representing as the vertices of the unit simplex in . However, for most interesting decision problem that can actually be solved in practice, the score function takes a simpler form , for some representation of and some . We give here a non-exhaustive list of examples of interesting problems of this type.

Maximum. The max function from to , that returns the largest among the entries of a vector is ubiquitous in machine learning, the hallmark of any classification task. It is equal to over the standard unit simplex.