1 Introduction
Many applications of machine learning benefit from the possibility to train by gradient descent compositional models using endtoend differentiability. Yet, there remain many fields in which discrete decisions are required at intermediate steps of a data processing pipeline, notably those involving sequences of decisions and/or discrete objects (e.g., in robotics, graphics or biology). This is the result of many factors: discrete decisions provide a much soughtfor interpretability of what a blackbox is actually doing and discrete solvers are built upon decades of advances in combinatorial algorithms (Schrijver, 2003)
to make quick decisions (e.g., sorting, picking closest neighbors, exploring options with beamsearch, and more generally with knapsack, routing and shortest paths problems). Even though these discrete decisions can be easily computed, in what would be called a forward pass in a deep learning context, the derivatives of these decisions with respect to inputs are degenerate (small changes in the inputs either yield no change or discontinuous changes in the outputs). As a consequence, discrete solvers break the backpropagation of computational graphs, and cannot be incorporated in endtoend learning.
In order to expand the set of operations that can be incorporated in differentiable models, we propose and investigate a systematic method to transform blocks with discrete optimizers as outputs into differentiable operations. Our approach relies on the method of stochastic perturbations, the theory of which was developed and applied to several tasks of machine learning recently (see, e.g. Hazan et al., 2016, for a survey). In a nutshell, we perturb the inputs of a discrete solver with random noise, and consider the perturbed optimal solutions of the problem. The method is both easy to analyze theoretically and trivial to implement. Using the formal expectation of these perturbed solutions, we show that they are never locally constant and everywhere differentiable, with successive derivatives being expectations of simple expressions.
Related work.
Our work is part of growing efforts to modify operations in order to make them differentiable, that we now review.
Differentiating through an argmax. Several works have studied the introduction of regularization in the optimization problem in order to make the argmax differentiable. However these works are usually problemspecific, since a new optimization problem needs to be solved. Examples include assignments (Adams & Zemel, 2011) and more generally optimal transport (Bonneel et al., 2016; Cuturi, 2013), differentiable dynamic programming (Mensch & Blondel, 2018), differentiable submodular optimization (Djolonga & Krause, 2017). An exception is the socalled SparseMAP algorithm (Niculae et al., 2018), which is based on FrankWolfe or activeset algorithms for solving the regularizedproblem, and on implicit differentiation for computing a Jacobian. Like our proposal, SparseMAP only requires access to a linear maximization oracle. However, it is sequential in nature, while our approach is trivial to parallelize. More recently Agrawal et al. (2019) analyze implicit differentiation on solutions of convex optimization. They express the derivatives of the argmax exactly, leading to zero Jacobian almost everywhere for optimization over polytopes. Vlastelica et al. (2019)
propose a new scheme to interpolate in a piecewiselinear manner between locally constant regions. Their aim is to keep the same value for the Jacobian of the argmax for a large region of inputs, allowing for zero Jacobians as well.
Perturbation methods. The idea of using the expectation of a perturbed max and argmax, commonly known as the “Gumbel trick”, dates back to Gumbel (1954), and the notion of random choice models (Luce, 1959; McFadden et al., 1973; Guadagni & Little, 1983). They are exploited in online learning and bandit problems in order to promote exploration, and induce robustness to adversarial data (see, e.g., Abernethy et al., 2016, for a survey). In relation with our work, they are used for action spaces that are combinatorial in nature (Neu & Bartók, 2016). The Gumbel trick has also been used together with a softmax to obtain differentiable sampling (Jang et al., 2016; Maddison et al., 2016).
The use of perturbation techniques as an alternative to MCMC techniques for sampling was pioneered by Papandreou & Yuille (2011). They are used to compute the expected statistics arising in the gradient of conditional random fields. They show the exactness for the fully perturbed (but intractable case) and propose “lowrank” perturbations as an approximation. These results are extended by Hazan & Jaakkola (2012), who proved that the maximum with lowrank perturbations, in expectation, provides an upperbound on the log partition and proposed to replace the log partition in conditional random fields loss by that expectation. Their results, however, are limited to discrete product spaces. Shortly after, Hazan et al. (2013) derived new lower bounds on the partition function and proposed a new unbiased sequential sampler for the Gibbs distribution based on lowrank perturbations. These results were further refined by Gane et al. (2014) and Orabona et al. (2014). Shpakova & Bach (2016) further studied these bounds and proposed a doubly stochastic scheme. Balog et al. (2017) explored other distributions from extreme value theory, including the Fréchet and Weibull distributions. Apart from Lorberbom et al. (2019), who use a finite difference method, we are not aware of any prior work using perturbation techniques to differentiate through an argmax. As reviewed above, all papers focus on (approximately) sampling from the Gibbs distribution, upperbounding the log partition function, or differentiating through the max.
Our contributions.

[topsep=0pt,itemsep=2pt,parsep=2pt,leftmargin=10pt]

We propose a new general method to transform discrete optimizer outputs, inspired by the stochastic perturbation literature. This versatile method is easy to apply to any blackbox solvers without adhoc modifications.

Our stochastic smoothing allows argmax differentiation, through the formal perturbed maximizer. We show that its Jacobian is welldefined and nonzero everywhere, thereby avoiding vanishing gradients.

The successive derivatives of the perturbed maximum and argmax are expressed as simple expectations, which are easy to approximate with MonteCarlo methods.

This particular approach to operator smoothing yields natural connections to the recentlyproposed FenchelYoung losses by Blondel et al. (2019). We show that the equivalence via duality with regularized optimization makes FenchelYoung losses particularly natural. We propose a doubly stochastic scheme for minimization of these losses, for unsupervised and supervised learning.

We demonstrate the performance of our approach on several machine learning tasks in experiments on synthetic and real data.
2 Perturbed maximizers
Given a finite set of distinct points and its convex hull, we consider a general discrete optimization problem parameterized by an input as follows:
(1) 
As detailed in Section 2.2 below, this formulation encompasses a variety of discrete operations such as picking the maximum value or the top
largest values of a vector; ranking the entries of a vector; or computing the shortest path over a weighted graph, to name just a few. In all cases,
is a polytope and these problems are linear programs (LP). For almost every
, the argmax is unique, and . While widespread, these functions do not have the convenient properties of blocks in endtoend learning methods, such as smoothness or differentiability. In particular, is piecewise constant: its gradient is zero almost everywhere, and undefined otherwise.To address these issues, we simply add to a random noise vector , where is a temperature parameter and has a positive and differentiable density on , ensuring that
is almost surely (a.s.) uniquely defined. This induces a probability distribution
on given by .Taking expectations w.r.t. the random perturbation leads to smoothed versions of and (see Figure 1): [linewidth=0.5pt]
Definition 2.1.
For all , and , we define
the perturbed maximum as
the perturbed maximizer as
This creates a general and natural model on the variable , when observations are solutions of optimization problems, with uncertain costs. It enables the modeling of phenomena where agents chose an optimal based on uncertain knowledge of , or varying circumstances. We view this as a generalization, or alternative to the Gibbs distribution, rather than an approximation thereof.
Models of random optimizers for linear problems with perturbed inputs are the subject of a wide litterature in machine learning, under the name of PerturbandMAP (Papandreou & Yuille, 2011; Hazan & Jaakkola, 2012), and perturbed leader method in online learning (Hannan, 1957; Kalai & Vempala, 2003; Abernethy et al., 2014). We refer to it here as the perturbed model. A notorious example is when
is the set of onehotencoding of
classes, is the unit simplex, and has the Gumbel distribution (Gumbel, 1954). In that case it is wellknown that is the Gibbs distribution, proportional to , is the logsumexp function of , and is the vector of softmax (or exponential weights) of the components of . The following result shows that the perturbed maximizer can also be defined as the solution of a convex problem, by FenchelRockafellar duality. [linewidth=0.5pt]Proposition 2.1.
Let be the Fenchel dual of , with domain . We have that
(2) 
As generalizes the logsumexp function for Gumbel noise on the simplex, its dual is a generalization of the negative entropy (which is the Fenchel dual of logsumexp). These connections have been studied in many parts of statistical and machine learning (Wainwright et al., 2008). In the literature on bandit problems and online learning, these links between regularization and perturbation for linear optimization problems are wellstudied, and applied to gradientbased algorithms (Abernethy et al., 2014, 2016).
This model, and the perturbed functions of Definition 2.1 inherit several important properties from this formulation. First, it allows to take derivatives with respect to the input of and of (Proposition 2.2). These derivatives are also easily expressed as expectations involving and with noisy inputs, as discussed in Section 3. In turn, this yields fast computational methods for these functions and their derivatives, described in Section 3.1. Further, by the duality point of view describing as a regularized maximizer, there exists a natural convex loss for this model that can be efficiently optimized in , for data . We describe this formalism in Section 4. All proofs are in the appendix.
2.1 Properties of the model
This model modifies the maximum and maximizer by perturbation. Because of the simple action of the stochastic noise , we can analyze their properties precisely.
[linewidth=0.5pt]
Proposition 2.2.
Assume is a convex polytope with nonempty interior, and has positive differentiable density. The perturbed model and the associated functions , , and have the following properties, for and :

[topsep=0pt,itemsep=2pt,parsep=2pt,leftmargin=5pt]

is strictly convex, twice differentiable, Lipschitzcontinuous and its gradient is Lipschitzcontinuous.

is strongly convex, differentiable, and Legendretype.

is in the interior of and is differentiable in .

Impact of the temperature : we have that
We develop in further details the simple expressions for derivatives of and in Section 3. By this proposition, since is strictly convex, it is nowhere locally linear, so is nowhere locally constant. Formally, is a mirror map, a onetoone mapping from unto the interior of . The gradient of is its functional inverse, by convex duality between these functions (see, e.g. surveys Wainwright et al., 2008; Bubeck et al., 2015, and references therein).
Remark 1.
For these properties to hold, it is crucial that has nonempty interior, i.e., that does not lie in an affine subspace of lower dimension. To adapt to cases where lies in a subspace, we consider the set of inputs up to vectors orthogonal to , or represent in a lowerdimensional subspace. As an example, over the unit simplex and Gumbel noise, the logsumexp is not strictly convex, and in fact linear along the allones vector . In such cases, the model is only wellspecified in up to the space orthogonal to , which does not affect prediction tasks.
For any positive temperature, these properties imply that there is an informative, welldefined and nonzero gradient in . The limiting behavior at extreme temperatures can also be explicitly derived from these properties.
[linewidth=0.5pt]
Proposition 2.3.
With the conditions of Proposition 2.2, for such that is a unique maximum:
 When , and .
 When , .
For every , we have ,
for constants and .
The properties of the distributions in this model are well studied in the perturbations literature (see, e.g. Hazan et al., 2016, for a survey). They notably do not have a simple closedform expression, but can be very easy to sample from. By the argmax definition, simulating , only requires to sample (e.g., Gaussian, or vector of i.i.d. Gumbel), and to solve the original optimization problem. It is the case in the applications we consider (e.g., max, ranking, shortest paths). This is in stark contrast to the Gibbs distribution, which has the opposite properties.
2.2 Examples
Many operations frequently used in machine learning can be written as optimal decisions over a discrete set and can be written as the problem in Equation (1). They correspond to a linear program (typically with integral solutions), but we emphasize again that this structure need not be known or exploited to use the perturbed maximizers in practice.
Maximum. The max function from to , that returns the largest of entries of a vector is ubiquitous in machine learning, the hallmark of any classification task. It is equal to an LP over the unit simplex , with .
Top . The function from to that returns the sum of the largest entries of a vector is also commonly used. Akin to the maximum, it is the value of an LP (cf. Appendix for more details).
Ranking. The function returning the ranks (in descending order) of a vector can be written as the argmax of a linear program over the permutahedron, the convex hull of permutations of any vector with distinct entries. Using different reference vectors yield different perturbed operations, and is commonly used.
Shortest paths. For a graph and positive costs over edges , the problem of finding a shortest path (i.e. with minimal total cost) from vertices to can be written in our setting with as an LP (see Appendix).
Other discrete problems. Finding linear assignments, minimum spanning trees, maximum flows and even applying logical operators can be written in this form. The strength of our method is that it adapts to any setting where the scores to be maximized correspond to a linear function , for some embedding of the possible outputs in .
3 Differentiation of soft maximizers
As noted above, for the right noise distributions, the perturbed maximizer is differentiable in its inputs, with nonzero Jacobian. Further, the derivatives associated to this model can be expressed as simple expectations. [linewidth=0.5pt]
Proposition 3.1.
(Abernethy et al., 2016, Lemma 1.5) For noise with distribution and twice differentiable , the following hold:
We discuss in the following subsection efficient techniques to evaluate in practice and its Jacobian, or to generate stochastic gradients, based on these expressions.
Remark 2.
Being able to compute the perturbed maximizer and its Jacobian allows to optimize functions that depend on through . This can be used to alter the costs to promote solutions with certain desired properties. Moreover, in a supervised learning setting, this allows to train models containing blocks with inputs (for some features ) and outputs , for response
For firstorder methods, backpropagating requires not only the usual architecturedependent , but also a gradient in of the loss . If this block is a strict discrete maximizer , as noted above, the computational graph is broken. However, with our proposed modification, we have that
(3) 
and the gradient can be fully backpropagated. Perturbed maximizers can therefore be used in endtoend prediction models, for any loss
on the predicted maximizer. Furthermore, we describe in Section 4 a loss that can be directly optimized in by firstorder methods. It comes with a strong algorithmic advantage, as it requires only to compute the perturbed maximizer and not its Jacobian.3.1 Practical implementation
For any , the perturbed maximizer is a solution of a convex optimization problem (Eq. 2). If has a simple form, can be computed either explicitly or through convex optimization (e.g. exponential weights for the entropy on the simplex).
More generally, by their expressions as expectations, the perturbed maximizer and its Jacobian in the input can be stochastically approximated with MonteCarlo methods. This only requires to efficiently sample from , and to solve the maximization problem over , which is a much weaker requirement (see, e.g., examples in Section 2.2). [linewidth=0.5pt]
Definition 3.1.
Note that the formulae in Proposition 3.1 give several manners to stochastically approximate , , and their derivatives by using , and and averages. This yields unbiased estimates for , , and its Jacobian. The plurality of these formulae gives the user several options for practical implementation, depending on how convenient it is to compute the maximum or the maximizer, how numerically stable they are, or what the impact of the factors are.
If
or its derivatives are used in stochastic gradient descent for training in supervised learning, a full approximation of the gradients is not always necessary. Indeed, taking only
(or a small number) of observations is acceptable here, as the gradients are stochastic in the first place. We call this scheme doubly stochastic, as it is stochastic with respect to both training samples and noise.A great strength of this method is the absence of conceptual overhead, or of supplementary computations. Only sampling from the chosen noise distribution and solutions of the problem are required. Further, even though our analysis relies on the specific structure of the problem as an LP, these algorithms do not. The MonteCarlo estimates can be obtained by using a function as a blackbox, without requiring knowledge of the problem or of the algorithm that solves it.
Implementation details. Two methods in which this implementation can be optimized are parallelization and warm starts. Indeed, to alleviate the dependency in of the running time, we can independently sample the and compute the in parallel. On the other hand, on certain algorithms, starting from a solution or nearsolution can lead to significant speedups. Using as initialization can improve running times dramatically, especially at lower temperatures.
4 Perturbed model learning with FenchelYoung losses
There is a large literature on learning parameters of a Gibbs distribution based on data , through maximization of the likelihood:
(4) 
They can be optimized by taking gradients of the empirical loglikelihood, of the form
thus earning the name of momentmatching procedures. The expectation of the Gibbs distribution is however hard to evaluate in some cases. This motivates its replacement by
(perturbandMAP in this literature), and to use this method to learn the parameters in this model (Papandreou & Yuille, 2011), as a proxy for loglikelihood.This minimization is equivalent to maximizing a term akin to Equation (4), substituting the logpartition with . We show here that this approach can be formally analyzed by the use of FenchelYoung losses (Blondel, 2019) in this context. The use of these losses also drastically improves the algorithmic aspects of the learning tasks, because of the specific expression of the gradients in of the loss.
In the perturbed model, the FenchelYoung loss is defined for by
It is–among other things–nonnegative, convex in , and minimized with value 0 if and only if is such that . It is equal to the Bregman divergence associated to
As and interact in this loss only through a scalar product, for random we have
where does not depend on . This is particularly convenient in analyzing the performance of FenchelYoung losses in generative models. The gradient of the loss is
The FenchelYoung loss can therefore be interpreted as a loss in that is a function of . Moreover, it can be optimized in with firstorder methods simply by computing the soft maximizer, without having to compute its Jacobian. It is therefore a particular case of the situation described in Eq. (3), allowing to even bypass virtually the perturbed maximizer block in the output, and to directly optimize a loss between observations and model outputs .
These ideas are developed in the following two subsections, in the cases of unsupervised and supervised learning.
4.1 Unsupervised learning  parameter estimation
For a given sequence of observations , a natural problem is to find the parameter that best fits these observations with the FenchelYoung loss. We show here that this approach is particularly appropriate if the ’s are generated by the perturbed model
that is, , for some unknown . In this unsupervised model, we therefore have a natural empirical and population loss , given a sample of size
Their gradients are given by
The empirical loss is minimized for such that and the population loss when . As a consequence, the whole battery of statistical results, from asymptotic to nonasymptotic, can be leveraged, and we present the simplest one (asymptotic normality). [linewidth=0.5pt]
Proposition 4.1.
When goes to , with the assumptions of Proposition 2.2 on the model, we have
in distribution, where is the covariance of .
In practice, can be fitted by stochastic optimization. Taking the loss only on observation yields
As a consequence, as usually in stochastic optimization, is a stochastic gradient either for (w.r.t. a random uniform from ) or for (w.r.t. drawn from ).
The methods described in Section 3.1 to stochastically approximate the gradient are particularly adapted here. Indeed, following Shpakova & Bach (2016), given an observation and a current value , a doubly stochastic version of the gradient is obtained by
(5) 
This can also be incorporated with a procedure where batches of data points are used to compute an approximate gradients, where the number of artificial samples and the batch size can be chosen separately.
4.2 Supervised learning
This loss can also be applied to a supervised learning task with observations , in a model class of parametrized functions , for . The FenchelYoung loss between and is, in this case as well, a natural and convenient loss to minimize. We consider
As in the unsupervised setting, this loss can be motivated by a generative model for some parameter such that
(6) 
This gives a natural distribution for discrete outputs that are optimizers, where the population loss is
The population loss is therefore minimized at . The gradient of the empirical loss is given by
Each term in the sum, gradient of the loss for a single observation, is therefore a stochastic gradient for (w.r.t. uniform in ) or for (w.r.t. to a random from
). As in unsupervised learning, a doubly stochastic gradient is obtained by averaging
perturbed argmax.5 Experiments
We demonstrate how perturbed maximizers can be used in a supervised learning setting, as described in Section 4. We do so in several tasks with features and responses in a discrete set of optimizers . In this section, since we focus on the prediction task, the issues raised in Remark 1 do not apply. We can write these maxima over polytopes that might have empty interior, for ease of notation.
When learning with the FenchelYoung losses, we simulate doubly stochastic gradients of the empirical loss with artificial perturbations (see Equation 5)
We give here a proof of concept of the adaptivity of this method to several tasks, and exhibit its performance for simple models (small neural networks, linear models).
We will opensource a package allowing to make any blackbox solver differentiable in just a few lines of code.
5.1 Perturbed max
We use the perturbed argmax with Gaussian noise in an image classification task. We train a vanillaCNN made of 4 convolutional and 2 fully connected layers on the CIFAR10 dataset for 600 epochs with minibatches of size 32. The 10 network outputs are the entries of
and we minimize the FenchelYoung loss between and , with different temperatures and number of perturbations .We analyze the impact of these two algorithmic parameters on the optimization and generalization abilities. We exhibit the final loss and accuracy for different number of perturbations in the doubly stochastic gradient (), and observe competitive performance compared to using the standard crossentropy loss (Figure 2, left and center).
We also highlight the importance of the temperature parameter on the algorithm (see Figure 2, right). Very low temperatures do not smooth the maximizers enough and no fitting occurs, even on training data. At very high temperatures, the perturbed maximizer carries less information about , and gradient estimates are closer to pure noise, which degrades the ability to fit to training and to generalize to the testing data.
5.2 Perturbed ranking
We create a ranking task by, at each instance , projecting vectors of dimension along an unknown direction . The label is then given by the vector of ranks of , for . This provides a simple experiment with permuted vectors as labels, for which the FenchelYoung loss is convex in : we fit to the response . Since is linear in , the loss is also convex in and we explore here the performance of our framework in a simple setting. On our experiments, we use instances of vectors in dimension  with instances in the training set and on the test set.
To explore the complexity of this task, we create a range of datasets, where the projections are perturbed with Gaussian noise of variance
before being ranked. They follow precisely the distribution of the generative model described in Equation 6, and the population loss is indeed minimized at . For correct labels (), our doubly stochastic scheme allows us to accurately generalize after epochs to the test data (see Figure 4). Even under small variances, we are able to accurately predict a high proportion of the correct ranks, and predictably fail for larger variances.5.3 Perturbed shortest path
We replicate the experiment of Vlastelica et al. (2019), aiming to learn the travel costs in graphs based on features, given examples of shortest path solutions (see Figure 3). On a dataset of 10,000 RGB images of size illustrating Warcraft terrains in the form of 2D grid networks. The responses are a shortest path between the topleft and bottomright corners, for costs hidden to the network corresponding, to the terrain type. These responses are represented as boolean matrices indicating the vertices along the shortest path.
We train a purely convolutional neural network made of 3 layers with a FenchelYoung loss between the predicted costs
and the shortest path . We optimize over epochs with batches of size , temperature parameter and , a single perturbation. We are able, only after a few epochs, to generalize very well for , and to accurately predict the shortest path on the test data (see Figure 5). We show here the impact of the temperature parameter on a finer metric based on the ratio between the optimal and the predicted cost.6 Conclusion
Despite a large body of work on perturbations techniques for machine learning, most existing works focused on approximating sampling, logpartitions and expectations under the Gibbs distribution. Together with novel theoretical insights, we proposed a framework for differentiating through, not only a max, but also an argmax, without adhoc modification of the underlying solver. In addition, by defining an equivalent regularizer , we showed how to construct FenchelYoung losses and proposed a doubly stochastic scheme, enabling both unsupervised and supervised learning. Experiments validated the ease of application of our framework to various tasks.
Aknowledgements.
FB’s work was funded in part by the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR19P3IA0001 (PRAIRIE 3IA Institute). FB also acknowledges support from the European Research Council (grant SEQUOIA 724063).
References
 Abernethy et al. (2014) Abernethy, J., Lee, C., Sinha, A., and Tewari, A. Online linear optimization via smoothing. In Conference on Learning Theory, pp. 807–823, 2014.
 Abernethy et al. (2016) Abernethy, J., Lee, C., and Tewari, A. Perturbation techniques in online learning and optimization. Perturbations, Optimization, and Statistics, pp. 233, 2016.
 Adams & Zemel (2011) Adams, R. P. and Zemel, R. S. Ranking via Sinkhorn propagation. arXiv preprint arXiv:1106.1925, 2011.
 Agrawal et al. (2019) Agrawal, A., Amos, B., Barratt, S., Boyd, S., Diamond, S., and Kolter, J. Z. Differentiable convex optimization layers. In Advances in Neural Information Processing Systems, pp. 9558–9570, 2019.
 Bach et al. (2012) Bach, F., Jenatton, R., Mairal, J., Obozinski, G., et al. Optimization with sparsityinducing penalties. Foundations and Trends® in Machine Learning, 4(1):1–106, 2012.
 Balog et al. (2017) Balog, M., Tripuraneni, N., Ghahramani, Z., and Weller, A. Lost relatives of the Gumbel trick. In Proc. of ICML, pp. 371–379, 2017.
 Blondel (2019) Blondel, M. Structured prediction with projection oracles. In Proc. of NeurIPS, 2019.
 Blondel et al. (2019) Blondel, M., Martins, A. F., and Niculae, V. Learning with fenchelyoung losses. arXiv preprint arXiv:1901.02324, 2019.
 Bonneel et al. (2016) Bonneel, N., Peyré, G., and Cuturi, M. Wasserstein barycentric coordinates: histogram regression using optimal transport. ACM Trans. Graph., 35(4):71–1, 2016.
 Bubeck et al. (2015) Bubeck, S. et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(34):231–357, 2015.
 Chandrasekaran et al. (2012) Chandrasekaran, V., Recht, B., Parrilo, P. A., and Willsky, A. S. The convex geometry of linear inverse problems. Foundations of Computational mathematics, 12(6):805–849, 2012.
 Cuturi (2013) Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pp. 2292–2300, 2013.
 Djolonga & Krause (2017) Djolonga, J. and Krause, A. Differentiable learning of submodular models. In Proc. of NeurIPS, pp. 1013–1023, 2017.
 Gane et al. (2014) Gane, A., Hazan, T., and Jaakkola, T. Learning with maximum aposteriori perturbation models. In Proc. of AISTATS, pp. 247–256, 2014.

Guadagni & Little (1983)
Guadagni, P. M. and Little, J. D.
A logit model of brand choice calibrated on scanner data.
Marketing science, 2(3):203–238, 1983.  Gumbel (1954) Gumbel, E. J. Statistical Theory of Extreme Values and some Practical Applications: A Series of Lectures. Number 33. US Govt. Print. Office, 1954.
 Hannan (1957) Hannan, J. Approximation to bayes risk in repeated plays. In Dresher, M., Tucker, A. W., and Wolfe, P. (eds.), Contributions to the Theory of Games 3, pp. 97–139. Princeton University Press, 1957.
 Hazan & Jaakkola (2012) Hazan, T. and Jaakkola, T. On the partition function and random maximum aposteriori perturbations. Proc. of ICML, 2012.
 Hazan et al. (2013) Hazan, T., Maji, S., and Jaakkola, T. On sampling from the gibbs distribution with random maximum aposteriori perturbations. In Advances in Neural Information Processing Systems, pp. 1268–1276, 2013.
 Hazan et al. (2016) Hazan, T., Papandreou, G., and Tarlow, D. Perturbations, Optimization, and Statistics. MIT Press, 2016.
 Jang et al. (2016) Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144, 2016.

Kalai & Vempala (2003)
Kalai, A. and Vempala, S.
Efficient algorithms for online decision.
In Proc. 16th Annual Conference on Learning Theory
(COLT2003)
, Lecture Notes in Artificial Intelligence, pp. 506–521. Springer, 2003.
 Lorberbom et al. (2019) Lorberbom, G., Gane, A., Jaakkola, T., and Hazan, T. Direct optimization through argmax for discrete variational autoencoder. In Proc. of NeurIPS, 2019.
 Luce (1959) Luce, R. D. Individual choice behavior. 1959.
 Maddison et al. (2016) Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
 McFadden et al. (1973) McFadden, D. et al. Conditional logit analysis of qualitative choice behavior. 1973.
 Mensch & Blondel (2018) Mensch, A. and Blondel, M. Differentiable dynamic programming for structured prediction and attention. In Proc. of ICML, 2018.
 Neu & Bartók (2016) Neu, G. and Bartók, G. Importance weighting without importance weights: An efficient algorithm for combinatorial semibandits. The Journal of Machine Learning Research, 17(1):5355–5375, 2016.
 Niculae et al. (2018) Niculae, V., Martins, A. F., Blondel, M., and Cardie, C. Sparsemap: Differentiable sparse structured inference. In Proc. of ICML, 2018.
 Orabona et al. (2014) Orabona, F., Hazan, T., Sarwate, A. D., and Jaakkola, T. S. On measure concentration of random maximum aposteriori perturbations. In ICML, pp. 432–440, 2014.
 Papandreou & Yuille (2011) Papandreou, G. and Yuille, A. L. PerturbandMAP random fields: Using discrete optimization to learn and sample from energy models. In Proc. of ICCV, pp. 193–200, 2011.
 Peyré & Cuturi (2019) Peyré, G. and Cuturi, M. Computational optimal transport. Now Publishers, Inc., 2019.
 Rockafellar & Wets (2009) Rockafellar, R. T. and Wets, R. J.B. Variational analysis, volume 317. Springer Science & Business Media, 2009.
 Schrijver (2003) Schrijver, A. Combinatorial Optimization: Polyhedra and Efficiency, volume 24. Springer Science & Business Media, 2003.
 Shpakova & Bach (2016) Shpakova, T. and Bach, F. Parameter learning for logsupermodular distributions. In Advances in Neural Information Processing Systems, pp. 3234–3242, 2016.
 van der Vaart (2000) van der Vaart, A. W. Asymptotic statistics, volume 3. Cambridge university press, 2000.
 Vlastelica et al. (2019) Vlastelica, M., Paulus, A., Musil, V., Martius, G., and Rolínek, M. Differentiation of blackbox combinatorial solvers. arXiv preprint arXiv:1912.02175, 2019.
 Wainwright et al. (2008) Wainwright, M. J., Jordan, M. I., et al. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.
Appendix A Proofs of technical results
Proof of Proposition 2.1.
Proof of Proposition 2.2.
The proof of these properties makes use of the notion of the normal fan of . It is the set of all normal cones to all faces of the polytope (Rockafellar & Wets, 2009). For each face, such a cone is the set of vectors in such that the linear program on with this vector as cost is maximized on this face. They form a partition of , and these cones are full dimensional if and only if they are associated to a vertex of . These vertices are a subset of , corresponding to extreme points of .
As a consequence of this normal cone structure, since has a positive density, it assigns positive mass to sets if and only if they have nonempty interior, so for any , and any , if and only if . In most applications, to begin with (all are potential maximizer for some vector of costs, otherwise they are not included in the set), and all points in have positive mass.
Properties of
 is strictly convex
The function is convex, as a maximum of convex (linear) functions. By definition of , for every and , for we have
The inequality holds with equality if and only if it holds within the expectation for almost all since the distribution of is positive on . If the function is not strictly convex, there exists therefore and such that
for all , for almost all . In this case, is linear on the segment for almost all .
If is contained in the boundary between the normal cones to and , for all distinct , we have for all such pairs of , so is orthogonal to the span of all the pairwise differences of . However, since has no empty interior, it is not contained in a strict affine subspace of so . As a consequence, for distinct and , there exists such that and are in the interior of two normal cones to different . As a consequence, the same holds under perturbations of in a small enough ball of , so cannot be linear on almost all segments , and is strictly convex.
 is twice differentiable, as a direct consequence of Proposition 3.1.
 is Lipschitz
is the maximum of finitely many functions that are Lipschitz. It therefore also satisfies this property. is an expectation of such functions, therefore it satisfies the same property.
 is gradient Lipschitz.
We have, by Proposition 3.1, for and in
As a consequence, by the Cauchy–Schwarz inequality, and Lipschitz property of , it holds that
Properties of
The function is the Fenchel dual of , which is strictly convex and smooth. As a consequence, is differentiable on the image of – the interior of – and it is strongly convex.
 Legendre type property
The regularization function is differentiable on the interior. If there is a point of its boundary such that does not diverge when approaching , then taking such that (where is the normal cone to at ), then . However, takes image in the interior of (see immediately below), leading to a contradiction.
Properties of
 The perturbed maximizer is in the interior of
Since the distribution of has positive density, the probability that (i.e. ) is positive for all . As a consequence, since
with all positive weights , is in the interior of the convex hull of .
 The function is differentiable, by twice differentiability of , by Proposition 3.1.
Influence of temperature parameter
We have for all
As a consequence
Since , and since , we have . ∎
Proof of Proposition 2.3.
We recall that we assume that yields a unique maximum to the linear program on . This is true almost everywhere, and assumed here for simplicity of the results. We discuss briefly at the end of this proof how this can be painlessly extended to the more general case.
Limit at low temperatures
Since is convex (see proof of Proposition 2.2), so by Jensen’s inequality
Further, we have for all
Taking expectations on both sides yields that
As a consequence, when , combining these two inequalities yields that .
Regarding the behavior of the perturbed maximizer , we follow the arguments of (Peyré & Cuturi, 2019, Proposition 4.1). By Proposition 2.1 and the definition of , we have
Since is continuous, it is bounded on , and the right hand term above is bounded by , for some . As a consequence, when , . For any sequence , the sequence is in a compact . Therefore, it has a subsequence that converges to some limit . However, since , we have , by continuity. Since is a unique maximizer, . As a consequence, all convergent subsequences of converge to the same limit : it is the unique accumulation point of this sequence. It follows directly that converges to , as it lives in a compact set, which yields the desired result.
Limit at high temperatures By Proposition 2.2, , so the desired result follows by continuity of the perturbed maximizer.
Nonasymptotic inequalities. These inequalities follow directly from those proved to establish limits at low temperatures.
If is such that the maximizer is not unique (which occurs only on a set of measure 0), the only result affected is the convergence of when . Following the same proof of (Peyré & Cuturi, 2019, Proposition 4.1), it can be shown to converge to the minimizer of over the set of maximizer. This point is always unique, as the minimizer of a strongly convex function over a convex set. ∎
Proof of Proposition 4.1.
We follow the classical proofs in Mestimation (see, e.g. van der Vaart, 2000, Section 5.3). First, the estimator is consistent as a virtue of the continuous mirror map between and . For large enough , since the probability of each extreme point of is positive. By definition of the estimator and stationarity condition for , we have in these conditions
By the law of large numbers,
converges to its expectation a.s. Since , the inverse of , is also continuous (by the fact that is convex smooth), so we have that converges to a.s.We write the first order conditions for at and the Taylor expansion with Lagrange remainder for all coordinates, one by one
(7) 
where is such that, for all coordinates
for some . We note here that since the estimator is not necessarily in dimension 1, cannot be written directly as for some , since the Taylor expansion with Lagrange remainder is not true in its multivariate form. However, doing it coordinatebycoordinate as here allows to circumvent this issue.
We have that . Since a.s. we have that for all , so a.s. Rearranging terms in Eq. (7), we have
By the central limit theorem,
in distribution. As a consequence, by convergence of and Slutsky’s lemma, we have the convergence in distribution∎
Appendix B Examples of discrete decision problems as linear programs
Our method applies seamlessly to all decision problems over discrete sets. Indeed, any problem of the form , for some score function , can at least be written in the form
by representing as the vertices of the unit simplex in . However, for most interesting decision problem that can actually be solved in practice, the score function takes a simpler form , for some representation of and some . We give here a nonexhaustive list of examples of interesting problems of this type.
Maximum. The max function from to , that returns the largest among the entries of a vector is ubiquitous in machine learning, the hallmark of any classification task. It is equal to over the standard unit simplex.