 # SoftSort: A Continuous Relaxation for the argsort Operator

While sorting is an important procedure in computer science, the argsort operator - which takes as input a vector and returns its sorting permutation - has a discrete image and thus zero gradients almost everywhere. This prohibits end-to-end, gradient-based learning of models that rely on the argsort operator. A natural way to overcome this problem is to replace the argsort operator with a continuous relaxation. Recent work has shown a number of ways to do this, but the relaxations proposed so far are computationally complex. In this work we propose a simple continuous relaxation for the argsort operator which has the following qualities: it can be implemented in three lines of code, achieves state-of-the-art performance, is easy to reason about mathematically - substantially simplifying proofs - and is faster than competing approaches. We open source the code to reproduce all of the experiments and results.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Gradient-based optimization lies at the core of the success of deep learning. However, many common operators have discrete images and thus exhibit zero gradients almost everywhere, making them unsuitable for gradient-based optimization. Examples include the

Heaviside step function, the argmax operator, and - the central object of this work - the argsort

operator. To overcome this limitation, continuous relaxations for these operators can be constructed. For example, the sigmoid function serves as a continuous relaxation for the

Heaviside step function, and the softmax operator serves as a continuous relaxation for the argmax. These continuous relaxations have the crucial property that they provide meaningful gradients that can drive optimization. Because of this, operators such as the softmax are ubiquitous in deep learning. In this work we are concerned with continuous relaxations for the argsort operator.

Formally, we define the argsort operator as the mapping from -dimensional real vectors to the set of permutations over elements , where returns the permutation that sorts in decreasing order111This is called the sort operator in (Grover et al., 2019). We adopt the more conventional naming.. For example, for the input vector we have . If we let denote the set of permutation matrices of dimension , following (Grover et al., 2019) we can define, for a permutation , the permutation matrix as:

 Pπ[i,j]={1 if j=πi,0 otherwise

This is simply the one-hot representation of . Note that with these definitions, the mapping that sorts in decreasing order is . Also, if we let , then the argsort operator can be recovered from by a simple matrix multiplication via

 argsort(s)=Pargsort(s)¯1n

Because of this reduction from the argsort operator to the operator, in this work, as in previous works (Mena et al., 2018; Grover et al., 2019; Cuturi et al., 2019), our strategy to derive a continuous relaxation for the argsort operator is to instead derive a continuous relaxation for the operator. This is analogous to the way that the softmax operator relaxes the argmax operator.

The main contribution of this paper is the proposal of a family of simple continuous relaxation for the operator, which we call SoftSort, and define as follows:

 SoftSortdτ(s)=softmax(−d(% sort(s)1T,1sT)τ)

where the softmax operator is applied row-wise, is any differentiable almost everywhere, semi–metric function of applied pointwise, and is a temperature parameter that controls the degree of the approximation. In simple words: the -th row of the SoftSort operator is the softmax of the negative distances to the -th largest element.

Throughout this work we will predominantly use (the norm), but our theoretical results hold for any such , making the approach flexible. The SoftSort

operator is trivial to implement in automatic differentiation libraries such as TensorFlow

and PyTorch

(Paszke et al., 2017), and we show that:

• SoftSort achieves state-of-the-art performance on multiple benchmarks that involve reordering elements.

• SoftSort shares the same desirable properties as the NeuralSort operator (Grover et al., 2019). Namely, it is row-stochastic, converges to in the limit of the temperature, and can be projected onto a permutation matrix using the row-wise argmax operation. However, SoftSort is significantly easier to reason about mathematically, which substantially simplifies the proof of these properties.

• The SoftSort operator is faster than the NeuralSort operator of (Grover et al., 2019), the fastest competing approach, and empirically just as easy to optimize in terms of the number of gradient steps required for the training objective to converge.

Therefore, the SoftSort operator advances the state of the art in differentiable sorting by significantly simplifying previous approaches. To better illustrate the usefulness of the mapping defined by SoftSort, we show in Figure 1 the result of applying the operator to soft-sort a sequence of vectors according to the order given by respective scores . Soft-sorting the is achieved by multiplying to the left by .

The code and experiments are available at https://github.com/sprillo/softsort

## 2 Related Work

Relaxed operators for sorting procedures were first proposed in the context of Learning to Rank with the end goal of enabling direct optimization of Information Retrieval (IR) metrics. Many IR metrics, such as the Normalized Discounted Cumulative Gain (NDCG) (Järvelin and Kekäläinen, 2002), are defined in terms of ranks. Formally, the rank operator is defined as the function that returns the inverse of the argsort permutation: , or equivalently . The rank operator is thus intimately related to the argsort operator; in fact, a relaxation for the operator can be obtained by transposing a relaxation for the operator, and vice-versa; this duality is apparent in the work of (Cuturi et al., 2019).

We begin by discussing previous work on relaxed rank operators in section 2.1. Next, we discuss more recent work, which deals with relaxations for the operator.

### 2.1 Relaxed Rank Operators

The first work to propose a relaxed rank operator is that of (Taylor et al., 2008). The authors introduce the relaxation given by where , and show that this relaxation, as well as its gradients, can be computed exactly in time . Note that as , 222Except when there are ties, which we assume is not the case. Ties are merely a technical nuisance and do not represent a problem for any of the methods (ours or other’s) discussed in this paper.. This relaxation is used in turn to define a surrogate for NDCG which can be optimized directly.

In (Qin et al., 2010), another relaxation for the rank operator is proposed, defined as:

 ˆπτ(s)[i]=1+∑j≠iσ(si−sjτ) (1)

where is the sigmoid function. Again, as . This operator can be computed in time , which is faster than the approach of (Taylor et al., 2008).

Note that the above two approaches cannot be used to define a relaxation for the argsort operator. Indeed, and are not relaxations for the operator. Instead, they directly relax the rank operator, and there is no easy way to obtain a relaxed argsort or operator from them.

### 2.2 Sorting via Bipartite Maximum Matchings

The work of (Mena et al., 2018) draws an analogy between the argmax operator and the argsort operator by means of bipartite maximum matchings: the argmax operator applied to an -dimensional vector can be viewed as a maximum matching on a bipartite graph with vertices in one component and vertex in the other component, the edge weights equal to the given vector ; similarly, a permutation matrix can be seen as a maximum matching on a bipartite graph between two groups of vertices with edge weights given by a matrix . This induces a mapping (for ‘matching’) from the set of square matrices to the set . Note that this mapping has arity , unlike the operator which has arity .

Like the operator, the operator has discrete image , so to enable end-to-end gradient-based optimization, (Mena et al., 2018) propose to relax the matching operator by means of the Sinkhorn operator ; is a temperature parameter that controls the degree of the approximation; as they show that . The Sinkhorn operator maps the square matrix to the Birkhoff polytope , which is defined as the set of doubly stochastic matrices (i.e. rows and columns summing to ).

The computational complexity of the (Mena et al., 2018) approach to differentiable sorting is thus where is the number of Sinkhorn iterates used to approximate ; the authors use for all experiments.

### 2.3 Sorting via Optimal Transport

The recent work of (Cuturi et al., 2019) also makes use of the Sinkhorn operator to derive a continuous relaxation for the operator. This time, the authors are motivated by the observation that a sorting permutation for can be recovered from an optimal transport plan between two discrete measures defined on the real line, one of which is supported on the elements of and the other of which is supported on arbitrary values

. Indeed, the optimal transport plan between the probability measures

and (where is the Dirac delta at ) is given by the matrix .

Notably, a variant of the optimal transport problem with entropy regularization yields instead a continuous relaxation mapping to the Birkhoff polytope ; plays a role similar to the temperature in (Mena et al., 2018), with as . This relaxation can be computed via Sinkhorn iterates, and enables the authors to relax by means of

. Gradients can be computed by backpropagating through the Sinkhorn operator as in

(Mena et al., 2018).

The computational complexity of this approach is again

. However, the authors show that a generalization of their method can be used to compute relaxed quantiles in time

, which is interesting in its own right.

### 2.4 Sorting via a sum-of-top-k elements identity

Finally, a more computationally efficient approach to differentiable sorting is proposed in (Grover et al., 2019). The authors rely on an identity that expresses the sum of the top elements of a vector as a symmetric function of that only involves max and min operations (Ogryczak and Tamir, 2003, Lemma 1). Based on this identity, and denoting by the matrix of absolute pairwise differences of elements of , namely , the authors prove the identity:

 Pargsort(s)[i,j]={1 if j=argmax(ci),0 otherwise (2)

where , and denotes the column vector of all ones.

Therefore, by replacing the argmax operator in Eq. 2 by a row-wise softmax, the authors arrive at the following continuous relaxation for the operator, which they call NeuralSort:

 NeuralSortτ(s)[i,:]=softmax(ciτ) (3)

Again, is a temperature parameter that controls the degree of the approximation; as they show that . Notably, the relaxation proposed by (Grover et al., 2019) can be computed in time , making it much faster than the competing approaches of (Mena et al., 2018; Cuturi et al., 2019).

## 3 SoftSort: A simple relaxed sorting operator

In this paper we propose SoftSort, a simple continuous relaxation for the operator. We define SoftSort as follows:

 SoftSortdτ(s)=softmax(−d(% sort(s)1T,1sT)τ) (4)

where is a temperature parameter that controls the degree of the approximation and is semi–metric function applied pointwise that is differentiable almost everywhere. Recall that a semi–metric has all the properties of a metric except the triangle inequality, which is not required. Examples of semi–metrics in include any positive power of the absolute value. The SoftSort operator has similar desirable properties to those of the NeuralSort operator, while being significantly simpler. Here we state and prove these properties. We start with the definition of Unimodal Row Stochastic Matrices (Grover et al., 2019), which summarizes the properties of our relaxed operator:

###### Definition 1

(Unimodal Row Stochastic Matrices). An matrix is Unimodal Row Stochastic (URS) if it satisfies the following conditions:

1. Non-negativity: .

2. Row Affinity: .

3. Argmax Permutation: Let denote a vector of size such that . Then, , i.e., it is a valid permutation.

While NeuralSort and SoftSort yield URS matrices (we will prove this shortly), the approaches of (Mena et al., 2018; Cuturi et al., 2019) yield bistochastic matrices. It is natural to ask whether URS matrices should be preferred over bistochastic matrices for relaxing the operator. Note that URS matrices are not comparable to bistochastic matrices: they drop the column-stochasticity condition, but require that each row have a distinct argmax, which is not true of bistochastic matrices. This means that URS matrices can be trivially projected onto the probability simplex, which is useful for e.g. straight-through gradient optimization, or whenever hard permutation matrices are required, such as at test time. The one property URS matrices lack is column-stochasticity, but this is not central to soft sorting. Instead, this property arises in the work of (Mena et al., 2018) because their goal is to relax the bipartite matching operator (rather than the argsort operator), and in this context bistochastic matrices are the natural choice. Similarly, (Cuturi et al., 2019) yields bistochastic matrices because they are the solutions to optimal transport problems (this does, however, allow them to simultaneously relax the argsort and rank operators). Since our only goal (as in the NeuralSort paper) is to relax the argsort operator, column-stochasticity can be dropped, and URS matrices are the more natural choice.

Now on to the main Theorem, which shows that SoftSort has the same desirable properties as NeuralSort. These are (Grover et al., 2019, Theorem 4):

###### Theorem 1

The SoftSort operator has the following properties:

1. Unimodality: ,

is a unimodal row stochastic matrix. Further, let

denote the permutation obtained by applying argmax row-wise to . Then, .

2. Limiting behavior: If no elements of coincide, then:

 limτ→0+SoftSortdτ(s)=P% argsort(s)

In particular, this limit holds almost surely if the entries of are drawn from a distribution that is absolutely continuous w.r.t. the Lebesgue measure on .

Proof.

1. Non-negativity and row affinity follow from the fact that every row in is the result of a softmax operation. For the third property, we use that softmax preserves maximums and that has a unique minimum at for every . Formally, let , i.e. are the decreasing order statistics of . Then:

 ui =argmaxjSoftSortdτ(s)[i,j] =argmaxj(softmax(−d(s[i],sj)/τ)) =argminj(d(s[i],sj)) =argsort(s)[i]

as desired.

2. It suffices to show that the -th row of converges to the one-hot representation of . But by part 1, the -th row of is the softmax of where is a vector whose unique argmax is . Since it is a well-known property of the softmax that (Elfadel and Wyatt, 1994), we are done.

Note that the proof of unimodality of the SoftSort operator is straightforward, unlike the proof for the NeuralSort operator, which requires proving a more technical Lemma and Corollary (Grover et al., 2019, Lemma 2, Corollary 3).

The row-stochastic property can be loosely interpreted as follows: row of SoftSort and NeuralSort encodes a distribution over the value of the rank element, more precisely, the probability of it being equal to for each . In particular, note that the first row of the operator is precisely the softmax of the input vector. In general, the -th row of the operator is the softmax of the negative distances to the -th largest element.

Finally, regarding the choice of in SoftSort, even though a large family of semi–metrics could be considered, in this work we experimented with the absolute value as well as the square distance and found the absolute value to be marginally better during experimentation. With this in consideration, in what follows we fix the absolute value function, unless stated otherwise. We leave for future work learning the metric or exploring a larger family of such functions.

## 4 Comparing SoftSort to NeuralSort

### 4.1 Mathematical Simplicity

The difference between SoftSort and NeuralSort becomes apparent once we write down what the actual operators look like; the equations defining them (Eq. 3, Eq. 4) are compact but do not offer much insight. Note that even though the work of (Grover et al., 2019) shows that the NeuralSort operator has the desirable properties of Theorem 1, the paper never gives a concrete example of what the operator instantiates to in practice, which keeps some of its complexity hidden.

Let be the function defined as , where the softmax is applied row-wise. Suppose that and that is sorted in decreasing order . Then the NeuralSort operator is given in Figure 2 and the SoftSort operator is given in Figure 3

. Note that the diagonal of the logit matrix has been

-centered by subtracting a constant value from each row; this does not change the softmax and simplifies the expressions. The SoftSort operator is straightforward, with the -th entry of the logit matrix given by . In contrast, the -th entry of the NeuralSort operator depends on all intermediate values . This is a consequence of the coupling between the NeuralSort operator and the complex identity used to derive it. As we show in this paper, this complexity can be completely avoided, and results in further benefits beyond aesthetic simplicity such as flexibility, speed and mathematical simplicity.

Note that for an arbitrary region of space other than , the NeuralSort and SoftSort operators look just like Figures 2 and 3 respectively except for relabelling of the and column permutations. Indeed, we have:

###### Proposition 1

For both and , the following identity holds:

 f(s)=f(sort(s))Pargsort(s) (5)

We defer the proof to appendix B. This proposition is interesting because it implies that the behaviour of the SoftSort and NeuralSort operators can be completely characterized by their functional form on the region of space where . Indeed, for any other value of , we can compute the value of or by first sorting , then applying SoftSort or NeuralSort, and finally sorting the columns of the resulting matrix with the inverse permutation that sorts . In particular, to our point, the proposition shows that Figures 2 and 3 are valid for all up to relabeling of the (by ) and column permutations (by the inverse permutation that sorts ).

To further our comparison, in appendix D we show how the SoftSort and NeuralSort operators differ in terms of the size of their matrix entries.

### 4.2 Code Simplicity

In Figures 4 and 5 we show TensorFlow implementations of the NeuralSort and SoftSort operators respectively. SoftSort has a simpler implementation than NeuralSort, which we shall see is reflected in its faster speed. (See section 5.5)

Note that our implementation of SoftSort is based directly off Eq. 4, and we rely on the sort operator. We would like to remark that there is nothing wrong with using the sort operator in a stochastic computation graph. Indeed, the sort function is continuous, almost everywhere differentiable (with non-zero gradients) and piecewise linear, just like the max, min or ReLU functions.

Finally, the unimodality property (Theorem 1) implies that any algorithm that builds a relaxed permutation matrix can be used to construct the true discrete permutation matrix. This means that any relaxed sorting algorithm (in particular, NeuralSort) is lower bounded by the complexity of sorting, which justifies relying on sorting as a subroutine. As we show later, SoftSort is faster than NeuralSort. Also, we believe that this modular approach is a net positive since sorting in CPU and GPU is a well studied problem (Singh et al., 2017) and any underlying improvements will benefit SoftSort’s speed as well. For instance, the current implementation in TensorFlow relies on radix sort and heap sort depending on input size.

## 5 Experiments

We first compare SoftSort to NeuralSort on the benchmarks from the NeuralSort paper (Grover et al., 2019), using the code kindly open-sourced by the authors. We show that SoftSort performs comparably to NeuralSort. Then, we devise a specific experiment to benchmark the speeds of the SoftSort and NeuralSort operators in isolation, and show that SoftSort is faster than NeuralSort while taking the same number of gradient steps to converge. This makes SoftSort not only the simplest, but also the fastest relaxed sorting operator proposed to date.

### 5.1 Models

For both SoftSort and NeuralSort we consider their deterministic and stochastic variants as in (Grover et al., 2019). The deterministic operators are those given by equations 3 and 4. The stochastic variants are Plackett-Luce distributions reparameterized via Gumbel distributions (Grover et al., 2019, Section 4.1), where the operator that is applied to the samples is relaxed by means of the SoftSort or NeuralSort operator; this is analogous to the Gumbel-Softmax trick where the argmax operator that is applied to the samples is relaxed via the softmax operator (Jang et al., 2017; Maddison et al., 2017).

### 5.2 Sorting Handwritten Numbers

The large-MNIST dataset of handwritten numbers is formed by concatenating randomly selected MNIST digits

. In this task, a neural network is presented with a sequence of

large-MNIST numbers and must learn the permutation that sorts these numbers. Supervision is provided only in the form of the ground-truth permutation. Performance on the task is measured by:

1. The proportion of permutations that are perfectly recovered.

2. The proportion of permutation elements that are correctly recovered.

Note that the first metric is always less than or equal to the second metric. We use the setup in (Cuturi et al., 2019) to be able to compare against their Optimal-Transport-based method. They use epochs to train all models.

The results for the first metric are shown in Table 1. We report the mean and standard deviation over 10 runs. We see that SoftSort and NeuralSort perform identically for all values of . Moreover, our results for NeuralSort are better than those reported in (Cuturi et al., 2019), to the extent that NeuralSort and SoftSort outperform the method of (Cuturi et al., 2019) for

, unlike reported in said paper. We found that the hyperparameter values reported in

(Grover et al., 2019) and used by (Cuturi et al., 2019) for NeuralSort are far from ideal: (Grover et al., 2019) reports using a learning rate of and temperature values from the set . However, a higher learning rate dramatically improves NeuralSort’s results, and higher temperatures also help. Concretely, we used a learning rate of for all the SoftSort and NeuralSort models, and a value of for and for . The results for the second metric are reported in appendix E. In the appendix we also include learning curves for SoftSort and NeuralSort, which show that they converge at the same speed.

### 5.3 Quantile Regression

As in the sorting task, a neural network is presented with a sequence of large-MNIST numbers. The task is to predict the median element from the sequence, and this is the only available form of supervision. Performance on the task is measured by mean squared error and Spearman’s rank correlation. We used iterations to train all models.

The results are shown in Table 2. We used a learning rate of for all models - again, higher than that reported in (Grover et al., 2019) - and grid-searched the temperature on the set - again, higher than that reported in (Grover et al., 2019). We see that SoftSort and NeuralSort perform similarly, with NeuralSort sometimes outperforming SoftSort and vice versa. The results for NeuralSort are also much better than those reported in (Grover et al., 2019), which we attribute to the better choice of hyperparameters, concretely, the higher learning rate. In the appendix we also include learning curves for SoftSort and NeuralSort, which show that they converge at the same speed.

### 5.4 Differentiable kNN

In this setup, we explored using the SoftSort operator to learn a differentiable -nearest neighbours (kNN) classifier that is able to learn a representation function, used to measure the distance between the candidates.

In a supervised training framework, we have a dataset that consists of pairs of a datapoint and a label. We are interested in learning a map to embed every such that we can use a kNN classifier to identify the class of by looking at the class of its closest neighbours according to the distance . Such a classifier would be valuable by virtue of being interpretable and robust to both noise and unseen classes.

This is achieved by constructing episodes during training that consist of one pair and candidate pairs for , arranged in two column vectors and . The probability of class under a kNN classifier is the average of the first entries in the vector

 Pargsort(−∥Φ(X)−Φ(^x)∥2)IY=^y

where is the vector of squared distances from the candidate points and is the binary vector indicating which indexes have class . Thus, if we replace by the SoftSort operator we obtain a differentiable relaxation . To compute the loss we follow (Grover et al., 2019) and use . We also experimented with the cross entropy loss, but the performance went down for both methods.

When , our method is closely related to Matching Networks (Vinyals et al., 2016). This follows from the following result: (See proof in Appendix C)

###### Proposition 2

Let and be the differentiable kNN operator using . If we choose the embedding function to be of norm , then

 ˆP(^y|^x,X,Y)=∑i:yi=^yeΦ(^x)⋅Φ(xi)/∑i=1…neΦ(^x)⋅Φ(xi)

This suggests that our method is a generalization of Matching Networks, since in our experiments larger values of yielded better results consistently and we expect a kNN classifier to be more robust in a setup with noisy labels. However, Matching Networks use contextual embedding functions, and different networks for and the candidates , both of which could be incorporated to our setup. A more comprehensive study comparing both methods on a few shot classification dataset such as Omniglot (Lake et al., 2011) is left for future work.

We applied this method to three benchmark datasets: MNIST, Fashion MNIST and CIFAR-10 with canonical splits. As baselines, we compare against NeuralSort as well as other kNN models with fixed representations coming from raw pixels, a PCA feature extractor and an auto-encoder. All the results are based on the ones reported in (Grover et al., 2019). We also included for comparison a standard classifier using a convolutional network.

Results are shown in Table 3. In every case, we achieve comparable accuracies with NeuralSort implementation, either slightly outperforming or underperforming NeuralSort. See hyperparameters used in appendix A.

### 5.5 Speed Comparison

We set up an experiment to compare the speed of the SoftSort and NeuralSort operators. We are interested in exercising both their forward and backward calls. To this end, we set up a dummy learning task where the goal is to perturb an -dimensional input vector to make it become sorted. We scale to and feed it through the SoftSort or NeuralSort operator to obtain , and place a loss on

that encourages it to become equal to the identity matrix, and thus encourages the input to become sorted.

Concretely, we place the cross-entropy loss between the true permutation matrix and the predicted soft URS matrix:

 L(ˆP)=−1nn∑i=1logˆP[i,i]

This encourages the probability mass from each row of to concentrate on the diagonal, which drives to sort itself. Note that this is a trivial task, since for example a pointwise ranking loss (Taylor et al., 2008, Section 2.2) leads the input to become sorted too, without any need for the SoftSort or NeuralSort operators. However, this task is a reasonable benchmark to measure the speed of the two operators in a realistic learning setting.

We benchmark in the range to , and batch inputs

together to exercise batching. Thus the input is a parameter tensor of shape

. Models are trained for epochs, which we verified is enough for the parameter vectors to become perfectly sorted by the end of training (i.e., to succeed at the task).

In Figure 6 we show the results for the TensorFlow implementations of NeuralSort and SoftSort given in Figures 4 and 5 respectively. We see that on both CPU and GPU, SoftSort is faster than NeuralSort. For , SoftSort is about 6 times faster than the NeuralSort implementation of (Grover et al., 2019) on both CPU and GPU. We tried to speed up the NeuralSort implementation of (Grover et al., 2019), and although we were able to improve it, NeuralSort was still slower than SoftSort, concretely: slower on CPU and slower on GPU. Details of our improvements to the speed of the NeuralSort operator are provided in appendix A.4.5.

The performance results for PyTorch are provided in the appendix and are similar to the TensorFlow results. In the appendix we also show that the learning curves of SoftSort with and NeuralSort are almost identical; interestingly, we found that using converges more slowly on this synthetic task.

We also investigated if relying on a sorting routine could cause slower run times in worst-case scenarios. When using sequences sorted in the opposite order we did not note any significant slowdowns. Furthermore, in applications where this could be a concern, the effect can be avoided entirely by shuffling the inputs before applying our operator.

As a final note, given that the cost of sorting is sub-quadratic, and most of the computation is payed when building and applying the matrix, we also think that our algorithm could be made faster asymptotically by constructing sparse versions of the SoftSort operator. For applications like differentiable nearest neighbors, evidence suggests than processing longer sequences yields better results, which motivates improvements in the asymptotic complexity. We leave this topic for future work.

## 6 Conclusion

We have introduced SoftSort, a simple continuous relaxation for the argsort operator. The -th row of the SoftSort operator is simply the softmax of the negative distances to the -th largest element.

SoftSort has similar properties to those of the NeuralSort operator of (Grover et al., 2019). However, due to its simplicity, SoftSort is trivial to implement, more modular, faster than NeuralSort, and proofs regarding the SoftSort operator are effortless. We also empirically find that it is just as easy to optimize.

Fundamentally, SoftSort advances the state of the art in differentiable sorting by significantly simplifying previous approaches. Our code and experiments can be found at https://github.com/sprillo/softsort.

## Acknowledgments

We thank Assistant Professor Jordan Boyd-Graber from University of Maryland and Visting Researcher at Google, and Thomas Müller from Google Language Research at Zurich for their feedback and comments on earlier versions of the manuscript. We would also like to thank the anonymous reviewers for their feedback that helped improve this work.

## References

• M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng (2016) TensorFlow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. External Links: Link Cited by: §1.
• M. Cuturi, O. Teboul, and J. Vert (2019) Differentiable ranking and sorting using optimal transport. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 6858–6868. External Links: Link Cited by: Appendix A, Table 5, Appendix E, §1, §2.3, §2.4, §2, §3, §5.2, §5.2, Table 1.
• I. M. Elfadel and J. L. Wyatt (1994) The ”softmax” nonlinearity: derivation using statistical mechanics and useful properties as a multiterminal analog circuit element. In Advances in Neural Information Processing Systems 6, J. D. Cowan, G. Tesauro, and J. Alspector (Eds.), pp. 882–887. External Links: Link Cited by: item 2.
• A. Grover, E. Wang, A. Zweig, and S. Ermon (2019) Stochastic optimization of sorting networks via continuous relaxations. In International Conference on Learning Representations, External Links: Link
• K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), Cham, pp. 630–645. External Links: ISBN 978-3-319-46493-0 Cited by: §A.3.1.
• E. Jang, S. Gu, and B. Poole (2017) Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations, External Links: Link Cited by: §5.1.
• K. Järvelin and J. Kekäläinen (2002) Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst. 20 (4), pp. 422–446. External Links: ISSN , Link, Document Cited by: §2.
• B. M. Lake, R. Salakhutdinov, J. Gross, and J. B. Tenenbaum (2011) One shot learning of simple visual concepts. In CogSci, Cited by: §5.4.
• C. Maddison, A. Mnih, and Y. Teh (2017)

The concrete distribution: a continuous relaxation of discrete random variables

.
In International Conference on Learning Representations, pp. . Cited by: §5.1.
• G. Mena, D. Belanger, S. Linderman, and J. Snoek (2018) Learning latent permutations with gumbel-sinkhorn networks. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.2, §2.2, §2.2, §2.3, §2.4, §3.
• W. Ogryczak and A. Tamir (2003) Minimizing the sum of the k largest functions in linear time. Information Processing Letters 85 (3), pp. 117 – 122. External Links: ISSN , Document, Link Cited by: §2.4.
• A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. Devito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. In Advances in Neural Information Processing Systems 30, Cited by: §A.4.6, §1.
• T. Qin, T. Liu, and H. Li (2010) A general approximation framework for direct optimization of information retrieval measures. Inf. Retr. 13 (4), pp. 375–397. External Links: ISSN , Link, Document Cited by: §2.1.
• D. Singh, I. Joshi, and J. Choudhary (2017) Survey of gpu based sorting algorithms. International Journal of Parallel Programming 46, pp. . External Links: Document Cited by: §4.2.
• M. Taylor, J. Guiver, S. Robertson, and T. Minka (2008) SoftRank: optimizing non-smooth rank metrics. In Proceedings of the 2008 International Conference on Web Search and Data Mining, WSDM ’08, New York, NY, USA, pp. 77–86. External Links: ISBN 9781595939272, Link, Document Cited by: §2.1, §2.1, §5.5.
• O. Vinyals, C. Blundell, T. P. Lillicrap, K. Kavukcuoglu, and D. Wierstra (2016) Matching networks for one shot learning. In Advances in Neural Information Processing Systems 29, Cited by: §5.4.

## Appendix A Experimental Details

We use the code kindly open sourced by (Grover et al., 2019) to perform the sorting, quantile regression, and kNN experiments. As such, we are using the same setup as in (Grover et al., 2019). The work of (Cuturi et al., 2019) also uses this code for the sorting task, allowing for a fair comparison.

To make our work self-contained, in this section we recall the main experimental details from (Grover et al., 2019), and we also provide our hyperparameter settings, which crucially differ from those used in (Grover et al., 2019) by the use of higher learning rates and temperatures (leading to improved results).

### a.1 Sorting Handwritten Numbers

#### a.1.1 Architecture

The convolutional neural network architecture used to map

large-MNIST images to scores is as follows:

 Conv[Kernel: 5x5, Stride: 1, Output: 112x28x32, Activation: Relu] → Pool[Stride: 2, Output: 56x14x32] → Conv[Kernel: 5x5, Stride: 1, Output: 56x14x64, Activation: Relu] → Pool[Stride: 2, Output: 28x7x64] → FC[Units: 64, Activation: Relu] → FC[Units: 1, Activation: None]

Recall that the large-MNIST dataset is formed by concatenating four MNIST images, hence each large-MNIST image input is of size .

For a given input sequence of large-MNIST images, using the above CNN we obtain a vector of scores (one score per image). Feeding this score vector into NeuralSort or SoftSort yields the matrix which is a relaxation for .

#### a.1.2 Loss Functions

To lean to sort the input sequence of large-MNIST digits, (Grover et al., 2019) imposes a cross-entropy loss between the rows of the true permutation matrix and the learnt relaxation , namely:

 L=1nn∑i,j=11{P[i,j]=1}logˆP(s)[i,j]

This is the loss for one example in the deterministic setup. For the stochastic setup with reparameterized Plackett-Luce distributions, the loss is instead:

 L=1nn∑i,j=1ns∑k=11{P[i,j]=1}logˆP(s+zk)[i,j]

where are samples from the Gumbel distribution.

#### a.1.3 Hyperparamaters

For this task we used an Adam optimizer with an initial learning rate of and a batch size of . The temperature was selected from the set based on validation set accuracy on predicting entire permutations. As a results, we used a value of for and for . iterations were used to train all models. For the stochastic setting, samples were used.

#### a.1.4 Learning Curves

In Figure 7 (a) we show the learning curves for deterministic SoftSort and NeuralSort, with . These are the average learning curves over all runs. Both learning curves are almost identical, showing that in this task both operators can essentially be exchanged for one another.

### a.2 Quantile Regression

#### a.2.1 Architecture

The same convolutional neural network as in the sorting task was used to map large-MNIST images to scores. A second neural network

with the same architecture but different parameters is used to regress each image to its value. These two networks are then used by the loss functions below to learn the median element.

#### a.2.2 Loss Functions

To learn the median element of the input sequence of large-MNIST digits, (Grover et al., 2019) first soft-sorts via which allows extracting the candidate median image. This candidate median image is then mapped to its predicted value via the CNN . The square loss between and the true median value is incurred. As in (Grover et al., 2019, Section E.2), the loss for a single example can thus compactly be written as (in the deterministic case):

 L=∥y−gθ(ˆP(s)x)∥22

For the stochastic case, the loss for a single example is instead:

 L=ns∑k=1∥y−gθ(ˆP(s+zk)x)∥22

where are samples from the Gumbel distribution.

#### a.2.3 Hyperparamaters

We used an Adam optimizer with an initial learning rate of and a batch size of . The value of was grid searched on the set based on validation set MSE. The final values of used to train the models and evaluate test set performance are given in Table 4. iterations were used to train all models. For the stochastic setting, samples were used.

#### a.2.4 Learning Curves

In Figure 7 (b) we show the learning curves for deterministic SoftSort and NeuralSort, with . Both learning curves are almost identical, showing that in this task both operators can essentially be exchanged for one another.

### a.3 Differentiable KNN

#### a.3.1 Architectures

To embed the images before applying differentiable kNN, we used the following convolutional network architectures. For MNIST:

 Conv[Kernel: 5x5, Stride: 1, Output: 24x24x20, Activation: Relu] → Pool[Stride: 2, Output: 12x12x20] → Conv[Kernel: 5x5, Stride: 1, Output: 8x8x50, Activation: Relu] → Pool[Stride: 2, Output: 4x4x50] → FC[Units: 50, Activation: Relu]

and for Fashion-MNIST and CIFAR-10 we used the ResNet18 architecture (He et al., 2016) as defined in github.com/kuangliu/pytorch-cifar, but we keep the dimensional output before the last classification layer.

For the baseline experiments in pixel distance with PCA and kNN, we report the results of (Grover et al., 2019), using scikit-learn implementations.

In the auto-encoder baselines, the embeddings were trained using the follow architectures. For MNIST and Fashion-MNIST:

 Encoder: FC[Units: 500, Activation: Relu] → FC[Units: 500, Activation: Relu] → FC[Units: 50, Activation: Relu] Decoder: → FC[Units: 500, Activation: Relu] → FC[Units: 500, Activation: Relu] → FC[Units: 784, Activation: Sigmoid]

For CIFAR-10, we follow the architecture defined at github.com/shibuiwilliam/Keras_Autoencoder, with a bottleneck dimension of .

#### a.3.2 Loss Functions

For the models using SoftSort or NeuralSort we use the negative of the probability output from the kNN model as a loss function. For the auto-encoder baselines we use a per-pixel binary cross entropy loss.

#### a.3.3 Hyperparamaters

We perform a grid search for , , learning rates taking values in , and . We train for epochs and choose the model based on validation loss. The optimizer used is SGD with momentum of . Every batch has episode, each containing candidates.

### a.4 Speed Comparison

#### a.4.1 Architecture

The input parameter vector of shape (20 being the batch size) is first normalized to and then fed through the NeuralSort or SoftSort operator, producing an output tensor of shape .

#### a.4.2 Loss Function

We impose the following loss term over the batch:

 L(ˆP)=−12020∑i=11nn∑j=1logˆP[i,j,j]

This loss term encourages the probability mass from each row of to concentrate on the diagonal, i.e. encourages each row of to become sorted in decreasing order. We also add an penalty term which ensures that the entries of do not diverge during training.

#### a.4.3 Hyperparamaters

We used epochs to train the models, with the first epoch used as burn-in to warm up the CPU or GPU (i.e. the first epoch is excluded from the time measurement). We used a temperature of for NeuralSort and for SoftSort. The entries of are initialized uniformly at random in . A momentum optimizer with learning rate and momentum was used. With these settings, 100 epochs are enough to sort each row of in decreasing order perfectly for .

Note that since the goal is to benchmark the operator’s speeds, performance on the Spearman rank correlation metric is anecdotal. However, we took the trouble of tuning the hyperparameters and the optimizer to make the learning setting as realistic as possible, and to ensure that the entries in are not diverging (which would negatively impact and confound the performance results). Finally, note that the learning problem is trivial, as a pointwise loss such as sorts the rows of without need for the NeuralSort or SoftSort operator. However, this bare-bones task exposes the computational performance of the NeuralSort and SoftSort operators.

#### a.4.4 Learning Curves

In Figure 8 we show the learning curves for ; the Spearman correlation metric is plotted against epoch. We see that SoftSort with and NeuralSort have almost identical learning curves. Interestingly, SoftSort with converges more slowly. Figure 8: Learning curves for SoftSort with d=|⋅|p for p∈{1,2}, and NeuralSort, on the speed comparison task.

#### a.4.5 NeuralSort Performance Improvement

We found that the NeuralSort implementations provided by  (Grover et al., 2019) in both TensorFlow and PyTorch have complexity . Indeed, in their TensorFlow implementation (Figure 4), the complexity of the following line is :

B = tf.matmul(A_s, tf.matmul(one,
tf.transpose(one)))


since the three matrices multiplied have sizes , , and respectively. To obtain complexity we associate differently:

B = tf.matmul(tf.matmul(A_s, one),
tf.transpose(one))


The same is true for their PyTorch implementation (Figure 11). This way, we were able to speed up the implementations provided by  (Grover et al., 2019).

#### a.4.6 PyTorch Results

In Figure 9 we show the benchmarking results for the PyTorch framework (Paszke et al., 2017). These are analogous to the results presented in Figure 6) of the main text. The results are similar to those for the TensorFlow framework, except that for PyTorch, NeuralSort runs out of memory on CPU for , on GPU for , and SoftSort runs out of memory on CPU for .

#### a.4.7 Hardware Specification

We used a GPU V100 and an n1-highmem-2 (2 vCPUs, 13 GB memory) Google Cloud instance to perform the speed comparison experiment.

We were also able to closely reproduce the GPU results on an Amazon EC2 p2.xlarge instance (4 vCPUs, 61 GB memory) equipped with a GPU Tesla K80, and the CPU results on an Amazon EC2 c5.2xlarge instance (8 vCPUs, 16 GB memory).

## Appendix B Proof of Proposition 1

First we recall the Proposition:

Proposition. For both (with any ) and , the following identity holds:

 f(s)=f(sort(s))Pargsort(s) (6)

To prove the proposition, we will use the following two Lemmas:

###### Lemma 1

Let be a permutation matrix, and let be any function. Let be the pointwise application of , that is:

 G(A1,…,Ak)i,j=g((A1)i,j,…,(Ak)i,j) (7)

Then the following identity holds for any :

 G(A1,…,Ak)P=G(A1P,…,AkP) (8)

Proof of Lemma 1. Since is a permutation matrix, multiplication to the right by permutes columns according to some permutation, i.e.

 (AP)i,j=Ai,π(j) (9)

for some permutation and any . Thus, for any fixed :

 (G(A1,…,Ak)P)i,j (i)= G(A1,…,Ak)i,π(j) (ii)= g((A1)i,π(j),…,(Ak)i,π(j)) (iii)= g((A1P)i,j,…,(AkP)i,j) (iv)= G(A1P,…,AkP)i,j

where follow from Eq. 9, and follow from Eq. 7. This proves the Lemma.

###### Lemma 2

Let be a permutation matrix, and denote the row-wise softmax, i.e.:

 σ(A)i,j=exp{Ai,j}∑kexp{Ai,k} (10)

Then the following identity holds for any :

 σ(A)P=σ(AP) (11)

Proof of Lemma 2. As before, there exists a permutation such that:

 (BP)i,j=Bi,π(j) (12)

for any . Thus for any fixed :

 (σ(A)P)i,j (i)= σ(A)i,π(j) (ii)= exp{Ai,π(j)}∑kexp{Ai,π(k)} (iii)= exp{(AP)i,j}∑kexp{(AP)i,k} (iv)= σ(AP)i,j

where follow from Eq. 12 and follow from the definition of the row-wise softmax (Eq. 10). This proves the Lemma.

We now leverage the Lemmas to provide proofs of Proposition 1 for each operator. To unclutter equations, we will denote by the row-wise softmax operator.

Proof of Proposition 1 for SoftSort. We have that:

 SoftSortdτ(sort(s))Pargsort(s) (i)= σ(−d(sort(sort(s))1T,1sort(s)T)τ)Pargsort(s) (ii)= σ(−d(sort(