1 Introduction
There are many different kinds of combinatorial optimisation (CO) problem, spanning bin packing, routing, assignment, scheduling, constraint satisfaction, and more. Solving these problems while sidestepping their inherent computational intractability has great importance and impact for the real world, where poor bin packing or routing lead to wasted profit or excess greenhouse emissions (Salimifard et al., 2012). General solving frameworks or metaheuristics for all these problems are desirable, due to their conceptual simplicity and easeofdeployment, but require manual tailoring to each individual problem. One such metaheuristic is Simulated Annealing (SA) (Kirkpatrick et al., 1987), a simple, and equally very popular, iterative global optimisation technique for numerically approximating the global minimum of both continuous and discretevariable problems. While SA has wide applicability, this is also its Achilles’ Heel, leaving many design choices to the user. Namely, a user has to design 1) neighbourhood proposal distributions, which define the space of possible transitions from a solution at time to solutions at time , and 2) a temperature schedule, which determines the balance of exploration to exploitation. In this work, we mitigate the need for extensive finetuning of SA’s parameters by designing a learnable proposal distribution, which we show improves convergence speed with little computational overhead (limited to per step for problem size ).
In recent years, research on approximate optimisation methods has been inundated by works in machine learning for CO (ML4CO) (Bengio et al., 2018). A lot of the focus has been on endtoend neural architectures (Bello et al., 2016; Vinyals et al., 2017; Dai et al., 2017; Kool et al., 2018; Emami & Ranka, 2018; Bresson & Laurent, 2021). These work by brute force learning the instance to solution mapping—in CO these are sometimes referred to as
construction heuristics
. Other works focus on learning good parameters for classic algorithms, whether they be parameters of the original algorithm (Kruber et al., 2017; Bonami et al., 2018) or extra neural parameters introduced into the computational graph of classic algorithms (Gasse et al., 2019; Gupta et al., 2020; Kool et al., 2021; da Costa et al., 2020; Wu et al., 2019b; Chen & Tian, 2019; Fu et al., 2021). Our method, neural simulated annealing (Neural SA) can be viewed as sitting firmly within this last category.SA is an improvement heuristic; it navigates the search space of feasible solutions by iteratively applying (small) perturbations to previously found solutions. Figure 1 illustrates this for the Travelling Salesperson Problem (TSP), perhaps the most classic of NPhard problems. In this work, we pose this as a Reinforcement Learning (RL) agent navigating an environment, searching for better solutions. In this light the proposal distribution is an optimisable quantity. Conveniently, our method inherits convergence guarantees from SA. We are able to directly optimise the proposal distribution using policy optimisation for both faster convergence and better solution quality under a fixed computation budget. We demonstrate Neural SA on four tasks: Rosenbrock’s function, a toy 2D optimisation problem, where we can easily visualise and analyse what is being learnt; the Knapsack and Bin Packing problems, which are classic NPhard resource allocation problems; and the TSP.
Our contributions are:

[itemsep=0pt,topsep=0pt]

We pose simulated annealing as a Markov decision process, bringing it into the realm of reinforcement learning. This allows us to optimise the proposal distribution in a principled manner, still preserving all the convergence guarantees of vanilla simulated annealing.

We show competitive performance to offtheshelf CO tools and other ML4CO methods on the Knapsack, Bin Packing, and Travelling Salesperson problems, in terms of solution quality and wallclock time.

We show our methods transfer to problems of different sizes, and also perform well on problems up to larger than the ones used for training.

Our method is competitive within the ML4CO space, using a very lightweight architecture, with number of learnable parameters of the order of 100s or fewer.
2 Background and Related Work
Here we outline the basic simulated annealing algorithm and its main components. Then we provide an overview of prior works in the machine learning literature which have sought to learn parts of the algorithm or where SA has found uses in machine learning.
Combinatorial optimisation
A combinatorial optimisation problem is defined by a triple where are problem instances (city locations in the TSP), is the set of feasible solutions given (Hamiltonian cycles in the TSP) and is an energy function (tour length in the TSP). Without loss of generality, the task is to minimise the energy . CO problems are in general NPhard, meaning that there is no known algorithm to solve them in time polynomial in the number of bits that represents a problem instance.
Simulated Annealing
Simulated annealing (Kirkpatrick et al., 1987)
is a metaheuristic for CO problems. It builds an inhomogeneous Markov chain
for , asymptotically converging to a minimizer of . The stochastic transitions depend on two quantities: 1) a proposal distribution, and 2) a temperature schedule. The proposal distribution , forthe space of probability distributions on
, suggests new states in the chain. It perturbs current solutions to new ones, potentially leading to lower energies immediately or later on. After perturbing a solution , a Metropolis–Hastings (MH) step (Metropolis et al., 1953; Hastings, 1970) is executed. This either accepts the perturbation () or rejects it ()—see Algorithm 1 for details. The target distribution of the MH step has form , where is the temperature at time . In the limit , this distribution tends to a sum of Dirac deltas on the minimisers of the energy. The temperature is annealed, according to the temperature schedule, , from high to low, to steer the target distribution smoothly from broad to peaked around the global optima. The algorithm is outlined in Algorithm 1. Under certain regularity conditions and provided the chain is long enough, it will visit the minimisers almost surely (Geman & Geman, 1984). More concretely,(1) 
Despite this guarantee, practical convergence speed is determined by and the temperature schedule, which are hard to finetune. There exist problemspecific heuristics for setting these (Pereira & Fernandes, 2004; Cicirello, 2007), but in this paper we propose to learn the proposal distribution.
2.1 Simulated annealing and machine learning
A natural way to combine machine learning and simulated annealing is to design local improvement heuristics that feed off each other. Cai et al. cai2019reinforcement and Vashisht et al. vashisht2020placement use RL to find good initial solutions that are later refined by standard SA. That is fundamentally different to our approach, as we augment SA with RLoptimisable components, instead of simply using them as standalone algorithms that only interact via shared solutions. In fact, our method is perfectly compatible with theirs and any other SA application. Another line of work seeks to optimise different components of SA with RL (Wauters et al., 2020; Khairy et al., 2020; Beloborodov et al., 2020; Mills et al., 2020) or statistical machine learning techniques (Blum et al., 2020)
. In contrast to these methods that optimise individual hyperparameters in SA, we frame SA itself as an RL problem, which allows us to define and train the proposal distribution as a policy.
More closely to our method, other approaches improve the proposal distribution. In Adaptive Simulated Annealing (ASA) (Ingber, 1996)
the proposal distribution is not fixed but evolves throughout the annealing process as a function of the variance of the quality of visited solutions. ASA improves the convergence of standard SA but is not learnable like Neural SA. To the best of our knowledge,
Marcos Alvarez et al. (2012)are the only others to learn the proposal distribution for SA, but they rely on supervised learning, requiring high quality solutions or good search strategies to imitate; both expensive to compute. Conversely, Neural SA is fully unsupervised, thus easier to train and extend to different CO tasks. Finally, SA is also akin to MetropolisHastings, a popular choice for Markov Chain Monte Carlo (MCMC) sampling. Noé et al. noe2019boltzmann, Albergo et al. albergo2019flow and de Haan et al. dehaan2021scaling recently studied how to learn a proposal distribution of an MCMC chain for sampling the Boltzmann distribution of a physical system. While their results serve as motivation for our methods, we investigate a completely different context and set of applications.
Lastly, our work falls under bilevel optimisation methods, where an outer optimisation loop finds the best parameters of an inner optimisation. This encompasses situations such as learning the parameters (Rere et al., 2015) or hyperparameters of a neural network optimiser (Maclaurin et al., 2015; Andrychowicz et al., 2016) and metalearning (Finn et al., 2017). However, most recent approaches assume differentiable losses on continuous state spaces Likhosherstov et al. (2021); Ji et al. (2021); Vicol et al. (2021), while we focus on the more challenging CO setting. We note, however, methods in Vicol et al. (2021) are based on evolution strategies and could be used in the discrete setting.
2.2 Markov Decision Processes
Simulated annealing naturally fits into the Markov Decision Process (MDP) framework as we explain below. An MDP consists of states , actions , an immediate reward function , a transition kernel , and a discount factor . On top of this MDP we add a stochastic policy . The policy and transition kernel together define a length trajectory , which is a sample from the distribution and where is sampled from the startstate distribution . One can then define the discounted return over a trajectory, where . We say that we have solved an MDP if we have found a policy that maximises the expected return .
3 Method
Here we outline our approach to learn the proposal distribution. First we define an MDP corresponding to SA. We then show how the proposal distribution can be optimised and provide a justification that this does not affect convergence guarantees of the classic algorithm.
3.1 MDP Formulation
We formalise SA as an MDP, with states for a parametric description of the problem instance as in Section 2, and the instantaneous temperature. Examples are in Section 4. Our actions perturb , where is a solution in the neighbourhood of . It is common to define small neighbourhoods, to limit energy variation from one state to the next. This heuristic discards exceptionally good and exceptionally bad moves, but since the latter are more common than the former, it generally leads to faster convergence.
We view the MH step in SA as a stochastic transition kernel, governed by the current temperature of the system, with transition probabilities following a Gibbs distribution and dynamics
(2) 
This defines a transition kernel , where we have . For rewards, we use either the immediate gain or the primal reward . We explored training with two different methods: Proximal Policy Optimisation (PPO) (Schulman et al., 2017) and Evolution Strategies (ES) Salimans et al. (2017). The immediate gain works best with PPO, where at each iteration of the rollout, the immediate gain gives finegrained feedback on whether the previous action helped or not. The primal reward works best with ES because it is nonlocal, returning the minimum along an entire rollout at the very end. We explored using the acceptance count but found that this sometimes led to pathological behaviours. Similarly, we tried the primal integral (Berthold, 2013), which encourages finding a good solution fast, but found we could not get training dynamics to converge.
3.2 Policy Network Architecture
SA chains are long. It is because of this that we need as lightweight a policy architecture as possible. Furthermore, this architecture should have the capacity to scale to varying numbers of inputs, so that we can transfer experience across problems of different size . We opt for a very simple network, shown in Figure 2. For each dimension of the problem we map the state
into a set of features. For all problems we try, there is a natural way to do this. Each feature is fed into an MLP, embedding it into a logit space, followed by a softmax function to yield probabilities. The complexity of this architecture scales linearly with
and the computation is embarrassingly parallel, which is important since we plan to evaluate it many times. A notable property of this architecture is that it is permutation equivariant (Zaheer et al., 2017)— for a permutation of objects—an important requirement for the CO problems we consider. Note that our model is a permutation equivariant settoset mapping, but we have not used attention or other kinds of pairwise interaction to keep the computational complexity linear in the number of items.Convergence
Convergence of SA to the optimum in the infinite time limit requires the Markov chain of the proposal distribution to be irreducible van Laarhoven & Aarts (1987), meaning that for any temperature, any two states are reachable through a sequence of transitions with positive conditional probability under
. Our neural network policy satisfies this condition as long as the softmax layer does not assign zero probability to any state, a condition which is met in practice. Thus Neural SA inherits convergence guarantees from SA.
4 Experiments
Results on Rosenbrock’s function: (a) Example trajectory, moving from red to blue, showing convergence around the minimiser at (1,1) (b) Neural SA has higher acceptance ratio than the baseline, a trend observed in all experiments, (c) Standard deviation of the learned policy as a function of iteration. Large initial steps offer great gains followed by small exploitative steps, (d) A nonadaptive vanilla SA baseline cannot match an adaptive one, no matter the standard deviation.
We evaluate our method on 4 tasks—Rosenbrock’s function, the Knapsack, Bin Packing, and TSP problems—using the same architecture and hyperparameters of Neural SA for all tasks. This shows the wide applicability and ease of use of our method. For each task (except for Rosenbrock’s function) we test Neural SA on problems of different size , training only on the smallest. Similarly, we consider rollouts of different lengths, training only on short ones
. This accelerates training, showing Neural SA’s generalisation capabilities. This type of transfer learning is one of the challenges in ML4CO
Joshi et al. (2019b), and is a merit of our lightweight, equivariant architecture. In all experiments, we start from trivial or random solutions and adopt an exponential multiplicative cooling schedule as originally proposed by Kirkpatrick et al. (1987), with . In practice, we define the temperature schedule by fixing , and computing according to the desired number of steps . This allows us to vary the rollout length while maintaining the same range of temperatures for every run. We provide more precise experimental details in the appendix.4.1 The Rosenbrock function
The Rosenbrock function is a common benchmark for optimisation algorithms. It is a nonconvex function over Euclidean space defined as
(3) 
and with global minimum at . Of course, gradientbased optimisers are more suited to this problem, but we use it as a toy example to showcase the properties of Neural SA. Our policy is an axisaligned Gaussian , where we parametrise the variance by an MLP of shape
with a ReLU in the middle. Proposals are of the form
, and the state is given by . An example rollout is in Figure 2(a).We contrast Neural SA against vanilla SA with fixed proposal distribution, i.e. , for different averaged over problem instances. This shows in Figure 2(d) that no constant variance policy can outperform an adaptive policy on this problem. Plots of acceptance ratio in Figure 2(b)
show Neural SA has higher acceptance probability early in the rollout, a trend we observed in all experiments, suggesting its proposals are skewed towards lower energy solutions than standard SA. Figure
2(c) shows the variance network as a function of time. It has learnt to make large steps until hitting the basin, whereupon large moves will be rejected with high probability, so variance must be reduced.4.2 Knapsack Problem
The Knapsack problem is a classic CO problem in resource allocation. Given a set of items, each of a different value and weight , the goal is to find a subset that maximises the sum of values while respecting a maximum total weight of . This is the 01 Knapsack Problem, which is weakly NPcomplete, has a search space of size
and corresponding integer linear program
(4) 
Random Search  Bello RL  Bello AS  SA  Ours (PPO)  Ours (ES)  Greedy  ORTools  

Knap50  
Knap100  
Knap200  
Knap500      
Knap1K        
Knap2K       
Solutions are represented as a binary vector
, with for ‘out of the bin’ and for ‘in the bin’. Our proposal distribution flips individual bits, one at a time, with the constraint that we cannot flip if the bin capacity will be exceeded. The neighbourhood of is thus all feasible solutions at a Hamming distance of 1 from . We use the proposal distribution described in Section 3.2 and illustrated in Figure 2, consisting of a pointwise embedding of each item—its weight, value, occupancy bit, the knapsack’s overall capacity, and global temperature—into a logitspace, followed by a softmax. Mathematically the policy and state–action to proposal mapping are(5) 
where is a small twolayer neural network with ReLU activations, comprising only 112 parameters. Actions are sampled from the categorical distribution induced by the softmax and cast to onehot vectors .
Neural networks have been used to solve the Knapsack Problem in Vinyals et al. (2017), Nomer et al. (2020), and Bello et al. (2016). We follow the setup of Bello et al. (2016), honing in on 3 selfgenerated datasets: Knap50, Knap100 and Knap200. Knap consists of items with weights and values generated uniformly at random in and capacities , and . We use ORTools (Perron & Furnon, 2019) to compute ground truth solutions. Results in Table 1 show that Neural SA improves over vanilla SA by up to 10% optimality gap, and heuristic methods (Random Search) by much more. Neural SA falls slightly behind two methods by Bello et al. (2016), which use (1) a large attentionbased pointer network with several orders of magnitude more parameters in Bello RL, and (2) this coupled with 5000 iterations of their Active Search method. It also falls behind a greedy heuristic for packing a knapsack based on the valuetoweight ratio. In Figure 4 we analyse the policy network and a typical rollout. It has learnt a mostly greedy policy to fill its knapsack with light, valuable objects, only ejecting them when full. This is in line with the valuetoweight greedy heuristic. Despite not coming top among methods, we note Neural SA is typically within 13% of the minimum energy, although its architecture was not designed for this problem in particular.
4.3 Bin Packing Problem
The Bin Packing problem is similar to the Knapsack problem in nature. Here, one wants to pack all of items into the smallest number of bins possible, where each item has weight , and we assume, without loss of generality, bins of equal capacity ; there would be no valid solution otherwise. This problem is NPhard and has a search space of size equal to the Bell number. If denotes item occupying bin , then the problem can be written as minimising an energy:
minimise  (6)  
subject to  
where the constraints apply for all and . We define the policy in two steps: we first pick an item , and then select a bin to place it into. We can then write the policy as , which we define as
(7) 
where is the bin item is in before the action (in terms of , we have ), is the free capacity of bin (), and both and are lightweight architectures with a ReLU nonlinearity between the two layers. We sample from the policy ancestrally, sampling first an item from , followed by a bin from . Results in Table 2 show that our lightweight model is able to find a solution to about 1% higher energy than the minimum found by FFD Johnson (1973), a very strong heuristic for this problem (Rieck, 2021). We even see that we very often beat the SCIP (Gamrath et al., 2020a, b) optimizer in ORTools, which timed out on most problems. Figure 5 compared convergence speed of Neural SA with vanilla SA and a third option, Greedy Neural SA, which uses argmax samples from the policy. The learnt policy, visualised in Figure 6 has much faster convergence than the vanilla version. Again, we see that our method, although simple, is competitive with handdesigned alternatives, whereas vanilla SA is not.
SA  Ours (PPO)  Ours (ES)  ORTools (SCIP)  FFD  

Bin50  
Bin100  
Bin200  
Bin500  
Bin1000  
Bin2000 
4.4 Travelling Salesperson Problem
Imagine you will make a round roadtrip through cities and want to plan the shortest route visiting each city once; this is the Travelling Salesperson Problem (TSP) (Applegate et al., 2006). The TSP has been a long time favourite of computer scientists due to its easy description and NPhardness (the base search space has size equal to the factorial of the number of cities). Here we use it as an example of a difficult CO problem. We compare with Concorde (Applegate et al., 2006) and LKH3 (Helsgaun, 2000), two custom solvers for TSP. Given cities with spatial coordinates , we wish to find a linear ordering of the cities, called a tour, denoted by the permutation vector for such that
(8)  
where we have defined for convenience of notation. Our action space consists of socalled 2opt moves (Croes, 1958), which reverse contiguous segments of a tour. An example of a 2opt move is shown in Figure 1. We have a twostage architecture, like in Bin Packing, which selects the start and end cities of the segment to reverse. Denoting as the start and as the end cities, we have , parametrised as
(9) 
where are the indices of city and its tour neighbours and . Again, we use simple MLPs: has architecture and , . We test on publicly available TSP20/50/100 (Kool et al., 2018) with 10K problems each and generate TSP200/500 with 1K tours each. Results, in Table 12, show Neural SA improves on vanilla SA. Albeit not outperforming Fu et al. fu2021generalize, Neural SA is necktoneck with other neural improvement heuristics methods, GATT{1000} (Wu et al., 2019b) and Costa{500} (da Costa et al., 2020). Since Neural SA is not custom designed for TSP as the competing methods, we view this as surprisingly good. A more complete comparison, including other neural approaches, is given in the appendix, Table 12.
5 Discussion
TSP20  TSP50  TSP100  TSP200  TSP500  
Cost  Gap  Time  Cost  Gap  Time  Cost  Gap  Time  Cost  Gap  Time  Cost  Gap  Time  
Concorde  3.836  0.00%  48s  5.696  0.00%  2m  7.764  0.00%  7m  10.70  0.00%  38m  16.54  0.00%  7h58m 
LKH3  3.836  0.00%  1m  5.696  0.00%  14m  7.764  0.00%  1h  10.70  0.00%  21m  16.54  0.00%  1h15m 
SA  3.881  1.17%  5s  5.943  4.34%  37s  8.343  7.45%  3m  11.98  11.87%  9m  20.22  22.25%  56m 
Ours (PPO)  3.838  0.05%  9s  5.734  0.67%  1m  7.874  1.42%  9m  11.00  2.80%  16m  17.64  6.65%  2h16m 
Ours (ES)  3.840  0.10%  9s  5.828  2.32%  1m  8.191  5.50%  9m  11.74  9.72%  16m  20.27  22.55%  2h16m 
ORTools*  3.86  0.85%  1m  5.85  2.87%  5m  8.06  3.86%  23m             
GATT{1000}*  3.84  0.03%  12m  5.75  0.83%  16m  8.01  3.24%  25m             
Costa {500}*  3.84  0.01%  5m  5.72  0.36%  7m  7.91  1.84%  10m             
Fu et al.*  3.84  0.00%  1m  5.70  0.01%  8m  7.76  0.04%  15m             
Neural SA is a general, plugandplay method, requiring only the definition of neighbourhoods for the proposal distribution and training problem instances (no solutions needed). It also obviates timeconsuming architecture design, since a simple MLP is enough for a range of CO problems. In this section, we discuss some of the main features of Neural SA.
Computational Efficiency Neural SA requires little computational resources given its compact architecture, with 384 parameters on TSP, 160 for Bin Packing, and 112 for Knapsack. Further, the cost of each step scales linearly in the problem size, since the architectures are embarrassingly parallel. In terms of running times, Neural SA is on par with and often faster than other TSP solvers (see Tables 3, 12). For the Knapsack and Bin Packing problems, we compare running times against ORTools, as shown in the appendix, Table 4. Neural SA lags behind ORTools in the Knapsack, for which an efficient branch and bound solver is known. However, for the Bin Packing problem, Neural SA is much faster than the available MixedInteger Programming solver, which only found trivial solutions for . Finally, Neural SA is also fast to train; only a few minutes with PPO and up to a few hours with ES. This is can be attributed to its low number of parameters but also to its generalisation ability; in all experiments, we could get away with training only on the smallest instances with very short rollouts.
PPO vs ES Neural SA can be trained with any policy optimisation method making it highly extendable. We found no winner between PPO and ES, apart from on the TSP, where PPO excelled and generalized better to larger instances. We also observed PPO to converge faster than ES, but ES policies were more robust, still performing well when we switched to greedy sampling, for example. Interestingly, the acceptance rate over trajectories was problem dependent and always higher in Neural SA (both PPO and ES) than in vanilla SA, contradicting conventional wisdom that it should be held at 0.44 throughout a rollout (Lam & Delosme, 1988).
Generalisation Our experiments show Neural SA generalises to different problem sizes and rollout lengths; a remarkable feat for such a simple pipeline, since transfer learning is notoriously difficult in RL and CO. Many ML4CO methods do handle problems of different sizes but underperform when tested on larger instances than the ones seen in training Kool et al. (2018); Joshi et al. (2019b) (see appendix, Table 9). Fu et al. fu2021generalize achieve better generalisation results for the TSP but had to resort to a suite of techniques to allow a small supervised model to be applied to larger problems. These are not easy to implement, TSPspecific, and consist only the first step in a complex pipeline that still relies on a tailored MonteCarlo tree search algorithm.
Solution Quality In all problems we considered, Neural SA, with little to no finetuning of its hyperparameters, outperformed vanilla SA and could get within a few percentage points or less of global minima. Conversely, stateoftheart SA variants are designed by searching a large space of different hyperparameters Franzin & Stützle (2019), a costly process that Neural SA helps us mitigate. Neural SA did not achieve stateoftheart results, but that was not to be expected nor our main goal. Instead, we envision Neural SA as a general purpose solver, allowing researchers and practitioners to get a strong baseline quickly without the need to finetune classic CO algorithms or design and train complex neural networks. Given the good performance, small computational resources, and fast training across a diverse set of CO problems, we believe Neural SA is a promising solver that can strike the right balance among solution quality, computing costs and development time.
6 Conclusion
We presented Neural SA, neurally augmented simulated annealing, where the SA chain is a trajectory from an MDP. In this light, the proposal distribution could be interpreted as a policy, which could be optmised. This has numerous benefits: 1) accelerated convergence of the chain, 2) ability to condition the proposal distribution on sideinformation 3) no need of ground truth data to learn the proposal distribution, 4) lightweight architectures that can be run on CPU unlike many contemporary ML4CO methods, 5) scalability to large problems due to its lightweight computational overhead, 6) generalisation across different problem sizes.
These contributions show augmenting classic, timetested (meta)heuristics with learnable components is a promising direction for future research in ML4CO. In contrast to expensive endtoend methods in previous work, this could be a more promising path towards machine learning models capable of solving a wide range of CO problems. As we show in this paper, this approach can yield solid results for different problems while preserving theoretical guarantees of existing CO algorithms and requiring only simple neural architectures that can be easily trained on small problems.
The ease of use and flexibility of Neural SA do come with drawbacks. In all experiments we were not able to achieve the minimum energy, even though we could usually get within a percentage point. The model also has no builtin termination condition, neither can it provide a certificate on the quality of solutions found. There is still also the question of how to tune the temperature schedule, which we did not attempt in this work. These shortcomings are all points to be addressed in upcoming research. We are also interested in extending the framework to multiple trajectories, such as in parallel tempering Swendsen & Wang (1986)
Holland (1992). For these, we would maintain a population of chains, which could exchange information.References
 Albergo et al. (2019) Albergo, M., Kanwar, G., and Shanahan, P. Flowbased generative models for markov chain monte carlo in lattice field theory. Physical Review D, 100(3):034515, 2019.
 Andrychowicz et al. (2016) Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., Shillingford, B., and De Freitas, N. Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems, pp. 3981–3989, 2016.
 Applegate et al. (2006) Applegate, D. L., Bixby, R. E., Chvátal, V., and Cook, W. J. The Traveling Salesman Problem: A Computational Study. Princeton University Press, 2006.
 Bello et al. (2016) Bello, I., Pham, H., Le, Q. V., Norouzi, M., and Bengio, S. Neural combinatorial optimization with reinforcement learning. CoRR, abs/1611.09940, 2016. URL http://arxiv.org/abs/1611.09940.

Beloborodov et al. (2020)
Beloborodov, D., Ulanov, A. E., Foerster, J. N., Whiteson, S., and Lvovsky, A.
Reinforcement learning enhanced quantuminspired algorithm for combinatorial optimization.
Machine Learning: Science and Technology, 2(2):025009, 2020.  Bengio et al. (2018) Bengio, Y., Lodi, A., and Prouvost, A. Machine learning for combinatorial optimization: a methodological tour d’horizon. CoRR, abs/1811.06128, 2018. URL http://arxiv.org/abs/1811.06128.
 Berthold (2013) Berthold, T. Measuring the impact of primal heuristics. Operations Research Letters, 41(6):611–614, 2013. ISSN 01676377. doi: https://doi.org/10.1016/j.orl.2013.08.007. URL https://www.sciencedirect.com/science/article/pii/S0167637713001181.
 Blum et al. (2020) Blum, A., Dan, C., and Seddighin, S. Learning complexity of simulated annealing, 2020.

Bonami et al. (2018)
Bonami, P., Lodi, A., and Zarpellon, G.
Learning a classification of mixedinteger quadratic programming
problems.
In van Hoeve, W.J. (ed.),
Integration of Constraint Programming, Artificial Intelligence, and Operations Research
, pp. 595–604, Cham, 2018. Springer International Publishing. ISBN 9783319930312.  Bresson & Laurent (2021) Bresson, X. and Laurent, T. The transformer network for the traveling salesman problem. arXiv preprint arXiv:2103.03012, 2021.
 Cai et al. (2019) Cai, Q., Hang, W., Mirhoseini, A., Tucker, G., Wang, J., and Wei, W. Reinforcement learning driven heuristic optimization, 2019.
 Chen & Tian (2019) Chen, X. and Tian, Y. Learning to perform local rewriting for combinatorial optimization. Advances in Neural Information Processing Systems, 32:6281–6292, 2019.
 Cicirello (2007) Cicirello, V. A. On the design of an adaptive simulated annealing algorithm. In Proceedings of the international conference on principles and practice of constraint programming first workshop on autonomous search, 2007.
 Croes (1958) Croes, G. A. A method for solving travelingsalesman problems. Operations Research, 6:791–812, 1958.
 da Costa et al. (2020) da Costa, P. R., Rhuggenaath, J., Zhang, Y., and Akcay, A. Learning 2opt heuristics for the traveling salesman problem via deep reinforcement learning. CoRR, abs/2004.01608, 2020. URL https://arxiv.org/abs/2004.01608.
 Dai et al. (2017) Dai, H., Khalil, E. B., Zhang, Y., Dilkina, B., and Song, L. Learning combinatorial optimization algorithms over graphs. CoRR, abs/1704.01665, 2017. URL http://arxiv.org/abs/1704.01665.
 de Haan et al. (2021) de Haan, P., Rainone, C., Cheng, M. C. N., and Bondesan, R. Scaling up machine learning for quantum field theory with equivariant continuous flows, 2021.
 Emami & Ranka (2018) Emami, P. and Ranka, S. Learning permutations with sinkhorn policy gradient. CoRR, abs/1805.07010, 2018. URL http://arxiv.org/abs/1805.07010.
 Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. Modelagnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learning, pp. 1126–1135. PMLR, 2017.
 Franzin & Stützle (2019) Franzin, A. and Stützle, T. Revisiting simulated annealing: A componentbased analysis. Computers & operations research, 104:191–206, 2019.
 Fu et al. (2021) Fu, Z.H., Qiu, K.B., and Zha, H. Generalize a small pretrained model to arbitrarily large tsp instances. Proceedings of the AAAI Conference on Artificial Intelligence, 35(8):7474–7482, May 2021. URL https://ojs.aaai.org/index.php/AAAI/article/view/16916.
 Gamrath et al. (2020a) Gamrath, G., Anderson, D., Bestuzheva, K., Chen, W.K., Eifler, L., Gasse, M., Gemander, P., Gleixner, A., Gottwald, L., Halbig, K., Hendel, G., Hojny, C., Koch, T., Le Bodic, P., Maher, S. J., Matter, F., Miltenberger, M., Mühmer, E., Müller, B., Pfetsch, M. E., Schlösser, F., Serrano, F., Shinano, Y., Tawfik, C., Vigerske, S., Wegscheider, F., Weninger, D., and Witzig, J. The SCIP Optimization Suite 7.0. Technical report, Optimization Online, March 2020a. URL http://www.optimizationonline.org/DB_HTML/2020/03/7705.html.
 Gamrath et al. (2020b) Gamrath, G., Anderson, D., Bestuzheva, K., Chen, W.K., Eifler, L., Gasse, M., Gemander, P., Gleixner, A., Gottwald, L., Halbig, K., Hendel, G., Hojny, C., Koch, T., Le Bodic, P., Maher, S. J., Matter, F., Miltenberger, M., Mühmer, E., Müller, B., Pfetsch, M. E., Schlösser, F., Serrano, F., Shinano, Y., Tawfik, C., Vigerske, S., Wegscheider, F., Weninger, D., and Witzig, J. The SCIP Optimization Suite 7.0. ZIBReport 2010, Zuse Institute Berlin, March 2020b. URL http://nbnresolving.de/urn:nbn:de:0297zib78023.

Gasse et al. (2019)
Gasse, M., Chetelat, D., Ferroni, N., Charlin, L., and Lodi, A.
Exact combinatorial optimization with graph convolutional neural networks.
In Wallach, H., Larochelle, H., Beygelzimer, A., dAlchéBuc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/d14c2267d848abeb81fd590f371d39bdPaper.pdf.  Geman & Geman (1984) Geman, S. and Geman, D. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI6(6):721–741, 1984. doi: 10.1109/TPAMI.1984.4767596.
 Gupta et al. (2020) Gupta, P., Gasse, M., Khalil, E. B., Kumar, M. P., Lodi, A., and Bengio, Y. Hybrid models for learning to branch. CoRR, abs/2006.15212, 2020. URL https://arxiv.org/abs/2006.15212.
 Hastings (1970) Hastings, W. K. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109, 04 1970. ISSN 00063444. doi: 10.1093/biomet/57.1.97. URL https://doi.org/10.1093/biomet/57.1.97.
 Helsgaun (2000) Helsgaun, K. An effective implementation of the lin–kernighan traveling salesman heuristic. European Journal of Operational Research, 126(1):106–130, 2000.
 Holland (1992) Holland, J. H. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control and Artificial Intelligence. MIT Press, Cambridge, MA, USA, 1992. ISBN 0262082136.
 Ingber (1996) Ingber, L. Adaptive simulated annealing (asa): lessons learned. Control and Cybernetics, 25(1), 1996.
 Ji et al. (2021) Ji, K., Yang, J., and Liang, Y. Bilevel optimization: Convergence analysis and enhanced design. In International Conference on Machine Learning, pp. 4882–4892. PMLR, 2021.
 Johnson (1973) Johnson, D. S. Nearoptimal bin packing algorithms. PhD thesis, Massachusetts Institute of Technology, 1973.
 Joshi et al. (2019a) Joshi, C. K., Laurent, T., and Bresson, X. An efficient graph convolutional network technique for the travelling salesman problem. arXiv preprint arXiv:1906.01227, 2019a.
 Joshi et al. (2019b) Joshi, C. K., Laurent, T., and Bresson, X. On learning paradigms for the travelling salesman problem. arXiv preprint arXiv:1910.07210, 2019b.
 Khairy et al. (2020) Khairy, S., Shaydulin, R., Cincio, L., Alexeev, Y., and Balaprakash, P. Learning to optimize variational quantum circuits to solve combinatorial problems. Proceedings of the AAAI Conference on Artificial Intelligence, 34(03):2367–2375, Apr 2020. ISSN 21595399. doi: 10.1609/aaai.v34i03.5616. URL http://dx.doi.org/10.1609/aaai.v34i03.5616.
 Kingma & Ba (2015) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Bengio, Y. and LeCun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.

Kirkpatrick et al. (1987)
Kirkpatrick, S., Gelatt Jr, C. D., and Vecchi, M. P.
Optimization by simulated annealing.
In
Readings in Computer Vision
, pp. 606–615. Elsevier, 1987.  Kool et al. (2018) Kool, W., van Hoof, H., and Welling, M. Attention, learn to solve routing problems! In International Conference on Learning Representations, 2018.
 Kool et al. (2021) Kool, W., van Hoof, H., Gromicho, J., and Welling, M. Deep policy dynamic programming for vehicle routing problems. arXiv preprint arXiv:2102.11756, 2021.
 Kruber et al. (2017) Kruber, M., Lübbecke, M., and Parmentier, A. Learning when to use a decomposition. In CPAIOR, pp. 202–210, 05 2017. ISBN 9783319597751. doi: 10.1007/9783319597768˙16.
 Lam & Delosme (1988) Lam, J. and Delosme, J.M. Performance of a new annealing schedule. In Proceedings of the 25th ACM/IEEE Design Automation Conference, pp. 306–311, 1988.
 Likhosherstov et al. (2021) Likhosherstov, V., Song, X., Choromanski, K., Davis, J., and Weller, A. Debiasing a firstorder heuristic for approximate bilevel optimization. arXiv preprint arXiv:2106.02487, 2021.
 Maclaurin et al. (2015) Maclaurin, D., Duvenaud, D., and Adams, R. P. Gradientbased hyperparameter optimization through reversible learning, 2015.
 Marcos Alvarez et al. (2012) Marcos Alvarez, A., Maes, F., and Wehenkel, L. Supervised learning to tune simulated annealing for in silico protein structure prediction. In ESANN 2012 proceedings, 20th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, pp. 49–54. Ciaco, 2012.
 Metropolis et al. (1953) Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087–1092, 1953. doi: 10.1063/1.1699114. URL https://doi.org/10.1063/1.1699114.
 Mills et al. (2020) Mills, K., Ronagh, P., and Tamblyn, I. Finding the ground state of spin hamiltonians with reinforcement learning. Nature Machine Intelligence, 2(9):509–517, 2020.

Noé et al. (2019)
Noé, F., Olsson, S., Köhler, J., and Wu, H.
Boltzmann generators: Sampling equilibrium states of manybody systems with deep learning.
Science, 365(6457), 2019.  Nomer et al. (2020) Nomer, H. A., Alnowibet, K. A., Elsayed, A., and Mohamed, A. W. Neural knapsack: A neural network based solver for the knapsack problem. IEEE Access, 8:224200–224210, 2020.
 Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, highperformance deep learning library. In Wallach, H., Larochelle, H., Beygelzimer, A., dAlchéBuc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019.
 Pereira & Fernandes (2004) Pereira, A. I. and Fernandes, E. M. G. P. A study of simulated annealing variants. In Proceedings of XXVIII Congreso de Estadística e Investigación Operativa, 2004.
 Perron & Furnon (2019) Perron, L. and Furnon, V. Ortools, 2019. URL https://developers.google.com/optimization/.
 Rere et al. (2015) Rere, L. R., Fanany, M. I., and Arymurthy, A. M. Simulated annealing algorithm for deep learning. Procedia Computer Science, 72:137–144, 2015. ISSN 18770509. doi: https://doi.org/10.1016/j.procs.2015.12.114. The Third Information Systems International Conference 2015.
 Rieck (2021) Rieck, B. Basic analysis of binpacking heuristics. arXiv preprint arXiv:2104.12235, 2021.
 Salimans et al. (2017) Salimans, T., Ho, J., Chen, X., and Sutskever, I. Evolution strategies as a scalable alternative to reinforcement learning. ArXiv, abs/1703.03864, 2017.
 Salimifard et al. (2012) Salimifard, K., Shahbandarzadeh, H., and Raeesi, R. Green transportation and the role of operations research. In 2012 International Conference on Traffic and Transportation Engineering (ICTTE 2012), pp. 74–79, 2012.

Schulman et al. (2016)
Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P.
Highdimensional continuous control using generalized advantage estimation.
In Proceedings of the International Conference on Learning Representations (ICLR), 2016.  Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Swendsen & Wang (1986) Swendsen, R. H. and Wang, J.S. Replica monte carlo simulation of spinglasses. Phys. Rev. Lett., 57:2607–2609, Nov 1986. doi: 10.1103/PhysRevLett.57.2607. URL https://link.aps.org/doi/10.1103/PhysRevLett.57.2607.
 van Laarhoven & Aarts (1987) van Laarhoven, P. and Aarts, E. Simulated Annealing: Theory and Applications, chapter 3, Thm. 6. Mathematics and Its Applications. Springer Netherlands, 1987. ISBN 9789027725134. URL https://books.google.co.in/books?id=IgUab6Dp_IC.
 Vashisht et al. (2020) Vashisht, D., Rampal, H., Liao, H., Lu, Y., Shanbhag, D., Fallon, E., and Kara, L. B. Placement in integrated circuits using cyclic reinforcement learning and simulated annealing, 2020.
 Vicol et al. (2021) Vicol, P., Metz, L., and SohlDickstein, J. Unbiased gradient estimation in unrolled computation graphs with persistent evolution strategies. In International Conference on Machine Learning, pp. 10553–10563. PMLR, 2021.
 Vinyals et al. (2017) Vinyals, O., Fortunato, M., and Jaitly, N. Pointer networks, 2017.
 Wauters et al. (2020) Wauters, M. M., Panizon, E., Mbeng, G. B., and Santoro, G. E. Reinforcementlearningassisted quantum optimization. Physical Review Research, 2(3), Sep 2020. ISSN 26431564. doi: 10.1103/physrevresearch.2.033446. URL http://dx.doi.org/10.1103/PhysRevResearch.2.033446.
 Wu et al. (2019a) Wu, Y., Song, W., Cao, Z., Zhang, J., and Lim, A. Learning improvement heuristics for solving the travelling salesman problem. CoRR, abs/1912.05784, 2019a. URL http://arxiv.org/abs/1912.05784.
 Wu et al. (2019b) Wu, Y., Song, W., Cao, Z., Zhang, J., and Lim, A. Learning improvement heuristics for solving the travelling salesman problem. CoRR, abs/1912.05784, 2019b. URL http://arxiv.org/abs/1912.05784.
 Zaheer et al. (2017) Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. Deep sets. Advances in Neural Information Processing Systems, 30, 2017.
Appendix A Additional Experimental Information and Results
a.1 General Information
Implementation
Our code was implemented in Pytorch 1.9 (Paszke et al., 2019) and run in a standard machine with a single GPU RTX2080. The code will be made publicly available upon publication.
Architectures
In all experiments, the proposal distribution is parametrised by a twolayer neural network, with ReLU activation and 16 neurons in the hidden layer:
, where the size of the input is problem specific. When using PPO, we also need a critic network to estimate the statevalue function so that we can compute advantages using Generalised Advantage Estimator (GAE) (Schulman et al., 2016). The critic network does not share any parameters with the proposal distribution (actor) but has the exact same architecture. The only difference is that the actor outputs logits of the proposal distribution, whereas the critic outputs action values from which we compute the necessary state values.Training
We train Neural SA using both Proximal Policy Optimisation (PPO) (Schulman et al., 2017) and Evolution Strategies (ES) (Salimans et al., 2017). Across all experiments, most of the hyperparameters of both of these methods are kept constant, as detailed below.

[itemsep=0pt,topsep=0pt]

ES: We use a population of 16 perturbations sampled from a Gaussian of standard deviation 0.05. Updates are fed into an SGD optimizer with learning rate 1e3 and momentum 0.9.
Testing
The randomly generated datasets used for testing can be recreated by setting the seed of Pytorch’s random number generator to . Similarly, we evaluate each configuration (problem size, number of steps) 5 times and report the average as well as the standard deviation across the different runs. For reproducibility, we also seed each of these runs (seeds , , , and ).
Running Times
We compare the running times of Neural SA and other combinatorial optimisation methods. Table 4 shows the running times of Neural SA against those of ORTools at the Knapsack and Bin Packing problems, while Table 12 show running times on the Travelling Salesperson Problem for Neural SA and a number of competing solvers.
Knapsack  Bin Packing  

Ours  ORTools  Ours  ORTools  
50N  
100N  
200N  
500N  
1000N  
2000N 
a.2 Knapsack Problem
Data
We consider different problem sizes, with Knap consisting of items, each with a weight and value sampled from a uniform distribution, . Each problem has also an associated capacity, that is, the maximum weight the knapsack can comport. Here we follow (Bello et al., 2016) and set and . However, for larger problems we set .
Initial Solution
We start with a feasible initial solution corresponding to an empty knapsack, that is, . That is the trivial (and worst) feasible solution, so our models do not require any form of initialisation preprocessing or heuristic.
Training
We train only on Knap50 with short rollouts of length
steps. The model is trained for 1000 epochs each of which is run on 256 random problems generated on the fly as described in the previous section. We set initial and final temperatures to
and , and compute the temperature decay as .Testing
We evaluate Neural SA on test sets of 1000 randomly generated Knapsack problems, while varying the length of the rollout. For each problem size , we consider rollouts of length , , and . The initial and final temperatures are kept fixed to and , respectively, and the temperature decay varies as function of , .
We compare our methods against one of the dedicated solvers for knapsack in ORTools (Perron & Furnon, 2019) (Knapsack Multidimension Branch and Bound Solver). We also compare sampled and greedy variants of Neural SA. The former samples actions from the proposal distribution while the latter always selects the most likely action.
Greedy  Sampled  ORTools  

Knap50  
Knap100  
Knap200  
Knap500  
Knap1K  
Knap2K 
Greedy  Sampled  ORTools  

Knap50  
Knap100  
Knap200  
Knap500  
Knap1K  
Knap2K 
a.3 Bin Packing Problem
Data
We consider problems of different sizes, with Bin consisting of items, each with a weight (size) sampled from a uniform distribution, . Without loss of generality, we also assume bins, all with unitary capacity. Each dataset Bin in Tables 7 and 8 contains 1000 such random Bin Packing problems used to evaluate the methods at test time.
Initial Solution
We start from the solution where each item is assigned to a different bin, e.g. .
Training
We train only on Bin50 with short rollouts of length steps. The model is trained for 1000 epochs each of which is ran on 256 random problems generated on the fly as described in the previous section. We keep the same temperature decay with , but use different initial and final temperatures for PPO and ES. For PPO, we set and , whereas for ES we set and .
Testing
We evaluate Neural SA on test sets of 1000 randomly generated Bin Packing problems, while varying the length of the rollout. For each problem size , we consider rollouts of length , , and . The initial and final temperatures are kept the same as in training, and the temperature decay parameter varies as function of , .
We compare Neural SA against FirstFitDecreasing (FFD) (Johnson, 1973), a powerful heuristic for the Bin Packing problem, and against ORTools (Perron & Furnon, 2019) MIP solver powered by SCIP (Gamrath et al., 2020a). The ORTools solver can be quite slow on Bin Packing so we set a time out of 1 minute per problem for BIN501000 and of 2 minutes for BIN2000 to match Neural SA running times (see Table 4).
We also compare sampled and greedy variants of Neural SA. The former naturally samples actions from the proposal distribution while the latter always selects the most likely action.
Greedy  Sampled  ORTools  FFD  

Bin50  27.62  27.43  27.36  27.29  27.24  27.10  
Bin100  53.80  53.63  53.54  53.44  53.38  53.91  
Bin200  105.63  105.78  105.64  105.51  105.43  109.19  
Bin500  259.09  260.86  260.65  260.42  260.27  
Bin1K  512.66  517.87  517.46  517.08  516.84  
Bin2K  1030.66  1029.89  1029.11  1028.67 
Greedy  Sampled  ORTools  FFD  

Bin50  27.10  
Bin100  
Bin200  
Bin500  
Bin1K  
Bin2K 
a.4 Travelling Salesperson Problem (TSP)
Data
We generate random instances for 2D Euclidean TSP by sampling coordinates uniformly in a unit square, as done in previous research (Kool et al., 2018; Chen & Tian, 2019; da Costa et al., 2020). We assume complete graphs (fullyconnected TSP), which means every pair of cities is connected by a valid route (an edge).
Initial Solution
We start with a random tour, which is simply a random permutation of the city indices. This is likely to be a poor initial solution, as it ignores any information about the problem, namely the coordinates of each city. Nevertheless, Neural SA achieves competitive results in spite of this, and it is reasonable to expect an improvement in its performance (at least in running time) when using better initialisation methods, like in LKH3 (Helsgaun, 2000) for instance.
Training
We train only on TSP20 with very short rollouts of length . Just like in the other problems we consider, we train using 256 random problems generated on the fly for each epoch. We also maintain the same initial temperature and cooling schedule with and , but use lower final temperatures for the TSP. We set for PPO and for ES, which we empirically found to work best with the training dynamics of each of these methods. We also use different number of epochs for each training method, 1000 for PPO and 10 000 for ES, as the latter has slower convergence.
Testing
We evaluate Neural SA on TSP20, TSP50 and TSP100 using the 10K problem instances made available in Kool et al. (2018). This allows us to directly compare our methods to previous research on the TSP. We also consider larger problem sizes, namely TSP200 and TSP500 to showcase the scalability of Neural SA. For each of these, we randomly generate 1000 instances by uniformly sampling coordinates in a 2D unit square. For each problem size , we consider rollouts of length , , and . That is different from the other CO problems we study since the complexity in the TSP is related to the number of edges rather than the number of cities . We also compare sampled and greedy variants of Neural SA. The former naturally samples actions from the proposal distribution while the latter always selects the most likely action.
We compare Neural SA against standard solvers LKH3 (Helsgaun, 2000) and Concorde (Applegate et al., 2006), which we have run ourselves. We also compare against the selfreported results of other Deep Learning models that have targeted TSP and relied on the test data provided by Kool et al. (2018): GCN (Joshi et al., 2019b), GAT (Kool et al., 2018), GATT (Wu et al., 2019a), and the works of da Costa et al. (2020) and Fu et al. (2021).
Note that Fu et al. (2021) also provide results for TSP200 and TSP500, but given that we do not know the exact test instances they used, it is hard to make a direct comparison to our results, especially regarding running times; they use a dataset of 128 instances, while we use 1000. For that reason, we omitted these results from Table 3 in the main text, but for the sake of completeness, presented them in Table 12.
Generalisation
We always train Neural SA only on the smallest of problem sizes we consider. In Table 9, we compare Neural SA with other models in the literature that have been evaluated the same way: trained on TSP20 only and tested on TSP20, 50 and 100. While not outperforming the model by Fu et al. fu2021generalize, Neural SA, especially with PPO, does generalise better than previous endtoend methods Kool et al. (2018).
TSP20  TSP50  TSP100  
Kool et al. kool2018attention*  
Fu et al. fu2021generalize*  
SA  
Neural SA (PPO)  
Neural SA (ES) 
Greedy  Sampled  LKH3  Concorde  

TSP20  3.836  3.836  
TSP50  5.696  5.696  
TSP100  7.764  7.764 
Greedy  Sampled  LKH3  Concorde  

TSP20  3.836  3.836  
TSP50  5.696  5.696  
TSP100  7.764  7.764  
TSP200  10.70  10.70  
TSP500  16.54  16.54 
TSP20  TSP50  TSP100  TSP200  TSP500  
Cost  Gap  Time  Cost  Gap  Time  Cost  Gap  Time  Cost  Gap  Time  Cost  Gap  Time  
Concorde (Applegate et al., 2006)  3.836  0.00%  48s  5.696  0.00%  2m  7.764  0.00%  7m  10.70  0.00%  38m  16.54  0.00%  7h58m 
LKH3 (Helsgaun, 2000)  3.836  0.00%  1m  5.696  0.00%  14m  7.764  0.00%  1h  10.70  0.00%  21m  16.54  0.00%  1h15m 
ORTools (Perron & Furnon, 2019)  3.86  0.85%  1m  5.85  2.87%  5m  8.06  3.86%  23m             
SA  3.881  1.17%  10s  5.943  4.34%  37s  8.343  7.45%  3m  11.98  11.87%  9m  20.22  22.25%  56m 
Neural SA PPO  3.837  0.02%  17s  5.727  0.54%  1m  7.856  1.18%  9m  10.96  2.50%  15m  17.64  6.65%  2h16m 
Neural SA ES  3.840  0.10%  10s  5.828  2.32%  1m  8.191  5.50%  9m  11.74  9.72%  15m  20.27  22.55%  2h16m 
GCN Greedy (Joshi et al., 2019a)*  3.86  0.60%  6s  5.87  3.10%  55s  8.41  8.38%  6m             
GCN Beam Search (Joshi et al., 2019a)*  3.84  0.01%  12m  5.70  0.01%  18m  7.87  1.39%  40m             
GAT Greedy (Kool et al., 2018)*  3.85  0.34%  0s  5.80  1.76%  2s  8.12  4.53%  6s             
GAT Sampling (Kool et al., 2018)*  3.84  0.08%  5 m  5.73  0.52%  24m  7.94  2.26%  1 h             
GATT {1000} (Wu et al., 2019a)*  3.84  0.03%  12m  5.75  0.83%  16m  8.01  3.24%  25m             
GATT {3000} (Wu et al., 2019a)*  3.84  0.00%  39m  5.72  0.34%  45 m  7.91  1.85%  1 h             
GATT {5000} (Wu et al., 2019a)*  3.84  0.00%  1 h  5.71  0.20%  1 h  7.87  1.42%  2 h             
da Costa et al. (2020) {500}*  3.84  0.01%  5m  5.72  0.36%  7m  7.91  1.84%  10m             
da Costa et al. (2020) {1000}*  3.84  0.00%  10m  5.71  0.21%  13m  7.86  1.26%  21 m             
da Costa et al. (2020) {2000}*  3.84  0.00%  15m  5.70  0.12%  29m  7.83  0.87%  41m             
AttGCRN+MCTS (Fu et al., 2021)*  3.84  0.00%  2m  5.69  0.01%  9m  7.76  0.03%  15m  10.81  0.88%  3m  16.96  2.96%  6m 