Log In Sign Up

Neural Simulated Annealing

Simulated annealing (SA) is a stochastic global optimisation technique applicable to a wide range of discrete and continuous variable problems. Despite its simplicity, the development of an effective SA optimiser for a given problem hinges on a handful of carefully handpicked components; namely, neighbour proposal distribution and temperature annealing schedule. In this work, we view SA from a reinforcement learning perspective and frame the proposal distribution as a policy, which can be optimised for higher solution quality given a fixed computational budget. We demonstrate that this Neural SA with such a learnt proposal distribution, parametrised by small equivariant neural networks, outperforms SA baselines on a number of problems: Rosenbrock's function, the Knapsack problem, the Bin Packing problem, and the Travelling Salesperson problem. We also show that Neural SA scales well to large problems - generalising to significantly larger problems than the ones seen during training - while achieving comparable performance to popular off-the-shelf solvers and other machine learning methods in terms of solution quality and wall-clock time.


Learning Complexity of Simulated Annealing

Simulated annealing is an effective and general means of optimization. I...

CMOS-compatible Ising and Potts Annealing Using Single Photon Avalanche Diodes

Massively parallel annealing processors may offer superior performance f...

Variable Annealing Length and Parallelism in Simulated Annealing

In this paper, we propose: (a) a restart schedule for an adaptive simula...

Convergence of Simulated Annealing Using Kinetic Langevin Dynamics

We study the simulated annealing algorithm based on the kinetic Langevin...

Simulated annealing for optimization of graphs and sequences

Optimization of discrete structures aims at generating a new structure w...

Reinforcement Learning and Adaptive Sampling for Optimized DNN Compilation

Achieving faster execution with shorter compilation time can enable furt...

Adaptive Large Neighborhood Search for Circle Bin Packing Problem

We address a new variant of packing problem called the circle bin packin...

1 Introduction

There are many different kinds of combinatorial optimisation (CO) problem, spanning bin packing, routing, assignment, scheduling, constraint satisfaction, and more. Solving these problems while sidestepping their inherent computational intractability has great importance and impact for the real world, where poor bin packing or routing lead to wasted profit or excess greenhouse emissions (Salimifard et al., 2012). General solving frameworks or metaheuristics for all these problems are desirable, due to their conceptual simplicity and ease-of-deployment, but require manual tailoring to each individual problem. One such metaheuristic is Simulated Annealing (SA) (Kirkpatrick et al., 1987), a simple, and equally very popular, iterative global optimisation technique for numerically approximating the global minimum of both continuous- and discrete-variable problems. While SA has wide applicability, this is also its Achilles’ Heel, leaving many design choices to the user. Namely, a user has to design 1) neighbourhood proposal distributions, which define the space of possible transitions from a solution at time to solutions at time , and 2) a temperature schedule, which determines the balance of exploration to exploitation. In this work, we mitigate the need for extensive finetuning of SA’s parameters by designing a learnable proposal distribution, which we show improves convergence speed with little computational overhead (limited to per step for problem size ).

In recent years, research on approximate optimisation methods has been inundated by works in machine learning for CO (ML4CO) (Bengio et al., 2018). A lot of the focus has been on end-to-end neural architectures (Bello et al., 2016; Vinyals et al., 2017; Dai et al., 2017; Kool et al., 2018; Emami & Ranka, 2018; Bresson & Laurent, 2021). These work by brute force learning the instance to solution mapping—in CO these are sometimes referred to as

construction heuristics

. Other works focus on learning good parameters for classic algorithms, whether they be parameters of the original algorithm (Kruber et al., 2017; Bonami et al., 2018) or extra neural parameters introduced into the computational graph of classic algorithms (Gasse et al., 2019; Gupta et al., 2020; Kool et al., 2021; da Costa et al., 2020; Wu et al., 2019b; Chen & Tian, 2019; Fu et al., 2021). Our method, neural simulated annealing (Neural SA) can be viewed as sitting firmly within this last category.

Figure 1: Neural SA pipeline for the TSP. Starting with a solution (tour) , we sample an action from our learnable policy/proposal distribution, defining start and end points of a 2-opt move (replacing two old with two new edges). Each pane shows both the linear and graph-based representations for a tour. From and we form a proposal which is either accepted or rejected in a MH step. Accepted moves assign ; whereas, rejected moves assign .

SA is an improvement heuristic; it navigates the search space of feasible solutions by iteratively applying (small) perturbations to previously found solutions. Figure 1 illustrates this for the Travelling Salesperson Problem (TSP), perhaps the most classic of NP-hard problems. In this work, we pose this as a Reinforcement Learning (RL) agent navigating an environment, searching for better solutions. In this light the proposal distribution is an optimisable quantity. Conveniently, our method inherits convergence guarantees from SA. We are able to directly optimise the proposal distribution using policy optimisation for both faster convergence and better solution quality under a fixed computation budget. We demonstrate Neural SA on four tasks: Rosenbrock’s function, a toy 2D optimisation problem, where we can easily visualise and analyse what is being learnt; the Knapsack and Bin Packing problems, which are classic NP-hard resource allocation problems; and the TSP.

Our contributions are:

  • [itemsep=0pt,topsep=0pt]

  • We pose simulated annealing as a Markov decision process, bringing it into the realm of reinforcement learning. This allows us to optimise the proposal distribution in a principled manner, still preserving all the convergence guarantees of vanilla simulated annealing.

  • We show competitive performance to off-the-shelf CO tools and other ML4CO methods on the Knapsack, Bin Packing, and Travelling Salesperson problems, in terms of solution quality and wall-clock time.

  • We show our methods transfer to problems of different sizes, and also perform well on problems up to larger than the ones used for training.

  • Our method is competitive within the ML4CO space, using a very lightweight architecture, with number of learnable parameters of the order of 100s or fewer.

2 Background and Related Work

Here we outline the basic simulated annealing algorithm and its main components. Then we provide an overview of prior works in the machine learning literature which have sought to learn parts of the algorithm or where SA has found uses in machine learning.

Combinatorial optimisation

A combinatorial optimisation problem is defined by a triple where are problem instances (city locations in the TSP), is the set of feasible solutions given (Hamiltonian cycles in the TSP) and is an energy function (tour length in the TSP). Without loss of generality, the task is to minimise the energy . CO problems are in general NP-hard, meaning that there is no known algorithm to solve them in time polynomial in the number of bits that represents a problem instance.

Simulated Annealing

Simulated annealing (Kirkpatrick et al., 1987)

is a metaheuristic for CO problems. It builds an inhomogeneous Markov chain

for , asymptotically converging to a minimizer of . The stochastic transitions depend on two quantities: 1) a proposal distribution, and 2) a temperature schedule. The proposal distribution , for

the space of probability distributions on

, suggests new states in the chain. It perturbs current solutions to new ones, potentially leading to lower energies immediately or later on. After perturbing a solution , a Metropolis–Hastings (MH) step (Metropolis et al., 1953; Hastings, 1970) is executed. This either accepts the perturbation () or rejects it ()—see Algorithm 1 for details. The target distribution of the MH step has form , where is the temperature at time . In the limit , this distribution tends to a sum of Dirac deltas on the minimisers of the energy. The temperature is annealed, according to the temperature schedule, , from high to low, to steer the target distribution smoothly from broad to peaked around the global optima. The algorithm is outlined in Algorithm 1. Under certain regularity conditions and provided the chain is long enough, it will visit the minimisers almost surely (Geman & Geman, 1984). More concretely,


Despite this guarantee, practical convergence speed is determined by and the temperature schedule, which are hard to fine-tune. There exist problem-specific heuristics for setting these (Pereira & Fernandes, 2004; Cicirello, 2007), but in this paper we propose to learn the proposal distribution.

2.1 Simulated annealing and machine learning

A natural way to combine machine learning and simulated annealing is to design local improvement heuristics that feed off each other. Cai et al. cai2019reinforcement and Vashisht et al. vashisht2020placement use RL to find good initial solutions that are later refined by standard SA. That is fundamentally different to our approach, as we augment SA with RL-optimisable components, instead of simply using them as standalone algorithms that only interact via shared solutions. In fact, our method is perfectly compatible with theirs and any other SA application. Another line of work seeks to optimise different components of SA with RL (Wauters et al., 2020; Khairy et al., 2020; Beloborodov et al., 2020; Mills et al., 2020) or statistical machine learning techniques (Blum et al., 2020)

. In contrast to these methods that optimise individual hyperparameters in SA, we frame SA itself as an RL problem, which allows us to define and train the proposal distribution as a policy.

More closely to our method, other approaches improve the proposal distribution. In Adaptive Simulated Annealing (ASA) (Ingber, 1996)

the proposal distribution is not fixed but evolves throughout the annealing process as a function of the variance of the quality of visited solutions. ASA improves the convergence of standard SA but is not learnable like Neural SA. To the best of our knowledge,

Marcos Alvarez et al. (2012)

are the only others to learn the proposal distribution for SA, but they rely on supervised learning, requiring high quality solutions or good search strategies to imitate; both expensive to compute. Conversely, Neural SA is fully unsupervised, thus easier to train and extend to different CO tasks. Finally, SA is also akin to Metropolis-Hastings, a popular choice for Markov Chain Monte Carlo (MCMC) sampling. Noé et al. noe2019boltzmann, Albergo et al. albergo2019flow and de Haan et al. dehaan2021scaling recently studied how to learn a proposal distribution of an MCMC chain for sampling the Boltzmann distribution of a physical system. While their results serve as motivation for our methods, we investigate a completely different context and set of applications.

Lastly, our work falls under bi-level optimisation methods, where an outer optimisation loop finds the best parameters of an inner optimisation. This encompasses situations such as learning the parameters (Rere et al., 2015) or hyperparameters of a neural network optimiser (Maclaurin et al., 2015; Andrychowicz et al., 2016) and meta-learning (Finn et al., 2017). However, most recent approaches assume differentiable losses on continuous state spaces Likhosherstov et al. (2021); Ji et al. (2021); Vicol et al. (2021), while we focus on the more challenging CO setting. We note, however, methods in Vicol et al. (2021) are based on evolution strategies and could be used in the discrete setting.

2.2 Markov Decision Processes

Simulated annealing naturally fits into the Markov Decision Process (MDP) framework as we explain below. An MDP consists of states , actions , an immediate reward function , a transition kernel , and a discount factor . On top of this MDP we add a stochastic policy . The policy and transition kernel together define a length- trajectory , which is a sample from the distribution and where is sampled from the start-state distribution . One can then define the discounted return over a trajectory, where . We say that we have solved an MDP if we have found a policy that maximises the expected return .

3 Method

Here we outline our approach to learn the proposal distribution. First we define an MDP corresponding to SA. We then show how the proposal distribution can be optimised and provide a justification that this does not affect convergence guarantees of the classic algorithm.

3.1 MDP Formulation

We formalise SA as an MDP, with states for a parametric description of the problem instance as in Section 2, and the instantaneous temperature. Examples are in Section 4. Our actions perturb , where is a solution in the neighbourhood of . It is common to define small neighbourhoods, to limit energy variation from one state to the next. This heuristic discards exceptionally good and exceptionally bad moves, but since the latter are more common than the former, it generally leads to faster convergence.

We view the MH step in SA as a stochastic transition kernel, governed by the current temperature of the system, with transition probabilities following a Gibbs distribution and dynamics


This defines a transition kernel , where we have . For rewards, we use either the immediate gain or the primal reward . We explored training with two different methods: Proximal Policy Optimisation (PPO) (Schulman et al., 2017) and Evolution Strategies (ES) Salimans et al. (2017). The immediate gain works best with PPO, where at each iteration of the rollout, the immediate gain gives fine-grained feedback on whether the previous action helped or not. The primal reward works best with ES because it is non-local, returning the minimum along an entire rollout at the very end. We explored using the acceptance count but found that this sometimes led to pathological behaviours. Similarly, we tried the primal integral (Berthold, 2013), which encourages finding a good solution fast, but found we could not get training dynamics to converge.

0:  Initial state , proposal distribution , transition function , temperature schedule , energy function
  for  do
      {Sample action}
      {Metropolis–Hastings step}
     if  then
     end if
  end for
Algorithm 1 Neural simulated annealing. To get back to vanilla SA, replace the parametrised proposal distribution

with a uniform distribution

over neighbourhoods .

3.2 Policy Network Architecture

SA chains are long. It is because of this that we need as lightweight a policy architecture as possible. Furthermore, this architecture should have the capacity to scale to varying numbers of inputs, so that we can transfer experience across problems of different size . We opt for a very simple network, shown in Figure 2. For each dimension of the problem we map the state

into a set of features. For all problems we try, there is a natural way to do this. Each feature is fed into an MLP, embedding it into a logit space, followed by a softmax function to yield probabilities. The complexity of this architecture scales linearly with

and the computation is embarrassingly parallel, which is important since we plan to evaluate it many times. A notable property of this architecture is that it is permutation equivariant (Zaheer et al., 2017) for a permutation of objects—an important requirement for the CO problems we consider. Note that our model is a permutation equivariant set-to-set mapping, but we have not used attention or other kinds of pairwise interaction to keep the computational complexity linear in the number of items.

Figure 2: (a) Policy network used in all experiments. The same MLP is applied to all inputs pointwise.


Convergence of SA to the optimum in the infinite time limit requires the Markov chain of the proposal distribution to be irreducible van Laarhoven & Aarts (1987), meaning that for any temperature, any two states are reachable through a sequence of transitions with positive conditional probability under

. Our neural network policy satisfies this condition as long as the softmax layer does not assign zero probability to any state, a condition which is met in practice. Thus Neural SA inherits convergence guarantees from SA.

4 Experiments

(a) 2D rollout
(b) Acceptance ratios
(c) by iteration
(d) Baseline sweep
Figure 3:

Results on Rosenbrock’s function: (a) Example trajectory, moving from red to blue, showing convergence around the minimiser at (1,1) (b) Neural SA has higher acceptance ratio than the baseline, a trend observed in all experiments, (c) Standard deviation of the learned policy as a function of iteration. Large initial steps offer great gains followed by small exploitative steps, (d) A non-adaptive vanilla SA baseline cannot match an adaptive one, no matter the standard deviation.

We evaluate our method on 4 tasks—Rosenbrock’s function, the Knapsack, Bin Packing, and TSP problems—using the same architecture and hyperparameters of Neural SA for all tasks. This shows the wide applicability and ease of use of our method. For each task (except for Rosenbrock’s function) we test Neural SA on problems of different size , training only on the smallest. Similarly, we consider rollouts of different lengths, training only on short ones

. This accelerates training, showing Neural SA’s generalisation capabilities. This type of transfer learning is one of the challenges in ML4CO

Joshi et al. (2019b), and is a merit of our lightweight, equivariant architecture. In all experiments, we start from trivial or random solutions and adopt an exponential multiplicative cooling schedule as originally proposed by Kirkpatrick et al. (1987), with . In practice, we define the temperature schedule by fixing , and computing according to the desired number of steps . This allows us to vary the rollout length while maintaining the same range of temperatures for every run. We provide more precise experimental details in the appendix.

4.1 The Rosenbrock function

The Rosenbrock function is a common benchmark for optimisation algorithms. It is a non-convex function over Euclidean space defined as


and with global minimum at . Of course, gradient-based optimisers are more suited to this problem, but we use it as a toy example to showcase the properties of Neural SA. Our policy is an axis-aligned Gaussian , where we parametrise the variance by an MLP of shape

with a ReLU in the middle. Proposals are of the form

, and the state is given by . An example rollout is in Figure 2(a).

We contrast Neural SA against vanilla SA with fixed proposal distribution, i.e. , for different averaged over problem instances. This shows in Figure 2(d) that no constant variance policy can outperform an adaptive policy on this problem. Plots of acceptance ratio in Figure 2(b)

show Neural SA has higher acceptance probability early in the rollout, a trend we observed in all experiments, suggesting its proposals are skewed towards lower energy solutions than standard SA. Figure 

2(c) shows the variance network as a function of time. It has learnt to make large steps until hitting the basin, whereupon large moves will be rejected with high probability, so variance must be reduced.

4.2 Knapsack Problem

The Knapsack problem is a classic CO problem in resource allocation. Given a set of items, each of a different value and weight , the goal is to find a subset that maximises the sum of values while respecting a maximum total weight of . This is the 0-1 Knapsack Problem, which is weakly NP-complete, has a search space of size

and corresponding integer linear program

Figure 4: Knapsack Policy with logits for and shown in each pane. Light valuable objects are favoured to insert. Once inserted the policy downweights an object’s probably of flipping state again. Interestingly, the ejection probability of heavy, valueless objects is low, perhaps because this only makes sense close to overflowing, although the policy does not receive free capacity as a feature.
Random Search Bello RL Bello AS SA Ours (PPO) Ours (ES) Greedy OR-Tools
Knap500 - -
Knap1K - - -
Knap2K - - -
Table 1: Average cost of solutions for the Knapsack Problem across five random seeds and, in parentheses, optimality gap to best solution found among solvers. Bigger is better. *Values as reported by Bello et al. (2016) for reference.

Solutions are represented as a binary vector

, with for ‘out of the bin’ and for ‘in the bin’. Our proposal distribution flips individual bits, one at a time, with the constraint that we cannot flip if the bin capacity will be exceeded. The neighbourhood of is thus all feasible solutions at a Hamming distance of 1 from . We use the proposal distribution described in Section 3.2 and illustrated in Figure 2, consisting of a pointwise embedding of each item—its weight, value, occupancy bit, the knapsack’s overall capacity, and global temperature—into a logit-space, followed by a softmax. Mathematically the policy and state–action to proposal mapping are


where is a small two-layer neural network with ReLU activations, comprising only 112 parameters. Actions are sampled from the categorical distribution induced by the softmax and cast to one-hot vectors .

Neural networks have been used to solve the Knapsack Problem in Vinyals et al. (2017), Nomer et al. (2020), and Bello et al. (2016). We follow the setup of Bello et al. (2016), honing in on 3 self-generated datasets: Knap50, Knap100 and Knap200. Knap consists of items with weights and values generated uniformly at random in and capacities , and . We use OR-Tools (Perron & Furnon, 2019) to compute ground truth solutions. Results in Table 1 show that Neural SA improves over vanilla SA by up to 10% optimality gap, and heuristic methods (Random Search) by much more. Neural SA falls slightly behind two methods by Bello et al. (2016), which use (1) a large attention-based pointer network with several orders of magnitude more parameters in Bello RL, and (2) this coupled with 5000 iterations of their Active Search method. It also falls behind a greedy heuristic for packing a knapsack based on the value-to-weight ratio. In Figure 4 we analyse the policy network and a typical rollout. It has learnt a mostly greedy policy to fill its knapsack with light, valuable objects, only ejecting them when full. This is in line with the value-to-weight greedy heuristic. Despite not coming top among methods, we note Neural SA is typically within 1-3% of the minimum energy, although its architecture was not designed for this problem in particular.

4.3 Bin Packing Problem

The Bin Packing problem is similar to the Knapsack problem in nature. Here, one wants to pack all of items into the smallest number of bins possible, where each item has weight , and we assume, without loss of generality, bins of equal capacity ; there would be no valid solution otherwise. This problem is NP-hard and has a search space of size equal to the Bell number. If denotes item occupying bin , then the problem can be written as minimising an energy:

minimise (6)
subject to

where the constraints apply for all and . We define the policy in two steps: we first pick an item , and then select a bin to place it into. We can then write the policy as , which we define as


where is the bin item is in before the action (in terms of , we have ), is the free capacity of bin (), and both and are lightweight architectures with a ReLU nonlinearity between the two layers. We sample from the policy ancestrally, sampling first an item from , followed by a bin from . Results in Table 2 show that our lightweight model is able to find a solution to about 1% higher energy than the minimum found by FFD Johnson (1973), a very strong heuristic for this problem (Rieck, 2021). We even see that we very often beat the SCIP (Gamrath et al., 2020a, b) optimizer in OR-Tools, which timed out on most problems. Figure 5 compared convergence speed of Neural SA with vanilla SA and a third option, Greedy Neural SA, which uses argmax samples from the policy. The learnt policy, visualised in Figure 6 has much faster convergence than the vanilla version. Again, we see that our method, although simple, is competitive with hand-designed alternatives, whereas vanilla SA is not.

Figure 5: Bin50 primal objective for vanilla, Neural, and Greedy Neural SA with 25th, 50th, and 75th percentiles.
SA Ours (PPO) Ours (ES) OR-Tools (SCIP) FFD
Table 2: Average cost of solutions for the Bin Packing Problem across five random seeds and, in parentheses, optimality gap to best solution found among solvers. Lower is better. We set a time out for Or-Tools of 1 minute per problem for Bin50-1000 and of 2 minutes for Bin2000; * indicates only the trivial solution was found in this time.
Figure 6: Bin Packing policy (logits), consisting of two networks, an item selector and a bin selector. The item selector uses item weight and bin used capacities to select an item to move. The bin selector then places this item in a bin, based on target bin fullness and selected item weight. The learnt policy is very sensible. The item selector looks for a light item in an under-full bin. The bin selector then place this in an almost-full bin. We mask bins with insufficient free capacity, hence the triangular logit-spaces.

4.4 Travelling Salesperson Problem

Imagine you will make a round road-trip through cities and want to plan the shortest route visiting each city once; this is the Travelling Salesperson Problem (TSP) (Applegate et al., 2006). The TSP has been a long time favourite of computer scientists due to its easy description and NP-hardness (the base search space has size equal to the factorial of the number of cities). Here we use it as an example of a difficult CO problem. We compare with Concorde (Applegate et al., 2006) and LKH-3 (Helsgaun, 2000), two custom solvers for TSP. Given cities with spatial coordinates , we wish to find a linear ordering of the cities, called a tour, denoted by the permutation vector for such that


where we have defined for convenience of notation. Our action space consists of so-called 2-opt moves (Croes, 1958), which reverse contiguous segments of a tour. An example of a 2-opt move is shown in Figure 1. We have a two-stage architecture, like in Bin Packing, which selects the start and end cities of the segment to reverse. Denoting as the start and as the end cities, we have , parametrised as


where are the indices of city and its tour neighbours and . Again, we use simple MLPs: has architecture and , . We test on publicly available TSP20/50/100 (Kool et al., 2018) with 10K problems each and generate TSP200/500 with 1K tours each. Results, in Table 12, show Neural SA improves on vanilla SA. Albeit not outperforming Fu et al. fu2021generalize, Neural SA is neck-to-neck with other neural improvement heuristics methods, GAT-T{1000} (Wu et al., 2019b) and Costa{500} (da Costa et al., 2020). Since Neural SA is not custom designed for TSP as the competing methods, we view this as surprisingly good. A more complete comparison, including other neural approaches, is given in the appendix, Table 12.

5 Discussion

TSP20 TSP50 TSP100 TSP200 TSP500
Cost Gap Time Cost Gap Time Cost Gap Time Cost Gap Time Cost Gap Time
Concorde 3.836 0.00% 48s 5.696 0.00% 2m 7.764 0.00% 7m 10.70 0.00% 38m 16.54 0.00% 7h58m
LKH-3 3.836 0.00% 1m 5.696 0.00% 14m 7.764 0.00% 1h 10.70 0.00% 21m 16.54 0.00% 1h15m
SA 3.881 1.17% 5s 5.943 4.34% 37s 8.343 7.45% 3m 11.98 11.87% 9m 20.22 22.25% 56m
Ours (PPO) 3.838 0.05% 9s 5.734 0.67% 1m 7.874 1.42% 9m 11.00 2.80% 16m 17.64 6.65% 2h16m
Ours (ES) 3.840 0.10% 9s 5.828 2.32% 1m 8.191 5.50% 9m 11.74 9.72% 16m 20.27 22.55% 2h16m
OR-Tools* 3.86 0.85% 1m 5.85 2.87% 5m 8.06 3.86% 23m - - - - - -
GAT-T{1000}* 3.84 0.03% 12m 5.75 0.83% 16m 8.01 3.24% 25m - - - - - -
Costa {500}* 3.84 0.01% 5m 5.72 0.36% 7m 7.91 1.84% 10m - - - - - -
Fu et al.* 3.84 0.00% 1m 5.70 0.01% 8m 7.76 0.04% 15m - - - - - -
Table 3: Comparison of Neural SA against competing methods with similar running times on TSP. Extended version in Table 12. Lower is better. *Values as reported in respective works (Wu et al., 2019a; da Costa et al., 2020; Fu et al., 2021).

Neural SA is a general, plug-and-play method, requiring only the definition of neighbourhoods for the proposal distribution and training problem instances (no solutions needed). It also obviates time-consuming architecture design, since a simple MLP is enough for a range of CO problems. In this section, we discuss some of the main features of Neural SA.

Computational Efficiency Neural SA requires little computational resources given its compact architecture, with 384 parameters on TSP, 160 for Bin Packing, and 112 for Knapsack. Further, the cost of each step scales linearly in the problem size, since the architectures are embarrassingly parallel. In terms of running times, Neural SA is on par with and often faster than other TSP solvers (see Tables 3, 12). For the Knapsack and Bin Packing problems, we compare running times against OR-Tools, as shown in the appendix, Table 4. Neural SA lags behind OR-Tools in the Knapsack, for which an efficient branch and bound solver is known. However, for the Bin Packing problem, Neural SA is much faster than the available Mixed-Integer Programming solver, which only found trivial solutions for . Finally, Neural SA is also fast to train; only a few minutes with PPO and up to a few hours with ES. This is can be attributed to its low number of parameters but also to its generalisation ability; in all experiments, we could get away with training only on the smallest instances with very short rollouts.

PPO vs ES Neural SA can be trained with any policy optimisation method making it highly extendable. We found no winner between PPO and ES, apart from on the TSP, where PPO excelled and generalized better to larger instances. We also observed PPO to converge faster than ES, but ES policies were more robust, still performing well when we switched to greedy sampling, for example. Interestingly, the acceptance rate over trajectories was problem dependent and always higher in Neural SA (both PPO and ES) than in vanilla SA, contradicting conventional wisdom that it should be held at 0.44 throughout a rollout (Lam & Delosme, 1988).

Generalisation Our experiments show Neural SA generalises to different problem sizes and rollout lengths; a remarkable feat for such a simple pipeline, since transfer learning is notoriously difficult in RL and CO. Many ML4CO methods do handle problems of different sizes but underperform when tested on larger instances than the ones seen in training Kool et al. (2018); Joshi et al. (2019b) (see appendix, Table 9). Fu et al. fu2021generalize achieve better generalisation results for the TSP but had to resort to a suite of techniques to allow a small supervised model to be applied to larger problems. These are not easy to implement, TSP-specific, and consist only the first step in a complex pipeline that still relies on a tailored Monte-Carlo tree search algorithm.

Solution Quality In all problems we considered, Neural SA, with little to no fine-tuning of its hyperparameters, outperformed vanilla SA and could get within a few percentage points or less of global minima. Conversely, state-of-the-art SA variants are designed by searching a large space of different hyperparameters Franzin & Stützle (2019), a costly process that Neural SA helps us mitigate. Neural SA did not achieve state-of-the-art results, but that was not to be expected nor our main goal. Instead, we envision Neural SA as a general purpose solver, allowing researchers and practitioners to get a strong baseline quickly without the need to fine-tune classic CO algorithms or design and train complex neural networks. Given the good performance, small computational resources, and fast training across a diverse set of CO problems, we believe Neural SA is a promising solver that can strike the right balance among solution quality, computing costs and development time.

6 Conclusion

We presented Neural SA, neurally augmented simulated annealing, where the SA chain is a trajectory from an MDP. In this light, the proposal distribution could be interpreted as a policy, which could be optmised. This has numerous benefits: 1) accelerated convergence of the chain, 2) ability to condition the proposal distribution on side-information 3) no need of ground truth data to learn the proposal distribution, 4) lightweight architectures that can be run on CPU unlike many contemporary ML4CO methods, 5) scalability to large problems due to its lightweight computational overhead, 6) generalisation across different problem sizes.

These contributions show augmenting classic, time-tested (meta-)heuristics with learnable components is a promising direction for future research in ML4CO. In contrast to expensive end-to-end methods in previous work, this could be a more promising path towards machine learning models capable of solving a wide range of CO problems. As we show in this paper, this approach can yield solid results for different problems while preserving theoretical guarantees of existing CO algorithms and requiring only simple neural architectures that can be easily trained on small problems.

The ease of use and flexibility of Neural SA do come with drawbacks. In all experiments we were not able to achieve the minimum energy, even though we could usually get within a percentage point. The model also has no built-in termination condition, neither can it provide a certificate on the quality of solutions found. There is still also the question of how to tune the temperature schedule, which we did not attempt in this work. These shortcomings are all points to be addressed in upcoming research. We are also interested in extending the framework to multiple trajectories, such as in parallel tempering Swendsen & Wang (1986)

or genetic algorithms

Holland (1992). For these, we would maintain a population of chains, which could exchange information.


Appendix A Additional Experimental Information and Results

a.1 General Information


Our code was implemented in Pytorch 1.9 (Paszke et al., 2019) and run in a standard machine with a single GPU RTX2080. The code will be made publicly available upon publication.


In all experiments, the proposal distribution is parametrised by a two-layer neural network, with ReLU activation and 16 neurons in the hidden layer:

, where the size of the input is problem specific. When using PPO, we also need a critic network to estimate the state-value function so that we can compute advantages using Generalised Advantage Estimator (GAE) (Schulman et al., 2016). The critic network does not share any parameters with the proposal distribution (actor) but has the exact same architecture. The only difference is that the actor outputs logits of the proposal distribution, whereas the critic outputs action values from which we compute the necessary state values.


We train Neural SA using both Proximal Policy Optimisation (PPO) (Schulman et al., 2017) and Evolution Strategies (ES) (Salimans et al., 2017). Across all experiments, most of the hyper-parameters of both of these methods are kept constant, as detailed below.

  • [itemsep=0pt,topsep=0pt]

  • PPO: We optimise both actor and critic networks using Adam (Kingma & Ba, 2015) with learning rate of , weight decay of and . For PPO, we set the discount factor and clipping threshold to and , respectively, and compute advantages using GAE (Schulman et al., 2016) with trace decay .

  • ES: We use a population of 16 perturbations sampled from a Gaussian of standard deviation 0.05. Updates are fed into an SGD optimizer with learning rate 1e-3 and momentum 0.9.


The randomly generated datasets used for testing can be recreated by setting the seed of Pytorch’s random number generator to . Similarly, we evaluate each configuration (problem size, number of steps) 5 times and report the average as well as the standard deviation across the different runs. For reproducibility, we also seed each of these runs (seeds , , , and ).

Running Times

We compare the running times of Neural SA and other combinatorial optimisation methods. Table 4 shows the running times of Neural SA against those of OR-Tools at the Knapsack and Bin Packing problems, while Table 12 show running times on the Travelling Salesperson Problem for Neural SA and a number of competing solvers.

Knapsack Bin Packing
Ours OR-Tools Ours OR-Tools
Table 4: Comparison of running times (at test time) for Neural SA (PPO/ES) against OR-tools for the Knapsack and Bin Packing Problems. We report the average time to evaluate one instance with each method for different problem sizes.

a.2 Knapsack Problem


We consider different problem sizes, with Knap consisting of items, each with a weight and value sampled from a uniform distribution, . Each problem has also an associated capacity, that is, the maximum weight the knapsack can comport. Here we follow (Bello et al., 2016) and set and . However, for larger problems we set .

Initial Solution

We start with a feasible initial solution corresponding to an empty knapsack, that is, . That is the trivial (and worst) feasible solution, so our models do not require any form of initialisation pre-processing or heuristic.


We train only on Knap50 with short rollouts of length

steps. The model is trained for 1000 epochs each of which is run on 256 random problems generated on the fly as described in the previous section. We set initial and final temperatures to

and , and compute the temperature decay as .


We evaluate Neural SA on test sets of 1000 randomly generated Knapsack problems, while varying the length of the rollout. For each problem size , we consider rollouts of length , , and . The initial and final temperatures are kept fixed to and , respectively, and the temperature decay varies as function of , .

We compare our methods against one of the dedicated solvers for knapsack in OR-Tools (Perron & Furnon, 2019) (Knapsack Multidimension Branch and Bound Solver). We also compare sampled and greedy variants of Neural SA. The former samples actions from the proposal distribution while the latter always selects the most likely action.

Greedy Sampled OR-Tools
Table 5: ES results on the Knapsack benchmark. Bigger is better. Comparison among rollouts of different lengths: 1, 2, 5 or 10 times the dimension of the problem.
Greedy Sampled OR-Tools
Table 6: PPO results on the Knapsack benchmark. Bigger is better. Comparison among rollouts of different lengths: 1, 2, 5 or 10 times the dimension of the problem.

a.3 Bin Packing Problem


We consider problems of different sizes, with Bin consisting of items, each with a weight (size) sampled from a uniform distribution, . Without loss of generality, we also assume bins, all with unitary capacity. Each dataset Bin in Tables 7 and 8 contains 1000 such random Bin Packing problems used to evaluate the methods at test time.

Initial Solution

We start from the solution where each item is assigned to a different bin, e.g. .


We train only on Bin50 with short rollouts of length steps. The model is trained for 1000 epochs each of which is ran on 256 random problems generated on the fly as described in the previous section. We keep the same temperature decay with , but use different initial and final temperatures for PPO and ES. For PPO, we set and , whereas for ES we set and .


We evaluate Neural SA on test sets of 1000 randomly generated Bin Packing problems, while varying the length of the rollout. For each problem size , we consider rollouts of length , , and . The initial and final temperatures are kept the same as in training, and the temperature decay parameter varies as function of , .

We compare Neural SA against First-Fit-Decreasing (FFD) (Johnson, 1973), a powerful heuristic for the Bin Packing problem, and against OR-Tools (Perron & Furnon, 2019) MIP solver powered by SCIP (Gamrath et al., 2020a). The OR-Tools solver can be quite slow on Bin Packing so we set a time out of 1 minute per problem for BIN50-1000 and of 2 minutes for BIN2000 to match Neural SA running times (see Table 4).

We also compare sampled and greedy variants of Neural SA. The former naturally samples actions from the proposal distribution while the latter always selects the most likely action.

Greedy Sampled OR-Tools FFD
Bin50 27.62 27.43 27.36 27.29 27.24 27.10
Bin100 53.80 53.63 53.54 53.44 53.38 53.91
Bin200 105.63 105.78 105.64 105.51 105.43 109.19
Bin500 259.09 260.86 260.65 260.42 260.27
Bin1K 512.66 517.87 517.46 517.08 516.84
Bin2K 1030.66 1029.89 1029.11 1028.67
Table 7: ES results on the Bin Packing benchmark. Lower is better.
Greedy Sampled OR-Tools FFD
Bin50 27.10
Table 8: PPO results on the Bin Packing benchmark. Lower is better.

a.4 Travelling Salesperson Problem (TSP)


We generate random instances for 2D Euclidean TSP by sampling coordinates uniformly in a unit square, as done in previous research (Kool et al., 2018; Chen & Tian, 2019; da Costa et al., 2020). We assume complete graphs (fully-connected TSP), which means every pair of cities is connected by a valid route (an edge).

Initial Solution

We start with a random tour, which is simply a random permutation of the city indices. This is likely to be a poor initial solution, as it ignores any information about the problem, namely the coordinates of each city. Nevertheless, Neural SA achieves competitive results in spite of this, and it is reasonable to expect an improvement in its performance (at least in running time) when using better initialisation methods, like in LKH-3 (Helsgaun, 2000) for instance.


We train only on TSP20 with very short rollouts of length . Just like in the other problems we consider, we train using 256 random problems generated on the fly for each epoch. We also maintain the same initial temperature and cooling schedule with and , but use lower final temperatures for the TSP. We set for PPO and for ES, which we empirically found to work best with the training dynamics of each of these methods. We also use different number of epochs for each training method, 1000 for PPO and 10 000 for ES, as the latter has slower convergence.


We evaluate Neural SA on TSP20, TSP50 and TSP100 using the 10K problem instances made available in Kool et al. (2018). This allows us to directly compare our methods to previous research on the TSP. We also consider larger problem sizes, namely TSP200 and TSP500 to showcase the scalability of Neural SA. For each of these, we randomly generate 1000 instances by uniformly sampling coordinates in a 2D unit square. For each problem size , we consider rollouts of length , , and . That is different from the other CO problems we study since the complexity in the TSP is related to the number of edges rather than the number of cities . We also compare sampled and greedy variants of Neural SA. The former naturally samples actions from the proposal distribution while the latter always selects the most likely action.

We compare Neural SA against standard solvers LKH-3 (Helsgaun, 2000) and Concorde (Applegate et al., 2006), which we have run ourselves. We also compare against the self-reported results of other Deep Learning models that have targeted TSP and relied on the test data provided by Kool et al. (2018): GCN (Joshi et al., 2019b), GAT (Kool et al., 2018), GAT-T (Wu et al., 2019a), and the works of da Costa et al. (2020) and Fu et al. (2021).

Note that Fu et al. (2021) also provide results for TSP200 and TSP500, but given that we do not know the exact test instances they used, it is hard to make a direct comparison to our results, especially regarding running times; they use a dataset of 128 instances, while we use 1000. For that reason, we omitted these results from Table 3 in the main text, but for the sake of completeness, presented them in Table 12.


We always train Neural SA only on the smallest of problem sizes we consider. In Table 9, we compare Neural SA with other models in the literature that have been evaluated the same way: trained on TSP20 only and tested on TSP20, 50 and 100. While not outperforming the model by Fu et al. fu2021generalize, Neural SA, especially with PPO, does generalise better than previous end-to-end methods Kool et al. (2018).

TSP20 TSP50 TSP100
Kool et al. kool2018attention*
Fu et al. fu2021generalize*
Neural SA (PPO)
Neural SA (ES)
Table 9: Optimality gap for models trained on TSP20 and evaluated on the test instances provided by Kool et al. kool2018attention for TSP20/50/100; *Values taken from respective papers.
Greedy Sampled LKH-3 Concorde
TSP20 3.836 3.836
TSP50 5.696 5.696
TSP100 7.764 7.764
Table 10: ES results on the TSP benchmark. Lower is better
Greedy Sampled LKH-3 Concorde
TSP20 3.836 3.836
TSP50 5.696 5.696
TSP100 7.764 7.764
TSP200 10.70 10.70
TSP500 16.54 16.54
Table 11: PPO results on the TSP benchmark. Lower is better
Figure 7: Policy for the Travelling Salesperson Problem. At each step, an action consists of selecting a pair of cities , one after the other. The figure depicts a TSP problem layed out in the 2D plane, with the learnt proposal distribution over the first city in the left, and in the right, the distribution over the second city , given . We mask out and exclude the neighbours of ( and ) as candidates for because selecting those would lead to no changes in the tour. It is clear the model has a strong preference towards a few cities, but otherwise the probability mass is spread almost uniformly among the other nodes. However, once is fixed, Neural SA strongly favours nodes that are close to . That is a desirable behaviour and even features in popular algorithms like LKH-3 (Helsgaun, 2000). That is because a 2-opt move actually adds edge to the tour, so leaning towards pairs of cities that are close to each other is more likely to lead to shorter tours.
TSP20 TSP50 TSP100 TSP200 TSP500
Cost Gap Time Cost Gap Time Cost Gap Time Cost Gap Time Cost Gap Time
Concorde (Applegate et al., 2006) 3.836 0.00% 48s 5.696 0.00% 2m 7.764 0.00% 7m 10.70 0.00% 38m 16.54 0.00% 7h58m
LKH-3 (Helsgaun, 2000) 3.836 0.00% 1m 5.696 0.00% 14m 7.764 0.00% 1h 10.70 0.00% 21m 16.54 0.00% 1h15m
OR-Tools (Perron & Furnon, 2019) 3.86 0.85% 1m 5.85 2.87% 5m 8.06 3.86% 23m - - - - - -
SA 3.881 1.17% 10s 5.943 4.34% 37s 8.343 7.45% 3m 11.98 11.87% 9m 20.22 22.25% 56m
Neural SA PPO 3.837 0.02% 17s 5.727 0.54% 1m 7.856 1.18% 9m 10.96 2.50% 15m 17.64 6.65% 2h16m
Neural SA ES 3.840 0.10% 10s 5.828 2.32% 1m 8.191 5.50% 9m 11.74 9.72% 15m 20.27 22.55% 2h16m
GCN Greedy (Joshi et al., 2019a)* 3.86 0.60% 6s 5.87 3.10% 55s 8.41 8.38% 6m - - - - - -
GCN Beam Search (Joshi et al., 2019a)* 3.84 0.01% 12m 5.70 0.01% 18m 7.87 1.39% 40m - - - - - -
GAT Greedy (Kool et al., 2018)* 3.85 0.34% 0s 5.80 1.76% 2s 8.12 4.53% 6s - - - - - -
GAT Sampling (Kool et al., 2018)* 3.84 0.08% 5 m 5.73 0.52% 24m 7.94 2.26% 1 h - - - - - -
GAT-T {1000} (Wu et al., 2019a)* 3.84 0.03% 12m 5.75 0.83% 16m 8.01 3.24% 25m - - - - - -
GAT-T {3000} (Wu et al., 2019a)* 3.84 0.00% 39m 5.72 0.34% 45 m 7.91 1.85% 1 h - - - - - -
GAT-T {5000} (Wu et al., 2019a)* 3.84 0.00% 1 h 5.71 0.20% 1 h 7.87 1.42% 2 h - - - - - -
da Costa et al. (2020) {500}* 3.84 0.01% 5m 5.72 0.36% 7m 7.91 1.84% 10m - - - - - -
da Costa et al. (2020) {1000}* 3.84 0.00% 10m 5.71 0.21% 13m 7.86 1.26% 21 m - - - - - -
da Costa et al. (2020) {2000}* 3.84 0.00% 15m 5.70 0.12% 29m 7.83 0.87% 41m - - - - - -
Att-GCRN+MCTS (Fu et al., 2021)* 3.84 0.00% 2m 5.69 0.01% 9m 7.76 0.03% 15m 10.81 0.88% 3m 16.96 2.96% 6m
Table 12: Comparison of different TSP solvers on the 10K instances for TSP20/50/100 provided in Kool et al. (2018), and 1K random instances for TSP200/500. We report the average solution cost, optimality gap and running time (to solve all instances) for each problem size. We split competing neural methods in two groups: construction heuristics Kool et al. (2018); Joshi et al. (2019a) and improvement heuristics like Neural SA Wu et al. (2019a); da Costa et al. (2020); Fu et al. (2021). *Values as reported in the corresponding paper. Different test data.