NeuroLKH
None
view repo
We present NeuroLKH, a novel algorithm that combines deep learning with the strong traditional heuristic Lin-Kernighan-Helsgaun (LKH) for solving Traveling Salesman Problem. Specifically, we train a Sparse Graph Network (SGN) with supervised learning for edge scores and unsupervised learning for node penalties, both of which are critical for improving the performance of LKH. Based on the output of SGN, NeuroLKH creates the edge candidate set and transforms edge distances to guide the searching process of LKH. Extensive experiments firmly demonstrate that, by training one model on a wide range of problem sizes, NeuroLKH significantly outperforms LKH and generalizes well to much larger sizes. Also, we show that NeuroLKH can be applied to other routing problems such as Capacitated Vehicle Routing Problem (CVRP), Pickup and Delivery Problem (PDP), and CVRP with Time Windows (CVRPTW).
READ FULL TEXT VIEW PDFNone
Traveling Salesman Problem (TSP) is an important NP-hard Combinatorial Optimization Problem with extensive industrial applications in various domains. Exact methods have the exponential worst-case computational complexity, which renders them impractical for solving large-scale problems in reality, even for highly optimized solvers such as Concorde. In contrast, although lacking optimality guarantees and non-trivial theoretical analysis, heuristic solvers search for near-optimal solutions with much lower complexity. They are usually desirable for real-life applications where statistically better performance is the goal.
Traditional heuristic methods are manually designed based on expert knowledge which is usually human-interpretable. However, supported by the recent development of deep learning technology, modern methods train powerful deep neural networks to learn the complex patterns from the TSP instances generated from some specific distributions
Vinyals et al. (2015); Bello et al. (2016); Dai et al. (2017); Kool et al. (2019); Joshi et al. (2019); Xin et al. (2020); Wu et al. (2021); Xin et al. (2021). The performances of deep learning models for solving TSP are constantly improved by these works, which unfortunately are still far worse than the strong traditional heuristic solver and generally limited to relatively small problem sizes.We believe that learning-based methods should be combined with strong traditional heuristic algorithms, which is also suggested by Bengio et al. (2020). In such a way, while learning the complex patterns from data samples, the efficient heuristics highly optimized by researchers for decades can be effectively utilized, especially for problems such as TSP which are well-studied due to their importance.
The Lin-Kernighan-Helsgaun (LKH) algorithm Helsgaun (2000, 2009) is generally considered as a very strong heuristic for solving TSP, which is developed based on the Lin-Kernighan (LK) heuristic Lin and Kernighan (1973). LKH iteratively searches for -opt moves to improve the existing solution where edges of the tour are exchanged for another edges to form a shorter tour. To save the searching time, the edges to add are limited to a small edge candidate set, which is created before search. One of the most significant contributions of LKH is to generate the edge candidate set based on Minimum Spanning Tree, rather than using the nearest neighbor method in the LK heuristic. Furthermore, LKH applies penalty values to the nodes which are iteratively optimized using subgradient optimization (will be detailed in Section 3). The optimized node penalties are used by LKH to transform the edge distances for the -opt searching process and improve the quality of edge candidate sets, both of which help find better solutions.
However, the edge candidate set generation in LKH is still guided by hand-crafted rules, which could limit the quality of edge candidates and hence the search performance. Moreover, the iterative optimization of node penalties is time-consuming, especially for large-scale problems. To address these limitations, we propose NeuroLKH, a novel learning-based method featuring a Sparse Graph Network (SGN) combined with the highly efficient -opt local search of LKH. SGN outputs the edge scores and node penalties simultaneously, which are trained by supervised learning and unsupervised learning, respectively. NeuroLKH transforms the edge distances based on the node penalties learned inductively from training instances, instead of performing iterative optimization for each instance, therefore saving a significant amount of time. More importantly, at the same time the edge scores are used to create the edge candidate set, leading to substantially better sets than those created by LKH. NeuroLKH trains one single network on TSP instances across a wide range of sizes and generalizes well to substantially larger problems with minutes of unsupervised offline fine-tuning to adjust the node penalty scales for different sizes.
Same as existing works on deep learning models for solving TSP, NeuroLKH aims to learn complex patterns from data samples to find better solutions for instances following specific distributions. Following the evaluation process in these works, we perform extensive experiments. Results show that NeuroLKH improves the baseline algorithms by large margins, not only across the wide range of training problem sizes, but also on much larger problem sizes not used in training. Furthermore, NeuroLKH trained with instances of relatively simple distributions generalizes well to traditional benchmark with various node distributions such as the TSPLIB Reinelt (1991). Also, we show that NeuroLKH can be applied to guide the extension of LKH Helsgaun (2017) for more complicated routing problems such as the Capacitated Vehicle Routing Problem (CVRP), Pickup and Delivery Problem (PDP) and CVRP with Time Windows (CVRPTW), using generated test datasets and traditional benchmarks Solomon (1987); Uchoa et al. (2017).
Till now, for routing problems such as TSP, most works focus on learning construction heuristics, where deep neural networks are trained to sequentially select the nodes to visit with supervised learning Vinyals et al. (2015); Hottung et al. (2021)
Bello et al. (2016); Dai et al. (2017); Nazari et al. (2018); Kool et al. (2019); Kwon et al. (2020). Similarly, networks are trained to pick edges in Joshi et al. (2019); Kool et al. (2021). In another line of works Chen and Tian (2019); Wu et al. (2021); Hottung and Tierney (2020); Hao Lu (2020); da Costa et al. (2020), researchers employ deep learning models to learn the actions for improving existing solutions, such as picking regions and rules or selecting nodes for the 2-opt heuristic. However, the performance of these works is still quite far from the strong non-learning heuristics such as LKH. In addition, they focus only on relatively small-sized problems (up to hundreds of nodes).A recent work Fu et al. (2021) generalizes a network pre-trained on fixed-size small graphs to solve larger size problems by sampling small sub-graphs to infer and merging the results. This interesting idea can be applied to very large graphs, however, the performance is still inferior to LKH and deteriorates rapidly with the increase of problem size.
In a concurrent work Zheng et al. (2021), a VSR-LKH method is proposed which also applies a learning method in combination with LKH. However, very different from our method, VSR-LKH applies traditional reinforcement learning during the searching process for each instance, instead of learning patterns for a class of instances. Moreover, VSR-LKH aims to guide the decision on edge selections within the edge candidate set, which is generated using the original procedure of LKH. NeuroLKH significantly outperforms VSR-LKH by large margins in all the settings of our experiments on testing instances following the training distributions, especially when the time limits are short. Even more impressively, NeuroLKH achieves performance similar to VSR-LKH on traditional benchmark TSPLIB Reinelt (1991) with various node distributions, which are very different from the training distributions for NeuroLKH.
The Lin-Kernighan-Helsgaun (LKH) algorithm Helsgaun (2000, 2009) is a local optimization algorithm developed based on the -opt move Lin (1965), where edges in the current tour are exchanged by another set of edges to achieve a shorter tour. While solving one instance, the LKH algorithm can conduct multiple trials to find better solutions. In each trial, starting from a randomly initialized tour, it iteratively searches for -opt exchanges that improve the tour, until no such exchanges can be found. In each iteration, the -opt exchanges are searched in the ascending order of variable and the tour will be replaced once an exchange is found to reduce the tour distance.
One central rule is that the -opt searching process is restricted and directed by an edge candidate set, which is created before search based on the -measure using sensitivity analysis of the Minimum Spanning Tree. Here we briefly introduce the related concepts. A TSP graph can be viewed as an undirected graph with as the set of nodes and as the set of edges weighted by distances. A spanning tree of is a connected graph with edges from and no cycles where any pair of nodes is connected by a path. A 1-tree of is a spanning tree for the graph of node set \{1} combined with two edges in connected to node 1, an arbitrary special node in . A minimum 1-tree is the 1-tree with minimum length. The -measure of an edge for graph is defined as , where is the length of Minimum 1-Tree and is the length of Minimum 1-Tree required to include the edge . The -measure of an edge can be viewed as the extra length of the Minimum 1-Tree to include this edge.
The edge candidate set consists of the edges with the smallest -measures connected to each node ( as default). During the -opt searching process, the edges to be included into the new tour are limited to the edges in this candidate set, and edges with smaller -measures will have higher priorities to be searched over. Therefore this candidate set not only restricts but also directs the search.
Moreover, the quality of -measures can be improved significantly by a subgradient optimization method. If we add a penalty to each node and transform the original distance of the edge to a new distance as , the optimal tour for the TSP will stay the same but the Minimum 1-Tree usually will change. Because by definition, a Minimum 1-Tree with node degrees all equal to 2 is an optimal solution for the corresponding TSP instance. With the length of Minimum 1-Tree resulting from the penalty as , is a lower bound of the optimal tour distance for the original TSP instance. LKH applies subgradient optimization Held and Karp (1971) to iteratively maximize this lower bound for multiple steps until convergence by applying at step , where is the scalar step size,
is the vector of node degrees in the Minimum 1-Tree with penalty
. Therefore, the node degrees are pushed towards 2. The -measures after this optimization will substantially improve the quality of edge candidate set. Furthermore, the transformed edge distance after this optimization helps find better solutions when used during the searching process for -opt exchanges.The subgradient optimization in LKH can substantially improve the quality of edge candidate sets based on the -measures, and transform the edge distances effectively to achieve reasonably good performance. However, it still has major limitations as the optimization process is over one instance iteratively until convergence, which costs a large amount of time, especially for large-scale problems. Moreover, even after subgradient optimization, some critical patterns could be missed by the relatively straightforward sensitivity analysis of spanning tree. Therefore, the quality of edge candidate set could be further improved by large margins, which will in turn improve the overall performance.
We propose the NeuroLKH algorithm, which employs a Sparse Graph Network to learn the complex patterns associated with the TSP instances generated from a distribution. Concretely, the network will learn the edge scores and node penalties simultaneously with a multi-task training process. The edge scores are trained with supervised learning for creating the edge candidate set, while the node penalties are trained with unsupervised learning for transforming the edge distances. The architecture of NeuroLKH is presented in Figure 1, along with the original LKH algorithm. We will detail the Sparse Graph Network, the training process and the proposed NeuroLKH algorithm in the following.
For the Sparse Graph Network (SGN), we format the TSP instance as a sparse directed graph containing the node set and a sparse edge set which only includes the shortest edges pointed from each node, as shown in the leftmost green box in Figure 1, where the circles represent the nodes and the diamonds represent the directed edges. Sparsification of the graph is crucial for effectively training the deep learning model on large TSP instances and generalizing to even larger sizes. Note that edge belongs to does not necessarily mean that the opposite-direction edge belongs to . The node inputs are the node coordinates and the edge inputs are the edge distances. Though we focus on 2-dimensional TSP with Euclidean distance as the other deep learning literature like Kool et al. (2019), the model can be applied to other kinds of TSP.
The SGN consists of 1) one encoder embedding the edge and node inputs into the corresponding feature vectors, and 2) two decoders for the edge scores and node penalties, respectively.
Encoder. The encoder first linearly projects the node inputs and the edge inputs into feature vectors and , respectively, where is the feature dimension, and . Then the node and edge features are embedded with Sparse Graph Convolutional Layers, which are defined formally as follows:
(1) |
(2) |
(3) |
(4) |
where and represent the element-wise multiplication and the element-wise division, respectively; is the layer index; and are trainable parameters; Eqs. (2) and (4) consist of a Skip-Connection layer He et al. (2016)
and a Batch Normalization layer
Ioffe and Szegedy (2015) in each; and the idea of element-wise attention in Eq. (1) is adopted from Bresson and Laurent (2017). As the input graph is directed and sparse, edges with different directions are embedded separately. But obviously the embedding of an edge should benefit from knowing whether its opposite-direction counterpart is also in the graph and the information of , which motivates our design of Eqs. (3) and (4).Decoders. The edge decoder takes the edge embeddings
from the encoder and embeds them with two layers of linear projection followed by ReLU activation into
. Then the edge scores are calculated as follows:(5) |
Similarly, the node decoder first embeds the node embeddings with two layers of linear projection and ReLU activation into . Then the node penalties are calculated as follows:
(6) |
where are trainable parameters; is used to keep the node penalties in the range of .
We train the network to learn the edge scores with supervised learning. And the edge loss is detailed as follows:
(7) |
where . Effectively, we increase the edge scores if the edge belongs to the optimal tour and decrease them otherwise.
The node penalties are trained by unsupervised learning. Similar to the goal of subgradient optimization in LKH, we are trying to transform the Minimum 1-Tree generated from the TSP graph closer to a tour where all nodes have a degree of 2. An important distinction from LKH is that we are learning the patterns for a class of TSP instances following a distribution, instead of optimizing the penalties for a specific TSP instance. The node loss is detailed as follows:
(8) |
where is the degree of node in the Minimum 1-Tree induced with penalty
. The penalties are increased for nodes with degrees larger than 2 and decreased for nodes with smaller degrees. The SGN is trained for the task of outputting the edge scores and node penalties simultaneously with the loss function
, where is the coefficient for balancing the two losses.The process of using NeuroLKH to solve one instance is shown in Algorithm 1. Firstly, the TSP instance is converted to a sparse directed graph . Then the SGN encoder embeds the nodes and edges in into feature embeddings, based on which the decoders output the node penalties and edge scores . Afterwards, NeuroLKH creates powerful edge candidate set and transforms the distance of each edge effectively, which further guides NeuroLKH to conduct multiple LKH trials to find good solutions. We detail each part as follows.
Transform Edge Distance. Based on the node penalties , the original edge distances are transformed into new distances , which will be used in the search process. With such a transformation, the optimal solution tour will stay the same. And the tour distance calculated with the transformed edge distances will be subtracted by to restore the tour distance for the original TSP.
Create Edge Candidate Set. For each node , the edge scores are sorted for and the edges with the top- largest scores are included in the edge candidate set. Edges with larger scores have higher priorities in the candidate set, which will be tried first for adding in the exchange during the LKH search process. Note that neither the original LKH nor NeuroLKH can guarantee all the edges in the optimal tour to be included in the edge candidate set. However, optimal solutions are still likely to be found during the multiple trials.
LKH Searching Trials. To solve one TSP instance, LKH conducts multiple trials to find better solutions. In each trial, one tour is initialized randomly, and iterations of LKH search are conducted for the -opt exchanges until the tour can no longer be improved by such exchanges. In each iteration, LKH searches in the ascending order of for -opt exchanges to reduce tour length, which will be applied once found.
Based on the trained SGN network, NeuroLKH infers the edge distance transformation and candidate set to guide the LKH trials, which is done by performing forward calculation through the model. This is much faster than the corresponding procedure in the original LKH, which employs subgradient optimization on each instance iteratively until convergence and is apparently time-consuming especially for large-scale problems. More importantly, rather than using the hand-crafted rules based on sensitivity analysis in the original LKH, NeuroLKH learns to create edge candidate set of much higher quality with the powerful deep model, leading to significantly better performance.
In this section, we conduct extensive experiments on TSP with various sizes and show the effective performance of NeuroLKH compared to the baseline algorithms. Our code is publicly available.^{1}^{1}1https://github.com/liangxinedu/NeuroLKH
Dataset distribution. Closely following the existing works such as Kool et al. (2019)
, we experiment with the 2-dimensional TSP instances in the Euclidean distance space where both coordinates of each node are generated independently from a unit uniform distribution. We train only one network using TSP instances ranging from 101 to 500 nodes. Since the amount of supervision and feedback during training is linearly related to the number of nodes. We generate
instances for each size in the training dataset, resulting in approximately 780000 instances in total. Therefore the amounts of supervision and feedback are kept similar across different sizes. We use Concorde ^{2}^{2}2https://www.math.uwaterloo.ca/tsp/concorde to get the optimal edges for the supervised training of edge scores. For testing, we generate 1000 instances for each testing problem size.Hyperparameters. We choose the number of directed edges pointed from one node in the sparse edge set as , which results in only 0.01% of the edges in the optimal tours missed in for the training dataset. We also conduct experiments to justify this choice in Appendix Section A. The hidden dimension is set to in the network with Sparse Graph Convolutional Layers. The node penalty coefficient in the loss function is . The network is trained by Adam Optimizer Kingma and Ba (2014)
with learning rate of 0.0001 for 16 epochs, which takes approximately 4 days. The deep learning models are trained and evaluated with one RTX-2080Ti GPU. The other parts of experiments without deep models for NeuroLKH and other baselines are conducted with random seed 1234 on an Intel(R) Core(TM) i9-10940X CPU unless stated otherwise. Hyperparameters for the LKH searching process are consistent with the example script for TSP given by LKH available online
^{3}^{3}3http://akira.ruc.dk/%7Ekeld/research/LKH-3/LKH-3.0.6.tgz and those used in Zheng et al. (2021).Method | Time(s) | Obj | Gap(%00) | Time(s) | Obj | Gap(%00) | Time(s) | Obj | Gap(%00) |
---|---|---|---|---|---|---|---|---|---|
Concorde | 207 | *7.753246 | 0.000 | 1072 | *10.701303 | 0.000 | 17022 | *16.541830 | 0.000 |
LKH (1 trial) | 33 | 7.755071 | 2.353 | 80 | 10.707043 | 5.364 | 338 | 16.556733 | 9.009 |
VSR-LKH | 7.754980 | 2.236 | 10.706739 | 5.080 | 16.557297 | 9.350 | |||
NeuroLKH | 7.753332 | 0.111 | 10.701873 | 0.533 | 16.543197 | 0.826 | |||
LKH (10 trials) | 43 | 7.754177 | 1.200 | 111 | 10.703724 | 2.263 | 445 | 16.548017 | 3.740 |
VSR-LKH | 7.754184 | 1.209 | 10.703997 | 2.518 | 16.549591 | 4.692 | |||
NeuroLKH | 7.753311 | 0.083 | 10.701623 | 0.299 | 16.542880 | 0.634 | |||
LKH (100 trials) | 127 | 7.753450 | 0.263 | 368 | 10.701755 | 0.423 | 1147 | 16.543707 | 1.134 |
VSR-LKH | 7.753407 | 0.207 | 10.701687 | 0.359 | 16.543085 | 0.759 | |||
NeuroLKH | 7.753270 | 0.030 | 10.701381 | 0.073 | 16.542163 | 0.201 | |||
LKH (1000 trials) | 938 | 7.753254 | 0.010 | 2805 | 10.701351 | 0.045 | 7527 | 16.542125 | 0.178 |
VSR-LKH | 7.753322 | 0.097 | 10.701336 | 0.031 | 16.541934 | 0.063 | |||
NeuroLKH | 7.753247 | 0.000 | 10.701303 | 0.000 | 16.541847 | 0.010 |
Here, we compare NeuroLKH with the original LKH algorithm Helsgaun (2009) and the recently proposed VSR-LKH algorithm Zheng et al. (2021). We do not compare with other deep learning based methods here because their performances are rather inferior to LKH, and most of them can hardly generalize to problems with more than 100 nodes. One exception is the method in Fu et al. (2021), which is tested on large problems but the performances are still far worse than LKH.
All algorithms are run once for each testing instance as we find running multiple times only provides very marginal improvement. For each testing problem size, we run the original LKH for 1, 10, 100, and 1000 trials, and record the total amounts of time in solving the 1000 instances. Then we impose the same amounts of time as time limits to NeuroLKH and VSR-LKH for solving the same 1000 instances for fair comparison. Note that for NeuroLKH, the solving time is the summation of the inference time of SGN on GPU and LKH searching time on CPU. In the following tables, for each size and time limit, we report the average performance (tour distance) and the total solving time for the 1000 testing instances.
Comparison on training sizes. In Table 1, we report the performances of LKH, VSR-LKH and NeuroLKH on three testing datasets with 100, 200 and 500 nodes, which are within the size range of instances used in training. Note that we train only one SGN Network on a wide range of problem sizes and here we use these three sizes to demonstrate the testing performances. We also use the exact solver Concorde on these instances to obtain the optimal solutions and compute the optimality gap for each method. As shown in this table, it is clear that NeuroLKH outperforms both LKH and VSR-LKH significantly and consistently across different problem sizes and with different time limits. Notably, the optimality gaps are reduced by at least an order of magnitude for most of the cases, which is a significant improvement.
Generalization analysis on larger sizes. We further show the generalization ability of NeuroLKH on much larger graph sizes of 1000, 2000 and 5000 nodes. Note that while the edge scores in SGN generalize well without any modification, it is hard for the node penalties to directly generalize. This is because they are trained unsupervisedly and SGN does not have any knowledge about how to penalize the nodes for larger TSP instances. Nevertheless, this could be resolved by a simple fine-tuning step. As the learned node embeddings are very powerful, we only fine-tune the very small amount of parameters in the SGN node decoder and keep the other parameters fixed. Specifically, for each of the large sizes, we fine-tune the node decoder for 100 iterations with batch size of , which only takes less than one minute for each size of 1000, 2000 and 5000. This fast fine-tuning process is for TSPs of one size generated from the distribution instead of specific instances, and may be viewed as adjusting the scale of penalties for large sizes. The generalization results are summarized in Table 2. Note that we do not run Concorde here due to the prohibitively long running time, and the gaps are with respect to the best value found by all methods. Clearly, NeuroLKH generalizes well to substantially larger problem sizes and the improvement of NeuroLKH over baselines is significant and consistent across all the settings.
Further discussion. The inference time of SGN in NeuroLKH for the 1000 instances of 100, 200, 500, 1000, 2000 and 5000 nodes is 3s, 6s, 16s, 33s, 63s and 208s, which is approximately linear with the number of nodes . In contrast, the subgradient optimization in LKH and VSR-LKH needs 20s, 51s, 266s, 1028s, 4501s and 38970s, which grows superlinearly with and is much longer than SGN inference, especially for large-scale problems. For NeuroLKH, the saved time is used to conduct more trials, which effectively helps to find better solutions. This effect is more salient with short time limit. Meanwhile, the number of trials is also small for short time limit and the algorithm only searches a small number of solutions, in which case the guidance of edge candidate set is more important. Due to these two reasons, the improvement of NeuroLKH over baselines is particularly substantial for short time limit. This is a desirable property especially for time-critical applications and solving large-scale problems, for which large numbers of trials are not feasible.
Method | Time(s) | Obj | Gap(%00) | Time(s) | Obj | Gap(%00) | Time(s) | Obj | Gap(%00) |
---|---|---|---|---|---|---|---|---|---|
LKH (1 trial) | 1183 | 23.155916 | 10.593 | 4843 | 32.483851 | 11.264 | 40048 | 51.025519 | 12.284 |
VSR-LKH | 23.154946 | 10.173 | 32.485551 | 11.788 | 51.025539 | 12.288 | |||
NeuroLKH | 23.133494 | 0.899 | 32.449752 | 0.755 | 50.965382 | 0.484 | |||
LKH (10 trials) | 1414 | 23.143435 | 5.197 | 5322 | 32.466953 | 6.056 | 41523 | 50.998721 | 7.026 |
VSR-LKH | 23.143347 | 5.159 | 32.467997 | 6.377 | 51.000093 | 7.295 | |||
NeuroLKH | 23.133066 | 0.714 | 32.449519 | 0.683 | 50.965219 | 0.452 | |||
LKH (100 trials) | 2567 | 23.135427 | 1.735 | 7371 | 32.455454 | 2.512 | 47884 | 50.976677 | 2.700 |
VSR-LKH | 23.134426 | 1.302 | 32.454427 | 2.195 | 50.979317 | 3.218 | |||
NeuroLKH | 23.132258 | 0.365 | 32.448666 | 0.420 | 50.964677 | 0.345 | |||
LKH (1000 trials) | 12884 | 23.132216 | 0.347 | 25613 | 32.448954 | 0.509 | 103885 | 50.965233 | 0.455 |
VSR-LKH | 23.131658 | 0.105 | 32.447953 | 0.200 | 50.965300 | 0.468 | |||
NeuroLKH | 23.131414 | 0.000 | 32.447304 | 0.000 | 50.962916 | 0.000 |
In Figure 2, we plot the performance of the LKH, VSR-LKH and NeuroLKH algorithms for solving the testing datasets with different numbers of nodes against different running time to visualize the improvement process (the resulting objective values after each trial). The time limits are set to the longest ones used in Table 1 and Table 2, which are the running time of LKH with 1000 trials. Clearly, NeuroLKH outperforms both LKH and VSR-LKH significantly and consistently across different problem sizes and with different time limits. In particular, NeuroLKH is superior as it not only reaches good solutions fast but also converges to better solutions eventually. With the same performance (i.e. objective value), NeuroLKH considerably reduces the computational time. We can also conclude that when the time limit is short, the improvement of NeuroLKH over baselines is particularly substantial. In addition, we show that the subgradient optimization is necessary for LKH and VSR-LKH. As exhibited in Figure 2, the performances of both LKH and VSR-LKH are much worse without subgradient optimization (w/o SO). More impressively, even ignoring the preprocessing time (IPT) used for subgradient optimization (pertaining to LKH and VSR-LKH) and Sparse Graph Network inferring (pertaining to NeuroLKH), NeuroLKH still outstrips both LKH and VSR-LKH. Note that this comparison is unfair for NeuroLKH as LKH and VSR-LKH consume much longer preprocessing time which is unavoidable.
For the results reported in Table 1 and Table 2, almost all the improvements of NeuroLKH over LKH and VSR-LKH on different sizes and with different time limits are statistically significant with confidence levels larger than 99%. The only one exception is the performance on TSP with 100 nodes and the running time of LKH with 1000 trials where the confidence levels are 90.5% and 97.6% for the improvements, respectively.
In the Appendix Section A, we also show that NeuroLKH substantially outperforms other deep learning based methods Kool et al. (2019, 2021); Wu et al. (2021); Joshi et al. (2019); Hottung et al. (2021); Fu et al. (2021); Kwon et al. (2020); da Costa et al. (2020).
Generalization to TSPLIB benchmark. Besides generalization to larger sizes, generalization to different distributions remains a crucial challenge for deep learning based methods in existing works. The TSPLIB benchmark contains instances with various node distributions, making it extremely hard for such methods. We test on all the 72 TSPLIB instances with Euclidean distances and less than 10000 nodes. The number of trials is set to be the number of nodes and the algorithms are run 10 times for each instance following the convention for TSPLIB in Helsgaun (2000); Zheng et al. (2021). With the various unknown node distributions, we do not fine-tune the model for the node penalties and only use the edge scores in NeuroLKH. For the 24 instances labeled as hard in Zheng et al. (2021)
, which the original LKH fails to solve optimally during at least one of the 10 runs, NeuroLKH trained with uniformly distributed data is able to find optimal solutions 6.13 times on average, which is much better than LKH (3.75 times). As an active learning method, VSR-LKH finds optimal solutions 6.42 times on average, slightly better than NeuroLKH. While NeuroLKH improves the results on most hard instances, it could generalize poorly on instances with certain special patterns such as where most nodes are located along several horizontal lines, making it fail to solve 11 of the 48 easy instances optimally for some runs.
With the same training dataset size, we trained another model NeuroLKH_M using a mixture of instances with uniformly distributed nodes, clustered nodes with 3-8 clusters, half uniform and half clustered nodes following Uchoa et al. (2017). NeuroLKH_M finds optimal solutions 6.79 times on average for the hard instances and fails to solve only 5 easy instances optimally for some runs, better than the NeuroLKH trained with only uniformly distributed instances. For all the 72 instances, NeuroLKH_M finds optimal solutions 8.74 times on average, which is much better than LKH (7.92 times) but slightly worse than VSR-LKH (8.78 times). Detailed results of each instance are listed in the Appendix Section B.
Finally, we show that NeuroLKH can be easily extended to solve much more complicated routing problems such as the Capacitated Vehicle Routing Problem (CVRP), the Pickup and Delivery Problem (PDP) and CVRP with Time Windows (CVRPTW). We briefly introduce the problems in the Appendix Section C. Different from TSP, the node penalties do not apply to these problems. Therefore, NeuroLKH only learns the edge candidate set. As these three problems are very hard to solve and the optimal solutions are not available in a reasonable amount of time, we use LKH with 10000 trials to get solutions as training labels. The demands, capacities, starts and ends in the time windows are taken as node inputs along with the coordinates. For PDP, we add connections between each pair of pickup and delivery nodes and assign weight matrices for these connections in Eq. (2). For PDP and CVRPTW, the edge directions affect the tour feasibility therefore the model learns the in-direction edge scores and out-direction edge scores for each node with Eq. (5).
The node coordinates are also generated uniformly from the unit square for all three problems, following Kool et al. (2019); Li et al. (2021). For CVRP, the demands of customers are generated uniformly from integers {1..9} with the capacity fixed as , compatible with the largest CVRP (100 nodes) studied in Kool et al. (2019). For CVRPTW, we use the same way to generate demands, capacity, serving time and time windows as Falkner and Schmidt-Thieme (2020). A training dataset for CVRP with 101-500 nodes and instances for each size (about 180000 in total) is used to train the SGN for 10 epochs. PDP and CVRPTW are harder to solve therefore we use a training dataset with 41-200 nodes and instances for each size. The other hyperparameters in SGN are the same as TSP and those for LKH searching process are consistent with example scripts given by LKH for CVRP, PDP and CVRPTW (with the SPECIAL hyperparameter).
In Table 3, we show the performance of NeuroLKH and the original LKH on testing datasets with 1000 instances for the smallest and largest graph sizes (number of customers) used in training as well as a much larger generalization size. We use the solving time of LKH with 100, 1000, 10000 trials as the time limits. For 100 trials, both methods fail to find feasible solutions for less than 1% of the PDP and CVRPTW test instances with 300 nodes. Whenever this happens, we push the infeasible visits to the end to get feasible solutions. The inferring time of SGN is 1s, 3s, 7s, 10s, 19s and 40s in total for the 1000 instances in the testing datasets with 40, 100, 200, 300, 500 and 1000 nodes, which is a tiny fraction compared to the LKH searching process. As shown in Table 3, NeuroLKH significantly improves the solution quality compared with the original LKH which is a very strong heuristic solver for all three problems, showing its potential in handling various types of routing problems.
Method | Time(s) | Obj | Gap(%) | Time(s) | Obj | Gap(%) | Time(s) | Obj | Gap(%) | |
---|---|---|---|---|---|---|---|---|---|---|
Generalization | ||||||||||
CVRP | LKH (100 trials) | 485 | 15.8363 | 1.675 | 2043 | 42.1621 | 5.394 | 4607 | 58.1372 | 9.750 |
NeuroLKH | 15.7770 | 1.295 | 41.7311 | 4.316 | 56.6469 | 6.937 | ||||
LKH (1000 trials) | 4520 | 15.6483 | 0.468 | 15812 | 40.6103 | 1.515 | 30133 | 54.3412 | 2.584 | |
NeuroLKH | 15.6295 | 0.348 | 40.4974 | 1.233 | 54.0499 | 2.034 | ||||
LKH (10000 trials) | 45435 | 15.5823 | 0.044 | 166875 | 40.0670 | 0.157 | 319368 | 53.1093 | 0.259 | |
NeuroLKH | 15.5754 | 0.000 | 40.0043 | 0.000 | 52.9723 | 0.000 | ||||
Generalization | ||||||||||
PDP | LKH (100 trials) | 115 | 6.2495 | 0.819 | 2832 | 13.8390 | 5.535 | 7939 | 17.0913 | 6.916 |
NeuroLKH | 6.2241 | 0.409 | 13.6246 | 3.899 | 16.7867 | 5.011 | ||||
LKH (1000 trials) | 845 | 6.2088 | 0.163 | 21216 | 13.2850 | 1.310 | 55643 | 16.2447 | 1.620 | |
NeuroLKH | 6.2041 | 0.087 | 13.2443 | 0.999 | 16.1857 | 1.251 | ||||
LKH (10000 trials) | 7989 | 6.1998 | 0.018 | 195220 | 13.1387 | 0.194 | 515377 | 16.0119 | 0.163 | |
NeuroLKH | 6.1988 | 0.000 | 13.1132 | 0.000 | 15.9857 | 0.000 | ||||
Generalization | ||||||||||
CVRPTW | LKH (100 trials) | 147 | 9.3051 | 1.081 | 813 | 26.1757 | 7.124 | 1746 | 34.2301 | 8.798 |
NeuroLKH | 9.2606 | 0.597 | 25.4000 | 3.949 | 32.9676 | 4.786 | ||||
LKH (1000 trials) | 1017 | 9.2276 | 0.239 | 4525 | 24.9770 | 2.218 | 7820 | 32.2671 | 2.559 | |
NeuroLKH | 9.2207 | 0.164 | 24.7857 | 1.435 | 32.0224 | 1.781 | ||||
LKH (10000 trials) | 9624 | 9.2073 | 0.018 | 45509 | 24.5338 | 0.405 | 75481 | 31.5719 | 0.350 | |
NeuroLKH | 9.2056 | 0.000 | 24.4350 | 0.000 | 31.4620 | 0.000 |
Performance on traditional benchmarks. To show the effectiveness of NeuroLKH on complicated routing problems with various distributions, we perform experiments on CVRPLIB Uchoa et al. (2017) and Solomon Solomon (1987) benchmark datasets. CVRPLIB Uchoa et al. (2017) contains various sized CVRP instances with a combination of 3 depot positioning, 3 customer positioning and 7 demand distributions. Solomon benchmark Solomon (1987) contains CVRPTW instances with 100 customers and various distributions of time windows. We detail the benchmarks, the training datasets and the results for each instance in the Appendix Section D. In summary, tested on the 43 instances with 100-300 nodes in CVRPLIB Uchoa et al. (2017), NeuroLKH improves the average performances on 38, 38 and 31 instances when the time limits are set to the time of LKH with 100, 1000 and 10000 trials, respectively. On the 11 Solomon R2-type instances, NeuroLKH outperforms LKH almost consistently with all the settings (32 out of the 33).
In this paper, we propose an algorithm utilizing the great power of deep learning models to combine with a strong heuristic for TSP. Specifically, one Sparse Graph Network is trained to predict the edge scores and the node penalties for generating the edge candidate set and transforming the edge distances, respectively. As shown in the extensive experiments, the improvement of NeuroLKH over baseline algorithms within different time limits is consistent and significant. And NeuroLKH generalizes well to instances with much larger graph sizes than training sizes and traditional benchmarks with various node distributions. Also, we use CVRP, PDP and CVRPTW to demonstrate that NeuroLKH effectively applies to other routing problems. NeuroLKH can effectively learn the routing patterns for TSP which generalize well to much larger sizes and different distributions of nodes. However, for other complicated routing problems such as CVRP and CVRPTW, although NeuroLKH generalizes well to larger sizes, it is hard to directly generalize to other distributions of demands and time windows without training, which is a limitation of NeuroLKH and is left for future research. In addition, NeuroLKH can be further combined with other learning based techniques such as sparsifying the TSP graph Sun et al. (2021) and other strong traditional algorithms such as the Hybrid Genetic Search Vidal et al. (2012).
This work was supported by the A*STAR Cyber-Physical Production System (CPPS) – Towards Contextual and Intelligent Response Research Program, under the RIE2020 IAF-PP Grant A19C1a0018, and Model Factory@SIMTech, in part by the National Natural Science Foundation of China under Grant 61803104 and Grant 62102228, and in part by the Young Scholar Future Plan of Shandong University under Grant 62420089964188.
Proceedings of the AAAI Conference on Artificial Intelligence
, Cited by: Table S.1, Appendix A, §2, §5.1, §5.1.Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: §4.1.LEARNING a latent search space for routing problems using variational autoencoders
. In Proceedings of International Conference on Learning Representations (ICLR)., Cited by: Table S.1, §2, §5.1.A hybrid genetic algorithm for multidepot and periodic vehicle routing problems
. Operations Research 60 (3), pp. 611–624. Cited by: §6.Multi-decoder attention model with embedding glimpse for solving vehicle routing problems
. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, pp. 12042–12049. Cited by: §1.To verify the quality of the edge candidate set learned by NeuroLKH, we report two metrics for the edge candidate set attained by different methods, i.e., the average ranking of the optimal edges and the percentage of optimal edges missed in the set, respectively. Regarding the sensitivity analysis of the Minimum Spanning Tree with the subgradient optimization in LKH algorithm, 0.68% and 0.67% of the optimal edges are missed in the candidate set for TSP100 and TSP500, respectively, where the average rankings of optimal edges are 1.670 and 1.681. The ideal average ranking would be 1.5 since the two optimal edges for each node would be the first and the second in the ranks. NeuroLKH reduces the average ranking to 1.557 and 1.597 where only 0.05% and 0.09% of the optimal edges are missed in the set, which justifies the effectiveness of NeuroLKH in learning desirable edge candidates.
For TSP, we choose the number of directed edges pointed from one node in the sparse edge set as to include most of the edges in the optimal tours into the sparse graph, which results in only 0.01% of the optimal edges missed in the sparse graph for the training dataset. In our experiments with (trained with 20% of the training samples to save time), 0.643%, 0.209% and 0.208% of the optimal edges are missed in the candidate set with the average ranking of the optimal edges 1.653, 1.646 and 1.640 for TSP500, respectively. With (i.e. numbers of edges), it only improves the average ranking marginally with similar percentages of optimal edges but obviously increases the computational time. Pertaining to other routing problems, we find similar results therefore we use for consistency. We find that the network can hardly give a high edge score to an edge with considerably large Euclidean distance and include it into the candidate set. Therefore larger is not needed which does not impact the performance much as long as it is not too small (e.g. less than 20).
The model outputs the node penalties within the range of with . In the original LKH algorithm, a subgradient optimization process is used to optimize the node penalties iteratively until convergence for each instance. In this process for the training instances where the coordinates are always between 0 and 1, we find that the penalties are usually between -10 and 10 (for different sizes). While testing for instances with different coordinate ranges, we scale the instances to make the coordinates between 0 and 1. The aspect ratio is fixed so that the objective value is just scaled by a constant. Therefore, we use in our experiments.
In Table S.1, we compare NeuroLKH with other recently proposed Deep Learning based methods on TSP100. Notably, most of them can hardly handle problems with more than 100 nodes. One exception is the method in [8], which is tested on large problems but the performance deteriorates rapidly with the increase of problem size and is still inferior to LKH. We adopt the results from their original works where the datasets tested on might be different but are sampled from the same distribution. Therefore the optimality gap is a more important measure than the objective value. The running time is reported for solving 1000 instances in total with the assumption that it is linearly related to the number of instances. Apparently, NeuroLKH significantly outperforms other methods with a short running time. And more importantly, as shown in Table 1 and Table 2, NeuroLKH generalizes well to large TSP with up to 5000 nodes.
Method | Time(s) | Gap(%00) | Method | Time(s) | Gap(%00) | Method | Time(s) | Gap(%00) |
---|---|---|---|---|---|---|---|---|
GCN greedy [18] | 36 | 838.000 | AM Greedy [21] | 0.6 | 453.000 | AM sampling [21] | 360 | 226.000 |
Wu [33] | 720 | 142.000 | GCN bs [18] | 240 | 139.000 | CVAE-Opt-RS [15] | 50500 | 135.000 |
da Costa [5] | 246 | 87.000 | CVAE-Opt-DE [15] | 55100 | 34.000 | POMO [22] | 6 | 14.000 |
Fu [8] | 90 | 4.000 | DPDP 10k [20] | 456 | 0.900 | DPDP 100k [20] | 990 | 0.400 |
NeuroLKH | 33 | 0.111 | NeuroLKH | 127 | 0.030 | NeuroLKH | 938 | 0.000 |
NeuroLKH is trained using only the instances with nodes generated from the uniform distribution. With the same training dataset size, we trained another model NeuroLKH_M using a mixture of instances with uniformly distributed nodes, clustered nodes with 3-8 clusters, half uniform and half clustered nodes following [30]. Following the convention for TSPLIB in [12, 36], the number of trials is set to be the number of nodes and the algorithms are run 10 times for each instance. During each run, the algorithm will stop when the optimal solution is found and the number of trials actually conducted is reported. Here we show the results of LKH, VSR-LKH, NeuroLKH and NeuroLKH_M for each instance in Table S.2, Table S.3 and Table S.4. The optimal tour distance is shown under the instance name. We report the success times where the optimal solution is found, the best performance (tour distance) during the runs, the average performance, the average running time (seconds) and the average number of trials actually conducted. The results of LKH are the same as reported in [36] (except the running time where we run all the algorithms on our machine for a fair comparison) while the results of VSR-LKH are slightly different due to behaviour uncontrolled by the random seed in the code.
Here we briefly introduce the Capacitated Vehicle Routing Problem (CVRP), the Pickup and Delivery Problem (PDP) and CVRP with Time Windows (CVRPTW). For PDP, the customers contain pairs of pickup and delivery nodes. The vehicle starts from the depot, visits each customer node once and returns to the depot with the constraint that the pickup node must be visited before the corresponding delivery node. For CVRP, multiple routes can be planned. In each route, the vehicle starts from the depot, visits some customers and returns to the depot. The total demand of the customers in each route cannot exceed the vehicle capacity and each customer must be visited once. CVRPTW generalizes CVRP with an additional constraint that each customer must be visited within the corresponding time window. The time will be spent on traveling between the nodes and serving the customers. The goal of all three problems is to minimize the tour distance.
Similarly, we plot the performance of the LKH and NeuroLKH algorithms for solving CVRP, PDP and CVRPTW in Figure S.1, which shows similar trends as those in Figure 2. The time limits are set to the longest ones used in Table 3, i.e., the running time of LKH algorithm with 10000 trials.
For the results reported in Table 3, almost all the improvements of NeuroLKH over LKH on different sizes and with different time limits are statistically significant with confidence levels larger than 99%. The only exceptions are the performance for the smallest size of each problem and the longest time limits (the running time of LKH with 10000 trials), where the confidence levels are 98.7%, 98.9% and 77.9% for CVRP100, PDP40 and CVRPTW40, respectively. The confidence level for CVRPTW40 with the time limit of LKH with 10000 trials is relatively low because CVRPTW with 40 nodes solved by LKH is already fairly close to the optimality with such a long time limit. Therefore the improvement room left for NeuroLKH is small.
CVRPLIB [30] contains various sized CVRP instances with a combination of 3 depot positioning, 3 customer positioning and 7 demand distributions. We train one network using CVRP instances ranging from 101 to 300 nodes. The instances are generated from this mixture of distributions proposed in [30] and we generate instances for each size in the training dataset, resulting in approximately 120000 instances in total.
Solomon benchmark [28]
contains CVRPTW instances with 100 customers and various distributions of time windows. An additional constraint for this benchmark is to minimize the number of routes. Therefore the goal is to minimize the tour distance using the minimum number of routes. We choose R2-type as the testbed in our experiment. We generate a training dataset of instances with 100 customers. The node coordinates are generated independently from the uniform distribution ranging from 0 to 80. The demands are generated from a Gaussian distribution with mean 15 and standard deviation 10 and the capacity is fixed as 1000. The serving time
for each customer is fixed as 10. The center of time window for node is generated from the uniform distribution with the interval , where is the distance between node and the depot. And the width of time window is generated from a Gaussian distribution with the mean and standard deviation set to 115 and 35, 240 and 0, 350 and 160, 150 and 380, 470 and 70, respectively. For each of the first two sets of parameters, four different types are generated with 0%, 25%, 50% and 100% of the customers receiving the time windows. And for the last three sets of parameters, all customers are receiving the time windows, resulting in 11 types of instances in total. We generate 5000 instances for each type in the training dataset. Please refer to the code for more details.As the running time is all relatively short, we run both LKH and NeuroLKH for 100 times on each instance. The results of LKH and NeuroLKH are shown in Table S.5, Table S.6 and Table S.7, while the time limits are set to the running time of LKH with 100, 1000 and 10000 trials. The optimal tour distance is shown under the instance name. We report the average running time (seconds), the best performance (tour distance) during the runs, the average performance, the success times when the optimal solution is found.
Comments
There are no comments yet.