Matrix Encoding Networks for Neural Combinatorial Optimization

06/21/2021 ∙ by Yeong-Dae Kwon, et al. ∙ SAMSUNG 0

Machine Learning (ML) can help solve combinatorial optimization (CO) problems better. A popular approach is to use a neural net to compute on the parameters of a given CO problem and extract useful information that guides the search for good solutions. Many CO problems of practical importance can be specified in a matrix form of parameters quantifying the relationship between two groups of items. There is currently no neural net model, however, that takes in such matrix-style relationship data as an input. Consequently, these types of CO problems have been out of reach for ML engineers. In this paper, we introduce Matrix Encoding Network (MatNet) and show how conveniently it takes in and processes parameters of such complex CO problems. Using an end-to-end model based on MatNet, we solve asymmetric traveling salesman (ATSP) and flexible flow shop (FFSP) problems as the earliest neural approach. In particular, for a class of FFSP we have tested MatNet on, we demonstrate a far superior empirical performance to any methods (neural or not) known to date.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many combinatorial optimization (CO) problems of industrial importance are NP-hard, leaving them intractable to solve optimally at large scales. Fortunately, researchers in operations research (OR) have developed ways to tackle these NP-hard problems in practice, mixed integer programming (MIP) and (meta-) heuristics being two of the most general and popular approaches. With the rapid progress in deep learning techniques over the last several years, a new approach based on machine learning (ML) has emerged. ML has been applied in both ways successfully 

[Bengio], as a helper for the traditional OR methods aforementioned or as an independent CO problem solver trained in an end-to-end fashion.

One way to leverage ML in solving CO problems is to employ a front-end neural net, which directly takes in the data specifying each problem. The neural net plays a critical role of analyzing all input data as a whole, from which it extracts useful “global” information. In end-to-end ML-based approaches, the global information is typically encoded onto the representations of the entities making up the problem. Greedy selection strategies based on these representations are no longer too near-sighted, allowing globally (near-) optimal solutions to be found. Existing OR methods can also benefit from incorporating the global information. Many hybrid approaches already exist that leverage information extracted by a neural net, both in ML-MIP forms [hybrid_1, hybrid_2] and ML-heuritic forms [hybrid_5, hybrid_4, hybrid_3].

Literature on neural combinatorial optimization contains many different types of the front-end models, conforming to the variety of data types upon which CO problems are defined. Yet, there has been no research for a model that encodes matrix-type data. This puts a serious limitation on the range of CO problems that an ML engineer can engage. Take, for example, the traveling salesman problem (TSP), the most intensely studied topic by the research community of neural combinatorial optimization. Once we lift the Euclidean distance restriction (the very first step towards making the problem more realistic), the problem instantly becomes a formidable challenge that has never been tackled before because it requires a neural net that can process the distance matrix111One could, in principle, consider a general GNN approach such as Khalil et al. [dai] that can encode any arbitrary graphs. However, as can be seen from the comparison of the TSP results of Khalil et al.( 8% error on random 100-city instances) with some of the state-of-the-art ML results (e.g., Kwon et al. [POMO], 0.2% error), a general GNN is most suitable for problems that deal with diverse graph structures, not those with fixed, dense graph structures (such as matrices)..

The list of other classical CO problems based on data matrices includes job-shop/flow-shop scheduling problems and linear/quadratic assignment problems, just to name a few. All of these examples are fundamental CO problems with critical industrial applications. To put it in general terms, imagine a CO problem made up of two different classes of items, and , where and are the sizes of and respectively. We are interested in the type where features of each item are defined by its relationships with those in the other class. The data matrix we want to encode, , would be given by the problem222The relationship can have multiple () features, in which case we are given number of different data matrices , , , . For simplicity we only use =1 in this paper, but our model covers >1 cases as well. (See Appendix.), where its th element represents a quantitative relationship of some sort between a pair of items (, ). The sets and are unordered lists, meaning that the orderings of the rows and the columns of

are arbitrary. Such permutation invariance built into our data matrices is what sets them apart from other types of matrices representing stacked vector-lists or a 2D image.

In this paper, we propose Matrix Encoding Network (MatNet) that computes good representations for all items in and within which the matrix

containing the relationship information is encoded. To demonstrate its performance, we have implemented MatNet as the front-end models for two end-to-end reinforcement learning (RL) algorithms that solve the asymmetric traveling salesman (ATSP) and the flexible flow shop (FFSP) problems. These classical CO problems have not been solved using deep neural networks. From our experiments, we have confirmed that MatNet achieves near-optimal solutions. Especially, for the specific FFSP instances we have investigated, our MatNet-based approach performs substantially better than the conventional methods used in operations research including mixed integer programming and meta-heuristics.

2 Related work

A neural net can accommodate variable-length inputs as is needed for encoding the parameters of a CO problem. Vinyals et al. [pointer_net_Vinyals]

, one of the earliest neural CO approaches, have introduced Ptr-Nets that use a recurrent neural network (RNN) 

[RNN] as the front-end model. Points on the plane are arranged into a sequence, and RNN processes the Cartesian coordinates of each point one by one. Bello et al. [pointer_net_Bello] continue the RNN approach, enhancing the Ptr-Net method through combination with RL. Nazari et al. [pointer_net_Nazari] further improve this approach by discarding the RNN encoder, but keeping the RNN decoder to handle the variable input size.

Graph neural network (GNN) [GNN0, GNN] is another important class of encoding networks used for neural CO, particularly on (but not limited to) graph problems. GNN learns the message passing policy between nodes that can be applied to an arbitrary graph with any number of nodes. Khalil et al. [dai] are among the firsts to show that a GNN-based framework can solve many different types of CO problems on graph in a uniform fashion. Li et al. [graph_1] extend the work by switching to a more advanced structure of graph convolutional networks (GCNs) [GCN]. Manchanda et al. [graph_2] demonstrate efficient use of GCN that can solve CO problems on graphs with billions of edges. Karalias and Loukas [graph_unsupervised]

use unsupervised learning on GNN to improve the quality of the CO solutions.

Encoding networks most relevant to our work are those based on the encoder of the Transformer [transformer]. Note that this encoder structure can be viewed as a graph attention network (GAT) [GATs] operating on fully connected graphs. Using this type of encoder, Kool et al. [Kool] encode Cartesian coordinates of all nodes given by the problem simultaneously and solve the TSP and many other related routing problems. Kwon et al. [POMO] use the same encoder as Kool et al.but produce solutions of significantly improved quality with the use of the RL training and the inference algorithms that are more suitable for combinatorial optimizations.

3 Model architecture

Figure 1: A complete bipartite graph with weighted edges.

MatNet can be considered a particular type of GNN that operates on a complete bipartite graph with weighted edges. The graph has two sets of nodes, and , and the edge between and has the weight , the -th element of , as illustrated in Figure 1.

MatNet is inspired by the encoder architecture of the Attention Model (AM) by Kool 

et al. [Kool], but because the AM deals with a quite different type of a graph (fully connected nodes and no edge weights), we have made significant modifications to it. The encoder of the AM follows the node-embedding framework of graph attention networks (GATs) [GATs], where a model is constructed by stacking multiple () graph attentional layers. Within each layer, a node ’s vector representation is updated to using aggregated representations of its neighbors as in


Here, is the set of all neighboring nodes of . The (learnable) update function is composed of multiple attention heads, and their aggregation process utilizes the attention score for each pair of nodes , which is a function of and .

MatNet also generally follows the GATs framework, but it differs in two major points: (1) It has two independent update functions that separately apply to nodes in and . (2) The attention score for a pair of nodes is not just a function of and , but the edge weight as well. The update function in each layer of MatNet can be described as


3.1 Dual graph attentional layer

Figure 2: (a) An overview of the MatNet architecture. (b) Mixed-score attention. “Multi-Head Mixed-Score Attention” block in (a) consists of many independent copies of the mixed-score attentions arranged in parallel and fully connected (FC) layers at the input and output interfaces (not drawn).

The use of two update functions and in Eq. (3.2) is somewhat unorthodox. The GATs framework requires that the update rule be applied uniformly to all nodes because it loses its universality otherwise. In our case, however, we focus on a particular type of a problem, where the items in and belong to two qualitatively different classes. Having two separate functions allows MatNet to develop a customized representation strategy for items in each class.

The two update functions are visualized as a dual structure of a graph attentional layer depicted in Figure 2(a). The MatNet architecture is a stack of graph attentional layers, and each layer is made of two sub-blocks that implement and . The two sub-blocks are structurally identical, where each one is similar to the encoder layer of the AM, which in turn is based on that of the Transformer model [transformer]. Note, however, that while both the AM and the Transformer model use the simple self-attention for encoding, MatNet requires a clear separation of queries (nodes in one set) and the key-value pairs (nodes in the other set) at the input stage to perform cross-attentions. For detailed descriptions of computing blocks used in Figure 2(a), we refer the readers to the Transformer architecture [transformer].

3.2 Mixed-score attention

We now focus on how MatNet makes use of the edge weight in its attention mechanism. “Multi-Head Mixed-Score Attention” block in Figure 2(a) is the same as “Multi-Head Attention” block of the Transformer except that the scaled dot-product attention in each attention head is replaced by the mixed-score attention shown in Figure 2(b). Originally, the scaled dot-product attention outputs the weighted sum of the values, where the weights are calculated from the attention scores (the scaled dot-products) of query-key pairs. For MatNet, in addition to these internally generated scores, it must use the externally given relationship scores, , which is also defined for all query-key pairs. The mixed-score attention mechanism mixes the (internal) attention score and the (external) relationship score before passing it to the next “SoftMax” stage.

Rather than using a handcrafted prioritizing rule to combine the two types of the scores, we let the neural net figure out the best mixing formula. “Trainable element-wise function” block in Figure 2

(b) is implemented by a multilayer perceptron (MLP) with two input nodes and a single output node. Each head in the multi-head attention architecture is given its own MLP parameters, so that some of the heads can learn to focus on a specific type of the scores while others learn to balance, depending on the type of graph features that each of them handles. Note that it is an element-wise function, so that the output of the mixed-score attention is indifferent to the ordering of

and , ensuring the permutation-invariance of the operation.

The attention mechanism of the Transformer is known for its efficiency as it can be computed by highly optimized matrix multiplication codes. The newly inserted MLP can also be implemented using matrix multiplications without the codes that explicitly loop over each rows and column of the matrix or each attentional head. Actual implementation of MatNet keeps the efficient matrix forms of an attention mechanism. Also, note that the duality of the graph attentional layers of MatNet explained in Section 3.1 is conveniently supported by , a transpose of the matrix .

3.3 The initial node representations

The final pieces of MatNet that we have not yet explained are the initial node representations that are fed to the model to set it in motion. One might be tempted to use the edge weights directly for these preliminary representations of nodes, but the vector representations are ordered lists while edges of a graph offer no particular order.

MatNet uses zero-vectors to embed nodes in , and one-hot vectors for nodes in (or vice versa). The use of the same zero-embeddings for all nodes of is allowed, as long as all nodes in are embedded differently, because they acquire unique representations after the first graph attentional layer. The zero-vector embedding enables MatNet to support variable-sized inputs. Regardless of how the number of the nodes in changes from a problem instance to another, MatNet can process them all in a uniform manner. See Appendix for a further discussion on scalability of MatNet, a related topic.

One-hot vector embedding scheme for nodes in , on the other hand, provides a somewhat limited flexibility towards varying , the number of columns in the input matrix. Before a MatNet model is trained, its user should prepare a pool of different one-hot vectors. When a matrix with () columns is given, different one-hot vectors are randomly drawn from the pool in sequence and used as the initial node representations for -nodes.

Instance augmentation via different sets of embeddings.

The one-hot embedding scheme naturally accommodates “instance augmentation” strategy proposed by Kwon et al. [POMO]. For each random sequence of one-hot vectors chosen to embed -nodes, a MatNet model encodes the matrix in a different way. (That is, the number of node representation sets for a given problem instance can be augmented from one to many.) One can take advantage of this property and easily create many dissimilar solutions simply by running the model repeatedly, each time with a new embedding sequence. Among those multiple solutions, the best one is chosen. This way, a higher quality solution can be acquired at the cost of increased runtime. The diversity of solutions created by the instance augmentation technique are far more effective than those produced by the conventional multi-sampling technique. (See Appendix.)

4 Asymmetric traveling salesman problem

Figure 3: The decoder. Scalars and

are the mask and the selection probability for


Problem definition.

In the traveling salesman problem (TSP), the goal is to find the permutation of the given cities so that the total distance traveled for a round trip is minimized. For each pair of “from” city and “to” city , a distance is given, which constitutes an -by- distance matrix. To verify that MatNet can handle general cases, we focus on the asymmetric traveling salesman problem (ATSP). This problem does not have the restriction so that its distance matrix is asymmetric. We use “tmat”-class ATSP instances that have the triangle inequality and are commonly studied by the OR community [atsp]. (See Appendix for detailed explanation of the tmat class.) We solve the ATSP of three different sizes, having 20, 50, and 100 number of cities.

MatNet configuration.

MatNet is used to encode the distance matrix. Regardless of the size of the ATSP, we have used the same MatNet structure. The MatNet model is constructed by stacking encoding layers. The embedding dimension, , for input/output of each layer is set to 256. Both sub-modules ( and ) use attention heads, where each head processes query, key, and value as 16-dimensional vectors (i.e., ). The score-mixing MLP in each head has one hidden layer with 16 nodes. For implementation of the “Scale” block in Figure 2(b), we apply scaling of and then soft-clipping to using a function, following the convention of Bello et al. [pointer_net_Bello]. “Feed-forward” blocks in Figure 2(a) is implemented with one hidden layer of dimension . We use instance normalization for “Add & Norm.” For the initial node representations, we use zero vectors for “from” cities and one-hot vectors to embed “to” cities. The size of the pool from which we draw the one-hot embedding vectors is adjusted to the minimum, the number of the cities of the problem (i.e., ).


Once MatNet processes the distance matrix and produces vector representations of “from” and “to” cities, a solution of ATSP (a permutation sequence of all cities) can be constructed autoregressively, one city at a time. The decoder model of Kool et al. [Kool] (shown in Figure 3) is used repeatedly at each step. Two “from” city representations, one for the current city and the other for the first city of the tour333When deciding where to go next, the decoder needs to have the information of the first city, because it is also the final destination of the tour., are concatenated to make the QUERY token. Vector representations of all “to” cities go into the decoder, and for each city the decoder calculates the probability to select it as the next destination. A mask (0 for the unvisited and for the visited cities) is used inside the multi-head attention block to guide the decision process, as well as right before the final softmax operation to enforce one-visit-per-city rule of the problem.


We use the POMO training algorithm [POMO] to perform reinforcement learning on the MatNet model and the decoder. That is, for an ATSP with size , different solutions (tours) are generated, each of them starting from a different city. The averaged tour length of these solutions is used as a baseline for REINFORCE algorithm [REINFORCE]. We use Adam optimizer [adam] with a learning rate of without a decay and a batch size of 200444If needed, a smaller batch size can be used to reduce GPU memory usage. A batch size of 50 achieves the similar results when trained with a learning rate of , although this takes a bit longer to converge.

. With an epoch being defined as training 10,000 randomly generated problem instances, we train 2,000, 8,000, and 12,000 epochs for

20, 50, and 100, respectively. For 20 and 50, they take roughly 6 and 55 hours, respectively, on a single GPU (Nvidia V100). For , we accumulate gradients from 4 GPUs (each handling a batch of size 50) and the training takes about 110 hours. Multi-GPU processes are used for speed, as well as to overcome our GPU memory constraint (32GB each).


We use the POMO inference algorithm [POMO]. That is, we allow the decoder to produce solutions in parallel, each starting from a different city, and choose the best solution. We use sampled trajectories for a POMO inference rather than greedy ones (argmax on action probabilities) [Kool] because they perform slightly better in our models.

4.1 Comparison with other baselines

Method ATSP20 ATSP50 ATSP100
Len. Gap Time Len. Gap Time Len. Gap Time
CPLEX 1.54 - (12m) 1.56 - (1h) 1.57 - (5h)
Nearest Neighbor 2.01 30.39% (-) 2.10 34.61% (-) 2.14 36.10% (-)
Nearest Insertion 1.80 16.56% (-) 1.95 25.16% (-) 2.05 30.79% (-)
Furthest Insertion 1.71 11.23% (-) 1.84 18.22% (-) 1.94 23.37% (-)
LKH3 1.54 0.00% (1s) 1.56 0.00% (11s) 1.57 0.00% (1m)
MatNet 1.55 0.53% (2s) 1.58 1.34% (8s) 1.62 3.24% (34s)
MatNet (128) 1.54 0.01% (4m) 1.56 0.11% (17m) 1.59 0.93% (1h)
Table 4.1: Experiment results on 10,000 instances of ATSP

In Table 4.1, we compare the performance of our trained models with those of other representative baseline algorithms on 10,000 test instances of the tmat-class ATSP. The table shows the average length of the tours generated by each method (displayed in units of ). The gap percentage is with respect to the CPLEX results. Times are accumulated for the computational processes only, excluding the program and the data (matrices) loading times. This makes the comparisons between different algorithms more transparent. Note that the heuristic baseline algorithms require only a single thread to run. Because it is easy to make them run in parallel on modern multi-core processors, we record their runtimes divided by 8. (They should be taken as references only and not too seriously.) For consistency, inference times of all MatNet-based models are measured on a single GPU, even though some models are trained by multiple GPUs.

Details of all baseline algorithms for both the ATSP and the FFSP experiments, including how we have implemented them, are given in Appendix.

Mixed-integer programming (MIP).

Many CO problems in the industry are solved by MIP because it can model a wide range of CO problems, and many powerful MIP solvers are readily available. If one can write down a mathematical model that accurately describes the CO problem at hand, the software can find a good solution using highly engineered branch-and-bound type algorithms. This is done in an automatic manner, freeing the user from the burden of programming. Another merit of the MIP approach is that it provides the optimality guarantee. MIP solvers keep track of the gap between the best solution found so far and the best lower (or upper) bound for the optimal.

To solve test instances using MIP, we use one of the well-known polynomial MIP formulations of the ATSP [miller_ATSP] and let CPLEX [cplex] solve this model with the distance matrices that we provide. CPLEX is one of the popular commercial optimization software used by the OR community. With 10-second timeouts, CPLEX exactly solves all instances. For and 100 test instances, it solves about 99% and 96% of them to the optimal values, respectively. Solutions that are not proven optimal have the the average optimality gap of 1.2% for and 0.5% for .


Nearest Neighbor (NN), Nearest Insertion (NI), and Furthest Insertion (FI) are simple greedy-selection algorithms commonly cited as baselines for TSP algorithms. Our implementations of NN, NI, and FI for ATSP in C++  solve 10,000 test instances very quickly, taking at most a few seconds (when 100). We thus omit their runtimes in Table 4.1.

The other heuristic baseline, LKH3 [LKH3], is a state-of-the-art algorithm carefully engineered for various classical routing problems. It relies on a local search algorithm using -opt operations to iteratively improve its solution. Even though it offers no optimality guarantee, our experiment shows that it finds the optimal solutions for most of our test instances in a very short time.


In Table 4.1, the performance of the MatNet-based ATSP solver is evaluated by two inference methods: 1) a single POMO rollout, and 2) choosing the best out of 128 solutions generated by the instance augmentation technique (i.e., random one-hot vector assignments for initial node embeddings) for each problem instance. The single-rollout inference method produces a solution in selection steps just like the other greedy-selection algorithms (NN, NI, and FI), but the quality of the MatNet solution is far superior. With instance augmentation (128), MatNet’s optimality gap goes down to 0.01% for 20-city ATSP, or to less than 1% for 100-city ATSP.

5 Flexible flow shop problem

Figure 4: (LEFT) Processing time matrices for a 3-stage 4-4-4-machine 20-job flexible flow shop problem. (RIGHT) A Gantt chart showing a valid schedule for the given problem. Numbers on colored strips are assigned job indices. This schedule has been produced by our MatNet-based neural net model, and it has a makespan of 25.

Problem definition.

Flexible flow shop problem (FFSP), or sometimes called hybrid flow shop problem (HFSP), is a classical optimization problem that captures the complexity of the production scheduling processes in real manufacturing applications. It is defined with a set of jobs that has to be processed in stages all in the same order. Each stage consists of machines555In general, the number of machines in each stage can vary from stage to stage. When all stages have just one machine, the problem simplifies to flow shop problem., and a job can be handled by any of the machines in the same stage. A machine cannot process more than one job at the same time. The goal is to schedule the jobs so that all jobs are finished in the shortest time possible (i.e., minimum makespan).

For our experiment, we fix on a configuration of stages and machines of a moderate complexity. We assume that there are stages, and each stage has machines. At the th stage, the processing time of the job on the machine is given by . Therefore, an instance of the problem is defined by three processing time matrices ( and ), all of which have the size -by-. is filled with a random integer between 2 and 9 for all , , and . We solve three types of FFSP with different number of jobs, namely, 20, 50, and 100. An example instance of FFSP with job count is shown in Figure 4 along with a Gantt chart showing an example of a valid schedule.

Encoding & decoding.

To encode three processing time matrices, , we use 3 copies of a MatNet model, one for each matrix, and acquire vector representations for machines and for jobs in stage

. This stage-wise encoding scheme is the key to develop a proper scheduling strategy, which should follow different job-selection rules depending on whether you are in an early stage or a late one. All three MatNets are configured identically, using the same hyperparameters as the ATSP encoder explained in the previous section

666except for the number of encoding layers , descreased from 5. This change is optional, but we have found that is unnecessarily too large for the task.. Machines are initially embedded with one-hot vectors (), and jobs are embedded with zero vectors. We also use three decoding neural nets, , all of them having the same architecture as the ATSP decoder (Figure 3).

Generating an FFSP solution means completing a Gantt chart like the one in Figure 4. We start at the upper left corner of the chart (time ) and move from stage 1 to stage 3. At stage , we loop over its 4 machines , each time using as the QUERY token for . The input embeddings for are (), plus one extra (21st) job embedding made of learnable parameters for “skip” option. outputs selection probabilities for jobs . When the “skip” option is selected, no job is assigned to machine even if there are available jobs. This is sometimes necessary to create better schedules. Obviously, unavailable jobs (already processed or not ready for stage ) are masked in the decoder. When no job can be selected (no job at stage , or the machine is busy), we simply do not assign any job to the machine and move on. When we finish assigning jobs for all machines, we increment time and repeat until all jobs are finished.


The decoding process is a series of job assignments to machines in each stage, but note that the order of the machines we assign jobs to can be arbitrary. We use number of permutations of (#1, #2, #3, #4) for the order of machines in which we execute job assignments at each stage. From the same set of vector representations of machines () and jobs (), this permutation trick can create 24 heterogeneous trajectories (different schedules), which we use for the POMO reinforcement learning to train the networks. We use a batch size 50 and Adam optimizer with a learning rate of . One epoch being processing 1,000 instances, we train for (100, 150, 200) epochs for FFSP with job size (20, 50, 100), which takes (4, 8, 14) hours. For FFSP with , we accumulate gradients from 4 GPUs.


We use sampled solutions and the POMO inference algorithm with 24 machine-order permutations similarly to the training. Interestingly, the sampled solutions are significantly better than the greedily-chosen ones, which could be a consequence of the stochastic nature of the problem in our approach. Each decoder makes decisions without any knowledge of other () stages.

5.1 Comparison with other baselines

In Table 5.2, we record average makespan (MS) of the schedules produced by various optimization methods for the same 1,000 test instances. The gaps are given in absolute terms with respect to MatNet with augmentation result. We display the runtime of each method, following the same rules that we use for Table 4.1 (e.g., computation time only, scaling by for single-thread processes, and single-GPU inference for the MatNet-based method).

Mixed-integer programming (MIP).

Modeling flow shop problems using MIP is possible, but it is much more complex than for TSPs. We use an FFSP model found in literature [ffsp_model_ref] with modifications that (empirically) improves the search speed of CPLEX. Even with the modifications, however, CPLEX cannot produce optimal solutions for any of the test instances within a reasonable time. For FFSPs with job size , we show two results, recorded with timeouts of 60 and 600 seconds per instance. For and 100, no valid solutions are found within 600 seconds.


Random and Shortest-Job-First (SJF) methods are greedy heuristics. They create valid schedules in an one-shot manner using the Gantt-chart-completion strategy, similarly to our MatNet based model. SJF assigns the shortest jobs (at most 4) in ascending order that are available for each stage at each time step .

Genetic Algorithm (GA) and Particle Swarm Optimization (PSO) are two metaheuristics widely used by the OR community to tackle the FFSP. Metaheuristics are systematic procedures to create heuristics, most useful for optimization problems that are too complex to use MIP approaches or to engineer problem-specific handcrafted algorithms. Our implementation of GA and PSO are based on the works found in the OR literature [GA_for_FFSP, PSO_for_FFSP].


Solutions produced by the learned heuristic of our MatNet based model significantly outperform those of the conventional OR approaches, both in terms of the qualities and the runtimes. Some commercial applications of the FFSP require solving just one instance as best as one can within a relatively generous time budget. Hence, we have also tested the performance of baseline algorithms under ample time to improve their solutions on single instances. The results are provided in Appendix. Even in this case, however, the conventional OR algorithms do not produce better solutions than those of our MatNet method.

Method FFSP20 FFSP50 FFSP100
MS Gap Time MS Gap Time MS Gap Time
CPLEX (60s) 46.4 21.0 (17h)
CPLEX (600s) 36.6 11.2 (167h)
Random 47.8 22.4 (1m) 93.2 43.6 (2m) 167.2 77.5 (3m)
Shortest Job First 31.3 5.9 (40s) 57.0 7.4 (1m) 99.3 9.6 (2m)
Genetic Algorithm 30.6 5.2 (7h) 56.4 6.8 (16h) 98.7 9.0 (29h)
Particle Swarm Opt. 29.1 3.7 (13h) 55.1 5.5 (26h) 97.3 7.6 (48h)
MatNet 27.3 1.9 (8s) 51.5 1.9 (14s) 91.5 1.8 (27s)
MatNet (128) 25.4 - (3m) 49.6 - (8m) 89.7 - (23m)
Table 5.2: Experiment results on 1,000 instances of FFSP

6 Conclusion & Discussion

In this paper, we have introduced MatNet, a neural net capable of encoding matrix-style relationship data found in many CO problems. Using MatNet as a front-end model, we have solved two classical optimization problems of different nature, the ATSP and the FFSP, for the first time using the deep learning approach.

A neural heuristic solver that clearly outperforms conventional OR methods has been rare, especially for classical optimization problems. Perhaps, this is because the range of CO problems ML researchers try to tackle has been too narrow, only around simple problems (such as the ATSP) for which good heuristics and powerful MIP models already exist. The FFSP is not one of them, and we have shown that our MatNet based FFSP solver significantly outperforms other algorithms. Our results are promising, hinting the prospect of real-world deployments of neural CO solvers in the near future. More research is needed, however, to fully reflect combinations of many different types of constraints posed by real-world problems.

For our experiments, we have chosen to use an end-to-end RL approach for its simplicity. It is purely data-driven and does not require any engineering efforts by a domain expert. We would like to emphasize, however, that the use of MatNet is not restricted to end-to-end methods only. Hybrid models (ML + OR algorithms) based on the MatNet encoder should be possible, which will have better performance and broader applicability.

We have implemented our MatNet model in Pytorch. The training and the testing codes for the experiments described in the paper are publicly available.

777An URL address will be added here later.


Appendix A MatNet variants

a.1 Multiple data matrices

A combinatorial optimization problem can be presented with multiple () relationship features between two groups of items. In FFSP, for example, a production cost could be different for each process that one has to take into account for scheduling in addition to the processing time for each pair of the job and the machine. (The optimization goal in this case would be to minimize the weighted sum of the makespan and the total production cost.) When there are number of matrices that need to be encoded (, , , ), MatNet can be easily expended to accommodate such problems by using the mixed-score attention shown in Figure A.1 instead of the one in Figure 2(b). “Trainable element-wise function” block in Figure A.1 is now an MLP with input nodes and 1 output node.

Figure A.1: Mixed-score attention with number of data matrices (, , , ).

a.2 Alternative encoding sequences

Equation (2) in the main text describes the application of and in the graph attentional layer of MatNet that happens in parallel. One can change it to be sequential, meaning that is applied first and then the application of follows using the updated vector representations . This is illustrated in Figure A.2. The opposite ordering is also valid, in which is applied first and then follows (not drawn).

Figure A.2: An alternative MatNet structure with the sequential update scheme. The bold line indicates the change from the original.

It is not yet clear why, but we have empirically found that the alternative MatNet structure above leads to better-performing models than the original structure in some experiments (that are not the ATSP and the FFSP experiments described in the paper).

Appendix B ATSP definition and baselines

b.1 Tmat class

While one can generate an ATSP instance by simply filling a distance matrix with random numbers alone (an “amat” class problem), such problem lacks the correlation between distances and is uninteresting. In our ATSP experiments, we use “tmat” class ATSP instances [atsp] that have the triangle inequality. First, we populate the distance matrix with independent random integers between 1 and 10, except for the diagonal elements that are set to 0. If we find for any , we replace it with . This procedure is repeated until no more changes can be made.

Generation speed of tmat class instances can be greatly enhanced using parallel processing on GPUs. Below is the Python code for the instance generation using Pytorch library.

def generate(batch_size, problem_size, min_val=1, max_val=1000000):
    problems = torch.randint(low=min_val, high=max_val+1, size=(batch_size, problem_size, problem_size))
    problems[:, torch.arange(problem_size), torch.arange(problem_size)] = 0
    while True:
        old_problems = problems.clone()
        problems, _ = (problems[:, :, None, :] + problems[:, None, :, :].transpose(2,3)).min(dim=3)
        if (problems == old_problems).all():
    return problems

b.2 MIP model

Our MIP model for ATSP is based on Miller et al. [miller_ATSP] developed in 1960.


(@)[3pt]llb & City index


(@)[3pt]llbb & Number of Cities
& Distance from city to city

Decision variables

(@)[3pt]llbb &  
& arbitrary numbers representing the order of city in the tour



Subject to:


Constraints (B.4) and (B.5) enforce the one-visit-per-city rule. Constraint set (B.6) could look unintuitive, but it is used to prevent subtours so that all cities are contained in a single tour of length .

b.3 Heuristics

Greedy-selection heuristics

Implementation of Nearest Neighbor (NN) is self-explanatory. On the other hand, exact procedures for the insertion-type heuristics (NI and FI) can vary when applied to the ATSP. Our approach is the following: for each city that is not included in the partially-completed round-trip tour of cities, we first determine the insertion point (one of choices) that would make the insertion of the city into the tour increase the tour length the smallest. With the insertion point designated for each city, every city now has the increment value for the tour length associated with it. Nearest Insertion (NI) selects the city with the smallest increment value, and Furthest Insertion (FI) selects the one with the largest increment value.


The version of LKH3 that we use is 3.0.6. The source code is downloaded from Parameter MAX_TRIALS is set to , where is the number of cities. Parameter RUNS is set to 1. All other parameters are set to the default values.

Appendix C FFSP definition and baselines

c.1 Unrelated parallel machines

We choose to generate processing time matrices with random numbers. This makes machines in the same stage totally unrelated to each other. This is somewhat unrealistic because an easy (short) job for one machine tends to be easy for the others, and the sames goes for a difficult (long) job, too. The type of FFSP that is studied most in the OR literature is a simpler version, in which all machines are identical (“uniform parallel machines” [ffsp_survey]). Solving “unrelated parallel machines” FFSP is more difficult, but the algorithms designed for this type of FFSP can be adapted to real-world problems more naturally than those assuming identical machines.

c.2 MIP model

Our MIP model for FFSP is (loosely) based on Asadi-Gangraj [ffsp_model_ref] but improved to perform better when directly entered into the CPLEX platform.


(@)[3pt]llbbbbb & Stage index
& Job index
& Machine index in each stage
& Number of jobs
& Number of stages


(@)[3pt]llbbb& Number of machines in stage
& A very large number
& Processing time of job in stage on machine

Decision variables

(@)[3pt]llbbb & Completion time of job in stage



Subject to:


Constraint set (C.8) ensures that each job must be assigned to one machine at each stage. Constraint sets (C.9)–(C.13) define precedence relationship () between jobs within a stage. Constraint set (C.9) indicates that every job has no precedence relationship () with itself. Constraint set (C.10) indicates that the sum of all precedence relationships (the sum of all ) in a stage is the same as minus the number of machines that no job has been assigned. Constraint set (C.11) expresses that only the jobs assigned to the same machine can have precedence relationships () among themselves. Constraint sets (C.12) and (C.13) mean that a job can have at most one preceding job and one following job. Constraint set (C.14) indicates that completion time of job in the first stage is greater than or equal to its processing time in this stage. The relation between completion times in two consecutive stages for job can be seen in Constraint set (C.15). Constraint set (C.16) guarantees that no more than one job can run on the same machine at the same time.

c.3 Metaheuristics

Genetic algorithm

Genetic algorithm (GA) iteratively updates multiple candidate solutions called chromosomes. Child chromosomes are generated from two parents using crossover methods, and mutations are applied on chromosomes for better exploration.

Our implementation of GA is based on chromosomes made of number of real numbers, where is the number of stages, and is the number of jobs. Each real number within a chromosome corresponds to the scheduling of one job at one stage. The integer part of the number determines the index of the assigned machine, and the fractional part determines the priority among the jobs when there are multiple jobs simultaneously available. The integer and the fractional parts are independently inherited during crossover. For mutation, we randomly select one from the following four methods: exchange, inverse, insert, and change. The number of chromosomes we use is 25, and the crossover ratio and the mutation rate are both 0.3. One of the initial chromosomes is set to the solution of the SJF heuristic and the best-performing chromosome is conserved throughout each iteration. We run 1,000 iterations per instance.

Particle swarm optimization

Particle swarm optimization (PSO) is a metaheuristic algorithm that iteratively updates multiple candidate solutions called particles. Particles are updated by the weighted sum of the inertial value, the local best and the global best at each iteration.

Our PSO solution representation (a particle) has the same form as a chromosome of our GA implementation. We use 25 number of particles for PSO. The inertial weight is 0.7, and the cognitive and social constants are set to 1.5. One of the initial particles is made to represent the solution of the SJF heuristic. We run 1,000 iterations per instance.

Appendix D Training curves

Figure D.1: The training curve of the MatNet-based ATSP solver with 100-city instances.
Figure D.2: The training curve of the MatNet-based FFSP solver with 100-job instances.

Appendix E Instance augmentation vs. sampling

The instance augmentation technique effectively creates different problem instances from which the neural net generates solutions of great diversity [POMO]. In Table E.3, we compare the performance of the instance augmentation method with that of the sampling method on the ATSP of different sizes. (The first two data rows of Table E.3 are identical to the last two rows of Table 1 in the main text.) The instance augmentation method takes a bit more time than the sampling method when the two use the same number (128) of rollouts because the former requires a new encoding procedure for each rollout while the latter needs to run the encoder just once and then reuses its output repeatedly. The table shows that the sampling method clearly suffers from the limited range of solutions it can create. Even when the number of rollouts used is larger by a factor of 10, the sampling method cannot outperform the instance augmentation technique in terms of the solution quality.

Method ATSP20 ATSP50 ATSP100
Len. Gap Time Len. Gap Time Len. Gap Time
MatNet (single POMO) 1.55 0.53% (2s) 1.58 1.34% (8s) 1.62 3.24% (34s)
MatNet (128 inst. aug.) 1.54 0.01% (4m) 1.56 0.11% (17m) 1.59 0.93% (1h)
MatNet (128 sampling) 1.54 0.22% (1m) 1.57 0.52% (7m) 1.60 1.89% (37m)
MatNet (1280 sampling) 1.54 0.15% (12m) 1.57 0.37% (1h) 1.60 1.58% (6h)
Table E.3: Experiment results on 10,000 instances of ATSP using different inference methods.

Appendix F Generalization performance on FFSP

A trained MatNet model can encode a matrix of an arbitrary number of rows (or columns). Such characteristics of MatNet is most useful for problems like FFSP, in which one of the two groups of items that the problem deals with is frequently updated. If we imagine a factory that is in need for an optimization tool for the FFSP-type scheduling problems, it is likely that its schedules are updated every day (or even every hour) or so, each time with a different (various sizes) set of jobs. The set of machines in the schedule, on the other hand, is unlikely to change on a daily basis. The processing time matrices in this case have rows of an unspecified size but a fixed number of columns. A MatNet-based FFSP solver can naturally handle such data.

In Table F.1, we show the performance of the three MatNet-based FFSP solvers. They are the same models that are used for the FFSP experiments described in the main text, e.g., in Table 2. Each model is trained with the FFSP instances of a different number of jobs (). We test them with 1,000 instances of the FFSP of different job sizes (), and the average makespans are shown in the table. We have used 128 instance augmentation for inference. Notice that the MatNet-based models can handle cases reasonably well, even though they have not encountered such large instances during the training. SFJ results are displayed as a baseline.

Method FFSP20 FFSP50 FFSP100 FFSP1000
Shortest Job First 31.3 57.0 99.3 847.0
MatNet () 25.4 50.3 91.2 814.4
MatNet () 25.2 49.6 89.9 803.9
MatNet () 25.3 49.6 89.7 803.2
Table F.1: Generalization test results on 1,000 instances of FFSP.

Appendix G One-instance FFSP

A scheduling task in a factory does not require solving many different problem instances at once. Moreover, the runtime is usually allowed quite long for the scheduling. We test each baseline algorithm and our MatNet-based model on a single FFSP instance and see how much each algorithm can improve its solution quality within a reasonably long time. For each FFSP size (), one problem instance is selected manually from 1,000 saved test instances based on the results from the “fast scheduling” experiments (in Table 2 of the main text). We have chosen instances whose makespans roughly match the average makespans of all test instances. The processing time matrices for the selected and 100 instances are displayed in Figure 4, Figure G.1, and Figure G.2, respectively.

Table G.1 shows the result of the one-instance experiments.88footnotemark: 8For both the MIP and the metaheuristic approaches, we find that only a relatively small improvement is possible even when they are allowed to keep searching for many more hours. MatNet approach produces much better solutions in the order of seconds.999The runtimes for GA and PSO are divided by 8, following the convention used in the main text, only for 128 run cases. Single-instance single-run programs cannot easily utilize multiprocessing.

99footnotetext: Here, for MatNet approaches, we simply use the instance augmentation technique only. There are, however, better inference techniques for neural net based approaches that are more suitable for solving “one-instance CO problems.” See, for example, Hottung et al. [hottung2021efficient].
Method FFSP20 FFSP50 FFSP100
(Fig.4) (Fig.G.1) (Fig.G.2)
MS Time MS Time MS Time
CPLEX (10 min timeout) 37 (10m)
CPLEX (10 hour timeout) 34 (10h)
Shortest Job First 31 (-) 58 (-) 100 (-)
GA (10 iters, 1 run) 30 (4m) 57 (8m) 99 (14m)
GA (10 iters, 1 run) 30 (8h) 57 (13h) 99 (26h)
GA (10 iters, 128 run) 29 (1h) 57 (2h) 99 (4h)
PSO (10 iters, 1 run) 29 (7m) 56 (14m) 98 (24m)
PSO (10 iters, 1 run) 27 (12h) 56 (21h) 98 (40h)
PSO (10 iters, 128 run) 27 (2h) 55 (4h) 98 (7h)
MatNet (single POMO) 27 (2s) 51 (4s) 93 (6s)
MatNet (128 inst. aug.) 25 (5s) 50 (10s) 91 (11s)
MatNet (1280 inst. aug.) 25 (8s) 49 (16s) 90 (23s)
Table G.1: One-instance FFSP experiment results.
Figure G.1: FFSP50 processing time matrices used for the one-instance experiment.
Figure G.2: FFSP100 processing time matrices used for the one-instance experiment.