1. Introduction
Quadratic unconstrained binary optimization (QUBO) is a standard model for optimization problems (not only) in the quantum world as it can be used as input for algorithms like the quantum approximate optimization algorithm (QAOA) (Farhi et al., 2014) or quantum annealing (QA) (Kadowaki and Nishimori, 1998). A QUBO instance of size is given as an matrix with for all . A solution to a QUBO instance
is a vector
so thatNote that QUBO instances can trivially be derived from instances of Ising spin glasses (McGeoch, 2014). Translations to QUBO and/or Ising models exist for a multitude of common optimization problems (Lucas, 2014; Glover et al., 2018), including many important NPhard problems like 3SAT (Choi, 2010) or scheduling problems (Stollenwerk and Basermann, 2016).
In this paper, we focus on the wellknown Traveling Salesman Problem (TSP): A TSP instance for cities is given as an matrix with for all . A solution to a TSP instance is a vector that is a permutation of and fulfills
Despite apparent parallels in the formulation of a QUBO and TSP instances, the best translation from a TSP instance for cities produces a QUBO instance of size , resulting roughly in a QUBO matrix with matrix cells in total (Feld et al., 2018)
. This boost in size makes the QUBO translation rather inefficient for many practical applications and sometimes prohibits the resulting QUBO instances from being solved using quantum hardware at all, since current machines running QAOA or QA are severely limited in the amount of available qubits. However, since the computed QUBO instances originate from the smaller TSP instances, they clearly contain some redundant information.
In order to assess alternative approaches to using the limited quantum hardware for solving QUBO problems, we apply neural networks (NNs) in this paper. These can help to bridge the gap until sufficiently large quantum hardware becomes available, but also may provide hooks for additional analysis. From a blackbox perspective, a NN solving QUBOs can be treated like quantum annealer by the calling modules. Having such a mockup helps to identify which aspects of software engineering are really quantumspecific solution and which originate from the problem definition.
We explain the considered variants of NNs and how to apply them to work with problems formulated as QUBOs along the way. Using these NNs, we provide first empirical evidence for the following four hypotheses:

Autoencoding QUBO instances generated from TSP instances is possible resulting in a hidden space having the size of the original TSP encoding (Fig. 1a, Sec. 2).

NNs can be trained to solve QUBO instances generated from TSP instances (Fig 1b, Sec. 3).

NNs can be trained to solve the encoded hidden spaces of these QUBO instances (Fig 1c, Sec. 4).

NNs can be trained to solve arbitrary QUBO instances (Fig. 1d, Sec. 5).
An overview over the tested network architectures and setups is given in Figure 5. We discuss the lessons learned from these experiments and motivate further research in Sec. 6.
2. Autoencoding QUBO formulations of TSP
Autoencoders (AEs) are NNs that typically possess an hourglass form: In the center they feature a hidden layer that is substantially smaller than the samesized input and output layer. An AE is trained to reproduce its input data, but as the hidden layer is smaller than the input samples, they cannot simply “pass through” their inputs. Instead, the AE’s first half (called encoder) needs to learn to abstract the most relevant features so that it can populate the latent space. This is the space of information that can be contained in the smallest hidden layer as densely as possible. Then, the second half (called decoder) will use this representation to reconstruct the original input as closely as possible. (Hubens, 2018)
Once trained, AEs can be used to compress and decompress information by using the encoder and decoder part separately or to detect anomalies (i.e., input data not fitting the previously constructed latent space is assumed to substantially differ from previous training data). In our case, we use the process of training various AEs to estimate the entropy contained within the input data: the smallest latent space that still allows for almost no loss in autoencoding gives an estimate of the contained entropy in the data set, given that the encoder and decoder have been trained perfectly (if they have not, the estimate becomes rougher).
2.1. Setup
We have trained, tested and validated the network using different data sets. The training data consists of 11,000 randomly generated TSP instances that have been translated to QUBO; the test and validation data sets each consist of 1,000 samples.
There are different types of AEs, each with different advantages and disadvantages. The vanilla autoencoder
represents the simplest form and consists of a network with three layers. After the input layer, a dense layer with a ReLU activation function reduces the input’s dimensionality, followed by a second dense layer using sigmoid as an activation function that reconstructs the input. The
multilayer autoencoder extends the previously described version by two more layers in both the encoder and decoder part. All layers use the ReLU activation function except the last layer, where the sigmoid activation function is used again. Finally, the convolutional autoencoder uses threedimensional vectors instead of onedimensional vectors, which is designed to be more suitable for compressing images and tested here for compressing matrices. Our setup consists of eleven layers: starting with an input layer, there are four encoder layers, two of which are pooling layers and the other two are convolutional layers with ReLU activation function. The decoder part consists of six layers: three convolutional layers with ReLU activation function, two upsampling layers and finally an output layer that uses the sigmoid activation function.The initial layer set for each of the AEs is inspired by (Hubens, 2018). Depending on the type of layer (convolutional or dense), the input data’s form must be adjusted. For the convolutional layer, a QUBO matrix is represented by an array of arrays. For the dense layer, the arrays have to be flattened, so QUBO problems are represented as a onedimensional array in order to enable the network to recognize different problems. The final settings for each network were determined using various experiments and evaluations, which are presented in the following subsection.
The mean squared error (MSE) was used as a loss function for each AE since it shows a higher sensitivity to outliers than, for example, the absolute error. MSE calculates the average of the squared errors between predicted and actual output vectors.
The optimizers adam and stochastic gradient descent (SGD) were used to optimize the AE networks. Compared to SGD, adam, which was specially developed for training NNs, has the advantage that its learning rates are adaptive and potentially specific for each parameter. While adam uses little memory and converges faster, SGD is usually better at generalizing
(Lu, 2017).We measured accuracy using two methods: The default accuracy compares each predicted output with the actual output and returns the percentage of correctly predicted outputs. This process is repeated after each episode, with one episode corresponding to a training session on the entire input data set. However, this accuracy is of limited interest for our motivation, since we are rather interested in whether the shortest path is returned after encoding and decoding the QUBO matrix. Therefore, the afterevaluation accuracy was also used for training, test and evaluation. This consideration is necessary because there are at least two shortest tours in an undirected graph as for each tour there exist an opposite tour of the same length. Thus, the second accuracy uses the energy values of the solved QUBO problems, both regarding the actual qubit configuration and with the predicted ones. Accuracy is then calculated from the relationship between corresponding and all energies.
Each AE was trained for epochs. Various learning rates were tries for the SGD optimizer, starting with a learning rate of , decay steps and a decay rate of . There were two further training setups with an initial training rate of and , respectively (Mack, 2018). For each network type, the best results were achieved using a learning rate of . Adam optimizer was configured with no initial learning rate and in the event of poor optimization, the mentioned configurations for learning rate and decay were set. Batch size was set to . It turned out the AE using adam optimizer showed better results than the one using SGD.
2.2. Evaluation
Evaluating the AEs should identify whether (and to what extent) the QUBO representation of TSP instances can be reduced while at the same time being able to reconstruct the input. For this purpose, the NNs were trained and evaluated differently, starting with no reduction in dimensionality to a reduction of one fourth of the original size. The experiments started with TSP instances with cities (TSP), i.e., a sized QUBO matrix. Even though this problem size is not challenging for computers or humans, it served as a baseline for determining the best solution.
The vanilla AE has reconstructed the QUBO well up to a size of . After that, the (afterevaluation) accuracy was below . The accuracy of the multilayered autoencoder (MLAE) and the convolutional autoencoder (CAE) was at least , even with a reduction to a quarter of the original size. For this reason, the vanilla AE was not evaluated further.
When encoding TSP instances with cities (TSP), both AEs performed well; the CAE was slightly better. The afterevaluation accuracy of the MLAE is for TSP and for TSP. The CAE achieves an accuracy of (TSP) and (TSP). The default accuracy was (MLAE) and (CAE). The average energy difference of predictions that did not correspond to the actual energy was for MLAE and for CAE. Since CAE was best able to reconstruct the input, MLAE will not be further evaluated.
As CAE in combination with adam as the optimization function achieved the best results, this setup was chosen for the following experiments involving an encoder part.
In summary, it can be said that it is indeed possible to reduce the dimensionality of TSP instances represented as QUBO problems. A reduction to a size of one fourth shows that the QUBO matrices contain lots of redundant information. If a network for outputting the correct qubit configuration can be trained just using reduced input, training time can be drastically reduced. Fig. (a)a and Fig. (b)b show that the reduction task is quite simple for the AEs, since training converges already in early epochs.
3. Solving QUBO formulations of TSP
The next step is to check whether a NN can be trained to solve a given QUBO problem. More specifically: is it possible to learn a qubit configuration that optimally solves a given problem.
The networks were again trained with a QUBO representation of TSP instances. However, since the required output differs from that of the AE part, new output data had to be generated accordingly. The required output for the NN is the qubit configuration for the shortest tour within the TSP instance. Corresponding qubit configurations were determined using qbsolv, a tool for operating the quantum annealing hardware by DWave Systems (Inc, 2019). Qbsolv can also be used as a classical solver for QUBO problems.The functionality of qbsolv regarding the solution of TSP instances up to a size of cities was checked and verified by comparing the tours returned with those calculated using Google’s ORTools (Inc, 2019) as well as with the solutions of the data sets by (Burkardt, 2019).
In order to determine a suitable NN for solving TSP instances, a recurrent neural network (RNN) and a convolutional neural network (CNN) were implemented. The results of both networks were compared, whereby again all networks were trained with a data set of size 11,000, and 1,000 samples each were used for test and validation.
3.1. Recurrent Neural Network
Our initial network model was inspired by (Bello et al., 2016). They used one network architecture that solves both TSP and the likewise NPcomplete knapsack problem. Their network uses the twodimensional coordinates of the cities as input and the sequence of the cities to be visited as output. In our work, however, the input are TSP instances represented as QUBO matrices and the output is the shortest tour coded as a qubit configuration.
We use a pointer network consisting of two recurrent NN modules (encoder and decoder). As in (Bello et al., 2016)
, we implement attention using long shortterm memory (LSTM) cells
(Lihala, 2019).The loss is calculated using binary crossentropy. This loss function is suitable for problems with a yes/no decision, which is the case with our 0/1 output representing the qubit configuration.
With regard to the optimizer function for training the RNN, we have strictly adhered to the structure of (Bello et al., 2016). They propose to use optimization via policy gradients instead of a supervised loss function (as for the AE mentioned above). The reason for this is that the model’s performance may be linked to the label’s quality.
For this, a Monte Carlo approach was implemented in the reinforcement algorithm in order to implement the policy parameters update using random sampling (Sanjeevi, 2018). In addition to this modelfree approach, adam was used as an optimization approach. Again, default accuracy was used during training and subsequently afterevaluation accuracy was used for evaluating the QUBO data.
3.2. Convolutional Neural Network
Any QUBO data can be represented as a twodimensional matrix, which is why we also implemented a convolutional neural network (CNN). Our CNN consists of six convolutional layers and two dense layers. All but the final layer are paired with a ReLU activation. The final layer includes a softmax activation function. The CNN’s training was optimized with adam.
The first round of experiments was trained using binary crossentropy as loss function. The network loss decreased as desired, but the accuracy did not increase. After analyzing the predicted outputs, it was found that the qubit configuration was incomplete. Most of the time, only two or three cities were visited within a TSP instance of cities (TSP), or three to four cities with a TSP instance of cities (TSP). This observation led to changing the loss function. The binary crossentropy function has been extended by a function that checks how many qubits are set to . The function increases the loss if the number of qubits set does not match the number of cities. In addition, the loss is increased if not every city was visited, but a certain city several times.
3.3. Setup
The RNN TSP solver was first trained with coordinates of the cities. This was to check whether the network, which was inspired by (Bello et al., 2016), gave similar results. It was trained on TSP and TSP and actually delivered similar results.
Then problems in QUBO representation were used as input and the resulting qubit configuration as output. Batch size was set to and the network was tested with and hidden units per layer. The range of learning rates has been as with the AE. To save training time, the RNN was first tested with hidden units and three learning rates. The loss was best at learning rate with a decay of at steps. However, training with hidden units resulted in a network that was not able to recognize the hidden logic within the qubits for problems with more than cities. Accordingly, the hidden units were increased to . This lengthened the training time, but the entire logic of QUBO, which represents the TSP, could still not be learned.
We suspect that the problem lies in the layers used, because – as can also be seen with the AEs – convolutional layers process QUBOs better. Since a further increase in the hidden dimensions would lead to a further increase in training time, we just focused on CNN for further analysis.
The CNN was trained and compared with and units per layer. Training the network containing units with TSP instances worked well, but the model overfitted. This is because the network is designed for complex problems, but a TSP with cities is just too simple. To prevent overfitting, dropout layers that randomly ignore units were added to the model when training with TSP instances.
TSP instances were used to train the units model, while TSP instances were used for models having and units (but no dropout layer).
3.4. Evaluation
Before the loss function was adjusted as already described, the forecast did not set qubits to , but only two or three. The network afterwards learned that the goal is to minimize the energy and therefore has to consider all constraints.
The TSP setup was trained with epochs. However, the training itself only required epochs for the ideal result. After the dropout layer was added, the network no longer overfitted and showed a loss of around (see Fig. (a)a). In of the cases, the predicted values matched the actual values. In cases where they did not match, the average difference between the actual and calculated distance of the shortest tour was . If one considers that the distances were chosen randomly between and 10,000, the network did understand its task.
When training the TSP, the dropout layer was not used. units were not sufficient to achieve good results: a default accuracy of was achieved. After an update to hidden units and still no dropout layer, an afterevaluation accuracy of was achieved. The average distance for nonmatching actual and predicted data was .
Fig. (b)b shows the training of TSP. One recognizes that the loss starts lower than with the TSP. A major disadvantage of convolutional neural layers is the training time. In order to save processing time, all TSP instances were trained with pretrained networks. The pretrained networks are networks that were trained using TSP. This procedure helps to reduce the processing time, since the loss starts at a lower point because only the last layers have to be trained. It also leads to fewer epochs for training convergence. In this specific case, epochs were sufficient. We also checked that these results are similar to a CNN that was trained on TSP without pretrained layers. The training took four times longer, the results were worse after epochs, but approximately the same after epochs.
4. Solving encoded states of QUBO formulations of TSP
We now present a network architecture for solving NPcomplete problems that uses the encoder part of the CAE combined with the CNN TSP solver (see Fig. (c)c). The idea is to reduce the dimensionality of the QUBO problems and use this representation to train the network solving the problem. The networks mentioned were chosen because the CAE showed best results on reconstructing QUBOs and the CNN TSP solver accordingly performed best when solving TSP instances. When combining the networks, the setup as described in the previous sections was used.
Training the CNN with compressed QUBO data from TSP instances again led to overfitting. Thus, dropout layer were added to address this problem. Instances of TSP were only tested with units because the results were good enough. The training of the combinatorial NN had very similar results to the CNN. The loss converged at and had a default accuracy of and an afterevaluation accuracy of . The average difference between all nonmatching and actual results was . The network was trained over epochs.
The compression of the input had almost no effect on the network’s ability to learn qubit configurations. The network’s training time was highly reduced when only compressed input was used. The CNN used about hours of training time, while the combinatorial NN only used about hours.
In order to learn TSP instances, the combinatorial NN has again used pretrained layers, i.e., those of the TSP combinatorial NN. Again, there are only minor differences from the CNN results. The loss was (see Fig. 12), which is identical to the CNN’s loss. The network was trained for epochs, had a default accuracy of , an afterevaluation accuracy of , and a mean difference between nonmatching and actual results of . This value is higher than that of the CNN, but still acceptable as the cities’ distances were randomly chosen between and 10,000.
5. Solving arbitrary QUBO instances
Finally, we want to take another step towards generalization and train NNs to solve arbitrary QUBOs. In this way, they can be functionally used in place of a quantum annealing solver.
5.1. Setup
Random QUBOs have no inherent structure that could be exploited by an AE, so we only trained CNNs for this task. The input data was generated by filling the upper triangular matrix with random numbers between 10,000 and 10,000. The output was generated by labeling given input with the qubit configurations that were created using qbsolv. The training data set consisted of 11,000 samples, the validation and test data set each consisted of 1,000 samples.
5.2. Evaluation
We want to show that a single NN can solve not only a specific NPcomplete problem, but a generic one. In order to obtain comparability, random QUBOs were created that have the same dimensionality as TSP and TSP QUBOs.
The CNN was able to learn from the random QUBOs: dimensional matrices (equivalent to TSP) were trained using units per layer over epochs and had a loss of . The default accuracy was , the afterevaluation accuracy . The mean energy difference between nonmatching actual and predicted results was , which is much higher than with TSP. However, the qubit configuration was correct for almost every second result.
Training using random data is far more complex than training a specific problem (see Fig. (a)a and Fig. (b)b). The network did not overfit, not even with twice as many units. The random sized QUBO problems (equivalent to TSP) were trained with a network having units per layer over epochs and had a loss of . Default accuracy and afterevaluation accuracy were around and the average energy difference of the nonmatching outputs was .
Training with random values took a lot of time for relevant QUBO sizes, the accuracy fell faster with the increase in the QUBOs’ dimensionality than with TSP. The use of a network pretrained on sized random QUBO problems inside a random QUBO network had an accuracy of and did not work comparable to the CNN. In addition to the fact that an energy minimum is sought, the larger network cannot reuse much information.
6. Conclusion
We provided empirical evidence for four hypotheses. (1) AEs are able to filter the overhead induced by a QUBO translation of TSP to some extent. They can thus be used to guess the original complexity of a problem from its QUBO formulation. (2) NNs can be trained to return the qubit configuration resulting in minimum energy for a QUBO problem generated from a TSP instance. They are thus able to solve TSP even in a larger QUBO translation. (3) Accordingly, NNs can also solve QUBO problems originating from TSP given their latent space representation (instead of the full QUBO matrix). (4) NNs can be trained to solve QUBO problems in general. The fact that CNNs appear most effective implies that QUBO problems can be treated more like a somewhat local graph problem and less like combinatorial optimization.
These first steps call for immediate followup research. Most importantly, a thorough study of the various impact of overhead from the QUBO translation is necessary: How do networks that have been trained for (a) solving TSP in native encoding, (b) solving QUBO translations of TSP, and (c) solving QUBO in general compare on the same set of problems regarding various performance metrics? Are there cases where a QUBO translation may actually be easier to solve than other representations of TSP? Does specialized training on just one type of QUBO bring any advantage over training on random QUBOs? How do the results on TSP (whose QUBO translation introduces a quadratic overhead) compare to problems with more (or less) efficient QUBO translations?
From this experience report, a strong argument can be made for mathematically solid interfaces in quantum computing: The NNs we trained should be able to replace any other means of solving QUBOs fully transparent to the provider of the problem instances. A diverse pool of mechanisms for solving QUBOs should prove useful to establish QUBO as a suitable formulation for optimization problems and thus prepare for the eventual deployment of quantumbased machines. Current breakthrough technology like neuromorphic hardware may thus serve as a bridge to the quantum age.
We argue that for some time to come, quantum software will usually only be shipped as a module within larger, mostly classical software applications. Furthermore, these modules will usually come with fully classical counterparts as quantum resources will remain comparatively limited and thus should not be used up unnecessarily, for example when testing other parts of the software where a good enough approximation of the quantum module suffices. We think that NNs may provide a very generic tool to produce such counterparts as it has been done in this case study for quantum annealing or QAOA, even though their rather blackbox nature opens up a new field of testing issues. Effectively, we argue that any approach to the integration of quantum modules should aim to include similar classical approximation models at least for the near future.
We would like to point out that even in the presence of largescale quantum hardware, handling QUBO problems with NNs might still be useful for pre and postprocessing of problem instances, dispatching instances to various hardware platforms, or providing estimates of the inherent complexity of a specific problem or problem instance. As we have shown that NNs can handle the structure of QUBO matrices well, they may also be able to learn transformations (ideally with automatic reduction of size) on them or help with introspection of the optimization process and effectively the debugging of optimization problem formulations or quantum hardware platforms.
References

Neural combinatorial optimization with reinforcement learning
. arXiv preprint arXiv:1611.09940. Cited by: §3.1, §3.1, §3.1, §3.3.  External Links: Link Cited by: §3.
 Adiabatic quantum algorithms for the npcomplete maximumweight independent set, exact cover and 3sat problems. arXiv preprint arXiv:1004.2226. Cited by: §1.
 A quantum approximate optimization algorithm. arXiv preprint arXiv:1411.4028. Cited by: §1.
 A hybrid solution method for the capacitated vehicle routing problem using a quantum annealer. arXiv preprint arXiv:1811.07403. Cited by: §1.
 A tutorial on formulating and using QUBO models. arXiv preprint arXiv:1811.11538. Cited by: §1.
 External Links: Link Cited by: §2.1, §2.
 External Links: Link Cited by: §3.
 Quantum annealing in the transverse ising model. Physical Review E 58 (5), pp. 5355. Cited by: §1.
 External Links: Link Cited by: §3.1.
 External Links: Link Cited by: §2.1.
 Ising formulations of many np problems. Frontiers in Physics 2, pp. 5. Cited by: §1.
 External Links: Link Cited by: §2.1.
 Adiabatic quantum computation and quantum annealing: theory and practice. Synthesis Lectures on QC 5 (2), pp. 1–93. Cited by: §1.
 External Links: Link Cited by: §3.1.
 Experiences with scheduling problems on adiabatic quantum computers. In 1st Int’l Workshop on PostMoore Era Supercomputing (PMES), pp. 45–46. Cited by: §1.
Comments
There are no comments yet.