Log In Sign Up

Fast semidefinite programming with feedforward neural networks

Semidefinite programming is an important optimization task, often used in time-sensitive applications. Though they are solvable in polynomial time, in practice they can be too slow to be used in online, i.e. real-time applications. Here we propose to solve feasibility semidefinite programs using artificial neural networks. Given the optimization constraints as an input, a neural network outputs values for the optimization parameters such that the constraints are satisfied, both for the primal and the dual formulations of the task. We train the network without having to exactly solve the semidefinite program even once, thus avoiding the possibly time-consuming task of having to generate many training samples with conventional solvers. The neural network method is only inconclusive if both the primal and dual models fail to provide feasible solutions. Otherwise we always obtain a certificate, which guarantees false positives to be excluded. We examine the performance of the method on a hierarchy of quantum information tasks, the Navascués-Pironio-Acín hierarchy applied to the Bell scenario. We demonstrate that the trained neural network gives decent accuracy, while showing orders of magnitude increase in speed compared to a traditional solver.


page 1

page 2

page 3

page 4


A Simplified Treatment of Ramana's Exact Dual for Semidefinite Programming

In semidefinite programming the dual may fail to attain its optimal valu...

Polynomial time guarantees for the Burer-Monteiro method

The Burer-Monteiro method is one of the most widely used techniques for ...

Sketching semidefinite programs for faster clustering

Many clustering problems enjoy solutions by semidefinite programming. Th...

How do exponential size solutions arise in semidefinite programming?

As a classic example of Khachiyan shows, some semidefinite programs (SDP...

Accelerated Inference in Markov Random Fields via Smooth Riemannian Optimization

Markov Random Fields (MRFs) are a popular model for several pattern reco...

Definable Ellipsoid Method, Sums-of-Squares Proofs, and the Isomorphism Problem

The ellipsoid method is an algorithm that solves the (weak) feasibility ...

Neural Networks as Artificial Specifications

In theory, a neural network can be trained to act as an artificial speci...

I Introduction

Neural network-based algorithms have produced astounding results for many tasks. A notable example is that of image inpainting. Given an image with missing patches in it, properly trained algorithms can fill the patch in convincingly, matching both local and global expectations. In many fields of science and technology we find ourselves facing a similar problem: given some elements of a matrix, find a completion of it, such that the whole matrix is positive semidefinite. Such a problem is an instance of a feasibility semidefinite program (SDP), which is a widely studied class of convex optimization problems, having a broad range of applications 

Wolkowicz et al. (2012); Vandenberghe and Boyd (1999). SDPs can be solved in polynomial time and by duality are typically able to provide a certificate of reaching the optimum. However, in various use-cases solving the problem exactly becomes impractical due to extensive runtimes, which can limit its utility in many real-time applications. This has led to several approaches to reduce the runtime, even if at the cost of losing accuracy Majumdar et al. (2020); Arora et al. (2005); Jain and Yao (2011); Hazan and Koren (2016); Bidyarthy et al. (2014); Mazziotti (2011); Shah et al. (2016).

Figure 1:

General scheme of our machine learning approach. Given a new instance of a feasibility SDP, described by the vector

(purple dots), a neural network outputs the missing elements (yellow triangles) of the constraint matrix . The neural network’s parameters are updated in order to make

as positive semidefinite as possible (i.e. maximize the smallest eigenvalue).

Here we explore an alternative solver for SDPs using neural networks. We replace the standard optimization procedure with a single nonlinear function, a feedforward artificial neural network. We focus on feasibility SDPs, where, given a description of the problem as an input, the neural network will guess the optimization parameter values such that they complete a matrix in a positive semidefinite way (illustrated in Fig. 1). If the prediction of the network results in a positive semidefinite matrix then we obtain a certificate of feasibility. If it fails to do so, then this could be either because there does not exist a positive semidefinite completion or because the neural network just didn’t find it. Luckily, the SDP has a dual formulation, in which the dual task is feasible if and only if the primal is infeasible. Thus we also check the dual task with a second neural network in a similar manner as the primal. It is only if both networks fail that we do not get a definite statement of feasibility. In other words, for the neural network-based method the duality gap is not guaranteed to be zero, while dedicated SDP solvers can in principle close this gap.

We pay this price in precision in order to gain a significantly improved runtime. Contrary to traditional solvers, for each new instance of a problem our neural network does not have to start from scratch to solve it, since it learned the structure of the problem previously, during training, which allows for a strong advantage in runtime. The flexibility of neural networks allow the user to tune this trade-off in accuracy versus runtime by changing the size and architecture of the neural network. The technique could be useful for screening data before running slower, albeit more precise solvers, when one has many instances of similarly structured optimization tasks. Additionally, it can be very useful in scenarios where quick or even real-time solving of SDPs are required, such as in calculating the safe zones of self-driving cars, collision avoidance Prajna and Jadbabaie (2004); Frazzoli et al. (2012), control systems and robotics tasks Vandenberghe and Boyd (1999); Boyd and Vandenberghe (1997), black-box quantum random number generation and quantum protocols Pironio et al. (2010); Brask et al. (2017), analyzing ground state energies in quantum chemistry and many-body physics tasks Mazziotti (2004); Bravyi et al. (2019); Fukuda et al. (2007); Nakata et al. (2001, 2008); Navascués et al. (2013) or bounding problems in NP Boyd and Vandenberghe (1997); Goemans and Williamson (1995).

We examine the performance of the method on a central scenario in quantum theory, Bell nonlocality Bell (1964); Brunner et al. (2014). At a glance, the basic task is to decide what kind of shared resource is necessary or sufficient for creating correlations between two parties. In particular, quantum resources, as opposed to classical ones, can lead to so-called nonlocal behavior. Previous works have used machine learning in tackling the question of whether classical resources are sufficient Kriváchy et al. (2020); Bharti et al. (2019)

, in classifying behavior by learning from many samples 

Canabarro et al. (2019) or by proposing new experiments Melnikov et al. (2020). In the current work we turn towards another question: is a quantum resource sufficient to reproduce the correlations? A hierarchy of SDPs proposed by Navascués, Pironio and Acín (NPA) helps in answering this question Navascués et al. (2007). Each step of the hierarchy is a feasibility SDP which demonstrates whether the correlations under scrutiny fall within a given relaxation of the set of quantum correlations. For higher levels of the hierarchy the relaxations become tighter, and eventually converge to the true quantum set Navascués et al. (2008); Doherty et al. (2008)

. Each SDP in the hierarchy is a feasibility SDP in which some elements of the so-called moment matrix are fixed by physically observed correlations, and the other elements need to be completed such that the whole matrix is positive semidefinite.

We apply our technique to this problem by training two neural networks for each level of the hierarchy, one for the primal and one for the dual problem. For a fixed level of the hierarchy, the physically observed correlations define the SDP, so this is the input we use for the neural network. For training, we use only random correlations as inputs, without having to solve the SDP exactly even once. Once trained, the neural networks can predict matrix completions orders of magnitude quicker for a new correlation than conventional solvers, while keeping a decent accuracy.

In the following we introduce our approach to solving feasibility SDPs in detail. We follow by describing the Bell scenario and the hierarchy of SDPs which approximate the quantum set of correlations. Then we present the performance results and conclude with a discussion. Finally, we present a more detailed introduction to neural networks and the details of our implementation in the Methods section.

Ii Neural networks for solving SDP

Consider the following family of generic nonlinear optimization tasks, parametrized by a vector .

s.t. (1)

where , . We focus on a differentiable objective and constraints

. The general spirit of our work is to encode the optimization problem in the loss function of an artificial neural network, such that it learns to satisfy the constraints and minimize the objective function, i.e.

, where is introduced to balance the objective and constraints in the loss function. Then we train the neural network to minimize this loss by inputting random configuration vectors , and asking the neural network to output , a solution which it guesses to be optimal. After training on many random configurations from within the family the neural network will learn the structure of the problem, and when evaluated on a new instance, it will predict a close-to optimal solution. Within this family of (fixed-size) optimization problems we thus automate the solving process, since for a new instance of an optimization problem (characterized by ), instead of solving it from scratch, the neural network predicts the solution almost instantaneously.

In the current work we examine an important subclass of optimization problems, feasibility SDPs Wolkowicz et al. (2012). First, we do this because SDPs are specific instances of conic programming and thus have a dual problem, the optimal solution of which approaches the primal’s from below and achieves a zero duality gap. With our technique we can train a neural network also for the dual task, and by evaluating both the dual and primal neural networks we can obtain bounds on how well the two models together manage to reduce the duality gap. Second, we only examine feasibility problems because in these the machine’s task is very pronounced and its success is easy to check: predict a variable such that . If successful, we have a certificate of feasibility. Finally, note that well-formulated SDPs with a bounded objective function can be approximated arbitrarily well with a series of feasibility SDPs using bisection techniques, with the accuracy increasing exponentially with the number of feasibility SDPs solved.

A feasibility SDP can be expressed as

s.t. (2)

where is a matrix, is the Hilbert-Schmidt inner product, and the matrices and vector encode the constraints. The dual task of the primal SDP can be stated as

s.t. (3)

where is an arbitrary positive number, which we introduced in order to be able to state the dual problem also as a feasibility task. For feasibility problems, either the primal or the dual task has a solution.

Our main tools for solving these SDPs are artificial neural networks. A feedforward neural network is a model for a generic multidimensional nonlinear function. One of its simplest realization, a multilayer perceptron, is characterized by the number of neurons per layer (width), the number of layers (depth), and the activation functions used at the neurons, which altogether model an iterative sequence of parametrized affine and fixed nonlinear transformations on the input, i.e.


where the matrix and vector

parametrize the linear transformation,

is a fixed differentiable nonlinear function, and is the input of layer , being the input of the model and being the output. During training, the parameters of the model () are updated such that they minimize a differentiable loss function of the training set. Once trained, the neural network can be evaluated on new input instances. For a more detailed introduction to neural networks we refer the reader to the Methods section or to Ref. Goodfellow et al. (2016).

Our approach to solving SDPs via neural networks is illustrated in Fig. 1. We examine a family of feasibility SDP problems, where is fixed and only the values of are different. We input the constraints of the SDP () in a feedforward neural network. The neural network outputs the optimization variables (the free values of for the primal or for the dual) and the loss is taken to be minus one times the smallest eigenvalue of the constraint matrix. Thus the neural network tries to push the lowest eigenvalue of the constraint matrix to be as large as possible. For a test sample, if this smallest eigenvalue of the primal constraint is positive, then the machine has found a positive semidefinite completion of , i.e. we obtain a certificate of feasibility. Alternatively if the dual neural network finds a positive semidefinite completion of (II), we obtain a certificate of infeasibility of the primal task. It is only if neither the primal nor the dual neural network give positive semidefinite solutions that we are uncertain of the feasibility.

Iii Case study in quantum nonlocality

To examine the performance of the method, we consider a hierarchy of feasibility SDPs, where the constraint matrices naturally have several elements fixed and the others left free. In the Navascués–Pironio–Acín (NPA) hierarchy Navascués et al. (2007, 2008)

, using such SDPs we can explore the limitations of correlations coming from shared quantum resources. The hierarchy can be applied in a variety of scenarios such as estimating randomness in device independent scenarios 

Nieto-Silleras et al. (2014); Bancal et al. (2014), network nonlocality Pozas-Kerstjens et al. (2019); Wolfe et al. (2020), sequential Bell tests Bowles et al. (2020) or bounding ground state energies Navascués et al. (2013), but here, for concreteness and easy benchmarking, we examine the standard bipartite Bell scenario, which is an excellent sandbox for examining the strength of correlations between two systems Bell (1964); Brunner et al. (2014).

In the Bell scenario a source is distributed to two parties, Alice and Bob, and they each additionally receive an input ( and , respectively), and produce an output ( and , respectively), as illustrated in Fig. 2(a). For the current work we will only consider discrete inputs and outputs of cardinality 2, i.e. . We say that there is no signaling among the parties if the statistics of the inputs and outputs obey


Any correlation for which these constraints hold are in the set of no-signaling correlations . We enforce no-signaling and that the inputs are independent from the source. Then, just by observing the statistics of the inputs and the outputs of the two parties,

, one can in certain cases deduce that stronger than classical sources were used, e.g. quantum or post-quantum sources. With any classical source, the obtained statistics can be described by a shared random variable

, constraining the correlations to be of the form


for some conditional probability distributions, otherwise known as response functions, and some probability distribution . Any correlation which can be described by such a model (otherwise known as a local hidden variable model) is within the local set .

The Born rule of quantum theory postulates that for any quantum source, must have the form



is a Hermitian positive semidefinite matrix (or operator, more generally) with unit trace, and can be thought of as the quantum analogue of a probability distribution,

are projective matrices such that , and similarly for . The set represents Alice’s measurement when her input is and output is , and analogously for Bob. Any correlation having a quantum explanation according to the above equation is in the set .

Figure 2: (a) The Bell scenario. A source (either no-signaling, quantum or local) is distributed to the two parties, Alice (A) and Bob (B). Independently from the source they each receive and input ( and , resp.) and produce an output ( and , resp.). The correlations are characterized by the joint statistics . (b) A slice of the set of no-signaling correlations, where the axes are the CHSH expression divided by four and a relabeling of that (CHSH’/4). The isotropic line (dotted red line) runs between the PR-box (top yellow dot) and the fully mixed state (gray dot). The red dot represents the Tsirelson bound. The 8 local vertices achieving a CHSH value of 2 do not appear in this slice, but for illustration are projected onto 2 points (yellow dots on the border of ).

The no-signaling set and the local set are polytopes Brunner et al. (2014), bounded by convex combinations of a finite set of vertices, or equivalently by linear inequalities. For the local set there are 8 non-trivial inequalities, all of them equivalent up to relabeling to the same Clauser–Horne–Shimony–Holt (CHSH) inequality, which is a linear expression bounded by the value for local behaviors Clauser et al. (1969). This inequality defines a facet, spanned by 8 local vertices. The no-signaling set is bounded by 24 vertices of which 16 are also local vertices, and 8 are genuinely no-signaling vertices, all equivalent under relabeling to the Popescu–Rohrlich box (PR-box) Popescu and Rohrlich (1994); Rastall (1985); Khalfin and Tsirelson (1985); Barrett et al. (2005), defined as


which gives a CHSH value of , far above the local bound. The relation between and is visualized on a two-dimensional slice of the space in Fig. 2(b).

Quantum behaviors can at most achieve a CHSH value of , the Tsirelson bound, thus achieving nonlocality Cirel’son (1980). However, the quantum set is not a polytope and is thus more difficult to characterize than the local or no-signaling sets. An additional difficulty is that one can not in general bound the dimension of the density matrix required. A general technique for approximating the quantum set is an infinite hierarchy of relaxations, known as the NPA hierarchy Navascués et al. (2007). Each relaxation is defined by a finite set of strings of moments of the quantum measurement operators of Alice and Bob. In this paper we will examine the relaxations , AB, , , where the number denotes the maximal degree of moments used in the relaxation, and for AB only includes single moments and cross-moments of Alice’s and Bob’s measurement, a commonly used intermediate level between and . The set is a subset of , and as approaches infinity we are guaranteed to recover the actual quantum set  Navascués et al. (2008); Doherty et al. (2008).

The basic idea of the relaxations is that if a distribution has a quantum description according to Eq. (8), then the so-called moment matrix should be positive semidefinite. A moment matrix, , is constructed by taking a set of strings of moments of measurement operators, e.g. for one takes . Other single moments are not needed due to normalization, so for clarity we leave away the output superscript from now on. One constructs the moment matrix by taking the expectation value of the dyadic product of and its element-wise adjoint set , which in the case of single moments is the same as since the projectors are self-adjoint. The moment matrix is the following, where the is the expectation value with respect to , i.e. . Only the upper triangle is shown as is symmetric.


For any behavior derived from some quantum source , such a moment matrix must be positive semidefinite. Notice that even without knowing , almost all elements can be identified with the observed statistics via the Born rule in Eq. (8). However, a few elements are unphysical and can not be observed, since they would require the joint measurement of two of Alice’s or Bob’s operators, which is prohibited for noncommuting measurements. For there are only two such unknown elements, denoted by bold text in the matrix (10), however for higher levels their number grows quickly, namely there are 8, 22, 52, 92 and 142 unknown elements in the moment matrices of levels AB, while each of them have the same number of elements fixed by the statistics .

The task is now the following: given some statistics from a Bell scenario, , and a level of the hierarchy (say ), find real numbers for the unknown elements of the moment matrix such that it is positive semidefinite. If such a completion is possible then . We can see that this problem is of the form (II) by defining the matrices to have ones only in the matrix indices where physical elements appear and zero otherwise, and by identifying the entries of with the appropriate elements of . As such, a dual formulation also exists, such that is if there is a positive semidefinite completion of the dual constraint matrix (see Eq. (II)), then we certify that . Otherwise, if we do not find a positive semidefinite for either the primal or dual, then we are left in uncertainty. In the following we train neural networks for the primal and dual of this problem for different levels of the NPA hierarchy and examine their performance.

We examine two ways to generate input behaviors . Since local behaviors can easily be screened out by checking their CHSH values, we do not need to train the machine to perform well on those. Thus we take samples from the nonlocal part of the no-signaling set, . We examine only a non-redundant part of this set by not considering the 8-fold symmetry under relabeling. As a result the set we consider is spanned by the 8 extremal local points and the PR-box which maximize the canonical CHSH inequality (see the yellow dots in Fig. 2(b) for an illustration). With a slight abuse of notation we continue to label this set as .

In the first sampling technique, we generate random samples from using hit and run sampling Smith (1984); Bélisle et al. (1998). In the second, which we will refer to as weighted vertex sampling we take uniformly randomly weighted mixtures of the 9 vertices of , with 8-fold weight on the PR-box. Formally, if the local vertices are and the PR-box is , then a training sample is generated as


where the are uniformly drawn random numbers. This sampling technique resulted in a more reliable training for the dual neural network.

Among other factors, the accuracy of the trained network depends on the architecture of the neural network, in particular on its size. In the following we will examine a family of neural networks, one for each level of the hierarchy. We keep the depth at a constant of 8 layers, while we vary the width in order to accompany the growing complexity of higher levels. In particular we take the width to be three times the number of elements which need to be predicted. The last layer is different from the preceding ones, first off, because its width is exactly the size of the number of elements which need to be predicted. Second, whereas in other layers we use exponential linear activations, we use a linear activation for the final layer of the primal network. For the dual network using linear activation can lead to instability, since the optimal solution to for maximizing the eigenvalue of the dual matrix in Eq. (II) can tend to infinity. Thus we limit the outputs such that by using a tangent hyperbolic activation. For a small enough (e.g. we used ) this only minimally restricts the generality of the solver, giving errors well below the standard inaccuracy of the machine learning predictions. For additional details of the training procedure see the Methods section.

Iv Results

Figure 3: The maximum of the smallest eigenvalue of the primal and dual SDP matrix as predicted by the machine (blue and red dashed lines, resp.), compared to the true value calculated by an SDP solver (green and orange solid lines, resp.), as a function of the mixing parameter of the isotropic line, for several different levels of the hierarchy. For the dual methods, in order to have numerical stability the outputs are restricted to be between and . The Tsirelson bound at is denoted by the solid magenta line.

The memory usage of the neural network method is approximately the same as for conventional SDP solvers, with the construction of the moment matrix being the primary bottleneck. Hence, in the following we compare the two approaches from two other aspects, their accuracy and runtime.

First let us examine Tsirelson’s bound in detail by using both the primal and the dual neural networks for relaxations 1, 1AB, 2 and 3 on the isotropic line, i.e. on distributions


where is the completely flat distribution and is the mixing parameter. The edge of the quantum set, Tsirelson’s bound, is at precisely . The isotropic line is portrayed in Fig. 2(b). In Fig. 3 we depict the minimum eigenvalue of the predicted moment matrix as a function , and contrast it to the maximum possible minimum eigenvalue, calculated by an SDP solver. For the results in Fig. 3 we use hit and run training for the primal networks and weighted vertex for the duals, since the dual neural network showed poor convergence and performance when trained on hit and run sampling.

We further explore the effect of choosing one training sampling technique versus the other by evaluating the accuracy of the

2 primal neural network solver for both hit and run and weighted vertex sampling. When testing the performance we additionally included a set of correlations coming from random two-qubit pure states measured in random projective bases (all sampled uniformly in the Haar measure) as well as results from the isotropic line. Results are portrayed in Table 


Tested on Trained on
Hit and run Weighted vertices
Hit and run 81% 60%
Weighted vertices 77% 77%
Quantum pure 80% 50%
Tsirelson bound (nonlocal part) 0.176/0.207 = 85% 0.193/0.207 = 93%
Table 1: Percent of test samples for which the 2 primal neural network found a positive semidefinite completion, for different training and test sampling techniques. For the test sets, consisting of 10000 samples each, we used only correlations which have positive semidefinite completions. For the Tsirelson bound the percentage is calculated only on the nonlocal part of the isotropic line, i.e. is and is .

Finally, in Table 2 we compare the time performance of the primal neural network to standard solving software (MOSEK), solved on the same personal computer at 100% CPU usage. We note that the timing results depend on the choice of neural network architecture. Recall that for different levels of the hierarchy we kept the depth fixed and increased the width linearly. This was an arbitrary choice which leads to the accuracy results, also visible in Table 2. If one would like to have higher accuracy on larger problem sizes, one could use larger neural networks, at the cost of having somewhat slower evaluation times. We can see that for the choice we made the speed of solving is orders of magnitude quicker than solving with MOSEK, hence there is still much space left for improving the accuracy if required.

Q1 Q1AB Q2 Q3
SDP variables 2 8 22 52
NN accuracy 98% 79% 81% 60%
NN Tsirelson 99% 97% 85% 77%
NN time per sample 17 s 20 s 36 s 210 s
MOSEK time per sample 9 ms 10 ms 13 ms 17ms
Table 2: Time per sample and accuracy for different levels of the hierarchy. The neural networks (NN) used all have a depth of 8 and a width which is 3 times the number of SDP variables which need to be predicted.

V Discussion

We have developed a solver for feasibility SDPs of a fixed structure using feed-forward neural networks. Our approach is an unsupervised learning one in the sense that training is done only with random input samples, and the neural network must learn itself the best output, which gives “the most positive-semidefinite” constraint matrix by constructing it explicitly. This is facilitated through a loss function which motivates the neural network to maximize the smallest eigenvalue of the generated constraint matrix. In future work it could be interesting to explore alternative approaches more in line with supervised learning, where one generates a training set by taking many random inputs and calculating the solution using conventional solvers, which input-output pairs would be training data for a neural network. Though it could be more precise than the approach explored in this work, it comes with the disadvantage that generating training samples is slow. This is a similar bottleneck as for the training of the method examined in this work, since we must calculate the eigenvalues of the constraint matrix for each training sample, which calculation for an

matrix typically scales as in practice. We note that we have tried briefly to do a supervised learning approach in which we circumvent these timely training steps by generating random positive semidefinite matrices and asking a feedforward neural network to predict the missing element positions given the known element positions. This gives a quick training procedure, however it is not as accurate as the technique demonstrated here, since the learned completions are not trying to push the matrix to be “as positive semidefinite as possible”.

A direction worth exploring in future work would be the use of generative methods. Instead of learning the distribution over outputs given an input (as it is done in discriminative techniques), generative neural networks learn the distribution over the inputs, and can generate new instances. Though we were inspired by such generative machine learning-based image inpainting, the task here is to always predict the same unknown elements of the matrix given the constraints . Thus it is unnecessarily expensive to learn a generative distribution over all positive semidefinite matrices, when one only needs to learn the conditional probabilities of some elements given others. Nonetheless, if sufficiently strong generative techniques were used their performance could be comparable to ours, though the same trade-off between training speed and accuracy is expected to also be present for generative models.

Examining generative models lead us towards more general scenarios than the one considered here. In the current work we discussed families of feasibility SDP problems where the structure of the problem is fixed. This is a common scenario in applications, where (the structure of the problem) is essentially fixed, and the task must be solved for different specific instances given by . If this is not the case then in principle can be provided next to to the neural network. For the example of matrix completion, a different would mean that different elements of the matrix need to be filled in, which for example a generative model could tackle, as previously discussed. In the current work we evaluated our approach on such tasks, i.e. tasks where the structure of the problem requires us to fill in a matrix, as shown in Fig. 1, where the entries of are either completely fixed by or are free optimization variables. For more general settings variable elimination on the constraints can help reduce the problem to this form, then allowing for our method to be used with slight modifications.

Furthermore, one could examine even more generic tasks by moving outside the realm of semidefinite programs to generic nonlinear objective functions and constraints, as in Eq. (II), where neural networks could be trained to minimize a Lagrangian function, incorporating all constraints and objectives with penalty terms. There have been many works addressing optimization problems using machine learning Bengio et al. (2020), however previous works addressing SDP specifically have been mostly focused on developing neural networks for specialized hardware. In principle these are quick, and for specific problems, such as for semidefinite programming, they can guarantee convergence Jiang and Wang (1999); Hopfield and Tank (1985). However, if simulated on a general-purpose digital computer, these neural network optimizations are also slow, since they must evaluate the constraints many times before converging.

The choice of the training set is important, especially since we are working in high-dimensional spaces. For example training on random quantum points for the primal is very inefficient, as most of them are local points. Another example is our preference of using the weighted vertex sampling for the dual neural network. Most samples generated with hit and run sampling are within , so the dual network focuses on performing well on that region. However, in order to function well on the isotropic line, for example, it must also perform well on the region outside , so that it can detect the boundary. We overcame this difficulty by introducing the weighted vertex sampling. In general some domain knowledge is useful in choosing the training set for such neural network-based SDP solvers.

Finally, an interesting question worth exploring more in depth is the advantage of neural networks versus conventional solvers. In which scenario could such a neural network approach have an advantage compared to standard solvers? Clearly, in application where many instances of semidefinite programs must be solved quickly, a neural network-based approach could be useful, for example in calculating the safe zones of self-driving cars, collision avoidance Prajna and Jadbabaie (2004); Frazzoli et al. (2012), control systems and robotics tasks Vandenberghe and Boyd (1999); Boyd and Vandenberghe (1997), black-box quantum random number generation and quantum protocols Pironio et al. (2010); Brask et al. (2017), testing many different candidate materials in quantum chemistry and many-body physics tasks Mazziotti (2004); Bravyi et al. (2019); Fukuda et al. (2007); Nakata et al. (2001, 2008); Navascués et al. (2013) or bounding problems in NP Boyd and Vandenberghe (1997); Goemans and Williamson (1995). Furthermore, we note that there are additional potential advantages of using neural network solvers for SDPs. One can use standard neural network analysis tools, such as optimizing over the input to see which inputs trigger the strongest response of the network. For example in the NPA hierarchy if the distribution depends on parameters in some nonlinear manner, then SDP solvers could not help us. However, with a trained neural network we could ask it to find the a for which the maximal smallest eigenvalue is as negative as possible, thus to find “very nonlocal” behaviors in a sense.

In conclusion, due to the rich variety of machine learning architectures, there could be a myriad of ways to tackle semidefinite programming. Here we demonstrated a simple feedforward neural network approach to optimization with semidefinite constraints which, by construction, is unable to give false positives. The approach is also applicable to the dual formulation of the SDP. Our neural network, once trained, is orders of magnitudes quicker than standard SDP solvers when evaluating multiple instances of the same type of problem. Though it is less precise, the performance is acceptable for the archetypal Bell scenario, and by changing network architectures, one can tune the trade-off of precision and speed.

Vi Methods

Here we introduce a simple artificial neural network architecture, the multilayer perceptron, and the details of our implementation for feasibility SDP solving. For a more in-depth, pedagogical introduction to modern neural networks we refer the interested reader to Ref. 

Goodfellow et al. (2016).

A multilayer perceptron is a model for a generic nonlinear multivalued, multivariate function, i.e. . It is constructed of a series of layer, each layer consisting of an affine and a subsequent nonlinear transformation. The depth of a neural network is the number of layers used, . A layer is characterized by its width and the activation function used in it, , though typically all layers (except the first and the last) have the same width and activation . The output of layer is


where the weight matrix

and bias vector

parametrize the affine transformation, is the fixed differentiable nonlinear “activation” function, and is typically an element-wise operation, and is the input of layer , being the input of the model and being the output. The free, trainable parameters of the model are often collectively referred to as the weights and are denoted by .

A neural network is trained on the so-called training data , with the objective of minimizing a loss function

. The most common algorithm to train a neural network is stochastic gradient descent, in which a random subset of the training data, a batch

, is taken and evaluated by the neural network. For each batch the loss is evaluated and the weights are updated according to the gradient of the loss function,



is the learning rate. The process is repeated with all batches of the data set. If the data is finite, then the algorithm reiterates over the data for many epochs. However, in the application analyzed in this work new training data can be artificially generated on the go, hence we do not use epochs, but use fresh data for each round. There are many variants of stochastic gradient descent. In the current work we additionally applied momentum to the updates, i.e. the updater remembers the update of the previous step,

, and adds this with a weight , such that


This is an example of a procedure which helps to avoid getting stuck in local minima during training. Typically the learning rate is decayed during training so that we can fine-tune the model towards the end.

Training is typically done until the loss appears to converge or until a fixed number of steps. Once trained, the neural network can be evaluated on new input instances, and is often benchmarked on a test set, which is composed of data that it hasn’t seen during training.

In the current work we used neural networks of depth 8 and a width is 3 times the number of SDP optimization parameters (see Table 2 for primal task sizes). The final layer width is precisely the number of SDP optimization parameters. The input size is always 8, as this is the number of parameters needed to characterize the conditional probability vector

for binary variables (i.e. the number of unique, physically observable elements in (

10)). The loss used is


where we introduced the function , which creates the constraint matrix from the inputs and outputs (see Eqs. (II), (10) for the primal and Eq. (II) for the dual), and the

function which returns the set of eigenvalues of a matrix. Such an eigenvalue function is implemented in the TensorFlow machine learning library 

Martín Abadi et al. (2015). Note that the dependence on the weights is implicitly in the output . Also, recall that the input corresponds to in the generic introduction in the maintext, or in the examined quantum case study.

Before feeding any input correlation into the network (be that training or testing) we permute the labels in such that it maximally violates the canonical CHSH inequality (see e.g. Ref. Brunner et al. (2014)

), then we apply the standard preprocessing used with neural networks, that for each feature the mean is 0 and variance is 1. In each round of training we generate 10000 random

vectors either through hit and run sampling of the no signaling polytope or with the weighted vertex sampling (see Eq. (11)). We perform 800 rounds of training in total with stochastic gradient descent, of which the first 200 rounds have a fixed learning rate of 0.005 (0.0001 for the dual network), while the rest have a decay rate of 0.99 per round, and the momentum is kept at a constant rate of 0.8.

The neural network used for the dual task (Eq. II

)) is slightly different from the one used for the primal. We have to restrict the outputs to be finite, since the optimal solution tends to infinity for points where the dual is feasible; we do this by replacing the linear activation used in the final layer of the primal neural network with a tangent hyperbolic activation function. All other activations in the networks are exponential linear functions. Even with this restriction the dual neural network tries to increase its weights as much as possible in order to get marginal improvements in the tangent hyperbolic functions (pushing outputs of neurons as close to 1 as possible). This can lead to exploding gradients, which we counteract by using the default Keras L2 activity regularizer on each layer with weight

 Chollet and others (2015).

Vii Acknowledgments

Most of the computations were performed at University of Geneva on the Baobab cluster. TK, YC and NB acknowledge financial support from the Swiss National Science Foundation (Starting grant DIAQ, and QSIT), and the European Research Council (ERC MEC). JB and DC acknowledge support from Fundació Cellex, Fundació Mir-Puig, Generalitat de Catalunya (CERCA, AGAUR SGR 1381). JB additionally acknowledges support from the Spanish MINECO (Severo Ochoa SEV-2015-0522) and the AXA Chair in Quantum Information Science. DC additionally acknowledges support from the Government of Spain (Ramon y Cajal fellowship, FIS2020-TRANQI and Severo Ochoa CEX2019-000910-S) and ERC AdG CERQUTE.