I Introduction
Neural networkbased algorithms have produced astounding results for many tasks. A notable example is that of image inpainting. Given an image with missing patches in it, properly trained algorithms can fill the patch in convincingly, matching both local and global expectations. In many fields of science and technology we find ourselves facing a similar problem: given some elements of a matrix, find a completion of it, such that the whole matrix is positive semidefinite. Such a problem is an instance of a feasibility semidefinite program (SDP), which is a widely studied class of convex optimization problems, having a broad range of applications
Wolkowicz et al. (2012); Vandenberghe and Boyd (1999). SDPs can be solved in polynomial time and by duality are typically able to provide a certificate of reaching the optimum. However, in various usecases solving the problem exactly becomes impractical due to extensive runtimes, which can limit its utility in many realtime applications. This has led to several approaches to reduce the runtime, even if at the cost of losing accuracy Majumdar et al. (2020); Arora et al. (2005); Jain and Yao (2011); Hazan and Koren (2016); Bidyarthy et al. (2014); Mazziotti (2011); Shah et al. (2016).Here we explore an alternative solver for SDPs using neural networks. We replace the standard optimization procedure with a single nonlinear function, a feedforward artificial neural network. We focus on feasibility SDPs, where, given a description of the problem as an input, the neural network will guess the optimization parameter values such that they complete a matrix in a positive semidefinite way (illustrated in Fig. 1). If the prediction of the network results in a positive semidefinite matrix then we obtain a certificate of feasibility. If it fails to do so, then this could be either because there does not exist a positive semidefinite completion or because the neural network just didn’t find it. Luckily, the SDP has a dual formulation, in which the dual task is feasible if and only if the primal is infeasible. Thus we also check the dual task with a second neural network in a similar manner as the primal. It is only if both networks fail that we do not get a definite statement of feasibility. In other words, for the neural networkbased method the duality gap is not guaranteed to be zero, while dedicated SDP solvers can in principle close this gap.
We pay this price in precision in order to gain a significantly improved runtime. Contrary to traditional solvers, for each new instance of a problem our neural network does not have to start from scratch to solve it, since it learned the structure of the problem previously, during training, which allows for a strong advantage in runtime. The flexibility of neural networks allow the user to tune this tradeoff in accuracy versus runtime by changing the size and architecture of the neural network. The technique could be useful for screening data before running slower, albeit more precise solvers, when one has many instances of similarly structured optimization tasks. Additionally, it can be very useful in scenarios where quick or even realtime solving of SDPs are required, such as in calculating the safe zones of selfdriving cars, collision avoidance Prajna and Jadbabaie (2004); Frazzoli et al. (2012), control systems and robotics tasks Vandenberghe and Boyd (1999); Boyd and Vandenberghe (1997), blackbox quantum random number generation and quantum protocols Pironio et al. (2010); Brask et al. (2017), analyzing ground state energies in quantum chemistry and manybody physics tasks Mazziotti (2004); Bravyi et al. (2019); Fukuda et al. (2007); Nakata et al. (2001, 2008); Navascués et al. (2013) or bounding problems in NP Boyd and Vandenberghe (1997); Goemans and Williamson (1995).
We examine the performance of the method on a central scenario in quantum theory, Bell nonlocality Bell (1964); Brunner et al. (2014). At a glance, the basic task is to decide what kind of shared resource is necessary or sufficient for creating correlations between two parties. In particular, quantum resources, as opposed to classical ones, can lead to socalled nonlocal behavior. Previous works have used machine learning in tackling the question of whether classical resources are sufficient Kriváchy et al. (2020); Bharti et al. (2019)
, in classifying behavior by learning from many samples
Canabarro et al. (2019) or by proposing new experiments Melnikov et al. (2020). In the current work we turn towards another question: is a quantum resource sufficient to reproduce the correlations? A hierarchy of SDPs proposed by Navascués, Pironio and Acín (NPA) helps in answering this question Navascués et al. (2007). Each step of the hierarchy is a feasibility SDP which demonstrates whether the correlations under scrutiny fall within a given relaxation of the set of quantum correlations. For higher levels of the hierarchy the relaxations become tighter, and eventually converge to the true quantum set Navascués et al. (2008); Doherty et al. (2008). Each SDP in the hierarchy is a feasibility SDP in which some elements of the socalled moment matrix are fixed by physically observed correlations, and the other elements need to be completed such that the whole matrix is positive semidefinite.
We apply our technique to this problem by training two neural networks for each level of the hierarchy, one for the primal and one for the dual problem. For a fixed level of the hierarchy, the physically observed correlations define the SDP, so this is the input we use for the neural network. For training, we use only random correlations as inputs, without having to solve the SDP exactly even once. Once trained, the neural networks can predict matrix completions orders of magnitude quicker for a new correlation than conventional solvers, while keeping a decent accuracy.
In the following we introduce our approach to solving feasibility SDPs in detail. We follow by describing the Bell scenario and the hierarchy of SDPs which approximate the quantum set of correlations. Then we present the performance results and conclude with a discussion. Finally, we present a more detailed introduction to neural networks and the details of our implementation in the Methods section.
Ii Neural networks for solving SDP
Consider the following family of generic nonlinear optimization tasks, parametrized by a vector .
s.t.  (1) 
where , . We focus on a differentiable objective and constraints
. The general spirit of our work is to encode the optimization problem in the loss function of an artificial neural network, such that it learns to satisfy the constraints and minimize the objective function, i.e.
, where is introduced to balance the objective and constraints in the loss function. Then we train the neural network to minimize this loss by inputting random configuration vectors , and asking the neural network to output , a solution which it guesses to be optimal. After training on many random configurations from within the family the neural network will learn the structure of the problem, and when evaluated on a new instance, it will predict a closeto optimal solution. Within this family of (fixedsize) optimization problems we thus automate the solving process, since for a new instance of an optimization problem (characterized by ), instead of solving it from scratch, the neural network predicts the solution almost instantaneously.In the current work we examine an important subclass of optimization problems, feasibility SDPs Wolkowicz et al. (2012). First, we do this because SDPs are specific instances of conic programming and thus have a dual problem, the optimal solution of which approaches the primal’s from below and achieves a zero duality gap. With our technique we can train a neural network also for the dual task, and by evaluating both the dual and primal neural networks we can obtain bounds on how well the two models together manage to reduce the duality gap. Second, we only examine feasibility problems because in these the machine’s task is very pronounced and its success is easy to check: predict a variable such that . If successful, we have a certificate of feasibility. Finally, note that wellformulated SDPs with a bounded objective function can be approximated arbitrarily well with a series of feasibility SDPs using bisection techniques, with the accuracy increasing exponentially with the number of feasibility SDPs solved.
A feasibility SDP can be expressed as
Find  
s.t.  (2)  
where is a matrix, is the HilbertSchmidt inner product, and the matrices and vector encode the constraints. The dual task of the primal SDP can be stated as
Find  
s.t.  (3) 
where is an arbitrary positive number, which we introduced in order to be able to state the dual problem also as a feasibility task. For feasibility problems, either the primal or the dual task has a solution.
Our main tools for solving these SDPs are artificial neural networks. A feedforward neural network is a model for a generic multidimensional nonlinear function. One of its simplest realization, a multilayer perceptron, is characterized by the number of neurons per layer (width), the number of layers (depth), and the activation functions used at the neurons, which altogether model an iterative sequence of parametrized affine and fixed nonlinear transformations on the input, i.e.
(4) 
where the matrix and vector
parametrize the linear transformation,
is a fixed differentiable nonlinear function, and is the input of layer , being the input of the model and being the output. During training, the parameters of the model () are updated such that they minimize a differentiable loss function of the training set. Once trained, the neural network can be evaluated on new input instances. For a more detailed introduction to neural networks we refer the reader to the Methods section or to Ref. Goodfellow et al. (2016).Our approach to solving SDPs via neural networks is illustrated in Fig. 1. We examine a family of feasibility SDP problems, where is fixed and only the values of are different. We input the constraints of the SDP () in a feedforward neural network. The neural network outputs the optimization variables (the free values of for the primal or for the dual) and the loss is taken to be minus one times the smallest eigenvalue of the constraint matrix. Thus the neural network tries to push the lowest eigenvalue of the constraint matrix to be as large as possible. For a test sample, if this smallest eigenvalue of the primal constraint is positive, then the machine has found a positive semidefinite completion of , i.e. we obtain a certificate of feasibility. Alternatively if the dual neural network finds a positive semidefinite completion of (II), we obtain a certificate of infeasibility of the primal task. It is only if neither the primal nor the dual neural network give positive semidefinite solutions that we are uncertain of the feasibility.
Iii Case study in quantum nonlocality
To examine the performance of the method, we consider a hierarchy of feasibility SDPs, where the constraint matrices naturally have several elements fixed and the others left free. In the Navascués–Pironio–Acín (NPA) hierarchy Navascués et al. (2007, 2008)
, using such SDPs we can explore the limitations of correlations coming from shared quantum resources. The hierarchy can be applied in a variety of scenarios such as estimating randomness in device independent scenarios
NietoSilleras et al. (2014); Bancal et al. (2014), network nonlocality PozasKerstjens et al. (2019); Wolfe et al. (2020), sequential Bell tests Bowles et al. (2020) or bounding ground state energies Navascués et al. (2013), but here, for concreteness and easy benchmarking, we examine the standard bipartite Bell scenario, which is an excellent sandbox for examining the strength of correlations between two systems Bell (1964); Brunner et al. (2014).In the Bell scenario a source is distributed to two parties, Alice and Bob, and they each additionally receive an input ( and , respectively), and produce an output ( and , respectively), as illustrated in Fig. 2(a). For the current work we will only consider discrete inputs and outputs of cardinality 2, i.e. . We say that there is no signaling among the parties if the statistics of the inputs and outputs obey
(5)  
(6) 
Any correlation for which these constraints hold are in the set of nosignaling correlations . We enforce nosignaling and that the inputs are independent from the source. Then, just by observing the statistics of the inputs and the outputs of the two parties,
, one can in certain cases deduce that stronger than classical sources were used, e.g. quantum or postquantum sources. With any classical source, the obtained statistics can be described by a shared random variable
, constraining the correlations to be of the form(7) 
for some conditional probability distributions, otherwise known as response functions, and some probability distribution . Any correlation which can be described by such a model (otherwise known as a local hidden variable model) is within the local set .
The Born rule of quantum theory postulates that for any quantum source, must have the form
(8) 
where
is a Hermitian positive semidefinite matrix (or operator, more generally) with unit trace, and can be thought of as the quantum analogue of a probability distribution,
are projective matrices such that , and similarly for . The set represents Alice’s measurement when her input is and output is , and analogously for Bob. Any correlation having a quantum explanation according to the above equation is in the set .The nosignaling set and the local set are polytopes Brunner et al. (2014), bounded by convex combinations of a finite set of vertices, or equivalently by linear inequalities. For the local set there are 8 nontrivial inequalities, all of them equivalent up to relabeling to the same Clauser–Horne–Shimony–Holt (CHSH) inequality, which is a linear expression bounded by the value for local behaviors Clauser et al. (1969). This inequality defines a facet, spanned by 8 local vertices. The nosignaling set is bounded by 24 vertices of which 16 are also local vertices, and 8 are genuinely nosignaling vertices, all equivalent under relabeling to the Popescu–Rohrlich box (PRbox) Popescu and Rohrlich (1994); Rastall (1985); Khalfin and Tsirelson (1985); Barrett et al. (2005), defined as
(9) 
which gives a CHSH value of , far above the local bound. The relation between and is visualized on a twodimensional slice of the space in Fig. 2(b).
Quantum behaviors can at most achieve a CHSH value of , the Tsirelson bound, thus achieving nonlocality Cirel’son (1980). However, the quantum set is not a polytope and is thus more difficult to characterize than the local or nosignaling sets. An additional difficulty is that one can not in general bound the dimension of the density matrix required. A general technique for approximating the quantum set is an infinite hierarchy of relaxations, known as the NPA hierarchy Navascués et al. (2007). Each relaxation is defined by a finite set of strings of moments of the quantum measurement operators of Alice and Bob. In this paper we will examine the relaxations , AB, , , where the number denotes the maximal degree of moments used in the relaxation, and for AB only includes single moments and crossmoments of Alice’s and Bob’s measurement, a commonly used intermediate level between and . The set is a subset of , and as approaches infinity we are guaranteed to recover the actual quantum set Navascués et al. (2008); Doherty et al. (2008).
The basic idea of the relaxations is that if a distribution has a quantum description according to Eq. (8), then the socalled moment matrix should be positive semidefinite. A moment matrix, , is constructed by taking a set of strings of moments of measurement operators, e.g. for one takes . Other single moments are not needed due to normalization, so for clarity we leave away the output superscript from now on. One constructs the moment matrix by taking the expectation value of the dyadic product of and its elementwise adjoint set , which in the case of single moments is the same as since the projectors are selfadjoint. The moment matrix is the following, where the is the expectation value with respect to , i.e. . Only the upper triangle is shown as is symmetric.
(10) 
For any behavior derived from some quantum source , such a moment matrix must be positive semidefinite. Notice that even without knowing , almost all elements can be identified with the observed statistics via the Born rule in Eq. (8). However, a few elements are unphysical and can not be observed, since they would require the joint measurement of two of Alice’s or Bob’s operators, which is prohibited for noncommuting measurements. For there are only two such unknown elements, denoted by bold text in the matrix (10), however for higher levels their number grows quickly, namely there are 8, 22, 52, 92 and 142 unknown elements in the moment matrices of levels AB, while each of them have the same number of elements fixed by the statistics .
The task is now the following: given some statistics from a Bell scenario, , and a level of the hierarchy (say ), find real numbers for the unknown elements of the moment matrix such that it is positive semidefinite. If such a completion is possible then . We can see that this problem is of the form (II) by defining the matrices to have ones only in the matrix indices where physical elements appear and zero otherwise, and by identifying the entries of with the appropriate elements of . As such, a dual formulation also exists, such that is if there is a positive semidefinite completion of the dual constraint matrix (see Eq. (II)), then we certify that . Otherwise, if we do not find a positive semidefinite for either the primal or dual, then we are left in uncertainty. In the following we train neural networks for the primal and dual of this problem for different levels of the NPA hierarchy and examine their performance.
We examine two ways to generate input behaviors . Since local behaviors can easily be screened out by checking their CHSH values, we do not need to train the machine to perform well on those. Thus we take samples from the nonlocal part of the nosignaling set, . We examine only a nonredundant part of this set by not considering the 8fold symmetry under relabeling. As a result the set we consider is spanned by the 8 extremal local points and the PRbox which maximize the canonical CHSH inequality (see the yellow dots in Fig. 2(b) for an illustration). With a slight abuse of notation we continue to label this set as .
In the first sampling technique, we generate random samples from using hit and run sampling Smith (1984); Bélisle et al. (1998). In the second, which we will refer to as weighted vertex sampling we take uniformly randomly weighted mixtures of the 9 vertices of , with 8fold weight on the PRbox. Formally, if the local vertices are and the PRbox is , then a training sample is generated as
(11) 
where the are uniformly drawn random numbers. This sampling technique resulted in a more reliable training for the dual neural network.
Among other factors, the accuracy of the trained network depends on the architecture of the neural network, in particular on its size. In the following we will examine a family of neural networks, one for each level of the hierarchy. We keep the depth at a constant of 8 layers, while we vary the width in order to accompany the growing complexity of higher levels. In particular we take the width to be three times the number of elements which need to be predicted. The last layer is different from the preceding ones, first off, because its width is exactly the size of the number of elements which need to be predicted. Second, whereas in other layers we use exponential linear activations, we use a linear activation for the final layer of the primal network. For the dual network using linear activation can lead to instability, since the optimal solution to for maximizing the eigenvalue of the dual matrix in Eq. (II) can tend to infinity. Thus we limit the outputs such that by using a tangent hyperbolic activation. For a small enough (e.g. we used ) this only minimally restricts the generality of the solver, giving errors well below the standard inaccuracy of the machine learning predictions. For additional details of the training procedure see the Methods section.
Iv Results
The memory usage of the neural network method is approximately the same as for conventional SDP solvers, with the construction of the moment matrix being the primary bottleneck. Hence, in the following we compare the two approaches from two other aspects, their accuracy and runtime.
First let us examine Tsirelson’s bound in detail by using both the primal and the dual neural networks for relaxations 1, 1AB, 2 and 3 on the isotropic line, i.e. on distributions
(12) 
where is the completely flat distribution and is the mixing parameter. The edge of the quantum set, Tsirelson’s bound, is at precisely . The isotropic line is portrayed in Fig. 2(b). In Fig. 3 we depict the minimum eigenvalue of the predicted moment matrix as a function , and contrast it to the maximum possible minimum eigenvalue, calculated by an SDP solver. For the results in Fig. 3 we use hit and run training for the primal networks and weighted vertex for the duals, since the dual neural network showed poor convergence and performance when trained on hit and run sampling.
We further explore the effect of choosing one training sampling technique versus the other by evaluating the accuracy of the
2 primal neural network solver for both hit and run and weighted vertex sampling. When testing the performance we additionally included a set of correlations coming from random twoqubit pure states measured in random projective bases (all sampled uniformly in the Haar measure) as well as results from the isotropic line. Results are portrayed in Table
1.Tested on  Trained on  

Hit and run  Weighted vertices  
Hit and run  81%  60% 
Weighted vertices  77%  77% 
Quantum pure  80%  50% 
Tsirelson bound (nonlocal part)  0.176/0.207 = 85%  0.193/0.207 = 93% 
Finally, in Table 2 we compare the time performance of the primal neural network to standard solving software (MOSEK), solved on the same personal computer at 100% CPU usage. We note that the timing results depend on the choice of neural network architecture. Recall that for different levels of the hierarchy we kept the depth fixed and increased the width linearly. This was an arbitrary choice which leads to the accuracy results, also visible in Table 2. If one would like to have higher accuracy on larger problem sizes, one could use larger neural networks, at the cost of having somewhat slower evaluation times. We can see that for the choice we made the speed of solving is orders of magnitude quicker than solving with MOSEK, hence there is still much space left for improving the accuracy if required.
Q1  Q1AB  Q2  Q3  
SDP variables  2  8  22  52 
NN accuracy  98%  79%  81%  60% 
NN Tsirelson  99%  97%  85%  77% 
NN time per sample  17 s  20 s  36 s  210 s 
MOSEK time per sample  9 ms  10 ms  13 ms  17ms 
V Discussion
We have developed a solver for feasibility SDPs of a fixed structure using feedforward neural networks. Our approach is an unsupervised learning one in the sense that training is done only with random input samples, and the neural network must learn itself the best output, which gives “the most positivesemidefinite” constraint matrix by constructing it explicitly. This is facilitated through a loss function which motivates the neural network to maximize the smallest eigenvalue of the generated constraint matrix. In future work it could be interesting to explore alternative approaches more in line with supervised learning, where one generates a training set by taking many random inputs and calculating the solution using conventional solvers, which inputoutput pairs would be training data for a neural network. Though it could be more precise than the approach explored in this work, it comes with the disadvantage that generating training samples is slow. This is a similar bottleneck as for the training of the method examined in this work, since we must calculate the eigenvalues of the constraint matrix for each training sample, which calculation for an
matrix typically scales as in practice. We note that we have tried briefly to do a supervised learning approach in which we circumvent these timely training steps by generating random positive semidefinite matrices and asking a feedforward neural network to predict the missing element positions given the known element positions. This gives a quick training procedure, however it is not as accurate as the technique demonstrated here, since the learned completions are not trying to push the matrix to be “as positive semidefinite as possible”.A direction worth exploring in future work would be the use of generative methods. Instead of learning the distribution over outputs given an input (as it is done in discriminative techniques), generative neural networks learn the distribution over the inputs, and can generate new instances. Though we were inspired by such generative machine learningbased image inpainting, the task here is to always predict the same unknown elements of the matrix given the constraints . Thus it is unnecessarily expensive to learn a generative distribution over all positive semidefinite matrices, when one only needs to learn the conditional probabilities of some elements given others. Nonetheless, if sufficiently strong generative techniques were used their performance could be comparable to ours, though the same tradeoff between training speed and accuracy is expected to also be present for generative models.
Examining generative models lead us towards more general scenarios than the one considered here. In the current work we discussed families of feasibility SDP problems where the structure of the problem is fixed. This is a common scenario in applications, where (the structure of the problem) is essentially fixed, and the task must be solved for different specific instances given by . If this is not the case then in principle can be provided next to to the neural network. For the example of matrix completion, a different would mean that different elements of the matrix need to be filled in, which for example a generative model could tackle, as previously discussed. In the current work we evaluated our approach on such tasks, i.e. tasks where the structure of the problem requires us to fill in a matrix, as shown in Fig. 1, where the entries of are either completely fixed by or are free optimization variables. For more general settings variable elimination on the constraints can help reduce the problem to this form, then allowing for our method to be used with slight modifications.
Furthermore, one could examine even more generic tasks by moving outside the realm of semidefinite programs to generic nonlinear objective functions and constraints, as in Eq. (II), where neural networks could be trained to minimize a Lagrangian function, incorporating all constraints and objectives with penalty terms. There have been many works addressing optimization problems using machine learning Bengio et al. (2020), however previous works addressing SDP specifically have been mostly focused on developing neural networks for specialized hardware. In principle these are quick, and for specific problems, such as for semidefinite programming, they can guarantee convergence Jiang and Wang (1999); Hopfield and Tank (1985). However, if simulated on a generalpurpose digital computer, these neural network optimizations are also slow, since they must evaluate the constraints many times before converging.
The choice of the training set is important, especially since we are working in highdimensional spaces. For example training on random quantum points for the primal is very inefficient, as most of them are local points. Another example is our preference of using the weighted vertex sampling for the dual neural network. Most samples generated with hit and run sampling are within , so the dual network focuses on performing well on that region. However, in order to function well on the isotropic line, for example, it must also perform well on the region outside , so that it can detect the boundary. We overcame this difficulty by introducing the weighted vertex sampling. In general some domain knowledge is useful in choosing the training set for such neural networkbased SDP solvers.
Finally, an interesting question worth exploring more in depth is the advantage of neural networks versus conventional solvers. In which scenario could such a neural network approach have an advantage compared to standard solvers? Clearly, in application where many instances of semidefinite programs must be solved quickly, a neural networkbased approach could be useful, for example in calculating the safe zones of selfdriving cars, collision avoidance Prajna and Jadbabaie (2004); Frazzoli et al. (2012), control systems and robotics tasks Vandenberghe and Boyd (1999); Boyd and Vandenberghe (1997), blackbox quantum random number generation and quantum protocols Pironio et al. (2010); Brask et al. (2017), testing many different candidate materials in quantum chemistry and manybody physics tasks Mazziotti (2004); Bravyi et al. (2019); Fukuda et al. (2007); Nakata et al. (2001, 2008); Navascués et al. (2013) or bounding problems in NP Boyd and Vandenberghe (1997); Goemans and Williamson (1995). Furthermore, we note that there are additional potential advantages of using neural network solvers for SDPs. One can use standard neural network analysis tools, such as optimizing over the input to see which inputs trigger the strongest response of the network. For example in the NPA hierarchy if the distribution depends on parameters in some nonlinear manner, then SDP solvers could not help us. However, with a trained neural network we could ask it to find the a for which the maximal smallest eigenvalue is as negative as possible, thus to find “very nonlocal” behaviors in a sense.
In conclusion, due to the rich variety of machine learning architectures, there could be a myriad of ways to tackle semidefinite programming. Here we demonstrated a simple feedforward neural network approach to optimization with semidefinite constraints which, by construction, is unable to give false positives. The approach is also applicable to the dual formulation of the SDP. Our neural network, once trained, is orders of magnitudes quicker than standard SDP solvers when evaluating multiple instances of the same type of problem. Though it is less precise, the performance is acceptable for the archetypal Bell scenario, and by changing network architectures, one can tune the tradeoff of precision and speed.
Vi Methods
Here we introduce a simple artificial neural network architecture, the multilayer perceptron, and the details of our implementation for feasibility SDP solving. For a more indepth, pedagogical introduction to modern neural networks we refer the interested reader to Ref.
Goodfellow et al. (2016).A multilayer perceptron is a model for a generic nonlinear multivalued, multivariate function, i.e. . It is constructed of a series of layer, each layer consisting of an affine and a subsequent nonlinear transformation. The depth of a neural network is the number of layers used, . A layer is characterized by its width and the activation function used in it, , though typically all layers (except the first and the last) have the same width and activation . The output of layer is
(13) 
where the weight matrix
and bias vector
parametrize the affine transformation, is the fixed differentiable nonlinear “activation” function, and is typically an elementwise operation, and is the input of layer , being the input of the model and being the output. The free, trainable parameters of the model are often collectively referred to as the weights and are denoted by .A neural network is trained on the socalled training data , with the objective of minimizing a loss function
. The most common algorithm to train a neural network is stochastic gradient descent, in which a random subset of the training data, a batch
, is taken and evaluated by the neural network. For each batch the loss is evaluated and the weights are updated according to the gradient of the loss function,(14) 
where
is the learning rate. The process is repeated with all batches of the data set. If the data is finite, then the algorithm reiterates over the data for many epochs. However, in the application analyzed in this work new training data can be artificially generated on the go, hence we do not use epochs, but use fresh data for each round. There are many variants of stochastic gradient descent. In the current work we additionally applied momentum to the updates, i.e. the updater remembers the update of the previous step,
, and adds this with a weight , such that(15) 
This is an example of a procedure which helps to avoid getting stuck in local minima during training. Typically the learning rate is decayed during training so that we can finetune the model towards the end.
Training is typically done until the loss appears to converge or until a fixed number of steps. Once trained, the neural network can be evaluated on new input instances, and is often benchmarked on a test set, which is composed of data that it hasn’t seen during training.
In the current work we used neural networks of depth 8 and a width is 3 times the number of SDP optimization parameters (see Table 2 for primal task sizes). The final layer width is precisely the number of SDP optimization parameters. The input size is always 8, as this is the number of parameters needed to characterize the conditional probability vector
for binary variables (i.e. the number of unique, physically observable elements in (
10)). The loss used is(16) 
where we introduced the function , which creates the constraint matrix from the inputs and outputs (see Eqs. (II), (10) for the primal and Eq. (II) for the dual), and the
function which returns the set of eigenvalues of a matrix. Such an eigenvalue function is implemented in the TensorFlow machine learning library
Martín Abadi et al. (2015). Note that the dependence on the weights is implicitly in the output . Also, recall that the input corresponds to in the generic introduction in the maintext, or in the examined quantum case study.Before feeding any input correlation into the network (be that training or testing) we permute the labels in such that it maximally violates the canonical CHSH inequality (see e.g. Ref. Brunner et al. (2014)
), then we apply the standard preprocessing used with neural networks, that for each feature the mean is 0 and variance is 1. In each round of training we generate 10000 random
vectors either through hit and run sampling of the no signaling polytope or with the weighted vertex sampling (see Eq. (11)). We perform 800 rounds of training in total with stochastic gradient descent, of which the first 200 rounds have a fixed learning rate of 0.005 (0.0001 for the dual network), while the rest have a decay rate of 0.99 per round, and the momentum is kept at a constant rate of 0.8.The neural network used for the dual task (Eq. II
)) is slightly different from the one used for the primal. We have to restrict the outputs to be finite, since the optimal solution tends to infinity for points where the dual is feasible; we do this by replacing the linear activation used in the final layer of the primal neural network with a tangent hyperbolic activation function. All other activations in the networks are exponential linear functions. Even with this restriction the dual neural network tries to increase its weights as much as possible in order to get marginal improvements in the tangent hyperbolic functions (pushing outputs of neurons as close to 1 as possible). This can lead to exploding gradients, which we counteract by using the default Keras L2 activity regularizer on each layer with weight
Chollet and others (2015).Vii Acknowledgments
Most of the computations were performed at University of Geneva on the Baobab cluster. TK, YC and NB acknowledge financial support from the Swiss National Science Foundation (Starting grant DIAQ, and QSIT), and the European Research Council (ERC MEC). JB and DC acknowledge support from Fundació Cellex, Fundació MirPuig, Generalitat de Catalunya (CERCA, AGAUR SGR 1381). JB additionally acknowledges support from the Spanish MINECO (Severo Ochoa SEV20150522) and the AXA Chair in Quantum Information Science. DC additionally acknowledges support from the Government of Spain (Ramon y Cajal fellowship, FIS2020TRANQI and Severo Ochoa CEX2019000910S) and ERC AdG CERQUTE.
References
 Wolkowicz et al. (2012) H. Wolkowicz, R. Saigal, and L. Vandenberghe, Handbook of Semidefinite Programming: Theory Algorithms and Applications (Springer Science & Business Media, 2012).
 Vandenberghe and Boyd (1999) L. Vandenberghe and S. Boyd, Applications of semidefinite programming, Applied Numerical Mathematics 29, 283 (1999).
 Majumdar et al. (2020) A. Majumdar, G. Hall, and A. A. Ahmadi, Recent Scalability Improvements for Semidefinite Programming with Applications in Machine Learning Control and Robotics, Annual Review of Control, Robotics, and Autonomous Systems 3, 331 (2020).
 Arora et al. (2005) S. Arora, E. Hazan, and S. Kale, in 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS’05) (2005), pp. 339–348.
 Jain and Yao (2011) R. Jain and P. Yao, in 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science (2011), pp. 463–471.
 Hazan and Koren (2016) E. Hazan and T. Koren, A lineartime algorithm for trust region problems, Mathematical Programming 158, 363 (2016).
 Bidyarthy et al. (2014) A. S. Bidyarthy, R. Raman, and D. Thomas, in Proceedings of the 7th ACM India Computing Conference (Association for Computing Machinery, New York, NY, USA, 2014), COMPUTE ’14, pp. 1–8.
 Mazziotti (2011) D. A. Mazziotti, LargeScale Semidefinite Programming for ManyElectron Quantum Mechanics, Physical Review Letters 106, 083001 (2011).
 Shah et al. (2016) S. Shah, A. K. Yadav, C. D. Castillo, D. W. Jacobs, C. Studer, and T. Goldstein, in Computer Vision – ECCV 2016, edited by B. Leibe, J. Matas, N. Sebe, and M. Welling (Springer International Publishing, Cham, 2016), Lecture Notes in Computer Science, pp. 717–735.
 Prajna and Jadbabaie (2004) S. Prajna and A. Jadbabaie, in Hybrid Systems: Computation and Control, edited by R. Alur and G. J. Pappas (Springer, Berlin, Heidelberg, 2004), Lecture Notes in Computer Science, pp. 477–492.
 Frazzoli et al. (2012) E. Frazzoli, Z.H. Mao, J.H. Oh, and E. Feron, Resolution of Conflicts Involving Many Aircraft via Semidefinite Programming, Journal of Guidance, Control, and Dynamics (2012).
 Boyd and Vandenberghe (1997) S. Boyd and L. Vandenberghe, Semidefinite Programming Relaxations of NonConvex Problems in Control and Combinatorial Optimization, in Communications, Computation, Control, and Signal Processing: a tribute to Thomas Kailath, edited by A. Paulraj, V. Roychowdhury, and C. D. Schaper (Springer US, Boston, MA, 1997), pp. 279–287.
 Pironio et al. (2010) S. Pironio, A. Acín, S. Massar, A. B. de la Giroday, D. N. Matsukevich, P. Maunz, S. Olmschenk, D. Hayes, L. Luo, T. A. Manning, et al., Random numbers certified by Bell’s theorem, Nature 464, 1021 (2010).
 Brask et al. (2017) J. B. Brask, A. Martin, W. Esposito, R. Houlmann, J. Bowles, H. Zbinden, and N. Brunner, MegahertzRate SemiDeviceIndependent Quantum Random Number Generators Based on Unambiguous State Discrimination, Physical Review Applied 7, 054018 (2017).
 Mazziotti (2004) D. A. Mazziotti, Firstorder semidefinite programming for the direct determination of twoelectron reduced density matrices with application to manyelectron atoms and molecules, The Journal of Chemical Physics 121, 10957 (2004).
 Bravyi et al. (2019) S. Bravyi, D. Gosset, R. König, and K. Temme, Approximation algorithms for quantum manybody problems, Journal of Mathematical Physics 60, 032203 (2019).
 Fukuda et al. (2007) M. Fukuda, B. J. Braams, M. Nakata, M. L. Overton, J. K. Percus, M. Yamashita, and Z. Zhao, Largescale semidefinite programs in electronic structure calculation, Mathematical Programming 109, 553 (2007).
 Nakata et al. (2001) M. Nakata, H. Nakatsuji, M. Ehara, M. Fukuda, K. Nakata, and K. Fujisawa, Variational calculations of fermion secondorder reduced density matrices by semidefinite programming algorithm, The Journal of Chemical Physics 114, 8282 (2001).
 Nakata et al. (2008) M. Nakata, B. J. Braams, K. Fujisawa, M. Fukuda, J. K. Percus, M. Yamashita, and Z. Zhao, Variational calculation of secondorder reduced density matrices by strong Nrepresentability conditions and an accurate semidefinite programming solver, The Journal of Chemical Physics 128, 164113 (2008).
 Navascués et al. (2013) M. Navascués, A. GarcíaSáez, A. Acín, S. Pironio, and M. B. Plenio, A paradox in bosonic energy computations via semidefinite programming relaxations, New Journal of Physics 15, 023026 (2013).
 Goemans and Williamson (1995) M. X. Goemans and D. P. Williamson, Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming, Journal of the ACM 42, 1115 (1995).
 Bell (1964) J. S. Bell, On the Einstein Podolsky Rosen paradox, Physics Physique Fizika 1, 195 (1964).
 Brunner et al. (2014) N. Brunner, D. Cavalcanti, S. Pironio, V. Scarani, and S. Wehner, Bell nonlocality, Reviews of Modern Physics 86, 419 (2014).
 Kriváchy et al. (2020) T. Kriváchy, Y. Cai, D. Cavalcanti, A. Tavakoli, N. Gisin, and N. Brunner, A neural network oracle for quantum nonlocality problems in networks, npj Quantum Information 6, 1 (2020).
 Bharti et al. (2019) K. Bharti, T. Haug, V. Vedral, and L.C. Kwek, How to Teach AI to Play Bell NonLocal Games: Reinforcement Learning, arXiv:1912.10783 [quantph] (2019).
 Canabarro et al. (2019) A. Canabarro, S. Brito, and R. Chaves, Machine Learning Nonlocal Correlations, Physical Review Letters 122, 200401 (2019).
 Melnikov et al. (2020) A. A. Melnikov, P. Sekatski, and N. Sangouard, Setting Up Experimental Bell Tests with Reinforcement Learning, Physical Review Letters 125, 160401 (2020).
 Navascués et al. (2007) M. Navascués, S. Pironio, and A. Acín, Bounding the Set of Quantum Correlations, Physical Review Letters 98, 010401 (2007).
 Navascués et al. (2008) M. Navascués, S. Pironio, and A. Acín, A convergent hierarchy of semidefinite programs characterizing the set of quantum correlations, New Journal of Physics 10, 073013 (2008).
 Doherty et al. (2008) A. C. Doherty, Y. Liang, B. Toner, and S. Wehner, in 2008 23rd Annual IEEE Conference on Computational Complexity (2008), pp. 199–210.
 Goodfellow et al. (2016) I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (MIT Press, 2016).
 NietoSilleras et al. (2014) O. NietoSilleras, S. Pironio, and J. Silman, Using complete measurement statistics for optimal deviceindependent randomness evaluation, New Journal of Physics 16, 013035 (2014).
 Bancal et al. (2014) J.D. Bancal, L. Sheridan, and V. Scarani, More randomness from the same data, New Journal of Physics 16, 033011 (2014).
 PozasKerstjens et al. (2019) A. PozasKerstjens, R. Rabelo, Ł. Rudnicki, R. Chaves, D. Cavalcanti, M. Navascués, and A. Acín, Bounding the Sets of Classical and Quantum Correlations in Networks, Physical Review Letters 123, 140503 (2019).
 Wolfe et al. (2020) E. Wolfe, A. PozasKerstjens, M. Grinberg, D. Rosset, A. Acín, and M. Navascues, Quantum Inflation: A General Approach to Quantum Causal Compatibility, arXiv:1909.10519 [quantph] (2020).
 Bowles et al. (2020) J. Bowles, F. Baccari, and A. Salavrakos, Bounding sets of sequential quantum correlations and deviceindependent randomness certification, Quantum 4, 344 (2020).
 Clauser et al. (1969) J. F. Clauser, M. A. Horne, A. Shimony, and R. A. Holt, Proposed Experiment to Test Local HiddenVariable Theories, Physical Review Letters 23, 880 (1969).
 Popescu and Rohrlich (1994) S. Popescu and D. Rohrlich, Quantum nonlocality as an axiom, Foundations of Physics 24, 379 (1994).
 Rastall (1985) P. Rastall, Locality Bells Theorem and Quantum Mechanics, Foundations of Physics 15, 963 (1985).
 Khalfin and Tsirelson (1985) L. A. Khalfin and B. Tsirelson, Quantum and quasiclassical analogs of Bell inequalities., Symposium on the Foundations of Modern Physics pp. 441–460 (1985).
 Barrett et al. (2005) J. Barrett, N. Linden, S. Massar, S. Pironio, S. Popescu, and D. Roberts, Nonlocal correlations as an informationtheoretic resource, Physical Review A 71, 022101 (2005).
 Cirel’son (1980) B. S. Cirel’son, Quantum generalizations of Bells inequality, Letters in Mathematical Physics 4, 93 (1980).
 Smith (1984) R. L. Smith, Efficient Monte Carlo Procedures for Generating Points Uniformly Distributed Over Bounded Regions, Operations Research 32, 1296 (1984).
 Bélisle et al. (1998) C. Bélisle, A. Boneh, and R. Caron, Convergence properties of HitandRun samplers, Communications in Statistics. Stochastic Models 14 (1998).
 Bengio et al. (2020) Y. Bengio, A. Lodi, and A. Prouvost, Machine learning for combinatorial optimization: A methodological tour d’horizon, European Journal of Operational Research (2020).

Jiang and Wang (1999)
D. Jiang and
J. Wang,
A recurrent neural network for realtime semidefinite programming
, IEEE Transactions on Neural Networks 10, 81 (1999).  Hopfield and Tank (1985) J. J. Hopfield and D. W. Tank, “Neural” computation of decisions in optimization problems, Biological Cybernetics 52, 141 (1985).
 Martín Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al., TensorFlow: LargeScale Machine Learning on Heterogeneous Systems (2015).
 Chollet and others (2015) F. Chollet and others, Keras (2015).