I Introduction
The possibility of creating stronger than classical correlations between distant parties has deep implications for both the foundations and applications of quantum theory. These ideas have been initiated by Bell Bell (1964), with subsequesnt research leading to the theory of Bell nonlocality Brunner et al. (2014). In the Bell scenario multiple parties jointly share a single classical or quantum source, often referred to as local and nonlocal sources, respectively. Recently, interest in more elaborate causal structures, in which several independent sources are shared among the parties over a network, has been on the rise Branciard et al. (2010, 2012); Fritz (2012). Contrary to the Bell scenario, in even slightly more complex networks the boundary between local and nonlocal correlations becomes nonlinear and the local set nonconvex, greatly perplexing rigorous analysis. Though some progress has been made Henson et al. (2014); Tavakoli et al. (2014); Chaves et al. (2015); Wolfe et al. (2016); Rosset et al. (2016); Navascues and Wolfe (2017); Rosset et al. (2017); Chaves (2016); Fraser and Wolfe (2018); Weilenmann and Colbeck (2018); Luo (2018); Renou et al. (2019a); Gisin et al. (2019); Renou et al. (2019b); PozasKerstjens et al. (2019), we still lack a robust set of tools to investigate generic networks from an analytic and numerical perspective.
Here we explore the use of machine learning in these problems. In particular we tackle the membership problem for causal structures, i.e. given a network and a distribution over the observed outputs, we must decide whether it could have been produced by using exclusively local resources. We encode the causal structure into a neural network and ask the network to reproduce the target distribution. By doing so, we approximate the question “does a local causal model exist?” with “is a local causal model learnable?”. Neural networks have proven to be useful ansätze for generic nonlinear functions in terms of expressivity, ease of learning and robustness, both in and outside the domain of physical sciences Melko et al. (2019); Iten et al. (2018); Melnikov et al. (2018); van Nieuwenburg et al. (2017); Carrasquilla and Melko (2017). They have also been used in the study of nonlocality, however, we note that our method is significantly different from previous ones Deng (2018); Canabarro et al. (2019), in particular because it allows us to obtain explicit local models.
In our approach we exploit that both causal structures and feedforward neural networks have their information flow determined by a directed acyclic graph. For any given distribution over observed variables and an ansatz causal structure, we train a neural network which respects that causal structure to reproduce the target distribution. This is equivalent to having a neural network learn the local responses of the parties to their inputs. If the target distribution is inside the local set, then a sufficiently expressive neural network should be able to learn the appropriate response functions and reproduce it. For distributions outside the local set, we should see that the machine can not approximate the given target. This gives us a criterion for deciding whether a target distribution is inside the local set or not. In particular, if a given distribution is truly outside the local set, then by adding noise in a physically relevant way we should see a clear transition in the machine’s behavior when entering the set of local correlations.
We explore the strength of this method by examining a notorious causal structure, the socalled ‘triangle’ network, depicted in Fig. 1. The triangle configuration is among the simplest tripartite networks, yet it poses immense challenges theoretically and numerically. We use the triangle with quaternary outcomes as a testbed for our neural network oracle. Apart from checking for the consistency of our method with known results, we examine the distribution proposed in Gisin (2019), which we refer to as the Elegant distribution from here on. Our method gives solid evidence that the Elegant distribution is outside the local set, as originally conjectured. We also use our method to get an estimate of the noise robustness of this nonlocal distribution.
Ii Encoding causal structures into neural networks
The methods developed in this work are in principle applicable to any causal structure. For the sake of simplicity we will demonstrate how to encode a network nonlocality configuration into a neural network on the simple, yet nontrivial example of the triangle network with quaternary outputs and no inputs. In this scenario three sources, , send information through either a classical or a quantum channel to three parties, Alice, Bob and Charlie. Flow of information is constrained such that the sources are independent from each other, and each one only sends information to two parties of the three, as depicted in Fig. 1. Alice, Bob and Charlie process their inputs with arbitrary local response functions, and they each output a number , respectively. Under the assumption that each source is independent and identically distributed from round to round, and that the local response functions are fixed, such a scenario is well characterized by the probability distribution
over the random variables of the outputs.
If quantum channels are permitted from the sources to the parties then the set of distributions is larger than that achievable classically. Due to the nonlocal nature of quantum mechanics, these correlations are often referred to as nonlocal ones, as opposed to local behaviors arising from only using classical channels. In the classical case, the scenario is equivalent to a causal structure, otherwise known as a Bayesian network
Pearl (2000); Koller and Friedman (2009).For the classical setup we can assume without loss of generality that the sources each send a random variable drawn from a uniform distribution on the continuous interval between
and . If we also incorporate the network constraint then the probability distribution over the parties’ outputs can be written as(1) 
We now construct a neural network which is able to approximate a distribution of the form (1). We use a feedforward neural network, since it is described by a directed acyclic graph, similarly to a causal structure Goodfellow et al. (2016); Pearl (2000); Koller and Friedman (2009). This allows for a seamless transfer from the causal structure to the neural network model. We simply take the hidden variables to be inputs and the conditional probabilities and to be the outputs, for each possible value of . So as to respect the communication constraints of the triangle, the neural network is not fully connected, as shown in Fig. 1. We evaluate the neural network for values of in order to approximate the joint probability distribution (1) with a Monte Carlo approximation,
(2) 
The cost function can be any measure of discrepancy between the target distribution and the neural network’s output
, such as the Kullback–Leibler divergence of one relative to the other, namely
. In order to train the neural network we synthetically generate uniform random numbers for the hidden variables, the inputs. We then adjust the weights of the network after evaluating a minibatch of size with conventional neural network optimization methods Goodfellow et al. (2016). The minibatch size is chosen arbitrarily and can be increased in order to increase the neural network’s precision.By encoding the causal structure in a neural network like this, we can train the neural network to try to reproduce a given target distribution. The procedure generalizes in a straightforward manner to any causal structure, and is thus in principle applicable to any quantum nonlocality network problem.
Iii Results
for a generic noisy distribution (left) and for the specific case of the Fritz distribution with a 2qubit Werner state shared between Alice and Bob (right). The grey dots depict the target distributions, while the red dots depict the distributions which the neural network finds.
Given a target distribution , the neural network provides an explicit model for a distribution , which is, according to the machine, the closest local distribution to . The distribution is guaranteed to be from the local set by construction. The neural network will almost never exactly reproduce the target distribution since is evaluated by sampling the neural network a finite number of times. As such, to use the neural network as an oracle we could define some confidence level for the similarity between and . It is however, more generic and perhaps more informative if instead, we search for transitions in the machine’s behavior when giving it different target distributions from both outside the local set and inside it. We will typically define a family of target distributions by taking a distribution which is believed to be nonlocal and adding some noise controlled by the parameter , with being the completely noisy distribution and being the noiseless, “most nonlocal” one. By adding noise in a physically meaningful way we guarantee that at some parameter value, , we will enter the local set and stay in it for . For each noisy target distribution we retrain the neural network and obtain a family of learned distributions . Observing a qualitative change in the machine’s performance at some point is an indication of traversing the local set’s boundary. In this work we extract information from the learned model through

the distance between the target and the learned distribution,

the learned distributions , in particular by examining the local response functions of Alice, Bob and Charlie.
Observing a clear liftoff of the distance at some point is a signal that we are leaving the local set. Somewhat surprisingly, we can deduce even more from the distance . Though the shape of the local set and the threshold value are unknown, in some cases, under mild assumptions, we can estimate not only , but also the angle at which the curve exits the local set, and in addition gain some insight into the shape of the local set near . To do this, let us first assume that the local set is flat near and that is a straight curve. Then the true distance from the local set is
(3) 
where is the angle between the curve
and the local set’s hyperplane (see Fig.
2 for an illustration). In the more general setting Eq. (3) is still approximately correct even for , if is almost straight and the local set is almost flat near . We denote our approximation of the true distance form the local set as . We use Eq. (3) to calculate it but keep in mind that it is only an approximation. Given an estimate for the two parameters and this function can be compared to what the machine perceives as a distance, . Finding a match between the two distance functions gives us strong evidence that indeed the curve exits the local set at at an angle , where the hat is used to signify the obtained estimates.We also get information out of the learned model by looking at the local responses of Alice, Bob and Charlie. Recall that the shared random variables, the sources, are uniformly distributed, hence the response functions encode the whole problem. These are already interesting in themselves and can guide us towards analytic guesses of the ideal response functions. However, they can also be used to verify our results in some special cases. For example, if and the local set is sufficiently flat, then the response functions should be the same for all . On the other hand if then we are in a scenario similar to that of the left panel in Fig. 2 and the response functions should differ for different values of .
iii.1 Fritz distribution
First let us consider the quantum distribution proposed by Fritz Fritz (2012), which can be viewed as the CHSH Bell scenario wrapped into the triangle topology. Alice and Bob share a singlet, i.e. , while Bob and Charlie share either a maximally entangled or a classically correlated state with Charlie, such as and similarly for . Alice measures the shared state with Charlie and, depending on this random bit, she measures either the Pauli or observable. Bob does the same with his shared state with Charlie and measures either or . They then both output the measurement result and the bit which they used to decide the measurement. Charlie measures both sources in the computational basis and announces the two bits. We now introduce a finite visibility for the singlet shared by Alice and Bob, thus examining a Werner state,
(4) 
where denotes the maximally mixed state of two qubits. For such a state we expect to find a local model below the threshold of .
In Fig. 3 we plot the distances discussed previously for and . The coincidence of the two curves is already good evidence that the machine finds the closest local distributions to the target distributions. Upon examining the response functions of Alice, Bob and Charlie, also in Fig. 3, we see that they do not change above , which means that the machine finds the same distributions for target distributions outside the local set. This is in line with our expectations. Due to the connection with the CHSH Bell scenario, we believe the curve exits the local set perpendicularly and that the local set is a polytope, as depicted on the right panel in Fig. 2. These results reaffirm that our algorithm functions well.
iii.2 Elegant distribution
Next we turn our attention to a distribution which is more native to the triangle structure, as it combines entangled states and entangled measurements. We examine the Elegant distribution, which is conjectured in Gisin (2019) to be outside the local set. The three parties share singlets and each perform a measurement on their two qubits, the eigenstates of which are
(5) 
where the
are the pure qubit states with unit length Bloch vectors pointing at the four vertices of the tetrahedron for
, and are the same for the inverted tetrahedron.We examine two noise models  one at the sources and one at the detectors. First we introduce a visibility to the singlets such that all three shared quantum states have the form (4). Second, we examine detector noise, in which each detector defaults independently with probability
and gives a random output as a result. This is equivalent to adding white noise to the quantum measurements performed by the parties, i.e. the positive operatorvalued measure elements are
.For both noise models we see a transition in the distance , depicted in Fig. 4, giving us strong evidence that the conjectured distribution is indeed nonlocal. Through this examination we gain insight into the noise robustness of the Elegant distribution as well. It seems that for visibilities above , or for detector noise above , the distribution is still nonlocal. The curves exit the local set at approximately and , respectively. Note that for both distribution families, by looking at the unit tangent vector, one can analytically verify that the curves are almost straight for values of above the observed threshold. This gives us even more confidence that it is legitimate to use the approximate distance as a reference. In Fig. 4 we illustrate how the response function of Charlie changes when adding detector noise. It is peculiar how the machine often prefers horizontal and vertical separations of the latent variable space, with very clean, deterministic responses, similarly to how we would do it intuitively, especially for noiseless target distributions.
Iv Discussion
The standard method for tackling the membership problem in network nonlocality is numerical optimization. For a fixed number of possible outputs per party, , without loss of generality one can take the hidden variables to be discrete with a finite alphabet size, and the response functions to be deterministic. In fact the cardinality of the hidden variables can be upper bounded as a function of Rosset et al. (2017). Specifically for the triangle this upper bound is . This results in a straightforward optimization over the probabilities of each hidden variable symbol and the deterministic responses of the observers, giving continuous and discrete optimization parameters. For binary outputs, i.e. , this means only 15 continuous and 72 discrete variables and is feasible. However, already for the case of quaternary outputs, , this optimization is a computational nightmare on standard CPUs with a looming 177 continuous and 10800 discrete optimization parameters. Even when constraining the response functions to be the same for the three parties, , and the latent variables to have the same distributions, , the problem becomes intractable around a hidden variable cardinality of , which is still much lower than the current upper bound of that needs to be examined. Standard numerical optimization tools quickly become infeasible even for the triangle configuration  not to mention larger networks!
The causal modeling and Bayesian network communities examine scenarios similar to those relevant for quantum information Pearl (2000); Koller and Friedman (2009). The core of both lines of research are directed acyclic graphs and probability distributions generated by them. In these communities there exist methods for this socalled ‘structure recovery’ or ‘structure learning’ task. However, these methods are either not applicable to our particular scenarios or are also approximate learning methods which make many assumptions on the hidden variables, including that the hidden variables are discrete. Hence, even if these learning methods are quicker than standard optimization for current scenarios of interest, they will run into the scaling problem of the latent variable cardinality.
The method demonstrated in this paper attacks the problem from a different angle. It relaxes both the discrete hidden variable and deterministic response function assumptions which are made by the methods previously mentioned. The complexity of the problem now boils down to the response function of the observers  each of which is represented by a feedforward neural network. Though our method is an approximate one, one can increase its precision by increasing the size of the neural network, the number of samples we sum over () and the amount of time provided for learning. Due to universal approximation theorems we are guaranteed to be able to represent essentially any function with arbitrary precisionCybenko (1989); Hornik (1991); Lu et al. (2017). For the distributions examined here we find that there is no significant change in the learned distributions after increasing the neural network’s width and depth above some moderate level, we have reached a plateau. For the Elegant distribution, for example, for each party we used depth 5 and width 30. We note, however, that we did not do a rigorous examination of how much this can be reduced while still detecting the same thresholds. Also, for larger network sizes the machine could in principle learn more complex functions. However studying this thoroughly can be tedious since the amount of time that the machine needs to learn typically increases with network size. We were satisfied with the current complexity, since getting a local model for a single target distribution takes a few minutes on a standard computer, using a minibatch size of . The question of what the minimal required complexity of the response functions is, is interesting enough for a separate study.
We have demonstrated how, by adding noise to a distribution and examining a family of distributions with the neural network, we can deduce information about the membership problem. For a single target distribution the machine finds only an upper bound to the distance from the local set. By examining families of target distributions, however, we get a robust signature of nonlocality due to the clear transitions in the distance function, which match very well with the approximately expected distances.
V Conclusion
In conclusion, we provide a method for testing whether a distribution is classically reproducible over a directed acyclic graph, relying on a fundamental connection to neural networks. The simple, yet effective method can be used for arbitrary causal structures, even in cases where current analytic tools are unavailable and numerical methods are futile, allowing quantum information scientist to test their conjectured quantum, or postquantum, distributions to see whether they are locally reproducible or not, hopefully paving the way to a deeper understanding of quantum nonlocality in networks.
To illustrate the relevance of the method, we have applied it to an open problem, giving firm numerical evidence that the Elegant distribution is nonlocal on the triangle network, and getting estimates for its noise robustness under two physically relevant noise models.
The obtained results on nonlocality are convincing, but are still just numerical evidence. Examining whether a certificate of nonlocality can be obtained from machine learning techniques would be an interesting further research direction. In particular, it would be fascinating if a machine could derive, or at least give a good guess for a Belltype inequality which is violated by the Elegant distribution. In general, seeing what insight can be gained about the boundary of the local set from machine learning would be interesting. Perhaps a step in this direction would be to understand better what the machine learned, for example by somehow extracting an interpretable model from the neural network analytically, instead of by sampling from it. A different direction would be to apply similar ideas to networks with quantum sources, allowing a machine to learn quantum strategies for some target distributions.
Acknowledgements.
The authors thank Raban Iten, Tony Metger, Elisa Bäumer and MarcOlivier Renou for fruitful discussions. TK, YC, NG and NB acknowledge financial support from the Swiss National Science Foundation (Starting grant DIAQ, and QSIT), and the European Research Council (ERC MEC). DC acknowledges support from the Ramon y Cajal fellowship, Spanish MINECO (QIBEQI, Project No. FIS201680773P, and Severo Ochoa SEV20150522), Fundació Cellex, Generalitat de Catalunya (SGR875 and CERCA Program), and ERC CoG QITBOX.References
 Bell (1964) J. S. Bell, On the Einstein Podolsky Rosen Paradox, Physics Physique Fizika 1, 195 (1964).
 Brunner et al. (2014) N. Brunner, D. Cavalcanti, S. Pironio, V. Scarani, and S. Wehner, Bell Nonlocality, Reviews of Modern Physics 86, 419 (2014).
 Branciard et al. (2010) C. Branciard, N. Gisin, and S. Pironio, Characterizing the Nonlocal Correlations Created via Entanglement Swapping, Physical Review Letters 104, 170401 (2010).
 Branciard et al. (2012) C. Branciard, D. Rosset, N. Gisin, and S. Pironio, Bilocal versus Nonbilocal Correlations in EntanglementSwapping Experiments, Physical Review A 85, 032119 (2012).
 Fritz (2012) T. Fritz, Beyond Bells Theorem: Correlation Scenarios, New Journal of Physics 14, 103001 (2012).
 Henson et al. (2014) J. Henson, R. Lal, and M. F. Pusey, TheoryIndependent Limits on Correlations from Generalized Bayesian Networks, New Journal of Physics 16, 113043 (2014).
 Tavakoli et al. (2014) A. Tavakoli, P. Skrzypczyk, D. Cavalcanti, and A. Acín, Nonlocal Correlations in the StarNetwork Configuration, Physical Review A 90, 062109 (2014).
 Chaves et al. (2015) R. Chaves, C. Majenz, and D. Gross, InformationTheoretic Implications of Quantum Causal Structures, Nature Communications 6, 5766 (2015).
 Wolfe et al. (2016) E. Wolfe, R. W. Spekkens, and T. Fritz, The Inflation Technique for Causal Inference with Latent Variables, arXiv:1609.00672 [quantph, stat] (2016).
 Rosset et al. (2016) D. Rosset, C. Branciard, T. J. Barnea, G. Pütz, N. Brunner, and N. Gisin, Nonlinear Bell Inequalities Tailored for Quantum Networks, Physical Review Letters 116, 010403 (2016).
 Navascues and Wolfe (2017) M. Navascues and E. Wolfe, The Inflation Technique Completely Solves the Causal Compatibility Problem, arXiv:1707.06476 [quantph, stat] (2017).
 Rosset et al. (2017) D. Rosset, N. Gisin, and E. Wolfe, Universal Bound on the Cardinality of Local Hidden Variables in Networks, Quantum Information and Computation 18 (2017).
 Chaves (2016) R. Chaves, Polynomial Bell Inequalities, Physical Review Letters 116, 010402 (2016).
 Fraser and Wolfe (2018) T. C. Fraser and E. Wolfe, Causal Compatibility Inequalities Admitting Quantum Violations in the Triangle Structure, Physical Review A 98, 022113 (2018).
 Weilenmann and Colbeck (2018) M. Weilenmann and R. Colbeck, NonShannon Inequalities in the Entropy Vector Approach to Causal Structures, Quantum 2, 57 (2018).
 Luo (2018) M.X. Luo, Computationally Efficient Nonlinear Bell Inequalities for Quantum Networks, Physical Review Letters 120, 140402 (2018).
 Renou et al. (2019a) M.O. Renou, E. Bäumer, S. Boreiri, N. Brunner, N. Gisin, and S. Beigi, Genuine Quantum Nonlocality in the Triangle Network, arXiv:1905.04902 [quantph] (2019a).
 Gisin et al. (2019) N. Gisin, J.D. Bancal, Y. Cai, A. Tavakoli, E. Z. Cruzeiro, S. Popescu, and N. Brunner, Constraints on Nonlocality in Networks from NoSignaling and Independence, arXiv:1906.06495 [quantph] (2019).
 Renou et al. (2019b) M.O. Renou, Y. Wang, S. Boreiri, S. Beigi, N. Gisin, and N. Brunner, Limits on Correlations in Networks for Quantum and NoSignaling Resources, arXiv:1901.08287 [quantph], Physics Review Letters in press, (2019b).
 PozasKerstjens et al. (2019) A. PozasKerstjens, R. Rabelo, Ł. Rudnicki, R. Chaves, D. Cavalcanti, M. Navascues, and A. Acín, Bounding the Sets of Classical and Quantum Correlations in Networks, arXiv:1904.08943 [quantph] (2019).
 Melko et al. (2019) R. G. Melko, G. Carleo, J. Carrasquilla, and J. I. Cirac, Restricted Boltzmann Machines in Quantum Physics, Nature Physics p. 1 (2019).
 Iten et al. (2018) R. Iten, T. Metger, H. Wilming, L. del Rio, and R. Renner, Discovering Physical Concepts with Neural Networks, arXiv:1807.10300 [physics, physics:quantph] (2018).
 Melnikov et al. (2018) A. A. Melnikov, H. P. Nautrup, M. Krenn, V. Dunjko, M. Tiersch, A. Zeilinger, and H. J. Briegel, Active Learning Machine Learns to Create New Quantum Experiments, Proceedings of the National Academy of Sciences 115, 1221 (2018).
 van Nieuwenburg et al. (2017) E. P. L. van Nieuwenburg, Y.H. Liu, and S. D. Huber, Learning Phase Transitions by Confusion, Nature Physics 13, 435 (2017).
 Carrasquilla and Melko (2017) J. Carrasquilla and R. G. Melko, Machine Learning Phases of Matter, Nature Physics 13, 431 (2017).
 Deng (2018) D.L. Deng, Machine Learning Detection of Bell Nonlocality in Quantum ManyBody Systems, Physical Review Letters 120, 240402 (2018).
 Canabarro et al. (2019) A. Canabarro, S. Brito, and R. Chaves, Machine Learning Nonlocal Correlations, Physical Review Letters 122, 200401 (2019).
 Gisin (2019) N. Gisin, Entanglement 25 Years after Quantum Teleportation: Testing Joint Measurements in Quantum Networks, Entropy 21, 325 (2019).
 Pearl (2000) J. Pearl, Causality : Models Reasoning and Inference (Cambridge University Press, 2000).
 Koller and Friedman (2009) D. Koller and N. Friedman, Probabilistic Graphical Models: Principles and Techniques (MIT press, 2009).
 Goodfellow et al. (2016) I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (MIT Press, 2016).
 Cybenko (1989) G. Cybenko, Approximation by Superpositions of a Sigmoidal Function, Mathematics of Control, Signals and Systems 2, 303 (1989).
 Hornik (1991) K. Hornik, Approximation Capabilities of Multilayer Feedforward Networks, Neural Networks 4, 251 (1991).
 Lu et al. (2017) Z. Lu, H. Pu, F. Wang, Z. Hu, and L. Wang, The Expressive Power of Neural Networks: A View from the Width, in Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Curran Associates, Inc., 2017), pp. 6231–6239.