# Learning Algebraic Models of Quantum Entanglement

We give a thorough overview of supervised learning and network design for learning membership on algebraic varieties via deep neural networks. We show how artificial neural networks can be trained to predict the entanglement type for quantum states. We give examples for detecting degenerate states, as well as border rank classification for up to 5 binary qubits and 3 qutrits (ternary qubits).

## Authors

• 2 publications
• 4 publications
• ### Mixed State Entanglement Classification using Artificial Neural Networks

Reliable methods for the classification and quantification of quantum en...
02/11/2021 ∙ by Cillian Harney, et al. ∙ 0

• ### Entangled Datasets for Quantum Machine Learning

High-quality, large-scale datasets have played a crucial role in the dev...
09/08/2021 ∙ by Louis Schatzki, et al. ∙ 50

• ### Detecting quantum entanglement with unsupervised learning

Quantum properties, such as entanglement and coherence, are indispensabl...
03/08/2021 ∙ by Yiwei Chen, et al. ∙ 79

• ### Negational Symmetry of Quantum Neural Networks for Binary Pattern Classification

Entanglement is a physical phenomenon, which has fueled recent successes...
05/20/2021 ∙ by Nanqing Dong, et al. ∙ 23

• ### Compact Neural Networks based on the Multiscale Entanglement Renormalization Ansatz

The goal of this paper is to demonstrate a method for tensorizing neural...
11/09/2017 ∙ by Andrew Hallam, et al. ∙ 0

• ### Deep convolutional tensor network

Tensor networks are linear algebraic representations of quantum many-bod...
05/29/2020 ∙ by Philip Blagoveschensky, et al. ∙ 0

• ### A Boolean Functions Theoretic Approach to Quantum Hypergraph States and Entanglement

We establish an one-to-one correspondence between the Boolean functions ...
11/01/2018 ∙ by Supriyo Dutta, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Recent efforts to unite Quantum Information, Quantum Computing, and Machine Learning have largely been centered on integrating quantum algorithms and quantum information processing into machine learning architectures [Wiebe2012, Lloyd2014, Rebentrost2014, Schuld2014, Schuld2015, Wiebe2016, Biamonte2017, Sheng2017, Kerenidis2018, Wossnig2019]

. Our approach is quite the opposite – we leverage Machine Learning techniques to build classifiers to distinguish different types of quantum entanglement. Machine Learning has been used to address problems in quantum physics and quantum information, such as quantum state tomography

[Quek2018], quantum error correction code [Nautrup2018] and wavefunction reconstruction [Beach2018]. Here we focus on quantum entanglement. While we were inspired by the approach of learning algebraic varieties in [Breiding2018], our methods differ in that we are not trying to find intrinsic defining equations of the algebraic models for entanglement types, but rather building a classifier that directly determines the entanglement class. Distinguishing entanglement types may be useful for quantum information processing and in improving and increasing the efficiency of quantum algorithms, quantum communication protocols, quantum cryptographic schemes, and quantum games. Our methods generalize to cases where a classification of all entanglement types is not known, and cases where the number of different classes is not finite (see for instance [LandsbergTensorBook, Ch.10], or [Vinberg-Elasvili]).

We only focus on pure states for representing quantum systems, which is sufficient for studying quantum computations and quantum algorithms. This is opposed to the noisy approach with density matrices and mixed states, which is used when one needs to account for the noise and the interaction with the environment [Nielsen2011].

### 1.1. Basic notions

The quantum state of a particle can be represented by a unit vector

in a Hilbert space , with basis in decimal notation. The state of an -qudit quantum system is represented by a unit vector

in a tensor product

of the state spaces for each particle, where for simplicity we assume all particles of the same type. The tensor space has basis , where are basis elements of . The dual vector space is represented by vectors , and we use the standard Hermitian inner product , where , and similarly for .

The study of quantum entanglement is often focused on orbits (equivalence classes) under the action of the SLOCC (Stochastic Local Operations and Classical Communication) group, which algebraists know as , the cartesian product of special linear groups, i.e. normal changes of coordinates in each mode.

### 1.2. First examples

Consider the pair of binary state particles, called a 2-qubit system. Here there are only two different entanglement types up to the SLOCC action, represented by and . All other non-zero 2-qubit states can be moved to one of these by the action of the SLOCC group. Determining on which orbit a given state is (and hence its entanglement type), can be done by computing the matrix determinant: Set . The value of is either 0 or if is respectively of type or . In particular, the non-general states live on a quadric hypersurface, and thus an Artificial Neural Network classifier (see Section 2.1) can be trained to test membership on these two types of states, see Example 4 below.

Going to the 3-qubit system, the rank (a numerical invariant) and the determinant (a polynomial invariant) generalize to the multilinear rank and the Cayley hyperdeterminant, respectively. A given state has coordinates

 φijk=⟨ijk|φ⟩.

The three 1-flattenings are as follows:

 F1=(φijk)i,jk,F2=(φijk)j,ik,F3=(φijk)k,ij.

The ranks of these flattenings comprise a vector called the multilinear rank. If any flattening has rank 0, is zero, which cannot occur if we assume is a unit vector. If the multilinear rank is , then is separable. If the multilinear rank is then is a bi-separable state of the form up to SLOCC. The other bi-separable states are permutations of this up to SLOCC. Finally, if the multilinear rank is , then up to a SLOCC transformation, is either the so called W-state or a general point . These states are distinguished, respectively, by the vanishing or non-vanishing of the well-known SLOCC-invariant, the hyperdeterminant:

 Δ222(x)=x2000x2111+x2011x2100+x2010x2101+x2001x2110−2x000x011x100x111−2x000x010x101x111−2x000x001x110x111+4x000x011x101x110−2x010x011x100x101−2x001x011x100x110−2x001x010x101x110+4x001x010x100x111.

In this case these are all the possible types of states up to SLOCC, and we have a complete algebraic description of these states as well. So, this is a good testing ground for other methods since we know how to test ground truth. In Section 4.2 we summarize the performance of a neural network for separating these states.

In the 4-qubit case there is still a classification, and a complete set of invariants that can be used to classify entanglement types is known [Holweck2014, Holweck2017]. In addition, algebraic invariants for all border ranks for 4-qubit systems are known classically (originally studied by [Segre1920]) and are given by the minors of 1- and 2- flattenings [LandsbergTensorBook, Ch. 7.2]. We note that the hyperdeterminant has degree 24, and is computable by Schlafli’s method, [GKZ]. However, in the 5-qubit case the hyperdeterminant has degree 128, is not computable by Schlafli’s method, [WeyZel_Sing], and is surely very complicated as indicated by the 4-qubit case [HSYY_hyperdet]. However, in Section 3.3 we show that neural networks can distinguish between singular and non-singular states even though algebraic invariants are likely to fail.

### 1.3. Prior Work and Outline

Machine Learning has been used to study entanglement detection, entanglement measurement, and entanglement classification. A common focus is the density matrices formalism, for which the asymptotic problem of deciding separability is a NP-hard problem. Neural networks have been used for estimating entanglement measures such as logarithmic negativity, Rényi entropy or concurrence in 2-qubits (pure and mixed) or many-body systems

[Gray2018, Berkovits2018, Govender2017]; for encoding several CHSH (witness) inequalities simultaneously in a network to detect entangled states [Ma2018]; for computing the closest separable state in a complex valued network [Che2018]; for recognizing the entanglement class of 3-qubit systems [Behrman2011]; and for detecting entanglement in families of qubits and qutrits systems in the bipartite case [Wisniewska2015]

. A convex hull approximation method combined with decision tree and standard ensemble learning (bagging) algorithms was used in

[Lu2018] to classify separable and entangled states. The forest algorithm (also using decision trees) was used in [Wang2017]

to detect entanglement and was compared to quantum state tomography (up to five qubits). Principal Component Analysis (PCA) was used to determine the dimension and intrinsic defining equations of certain algebraic varieties

[Breiding2018]

. In the 2-qubit case neural networks, support vector machines, and decision trees were used to detect non-classical correlations such as entanglement, non-locality, and quantum steering

[Yang2018].

Indeed, Machine Learning can be a relevant tool for entanglement detection, at least in some limited cases. However, prior studies were mostly limited to the 2-qubit or bipartite case, because of the possibility of generating properly separable and entangled mixed states using the PPT criterion or Schmidt decomposition. It seems difficult to generalize the prior supervised approaches to higher dimensions or to systems with more than two particles.

Instead, we focus on the problem of entanglement detection and classification by learning algebraic varieties (Segre variety, dual variety, secant varieties) that characterize different entanglement classes for pure states. Our method can be generalized to higher dimensions and systems with several particles, bringing original tools for distinguishing non-equivalent entanglement classes for quantum systems for which we do not know the complete classification or we do not have exact algorithms (as proposed by [Holweck2012, Holweck2014, Holweck2017]) to determine the entanglement type (as it is the case for 5-qubit systems for instance).

As noted in [Breiding2018] there are many instances of high degree hypersurfaces (like the hypersurface of degenerate states in ) that are easy to sample, but there is little hope to learn their equations from samples, so prior methods do not apply. However, in Section 3.3 we will show several cases where an artificial neural network can be trained to perform the task of determining membership on these high degree hypersurfaces (which the inaccessible defining equation would also do).

In Section 2

, we investigate several architectures of feed-forward neural networks for learning membership on algebraic varieties, by using previous knowledge about the defining equations to design networks. In Section

3, we study several examples by building classifiers with neural networks for the cases of qubits and qutrits. Finally, in Section 4, we use these classifiers to distinguish quantum entanglement classes.

## 2. Network design for learning algebraic varieties

### 2.1. Basics of Artificial Neural Networks

Inspired by biological neural networks, Artificial Neural Networks (ANN) are computing systems whose goal is to reproduce the functionality and basic structure of the human brain. They are used for the learning of a specific task (such as classification and regression), without any explicit instructions or rules.

An artificial neural network is composed of several artificial neurons, regrouped using a specific architecture, and is able to perform a specific task using a learning algorithm.

In 1943, McCulloch and Pitts proposed the first model of an artificial neuron [McCulloch1990]. An artificial neuron (see Figure 1

) is defined by inputs (eventually coming for other neurons), by weights (synaptic weights associated with each input), a weighted sum (computed from the two previous ones), a threshold (or bias), an activation function, and an output.

The output of the neuron is determined as follows. If the weighted sum is strictly greater than the threshold, then the neuron is said to be active, and the output of the neuron will be equal to the activation function applied to the weighted sum (with the threshold frequently added to the weighted sum). Denote by the inputs of the neuron, and by the weight associated with the -th input. The weighted sum (frequently denoted by ) is

 U=w1x1+w2x2+⋯+wnxn . (1)

The threshold is a real number that can be considered as a weight associated with an additional input 1 during the learning process (see Section 2.2). The output of the neuron is equal to , with the activation function of the neuron. Usually we want activation functions to be nonlinear in order to allow such neural networks to compute nontrivial problems using only a small number of neurons [Cybenko1989]. However, activation functions can be chosen as binary step or linear functions but will lead to several problems during the learning step or for dealing with non-binary classification problems. One may also desire other properties in an activation function, such as continuously differentiable, monotonic, smooth., and approximates the identity near the origin (see Figure 2 for a summary).

###### Example 1.

Consider the problem of reproducing the basic logical function OR with an artificial neuron. Denote by and the two inputs, which take values from . The output of OR is if either input is and otherwise. Choose the step activation function . The weighted sum and output are respectively:

 U=w1x1+w2x2 , g(U)={1if U>θ0otherwise (2)

In order to model OR with a single neuron the following conditions are necessary:

 0⋅w1+0⋅w2≤θ,0⋅w1+1⋅w2>θ,1⋅w1+0⋅w2>θ,1⋅w1+1⋅w2>θ.

 θ≥0, w1>θ, w2>θ .

If , , , this single neuron reproduces OR for inputs in . A visualization of this binary classification is given in Figure 3.

Geometrically, if one can find a hyperplane (codimension 1) that separates the data into two classes, then the binary classification problem can be solved using a single artificial neuron. Here, the equation of one such hyperplane is given by , which corresponds to the following values for the threshold and weights: , , . ∎

On the other hand, if we want to model the logical XOR function with only one neuron, one can show that this is not possible by the same type of analysis of Example 1

. Therefore, one needs more complex structures and more neurons to solve more difficult problems, and this leads us to consider the model of Multi-Layer Perceptron (MLP). It is composed of an input layer, one or several hidden layers, and an output layer. Each layer is composed of a certain number of neurons that are not connected. In a feed-forward dense layer configuration, each neuron in a layer is connected to every neuron of the previous and next layer, and the signal is only propagated in one direction (from left (input layer) to right (output layer)).

An Artificial Neural Network (ANN) learns tasks from examples via optimizing weights in a network. These examples form the training dataset of correct input-output pairs. The goal of the supervised approach with ANNs is to learn how to associate the correct output to a given input. This task can be performed after a training step, where all the weights and thresholds will be optimized to reduce the error defined by the difference between the output given by the network and the correct output provided in the training dataset. In the feed-forward neural network model, the back-propagation algorithm is used to update the weights and reduce the error, also known as the loss function [Rumelhart1988, LeCun2015].

After training, the network should be able to answer correctly for inputs taken from the training dataset but should also be able to make correct predictions for new data. In order to test the network, we give a testing dataset to evaluate the capacity of the network to generalize the previous learning by making predictions. In addition, one can consider a validation dataset, used during the training step to avoid or detect overfitting.

The activation function is an important choice when designing a neural network for learning a specific task. Frequently neurons in the same layer will have the same activation function, while two adjacent layers may have different activation functions.

For the input and hidden layers, ReLU is commonly used as an activation function. The advantages of ReLU is that it is computationally efficient (quick convergence of the network), it is non-linear, and admits a derivative (thus allowing back-propagation). However, when the inputs approach zero or become negative, the learning becomes harder (dying ReLU problem). To prevent that, one can use Leaky ReLU activation, which introduces a small positive slope in the negative area, permitting back-propagation.

The activation function of the neurons in the output layer will depend on the problem one wants to solve. Sigmoid and SoftMax (see Eq. 3

) functions are the most used ones, and the choice will depend on the classification problem one wants to solve. In fact, if it is a binary classification problem, then Sigmoid function with only one neuron will be more efficient. When one deals with a classification problem with multiple classes, several neurons at the output (with categorical presentation of the data) should be introduced. Therefore, if a state can belong to several classes at the same time, one will use the Sigmoid function in this case. Otherwise, if every state only belongs to one class, then SoftMax function will be a good choice.

 SoftMax(z)=(σ(z)1,σ(z)2,…,σ(z)K)with z=(z1,z2,…,zK) and σ(z)j=ezj∑Kk=1ezk,∀j∈⟦1,K⟧ (3)

The loss function should also be chosen based on the problem of interest. Choices of activation functions for the output layer and loss functions are summarized in Table 1.

In our work, we implemented neural networks in Python using the Keras library

[Chollet2015]. It provides a flexible implementation of feed-forward neural networks and the choice of network parameters (architecture, activation functions, loss function, solver, etc.). We chose to implement the Nadam solver provided by the Keras library for minimizing the loss function during the learning step [Sutskever2013, Dozat2016].

Once activation and loss functions are fixed, the remaining parameters to be configured are the number and size of the layers. A classical result, known as the Universal Approximation Theorem states that “a feed-forward network with an input layer and a single hidden layer with a finite number of neurons can approximate any continuous function on compact subsets of [Csaji2001]. In 1989, the theorem was proven using Sigmoid activation functions [Cybenko1989]. Kurt Hornik showed that the theorem does not depend on the activation functions, but rather on the feed-forward multilayer architecture in itself [Hornik1991].

These results concern the case of depth-bounded (e.g. 2) networks. On the other hand, several works recently proposed an analogous theorem for width-bounded (with ReLU activation functions) networks [Lu2017, Hanin2017]. The question of providing the best network, in terms of accuracy and computational efficiency, for a given task or problem, is still open and the question of the choice of the depth and the width of neural networks are most of the time chosen by practice. Designing a network often leads one to consider a trade off between performance and computational cost. Some architectures, like a decreasing number of neurons as the neural network becomes deeper, seem to present practical efficiency. However, there is no theoretical proof of the superiority of such networks, and our experience seems to show that a network with the same number of neurons in each layer could perform with the same accuracy for predictions.

### 2.2. Learning algebraic varieties

An algebraic variety is a geometrical object defined as the zero locus of a set of polynomials. In order to teach a machine how to recognize points on algebraic varieties, one must be able to encode in the learning model the polynomial equations defining these objects (or an approximation of them).

###### Example 2 (Learning affine linear spaces).

Consider the equation of a line in : , with , and real coefficients. Suppose we want to determine membership on . This is a binary classification problem that can be modeled with a single artificial neuron with two inputs and . Denote by , respectively , the weight associated with the first input , respectively , and the threshold of the neuron (which is here considered as a weight associated with the input value ). The expression of the weighted sum is:

 U=w1x+w2y+θ

Note that if we set , and , we retrieve the equation of the line . This means that the weighted sum of the neuron will be equal to 0 if and only if the inputs correspond to a point on the line .

This observation can be generalized to any dimension. In fact, any linear subspace of codimension 1 in a -dimension space, i.e. defined by a single linear equation with variables, can be modeled by using a single artificial neuron with inputs corresponding to the variables. By model we mean the fact that the output of the neuron is always the same value when the inputs represent a point contained in this linear subspace, with the activation function can be taken as the identity function. The following is straightforward.

###### Proposition 2.1.

Let be an affine linear space in defined by , with real variables and real coefficients . Then can be modeled using a single artificial neuron, and the weights must satisfy: and , with , as shown in Figure 4.

###### Example 3.

We now generalize the previous example to higher codimension. Let be a line in . A line in a 3-dimensional space is an algebraic variety of codimension 2, defined by a system of two linear equations. The vanishing of these two equations is the intersection of two planes, the line . So, we can combine in the same layer two artificial neurons to model the two equations defining the line (Figure 5). The output of the first neuron should be equal to the same value (namely 0) when the input point of the network is on the first plane (and thus on the line), and the same applies for the second neuron.

If we also take the identity as activation function for both neurons, the output of the two neurons will be precisely when a point is on the line . This result is straightforward to generalize to higher codimension:

###### Proposition 2.2.

Let be an affine linear subspace of codimension , defined by equations of the type , with real variables and real coefficients , . Then this subspace can be modeled using artificial neurons combined in a layer, and the weights of the -th neuron must satisfy: and , with .

###### Example 4.

[Curves] Consider a circle in defined by the equation: , with the radius of the circle. By only using the weighted sum of the inputs with the identity as activation function, a single neuron is not able to model this quadratic equation. The idea is to use the function for the activation function, in order to introduce squared terms of the input variables and at the output of the neuron. This idea was already used in [Kileel2019] to study the power of deep neural networks with polynomial activation functions. Such networks are also known as Polynomial Neural Networks (PNN) and an adaptive architecture was proposed in [Oh2003].

If we consider a single artificial neuron, with and the weights associated with and respectively, the threshold, and the activation function, then the output is equal to:

 s=g(U)=(w1x+w2y+θ)2=w12x2+w22y2+2w1w2xy+2w1θx+2w2θy+θ2 (4)

If we want to model the equation of the circle, then a single artificial neuron is not sufficient. However, it is possible by adding another neuron in the same layer, and then combine these two outputs in a third neuron (in another layer) (see Figure 6). This result can be generalized as follows:

###### Proposition 2.3.

Any polynomial equation of degree 2 with real variables , i.e of the form , with , can be modeled using artificial neurons: neurons with (square) activation function in a first layer, and one neuron with identity activation function in a second layer.

###### Proof.

Let be a polynomial of degree 2 as in the statement of the proposition, which can be represented by the matrix equation for a real symmetric matrix , with .

Let be the weight associated with the -th input and the -th neuron of the first layer, let be the weight associated with the output of the -th neuron in the first layer and the single neuron in the second layer, let be the thresholds of the neurons in the first layer, and be the threshold of the neuron in the second layer. Each neuron in the first layer has square activation function, and the neuron of the second layer has the identity activation function. If we denote by the outputs of the neurons in the first layer, and the output of the single neuron in the second layer, then:

 sn=Un=n−1∑i=1bisi+θnwithsi=U12=(n−1∑j=1wj,ixj+θi)2. (5)

So is an inhomogeneous linear combination of squares of affine linear forms. This can be represented as for a real diagonal matrix , and the entries of are the linear forms (before squaring) in the expressions for the , and . Thus, the proposition is equivalent to the diagonalization of real symmetric matrices, which is well known to always be possible via orthogonal matrices. The entries of the orthogonal matrices used in the diagonalization give the weights of the network. ∎

To build a network for modeling any polynomial equation of degree in variables, a dimension count gives a good guess for the appropriate architecture. The Alexander-Hirschowitz (AH) theorem (see [Brambilla2008] for a modern treatment) tells us when the naive dimension count fails. In fact, the AH theorem states that a general111By “general” we mean avoiding a measure-zero set of possible counterexamples. homogeneous polynomial of degree in variables can be expressed as the sum of -th powers of linear forms (except for quadratic forms (Prop 2.3) and a few other special cases), i.e.

Dehomogenizing the AH result (by setting the last variable to 1, for example) one obtains a bound for 2-layer neural networks modeling affine hypersurfaces. We implement neurons with activation function in the first layer, and then combine all the outputs in a linear combination using a neuron in the second layer. Suppose we have inputs corresponding to the variables of the homogeneous polynomial . If we denote by the weight associated with the -th input and the -th neuron in the first layer, and by the threshold of the -th neuron, then the output of the neuron in the first layer is

 sj=g(Uj)=(n∑i=1wi,jxi+θj)d.

The threshold introduces an inhomogeneity which we can remove by adding an extra variable , replacing with . Then we may apply the AH theorem for variables, then set to obtain the bound for the inhomogeneous form of the output. This idea also appears in [Kileel2019].

The last step of the network design is to solve the classification problem, namely: is the input point on (or outside) the variety defined by the equations modeled by the network? We recall that after the two first layers, we should retrieve one specific value (most of the time 0) when the input point is on the variety, and any other value if the point is not. So, the last step is equivalent to recognizing a real number in an interval (whose size depends on the training data). We cannot just add a Sigmoid neuron to solve directly the classification problem because single neurons only solve inequalities and not equalities. So, we need to add another layer before the output layer to be able to recognize this specific value , and then solve this binary classification problem at the output of the network.

After testing several architectures, the task of recognizing a real number in an interval can be performed using a single layer with four neurons with LeakyReLU activation functions. Thus, by adding such a layer after the first two layers (modeling the equations of the variety), and by adding a last layer (output layer) with only one neuron (with Sigmoid activation function), one can potentially learn any algebraic variety defined by a set of homogeneous polynomials (Figure 7).

However, the questions of the ability to correctly train the network, and the ability for the optimizer to find the right weights and thresholds are important for practical implementations. For homogeneous equations of low degree and few variables (such as the circle case), the network is able to learn the correct weights, and one can recover the equation of the circle from the weights of the network. However, when the degree and the number of variables increases, the number of neurons needed in the first layer increases very quickly and it becomes harder to converge to a set of weights and thresholds that model the desired equation. To overcome this, in Subsection 2.4 we slightly modify the structure of the network to reduce the number of weights and parameters.

### 2.3. Expressive Power of Deep Neural Networks

For deeper neural networks with -th power activation functions we can apply the same dimension-counting principle as was employed in the AH case in order to determine the maximum width of each layer to avoid over-fitting and redundancy in the network. See also Kileel et al., [Kileel2019, Sec. 3.2] for several concrete examples. Let us give a recursive construction. Suppose we are working over the complex numbers and want to approximate a homogeneous polynomial of degree as a sum of the -th powers of homogeneous degree polynomials in variables as

 f≃p∑j=1λj(gj)d.

Note that we’re not going to count the parameters in this case because we can re-scale the by bringing in a -th root. Then we need at most parameters to express , but the right-hand side of the above equation uses parameters. So, we expect that we can do the approximation if , or if

 p≥⌈(n−1+dede)(n−1+ee)⌉.

If has degree and the functions are taken from a space that requires parameters, repeating the same argument we find that as long as we have nodes with

 p(M−1+ee)≥(n−1+dede). (6)

then we expect to be able to approximate . For more layers we iterate (6). For example, in the 3-layer case with nodes on layer using degree activation functions, if

 p1(p2(p3(n−1+d3d3)−1+d2d2)−1+d1)d1≥(n−1+d1d2d3d1d2d3).

then we can expect to be able to model with degree . One might guess that a classification result like the AH theorem might hold, but relatively little is known. See [Kileel2019, Thm. 10] for another bound.

###### Remark 2.4.

The ability of neural networks to approximate any function, in the sense of the Universal Approximation Theorem, can lead one to ask how it is different from polynomial fitting. In [Cheng2018], an analytic argument is given that neural networks are in fact essentially polynomial regression models, with the effective degree of the polynomial growing at each hidden layer. According to the authors, if the activation function is any polynomial, or implemented by one, a neural network exactly performs polynomial regression. In the same spirit, neural network architectures are proposed in [Li2003]

, showing the correspondence between the values of the function to be interpolated, the basis functions and the basic points on one hand, and the weights, the activations functions and the thresholds (bias) on the other hand. However, in practical applications, data is often noisy and incomplete, and polynomial interpolation is generally subject to overfitting

[Witten2016], while neural networks are able to perform when there are noisy or incomplete data, and have the ability to generalize from the input data [Ong2006]

. This dilemma between fitting the training dataset and being able to generalize the models to unencountered data is known as the bias-variance trade-off, and it is discussed in

[Belkin2018]. Raturi explains in [Raturi2018] that solving the same problem as polynomial interpolation requires much less computational time and resources when using neural networks. In the same paper, he shows that neural networks provide better approximations of two special functions compared with Lagrange interpolation. Finally, it is shown in [Llanas2006] that neural networks can interpolate and model a function by means of sigmoidal functions to approximate samples in any dimension, with arbitrary precision and without training.

### 2.4. Hybrid networks

Here we introduce a simpler network architecture for classifying membership on a parametrized algebraic variety. We observed that during the learning process, which is an optimization process, the algorithm is not always able to reach the set of weights that minimize the loss function. This is due to the existence of local minima in the loss function, related to the utilization of the non-convex activation functions.

The second layer (Figure 7) contains only one neuron (with the identity activation function), introduced to take the linear combination of -th power of weighted sums of input variables. One would like to remove this layer, and directly link outputs of the first layer with the third layer (containing only neurons with LeakyReLU activation function). We call these hybrid networks, because they combine layers with both and LeakyReLU activation functions.

Moreover, the geometrical interpretation of the network is now different since every neuron in the second layer (LeakyReLU neurons) take as input a different linear combination of all -th power forms. The second layer combines different homogeneous polynomials that are not necessarily equal to the homogeneous polynomial defining the algebraic variety, but they can be used to approximate this last as a set of inequalities. The third and last layer of the network is the output layer, which will depend on the classification problem one wants to solve (binary classification, several classes, etc.). The architecture of the hybrid network is summarized in Figure 8.

This type of network showed better results and quicker learning than the previous version of the network. The AH theorem still can be used to give an idea about the number of neurons in the first layer, and sometimes extra neurons should be added or removed to boost the performance of the network during the learning phase. By experience, the number of neurons in the second layer can be chosen quite small (between 4 and 10 for binary classification problems) and will depend on the number of neurons in the output layer (it might not be smaller than the number of neurons in the output layer).

## 3. Experiments

Consider a pair, , of a vector space and a group . The set of all such pairs for which acts on with finitely many orbits has been classified by Kac [Kac80, Kac85]. Since then Vinberg and others have classified all the orbits in many of these cases. There are special cases where acts on with infinitely many orbits, yet those orbits may still be represented using finitely many parameters, the tame case. A fantastic example of such is [Vinberg-Elasvili], which gave a classification of the orbits of 9 dimensional trivectors utilizing a connection to the Lie algebra . This classification also implies a classification for 3 qutrit systems among others. After orbit classification, one desires effective methods to determine orbit membership. Separating orbits is difficult in general and lies at the core of geometric approaches to Valliant’s version of P versus NP [LandsbergCambridgeBook].

One approach is to use Invariant Theory to build a set of invariants and covariants that characterize each orbit, or each family represented by parameters [Holweck2012, Luque2003]. In the four-qubit case, an infinite tame case, Verstraete et al. gave a list of 9 normal forms (depending on parameters) that give a parametrization of all SLOCC orbits [Verstraete2002, Chterental2006]. An algorithm was proposed in [Holweck2014, Holweck2017], for the four-qubit case, to determine orbit membership. In general, the complexity of these invariants grows very rapidly with the number of particles in the system. Already in the five-qubit case, a complete description of the algebra of SLOCC-invariant polynomials is out of reach of any computer system [Luque2006].

Therefore, it is worth considering other approaches. In this section we give examples that serve as proof of the concept that artificial neural networks can be trained to give efficient and effective classifiers for quantum states.

### 3.1. Modeling quantum states

In our work, we only focused on the real case, i.e. -qudits systems that are represented by tensors in . Therefore, such states can be represented as dimensional vectors with coordinates in , and this vector will be used as input vector for the neural network.

### 3.2. Detecting separable states

A state is separable or unentangled if it can be written as a tensor product of pure states as follows

 φ=φ1⊗⋯⊗φn,

for for all . Algebraic geometers call the projective variety of separable states the Segre variety, which is parametrized as:

 Seg:PH×⋯×PH→PH⊗n([x1],[x2],…,[xn])↦[x1⊗x2⊗…⊗xn],

where the square brackets denote the equivalence class by rescaling, which makes physical sense to consider since we assume states are unit vectors. We uniformly sample the space of separable states working on representatives of projective points on spheres:

 SH×⋯×SH→SH⊗n,

where denotes the states of unit norm, and similarly .

The following is straightforward.

###### Proposition 3.1.

A non-zero state is separable if and only if either of the following equivalent conditions.

1. is in the SLOCC orbit of .

2. All of the 1-flattenings have rank 1.

Separable quantum states are states where quantum entanglement is not present at all. Detecting separable states thus becomes equivalent to detecting quantum entanglement. Therefore, one way to determine if a state is separable is to compute the compact form of the SVD of every 1-flattening.

In order to illustrate our methods, we show how we train an artificial neural network to perform this task.

#### 3.2.1. Networks for pure states

The set of separable states is the zeroset of the two by two minors of 1-flattenings, and thus is defined by a set of homogeneous polynomials of degree 2. In the case, there is only one defining equation, the determinant. In higher dimensions, several equations of degree 2 define the Segre variety. To select the number of equations, we take (at least) the codimension of the Segre variety in the projective space.

For each quadratic equation, we use our criteria (see Proposition 2.3) to deduce the number of needed neurons, and then to design the Hybrid Networks. In some cases, we chose to add several extra neurons in the first layer (with quadratic activation function) to accelerate the learning process. As explained before, the second layer will only contain neurons with LeakyReLU activation functions. The last layer will contain only one neuron, with Sigmoid activation function in this case (because of the binary classification problem).

We also implemented networks with only LeakyReLU activation functions neurons (excepted for the last layer). For all the considered cases, we used the same network architecture, which is composed of 4 first layers and a last output layer. We chose to implement a decreasing structure in the network, with 100 neurons in the first layer, 50 in the second, 25 in the third, and 16 in the fourth.

In each case, we precise the architecture of the network in the second column of the results tables.

#### 3.2.2. Results

In Table 2 we present the result for the , the , the , and the cases with hybrid networks. In the , the , and the cases, we used a training dataset of size 56200, a validation dataset of size 12800, and a testing dataset of size 32000. In the case, we doubled the size of the training and the validation datasets only (same size for the testing one). We reached an average accuracy of 93% on the testing datasets for separability classification.

We used LeakyReLU networks to study the same cases as we did with the hybrid networks, with the addition of the one, and we regrouped the results in Table 3. In the and cases, we used a training dataset of size 102400, a validation dataset of size 25600, and a testing dataset of size 32000. In the , and cases, we used a training dataset of size 502600, a validation dataset of size 55600, and a testing dataset of size 32000. We reached an average accuracy of 98% on the testing datasets for separability classification.

These results show that an ANN can be trained to distinguish between separable and entangled states. From an algebraic geometric point of view, determining if a tensor is on the Segre variety is equivalent to decide if it is a rank one tensor, and this task can be performed using truncated SVD for each flattening. The complexity of computing the singular values of an

matrix is . This is roughly times the complexity of reading the matrix. Computing the singular values of the -th flattening of a tensor of format has complexity

 O(ni(n1⋯ntni)min{ni,(n1⋯ntni)})=O((n1⋯nt)min{ni,(n1⋯ntni)).

For “balanced” cases , thus this complexity is at most

 O((n1⋯nt)ni),

which is roughly times the complexity of reading the tensor. Now computing all of the singular values of the tensor (in a naive way, not informing one computation by the results of another) one would expect complexity . The cost of evaluating a trained network is equal to the cost of a feed-forward propagation, which needs to compute all the weighted sums (which is equivalent to compute matrix multiplications) and evaluate the activation function for each neuron. It will in fact depend on the architecture of the network. If we denote by the number of layers (without counting the input layer) and by the number of neurons in each layer, it has a complexity roughly equal to .

This indicates that as the dimensions of the tensor spaces increase, that the cost of evaluating an already trained network should be more efficient than the computation of the SVD of flattenings. Though the cost of training the network cannot be ignored, it is a one-time cost, and the trained network can be used for efficient computation for any number of tests. In addition, once the tensor rank is larger than the dimension of any of the factors, there are few effective methods for determining rank, so a trained ANN classifier may provide valuable insight.

### 3.3. Detecting degenerate states

Recall from linear algebra that a matrix is degenerate (rank deficient) if and only if any of the following equivalent conditions hold:

1. There are non-zero vectors and such that for all and for all .

2. In the case , the determinant vanishes: .

3. Up to Gaussian elimination has a zero row and a zero column.

These conditions carry over in the tensor setting, following [GKZ]. A state is degenerate if one of the following equivalent conditions hold:

1. There is a pure state such that the contraction

2. The hyperdeterminant vanishes: , where is the hypermatrix satisfying .

3. Up to SLOCC, the coordinates for all of Hamming distance away from .

The hyperdeterminant is a SLOCC invariant polynomial, and thus is used to characterize quantum entanglement. The zero locus of the equation defined by the hyperdeterminant also defines the dual of the Segre variety [Miyake2002], and the study of singularities of the associated hypersurfaces gives also a qualitative information about quantum entanglement [Holweck2014a, Holweck2016].

On the other hand, the hyperdeterminant can sometimes also be regarded as a quantitative measure of entanglement and can be used to study entanglement in quantum algorithms for example [Holweck2016a, Jaffali2019]. In a concrete sense the non-degenerate states are the most entangled ones and maximizing the absolute value of the hyperdeterminant can lead to maximally entangled states [Gour2010, Chen2013, Holweck2019]. While the hyperdeterminant is a polynomial test for degeneracy, it has very high degree and rarely feasible to compute apart from the smallest cases. We note that one of us recently showed that hyperdeterminants coming from the root system E8 can be computed efficiently [HolweckOedingE8].

We wish to show that we can still learn degenerate states even in cases where an expression or evaluation method for the hyperdeterminant is not known explicitly.

#### 3.3.1. A note on uniform sampling

The SLOCC description of degenerate states gives a way to uniformly sample the set. We take a random state with coordinates for all of Hamming distance away from , we renormalize so that , and then we apply a random element of the SLOCC group.

#### 3.3.2. Networks for degenerate states

The hyperdeterminant is a homogeneous polynomial whose degree is respectively for , and its degree is 36 in the case (see [GKZ] for concrete formulas).

In the case, the first layer of the hybrid network must contain at least neurons with activation functions. We chose to implement 60 neurons in the first layer for the 3-qubit case. In the two other cases, the number of needed neurons is way too high to hope an implementation using Keras. One can still use fewer neurons to approximate the hyperdeterminant using hybrid networks, and even if the accuracy will not be very close to 100%, on can still reach 60% or more and use the idea presented in Section 4.1 to have a better overall accuracy. Another possible idea is to use Standard Ensemble Learning (also known as bagging) and train several networks in order to build a committee that will give a more accurate answer concerning the detection of degenerate states.

The second possibility is to use only LeakyReLU activation functions (except for the output layer), as it was done in Subsection 3.2. We also used the same decreasing structure with four layers, except in the 4-qubit case, when we chose to double the number of neurons in the three first layers.

The non-degenerate states are in fact non-separable states. This case is thus already solved by the separability classifiers presented in Section 3.2.

#### 3.3.3. Results

In Table 4 we present the result for the case with the hybrid network. We used a training dataset of size 202600, a validation dataset of size 25600, and a testing dataset of size 32000. We reached 92% of accuracy for the testing dataset.

In Table 5 we present the results for the , , , and cases with networks using LeakyReLU activation functions. In the case, we used a training dataset of size 502600, a validation dataset of size 55600, and a testing dataset of size 32000. In the and cases, we used a training dataset of size 252400, a validation dataset of size 55600, and a testing dataset of size 52000. In the case, we used a training dataset of size 352400, a validation dataset of size 55600, and a testing dataset of size 52000. We reached an average accuracy of 96% on the testing datasets for classification of degenerate states.

The obtained results are quite interesting, especially in the 5-qubit case. In fact, as mentioned before, no explicit expression of the hyperdeterminant is available, because of the complications related with computations. The network has 98% accuracy for the testing dataset, which means that we can determine with very high accuracy if a tensor is degenerate or not, which is equivalent to determine if the hyperdeterminant vanishes or not. In the context of qualitative characterization of entanglement, one only need to know if the state annihilates or not the hyperdeterminant. In this sense, we provide an original result for the evaluation of the nullity of the hyperdeterminant for real tensors, that can be used to investigate quantum entanglement for 5-qubit systems (see Section 4.3).

### 3.4. Border-rank classification

A state is said to have rank if there is an expression with and . It is well known that when rank is not a closed condition, and it is not semi-continuous. The first example is 3 qubits, where the -state , which has rank 3, and is in the closure of the generic orbit, that of , which has rank 2.

In another context, Bini [Bini_BorderRank] defined the notion of border rank to regain semi-continuity. One says that a state has border rank if there exists a family of rank states and . Equivalently, the set of border rank states, denoted is the Zariski closure of the states of rank .

By construction, when we have a chain

 σ1⊊σ2⊊⋯⊊σg=H⊗n

which ends at because the set of separable states is linearly non-degenerate in . The minimal integer for which is called the generic rank. A simple dimension count gives the expected generic rank:

 e=⌈dnn(d−1)+1⌉

The only known case for tensors of type where the generic rank differs from the expected rank is the case , so tensors or 3-qutrits where the generic rank is 5 and not 4. In the case , so or 4 qubits, where the generic rank is the expected generic rank of 4, even though it is known that the set of rank 3 tensors is defective. It is conjectured that these are the only such cases (see [BaurDraismadeGraaf, AOP_Segre, ChiOttVan]).

Let denote the states of rank exactly . When there can be more than one semi-algebraic set that is full dimensional. These ranks are called typical ranks

. Given a probability measure

on the measure of represents the probability of a random state having rank . We learned of the following example from [deSilva_Lim_ill_posed] which illustrates the situation with real typical ranks.

###### Example 5.

Set , . Both 2 and 3 are typical ranks, and moreover that a positive volume set of tensors with rank 3 have no optimal low-rank approximation. The hyperdeterminant separates

into regions of constant typical rank. We formed a sample of tensors of real rank at most 3 as follows. We constructed rank-one tensors as a tensor product of three vectors with entries uniformly distributed in

and normalized to unit-norm. Then we took the sum of these vectors and normalized again. For such a distribution of tensors we found the following frequencies:

1. with approximate frequency 86.6% , in which case the -rank is 2,

2. with approximate frequency 13.4% , in which case -rank is 3,

3. with frequency 0% , in which case the -rank can be 0,1,2,3. ∎

#### 3.4.1. A note on uniform sampling

We construct points on algebraic varieties by utilizing rational parameterizations of the form , with a rational function, and , subsets of a normed linear space. If we uniformly sample the source it is generally the case that is not a uniform sample of the image. Instead of trying to uniformly sample the image of these varieties [Hernandez-Mederos2003, Pagani2018, Dufresne2018], we provide training data that is constructed in the way we think that one might construct the test data, or how one might imagine the data is produced from a state constructed in a lab. Whether this is a reasonable assumption is open for debate. On one hand, it is always possible (by forcing incoherence, for instance) for an adversarial entity to construct a data set on the same variety on which we’re testing membership that fools our classifier. On the other hand, we can say that if we train our neural network on data produced by a certain process, we can be confident that if the validation data is constructed by the same process, then the accuracy numbers we report reflect a reasonable measure of confidence in the classifier.

#### 3.4.2. Results

In Table 6 we present the result for the rank classification with the hybrid network. We used a training dataset of size 102400, a validation dataset of size 25600, and a testing dataset of size 32000. We reached 87% accuracy for the testing dataset.

In Table 7 we present the results for the , , and cases with network using LeakyReLU activation functions. In the rank classification problem, we used a training dataset of size 502500, a validation dataset of size 55600, and a testing dataset of size 52000. In the and border rank classification problems, we respectively used a training dataset of size 502400 and 802400, a validation dataset of size 55600 and a testing dataset of size 52000.

In the case, we trained our networks to recognize the exact rank of tensors. In fact, for building the training and validation datasets, we provided tensors from Segre variety for rank one, tensors that are SLOCC equivalents to the biseparable and GHZ states for rank 2, and states that are SLOCC equivalents to the W state for rank 3, following the classification of 3-qubits (see Table 8). We also took into account the typical rank (see Example 5) and used the sign of the hyperdeterminant to generate rank 2 and rank 3 tensors.

In the and cases, we generated tensors for each class by summing rank one tensors (and renormalized). The class ‘0’ corresponds to rank one tensors, and the other classes correspond to tensors with border rank . In the , we generated tensors up to border rank 4, and up to border rank 5 in the case. By using this way of generating vectors for each class, we introduce noise in the data. However, even in the presence of noise, the network is able to learn and predict border rank of states with a quite interesting accuracy (84% in the 4-qubit case, and 80% in the 5-qubit case, on the testing dataset (which is also generated by the same process)).

For example, we evaluate the 5-qubit network on the input state (see Section 4.1 for how we make predictions), we obtain most of the time the class ‘1’ (see Figure 9), which corresponds to tensors with border rank 2.

## 4. How to use the classifier after the fact

### 4.1. Predictions for a single quantum state

In this work, we study quantum entanglement from a qualitative point of view. Quantum states are regrouped into classes, which correspond to the orbits of the group SLOCC. When the number of orbits is infinite, we talk about families depending on parameters. Any state belonging to a specific class will be equivalent to another state in the same class by the action of an element of this group. Moreover, other properties, such as separability, degeneracy, and tensor rank, are invariants under the action of the group SLOCC. Consequently, all of our classifiers should theoretically give the same answer for SLOCC equivalent states.

However, the accuracy of our classifiers is not exactly equal to 100%, and thus the probability of misclassification is non-zero, which gives little confidence for a single test. In order to avoid (or significantly reduce) misclassification, when one wants to make predictions for a specific quantum state (or set of states), the idea is to generate SLOCC equivalent states and look at the most frequent answer of the neural network classifier. All the answers can be regrouped into a histogram in order to give a graphical representation of the neural network output distribution. Examples are investigated in the next sections. We report the results after applying the classifier to enough points of the orbit so that the consensus was reached (the answer was the same for larger data samples).

### 4.2. Three-qubit entanglement classification

The 3-qubit case was the first case to illustrate the existence of non-equivalent entangled states, such as and states, as it was proven by [Duer2000]. The entanglement classification of 3-qubit systems (see Table 8) can be described by join, secant and tangential varieties [Holweck2012], and by using dual varieties as well [Holweck2012, Miyake2002, Miyake2003, Miyake2004]. Rank can also be used as an algebraic measure of entanglement to distinguish between several 3-qubits entanglement classes.

Therefore, our binary classifiers built in Section 3 can be used to distinguish between 3-qubit entanglement classes. In fact, separable states can be detected using the binary classifier for tensors on the Segre variety, presented in Section 3.2. All states that are SLOCC equivalent to the state correspond to non-degenerate states, and thus can be recognized using the binary classifier for tensors on the dual variety of the Segre variety, presented in Section 3.3. Finally, to distinguish between Bi-Separable states and states that are SLOCC equivalents to , we exploit the rank of tensors, by using the rank classifier, presented in Section 3.4, as a predictor for distinguishing between rank 2 and rank 3 tensors. All the results are regrouped in Figures 10,11,12,13.

### 4.3. Entanglement of five-qubit systems

In the five-qubit case, since no complete or parametrized classification is known, distinguishing entanglement classes is a difficult task. Some prior work focused on using filters [Osterloh2006] or using polynomial invariants [Luque2006]. In these papers, they considered four 5-qubit systems , , , and (see Equations 78910).

 (7)
 (8)
 |Φ3⟩=1√6(√2|11111⟩+|11000⟩+|00100⟩+|00010⟩+|00001⟩) (9)
 (10)

These states do not belong to the same entanglement classes, as it was shown in [Osterloh2006, Luque2006]. However, one can ask if these states are degenerate or not. We used our classifier for 5-qubits degenerate states to investigate this question, and it turned out that for all of these four states, the hyperdeterminant vanishes, as it shown in Figure 14.

On the other hand, we can also propose several non-degenerate states using our classifier. For example, the states and (see Equations 11 and 12) are non-degenerate states, as it is confirmed by the prediction results of our classifier presented in Figure 15. This statement can be verified using the study of hypersurface singularities associated with these two states, as it was done in [Holweck2014a, Holweck2016].

 (11)
 |δ2⟩=1√11(|00000⟩+|00100⟩+|00111⟩+|01010⟩−|01101⟩+|10001⟩+|10011⟩+|10111⟩−|11000⟩+|11110⟩) (12)

## 5. Acknowledgements

This work was partially supported by the Région Bourgogne Franche-Comté, project PHYFA (contract 20174-06235), the French “Investissements d’Avenir” programme, project ISITE-BFC (contract ANR-15-IDEX-03) and by “BQR Mobilité Doctorante” at U.T.B.M. The first author thanks the Department of Mathematics at Auburn University for its hospitality. The first author wants also to thank Dr. Mark Carpenter (Auburn University) and Dr. Frabrice Lauri (U.T.B.M.) for lectures and discussion about Machine Learning. The authors want to thank Dr. Frédéric Holweck for his advices, comments and discussions.