# An efficient quantum algorithm for generative machine learning

A central task in the field of quantum computing is to find applications where quantum computer could provide exponential speedup over any classical computer. Machine learning represents an important field with broad applications where quantum computer may offer significant speedup. Several quantum algorithms for discriminative machine learning have been found based on efficient solving of linear algebraic problems, with potential exponential speedup in runtime under the assumption of effective input from a quantum random access memory. In machine learning, generative models represent another large class which is widely used for both supervised and unsupervised learning. Here, we propose an efficient quantum algorithm for machine learning based on a quantum generative model. We prove that our proposed model is exponentially more powerful to represent probability distributions compared with classical generative models and has exponential speedup in training and inference at least for some instances under a reasonable assumption in computational complexity theory. Our result opens a new direction for quantum machine learning and offers a remarkable example in which a quantum algorithm shows exponential improvement over any classical algorithm in an important application field.

## Authors

• 5 publications
• 3 publications
• 1 publication
01/20/2021

### Enhancing Generative Models via Quantum Correlations

Generative modeling using samples drawn from the probability distributio...
09/23/2021

### Quantum algorithms for group convolution, cross-correlation, and equivariant transformations

Group convolutions and cross-correlations, which are equivariant to the ...
10/29/2020

### Quantum advantage for differential equation analysis

Quantum algorithms for both differential equation solving and for machin...
10/26/2016

### Quantum-enhanced machine learning

The emerging field of quantum machine learning has the potential to subs...
08/05/2021

### Quantum Topological Data Analysis with Linear Depth and Exponential Speedup

Quantum computing offers the potential of exponential speedups for certa...
11/17/2017

### Hardening Quantum Machine Learning Against Adversaries

Security for machine learning has begun to become a serious issue for pr...
04/22/2020

### Fast Quantum Algorithm for Learning with Optimized Random Features

Kernel methods augmented with random features give scalable algorithms f...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## References

• (1) Shor, P. W. Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer. SIAM review 41, 303–332 (1999).
• (2) Feynman, R. P. Simulating physics with computers. International journal of theoretical physics 21, 467–488 (1982).
• (3) Lloyd, S. et al. Universal quantum simulators. SCIENCE-NEW YORK THEN WASHINGTON- 1073–1077 (1996).
• (4) Biamonte, J. et al. Quantum machine learning. Nature 195–202.
• (5) Ciliberto, C. et al. Quantum machine learning: a classical perspective. arXiv preprint arXiv:1707.08561 (2017).
• (6) Brandão, F. G. & Svore, K. Quantum speed-ups for semidefinite programming. arXiv preprint arXiv:1609.05537 (2016).
• (7) Brandão, F. G. et al. Exponential quantum speed-ups for semidefinite programming with applications to quantum learning. arXiv preprint arXiv:1710.02581 (2017).
• (8) Farhi, E. et al. A quantum adiabatic evolution algorithm applied to random instances of an np-complete problem. Science 292, 472–475 (2001).
• (9) Jebara, T. Machine learning: discriminative and generative, vol. 755 (Springer Science & Business Media, 2012).
• (10) Harrow, A. W., Hassidim, A. & Lloyd, S. Quantum algorithm for linear systems of equations. Physical review letters 103, 150502 (2009).
• (11) Wiebe, N., Braun, D. & Lloyd, S. Quantum algorithm for data fitting. Physical review letters 109, 050505 (2012).
• (12) Lloyd, S., Mohseni, M. & Rebentrost, P. Quantum algorithms for supervised and unsupervised machine learning. arXiv preprint arXiv:1307.0411 (2013).
• (13) Lloyd, S., Mohseni, M. & Rebentrost, P.

Quantum principal component analysis.

Nature Physics 10, 631–633 (2014).
• (14) Rebentrost, P., Mohseni, M. & Lloyd, S. Quantum support vector machine for big data classification. Physical review letters 113, 130503 (2014).
• (15) Cong, I. & Duan, L. Quantum discriminant analysis for dimensionality reduction and classification. New Journal of Physics 18, 073011 (2016).
• (16) Giovannetti, V., Lloyd, S. & Maccone, L. Quantum random access memory. Physical review letters 100, 160501 (2008).
• (17) Shalev-Shwartz, S. & Ben-David, S. Understanding machine learning: From theory to algorithms (Cambridge university press, 2014).
• (18) Goodfellow, I., Bengio, Y. & Courville, A. Deep learning (MIT press, 2016).
• (19) Giovannetti, V., Lloyd, S. & Maccone, L. Architectures for a quantum random access memory. Physical Review A 78, 052310 (2008).
• (20) Aaronson, S. Read the fine print. Nature Physics 11, 291–293 (2015).
• (21) Bishop, C. M. Pattern recognition and machine learning (springer, 2006).
• (22) Supplementary Material.
• (23) Le Roux, N. & Bengio, Y. Representational power of restricted boltzmann machines and deep belief networks. Neural computation 20, 1631–1649 (2008).
• (24) Perez-Garcia, D., Verstraete, F., Wolf, M. & Cirac, J. Peps as unique ground states of local hamiltonians. Quantum Information & Computation 8, 650–663 (2008).
• (25) Schwarz, M., Temme, K. & Verstraete, F. Preparing projected entangled pair states on a quantum computer. Physical review letters 108, 110502 (2012).
• (26) Abrams, D. S. & Lloyd, S.

Quantum algorithm providing exponential speed increase for finding eigenvalues and eigenvectors.

Physical Review Letters 83, 5162 (1999).
• (27) Dagum, P. & Luby, M. Approximating probabilistic inference in bayesian belief networks is np-hard. Artificial intelligence 60, 141–153 (1993).
• (28) Kitaev, A. Y., Shen, A. & Vyalyi, M. N. Classical and quantum computation, vol. 47 (American Mathematical Society Providence, 2002).
• (29) Oliveira, R. & Terhal, B. M. The complexity of quantum spin systems on a two-dimensional square lattice. Quantum Information & Computation 8, 900–924 (2008).

## I Representational power and generalization ability

In this section, we briefly introduce some basic concepts in statistical learning theory

vapnik2013natures , the theoretical foundation of machine learning. Then we discuss the intuitive connection between the generalization ability and representational power of a machine learning model. This connection does not hold in the sense of mathematical rigor and counter examples exist, however, it is still a good guiding principle in practice. More details can be found in the introductory book shalev2014understandings .

A simplified way to formulate a machine learning task is as follows: given data generated independently from a distribution (which is unknown), a set of hypothesis

(which could be a model with some parameters) and a loss function

characterizing how “good” a hypothesis is, try to find an produced by some machine learning algorithm , minimizing the actual loss:

 LD(hA)∼ϵbias+ϵest where ϵbias=minh∈HLD(h) and ϵest depends on M. (S1)

The term represents the bias error. When is a hypothesis class which is not rich enough or the number of parameters in the model is small, this term might be large no matter how many training data are given, which leads to underfitting. The term represents the generalization error. When is a very rich hypothesis class or the number of parameters in the model is too large, this term might be large if the number of training data is not enough, which leads to overfitting. This is the bias-complexity trade-off of a machine learning model.

The generalization ability of a machine learning model is usually quantified by sample complexity in the framework of Probably Approximately Correct (PAC) learning valiant1984theorys , which means if the number of training data is larger than , the generalization error is bounded by with probability at least . It has been proved that the sample complexity has the same order of magnitude as a quantity of the hypothesis set , the Vapnik-Chervonenkis dimension (VC dimension)vapnik2015uniforms

. The VC dimension could be used to characterize the “effective” degree of freedom of a model, which is the number of parameters in most situations. With a smaller VC dimension, the generalization ability gets better.

There are some caveats of using the VC dimension to connect the generalization ability and the representational power of a machine learning model though, which include: (i) the VC dimension theory only works in the framework of PAC learning (thus restricted to supervised learning), so it is not directly applicable to generative models; (ii) the number of parameters does not always match the VC dimension, e.g., the combination of several parameters as

only counts as one effective parameter, meanwhile there also exists such a model with only one parameter but having infinite VC dimensions. However, despite of existence of those counter examples, the VC dimension matches the number of parameters in most situations, so it is still a guiding principle to choose models with a smaller number of parameters. If the number of parameters is large, in order to determine each parameter, it needs a large amount of information, thus a large number of data is necessary.

In the case of the QGM, we find a distribution (illustrated by a blue circle in Fig. S1) such that, in order to represent it or equivalently make small, the factor graph should have a number of parameters exponentially large (Fig. S1b). In this case, the distribution that the factor graph could represent has a very large degrees of freedom. However, the distribution which the QGM could represent only occupies a small corner of the whole distribution space (Fig. S1b), so the information needed to determine the parameters will be much smaller. Similar reasoning has been used to illustrate the necessity of using deep architecture in neural network goodfellow2016deeps ; maass1994comparisons ; maass1997boundss ; montufar2014numbers ; poggio2015theorys ; eldan2015powers ; mhaskar2016learning1s ; raghu2016expressives ; poole2016exponentials ; mhaskar2016learning2s ; lin2016doess ; liang2016deeps .

## Ii Reduction of typical generative models to factor graphs

In this section, we review typical classical generative models with graph structure. There are two large classes: one is the probabilistic graphical model koller2009probabilistics ; bishop2006patterns , and the other is the energy-based generative neural network. Strictly speaking, the later one is a special case of the former, but we regard it as another category because it is usually referred in the context of deep learning goodfellow2016deeps . We discuss the “canonical form” of generative models, the factor graph, and show how to convert various classical generative models into this canonical form. The details could be found in the books in Refs. koller2009probabilistics ; bishop2006patterns .

First, we introduce two probabilistic graphical models as shown in Fig. S2: the Bayesian network (directed graphical model) and the Markov random field (undirected graphical model). A Bayesian network (Fig. S2a) is defined on a directed acyclic graph. For each node , assign a transition probability where the parent means starting point of a directed edge of which the end point is . If there is no parent for , assign a probability . The total probability distribution is the product of these conditional probabilities. A Markov random field (Fig. S2b) is defined on an undirected graph. The sub-graph in each dashed circle is a clique in which each pair of nodes is connected. For each clique, assign a non-negative function for each node variable. The total probability distribution is proportional to the product of these functions.

Then we introduce four typical generative neural networks as shown in Fig. S3: the Boltzmann machine, the restricted Boltzmann machine, the deep Boltzmann machine, and the deep belief network. These neural networks are widely used in deep learning goodfellow2016deeps .

All the above generative models could be represented in the form of factor graph. Factor graph could be viewed as a canonical form of the generative models with graph structure. The factor graph representation of the Bayesian network and the Markov random field simply regards the conditional probability and function as the factor correlation . The factor graph representations of the Boltzmann machine, the restricted Boltzmann machine and the deep Boltzmann machine are graphs such that there is a square in each edge in the original network with correlation . The factor graph with unbounded degrees in these cases could be simulated by a factor graph with a constantly bounded degree, which is shown in Fig. S4. The idea is to add equality constraints which could be simulated by correlation with being a very large positive number since

 eaxixj−axi/2−axj/2=δxixj+e−a/2(1−δxixj)⟶δxixj as a→+∞. (S2)

The factor graph representation of the deep belief network is a mixture of the Bayesian network and the restricted Boltzmann machine except that one correlation in the part of directed graph involves an unbounded number of variables. This is the conditional probability with one variable conditioned on the values of variables in the previous layer. It is

 P(y|⋯,xi,⋯)=e(∑iaijxi+bj)y1+e∑iaijxi+bj. (S3)

where are the parents of . Given the values of , it’s easy to sample according to (which is actually the original motivation for introducing the deep belief network). Thus the process could be represented by a Boolean circuit with random coins. In fact, a circuit with random coins is a Bayesian network: each logical gate is basically a conditional probability (for example, the AND gate could be simulated by the conditional probability ). So this conditional probability could be represented by a Bayesian network in which the degree of each node is bounded by a constant. Thus deep belief network could be represented by a factor graph with a constant degree.

One important property is that the conditional probability of a factor graph is still factor graph. Actually, the correlation becomes simpler: the number of variables involved does not increase, which means approximately computing the conditional probability of probabilistic graphical models could be reduced to preparing the state (which will be shown in next section ). Then we arrive at the following lemma:

###### Lemma 1.

All the probability distributions or conditional probability distributions of probabilistic graphical models and energy-based neural networks can be represented efficiently by the factor graph with a constant degree.

## Iii Parent Hamiltonian of the State |Q⟩

We consider tensor network representation of defined on a graph with a constant degree.

 |Q⟩≡M1⊗⋯⊗Mn|G⟩ (S4)

where is the graph state that is the unique ground state of the frustration-free local Hamiltonian with zero ground-state energy:

 HG=∑iHi=∑iI−(⨂j∈{i's neighbors}Zj)⊗Xi2 (S5)

where each is the projector of the stabilizer gottesman1997stabilizers ; raussendorf2001ones supported only on the node and its neighbors, thus being local since the degree of the graph is a constant. And

 Hi|G⟩=0. (S6)

Next we construct a Hamiltonian:

 HQ=∑iH′i=∑i⎛⎝⨂j∈{i}∪{i'% s neighbors}M−1j⎞⎠†Hi⎛⎝⨂j∈{i}∪{i's neighbors}M−1j⎞⎠. (S7)

First, we show that is the ground state of . Since is positive semidefinite, thus is also positive semidefinite, the eigenvalue of is no less than .

 ⟨Q|H′i|Q⟩ = ⟨Q|⎛⎝⨂j∈{i}∪{i's neighbors}M−1j⎞⎠†Hi⎛⎝⨂j∈{i}∪{i's % neighbors}M−1j⎞⎠|Q⟩ (S8) = ⟨G|⎛⎝⨂k∈all the nodes except for js(MkM†k)−1⎞⎠⊗Hi|G⟩ = 0,

so which means is the ground state of . Then we prove is the unique ground state. Suppose satisfying for some state , which implies for every . So

 ⟨Q′|H′i|Q′⟩ = ⟨Q′|⎛⎝⨂j∈{i}∪{i's % neighbors}M−1j⎞⎠†Hi⎛⎝⨂j∈{i}∪{i's neighbors}M−1j⎞⎠|Q′⟩ (S9) = ⟨G′|Hi⊗⎛⎝⨂k∈all % the nodes except for js(MkM†k)−1⎞⎠|G′⟩ = ∣∣ ∣∣Hi⊗⎛⎝⨂k∈all the nodes except for jsM−1k⎞⎠|G′⟩∣∣ ∣∣2 = 0,

which implies

 Hi⊗⎛⎝⨂k∈all the nodes except for i and jsM−1k⎞⎠|G′⟩=0⟹Hi|G′⟩=0 (S10)

for every , thus . This proves the uniqueness of as the ground state of .

## Iv Proof of Theorem 2

In this section, we use computational complexity theory to prove the exponential improvement on representational power of the QGM over any factor graph in which each correlation is easy to compute given a specific value for all variables (including all the models mentioned above). The discussion of the relevant computational complexity theory could be found in the book in Ref. arora2009computationals or in the recent review article on quantum supremacy Harrow2017s .

To prove theorem 2, first we define the concept of multiplicative error. Denote the probability distribution produced by the QGM as . Then we ask whether there exists a factor graph such that its distribution approximates to some error. It is natural to require that if is very small, should also be very small, which means rare events should still be rare. So we define the following error model:

###### Definition 1 (Multiplicative Error).

Distribution approximates distribution to multiplicative error means

 |p(x)−q(x)|≤γq(x) (S11)

for any , where .

This error can also guarantee that any local behavior is approximately the same since the -distance can be bounded by it:

 ∑x|p(x)−q(x)|≤∑xγq(x)=γ. (S12)

But only bounding -distance cannot guarantee that rare event is still rare. Multiplicative error for small implies that the KL-divergence is bounded by

 ∣∣∣∑xq(x)logp(x)q(x)∣∣∣ ≤ ∑xq(x)log(1+∣∣∣p(x)q(x)−1∣∣∣) (S13) ≤ ∑xq(x)γ=γ.

The probability of a factor graph can be written as

 p(x|z)=∑yp(x,y,z)∑x,yp(x,y,z)=∑yf(x,y)∑x,yf(x,y) (S14)

where is product of a polynomial number of relatively simple correlations, thus non-negative and can be computed in polynomial time and are hidden variables.

The probability of the QGM can be written as

 q(x)=∑yg(x,y)∑x,yg(x,y) (S15)

where is product of a polynomial number of tensors given a specific assignment of indices and are the virtual indices or physical indices of hidden variables, is the remaining physical indices. Different from , is a complex number in general. Thus in some sense, we can say is the result of quantum interference in contrast to the case of (or ) which is only summation of non-negative numbers. Since is summation of complex numbers, in the process of summation, it can oscillate dramatically, so we expect being more complex than (or ). This can be formalized as the following lemma.

###### Lemma 2 (Stockmeyer’s theorem stockmeyer1976polynomials ).

There exists an algorithm which can approximate

 P=Prt[f(t)=1]=12r∑t∈{0,1}rf(t) (S16)

by , for any boolean function , to multiplicative error if can be computed efficiently given .

algorithms denote algorithms that a probabilistic Turing machine, supplied with an oracle that can solve all the NP problems in one step, can run in polynomial time.

Without the constraint , approximating by such that is in general #P-hard even if . Roughly speaking, the complexity class #P valiant1979complexitys includes problems counting the number of witnesses of an NP problem, which is believed much harder than NP (see Ref. Harrow2017s for a quantum computing oriented introduction). The above lemma shows that summation of non-negative numbers is easier than complex numbers in general. This formulates that quantum interference is more complex than classical probability superposition.

Stockmeyer’s theorem was firstly used to separate classical and quantum distribution in Ref. aaronson2011computationals where the classical distribution is sampled by a probabilistic Turing machine in polynomial time and the quantum distribution is sampled from linear optics network. In some sense, our result could be viewed as a development of this result: the classical distribution is not necessarily sampled by a classical device efficiently, instead, the distribution could be approximated by a factor graph to multiplicative error. An efficient classical device is a special case of factor graph because it could be represented as a Boolean circuit with random coins, this could be represented as the Bayesian network as we have discussed above about the representation of deep belief network by factor graph.

Then we give the proof of theorem 2:

###### Proof.

Assume there exists a factor graph from which we can compute the conditional probability , we will prove that approximating to multiplicative error, i.e., , is in (the meaning of this complexity class will be explained later).

Suppose is defined as , there exists an algorithm approximating by such that and by such that . Define and we have

 |˜p(x)−p(x|z)| = ∣∣∣˜P2˜P1−P2P1∣∣∣ (S17) ≤ ∣∣∣˜P2˜P1−P2˜P1∣∣∣+∣∣∣P2˜P1−P2P1∣∣∣ = |˜P2−P2|˜P1+P2∣∣∣1˜P1−1P1∣∣∣ = |˜P2−P2|˜P1+P2˜P1P1|˜P1−P1| ≤ (γ1+γ2)P2˜P1 ≤ γ1+γ21−γ1P2P1 = γ1+γ21−γ1p(x|z),

then

 |˜p(x)−q(x)|≤|˜p(x)−p(x|z)|+|p(x|z)−q(x)|≤γ1+γ21−γ1p(x|z)+γq(x)≤γ+γ1+γ2+γ2γ1−γ1q(x)<12q(x), (S18)

the last step is because we choose and as sufficiently small as . Under the assumption that the representation is efficient, such can be represented by a polynomial size circuit, where the circuit corresponds to the description of the factor graph. So can be computed in . “/poly” is because we do not need to construct the circuit efficiently karp1982turings .

In Ref. gao2016quantums , we introduced a special form of QGM that corresponds to a graph state (Fig. S5) with one layer of invertible matrices such that computing to multiplicative error with is at least . So assuming the efficient representation of QGM by factor graph, we will get

 P#P⊆BPPNP/poly. (S19)

Then it follows basically the same reasoning as the proof of theorem 3 in Ref. aaronson2017implausibilitys except they consider and the result is polynomial hierarchy collapse to the fourth level. According to Toda’s theorem toda1989computationals , PH, this implies . According to Adleman’s result adleman1978twos , BPPP/poly, relativizes, which means . Karp-Lipton theorem karp1982turings states if , then (polynomial hierarchy collapse to the second level); the result is also relativizing then it follows if , then which means (polynomial hierarchy collapse to the third level). ∎

## V Training and inference in quantum generative model

In this section, we discuss how to train the QGM and make inference on it. First, we briefly review how to train and make inference on some typical factor graphs. Then we reduce inference on the QGM to preparation of a tensor network state. Finally, we derive the formula for computing gradient of the KL-divergence of QGM and reduce it to the preparation of a tensor network state.

Inference problems on probabilistic graphical models include computing marginal probability and conditional probability (which includes marginal probability as a special case when the set is empty). To approximately compute the probability on some variables, we sample the marginal probability or the conditional probability , and then measure the variable set . We use the Boltzmann machine as an example to show how to train energy-based neural networks. The KL-divergence given data is

 1M∑v∈data setlogp(v). (S20)

The training is to optimize this quantity with the gradient descent method. The gradient (for simplicity, we only present between hidden and visible nodes) is

 1M∂aij∑v∈data setlogp(v) = 1M∑v∈data set∑hP(h|v)hivj−∑h,vP(h,v)hivj (S21) = ⟨hivj⟩data−⟨hivj⟩model.

The subscript “data” denotes distribution with probability randomly chosen from the training data according to , where is the distribution defined by the graphical model, randomly sampling hidden variables . The distribution “model” is . So the training is reduced to sampling some distribution or conditional distribution defined by the generative neural network and then estimating the expectation value of local observables. Since the QGM could represent conditional probability of these models and the corresponding state is the unique ground state of local Hamiltonian, inference and training could be reduced to the ground state preparation problem.

Similarly, inference on the QGM could also be reduced to preparation of a quantum state. As an example, let us compute the marginal probability for the QGM:

 q(x)=∑yq(x,y)=∑y⟨Q|x,y⟩⟨x,y|Q⟩⟨Q|Q⟩=⟨Q|(|x⟩⟨x|⊗I)|Q⟩⟨Q|Q⟩=⟨Q|(O⊗I)|Q⟩⟨Q|Q⟩. (S22)

So the problem is reduced to preparing the state and then measuring the local observable . Similarly, the conditional probability is given by

 q(x|z)=∑yq(x,y,z)q(z)=⟨Q(z)|(|x⟩⟨x|⊗I)|Q(z)⟩⟨Q(z)|Q(z)⟩=⟨Q(z)|(O⊗I)|Q(z)⟩⟨Q(z)|Q(z)⟩, (S23)

where

 |Q(z)⟩≡(I⊗⟨z|)|Q⟩ (S24)

is a tensor network state.

The KL-divergence of the QGM is given by

 1M∑v∈data setlog⟨Q(v)|Q(v)⟩−log⟨Q|Q⟩, (S25)

and its derivative with respect to a parameter in is

 ∂θi⎛⎝1M∑v∈data setlog⟨Q(v)|Q(v)⟩−log⟨Q|Q⟩⎞⎠ = 1M∑v∈data set∂θi⟨Q(v)|Q(v)⟩⟨Q(v)|Q(v)⟩−∂θi⟨Q|Q⟩⟨Q|Q⟩. (S26)

Let us consider the second term first.

 ∂θi⟨Q|Q⟩⟨Q|Q⟩ = ∂θi⟨G|M†1M1⊗⋯⊗M†nMn|G⟩⟨Q|Q⟩ (S27) = ⟨G|M†1M1⊗⋯⊗∂θiM†iMi⊗⋯⊗M†nMn|G⟩⟨Q|Q⟩ = ⟨G|M†1M1⊗⋯⊗(M†i(∂θiMi)+(∂θiM†i)Mi)⊗⋯⊗M†nMn|G⟩⟨Q|Q⟩ = ⟨G|M†1M1⊗⋯⊗(M†i(∂θiMi)M−1iMi+M†iM†−1i(∂θiM†i)Mi)⊗⋯⊗M†nMn|G⟩⟨Q|Q⟩ = ⟨Q|(∂θiMi)M−1i+M†−1i(∂θiM†i)|Q⟩⟨Q|Q⟩.

It is basically the same for if is the parameter for the unconditioned qubit:

 ∂θi⟨Q(v)|Q(v)⟩⟨Q(v)|Q(v)⟩=⟨Q(v)|(∂θiMi)M−1i+M†−1i(∂θiM†i)|Q(v)⟩⟨Q(v)|Q(v)⟩. (S28)

If is the parameter of conditioned variables, it is more complicated. Let which is independent of and , , where denote all variables in except .

 ∂θi⟨Q(v)|Q(v)⟩⟨Q(v)|Q(v)⟩ = ⟨Ri|∂θi(M†i|vi⟩⟨vi|Mi)|Ri⟩⟨Q(v)|Q(v)⟩ (S29) = ⟨Ri|(∂θiM†i)|vi⟩⟨vi|Mi+M†i|vi⟩⟨vi|(∂θiMi)|Ri⟩⟨Q(v)|Q(v)⟩ = ⟨Ri|M†iM†−1i(∂θiM†i)|vi⟩⟨vi|Mi+M†i|vi⟩⟨vi|(∂θiMi)M−1iMi|Ri⟩⟨Q(v)|Q(v)⟩ = ⟨Q(v/vi)|M†−1i(∂θiM†i)|vi⟩⟨vi|+|vi⟩⟨vi|(∂θiMi)M−1i|Q(v/vi)⟩⟨Q(v)|Q(v)⟩ = ⟨Q(v/vi)|M†−1i(∂θiM†i)|vi⟩⟨vi|+|vi⟩⟨vi|(∂θiMi)M−1i|Q(v/vi)⟩⟨Q(v/vi)|Q(v/vi)⟩/⟨Q(v)|Q(v)⟩⟨Q(v/vi)|Q(v/vi)⟩ = ⟨Q(v/vi)|M†−1i(∂θiM†i)|vi⟩⟨vi|+|vi⟩⟨vi|(∂θiMi)M−1i|Q(v/vi)⟩⟨Q(v/vi)|Q(v/vi)⟩/⟨Q(v/vi)|vi⟩⟨vi|Q(v/vi)⟩⟨Q(v/vi)|Q(v/vi)⟩.

So computing gradient of KL-divergence of QGM reduces to preparing tensor network state ( being empty, or respectively) and measuring the expectation value of , and . If the denominator is small, the error of the estimation through sampling could be large, in particular when the denominator is exponentially close to . But we do not need to be pessimistic. Since this denominator represents the probability of getting for measuring the -th variable, conditioned on the remaining variables being

, this quantity should not be small if the model distribution is close to the real data distribution. Otherwise, we are not able to observe this data. If the model distribution is far from the real data distribution, there is no need to estimate the gradient precisely. Moreover, statistical fluctuation in sampling in this case could even help to jump out of the local minimum, which is analogous to the stochastic gradient descent method in traditional machine learning

shalev2014understandings ; goodfellow2016deeps .

The efficiency of estimating the expectation value of local observable through sampling depends on the maximum value of the observable. In the case of calculation of gradient of the KL-divergence, the maximum value depends on

whose minimum singular value is set to

. Thus, the number of sampling is bounded by , where is the condition number which is the maximum singular value of , is the error of the estimation. In the case when is large, is ill-conditioned, resulting in bad estimation. One strategy to avoid this situation is to use regularization term shalev2014understandings ; goodfellow2016deeps . For example, we may use

 ∑itrM†iMi−χ|detMi|2, (S30)

as a penalty where

is a hyperparameter which adjust the importance of the second term. The first term guarantees the singular value of

should not be large and the second term guarantees that should not be too small. In the case that is not too large, it guarantees that is not too small, which means is not too ill-conditioned.

## Vi Proof of Theorem 3

First, we prove that could represent any tensor network efficiently:

###### Lemma 3.

Choosing the conditioned variables and invertible matrices properly, could efficiently represent any tensor network in which each tensor has constant degree and the virtual index range is bounded by a constant.

###### Proof.

For a tensor , consider a quantum circuit preparing the following state:

 ∑i⋯jAi⋯j|i⋯j⟩ (S31)

where the dimension of the Hilbert space is bounded by a constant if the number of indices and the range of each index of this tensor are bounded. The quantum circuit could be represented by a state like the one in Fig. S5 with constant size, conditioned on specific values of some variables. This is just a post-selection of measurement result in measurement-based quantum computingraussendorf2001ones . Then we consider contracting virtual indices between different tensors. In this case, direct contraction will lead to a problem: the edge connecting two different tensors may be an identity tensor instead of a Hadamard tensor as in state . We can solve this problem by introducing an extra variable in the middle of these two indices and connecting the two indices by Hadamard tensor. Further applying a Hadamard gate on it and conditioning on this extra variable being , the net effect is equivalent to connecting the two indices by an identity tensor.∎

Since computation on the QGM is reduced to preparing , we will focus on tensor network construction for the state in the following discussion. For an instance that our algorithm could run in polynomial time, and should be both at least . We construct a tensor network satisfying this requirement which at the same time encodes universal quantum computing. Therefore, a classical model is not able to to produce this result if quantum computation can not be efficiently simulated classically.

First, we construct the tensor network state encoding universal quantum computation. Consider the following history state encoding history of quantum computation:

 |ψclock⟩=1√T+1T∑t=0|t⟩⊗Vt⋯V1|0⟩⊗m (S32)

where is the number of gates in the quantum circuit, is the unary representation of the number (the first bits are set to and the last bits are set to ), is the input state in the 2D layout of the circuits shown in Fig. 1 of Ref.oliveira2008complexitys with (which is the number of qubits times the depth of the original circuit, i.e. roughly the same as ) qubits and is the -th gate defined in Fig. S6b, i.e., for . This history state has been used to prove QMA-hardness for local Hamiltonian on a 2D lattice oliveira2008complexitys . This state can be represented by a “2D” tensor network shown in Fig. S6d since there is only a constant number of gates applying on each qubit. It can be verified directly that the tensor network in Fig. S6d represents . For simplicity of illustration, we consider computation on a single qubit where the “2D” network (actually 1D in this case) is shown in Fig. S6a. The tensor in Fig. S6f must have the form and all of them have the same weight; the left line in Fig. S6d will be entangled with state in right line; the tensor in Fig. S6g is control- gate serving to make the right line in Fig. S6d if the left line is . In this way, the state is exactly . The general case (-qubit case) for universal quantum computing is constructed in the same way.

Second, we calculate the overlap between two successive tensor networks. From Fig. S6d, direct calculation shows that the tensor network is

 1√t+1t∑t1=0|t1⟩⊗Vt1⋯V1|0⟩⊗m. (S33)

Note that the number of bits in clock register is the same for any . Then the overlap between and is

 ⟨Qt|Qt−1⟩=1√(t−1)t(t−1)=√1−1t, (S34)

so

 ηt=1−1t≥12 for t≥2, (S35)

which means

 1mintηt=O(1). (S36)

Third, we calculate the parent Hamiltonian of tensor network . For simplicity, we consider quantum circuit on single qubit (Fig. S6a) and the general case follows similarly. Each local term is constructed from five variables shown in Fig. S6e. With different virtual indices denoted by , the tensor corresponds to a five dimensional space spanned by the following states:

 |000⟩⊗|x⟩|0⟩, |111⟩⊗|0⟩|0⟩, |100⟩⊗|x⟩|0⟩+|110⟩⊗|0⟩Ut|x⟩=|100⟩⊗|x⟩|0⟩+|110⟩⊗Vt(|x⟩|0⟩), (S37)

where can be either or . The corresponding local term in parent Hamiltonian is projector such that its null space is exactly the above subspace so its rank is :

 Πt = Π(1)t+Π(2)t+Π(3)t+Π(4)t,

where

 Π(1)t = |01⟩⟨01|⊗I⊗I⊗I+I⊗|01⟩⟨01|⊗I⊗I,rank: 23+23=16, Π(2)t = |000⟩⟨000|⊗I⊗|1⟩⟨1|+|111⟩⟨111|⊗(I⊗I−|00⟩⟨00|), rank:% 2+3=5, Π(3)t = |100⟩⟨100|⊗I⊗|1⟩⟨1|+|110⟩⟨110|