# Quantum Deep Learning

In recent years, deep learning has had a profound impact on machine learning and artificial intelligence. At the same time, algorithms for quantum computers have been shown to efficiently solve some problems that are intractable on conventional, classical computers. We show that quantum computing not only reduces the time required to train a deep restricted Boltzmann machine, but also provides a richer and more comprehensive framework for deep learning than classical computing and leads to significant improvements in the optimization of the underlying objective function. Our quantum methods also permit efficient training of full Boltzmann machines and multi-layer, fully connected models and do not have well known classical counterparts.

## Authors

• 15 publications
• 29 publications
• 6 publications
• ### Bayesian machine learning for Boltzmann machine in quantum-enhanced feature spaces

Bayesian learning is ubiquitous for implementing classification and regr...
12/20/2019 ∙ by Yusen Wu, et al. ∙ 0

• ### QDNN: DNN with Quantum Neural Network Layers

The deep neural network (DNN) became the most important and powerful mac...
12/29/2019 ∙ by Chen Zhao, et al. ∙ 0

• ### QUBO Formulations for Training Machine Learning Models

Training machine learning models on classical computers is usually a tim...
08/05/2020 ∙ by Prasanna Date, et al. ∙ 13

• ### Quantum versus Classical Generative Modelling in Finance

Finding a concrete use case for quantum computers in the near term is st...
08/03/2020 ∙ by Brian Coyle, et al. ∙ 0

• ### Leveraging Adiabatic Quantum Computation for Election Forecasting

Accurate, reliable sampling from fully-connected graphs with arbitrary c...
01/30/2018 ∙ by Maxwell Henderson, et al. ∙ 0

• ### A Metaheuristic-Driven Approach to Fine-Tune Deep Boltzmann Machines

Deep learning techniques, such as Deep Boltzmann Machines (DBMs), have r...
01/14/2021 ∙ by Leandro Aparecido Passos, et al. ∙ 0

• ### Temporal Autoencoding Restricted Boltzmann Machine

Much work has been done refining and characterizing the receptive fields...
10/31/2012 ∙ by Chris Häusler, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Introduction

We present quantum algorithms to perform deep learning that outperform conventional, state-of-the-art classical algorithms in terms of both training efficiency and model quality. Deep learning is a recent technique used in machine learning that has substantially impacted the way in which classification, inference, and artificial intelligence (AI) tasks are modeled HOT06 ; CW08 ; Ben09 ; LYK+10

. It is based on the premise that to perform sophisticated AI tasks, such as speech and visual recognition, it may be necessary to allow a machine to learn a model that contains several layers of abstractions of the raw input data. For example, a model trained to detect a car might first accept a raw image, in pixels, as input. In a subsequent layer, it may abstract the data into simple shapes. In the next layer, the elementary shapes may be abstracted further into aggregate forms, such as bumpers or wheels. At even higher layers, the shapes may be tagged with words like “tire” or “hood”. Deep networks therefore automatically learn a complex, nested representation of raw data similar to layers of neuron processing in our brain, where ideally the learned hierarchy of concepts is (humanly) understandable. In general, deep networks may contain many levels of abstraction encoded into a highly connected, complex graphical network; training such graphical networks falls under the umbrella of deep learning.

Boltzmann machines (BMs) are one such class of deep networks, which formally are a class recurrent neural nets with undirected edges and thus provide a generative model for the data. From a physical perspective, Boltzmann machines model the training data with an Ising model that is in thermal equilibrium. These spins are called units in the machine learning literature and encode features and concepts while the edges in the Ising model’s interaction graph represent the statistical dependencies of the features. The set of nodes that encode the observed data and the output are called the visible units (), whereas the nodes used to model the latent concept and feature space are called the hidden units (). Two important classes of BMs are the restricted Boltzmann machine (RBM) which takes the underlying graph to be a complete bipartite graph, and the deep restricted Boltzmann machine which is composed of many layers of RBMs (see Figure 1). For the purposes of discussion, we assume that the visible and hidden units are binary.

A Boltzmann machine models the probability of a given configuration of visible and hidden units by the Gibbs distribution (with inverse temperature

):

 P(v,h)=e−E(v,h)/Z, (1)

where is a normalizing factor known as the partition function and the energy of a given configuration of visible and hidden units is given by

 E(v,h)=−∑ivibi−∑jhjdj−∑i,jwvhijvihj−∑i,jwvi,jvivj−∑i,jwhi,jhihj. (2)

Here the vectors

and are biases that provide an energy penalty for a unit taking the value and , , and are weights which assign an energy penalty if the visible and hidden units both take value . We denote and let and be the numbers of visible and hidden units, respectively.

Given some a priori observed data, referred to as the training set, learning for these models proceeds by modifying the strengths of the interactions in the graph to maximize the likelihood of the Boltzmann machine producing the given observations. Consequently, the training process uses gradient descent to find weights and biases that optimize the maximum–likelihood objective (ML–objective):

 OML:=1Ntrain∑v∈xtrainlog(nh∑h=1P(v,h))−λ2wTw, (3)

where is the size of the training set, is the set of training vectors, and is an –regularization term to combat overfitting. The derivative of with respect to the weights is

 ∂OML∂wi,j = ⟨vihj⟩data−⟨vihj⟩model−λwi,j, (4)

where the brackets denote the expectation values over the data and model for the BM. The remaining derivatives take a similar form Hin02 .

Computing these gradients directly from (1) and (4) is exponentially hard in and

; thus, classical approaches resort to approximations such as contrastive divergence

Hin02 ; SMH07 ; Tie08 ; SH09 ; Ben09 . Unfortunately, contrastive divergence does not provide the gradient of any true objective function ST10 , it is known to lead to suboptimal solutions  TH09 ; BD07 ; FI11 , it is not guaranteed to converge in the presence of certain regularization functions ST10 , and it cannot be used directly to train a full Boltzmann machine. We show that quantum computation provides a much better framework for deep learning and illustrate this by providing efficient alternatives to these methods that are elementary to analyze, accelerate the learning process and lead to better models for the training data.

## GEQS Algorithm

We propose two quantum algorithms: Gradient Estimation via Quantum Sampling (GEQS) and Gradient Estimation via Quantum Ampitude Estimation (GEQAE). These algorithms prepare a coherent analog of the Gibbs state for Boltzmann machines and then draw samples from the resultant state to compute the expectation values in (

4). Formal descriptions of the algorithms are given in the appendix. Existing algorithms for preparing these states LB97 ; TD98 ; PW09 ; DF11 ; ORR13 tend not to be efficient for machine learning applications or do not offer clear evidence of a quantum speedup. The inefficiency of PW09 ; ORR13 is a consequence of the uniform initial state having small overlap with the Gibbs state. The complexities of these prior algorithms, along with our own, are given in Table 1

Our algorithms address this problem by using a non-uniform prior distribution for the probabilities of each configuration, which is motivated by the fact that we know a priori from the weights and the biases that certain configurations will be less likely than others. We obtain this distribution by using a mean–field (MF) approximation to the configuration probabilities. This approximation is classically efficient and typically provides a good approximation to the Gibbs states observed in practical machine learning problems Jor99 ; WH02 ; Tie08 . Our algorithms exploit this prior knowledge to refine the Gibbs state from copies of the MF state. This allows the Gibbs distribution to be prepared efficiently and exactly if the two states are sufficiently close.

The MF approximation,

, is defined to be the product distribution that minimizes the Kullback–Leibler divergence

. The fact that it is a product distribution means that it can be efficiently computed and also can be used to find a classically tractable estimate of the partition function :

 ZQ:=∑v,hQ(v,h)log(e−E(v,h)Q(v,h)).

Here and equality is achieved if and only if  Jor99 . Here does not need to be the MF approximation. The same formula also applies if is replaced by another efficient approximation, such as a structured mean–field theory calculation Xin02 .

Let us assume that a constant is known such that

 P(v,h)≤e−E(v,h)ZQ≤κQ(v,h), (5)

and define the following “normalized” probability of a configuration as

 P(v,h):=e−E(v,h)κZQQ(v,h). (6)

Note that

 Q(v,h)P(v,h)∝P(v,h), (7)

which means that if the state

 ∑v,h√Q(v,h)|v⟩|h⟩, (8)

is prepared and each of the amplitudes are multiplied by then the result will be proportional to the desired state.

The above process can be made operational by adding an additional quantum register to compute and using quantum superpostion to prepare the state

 ∑v,h√Q(v,h)|v⟩|h⟩|P(v,h)⟩(√1−P(v,h)|0⟩+√P(v,h)|1⟩). (9)

The target Gibbs state is obtained if the right–most qubit is measured to be

. Preparing (9) is efficient because and can be calculated in time that is polynomial in the number of visible and hidden units. The success probability of preparing the state in this manner is

 Psuccess=ZκZQ≥1κ. (10)

In practice, our algorithm uses quantum amplitude amplification BHM+00 to quadratically boost the probability of success if (10) is small.

The complexity of the algorithm is determined by the number of quantum operations needed in the gradient calculation. Since the evaluation of the energy requires a number of operations that, up to logarithmic factors, scales linearly with the total number of edges in the model the combined cost of estimating the gradient is

 ~O(NtrainE(√κ+maxx∈xtrain√κx)), (11)

here is the value of corresponding to the case where the visible units are constrained to be . The cost of estimating and is (see appendix) and thus does not asymptotically contribute to the cost. In contrast, the number of operations required to classically estimate the gradient using greedy layer–by–layer optimization Ben09 scales as

 ~O(NtrainℓE), (12)

where is the number of layers in the dRBM and is the number of connections in the BM. Assuming that is a constant, the quantum sampling approach provides an asymptotic advantage for training deep networks. We provide numerical evidence in the appendixshowing that can often be made constant by increasing and the regularization parameter .

The number of qubits required by our algorithm is minimal compared to existing quantum machine learning algorithms ABG06 ; LMR13 ; RML13 ; QKS15 . This is because the training data does not need to be stored in a quantum database, which would otherwise require logical qubits NC00 ; GLM08 . Rather, if is computed with bits of precision and can be accessed as an oracle then only

 O(nh+nv+log(1/E))

logical qubits are needed for the GEQS algorithm. The number of qubits required will increase if is computed using reversible operations, but recent developments in quantum arithmetic can substantially reduce such costs WR14 .

Furthermore, the exact value of need not be known. If a value of is chosen that does not satisfy (5) for all configurations then our algorithm will still be able to approximate the gradient if is clipped to the interval

. The algorithm can therefore always be made efficient, at the price of introducing errors in the resultant probability distribution, by holding

fixed as the size of the BM increases. These errors emerge because the state preparation algorithm will under–estimate the relative probability of configurations that violate (5); however, if the sum of the probabilities of these violations is small then a simple continuity argument reveals that the fidelity of the approximate Gibbs state and the correct state is high. In particular, if we define “bad” to be the set of configurations that violate (5) then the continuity argument shows that if

then the fidelity of the resultant state with the Gibbs state is at least . This is formalized in the appendix.

Our algorithms are not expected to be both exact and efficient for all BMs. If they were then they could be used to learn ground–state energies of non–planar Ising models, implying that , which is widely believed to be false. Therefore BMs exist for which our algorithm will fail to be efficient and exact, modulo complexity theoretic assumptions. It is unknown how common these hard examples are in practice; however, they are unlikely to be commonplace because of the observed efficacy of the MF approximation for trained BMs Jor99 ; WH02 ; Tie08 ; SH09 and because the weights used in trained models tend to be small.

## GEQAE Algorithm

New forms of training, such as our GEQAE algorithm, are possible in cases where the training data is provided via a quantum oracle, allowing access to the training data in superposition rather than sequentially. The idea behind the GEQAE algorithm is to leverage the data superposition by amplitude estimation BHM+00

, which leads to a quadratic reduction in the variance in the estimated gradients over the GEQS algorithm. GEQAE consequently leads to substantial performance improvements for large training sets. Also, allowing the training data to be accessed quantumly allows it to be pre–processed using quantum clustering and data processing algorithms

ABG06 ; ABG07 ; LMR13 ; RML13 ; QKS15 .

The quantum oracle used in GEQAE abstracts the access model for the training data. The oracle can be thought of as a stand–in for a quantum database or an efficient quantum subroutine that generates the training data (such as a pre–trained quantum Boltzmann machine or quantum simulator). As the training data must be directly (or indirectly) stored in the quantum computer, GEQAE typically requires more qubits than GEQS; however, this is mitigated by the fact that quantum superposition allows training over the entire set of training vectors in one step as opposed to learning over each training example sequentially. This allows the gradient to be accurately estimated while accessing the training data at most times.

Let be a quantum oracle that for any index has the action

 UO|i⟩|y⟩:=|i⟩|y⊕xi⟩, (13)

where is a training vector. This oracle can be used to prepare the visible units in the state of the training vector. A single query to this oracle then suffices to prepare a uniform superposition over all of the training vectors, which then can be converted into

 1√Ntrain∑i,h√Q(Xi,h)|i⟩|xi⟩|h⟩(√1−P(xi,h)|0⟩+√P(xi,h)|1⟩), (14)

by repeating the state preparation method given in (9).

GEQAE computes expectations such as over the data and model by estimating (a) , the probability of measuring the right–most qubit in (14) to be and (b) , the probability of measuring and the right–most qubit in (14) to be . It then follows that

 ⟨vihj⟩=P(11)P(1). (15)

These two probabilities can be estimated by sampling, but a more efficient method is to learn them using amplitude estimation BHM+00 — a quantum algorithm that uses phase estimation on Grover’s algorithm to directly output these probabilities in a qubit string. If we demand the sampling error to scale as (in rough analogy to the previous case) then the query complexity of GEQAE is

 ~O(√NtrainE(κ+maxxκx)). (16)

Each energy calculation requires arithmetic operations, therefore (16) gives that the number of non–query operations scales as

 ~O(√NtrainE2(κ+maxxκx)). (17)

If the success probability is known to within a constant factor then amplitude amplification BHM+00 can be used to boost the success probability prior to estimating it. The original success probability is then computed from the amplified probability. This reduces the query complexity of GEQAE to

 ~O(√NtrainE(√κ+maxx√κx)). (18)

GEQAE is therefore preferable to GEQS if .

## Parallelizing Algorithms

Greedy CD– training is embarassingly parallel, meaning that almost all parts of the algorithm can be distributed over parallel processing nodes. However, the rounds of sampling used to train each layer in CD– cannot be easily parallelized. This means that simple but easily parallelizable models (such as GMMs) can be preferable in some circumstances HDY+12 . In contrast, GEQS and GEQAE can leverage the parallelism that is anticipated in a fault–tolerant quantum computer to train a dRBM much more effectively. To see this, note that the energy is the sum of the energies of each layer, which can be computed in depth (see Figure 1) and summed in depth . The MF state preparations can be executed simultaneously and the correct sample(s) located via a log–depth calculation. The depth of GEQS is therefore

 O(log([κ+maxxκx]MℓNtrain)). (19)

Since each of the derivatives output by GEQAE can be computed independently, the depth of GEQAE is

 O(√Ntrain[κ+maxxκx]log(Mℓ)). (20)

The depth can be reduced, at the price of increased circuit size, by dividing the training set into mini–batches and averaging the resultant derivatives.

Training using –step contrastive divergence (CD-) requires depth

 O(kℓ2log(MNtrain)). (21)

The scaling arises because CD- is a feed–forward algorithm, whereas GEQS and GEQAE are not.

## Numerical Results

We address the following questions regarding the behavior of our algorithms:

1. What are typical values of ?

2. How do models trained using CD- differ from those trained with GEQS and GEQAE?

3. Do full BMs yield substantially better models than dRBMs?

To answer these questions, and need to be computed classically, requiring time that grows exponentially with . Computational restrictions therefore severely limit the size of the models that we can study through numerical experiments. In practice, we are computationally limited to models with at most units. We train the following dRBMs with layers, hidden units, and visible units.

Large-scale traditional data sets for benchmarking machine learning, such as MNIST lecun1998mnist , are impractical here due to computational limitations. Consequently, we focus on synthetic training data consisting of four distinct functions:

 [x1]j = 1 if j≤nv/2 else 0 [x2]j = j mod 2, (22)

as well as their bitwise negations. We add Bernoulli noise to each of the bits in the bit string to increase the size of the training sets. In particular, we take each of the four patterns in (22) and flip each bit with probability . We use training examples in each of our numerical experiments; each vector contains binary features. Our task is to infer a generative model for these four vectors. We provide numerical experiments on sub–sampled MNIST digit data in the appendix. The results are qualitatively similar.

Figure 2 shows that doubling the number of visible units does not substantially increase for this data set (with ), despite the fact that their Hilbert space dimensions differ by a factor of . This illustrates that primarily depends on the quality of the MF approximation, rather than and . Similar behavior is observed for full BMs, as shown in the appendix. Furthermore, typically results in a close approximation to the true Gibbs state. Although is not excessive, we introduce “hedging strategies” in the appendixthat can reduce to roughly .

We further examine the scaling of for random (untrained) RBMs via

 κest=∑v,hP2(v,h)/Q(v,h). (23)

Figure 3 shows that for small random RBMs, for .

This leads to the second issue: determining the distribution of weights for actual Boltzmann machines. Figure 4 shows that for large RBMs trained using contrastive divergence have weights that tend to rapidly shrink as is increased. For , the empirical scaling is which suggests that will not diverge as grows. Although taking reduces considerably, the scaling is also reduced. This may be a result of regularization having different effects for the two training sets. In either case, these results coupled with those of Figure 3 suggest that should be manageable for large networks.

We assess the benefits of GEQS and GEQAE by comparing the average values of found under contrastive divergence and under our quantum algorithm, for dRBMs. The difference between the optima found is significant for even small RBMs; the differences for deep networks can be on the order of . The data in Table 2 shows that ML-training leads to substantial improvements in the quality of the resultant models. We also observe that contrastive divergence can outperform gradient descent on the ML objective in highly constrained cases. This is because the stochastic nature of the contrastive divergence approximation makes it less sensitive to local minima.

The modeling capacity of a full BM can significantly outperform a dRBM in terms of the quality of the objective function. In fact, we show in the appendixthat a full Boltzmann machine with and can achieve . dRBMs with comparable numbers of edges result in (see Table 2), which is less than the full BM. Since our quantum algorithms can efficiently train full BMs in addition to dRBMs, the quantum framework enables forms of machine learning that are not only richer than classically tractable methods but also potentially lead to improved models of the data.

## Conclusions

A fundamental result of our work is that training Boltzmann machines can be reduced to a problem of quantum state preparation. This state preparation process notably does not require the use of contrastive divergence approximation or assumptions about the topology of the graphical model. We show that the quantum algorithms not only lead to significantly improved models of the data, but also provide a more elegant framework in which to think about training BMs. This framework enables the wealth of knowledge developed in quantum information and condensed matter physics on preparing and approximating Gibbs states to be leveraged during training.

Our quantum deep learning framework enables the refining of MF approximations into states that are close (or equivalent to) the desired Gibbs state. This state preparation method allows a BM to be trained using a number of operations that does not explicitly depend on the number of layers in a dRBM. It also allows a quadratic reduction in the number of times the training data must be accessed and enables full Boltzmann machines to be trained. Our algorithms can also be better parallelized over multiple quantum processors, addressing a major shortcoming of deep learning HDY+12 .

While numerical results on small examples are encouraging in advance of having a scalable quantum computer, future experimental studies using quantum hardware will be needed to assess the generalization performance of our algorithm. Given our algorithm’s ability to provide better gradients than contrastive divergence, it is natural to expect that it will perform well in that setting by using the same methodologies currently used to train deep Boltzmann machines Ben09 . Regardless, the myriad advantages offered by quantum computing to deep learning not only suggests a significant near-term application of a quantum computer but also underscores the value of thinking about machine learning from a quantum perspective.

## Appendix A Quantum algorithm for state preparation

We begin by showing how quantum computers can draw unbiased samples from the Gibbs distribution, thereby allowing the probabilities to be computed by sampling (or by quantum sampling). The idea behind our approach is to prepare a quantum distribution that approximates the ideal probability distribution over the model or data. This approximate distribution is then refined using rejection sampling into a quantum distribution that is, to within numerical error, the target probability distribution ORR13 . If we begin with a uniform prior over the amplitudes of the Gibbs state, then preparing the state via quantum rejection sampling is likely to be inefficient. This is because the success probability depends on the ratio of the partition functions of the initial state and the Gibbs state PW09 , which in practice is exponentially small for machine learning problems. Instead, our algorithm uses a mean–field approximation, rather than a uniform prior, over the joint probabilities in the Gibbs state. We show numerically that this extra information can be used to boost the probability of success to acceptable levels. The required expectation values can then be found by sampling from the quantum distribution. We show that the number of samples needed to achieve a fixed sampling error can be quadratically reduced by using a quantum algorithm known as amplitude estimation BHM+00 .

We first discuss the process by which the initial quantum distribution is refined into a quantum coherent Gibbs state (often called a coherent thermal state or CTS). We then discuss how mean–field theory, or generalizations thereof, can be used to provide suitable initial states for the quantum computer to refine into the CTS. We assume in the following that all units in the Boltzmann machine are binary valued. Other valued units, such as Gaussian units, can be approximated within this framework by forming a single unit out of a string of several qubits.

First, let us define the mean–field approximation to the joint probability distribution to be . For more details on the mean–field approximation, see Section G. We also use the mean–field distribution to compute a variational approximation to the partition functions needed for our algorithm. These approximations can be efficiently calculated (because the probability distribution factorizes) and are defined below.

###### Definition 1.

Let be the mean–field approximation to the Gibbs distribution then

 ZQ:=∑v,hQ(v,h)log(e−E(v,h)Q(v,h)).

Furthermore for any let be the mean–field approximation to the Gibbs distribution found for a Boltzmann machine with the visible units clamped to , then

 Zx,Q:=∑hQx(x,h)log(e−E(x,h)Qx(x,h)).

In order to use our quantum algorithm to prepare from we need to know an upper bound, , on the ratio of the approximation to . We formally define this below.

###### Definition 2.

Let be a constant that is promised to satisfy for all visible and hidden configurations

 e−E(v,h)ZQ≤κQ(v,h), (24)

where is the approximation to the partition function given in Definition 1.

We also define an analogous quantity appropriate for the case where the visible units are clamped to one of the training vectors.

###### Definition 3.

Let be a constant that is promised to satisfy for and all hidden configurations

 e−E(x,h)Zx,Q≤κQx(x,h), (25)

where is the approximation to the partition function given in Definition 1.

###### Lemma 1.

Let be the mean–field probability distribution for a Boltzmann machine, then for all configurations of hidden and visible units we have

 P(v,h)≤e−E(v,h)ZQ≤κQ(v,h).
###### Proof.

The mean–field approximation can also be used to provide a lower bound for the log–partition function. For example, Jensen’s inequality shows that

 log(Z) =log(∑v,hQ(v,h)e−E(v,h)Q(v,h)), ≥∑v,hQ(v,h)log(e−E(v,h)Q(v,h))=log(ZQ). (26)

This shows that and hence

 P(v,h)≤e−E(v,h)/ZQ, (27)

where is the approximation to that arises from using the mean–field distribution. The result then follows from (27) and Definition 2. ∎

The result of Lemma 1 allows us to prove the following lemma, which gives the success probability for preparing the Gibbs state from the mean–field state.

###### Lemma 2.

A coherent analog of the Gibbs state for a Boltzmann machine can be prepared with a probability of success of . Similarly, the Gibbs state corresponding to the visible units being clamped to a configuration can be prepared with success probability .

###### Proof.

The first step in the algorithm is to compute the mean–field parameters and using (56). These parameters uniquely specify the mean–field distribution . Next the mean–field parameters are used to approximate the partition functions and . These mean–field parameters are then used to prepare a coherent analog of , denoted as , by performing a series of single–qubit rotations:

 ∣∣ψQ⟩:=∏iRy(2arcsin(√μi))|0⟩∏jRy(2arcsin(√νj))|0⟩=∑v,h√Q(v,h)|v⟩|h⟩. (28)

The remaining steps use rejection sampling to refine this crude approximation to .

For compactness we define

 P(v,h):=e−E(v,h)κZQQ(v,h). (29)

can be computed efficiently from the mean–field parameters and so an efficient quantum algorithm (quantum circuit) also exists to compute . Lemma 1 also guarantees that .

Since quantum operations (with the exception of measurement) are linear, if we apply the algorithm to a state we obtain . We then add an additional quantum bit, called an ancilla qubit, and perform a controlled rotation of the form on this qubit to enact the following transformation:

 ∑v,h√Q(v,h)|v⟩|h⟩|P(v,h)⟩|0⟩↦∑v,h√Q(v,h)|v⟩|h⟩|P(v,h)⟩(√1−P(v,h)|0⟩+√P(v,h)|1⟩). (30)

The quantum register that contains the qubit string is then reverted to the state by applying the same operations used to prepare in reverse. This process is possible because all quantum operations, save measurement, are reversible. Since , then (30) is a properly normalized quantum state and in turn its square is a valid probability distribution.

If the rightmost quantum bit in (30) is measured and a result of is obtained (recall that projective measurements always result in a unit vector) then the remainder of the state will be proportional to

 ∑v,h√Q(v,h)P(v,h)=√ZκZQ∑v,h√e−E(v,h)Z|v⟩|h⟩=√ZκZQ∑v,h√P(v,h)|v⟩|h⟩, (31)

which is the desired state up to a normalizing factor. The probability of measuring is the square of this constant of proportionality

 P(1|κ,ZQ)=ZκZQ. (32)

Note that this is a valid probability because Lemma 1 gives that .

Preparing a quantum state that can be used to estimate the expectation values over the data requires a slight modification to this algorithm. First, for each needed for the expectation values, we replace with the constrained mean–field distribution . Then using this data the quantum state

 ∑h√Qx(x,h)|x⟩|h⟩, (33)

can be prepared. We then follow the exact same protocol using in place of , in place of , and in place of . The success probability of this algorithm is

 P(1|κ,Zx,Q)=ZxκxZx,Q. (34)

The approach to the state preparation problem used in Lemma 2 is similar to that of PW09 , with the exception that we use a mean-field approximation rather than the infinite temperature Gibbs state as our initial state. This choice of initial state is important because the success probability of the state preparation process depends on the distance between the initial state and the target state. For machine learning applications, the inner product between the Gibbs state and the infinite temperature Gibbs state is often exponentially small; whereas we find in Section E.3 that the mean–field and the Gibbs states typically have large overlaps.

The following lemma is a more general version of Lemma 2 that shows that if a insufficiently large value of is used then the state preparation algorithm can still be employed, but at the price of reduced fidelity with the ideal coherent Gibbs state.

###### Lemma 3.

If we relax the assumptions of Lemma 2 such that for all and for all and , then a state can be prepared that has fidelity at least with the target Gibbs state with probability at least .

###### Proof.

Let our protocol be that used in Lemma 2 with the modification that the rotation is only applied if . This means that prior to the measurement of the register that projects the state onto the success or failure branch, the state is

 (35)

The probability of successfully preparing the approximation to the state is then

The fidelity of the resultant state with the ideal state is

since for all . Now using the same trick employed in (37) and the assumption that , we have that the fidelity is bounded below by

 Z(1−ϵ)Z√1−ϵ=√1−ϵ≥1−ϵ. (38)

The corresonding algorithms are outlined in Algorithm 1 and Algorithm 2, for preparing the state required to compute the model expectation and the data expectation, respectively.

## Appendix B Gradient calculation by sampling

Our first algorithm for estimating the gradients of involves preparing the Gibbs state from the mean–field state and then drawing samples from the resultant distribution in order to estimate the expectation values required in the expression for the gradient. We refer to this algorithm as GEQS (Gradient Estimation via Quantum Sampling) in the main body. We also optimize GEQS algorithm by utilizing a quantum algorithm known as amplitude amplification BHM+00 (a generalization of Grover’s search algorithm Gro96 ) which quadratically reduces the mean number of repetitions needed to draw a sample from the Gibbs distribution using the approach in Lemma 2 or Lemma 3.

t is important to see that the distributions that this algorithm prepares are not directly related to the mean–field distribution. The mean field distribution is chosen because it is an efficiently computable distribution that is close to the true Gibbs distribution and thereby gives a shortcut to preparing the state. Alternative choices, such as the uniform distribution, will ideally result in the same final distribution but may require many more operations than would be required if the mean–field approximation were used as a starting point. We state the performance of the GEQS algorithm in the following theorem.

###### Theorem 1.

There exists a quantum algorithm that can estimate the gradient of using samples for a Boltzmann machine on a connected graph with edges. The mean number of quantum operations required by algorithm to compute the gradient is

 ~O(NtrainE(√κ+√maxvκv)),

where is the value of that corresponds to the Gibbs distribution when the visible units are clamped to and implies up to polylogarithmic factors.

###### Proof.

We use Algorithm 3 to compute the required gradients. It is straightforward to see from Lemma 2 that Algorithm 3 draws samples from the Boltzmann machine and then estimates the expectation values needed to compute the gradient of the log–likelihood by drawing samples from these states. The subroutines that generate these states, qGenModelState and qGenDataState, given in Algorithm 1 and Algorithm 2

, represent the only quantum processing in this algorithm. The number of times the subroutines must be called on average before a success is observed is given by the mean of a geometric distribution with success probability given by

Lemma 2 that is at least

 min{ZκZQ,minxZxκxZx,Q}. (39)

Lemma 1 gives us that and hence the probability of success satisfies

 min{ZκZQ,minvZxκxZx,Q}≥1κ+maxvκv. (40)

Normally, (40) implies that preparation of the Gibbs state would require calls to Algorithm 1 and Algorithm 2 on average, but the quantum amplitude amplification algorithm BHM+00 reduces the average number of repetitions needed before a success is obtained to . Algorithm 3 therefore requires an average number of calls to qGenModelState and qGenDataState that scale as .

Algorithm 1 and Algorithm 2 require preparing the mean-field state, computing the energy of a configuration , and performing a controlled rotation. Assuming that the graph is connected, the number of hidden and visible units are . Since the cost of synthesizing single qubit rotations to within error is  KMM+13 ; RS14 ; BRS14 and the cost of computing the energy is it follows that the cost of these algorithms is . Thus the expected cost of Algorithm 3 is as claimed. ∎

In contrast, the number of operations and queries to required to estimate the gradients using greedy layer–by–layer optimization scales as Ben09

 ~O(NtrainℓE), (41)

where is the number of layers in the deep Boltzmann machine. Assuming that is a constant, it follows that the quantum sampling approach provides an asymptotic advantage for training deep networks. In practice, the two approaches are difficult to directly compare because they both optimize different objective functions and thus the qualities of the resultant trained models will differ. It is reasonable to expect, however, that the quantum approach will tend to find superior models because it optimizes the maximum-likelihood objective function up to sampling error due to taking finite .

Note that Algorithm 3 has an important advantage over many existing quantum machine learning algorithms ABG06 ; LMR13 ; RML13 ; QKS15 : it does not require that the training vectors are stored in quantum memory. It requires only qubits if a numerical precision of is needed in the evaluation of the . This means that a demonstration of this algorithm that would not be classically simulatable could be performed with fewer than qubits, assuming that bits of precision suffices for the energy. In practice though, additional qubits will likely be required to implement the required arithmetic on a quantum computer. Recent developments in quantum rotation synthesis could, however, be used to remove the requirement that the energy is explicitly stored as a qubit string WR14 , which may substantially reduce the space requirements of this algorithm. Below we consider the opposite case: the quantum computer can coherently access the database of training data via an oracle. The algorithm requires more qubits (space), however it can quadratically reduce the number of samples required for learning in some settings.

## Appendix C Training via quantum amplitude estimation

We now consider a different learning environment, one in which the user has access to the training data via a quantum oracle which could represent either an efficient quantum algorithm that provides the training data (such as another Boltzmann machine used as a generative model) or a quantum database that stores the memory via a binary access tree NC00 ; GLM08 , such as a quantum Random Access Memory (qRAM) GLM08 .

If we denote the training set as , then the oracle is defined as a unitary operation as follows:

###### Definition 4.

is a unitary operation that performs for any computational basis state and any

 UO|i⟩|y⟩:=|i⟩|y⊕xi⟩,

where is the training set and .

A single quantum access to is sufficient to prepare a uniform distribution over all the training data

 (42)

The state can be efficiently prepared using quantum techniques QKS15 and so the entire procedure is efficient.

At first glance, the ability to prepare a superposition over all data in the training set seems to be a powerful resource. However, a similar probability distribution can also be generated classically using one query by picking a random training vector. More sophisticated approaches are needed if we wish to leverage such quantum superpositions of the training data. Algorithm 4 utilizes such superpositions to provide advantages, under certain circumstances, for computing the gradient. The performance of this algorithm is given in the following theorem.

###### Theorem 2.

There exists a quantum algorithm that can compute , or for any corresponding to visible/hidden unit pairs for a Boltzmann machine on a connected graph with edges to within error using an expected number of queries to that scales as

 ~O(κ+maxvκvδ),

and a number of quantum operations that scales as

 ~O(E(κ+maxvκv)δ),

for constant learning rate .

Algorithm 4 requires the use of the amplitude estimation BHM+00 algorithm, which provides a quadratic reduction in the number of samples needed to learn the probability of an event occurring, as stated in the following theorem.

###### Theorem 3 (Brassard, Høyer, Mosca and Tapp).

For any positive integer , the amplitude estimation algorithm of takes as input a quantum algorithm that does not use measurement and with success probability and outputs such that

 |~a−a|≤π(π+1)L

with probability at least . It uses exactly iterations of Grover’s algorithm. If then with certainty, and if and is even, then with certainty.

This result is central to the proof of Theorem 2 which we give below.

• Proof of Theorem 2.

Algorithm 4 computes the derivative of with respect to the weights. The algorithm can be trivially adapted to compute the derivatives with respect to the biases. The first step in the algorithm prepares a uniform superposition of all training data and then applies to it. The result of this is

 1√NtrainNtrain∑p=1|p⟩|xp⟩, (43)

as claimed.

Any quantum algorithm that does not use measurement is linear and hence applying qGenDataState (Algorithm 2) to (43) yields

 :=1√NtrainNtrain∑p=1|p⟩|xp⟩∑