Benchmarking Quantum Hardware for Training of Fully Visible Boltzmann Machines

11/14/2016 ∙ by Dmytro Korenkevych, et al. ∙ 0

Quantum annealing (QA) is a hardware-based heuristic optimization and sampling method applicable to discrete undirected graphical models. While similar to simulated annealing, QA relies on quantum, rather than thermal, effects to explore complex search spaces. For many classes of problems, QA is known to offer computational advantages over simulated annealing. Here we report on the ability of recent QA hardware to accelerate training of fully visible Boltzmann machines. We characterize the sampling distribution of QA hardware, and show that in many cases, the quantum distributions differ significantly from classical Boltzmann distributions. In spite of this difference, training (which seeks to match data and model statistics) using standard classical gradient updates is still effective. We investigate the use of QA for seeding Markov chains as an alternative to contrastive divergence (CD) and persistent contrastive divergence (PCD). Using k=50 Gibbs steps, we show that for problems with high-energy barriers between modes, QA-based seeds can improve upon chains with CD and PCD initializations. For these hard problems, QA gradient estimates are more accurate, and allow for faster learning. Furthermore, and interestingly, even the case of raw QA samples (that is, k=0) achieved similar improvements. We argue that this relates to the fact that we are training a quantum rather than classical Boltzmann distribution in this case. The learned parameters give rise to hardware QA distributions closely approximating classical Boltzmann distributions that are hard to train with CD/PCD.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 9

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the early 1980s, a number of authors suggested that certain computations might be accelerated with computers making use of quantum resources [Ben80, Deu85]. Feynman’s 1981 proposal [Fey82] suggested that quantum systems themselves might be more efficiently modelled with quantum computers. Over a decade later, Peter Shor devised a polynomial-time quantum method for factoring large integers. Despite this theoretical promise, progress towards experimental quantum computing platforms remained limited. It was not until 1998, with the introduction of quantum annealing (QA) [KN98], that a path to scalable quantum hardware emerged. While existing QA machines are not computationally universal, QA machines are available now at large scales and offer significant speedups for certain problem classes [DBI15]. Here, we explore the potential of QA to accelerate training of probabilistic models.

The QA heuristic operates in a manner analogous to simulated annealing (SA), but relies on quantum, rather than thermal, fluctuations to foster exploration through a search space. Just as thermal fluctuations are annealed in SA, quantum fluctuations are annealed in QA.

With the exception of [AH15, BGBO16], most applications run on QA hardware have used the optimization potential of quantum annealing. In [AH15]

, the focus is on training a 4-layer deep belief network. Pre-training of each layer uses restricted Boltzmann machines (RBMs) trained using QA via a complete bipartite graph embedding.

[AH15] tested their approach against 1-step contrastive divergence (CD) samples on a coarse-version of MNIST and concluded that QA sped up training significantly. In [BGBO16], the authors consider training a fully connected Boltzmann machine (BM) using QA via a complete graph embedding on the hardware graph. They report a training speed-up compared to training with simulated annealing directly on the complete graph. These studies assume that the quantum hardware produces a classical Boltzmann distribution. In contrast, in this paper we do not assume the QA samples are Boltzmann. We demonstrate the differences between classical Boltzmann and QA hardware samples, and explore the impact of these differences in training fully-visible BMs in small density estimation tasks. Training of BMs is a natural application domain because available QA hardware realizes Boltzmann-like distributions, inference in BMs is known to be very hard [LS10], and BMs are a building block of many generative probabilistic models [SH09].

We begin with background on QA on the annealing-based quantum system, highlighting its practical constraints. We characterize the sampling done by the hardware, which in some cases is Boltzmann and in other cases differs significantly from Boltzmann. We then describe the challenge of learning probabilistic models with BMs, and how QA might accelerate such training. We provide benchmark results on the learning of multimodal distributions, and quantify the benefits that QA can offer. Lastly, we show the impact of the non-Boltzmann nature of the D-Wave system, and how this impacts learning. We conclude with directions for future work.

2 Quantum Annealing

QA uses quantum-mechanical processes to minimize and sample from energy-based models. The D-Wave machine implements the Ising model energy function:

111Vectors are indicated in lowercase bold font, and matrices in uppercase bold font.

with variable connectivity defined by a graph

. The 2000-qubit D-Wave system allows for up to

variables with sparse bipartite connectivity. The connectivity graph of the D-Wave device is called Chimera, and denoted . consists of an array of unit cells with connection between unit cells as in Fig. 0(a), which shows a graph.

(a) The Chimera graph consisting of a array of bipartite unit cells. Nodes represent problem variables with programmable weights , and edges have a programmable connection.
programming time 25 ms
anneal time s/sample
readout time 260 s/sample
(b) Typical timing data of the 2000-qubit D-Wave system.
Figure 1: 2000-qubit D-Wave system parameters.

The tree-width of graph is 48, so exact inference is practically impossible. It is simple to convert valued spins to Boolean-valued variables so that also defines a BM with energy and the same sparse bipartite connectivity.

Quantum mechanics replaces the energy function with a linear operator acting on states and returning new states . This energy operator is described by the Hamiltonian, a matrix whose components are indexed by . The diagonal elements of record the energy of the corresponding states, i.e., , and the off-diagonal elements of act to transform states. In the D-Wave machine the only allowed off-diagonal contributions are those which flip bits, i.e. for

Quantum processes favor states corresponding to the eigenvectors of low-energy eigenvalues of

. Thus, at zero temperature when

, quantum evolution corresponds to uniform sampling within the eigenspace corresponding to the lowest eigenvalue (energy) of

. However, gives rise to eigenvectors that are linear combinations of basis vectors. These states are called superpositions, and are interpreted as follows. An arbitrary superposition is written as where is a weight (often called an amplitude), and is the basis vector corresponding to state . In superposition any particular state,

, is observed with probability proportional to

. Thus, the quantum state implicitly encodes degrees of freedom. Superposition states are unavailable in non-quantum devices, and are a source of the speedups seen in quantum computations. In hardware like the D-Wave annealer, superposition states are generated by physical processes and do not need to be simulated.

In QA algorithms, is varied over time so that222 is Iverson’s bracket defined to be 1 if predicate is true, and 0 otherwise.

(1)

The time-dependent weightings are monotonically decreasing/increasing and satisfy and , so that we evolve from — which has no diagonal energy contribution, and which assigns equal probability to all states () — to the Ising energy function . The decreasing quantum effects mediated by give rise to the name quantum annealing. For certain classes of optimization problems, quantum annealing can be dramatically faster than simulated annealing [KYN15, DBI15].

On the 2000-qubit D-Wave system, the annealing time is programmable (the default anneal time is 20 s). A single sample is then measured (drawn) at time , and the process is repeated in an i.i.d. fashion for subsequent anneals. On the first anneal, the parameters and must be specified, requiring a programming time around ms. Further timing data of the 2000-qubit D-Wave system are listed in Fig. 0(b).

The Ising Hamiltonian described above is a zero temperature () idealization of real-world complexities. Important deviations from ideality arise from:

  • Finite temperature: QA hardware does not operate at zero temperature. In units where the parameters lie in the interval and , the effective hardware temperature is problem dependent and usually between and .333The sampling distribution is not Boltzmann so the notion of temperature as it appears in a Boltzmann distribution is ill-defined, and many factors beyond physical temperature contribute to an effective “temperature.”

  • Parameter misspecification: During programming the

    parameters are subject to additive Gaussian noise having standard deviations

    and respectively. Additionally, in spite of calibration of the device, small systematic deviations from the idealized Ising model arise because the Ising model is only an approximation to the true low-energy physics.

  • Dynamics: The quantum mechanical evolution of the annealing process cannot be simulated at large scales (even for idealized models), and quantum effects can cause significant deviations from the classical Boltzmann distribution. A better approximation is obtained using the density matrix of the quantum Boltzmann distribution , but even this approximation fails to capture the out-of-equilibrium effects of rapid annealing within the D-Wave device [RYA16].

In spite of these complexities, it remains true that QA hardware rapidly produces i.i.d. low energy samples from programmable Chimera-structured energy models. Here, we explore whether this capability can be harnessed for efficient learning of Chimera-structured BMs. As our interest is on the sampling aspects of learning, we focus on fully visible models to avoid the confounding influence of multimodal likelihood functions.

3 QA Versus Boltzmann Sampling

To begin, we explore the QA sampling distributions. As a rough characterization, we might expect a Boltzmann distribution , and indeed for some problems this is a good description. However, the Boltzmann distribution assumes classical statistics, and numerous experiments have confirmed the quantum nature of the D-Wave systems [LPS14, DBI15]. With different choices of energy functions we can clearly expose its quantum properties.

Consider the Ising model illustrated on Fig. 1(a). The model consists of 4 unit cells. The variables within each unit cell are strongly ferromagnetically444Ferromagnetic () connections induce neighbouring spins to take the same value in low energy states. connected with connection weights of . The connections between unit cells form a frustrated loop, and have weights 10 times weaker in magnitude than the intra-cell connections. The weights on all variables are zero. We call this a frustrated loop of clusters problem, and reference it as FCL-1.

(a) FCL-1 problem. Edge colors represent weights of connections, and all biases are 0.
(b) Probabilities of local optima for QA and MCMC annealing samples. Red bars represent exact Boltzmann probabilities of local optima, while blue and green bars represent empirical probabilities of QA and MCMC samples.
Figure 2: FCL-1.

The energy landscape of FCL-1 has 16 local minima corresponding to the possible configurations of four unit cells (each variable within a unit cell takes the same value in low-energy states). These 16 local minima are separated by high-energy barriers. To cross a barrier, an entire unit cell must be flipped, incurring an energy penalty of . Among the 16 local minima, 8 are ground states and 8 are excited states, and the energy gap between ground and excited states is 4.

The energy barriers make it very difficult for any single-spin-flip Markov chain Monte Carlo (MCMC) algorithm to move between valleys. To draw approximate Boltzmann samples from FCL-1, we ran MCMC chains from random initializations, and updated each using blocked Gibbs sampling with Gibbs updates, annealed over 1000 temperature steps.555Blocked Gibbs sampling without annealing performed much more poorly. The inverse temperature steps were set uniformly over the interval so there are 10 blocked Gibbs updates at each .

Under FCL-1, we also generated QA samples666We used 100 random spin-reversal transformations as suggested by D-Wave to mitigate parameter misspecifications., each obtained with a annealing process. To adjust for physical temperature of the hardware, we scaled down the values of by a factor of , which is a crude estimate of the parameter for this problem. As a result, the model programmed on hardware had all values within the range as required by the 2000-qubit D-Wave system.

The resulting empirical probabilities of 16 local minima under both MCMC (green) and QA (blue) sampling are shown in Fig. 1(b). The abscissa represents the 16 local minima. The ordinate records the probability of each local minimum. Red bars show the probabilities of local minima under a classical Boltzmann distribution. QA empirical probabilities follow the exact Boltzmann probabilities closely, with a Kullback–Leibler (KL) divergence of empirical distribution from exact Boltzmann distribution of . In contrast, MCMC annealing substantially over-samples excited states with corresponding . MCMC chains become trapped in excited minima during the anneal, and are not able to cross barriers between states as the temperature decreases.

The failure of the MCMC annealing process is shown more in detail in Fig. 3. Here, the abscissa records inverse temperature, and the ordinate records probability. The solid green, red, and blue curves represent the exact combined probabilities of all 16 local minima, all 8 ground states, and all 8 excited states respectively. The dashed lines represent corresponding empirical probabilities derived from MCMC chains at each temperature step. Notably, the exact probabilities of excited states change non-monotonically during the annealing process. At early stages of the anneal at low values, the probability of excited states increases as a function of as probability flows from the entire solution space into the local minima. As increases further, the dynamics alter. Probability transitions from excited states to ground states, and the total probability of excited states decreases as a function of . The MCMC process is able to accurately model probabilities of all states at early stages of the anneal, but when the energy barriers between states grow sufficiently large, the process freezes, and the probabilities of local minima do not change. As a result, MCMC over-samples excited minima.

Figure 3: Dynamics of the MCMC annealing process on FCL-1.

It might be argued that a single parameter, , can be adjusted to provide a close match between the QA distribution and the corresponding Boltzmann distribution, since there are only two relevant distinct energies within FCL-1. To address this concern, we modified FCL-1 by breaking symmetry within the inter-cell frustrated loop connections. The modified problem, FCL-2, is shown on Fig. 3(a).

(a) Cluster problem with modified inter-cell couplings (FCL-2).
(b) Empirical probabilities of local optima obtained from QA MCMC samples. Red bars represent the exact Boltzmann probabilities of local optima, blue and green bars represent empirical probabilities derived from QA and MCMC samples respectively.
Figure 4: FCL-2.

FCL-2 has the same 16 low-energy local optima, but 4 of these are ground states, and the remaining 12 excited states have diverse energy values. We repeated the sampling procedures described above using the same value of to adjust the values programmed on hardware. The results are presented in Fig. 3(b). Again we see that the empirical QA samples closely follow the exact Boltzmann distribution, with KL divergence of 0.006, while MCMC annealing continues to over-sample excited states, only reaching a KL divergence of 0.28.

Thus far, the QA distributions closely approximate the classical Boltzmann distribution. A little digging into the physics yields the reason. During quantum annealing, there is a freeze-out analogous to the classical freeze-out seen in Fig. 3. For FCL-1 and FCL-2, the equivalence of all intra-cell interactions means that quantum effects at the freeze-out point affect all ferromagnetically connected clusters equally. This freeze-out translates to a simple energy shift in the classical spectrum, so that the quantum Boltzmann distribution is very similar to the classical distribution. In general however, clusters might not freeze at the same point. Next, we consider Ising models where the QA distribution deviates from the classical Boltzmann. Such models can be obtained by differentiating among the couplings. Thus, we consider the FCL-3 problem of Fig. 4(a).

(a) Cluster problem with uneven cluster strengths (FCL-3). Colour represents weights of the connections.
(b) The QA distribution deviates substantially from classical Boltzmann, but is in a qualitative agreement with the Redfield simulation of the quantum dynamics.
Figure 5: FCL-3.

The results of the same sampling procedure applied to FCL-3 are presented in Fig. 4(b). Again, red bars represent the classical Boltzmann probabilities of energy local minima, and blue and green bars represent empirical probabilities of local minima derived from QA and MCMC samples respectively. Now we see that the QA distribution deviates substantially from classical Boltzmann with a KL divergence similar to that obtained by a MCMC and anneal procedure (0.11). Clusters with large (strong) freeze earlier in the quantum annealing process compared to weak ones [Ami15]. Hence, qubits in strong clusters equilibrate under a quantum Boltzmann distribution at a lower energy scale than qubits in weak clusters. The result is a distorted distribution that deviates from the classical Boltzmann. To confirm this explanation, we applied a classical Redfield simulation of the quantum dynamics [ATA09]. Orange bars in Fig. 4(b) show empirical probabilities of local minima derived using this simulation agree closely with probabilities derived from QA samples.

Lastly, we modified cluster strengths for an FCL-2 problem (with a broken symmetry between excited states) and denoted the resulting problem FCL-4 (Fig. 5(a)). The sampling results are shown in Fig. 5(b). The QA distribution again deviates significantly from the classical Boltzmann, but agrees closely with the quantum simulation.

(a) Cluster problem with uneven cluster strengths and modified inter-cell couplings. Colour represents weights of the connections.
(b) Sampling results for FCL-4. The QA distribution deviates substantially from classical Boltzmann one.
Figure 6: FCL-4.

From a machine learning perspective, these asymmetric cluster problems may appear discouraging, as they suggest that the general QA distribution has a complicated form that depends on unknown factors, e.g. freeze-out points for different qubits. In the next section, however, we show that at least in considered cases it is possible to adjust (with simple learning rules) hardware parameters to match classical Boltzmann distributions of interest.

4 Training Boltzmann Machines Using QA

4.1 Fully Visible Boltzmann Machines

A Boltzmann machine defines a probability distribution over

-valued variables as

(2)

where the partition function is . For Chimera-structured BMs the vector of sufficient statistics is given by . Often, hidden variables are introduced to increase the modeling flexibility of BMs, but we defer the study of hidden variable models because the likelihood surfaces that result become multimodal. BMs play an important role in many machine learning algorithms, and serve as building blocks for undirected generative models such as deep BMs [SH09].

In fully visible BMs, the parameters are learned from training data by maximizing the expected log-likelihood of :

(3)
(4)

where is the training data distribution. Though is a concave function (making maximization straightforward in principle), neither nor can be determined exactly for models at large scale. Thus, training of practically relevant BMs is typically very difficult. The dominant approach to training BMs is stochastic gradient ascent, where approximations to are used [You98]. MCMC (specifically Gibbs sampling) is used to estimate needed for at parameter setting , and is updated (most simply) according to the estimated gradient as . A variety of methods are available for the gradient step size . The efficacy of stochastic gradient ascent depends on the quality of the gradient estimates, and two methods are commonly applied to seed the MCMC chains with good starting configurations. Contrastive Divergence (CD) [Hin02, CPH05] initializes the Markov chains with the data elements themselves since (at least for well-trained models) these are highly likely states. Persistent Contrastive Divergence (PCD) [Tie08], improves upon CD by initializing the Markov chains needed for with samples from the previous chain at . If gradient steps on are small, it is hoped that samples from rapidly equilibrate under .

The approaches used in CD and PCD to foster rapid equilibration acutely fail in multimodal probability distributions that have high-energy barriers. However, even simple problems at modest sizes can show the effects of poor equilibration under PCD as the problem size grows. To demonstrate this, we generated 20 Chimera-structured Ising models with and randomly sampled from at sizes (72 variables), (128 variables), and (200 variables). PCD-estimated gradients used 1000 chains with either 2, 10, or 50 blocked Gibbs updates, and all models were trained for 500 iterations using Nesterov-accelerated gradients [Nes83]

. The Nesterov method uses momentum (past gradients), and is more susceptible to noisy gradients than stochastic gradient descent

[DGN14]. The learned model results are presented on Fig. 7 ( is learned, and is fixed to zero). The abscissa represents problem size, and the ordinate represents the log-likelihood-ratio averaged on test data. Note that this ratio is a sampling-based estimate of . The exact model is recovered when the KL divergence is zero. As expected, models trained using exact samples achieve a KL divergence close to 0 on all instances, but PCD requires progressively more Gibbs updates as the problem size increases.

Figure 7: Training of random BMs. Models trained with exact samples minimize the KL divergence, but models trained with approximate PCD sampling require progressively more Gibbs updates to perform well. Solid lines represent the mean value across 20 random instances, and dashed lines represent 25th and 75th percentiles.

In subsequent experiments, we explore whether QA may improve upon CD and PCD by providing MCMC seeds that more accurately sample low-energy states of thus allowing for faster equilibration and better gradient estimates.

4.2 Experiments

In training models on QA hardware, it is important to distinguish from the D-Wave QA sampling distribution. By we denote the distribution formed by sampling the QA hardware at parameter followed by sweeps of blocked Gibbs updates at parameter . In particular, is the raw hardware distribution at , and . In the experiments we report, we use blocked Gibbs sweeps.

To test QA for BM learning we train fully visible multimodal Chimera-structured models. For a variety of problems up to scale (200 variables), we specify , draw exact Boltzmann samples777We can sample exactly because the treewidth of is 20. from , and try to recover from the samples. We compare the efficacy of CD, PCD, and QA-seeded MCMC chains. In all CD/PCD/QA cases, each chain is run for 50 blocked Gibbs updates. To assess the accuracy of the learned models, we measure the log likelihood on both training and held out test data, and compare these results to known optimal values.

For each FCL problem, we generate a training and a test set of size using an exact Boltzmann sampler. All FCL problems have and only parameters are learned. During training, gradients are estimated from 1000 Monte Carlo chains seeded with CD, PCD, or QA initializations. The QA seeds are obtained by calling the quantum hardware with the standard anneal. In all cases, 50 block Gibbs updates are performed on the seeds. To speed training, we used Nesterov accelerated gradients. The results for FCL-1 are presented in Fig. 8. After about 30 iterations, the CD and PCD procedures collapse, and the corresponding log likelihoods deteriorate. This occurs when the energy barriers between local optima in the learned model energy landscape become too large for the MCMC chains to cross efficiently with 50 Gibbs updates. As a result, MCMC-based procedures obtain biased gradients and the CD/PCD models drift away from the optimal region. In contrast, QA-seeded gradients consistently improve the log-likelihood value for about 70 updates and stagnate within of .

Figure 8: Training on FCL-1 using Nesterov-accelerated gradient updates with constant step size 0.1 ( in the reformulation of [SMDH13]). Both CD and PCD procedures become unstable, but QA-seeded gradients exhibit stable learning.

The poor performance of CD and PCD is due in part to the choice of the Nesterov accelerated gradient updates, which, as mentioned earlier, are more sensitive to noisy gradients than stochastic gradient descent updates. Interestingly, increasing the number of Gibbs steps (up to ) does not help either CD or PCD significantly. As expected, we found training CD/PCD with simple stochastic gradient updates to be more effective over a wide range of iteration-independent learning rates . A smaller learning rate effectively corresponds to a larger number of Gibbs updates at a larger learning rate, and therefore improves the quality of estimated gradients, but takes more time. We trained CD/PCD models for 10,000 iterations, and compared to 200 iterations of training using QA with Nesterov-accelerated gradients. The CD/PCD learning rates were varied from , where learning rapidly goes unstable, to where learning was impractically slow within 10,000 iterations. The results are shown in Fig. 9. We found that some of the CD and PCD trained models achieved values similar to that of QA-based learning, but required times as many model updates.

Figure 9: Training on FCL-2. QA is trained using Nesterov updates, while CD/PCD are trained using standard stochastic gradient descent with a fixed learning rate. Decreasing the learning rate for CD and PCD improves the stability of the procedures, but increases the number of iterations required to reach low values of KL divergence.

It is reassuring that QA samples are able to improve upon CD/PCD in FCL-1 and FCL-2 where the QA distribution closely follows the classical Boltzmann distribution (see Figs. 1(b) and 3(b)). However, what about training on FCL-3 where QA exhibits strongly non-Boltzmann behavior (see Fig. 4(b))? In order for the difference in cluster strengths to be reflected in the data, we scaled down all in FCL-3 by a factor of 3.888The FCL-3 weights are strong enough that there are negligibly few broken intracluster bonds, and therefore training data generated for FCL-3 and FCL-1 are almost identical. We train a BM using QA-seeded gradients and fixed learning rate on the resulting problem to learn parameters .

To characterize , we determine the occupation of local minima under (in red) and (in blue). In Fig. 10 green bars represent the local minima occupation probabilities in the scaled-FCL-3 training data.

Figure 10: Training on scaled-FCL-3 where parameters are scaled down from FCL-3 by a factor of 3. The bars indicate the local minimum probabilities derived from the learned model using a Boltzmann distribution (red), and the hardware distribution (blue). Green bars are the probabilities in the training data.

The occupation probabilities do not sum to 1 as there is significant probability of occupying states with broken intracluster bonds. The Boltzmann distribution fits the data poorly, but fits the data well. More detailed examination reveals that over-samples the states that are under-sampled when the QA hardware is used to sample from FCL-3 (Fig. 4(b)

). The learning procedure therefore adjusts the model to compensate for the deviation of QA distribution from classical Boltzmann. This suggests two important conclusions. Firstly, the gradients of the loss function Eq. (

4) used in the training procedure and derived under the assumption of classical Boltzmann distribution remain useful in optimizing the model under non-Boltzmann QA distribution. Secondly, the parameters of the hardware distribution in this case are flexible enough to closely approximate a classical Boltzmann distribution of interest.

5 Assessment of Learned QA Distributions

The results of the previous section suggest that the learned models may not be good fits to training data under Boltzmann assumptions, but may be when sampling according to . Ideally, we would quantify this by measuring log likelihood on test data, but this is not directly possible because a closed form expression of the hardware distribution is unavailable. Instead, we fit a density estimate to data sampled from , and evaluate test set log-likelihood using the tractable fit.

(a) NADE estimates on test data.
(b) Boltzmann estimates on test data.
Figure 11: Analytic density estimates.

Let represent a tractable fit obtained from samples of , which approximates . We require that can be evaluated for any so that the log likelihood of test data may be computed. One choice for is the neural autoregressive density estimator (NADE) [LM11]

. NADE decomposes the joint distribution into a product of conditional density estimates, one for each dimension of

. NADE often outperforms other density estimators, but it suffers from slow training and the necessity of hyperparameter tuning. We made some effort to optimize hyperparameters, but improved values are likely possible.

Consider again the FCL-3 problem. We denote the FCL-3 parameters by , and the parameters of the model learned under QA gradients as . Let and represent the Boltzmann and hardware probability distributions for parameters . We compile three data sets each consisting of samples from (data), , and . The data sets are further split into 5000 training and 5000 test points. To apply NADE to the datasets, we use an RBM with 200 hidden units with a learning rate initialized to and decreased over time as

. The NADE optimization is terminated when the algorithm sees no performance improvement for 10 consecutive epochs. We validate the quality of the resultant NADE models by showing scatter plots of log probability of each test point with respect to its energy (first three panels of Fig. 

10(a)). The NADE models are all roughly Boltzmann with log probability decreasing approximately linearly with as expected. In Fig. 10(a) we show the average test set log-likelihood of the NADE models trained on samples from , and . For comparison, the horizontal blue line denotes the likelihood of test data under the true model . According to NADE, the hardware model is a better fit to test data than .

The NADE algorithm is heuristic and introduces its own error in estimating the test set log likelihoods, and our hope is that the NADE error is smaller than the differences in test set log likelihoods. For models of unknown structure, we have no better alternative than a blackbox approach like NADE, but on these problems where we know the training data is Boltzmann distributed we can do better. As all three distributions should be either Boltzmann or close to Boltzmann, we fit a Boltzmann distribution to each set of samples. Fig. 10(b) shows analogous results but under a Boltzmann fit rather than a NADE fit. In this case we see that on test data is an excellent fit, and almost matches the true test set log likelihood. Thus, the QA-enabled training procedure learns a very good data model under the hardware distribution despite the fact that the hardware distribution is significantly non-Boltzmann. In the rest of the paper, we assume that is calculated using Boltzmann estimates.

Lastly, we characterize the relative computational effort of learning on larger problems. These problems consist of 200 variables arranged as a array of unit cell clusters with , and with inter-cell couplings that are randomly . These problems have many local minima due to the frustrated loops between clusters, and have high-energy barriers between local minima. We indicate a particular realization of this model as and create test and training states of 500,000 each by sampling from . Parameters are learned from the training data using PCD and QA seeded gradients, and approximate KL divergence is measured using the test data. In all cases, we use 1000 Monte Carlo chains and apply 50 blocked Gibbs updates. In Figs. 11(a) and 11(b) we show the number of gradient updates required by PCD and QA-seeded gradients to achieve a specified under stochastic gradient (SGD) and Nesterov updates. We ran PCD at 9 different learning rates ranging from down to , and QA-seeded gradients at learning rates , , and . At each divergence, we counted the number of gradient updates in the method requiring the fewest number of updates to attain that . For comparison, we also indicate the rate of learning under exact gradients using a step size of 0.1.

The curves labeled and are the two variants of hardware-trained models. Curves that terminate at finite values indicate that no lower divergence was found. We see that Nesterov updates using QA gradients result in the most rapid learning.

6 Training Quantum Boltzmann Machines Using QA

We have seen that QA-seeded MCMC can speed training of some classical BMs. The learning rule we employ, (which assumes a classical Boltzmann sampling distribution), results in models that adapt to the biases arising from deviations between the QA sampling distribution and the classical Boltzmann distribution. As a consequence, is usually a better model than . In light of this, it is natural to explore a training procedure that avoids blocked Gibbs postprocessing entirely, namely , and evaluate the generalization of on test data.

This may seem a strange learning rule as it is motivated by assuming the QA sampling distribution is Boltzmann, which it clearly is not. However, as we show next it can be theoretically motivated.

6.1 Fully Visible Quantum Boltzmann Machines

When annealing classically, the dynamics can freeze as the temperature drops below the size of relevant energy barriers. We provided an example of this in Fig. 3 for classical annealing on FCL-1. A similar effect can occur during quantum annealing where dynamics freeze at time prior to the end of the quantum anneal at . Thus, a more accurate model of QA distribution is described in [Ami15] using a transverse Ising Hamiltonian for the Hamiltonian of Eq. (1) and some . The density matrix of the distribution defined by is

where the partition function is simply the trace of the matrix , and the probability of state is the diagonal entry of . Maximizing the log likelihood of this distribution is difficult. Instead, [AAR16] proposes to maximize a lower bound obtained using the Golden-Thompson inequality:

The gradient of this lower bound can be estimated exactly as in (4) using the raw QA samples, that is, using :

(5)
(a) Stochastic gradient updates.
(b) Nesterov updates.
Figure 12: Learning on four randomly generated frustrated cluster loop problems.

6.2 Experiments

We generated problems as in Section 4.2. We focus on the problem class where QA sampling shows the largest deviation from classical Boltzmann sampling, namely a array of clusters with randomly assigned cluster strengths from as in FCL-4, and where all 4 cycles are frustrated, and have otherwise random couplings from . Training and test sets had size points each, generated by an exact Boltzmann sampler. On these problems, provides better fits than . As mentioned before, we used raw hardware samples (postprocessing offered no improvement) and used to measure performance.

We tested annealed learning using gradient step sizes decaying as .999The 200 scaling factor was determined by cross validation to provide good learning under PCD. Both CD and PCD used 10,000 blocked Gibbs updates at each parameter update. Our findings for cluster problems are summarized in Fig. 13.

These plots show the evolution, over the SGD iterations, of test set KL divergences for software runs and for QA runs (the dotted red line is the performance of QA using , for reference). The values shown are the best for each algorithm where . For these examples, only CD with 10,000 blocked Gibbs updates was competitive with QA.

Figure 13: Test set performance under annealed learning schedules.

7 Discussion

In this work, we have studied the utility of quantum annealing in training hard fully visible Boltzmann distributions. We have empirically characterized the sampling distribution of the D-Wave QA device on a number of problem classes, and shown that, while the device is effective at sampling low-energy configurations, the sampling distribution can differ significantly from classical Boltzmann. In spite of this, a learning procedure that updates model parameters as if the sampling distribution were Boltzmann results in excellent models as long as samples are drawn from the QA hardware followed by Gibbs updates. We tested several values of and we noticed improvements over CD and PCD. Interestingly, raw QA samples (i.e., ) provided similar improvements. We justify this by relating learning in classical BMs and quantum BMs as described in [AAR16]. We have demonstrated computational benefits over PCD and CD by measuring the decrease in the number of parameter updates required for training, and shown benefits under both fixed and decaying learning rates.

These promising results justify further exploration. Firstly, the computational benefits of QA over CD/PCD were demonstrated in artificial problems constructed to have high-energy barriers between modes, but which were small enough to yield exact results. We anticipate that more realistic problems also having large energy barriers would show similar QA improvement, but this should be validated. Secondly, we would like to have further evidence that the QA model of [AAR16] or an extension of it can be used to justify the parameter update rule of Eq. (5) to raw QA samples. Our motivation is heuristic, and a deeper understanding might provide more effective learning updates. Thirdly, the sparsity of connections on current QA hardware limits the expressiveness of models, and hidden variables are required to model distributions of practical interest. Thus, studies similar to this one should characterize performance for QA-based learning in models with hidden variables. Lastly, QA hardware is continuously being improved, and new parameters that control the quantum annealing path (the and functions of Eq. (1)) have recently been developed. Learning to exploit these additional controls for improved training is an important and challenging task.

References