. RBMs and other Boltzmann machines are the dominant means of using deep learning to solve tasks that involve unsupervised learning and probabilistic modeling, such as filling in missing values or classification with missing inputs[Goodfellow et al.2013]. Unfortunately, the log likelihood of the RBM is intractable [Long and Servedio2010], and for other Boltzmann machines most other interesting quantities are intractable as well. In this paper, we explore the use of quantum hardware to overcome these difficulties. This approach could possibly unlock the untapped potential of non-restricted Boltzmann machines.
The model may be trained using sampling-based approximations to the gradient of the log likelihood [Younes1998, Tieleman2008]. However, drawing a fair sample from the model is also intractable [Long and Servedio2010].
. Existing approaches are based on Markov chain Monte Carlo (MCMC) procedures. The cost of drawing a fair sample using an MCMC method may be high if the number of steps required to get a good sample is high. This occurs in practice because some RBMs represent distributions with modes that are separated by regions of extremely low probability, which the Markov chain crosses only rarely. This is particularly problematic because it interacts with the learning procedure in a vicious circle: as training progresses, parameters (weights and biases) gradually become larger, corresponding to sharper probabilities (higher near training examples, and smaller elsewhere), i.e., corresponding to sharper modes separated by zones of lower probability. Since training procedures based on approximating the log-likelihood gradient require sampling from the model (usually by MCMC), as training progresses sampling becomes more difficult (mixing more slowly between modes, i.e., more samples would be required to achieve the same level of variance in the MCMC estimator of the gradient), making the gradient less reliable and thus slowing down training.
One possible solution is to construct a physical system whose natural behavior is to take on states with the desired probability. One may then obtain the desired samples by observing the behavior of the system, rather than explicitly performing computations to simulate the dynamics of such a system. We refer to this approach as “physical computation”. It is similar in spirit to “analog computation” but we find that term inappropriate in this case, since the sampled states remain digital. Note that this is different from the idea of building an RBM “in hardware”–we are not merely advocating designing an FPGA that specializes in performing the kinds of digital computations used for simulating an RBM.
Physical computation is a strategy being actively pursued by D-Wave Systems Inc. 111http://www.dwavesys.com/en/products-services.html and DARPA’s UPSIDE program222http://www.darpa.mil/Our_Work/MTO/Programs/Unconventional_Processing_of_Signals_for_Intelligent_Data_Exploitation_(UPSIDE).aspx. In particular, the D-Wave Two system can be viewed as a physical implementation of an RBM. Most approaches to physical computation share the property that they greatly simplify the complexity of a task that is difficult for digital computers, but also introduce many limitations that digital computers do not share. For instance, any physical implementation of an RBM will likely face the issues of noisy parameters, limited parameter range and restricted architecture. This paper aims at getting a better understanding of the effect of these three constraints on the training and performance of the physical RBM and ultimately, of the feasibility of the physical approach. In particular, we would like to address the following questions:
Which constraint has the worst effect on performance?
Under which circumstances can a physical implementation of the RBM be reasonably trained?
Are there ways to mitigate the degrading effects of constraints imposed by physical computation?
Currently, the only practical physical RBM available is the D-Wave Two system (but see [Dupret et al.1996] for earlier work on physical computation also associated with Ising models). It suffers from all three of the limitations we wish to study. In order to study each limitation in isolation, we performed a suite of feasibility studies using a simulated physical computer, that we implemented in software on a GPU. Using a simulation allows us to observe what happens when a physical computer has noisy parameters, but not limited parameter range or architecture restrictions, etc. Because these experiments are performed in simulation, we do not capture the benefit of physical computation: faster, less correlated samples. Instead, we aim to characterize the potential detriments of physical computation. In particular, by studying each constraint in isolation, we are able to infer their relative effect on performance and thereby offer guidance for how both hardware and algorithm designers can best focus their efforts on those properties of physical computation that impose the greatest barriers to its practical use.
2 Restricted Boltzmann machines
In particular, the energy function is
This particular form of energy makes the computation of conditional probabilities trivial:
Although conditional sampling in an RBM is trivial, sampling from or from cannot be done in a single step and requires the use of Monte Carlo Markov chains, which in general becomes computationally expensive if the parameters , , and are configured in a way that makes the Markov chain mix slowly.
3 RBM Learning and Inference
Given some dataset ( for ), training an RBM is most commonly done via an approximation to the gradient of the log-likelihood with respect to the model parameters , elements of the parameter vector :
The log-likelihood gradient has two contributions: one in the “positive phase” with the expectation over
the model’s conditional hidden unit distribution given the data; the other in the “negative phase” with the expectation over the model’s full joint distribution.
While the expectation over the conditional distribution in the clamped condition is straightforward to compute, the same cannot be said of the expectation over the joint distribution in the unclamped condition. The evaluation of this expectation is intractable for all but very small RBMs where the sum over either all states of the visible layer or all states of the hidden layer is feasible to compute. In practice, we commonly resort to an approximation to this expectation via sampling. The persistent contrastive divergence (PCD) algorithm (also known as stochastic maximum likelihood)[Younes1998, Tieleman2008] uses a persistent Gibbs (MCMC) sampling scheme that sequentially samples from the conditionals and to recover samples from the joint distribution. These samples are then used in a Monte Carlo approximation of the negative phase contribution of the log likelihood gradient.
While PCD has established itself as probably the most popular method of maximizing log likelihood in RBMs, it suffers from one important weakness. In many situations, as learning progresses and the model parameters begin to increase in magnitude, the Gibbs sampler at the heart of the negative phase contribution of the gradient can suffer from poor mixing properties. Generally, it occurs when the hidden and visible activations become highly correlated. Poor mixing in the Gibbs sampling induced Markov chain leads to poor sample diversity which in turn leads to poor estimates of the negative phase statistics which ultimately lead to a poor approximation of the likelihood gradient. This problem can be somewhat mitigated by increasing sample diversity through the use of PCD- (using Gibbs sampling steps between gradient updates). Other ways to mitigate the negative phase mixing issue include the use of auxilliary parameters [Tieleman and Hinton2009] and tempering methods [Salakhutdinov2010b, Desjardins, Courville, and Bengio2010, Cho, Raiko, and Ilin2010].
The promise of a physical implementation of the RBM is that we entirely sidestep the difficult mixing problem that occurs in the negative phase of training by aquiring fair, uncorrelated samples directly from a physical implementation of the RBM. In the next section we review the D-wave machine, to our knowledge the only existing physical implementation of an RBM-like model.
4 The D-Wave system
The D-Wave Two system implements an Ising model [Ising1925]. Specifically, it has a signed state vector and a quadratic energy function
where is analogous to the weights of a Boltzmann machine and is analogous to its biases. The set of Ising model distributions with states is isomorphic to the set of Boltzmann machine distributions with states. The conversion between the parameters of the two model families is a linear mapping. An RBM with states and encoded with weights and biases and can be converted to use states via the mapping:
One can draw samples from a Boltzmann machine using the D-Wave Two system just by performing this linear conversion of the parameters prior to requesting the sample. The resulting sample may be converted to a
sample simply by replacing all instances of -1 with 0. The choice of parameterization affects the learning dynamics of stochastic gradient descent, and the Boltzmann parameterization is usually better, so it is generally best to regard the model as a Boltzmann machine even if the interface to the sampling hardware uses the Ising parameterization.
The actual probability distribution sampled by the D-Wave Two system deviates slightly from . Moreover, it is difficult to control the value of or precisely. Both effects can be approximated by adding Gaussian noise to and . To simulate the D-Wave Two system with reasonable accuracy, the noise should be added to once each time the value of is changed to a new unique value, but the noise on should be resampled every time a new sample is drawn333Andrew Berkley, D-Wave Principal Scientist, personal communication. This is the approach we take in our GPU-based simulator of the D-Wave Two system. (One complication we do not attempt to model is that if the same value of is requested twice, the error on
should be the same both times–it is not truly noise, but rather a deterministic error that has a Gaussian distribution when compared over multiple points inspace) Other approaches to physical computation, such as those explored by DARPA’s UPSIDE program, face similar issues with noise.
The D-Wave Two system also imposes restrictions on the magnitude of each individual element of and . This is common to most approaches to physical computation.
Finally, many elements of are constrained to be zero. This is because the various elements of the state vector are physically laid out in a 2-D grid, and only nearby elements can interact with each other. Specifically, the connectivity of the graphical model is constrained to be a chimera graph as illustrated in Fig. 1
. We observe that this chimera graph can be partitioned to form a bipartite graph. Under such a partition, the D-Wave Two system comes very close to being an RBM. The only difference between this model and an RBM is that the noise on the biases causes the biases to be random variables rather than parameters of the model.
Denil-wkshp-2011 have also explored the use of D-Wave hardware for training RBMs. Like our work, their work is primarily a feasibility study based on software simulations. Their approach differs from ours in three respects: 1) We partition the D-Wave Two system into visible and hidden states using a partitioning that makes the chimera graph bipartite, so the hardware implements an RBM. Denil-wkshp-2011 used a different partitioning that allowed visible-visible and hidden-hidden interactions. 2) We train using sampling-based approximations to the log likelihood gradient, while they train using empirical derivatives of an autoencoder-like cost function. 3) Our focus is on understanding how detrimental each of the limitations of the D-Wave hardware is in isolation, while Denil-wkshp-2011 focus on devising an algorithm that works reasonably well with all limitations in place simultaneously.
a) Test NLL estimator computed by sampling with no added parameter noise from RBMs trained with various parameter noise levels. For each noise level, 5 models were trained using the same hyperparameters but different seeds. b) Test NLL estimator computed by sampling from RBMs trained with various magnitude constraints. For each magnitude level, 5 models were trained using the same hyperparameters but different seeds.
Random samples after 100,000 Gibbs steps for an RBM trained without noise (top row) and for an RBM trained with Gaussian noise of standard deviationapplied to weights and biases (bottom row), for different levels of parameter noise.
5 Methodological notes
datasets. For all experiments involving training on the simulated physical computer, we used the simulator to draw samples for the negative phase of PCD, but used exact mean field for the positive phase. Training examples were binarized every time they were presented by sampling from a Bernoulli distribution, such that the grayscale value in [0, 1] in the original image gives the probability of that pixel being a 1 in the binary image. Unless explicitly stated, all models were trained using the same hyperparameters.
Negative log-likelihood (NLL) of all models is approximated using annealed importance sampling (AIS) [Salakhutdinov and Murray2008]. When noise is added to parameters, the expected AIS is computed by Monte Carlo, with test examples binarized by following the same method as with training examples.
Although the constraints we apply are dictated by the D-Wave Two system, we are simulating a low-precision RBM, which means that constraints are enforced on parameters directly, without converting them to the Ising parametrization first.
All images of samples are displaying the expected value of the visible units given binary samples of the hidden units.
6 Simulating noisy parameters
Consider the case where we have a trained RBM (trained by any succesful means; in these experiments we obtained ours by traditional training on a digital computer), and we would like to draw samples from it using physical computation. In this case, we know that the model parameters represent the desired distribution well. However, when loaded into the physical computer, the parameters may not be preserved exactly. We simulate this by adding Gaussian noise to the parameters.
See Fig. 2 for a summary of the experimental results in this case. We find that noise on biases has a negligible effect on NLL compared to noise on weights. This could be explained by the fact there are simply more weight parameters than bias parameters contributing to the energy function. In that case, variance of the energy function would be dominated by variance on weights. From these tests, we can observe two things: 1) Adding noise to the model parameters quickly degrades its performance, and 2) Noise on the biases is less harmful than noise on the weights.
Of course, these parameters were trained to work well in the absence of noise. It is possible to learn different parameters, that are chosen to diminish the effect of noise. In order to do this, we trained an RBM using the simulated physical computer to draw the negative phase samples during training. The negative phase repels the model parameters from regions that produced poor samples. Using noise on the parameters while generating the negative phase samples increases the range of the repulsion–not only must the parameters not generate bad samples, noisy versions of the parameters must not do so either.
We compared how RBM performance evolves as we increase parameter noise during sampling with that of the RBM trained without noisy parameters (Fig. 2). We used the same noise distribution for both weights and biases.
We find that training with noisy parameters helps reducing the degrading effect of sampling with noisy parameters. For instance, by training with on parameters and sampling with the same , we were able to reduce NLL estimator increase by 21.9% in average when compared to training without noise. Furthermore, the benefits of training with noisy parameters before sampling with noisy parameters extends to noise levels greater than used during training.
The effect of training with noisy parameters is also qualitatively visible when looking at samples from the model (Fig. 4). We observe that adding noise to parameters during sampling increases visual noise in samples, and also makes samples collapse to major modes. By training with noisy parameters, we are able to soften these effects, even when sampling with parameter noise greater than training noise.
As for how much parameter noise an RBM can support during training, we trained RBMs using various noise levels on weights and biases and computed their test NLL estimator when sampling with no added noise (Fig. 3). A noise level of is the biggest noise we could add before the RBM’s performance noticeably started to degrade. This means noisy parameters negatively affects learning for all but the smallest noise values.
7 Simulating limited parameter range
We now turn our attention to the parameter range constraint. We trained RBMs by forcing their parameter magnitude to stay below a certain threshold value and observed the effect of that value on test NLL (Fig. 3). Whenever parameter updates would bring a parameter outside of that range, it was clipped to the threshold value.
We find that a magnitude constraint higher than or equal to 1.0 has little to no effect on performance, but that forcing parameter’s magnitude to be smaller than that quickly degrades performance.
8 Combined simulation of noise and limited parameter range
We combined noise and magnitude constraints together to see how they interact. We explored constraint space around reasonable noise and magnitude values and looked at how they affect NLL (Fig. 7). The two constraints appear to work well together. In fact, a model with higher noise and small parameter values performs nearly as well as a standard RBM. We think that the constraint on parameter values may actually be helpful, because they force the RBM to find good weight vector directions that generalize well, rather than just scaling up its weights to overpower the noise. As always, one should be careful about generalizing these conclusions to values outside the ranges evaluated in these experiments.
9 Simulating limited connectivity
We trained RBMs by forcing a random subset of weights to be zero and observed how it affected test NLL (Fig. 7). It turns out the RBM can cope with a reasonable amount of removed connections: even when half the weights are forced to be zero, test NLL only increases by about 4.3%. However, physical implementations will likely have sparse connectivity; for instance, the connectivity pattern of a D-Wave machine (Fig. 1) applied to an RBM with 784 visible units and 784 hidden units is so that over 99% of its connections are removed. In the aforementioned experiment, 99% removed connections results in a disappointing test NLL. When looking at samples (Fig. 8), we observe that the RBM’s representative power decreases as we force more weights to be zero, until samples no longer resemble digits.
Fortunately, physical implementations of an RBM will most likely have some kind of structure to their connectivity pattern, so the results we get by forcing a random subset of the weights to be zero are somewhat pessimistic.
When we train an RBM with 784 visible units and 784 hidden units with chimera connectivity pattern, results are much better. There are many ways to map pixels of an image to visible units of the model; we tried two that seemed the most logical (Fig. 1). The pixel blocks mapping lead to a test NLL of 138.2, while the extended pixel blocks mapping lead to a test NLL of 160.9. When we look at samples from both RBMs (Fig. 9), we see that digit structure is much better preserved than when we randomly force the same proportion of weights to be zero, although samples still barely look like digits. In all cases, the limited architecture seems to be the most damaging constraint studied in this paper.
In this paper, we have performed a series of simulation experiments to determine the feasibility of implementing an RBM using physical computation. We have evaluated the impact of three barriers to the success of physical computation: noise on the model parameters, limited range on the model parameters, and limited topology of the model.
We have found that noise on the parameters noticeably degrades performance, though this can be mitigated by training using the same sampler in the negative phase as will be used to draw samples at test time. We have found that the limits on the range of the parameters do not significantly impair the performance of the RBM. Finally, and most importantly, we have found that restrictions on the topology of the model can impair the model’s performance more than any of the other limitations we consider. While structured sparsity like in the D-Wave Two system’s chimera topology does perform well for the number of connections it has, the overall number of connections is still low enough to cause many difficulties.
Note however that experiments on noisy weights were performed on fully-connected RBMs. If, as suggested when discussing Fig. 2, the effect of noisy parameters is dominated by noise on weights because there are more weights than biases, then a constrained architecture might mitigate the effect of noisy weights simply by reducing their number. This needs to be verified in future experiments.
This suggests that quantum hardware designers should concentrate their efforts on reducing noise on weights and on increasing the number of connections between elements in the quantum computers, and quantum machine learning researchers should focus their efforts on designing approaches that can cope with noisy weights and restricted topology.
Deep learning of representations for unsupervised and transfer learning.In JMLR W&CP: Proc. Unsupervised and Transfer Learning challenge and workshop, volume 27, 17–36.
[Cho, Raiko, and Ilin2010]
Cho, K.; Raiko, T.; and Ilin, A.
Parallel tempering is efficient for learning restricted Boltzmann
Proceedings of the International Joint Conference on Neural Networks (IJCNN 2010).
- [Coates and Ng2011] Coates, A., and Ng, A. Y. 2011. The importance of encoding versus training with sparse coding and vector quantization. In ICML’2011.
- [Denil and de Freitas2011] Denil, M., and de Freitas, N. 2011. Toward the implementation of a quantum rbm. NIPS*2011 Workshop on Deep Learning and Unsupervised Feature Learning.
[Desjardins, Courville, and
Desjardins, G.; Courville, A.; and Bengio, Y.
Tempered Markov chain Monte Carlo for training of restricted
JMLR W&CP: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2010), volume 9, 145–152.
- [Dupret et al.1996] Dupret, A.; Belhaire, E.; Rodier, J.-C.; Lalanne, P.; Prevost, D.; Garda, P.; and Chavel, P. 1996. An optoelectronic CMOS circuit implementing a simulated annealing algorithm. IEEE Journal of Solid-State Circuits 31(7):1046–1050.
- [Goodfellow et al.2013] Goodfellow, I. J.; Mirza, M.; Courville, A.; and Bengio, Y. 2013. Multi-prediction deep Boltzmann machines. In Advances in Neural Information Processing Systems 26 (NIPS’13). Nips Foundation (http://books.nips.cc).
- [Hinton et al.2012] Hinton, G. E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2012. Improving neural networks by preventing co-adaptation of feature detectors. Technical report, arXiv:1207.0580.
- [Hinton, Osindero, and Teh2006] Hinton, G. E.; Osindero, S.; and Teh, Y. 2006. A fast learning algorithm for deep belief nets. Neural Computation 18:1527–1554.
- [Ising1925] Ising, E. 1925. Beitrag zur Theorie des Ferromagnetismus. Zeitschrift fur Physik 31:253–258.
- [Larochelle, Bengio, and Turian2010] Larochelle, H.; Bengio, Y.; and Turian, J. 2010. Tractable multivariate binary density estimation and the restricted Boltzmann forest. Neural Computation 22(9):2285–2307.
- [LeCun et al.1998] LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
- [Long and Servedio2010] Long, P. M., and Servedio, R. A. 2010. Restricted Boltzmann machines are hard to approximately evaluate or simulate. In Proceedings of the 27th International Conference on Machine Learning (ICML’10).
Salakhutdinov, R., and Murray, I.
On the quantitative analysis of deep belief networks.In Cohen, W. W.; McCallum, A.; and Roweis, S. T., eds., ICML 2008, volume 25, 872–879. ACM.
- [Salakhutdinov2010a] Salakhutdinov, R. 2010a. Learning deep Boltzmann machines using adaptive MCMC. In Bottou, L., and Littman, M., eds., Proceedings of the Twenty-seventh International Conference on Machine Learning (ICML-10), volume 1, 943–950. ACM.
- [Salakhutdinov2010b] Salakhutdinov, R. 2010b. Learning in Markov random fields using tempered transitions. In NIPS’2010.
- [Smolensky1986] Smolensky, P. 1986. Information processing in dynamical systems: Foundations of harmony theory. In Rumelhart, D. E., and McClelland, J. L., eds., Parallel Distributed Processing, volume 1. Cambridge: MIT Press. chapter 6, 194–281.
- [Tieleman and Hinton2009] Tieleman, T., and Hinton, G. 2009. Using fast weights to improve persistent contrastive divergence. In ICML’2009.
- [Tieleman2008] Tieleman, T. 2008. Training restricted Boltzmann machines using approximations to the likelihood gradient. In Cohen, W. W.; McCallum, A.; and Roweis, S. T., eds., ICML 2008, 1064–1071. ACM.
- [Younes1998] Younes, L. 1998. On the convergence of Markovian stochastic algorithms with rapidly decreasing ergodicity rates. In Stochastics and Stochastics Models, 177–228.