Quantum mechanics is widely believed to produce distributions of data difficult to replicate classically Boixo et al. (2018)
. As deep learning is implemented through applying a series of transformations to latent probability distributions to approximate empirical distributions given by data, this has led to a trend in recent years of applying quantum computing techniques to machine learningBiamonte et al. (2017); Wiebe et al. (2016); Farhi and Neven (2018); Pudenz and Lidar (2013); Neven et al. (2012); Lloyd et al. (2013); Denchev et al. (2012); Killoran et al. (2018); Rebentrost et al. (2018). One approach to this task consists of applying well-known quantum algorithms—such as the HHL algorithm Harrow et al. (2009)—to replace their classical counterparts—such as the BLAS library Biamonte et al. (2017)
. This method has been applied to principal component analysisLloyd et al. (2014)2014), and most recently, generative adversarial networks Dallaire-Demers and Killoran (2018); Lloyd and Weedbrook (2018). These architectures, however, require large quantum networks and the extensive use of quantum memory, which are not yet available on the current generation of quantum devices Aaronson (2015).
Instead, one could focus on translating such classical machine learning algorithms for use on current and near-term quantum devices, in the so-called Noisy Intermediate-Scale Quantum (NISQ) era Preskill (2018). This has been made possible by recent developments in neutral atom and ion trap architectures Endres et al. (2016); Zhang et al. (2017), as well as gate-model processors Neill et al. (2018); Kandala et al. (2017). Inspired by variational hybrid quantum-classical algorithms Farhi et al. (2014); McClean et al. (2016); Peruzzo et al. (2014), there has been an effort to create machine learning models that use small quantum devices to aid in the training of classical machine learning architectures. Recently studied hybrid architectures include generalizations of Helmholtz machines Benedetti et al. (2018); Perdomo-Ortiz et al. (2018)
and variational autoencodersKhoshaman et al. (2018). In these models, the small quantum device encodes a quantum Boltzmann machine (QBM) Amin et al. (2018)
, which is a quantum generalization of classical restricted Boltzmann machines (RBMs)Smolensky (1986). In this paper, we focus on using small QBMs to improve the performance of generative adversarial networks (GANs).
GANs are the state-of-the-art classical machine learning architectures for unsupervised learningArjovsky et al. (2017); Goodfellow et al. (2016); Goodfellow (2016); Gulrajani et al. (2017); Kodali et al. (2018); Metz et al. (2016); Goodfellow et al. (2014). They have a wide range of applications, including regression and classification Radford et al. (2015); Salimans et al. (2016), image generation Denton et al. (2015); Reed et al. (2016); Karras et al. (2018)2017); Li and Wand (2016)
. GANs can be thought of as a zero-sum game between two players—the generator and the discriminator—each implemented as a neural network. Taking the concrete setting of image generation, the generator learns to create images resembling a given data set of authentic images, and the discriminator learns to distinguish between images produced by the generator and images drawn from the true data setGoodfellow et al. (2014)
. The generator does not have direct access to the data set, and only learns how to create images through feedback from the discriminator—that is, through an error signal backpropagated through the GAN.
Although GANs are the most ubiquitous adversarial models, they are notoriously difficult to train Salimans et al. (2016). One of the major reasons GANs are difficult to train is that the generator and discriminator learn at different rates generically. One approach towards combating this issue is through the use of associative adversarial networks (AANs) Arici and Celikyilmaz (2016)
. In this architecture, a Boltzmann machine acts as an associative memory that learns the high-level feature distribution of a layer of the discriminator. The generator then draws samples from the distribution approximated by the Boltzmann machine. The associative memory also adds expresivity to the network by providing the generator with inputs drawn from a more meaningful, data-specific distribution, rather than a uniform or Gaussian distribution as is the case for standard GANs. Motivated by the observed improvement in performance of QBMs over RBMsAmin et al. (2018); Kieferová and Wiebe (2017), we propose a method of implementing hybrid quantum-classical AANs (QAANs), where the associative memory is instead provided by a QBM.
The structure of the paper is as follows. In Sec. II, we give an introduction to classical adversarial networks and, in particular, AANs. In Sec. III, we construct a quantum analog of AANs, and give a numerical comparison between classical and quantum AANs in Sec. IV. Finally, we discuss our results and future research directions in Sec. V.
Ii Classical Generative Models
ii.1 Restricted Boltzmann Machines
A Boltzmann machine is an energy-based generative model, and one of the first neural networks capable of learning internal representations for and sampling from arbitrary probability distributions Hinton and Sejnowski (1986). Recent work has seen successful applications of this model to a wide variety of machine learning tasks, including image Hinton et al. (2006), text Salakhutdinov and Hinton (2009), and speech Mohamed et al. (2012)
generation. It also serves as a key component in other machine learning architectures, such as deep belief networksHinton et al. (2006). A Boltzmann machine is characterized by an energy function
where is a binary vector and are the model parameters. The model can be viewed as a two-layer network by dividing its nodes into visible units, which represent the input data, and hidden units, which form an internal representation of the data. In practice, since training a general Boltzmann machine is impractical, we consider restricted Boltzmann machines (RBMs) which further simplify the model to only contain connections between visible and hidden units Smolensky (1986) (see Fig. 1). By labeling the indices of visible nodes by and the indices of hidden nodes by , we can separate our vector as , and rewrite the energy function in the form:
For an input vector , this network assigns the probability
where is the partition function. The model parameters are then chosen such that samples drawn from the marginal probability distribution approximate samples drawn from the empirical probability distribution of the data
. This is achieved by minimizing the Kullback–Leibler divergence (KL) divergenceKullback and Leibler (1951) between and , or equivalently, by minimizing the negative log-likelihood
The minimization of
is usually performed using a gradient based optimizer, where the gradient of the loss function is:
Here, denotes the average with respect to a Boltzmann distribution with the energy given by Eq. (2), and is the same but with the visible nodes clamped to . The first term is known as the positive phase, and the second term as the negative phase. Exact maximum likelihood learning of the negative phase requires knowledge of the partition function, which is intractable. Therefore, in practice, this phase is approximated using Gibbs sampling Geman and Geman (1984)
. We use Persistent Contrastive Divergence (PCD)Tieleman (2008a) to train our RBM and simulated annealing to sample from it; both are described in Appendix A.
ii.2 Generative Adversarial Networks
Generative adversarial networks (GANs) are structured probabilistic models over the space of observed variables and latent variables Goodfellow et al. (2014); Goodfellow (2016). They implicitly model high-dimensional distributions of data and can be used to efficiently generate samples from these distributions. GANs are generally characterized by a pair of neural networks competing against each other in a zero-sum game. The generator network with parameters learns to map vectors from the latent space to samples drawn from a probability distribution that is close to . The discriminator network with parameters receives samples from both the generator and the true distribution of data, and emits a probability that the input is real. The generator wishes to minimize its loss function by attempting to fool the discriminator into believing that its generated samples are real. The discriminator, on the other hand, tries to minimize its loss function
by correctly classifying the inputs as either fake or realGoodfellow et al. (2016). We can cast these statements into the minimax optimization problem Goodfellow (2016):
where the goal is to find the optimal generator parameters
Here, denotes the expectation value when is drawn from real data, and denotes the expectation value when is drawn from the distribution of latent variables. For GANs, the latent variables are normally chosen to be Gaussian or uniform noise Goodfellow et al. (2014). The first term in Eq. (6) favors the discriminator outputting on real data, while the second term favors the discriminator outputting on generated data. The generator strives to achieve the opposite. The solution to this optimization problem is a point of Nash equilibrium where the generator samples are indistinguishable from the real data and the discriminator predicts on all inputs.
Unfortunately, finding the Nash equilibrium is quite difficult in practice Salimans et al. (2016). In recent years, there have been many proposals aiming to improve the training of GANs. Some of them include considering a non-saturating version of the generator loss function Goodfellow (2016), introducing surrogate objective functions Metz et al. (2016), regularizing the discriminator by adding gradient penalty terms Gulrajani et al. (2017); Kodali et al. (2018), and using a formulation of the training objective based on the Wasserstein-1 distance Arjovsky et al. (2017); Gulrajani et al. (2017). Most of these architectures are extensions of deep convolutional GANs (DCGANs) Radford et al. (2015)
, which use convolutional neural networks (CNNs) for their generators and discriminators, instead of the fully-connected dense layers proposed initiallyGoodfellow et al. (2014). Our implementation of a DCGAN is described in Fig. 3. In this paper we focus on the AAN architecture Arici and Celikyilmaz (2016), which is an extension of the DCGAN architecture, and the improvements that it brings to training GANs, as described in Section II.3.
ii.3 Associative Adversarial Networks
An associative memory provided by an RBM can circumvent the imbalance in training rates typically present in a GAN and enhance the expressivity of the model; such architectures are called associative adversarial networks (AANs) Arici and Celikyilmaz (2016)
. In an AAN, the latent space for the generator is treated as a feature space that is learned by the RBM. The RBM is simultaneously trained with the GAN on an intermediate layer of the discriminator, with a sigmoid activation function interpreted as the probability of a particular neuron firing. In general, in an AAN, the discriminatoris decomposed into a mapping into the feature space and a a classifier such that:
We find that in practice, however, using a trivial classifier as in Fig. (b)b suffices to improve the performance of the generator.
Though the initial motivation behind AANs was to bring the generator and discriminator learning rates in line with each other—as often the instability in GAN training is due to the discriminator learning more quickly than the generator—our DCGAN implementation for the data sets under consideration experiences no notable difference in the learning rates of the generator and discriminator, and thus our main advantage when using an AAN is due to the generator expressivity gained when it draws samples from an improved feature space rather than from noise. Our AAN architecture consists of the DCGAN architecture described in Fig. 3, coupled to an RBM as in Fig. 4.
Iii Quantum Generative Models
iii.1 Quantum Boltzmann Machines
Quantum Boltzmann machines (QBMs) are a recently introduced method of quantizing Boltzmann machines that have been numerically observed to give a quantum speedup in both the rate of training and in the accuracy of the approximating distributions Amin et al. (2018); Kieferová and Wiebe (2017). As initially proposed, instead of considering the classical energy function of Eq. (1), one considers the Hamiltonian:
and the thermal density matrix:
where now are the model parameters. Defining the projector onto the subspace with visible nodes equal to as , our goal is to now train the parameters , , and such that the probability distribution of samples from
approximates the empirical probability distribution . Due to the difficulties in computing the log-liklihood of this distribution Amin et al. (2018), one instead usually 111The recently introduced relative entropy training Kieferová and Wiebe (2017); Wiebe and Wossnig (2019) is approximation free, but we do not study its behavior in this work. trains QBMs to minimize the upper bound on the loss function
is the clamped Hamiltonian. The gradients of this loss function are now given by:
where here denotes expectation values taken with respect to and denotes expectation values taken with respect to the thermal density matrix with clamped Hamiltonian . Training on this upper bound leads to , so in training
is fixed to some constant and treated as a learning hyperparameterAmin et al. (2018).
In order to numerically simulate large QBMs on a classical computer, we consider only the stoquastic Hamiltonian given by Eq. (9), and not more general QMA-hard Hamiltonians which have been studied in the context of QBMs Kieferová and Wiebe (2017); Wiebe et al. (2019). The details of our Monte Carlo-based simulation method are given in Appendix B.
iii.2 Quantum-Classical Associative Adversarial Networks
We now quantize AANs by transforming the associated RBM into a QBM, and call the resulting architecture a quantum-classical associative adversarial network (QAAN). Our implementation otherwise exactly follows that of our AAN (see Sec. II.3).
The hybrid quantum-classical nature of our QAAN architecture lends itself well to being implementable on NISQ devices. Though in general simulating quantum thermal states is inefficient even on quantum devices, there is evidence that the structure of QBMs allows for efficient heuristic training on NISQ devicesAnschuetz and Cao (2019)
. Furthermore, as QBM training only necessitates measuring simple local observables in the QBM state, there is no need for a quantum memory for the training data. Finally, as the QBM is only trained on a feature space of much lower dimensionality than the data, many fewer qubits are required to implement the QBM than if it were to directly learn the data. This hybrid quantum-classical approach to machine learning, inspired by variational hybrid quantum-classical algorithmsFarhi et al. (2014); McClean et al. (2016); Peruzzo et al. (2014), can serve as a model for similar future quantum machine learning architectures that minimize the need for large quantum devices.
We compare the performance of the classical and quantum-classical architectures on three data sets of increasing difficulty. First, we show that QBMs outperform RBMs on a simple synthetic data set of mixed Bernoulli distributions, thus suggesting that quantum models can provide an improvement in approximating certain distributions. Next, we compare the performance of DCGAN, AAN, and QAAN architectures on the MNIST data setLeCun (1998), which is a standard benchmark used in classical machine learning. Finally, we test the three architectures on the more challenging CIFAR-10 data set Krizhevsky and Hinton (2009), and show that our implementation of a QAAN architecture produces samples that more closely mirror the samples drawn from the data distributions.
In order to quantitatively evaluate the different architectures on real data sets, we use the Inception score Salimans et al. (2016) and the Fréchet Inception distance (FID) Heusel et al. (2017). The Inception score computes the KL divergence between the conditional class distribution and the marginal class distribution over generated samples as predicted by an Inception-v3 network Szegedy et al. (2016). A higher score indicates better generated images. This is the most widely used metric for evaluating GANs and allows for easy comparisons with previous works. However, the IS has some limitations in assessing the realism and intra-class diversity of the generated samples. The FID is a more comprehensive metric that has been shown to mitigate some of these shortcomings Heusel et al. (2017). It relies on computing the Wasserstein-2 distance between the generated data and the real data in the feature space of an Inception-v3 network. Lower FID values are better and suggest that the generated images are more similar to the original data set. A detailed description of both metrics is provided in Appendix D.
In all of our experiments,
samples are randomly drawn from each model and used to evaluate its performance. Similar results are obtained for different initializations of the network parameters. The Inception score showed a higher variance over different data subsets and therefore we report the average overbatches, each consisting of generated images. Since the FID relies on a particular layer of a pre-trained network and can only be unambigously defined for colored images, we only use it in quantifying the performace of our models on CIFAR-10.
iv.1 Synthetic Data
We begin by comparing the learning capability of an RBM with that of a QBM by training both on samples from the mixed Bernoulli distribution:
with random modes , where denotes the Hamming distance between and . We choose and for our numerical experiments. More details about our training procedure are provided in Appendix C. Samples are then drawn from each of the data, RBM, and QBM distributions. The resulting empirical probability distributions are given by Fig. 5. To quantitatively measure the distance between the original and reconstructed distributions, we compute the KL divergence . We obtain a KL divergence of approximately for the RBM and for the QBM. We see that the QBM outperforms the RBM, even though both of them have the same number of parameters and are trained similarly.
iv.2 MNIST Data
We now compare our implementations of DCGAN, AAN, and QAAN. We train all three networks on the MNIST handwritten digit data set LeCun (1998), which consists of grayscale images, each of size pixels. We rescale the pixel values to the interval before feeding them into our networks. The training parameters for this data set are summarized in Appendix C.
We monitor the performance of each network by computing the Inception score on generated images after every epoch, as shown in Fig. 6. We notice that the results converge within the considered epochs, with QAAN reaching a better Inception score than its classical counterpart by roughly . It is worth mentioning that all of the architectures considered achieve a score that is close to that of real data (). We also visually examine samples of generated handwritten digits, as shown in Fig. 7, and confirm that they are almost indistinguishable from the orginal data.
iv.3 CIFAR-10 Data
Finally, we study the performance of our models on the CIFAR-10 data set. This data set consists of colored natural scene images, each of size pixels and color channels, divided across different classes Krizhevsky and Hinton (2009). The training procedure is identical to the case of MNIST and is described in Appendix C.
We once again compute the Inception score and FID on generated images after every epoch, and plot our results for all three models in Fig. 8. Notice that the inception scores are overall lower than in the case of MNIST, even though both data sets have classes. This can be attributed to the increased complexity of the data set of colored images. The better performance of QAAN is more prominent in the FID metric, where it achieves a consistently better score in the last five training epochs when compared to DCGAN and AAN. The improvement in FID is approximately by the end of training. The metrics reported in Fig. 8 are on par with those obtained for a variety of classical GANs, such as WGANs, with similar architectures Fedus et al. (2017); Ostrovski et al. (2018). We also note that both AAN and QAAN have a steeper learning curve in the early epochs, which can be attributed to the associative memory providing a more meaningful latent space for the generator. Sample images from CIFAR-10 and our models are shown in Fig. 9. The generated images look very realistic.
In Sec. IV we showed that the QAAN architecture can learn the MNIST and CIFAR-10 data sets more effectively than the AAN and DCGAN architectures. Since the only difference between the QAAN and AAN architectures is the use of a QBM, rather than an RBM, we attribute to it the observed increase in performance. Nonetheless, QAAN’s learning advantage is not as substantial as QBM’s edge on simpler data sets, as supported by Amin et al. (2018); Kieferová and Wiebe (2017) and our results in Sec. IV.1. We suspect that this is due to the moderate size of the QBM used in our QAAN architecture. The bottleneck in our numerical experiments comes from simulating the QBM qubits on a classical computer through Monte Carlo sampling, which severely limits the accessible system sizes for the QAAN associative memory. Further improvements may be gained by improving our classical simulations and by considering QMA-hard Hamiltonians Kieferová and Wiebe (2017); Wiebe et al. (2019), which we leave to future work.
As QAANs are a quantum-assisted classical architecture, they lend themselves well to potential experimental implementations on NISQ devices. There are proposals for implementing QBMs on quantum annealing devices Amin (2015) and generic NISQ devices Anschuetz and Cao (2019), and the necessary Gibbs distributions have been produced on atomic lattice systems with similar Hamiltonians through Hamiltonian quenching Bernien et al. (2017); Kaufman et al. (2016). Furthermore, as the necessary number of visible units of the QBM only grows as the dimensionality of the latent space of the QAAN—which in general is much smaller than the size of the probability distribution being approximated—the necessary number of qubits also remains small. For instance, our simulations considered Boltzmann machines with visible units and hidden units (see Appendix C), whereas the dimensionality of the MNIST input data is and the dimensionality of the CIFAR-10 input data is . These considerations suggest that a QAAN could be implemented in the very near future.
While preparing this manuscript, we became aware of a similar project that also considered quantum-assisted classical AANs Wilson et al. (2019). We believe that our work, although similar in network architecture and scope, is still very different when it comes to model training and testing. In particular, we use quantum Monte Carlo to train the QBM, while the authors in Wilson et al. (2019) had access to a quantum annealing platform with many more qubits than our simulations could achieve. Nonetheless, our quantum-classical implementation seems to yield better generative performance under the considered metrics. Furthermore, we test our models on a more complex data set of colored images.
In conclusion, we have introduced a new hybrid quantum-classical generative model capable of successfully learning distributions over complex data sets. We have showed numerically that our model slightly outperforms analogous classical generative architectures when trained under similar conditions. In addition, the model could potentially be experimentally tested on NISQ devices. Our work adds to the rapidly-expanding family of quantum-enhanced machine learning algorithms.
Acknowledgements.We are grateful to Maxwell Nye for insightful discussion on classical machine learning. We would also like to thank Aram Harrow for his guidance on this project. ERA is partially supported by a Francis C. W. Greenlaw fellowship, the Henry W. Kendall Fellowship Fund, and a Lester Wolfe fellowship from MIT. CZ is partially supported by a Whiteman fellowship from MIT.
- Boixo et al. (2018) S. Boixo, S. V. Isakov, V. N. Smelyanskiy, R. Babbush, N. Ding, Z. Jiang, M. J. Bremner, J. M. Martinis, and H. Neven, Nat. Phys. , 1 (2018).
- Biamonte et al. (2017) J. Biamonte, P. Wittek, N. Pancotti, P. Rebentrost, N. Wiebe, and S. Lloyd, Nature 549, 195 (2017).
- Wiebe et al. (2016) N. Wiebe, A. Kapoor, and K. M. Svore, Quantum Info. Comput. 16, 541 (2016).
- Farhi and Neven (2018) E. Farhi and H. Neven, “Classification with Quantum Neural Networks on Near Term Processors,” (2018), arXiv:1802.06002 [quant-ph] .
- Pudenz and Lidar (2013) K. L. Pudenz and D. A. Lidar, Quantum Inf. Process. 12, 2027 (2013).
- Neven et al. (2012) H. Neven, V. S. Denchev, G. Rose, and W. G. Macready, in Proceedings of the Asian Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 25, edited by S. C. H. Hoi and W. Buntine (PMLR, Singapore Management University, Singapore, 2012) pp. 333–348.
- Lloyd et al. (2013) S. Lloyd, M. Mohseni, and P. Rebentrost, “Quantum algorithms for supervised and unsupervised machine learning,” (2013), arXiv:1307.0411 [quant-ph] .
- Denchev et al. (2012) V. S. Denchev, N. Ding, S. V. N. Vishwanathan, and H. Neven, “Robust Classification with Adiabatic Quantum Optimization,” (2012), arXiv:1205.1148 [quant-ph] .
- Killoran et al. (2018) N. Killoran, T. R. Bromley, J. M. Arrazola, M. Schuld, N. Quesada, and S. Lloyd, “Continuous-variable quantum neural networks,” (2018), arXiv:1806.06871 [quant-ph] .
- Rebentrost et al. (2018) P. Rebentrost, T. R. Bromley, C. Weedbrook, and S. Lloyd, Phys. Rev. A 98, 042308 (2018).
- Harrow et al. (2009) A. W. Harrow, A. Hassidim, and S. Lloyd, Phys. Rev. Lett. 103, 150502 (2009).
- Lloyd et al. (2014) S. Lloyd, M. Mohseni, and P. Rebentrost, Nat. Phys. 10, 631 (2014).
- Rebentrost et al. (2014) P. Rebentrost, M. Mohseni, and S. Lloyd, Phys. Rev. Lett. 113, 130503 (2014).
- Dallaire-Demers and Killoran (2018) P.-L. Dallaire-Demers and N. Killoran, Phys. Rev. A 98, 012324 (2018).
- Lloyd and Weedbrook (2018) S. Lloyd and C. Weedbrook, Phys. Rev. Lett. 121, 040502 (2018).
- Aaronson (2015) S. Aaronson, Nat. Phys. 11, 291 (2015).
- Preskill (2018) J. Preskill, Quantum 2, 79 (2018).
- Endres et al. (2016) M. Endres, H. Bernien, A. Keesling, H. Levine, E. R. Anschuetz, A. Krajenbrink, C. Senko, V. Vuletic, M. Greiner, and M. D. Lukin, Science 354, 1024 (2016).
- Zhang et al. (2017) J. Zhang, G. Pagano, P. W. Hess, A. Kyprianidis, P. Becker, H. Kaplan, A. V. Gorshkov, Z.-X. Gong, and C. Monroe, Nature 551, 601 (2017).
- Neill et al. (2018) C. Neill, P. Roushan, K. Kechedzhi, S. Boixo, S. V. Isakov, V. Smelyanskiy, A. Megrant, B. Chiaro, A. Dunsworth, K. Arya, R. Barends, B. Burkett, Y. Chen, Z. Chen, A. Fowler, B. Foxen, M. Giustina, R. Graff, E. Jeffrey, T. Huang, J. Kelly, P. Klimov, E. Lucero, J. Mutus, M. Neeley, C. Quintana, D. Sank, A. Vainsencher, J. Wenner, T. C. White, H. Neven, and J. M. Martinis, Science 360, 195 (2018).
- Kandala et al. (2017) A. Kandala, A. Mezzacapo, K. Temme, M. Takita, M. Brink, J. M. Chow, and J. M. Gambetta, Nature 549, 242 (2017).
- Farhi et al. (2014) E. Farhi, J. Goldstone, and S. Gutmann, “A Quantum Approximate Optimization Algorithm,” (2014), arXiv:1411.4028 [quant-ph] .
- McClean et al. (2016) J. R. McClean, J. Romero, R. Babbush, and A. Aspuru-Guzik, New J. Phys. 18, 023023 (2016).
- Peruzzo et al. (2014) A. Peruzzo, J. McClean, P. Shadbolt, M.-H. Yung, X.-Q. Zhou, P. J. Love, A. Aspuru-Guzik, and J. L. O’Brien, Nat. Commun. 5, 4213 (2014).
- Benedetti et al. (2018) M. Benedetti, J. Realpe-Gómez, and A. Perdomo-Ortiz, Quantum Sci. Technol. 3, 034007 (2018).
- Perdomo-Ortiz et al. (2018) A. Perdomo-Ortiz, M. Benedetti, J. Realpe-Gómez, and R. Biswas, Quantum Sci. Technol. 3, 030502 (2018).
- Khoshaman et al. (2018) A. Khoshaman, W. Vinci, B. Denis, E. Andriyash, and M. H. Amin, Quantum Sci. Technol. 4, 014001 (2018).
- Amin et al. (2018) M. H. Amin, E. Andriyash, J. Rolfe, B. Kulchytskyy, and R. Melko, Phys. Rev. X 8, 021050 (2018).
- Smolensky (1986) P. Smolensky (MIT Press, Cambridge, MA, USA, 1986) Chap. Information Processing in Dynamical Systems: Foundations of Harmony Theory, pp. 194–281.
- Arjovsky et al. (2017) M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein GAN,” (2017), arXiv:1701.07875 [stat.ML] .
- Goodfellow et al. (2016) I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (MIT Press, 2016).
- Goodfellow (2016) I. Goodfellow, “NIPS 2016 Tutorial: Generative Adversarial Networks,” (2016), arXiv:1701.00160 [cs.LG] .
- Gulrajani et al. (2017) I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, in Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Curran Associates, Inc., 2017) pp. 5767–5777.
- Kodali et al. (2018) N. Kodali, J. Hays, J. Abernethy, and Z. Kira, “On Convergence and Stability of GANs,” (2018).
- Metz et al. (2016) L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein, “Unrolled Generative Adversarial Networks,” (2016), arXiv:1611.02163 [cs.LG] .
- Goodfellow et al. (2014) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, in Advances in Neural Information Processing Systems 27, edited by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Curran Associates, Inc., 2014) pp. 2672–2680.
- Radford et al. (2015) A. Radford, L. Metz, and S. Chintala, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks,” (2015), arXiv:1511.06434 [cs.LG] .
- Salimans et al. (2016) T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, in Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Curran Associates, Inc., 2016) pp. 2234–2242.
- Denton et al. (2015) E. L. Denton, S. Chintala, a. szlam, and R. Fergus, in Advances in Neural Information Processing Systems 28, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Curran Associates, Inc., 2015) pp. 1486–1494.
- Reed et al. (2016) S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, in Proceedings of The 33rd International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 48, edited by M. F. Balcan and K. Q. Weinberger (PMLR, New York, New York, USA, 2016) pp. 1060–1069.
- Karras et al. (2018) T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” (2018), arXiv:1812.04948 [cs.NE] .
- Isola et al. (2017) P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 10.1109/cvpr.2017.632.
- Li and Wand (2016) C. Li and M. Wand, in Computer Vision – ECCV 2016, edited by B. Leibe, J. Matas, N. Sebe, and M. Welling (Springer International Publishing, Cham, 2016) pp. 702–716.
- Arici and Celikyilmaz (2016) T. Arici and A. Celikyilmaz, “Associative Adversarial Networks,” (2016), arXiv:1611.06953 [cs.LG] .
- Kieferová and Wiebe (2017) M. Kieferová and N. Wiebe, Phys. Rev. A 96, 062327 (2017).
- Hinton and Sejnowski (1986) G. E. Hinton and T. J. Sejnowski (MIT Press, Cambridge, MA, USA, 1986) Chap. Learning and Relearning in Boltzmann Machines, pp. 282–317.
- Hinton et al. (2006) G. E. Hinton, S. Osindero, and Y.-W. Teh, Neural Comput. 18, 1527 (2006).
- Salakhutdinov and Hinton (2009) R. Salakhutdinov and G. Hinton, Int. J. Approx. Reason. 50, 969 (2009).
- Mohamed et al. (2012) A.-r. Mohamed, G. E. Dahl, and G. Hinton, IEEE Transactions on Audio, Speech, and Language Processing 20, 14 (2012).
- Kullback and Leibler (1951) S. Kullback and R. A. Leibler, Ann. Math. Statist. 22, 79 (1951).
- Geman and Geman (1984) S. Geman and D. Geman, IEEE Trans. Pattern Anal. Mach. Intell. PAMI-6, 721 (1984).
- Tieleman (2008a) T. Tieleman, in Proceedings of the 25th International Conference on Machine Learning, ICML ’08 (ACM, New York, NY, USA, 2008) pp. 1064–1071.
- (53) The recently introduced relative entropy training Kieferová and Wiebe (2017); Wiebe and Wossnig (2019) is approximation free, but we do not study its behavior in this work.
- Wiebe et al. (2019) N. Wiebe, A. Bocharov, P. Smolensky, M. Troyer, and K. M. Svore, “Quantum Language Processing,” (2019), arXiv:1902.05162 [quant-ph] .
- Anschuetz and Cao (2019) E. R. Anschuetz and Y. Cao, “Realizing Quantum Boltzmann Machines Through Eigenstate Thermalization,” (2019), arXiv:1903.01359 [quant-ph] .
- LeCun (1998) Y. LeCun, “The MNIST database of handwritten digits,” (1998).
- Krizhevsky and Hinton (2009) A. Krizhevsky and G. Hinton, Learning multiple layers of features from tiny images, Tech. Rep. (Citeseer, 2009).
- Heusel et al. (2017) M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, in Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Curran Associates, Inc., 2017) pp. 6626–6637.
- Szegedy et al. (2016) C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) pp. 2818–2826.
- Fedus et al. (2017) W. Fedus, M. Rosca, B. Lakshminarayanan, A. M. Dai, S. Mohamed, and I. Goodfellow, “Many Paths to Equilibrium: GANs Do Not Need to Decrease a Divergence At Every Step,” (2017), arXiv:1710.08446 [stat.ML] .
- Ostrovski et al. (2018) G. Ostrovski, W. Dabney, and R. Munos, “Autoregressive Quantile Networks for Generative Modeling,” (2018), arXiv:1806.05575 [cs.LG] .
- Amin (2015) M. H. Amin, Phys. Rev. A 92, 052323 (2015).
- Bernien et al. (2017) H. Bernien, S. Schwartz, A. Keesling, H. Levine, A. Omran, H. Pichler, S. Choi, A. S. Zibrov, M. Endres, M. Greiner, et al., Nature 551, 579 (2017).
- Kaufman et al. (2016) A. M. Kaufman, M. E. Tai, A. Lukin, M. Rispoli, R. Schittko, P. M. Preiss, and M. Greiner, Science 353, 794 (2016).
- Wilson et al. (2019) M. Wilson, T. Vandal, T. Hogg, and E. Rieffel, “Quantum-assisted associative adversarial network: Applying quantum annealing in deep learning,” (2019), arXiv:1904.10573 [cs.LG] .
- Wiebe and Wossnig (2019) N. Wiebe and L. Wossnig, “Generative training of quantum Boltzmann machines with hidden units,” (2019), arXiv:1905.09902 [quant-ph] .
- Hinton (2002) G. E. Hinton, Neural Comput. 14, 1771 (2002).
- Tieleman (2008b) T. Tieleman, in Proceedings of the 25th International Conference on Machine Learning, ICML ’08 (ACM, New York, NY, USA, 2008) pp. 1064–1071.
- Hinton (2012) G. E. Hinton, “A Practical Guide to Training Restricted Boltzmann Machines,” in Neural Networks: Tricks of the Trade: Second Edition, edited by G. Montavon, G. B. Orr, and K.-R. Müller (Springer Berlin Heidelberg, Berlin, Heidelberg, 2012) pp. 599–619.
- Kirkpatrick et al. (1983) S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, Science 220, 671 (1983).
- Barahona (1982) F. Barahona, J. Phys. A 15, 3241 (1982).
- Hukushima and Iba (2003) K. Hukushima and Y. Iba, AIP Conf. Proc. 690, 200 (2003).
- Machta (2010) J. Machta, Phys. Rev. E 82, 026704 (2010).
- Suzuki (1976) M. Suzuki, Prog. Theor. Phys. 56, 1454 (1976).
- Hastings (1970) W. K. Hastings, Biometrika 57, 97 (1970).
- Kingma and Ba (2014) D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” (2014), arXiv:1412.6980 [cs.LG] .
- Glorot and Bengio (2010) X. Glorot and Y. Bengio, in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, Vol. 9, edited by Y. W. Teh and M. Titterington (PMLR, Chia Laguna Resort, Sardinia, Italy, 2010) pp. 249–256.
- Deng et al. (2009) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, in 2009 IEEE Conference on Computer Vision and Pattern Recognition (2009) pp. 248–255.
- Barratt and Sharma (2018) S. Barratt and R. Sharma, “A Note on the Inception Score,” (2018), arXiv:1801.01973 [stat.ML] .
- Borji (2018) A. Borji, “Pros and Cons of GAN Evaluation Measures,” (2018), arXiv:1802.03446 [cs.CV] .
Appendix A RBM Training and Sampling
Although the probability distribution over the visible and hidden units
is generally intractable for computing expectation values, the bipartite structure of the RBM makes it very simple to heuristically sample from the conditional probability distributions:
is the sigmoid function. We can use these relations to approximate the negative phase of the gradient during training.
as a starting point of a Markov chain at every optimization iteration and then performs block Gibbs sampling forsteps using Eqs. (19) and (20). PCD instead relies on a single Markov chain with a persistent state that is preserved at the end of each optimization iteration and passed on as the initial state of the next iteration. In practice, choosing has been shown to be sufficient Hinton (2012).
In order to generate samples from the RBM, we apply the same procedure, but instead of initializing the visible units with a data sample, we use a random initialization. Furthermore, we use simulated annealing Kirkpatrick et al. (1983) by introducing an inverse temperature
that scales the energy function, and hence the model parameters. We initialize our Markov chain with a sample drawn from a uniform distribution atand then gradually lower the temperature at each step of the Markov chain until we reach , which corresponds to the desired distribution parameters. We use a linear schedule in for annealing. This procedure improves the diversity of our samples by increasing the mixing rate of the chain and helps the training procedure avoid getting trapped in local minima of the loss function.
Note that thus far we have assumed that our inputs—given by the visible units—are binary vectors. In the case of real-valued data, such as pixel values of images, we can often rescale the input to be in the range and treat it as the expectation value
of a binary variableGoodfellow et al. (2016). We can then sample each entry in our visible vector from a Bernoulli distribution with mean . It is important to mention that although there are formulations of RBMs with real-valued inputs, known as Gaussian-Bernoulli RBMs, here we use Bernoulli RBMs as they are generally easier to train Hinton (2012).
Appendix B QBM Training and Sampling
We begin by rewriting Eq. (14
) for each of the QBM parameters, such that at every training step we must estimate:
As mentioned in Sec. (III.1), the use of the approximate lower bound given by Eq. (12) precludes the training of . The positive phases of the gradient can be calculated exactly, as given in Amin et al. (2018):
As in the RBM case, it is difficult to estimate the negative phase of the gradient as it requires sampling from a quantum thermal state, a problem which in general is NP-hard Barahona (1982). To approximately sample from our numerically simulated QBM, we perform population-annealed Monte Carlo sampling. In the population annealing sampling heuristic Hukushima and Iba (2003), a population of replicas of the system in question is maintained at infinite temperature, and then cooled to some finite temperature by an annealing schedule of steps, which is analogous to the number of Gibbs steps in the training of RBMs. With each cooling step, replicas are duplicated or deleted based on an estimate of their relative Boltzmann weights, and are equilibrated according to some Monte Carlo algorithm Machta (2010). By sampling a population from our quantum Boltzmann distribution, we are able to parallelize the generation of a mini-batch of samples with one run of the algorithm.
To perform Monte Carlo sampling at each equilibration step, we use the Trotter–Suzuki mapping of the stoquastic Hamiltonian to a classical energy function with an extra imaginary time dimension, which is discretized into imaginary time slices Suzuki (1976). Under this mapping, the quantum thermal distribution at inverse temperature given by the stoquastic Hamiltonian is approximated by the classical thermal distribution:
We impose periodic boundary conditions, such that is identified with . In the limit , this approximation is exact. Then, we perform the necessary Monte Carlo sampling using the Metropolis–Hastings algorithm Hastings (1970) on the mapped set of spins.
In our simulations, we use initial population replicas corresponding to a single mini-batch, a linear annealing schedule of steps from to , and imaginary time slices.
Appendix C Training Parameters
We train all of our networks using the Adam method for stochastic optimization Kingma and Ba (2014), with for all trained variables, for our Boltzmann machines, and for our generator and discriminator. As in previous works involving QBMs Amin et al. (2018); Khoshaman et al. (2018), we take all . During training on synthetic data, we set the learning rate for both of our Boltzmann machines to and use Gibbs steps. Each Boltzmann machine has visible units and hidden units, which are sufficient for approximating the studied Bernoulli distribution.
When training on the MNIST and CIFAR-10 data sets, we consider Boltzmann machines with visible units and hidden units. For consistency, we also set the dimension of the noise distribution for DCGAN to be . We use a learning rate of for both our generator and discriminator, and a learning rate of for our Boltzmann machines with Gibbs steps. Furthermore, we initialize our weights using Xavier initialization Glorot and Bengio (2010) and initialize our biases to zero. In order to help the discriminator learn in the early stages of trainig, we use soft and noisy labels where a random number between and is used instead of labels (fake images) and a random number between and is used instead of labels (real images). Each model is trained for epochs, where an epoch represents one full pass through the training data.
Appendix D Performance Evaluation Metrics
In this section we describe how various GAN architectures are quantitatively compared. For data sets drawn from a known distribution , we can evaluate the performance of a GAN by computing the KL divergence between the empirical distribution of generated samples and . Unfortunately, for most image data sets the underlying distribution is unknown and this method is not applicable. Ideally, one would want humans to judge the quality of generated samples in comparison with the original data. However, as this approach is very subjective and almost always impractical, two alternative metrics have been proposed and successfully used to evaluate GANs.
d.1 Inception Score
trained on the ImageNet data setDeng et al. (2009)—to assess the quality of the generated images. This metric is called the Inception score Salimans et al. (2016) and is defined as the average KL divergence between the conditional label distribution and the marginal distribution over generated samples; that is:
where denotes the entropy and the expectation value is taken over image vectors sampled from . We recognize the term on the right-hand side as the exponential of the mutual information . This metric captures two key features that we are looking for in our generated images: the depiction of clearly identifiable objects (i.e. should have low entropy for easily classifiable samples) and a high diversity in samples (i.e. should have high entropy if all classes are approximately equally represented) Barratt and Sharma (2018).
Note that the use of the Inception network in computing the above score is only appropriate for colored images, which is not the case for the MNIST data set. We follow the approach described in Kodali et al. (2018) and train a simple 4-layer CNN on MNIST, which achieves an accuracy of and therefore can be viewed as a reliable classifier. We then use it to compute the Inception score in an otherwise identical fashion.
The Inception score is a widely adopted evaluation scheme and is known to match well with the human perception of image quality Salimans et al. (2016). However, it also has some known drawbacks, such as favoring models that memorize the training data or those that generate clear yet unnatural combinations of objects Borji (2018); Barratt and Sharma (2018).
d.2 Fréchet Inception Distance
The Fréchet Inception distance, introduced by Heusel et al. Heusel et al. (2017)
, avoids some of the problems of the Inception score described above by directly comparing the statistics of synthetic samples to those of real world data. This metric computes the similarity between the features extracted by the pool3 layer of Inception-v3 when the network is supplied with real data and with images generated from. These features can be thought of as drawn from multivariate Gaussian distributions and respectively. The Fréchet distance, also known as the Wasserstein-2 distance, is defined as:
This distance is lower when the features extracted from generated data are distributed similarly to those extracted from real images.