Quantum Semantic Learning by Reverse Annealing an Adiabatic Quantum Computer

03/25/2020 ∙ by Lorenzo Rocutto, et al. ∙ Consiglio Nazionale delle Ricerche 1

Boltzmann Machines constitute a class of neural networks with applications to image reconstruction, pattern classification and unsupervised learning in general. Their most common variants, called Restricted Boltzmann Machines (RBMs) exhibit a good trade-off between computability on existing silicon-based hardware and generality of possible applications. Still, the diffusion of RBMs is quite limited, since their training process proves to be hard. The advent of commercial Adiabatic Quantum Computers (AQCs) raised the expectation that the implementations of RBMs on such quantum devices could increase the training speed with respect to conventional hardware. To date, however, the implementation of RBM networks on AQCs has been limited by the low qubit connectivity when each qubit acts as a node of the neural network. Here we demonstrate the feasibility of a complete RBM on AQCs, thanks to an embedding that associates its nodes to virtual qubits, thus outperforming previous implementations based on incomplete graphs. Moreover, to accelerate the learning, we implement a semantic quantum search which, contrary to previous proposals, takes the input data as initial boundary conditions to start each learning step of the RBM, thanks to a reverse annealing schedule. Such an approach, unlike the more conventional forward annealing schedule, allows sampling configurations in a meaningful neighborhood of the training data, mimicking the behavior of the classical Gibbs sampling algorithm. We show that the learning based on reverse annealing quickly raises the sampling probability of a meaningful subset of the set of the configurations. Even without a proper optimization of the annealing schedule, the RBM semantically trained by reverse annealing achieves better scores on reconstruction tasks.



There are no comments yet.


page 8

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Unsupervised learning algorithms are expected to be empowered by the advent of practical quantum computing. Nonetheless, the achievement of a computational advantage over the Von Neumann–Zuse computer architecture schmidhuber2006colossus based on classical binary logics for useful–sized problems is still on its way, mainly since quantum hardware is in an early stage of development. To speed up the process, hardware synthesis techniques mueck2017quantum must be constantly updated to harness the full potential of the quantum processing units.

Boltzmann Machines (BMs) are a prominent example of unsupervised learning algorithm barlow1989unsupervised, le2013building, introduced in 1985 by Ackley, Hinton and Sejnowski ackley1985learning

. It has been proved that a trained BM can be used as a universal approximator of probability distributions on binary variables

sussmann1988learning,younes1996synchronous. Such property makes BMs particularly suitable if used as a generative model theis2015note, srivastava2012multimodal to reconstruct partially missing data. Despite such theoretical representative power, one of the steps of the training algorithm makes computationally expensive to train large BMs hinton2002training. More specifically, such a step requires to extract samples from the instantaneous distribution approximated by the BM. The cost to produce each sample grows rapidly as the problem size increases. Thus, BMs are usually not applicable in useful–sized problems.

To achieve the full power of BMs, a training method that scales well as the size and the complexity of a BM increase would be required. A possible solution follows from the observation that BMs can be naturally implemented on a specific architecture of quantum computers referred to as Adiabatic Quantum Computers (AQC). Being able to train useful–sized BMs on an AQC would enable to tackle and solve relevant problems in many fields ranging from recommendation systems bell2007lessons

, to anomaly detection

yang2017improved,rastatter2019abnormal, to quantum tomography carrasquilla2019reconstructing .

During the years, BMs have found application mostly in a simplified form, called Restricted Boltzmann Machine (RBM) smolensky1986information, hinton2002training. RBMs have lower representative power than BMs, but they are easier to train and use. Due to their popularity in the classical version, RBMs are usually chosen as a benchmark to evaluate the potential advantages of the quantum learning algorithms benedetti2016estimation.

Since 2011, researchers explored the possibility to use an AQC to approximate the probability distribution needed during the training of an RBM denil2011toward, dumoulin2014challenges. Experimental results suggest that an AQC can be operated to extract samples from the distribution associated with an RBM, at the computational cost of a single quantum operation amin2015searching, boixo2016computational, korenkevych2016benchmarking. Thus, sampling from AQCs seems to have a cost that does not depend on the size or the complexity of the RBM. In 2015, Adachi and Henderson adachi2015application

performed computations on an actual AQC to train a Deep Belief Network, a directed graph version of RBM. In 2016, Benedetti et al.


introduced an efficient technique for estimating the effective temperature of the distribution produced by the AQC. Developing hardware synthesis techniques to sample effectively from a given probability distribution is beneficial for unsupervised learning in general. For instance, a hybrid algorithm has been recently proposed where a classical Generative Adversarial Network (GAN) is supported by a quantum–trained RBM


. Previously, some of the authors exploited deep learning algorithms

prati2017quantum to achieve control of qubits for preparation of quantum states porotti2019coherent, porotti2019reinforcement, paparelle2020digitally

. Here we follow the opposite approach since we combine machine learning with the computational power of the quantum hardware itself.

Following the practice of benchmarking BMs through a restricted version, we tackled the problem of training an RBM using an AQC. Results obtained in such a restricted case are fundamental building blocks for more complex networks.

We applied embedding techniques to prepare a graph of virtual qubits arranged to implement all the weights present in a classical RBM. We compare our results to those obtained with a network characterized by sparse connections between the visible and the hidden layers benedetti2016estimation to show the improvement in the learning. Differently from circuital quantum computers based on quantum logic gates rieffel2011quantum, adiabatic quantum computers childs2001robustness; boixo2016computational; johnson2011quantum; farhi2002quantum; dickson2013thermally provide the quantum analog of simulated annealing so they return probabilistic answers by driving a quantum system from a superposition of classical states to a frozen state which approximates the solution of a problem. Such process is called quantum annealing. Unfortunately, the commonly adopted forward annealing schedule fails to entirely use the information carried by the elements of the dataset to start the search as it starts from a generic multi–qubit state. Contrary to previous algorithms, we introduce semantic learning based on the reverse annealing schedule, which consists of partially relaxing the multi–qubit state from a chosen initial classical state, performing a local quantum search in the configurations space.

The results obtained by applying the reverse annealing schedule take advantage of the ability to perform the annealing process starting from configurations belonging to the dataset. For this reason it is referred to as semantuc and it is more informative if compared to the forward annealing method.

To assess the performance of the training process we trained an RBM with 16 visible units and 16 hidden units to reconstruct the Bars and Stripes dataset. The performance is evaluated in terms of the percentage of reconstructed pixels and the average Log–Likelihood of the dataset. All the quantum computations have been performed on a D–Wave 2000QTM System processor johnson2011quantum, dwave2019technical, dwave2017reverse.

We show that embedded topologies do not sensibly interfere with the capacity of the AQC to produce correctly distributed samples. The use of embedding techniques thus allows faster learning if compared to the sparse case.

The sampling distribution produced by the AQC in the case of the reverse annealing schedule is closer to the configurations corresponding to the dataset than that produced by the more conventional forward annealing. As a consequence, the learning process exhibits a novel semantic behavior that brings pros and cons. In our case, the RBM trained with the use of reverse annealing achieves overall a better reconstruction score if compared to the usual quantum algorithm for AQCs. Furthermore, the use of reverse annealing leads to a sampling probability of elements of the dataset which is double that of the forward annealing and higher than that of the classical method.

In the Results Section (I), the classical RBM is briefly introduced and the quantum training method for the RBM is described, both in its forward annealing and reverse annealing schedule. The experimental results for both approaches are shown. In Section II the results are discussed. The Method Section (III) describes the technical details and the design of the experiment. Finally, Section IV summarizes the main conclusions and the perspectives.

I Results

i.1 Structure of BM and RBM

A Boltzmann Machine is represented by a graph composed of units that can assume binary values , linked by real weighted connections ackley1985learning. Units are divided into two classes, visible and hidden. In general, a BM has connections between any pair of units while the graph of an RBM is bipartite, with no hidden–hidden or visible–visible connections. Figure 1 b shows the structure of a 16x16 RBM. Before the training, the weights are initialized randomly, according to some probability distribution. A training process that suitably modifies the weights of the BM is needed to encode information. A trained BM can accept an incomplete item and reconstruct the missing parts according to the previously acquired information.

Figure 1: a: Schematic overview of the methods to train a Restricted Boltzmann Machine. The green and yellow boxes show which part of the gradient is estimated by each step of the algorithm. Steps that require only classical computations are circled in orange. During the data states phase, the elements of the dataset are loaded in the visible units (red), and the hidden units (green) are updated accordingly. The result allows the estimation of the positive statistics. Next, one step among Gibbs, forward annealing, and reverse annealing sampling is chosen to estimate the negative statistics. Forward annealing is the only one that does not depend on states loaded from the dataset. b: Structure of the Restricted Boltzmann Machine used in the article, with 16 visible units fully connected to 16 hidden units.

In the case of image reconstruction, both known and missing pixels of the image are univocally represented by a visible unit. The visible units correspond to both the input and the output of the generative model. Indeed, when reconstructing a corrupted image, we initialize and fix part of the visible units, while the uninitialized visible units output the missing pixels. Hidden units, on the other hand, confer to the BM the power to extract essential features from the dataset. A higher number of hidden units raise the ability of the BM to mimic the structure of the dataset.

i.1.1 Internal dynamics of a BM

In 1985, Ackley, Hinton, and Sejnowski ackley1985learning proposed a computational algorithm that produces configurations distributed according to an energy model summarized in Supplementary Note 1.

First, visible units are initialized with arbitrary binary values. Next, the values of the hidden units are updated according to a probabilistic rule that depends on the state of visible units. Then, visible units are updated according to the state of hidden units. Such an update is usually iterated many times, and the output configuration is achieved by reading the final values of the units. If the number of update steps is sufficiently high, the output configurations are extracted from a distribution which approximates the probability from the energy model


where , is a dimensionless function depending on the state of the BM units that we will call energy, and is a dimensionless parameter called temperature, in analogy with thermodynamic functions of statistical physics. In the case of an RBM,


where is the value of the -th visible unit, is the value of the -th hidden unit, is the weight of the connection between the -th visible unit and the -th hidden unit, while and are real parameters called biases. Since different values of the temperature correspond to different uniform rescalings of the weights and biases, we are allowed to set .

The sampling algorithm is described in detail in Supplementary Note 2. Such a method to extract configurations is well established in Physics, and it is usually referred to as Gibbs sampling. We call the chosen number of times visible units are updated.

i.1.2 How to train a BM

In the following, the training of the BM involves a dataset of images composed of pixels, known as Bars and Stripes dataset (BAS). The images consist of either black or white stripes or columns. Figure 2 c shows some sample images taken from the version of such dataset.

Figure 2: a

, Reconstruction score vs training epochs for the classical, forward annealing and reverse annealing case, respectively. The learning curves are obtained with a rescaling parameter

and the weights initialization parameters , and . b, Average Log–Likelihood vs training epochs for the classical, forward annealing and reverse annealing case. c, Some of the images used during training, taken from the Bars and Stripes dataset, together with the reconstruction obtained with steps of Gibbs sampling at different epochs.

Each training image can be mapped to a

–dimensional binary vector

, where each term in the vector corresponds to a visible unit of the BM. The probability for the BM to output a training vector can be raised by adjusting the weights and biases to lower the energy of that vector and to raise the energy of other vectors.

The learning rule for performing stochastic steepest ascent in the log probability of the training data is expressed by:


where is the learning rate ackley1985learning.

The term is called positive statistics. It represents the expected value of when the visible units are fixed to the values of a training vector, averaged over all vectors of the training dataset. It is defined as


where is the training set, and is the number of elements in .

Instead, is called negative statistics and represents the expectation value of when both visible and hidden units are produced by the model (e.g. during the Gibbs sampling algorithm). It is defined as


In the RBM case, since there are no direct connections between hidden units, it is fairly easy to get an unbiased sample of . A single Gibbs sampling step to update the hidden units is necessary, repeated for each (see Supplementary Note 1).

Getting an unbiased sample of is much more difficult. From Equation 5 it follows that an exact computation would involve a summation over all possible combinations of values for hidden and visible units, which means summing over elements. In many practical situations, an exact computation of the negative gradient is unfeasible.

The negative statistics is usually approximated as follows. Initially, the values of the visible units are set equal to those of a training vector. Next, the Gibbs sampling algorithm is repeatedly applied. More precisely, a sequence of configurations is produced by alternatively extracting the values of the hidden units at fixed visible units and the values of visible units at fixed hidden units, while averaging the value of along the sequence. Each step of this sequence has the same computational cost of . This process is repeated for each element of the dataset. This provides an estimate of , which is indeed unbiased if a good level of convergence, a.k.a. equilibration, is attained for each sequence. RBMs typically learn better if more steps of alternating Gibbs sampling are used before collecting vectors for the negative statistics. Initializing the visible units with elements from the dataset is not a mathematical requirement since average values at equilibrium ought to be independent of the initial conditions, but it is known to enhance the quality of the learning hinton2002training.

The Gibbs sampling algorithm allowed to train RBMs in some cases (e.g. the Netflix prize challenge amatriain2013big). On the contrary, unrestricted BMs are still out of reach even with this importance–sampling algorithm, due to the presence of hidden–hidden connections which make each step of the alternating Gibbs sampling much more difficult. Indeed, nowadays there is hardly any known application of BMs in real–sized problems. In principle our findings can be extended to BMs as well.

From now on we focus on the RBM case. Indeed, even if fully connected BMs could gain a greater advantage from the quantum training process, RBMs are usually chosen as a benchmark due to their popularity in the classical framework.

Note that also biases need to be trained, but their update requires no additional overhead if we can effectively estimate the negative part of the statistics. Such property is explicitly discussed in Supplementary Note 1.

i.2 Quantum–trained RBM

Training an RBM is made easier by a device capable of producing the configurations needed to estimate within a few operations. It has been suggested that Adiabatic Quantum Computers (AQCs) produced by D–Wave Systems can extract binary configurations distributed as the expression of in Equation 1 denil2011toward.

AQCs are quantum processors that tackles optimization problems by driving a quantum system from a superposition of classical states to a frozen state corresponding to an approximate solution boixo2016computational, johnson2011quantum, farhi2002quantum, dickson2013thermally.

The qubits composing the processor are forced to evolve according to an Ising Hamiltonian that encodes the problem to solve. The total Hamiltonian implemented on D–Wave devices takes the expression dwave2019technical, morita2008mathematical, hajek1988cooling:


where is the physical time and , are the Pauli matrices that act along the and direction, and are real parameters chosen by the user. The annealing process ideally returns the classical spin configuration that minimizes .

Figure 3: Visual representation of the process according to both quantum forward and reverse annealing schedule. Each bin of the gray 2–D histograms represents a possible configuration produced by the AQC. The height of a bin represents the probability that the corresponding configuration is the output of a single annealing process. The colored surface is a representation of the problem Hamiltonian in Equation 8. Each configuration corresponds to a different value of . For clarity, the problem is reduced to only two dimensions, and neighboring configurations differ by a single bit. A further graphical simplification consists of smoothing the surface making it continuous. a, During the forward annealing, the component of the annealing Hamiltonian increases from to its maximum value. The behavior is represented by the increase in the weight of . The configurations histogram, initially almost flat, becomes more and more peaked around the configurations that minimize . b, During the reverse annealing, the component of the annealing Hamiltonian (Equation 6) is initially set at its maximum value and the component is null. It is then possible to initialize the system such that a single configuration has probability almost equal to (apart from noise effects). The annealing is then performed in reverse, lowering components and raising components in the annealing Hamiltonian . The system partially relaxes, and configurations close to the initial one become populated according to the corresponding value for , thanks to quantum tunneling and thermal effects. The final step consists of standard forward annealing.

The function is called annealing schedule. During the usual annealing schedule, , where is the total time of the annealing process, and thus increases linearly from to . Such a schedule is usually called forward annealing, and its shape is represented in Figure 1 a. Figure 3 a shows how and vary with respect to the time. In the case of the D–Wave 2000QTM System processor, the user can specify few distinct points during the annealing where can be set to any value in dwave2017reverse. In particular, the schedule is called reverse annealing if at the start of the annealing. The value corresponds to , which means the component of the qubits is not coupled to any transverse field. As a consequence, each qubit can be put in a classical state, or , because spin flip is forbidden dwave2017reverse. Next, the parameter is reduced, therefore allowing spin flipping. Figure 1 a presents the shape of a reverse annealing schedule. Figure 3 b shows how and vary with respect to the time in this case.

Previous works have exploited reverse annealing to boost non–negative matrix factorization ottaviani2018low

, to enhance genetic algorithms

king2019quantum, to solve portfolio optimization problems venturelli2019reverse.

In general, mapping a problem on an AQC and running an annealing cycle returns a configuration that is not the ground state of . The reason resides in the sources of noise affecting the hardware dumoulin2014challenges, dwave2019technical, which determine the violation of the hypothesis of the adiabatic theorem. Dumoulin et al. dumoulin2014challenges observe that the erroneous, higher–energy configurations produced by the AQC are approximately distributed as if they were extracted from a Boltzmann distribution at an effective dimensionless temperature :



is the eigenvalue corresponding to the eigenstate

of Hamiltonian (see also amin2015searching, raymond2016global). The effective temperature is a real parameter determined by the presence of thermal noise inside the chip dwave2019technical. Its time evolution is considered difficult to forecast adachi2015application, benedetti2016estimation. The key idea consists of exploiting such a property to produce configurations that are distributed appropriately for the RBM training.

i.2.1 Mapping the RBM on an adiabatic quantum computer

We now show how to set the parameters of a D–Wave AQC such that the configurations produced are distributed according to Equation 7 at an effective temperature .

First, we introduce two sets of qubits, named and . The set contains qubits, each one corresponding to a visible unit. contains qubits corresponding to the hidden units. is rewritten as


where , , are real parameters. The summation in the first term is to be performed on the set of couplers linking visible and hidden qubits.

Note that the RBM energy functional in Equation 2 now appears very close in shape to in Equation 8. Indeed, the two expressions can be mapped into each other if a suitable rescaling of the parameters is performed. In particular, note that the units of the RBM assume binary values in , while the eigenvalues of belong to . We can recover the same energy levels in the two expressions by the following mapping:


where is an experimental estimate of the parameter that appears in Equation 7. If the AQC parameters are initialized as in Equation 9, each annealing cycle samples a probability distribution similar to that appearing in Equation 7 adachi2015application, benedetti2016estimation. In other words, the AQC produces a configuration needed for the training at the cost of a single quantum operation.

i.3 Estimation of the negative statistics by forward annealing

The estimation of is fairly easy on a classical processor, therefore we look at the quantum advantage when estimating .

The customary quantum algorithm for the estimation of adachi2015application, benedetti2016estimation, assumes that the value of can be considered constant over time. Such assumption is tricky since it has been proved that a wrong guess for can lead to severe errors during training benedetti2016estimation. Nonetheless, a reasonable guess allows to effectively train an RBM benedetti2016estimation,benedetti2017quantum, henderson2019leveraging.

We trained an RBM composed of 16 visible units and 16 hidden units. We trained it on the Bars and Stripes (BAS) dataset defined in mackay2003information. The dataset contains images composed of black or white pixels. In each image, the pixels are arranged in monochromatic bars or stripes. Since our RBM has 16 visible units, we used images. Figure 1 b shows the structure of the chosen RBM.

We trained five times the RBM for epochs using quantum forward annealing. Each instance was initialized with random weights and biases extracted from the same probability distribution. The weight initialization spread

is the value of the standard deviation of the Gaussian used for randomly generate the weights and the biases. The weight initialization range

defines the symmetric range centered in zero truncating the Gaussian from which the weights and the biases are sampled. The amplitude of the interval is bound below by the S/N ratio due to both the uncertainty in the values of the currents and the random fluctuations of the chip. It is also bound above by the maximum values of the current. Among the amplitudes tested, was sufficiently big to overcome weights and sufficiently small to not reach the maximum values allowed for the currents. Supplementary Figure 1 presents the learning curve for an RBM initialized with smaller weights. The result shows that, if the aforementioned boundaries are respected, the initial distribution of the weights could have a reduced impact on the learning speed.

For each run we set . Figure 2 b shows the results obtained by averaging over the five runs, in terms of the average Log–Likelihood of the dataset , defined as follows:


where is calculated as it appears in Equation 7, with the hypothesis that . The most expensive step in computing is obtaining the explicit expression of the partition function of the distribution (i.e. the denominator in Equation 7). The number of units in the RBM (32 total) is sufficiently low so that we could calculate without resorting to approximations.

We also trained five times the RBM for epochs using the classical algorithm. The instances were initialized with the weights already used in the forward annealing case. It means that for each quantum instance, there is a classical instance that used the same weight initialization.

i.4 Estimation of the negative statistics by reverse annealing

To introduce a semantic search emulating the initialization of the Gibbs sampling of the classical case on a quantum adiabatic computer, we adopt the reverse annealing schedule. To our knowledge, such an initialization method has never been used to train an RBM on an AQC. Indeed, the customary quantum annealing method allows a unique system initialization, corresponding to the equally–probable quantum superposition of all the classical states. On the contrary, reverse annealing allows us to perform a quantum search in the neighborhood of the configuration set by the user as the initial state. It then constitutes a possible way to emulate the initialization of the Gibbs sampling in the classical case.

We modified the usual quantum algorithm for the estimation of , which exploited forward annealing, to exploit a sampling procedure based on reverse annealing (see Supplementary Note 2 for the original algorithm).

Input: Number of visible units , number of hidden units , desired number of iterations , initial weights , initial biases and , annealing schedule such that , training dataset composed by elements. We suppose is known for each step .
Output: Matrix of dimensions that contains the estimates for .
Functions: returns values for , and according to Eq. 9; initialize each visible units of the RBM with the corresponding pixel in the -th element of , while hidden units are updated at with a single step of Gibbs sampling, then qubits of the AQC are initialized with the corresponding values mapped in ; uses quantum annealing with as final Hamiltonian (Eq. 8) and as annealing schedule; retrieves the binary value of after the annealing process ends.
1 for  do
2       for  do
3             initialize_qubits eval anneal measure measure
Algorithm 1 Quantum algorithm for the negative statistics estimation using reverse annealing

If compared to the usual procedure, the annealing cycles are performed separated in groups. For each group, qubits corresponding to visible units are initialized with an element from the dataset. Qubits corresponding to hidden units are initialized with binary values classically computed with a single step of Gibbs sampling. After all elements of the dataset have been used to initialize the annealing process, the output configurations are used to estimate .

Figure 1 a reports a schematic overview of the different methods presented in the article that can be used to train an RBM.

We trained twice the RBM for epochs using reverse annealing. The instances were initialized with two weights set randomly chosen from the five already used in the classical and forward annealing case. We chose as in the forward annealing case.

Figure 2 b shows the results in terms of the average Log–Likelihood of the dataset, compared with the results obtained in the forward annealing case.

Ii Discussion

As one might expect, thanks to the full connectivity based on the embedding on virtual qubits, our forward annealing approach obtains Log–Likelihood scores sensibly better than sparse implementations dumoulin2014challenges; benedetti2016estimation. To compare, contrary from Ref. benedetti2016estimation where a RBM is implemented with visible to hidden connections, our implementation realizes all connections. At the epoch our forward annealing method obtains a score of which is better than the best forward annealing-based resultbenedetti2016estimation, optimized step-by-step by a temperature estimation tool, which reaches a score of at iteration . Even without using any estimation technique, the higher number of connections was sufficient to gain a great advantage over the sparse implementation. For the sake of completeness, the same sparse embedding of Ref. benedetti2016estimation has been tested at a fixed temperature (see Supplementary Figure 2). As pointed out in Ref. dumoulin2014challenges, the complexity of the topology of the AQC seems to be a major bottleneck for the implementation of RBMs.

Figure 4: The distributions presented are reconstructed from the weights obtained during the training processes presented in Figure 2. a, Each colored band represents the probability that the corresponding image of the dataset is sampled from the Boltzmann distribution defined by the RBM at that epoch. Thus, each plot contains bands. The green line represents the probability that any of the images is sampled at epoch . The black line represents the probability that any of the less probable images is sampled at epoch . b and c represent the distributions produced by the D–Wave machine using forward annealing (b, in blue) and reverse annealing (c, in light blue) each epochs, from epoch to epoch , starting from the same weight initialization used in a. In both b and c, the Boltzmann distribution at corresponding to the instantaneous value of the RBM weights is represented in orange. The qualitative shape of the annealing schedule is also shown as reference.

The value was chosen for the forward annealing case because it achieves the better Log–Likelihood score at epoch , among a range of tested values. We kept the poor man’s choice of also in the reverse annealing case, and we set the shape of

without further exploration of the hyperparameter space of the annealing schedule. Despite such lack of fine optimization, the reverse annealing performs even better than the forward annealing up to epoch

, therefore proving capable to speed up the learning in the initial phase, and comparable at the epoch , if we limit the evaluation to the Log-Likelihood.
We turn now our attention to the differences among energy distributions during the learning by applying the two paradigms. Both the forward and reverse annealing approaches initially produce a distribution corresponding to (Figure 4 b and 4 c), since the quantum–sampled distributions are visibly cooler than the theoretical distribution calculated at . Such behavior is a consequence of the choice to keep fixed. The value of has been chosen to maximize the score at iteration for the forward annealing case, so the same value is not necessarily the best during the first iterations.

At epoch , the lowest energy configurations for the forward case are at energy , while in the reverse case the lowest achieved energy at the same epoch is , where the values are dimensionless, as obtained by the definition of (Equation 2). Despite the two cases share the same learning rate , the training based on reverse annealing modifies the RBM weights so that some configurations correspond to considerably lower energies. We will now explore the consequences of such behavior.

Let’s now consider the configurations where the visible units correspond exactly to one of the elements of the dataset. We define the set composed of such configurations. is a subset of the set composed of all the possible configurations, and the most relevant to evaluate learning. Figure 4 a shows the probability that, if we sample at T=1 from the Boltzmann distribution defined by the weights of the RBM, we get a configuration in . Figure 4 shows how such probability evolves as a function of the training epochs, for the forward, reverse, and classical cases respectively. It also shows the evolution of the probability associated with each element of the dataset, dividing the plot in distinct bands.

In the classical learning case, the probabilities appear more homogeneous than in the quantum cases. The training with forward–annealing runs at a slower pace but all the probabilities are growing steadily. Regarding the training with reverse–annealing, it is manifest that it resulted in an increased overall probability for configurations belonging to the set . In particular, the probability to extract a configuration in from the RBM trained with reverse–annealing is approximately double with respect to the probability in the forward-trained case. Nonetheless, the average log-likelihood is similar in the two cases. The reason resides in the low probability associated with some configurations by the RBM trained with reverse-annealing. The black line in Figure 4 a represents the summation of the probabilities of the less probable images. Since the BAS dataset is composed of images, the black line divides the dataset into two equal parts, not equally represented). The RBM trained with reverse–annealing is the best of the three cases if we look at the total probability of the dataset, at the cost of a lower probability of the lesser represented images. Such behavior could be connected to the fact that reverse annealing has a higher probability to produce configurations that are similar to those appearing in the dataset.

The different sampling distribution brings two competing consequences. At the beginning of the learning, the gradient estimation is influenced more by the configurations belonging to the dataset, which results in lowering their energy faster. This effect is exactly what the reverse annealing was intended for (Figure 4 a). Later during learning, some configurations belonging to are driven at low energies by the weights update and their probability is heavily underestimated. The training algorithm does not capture the relevance that such configurations have in the partition function, and thus allows them to lower energy at each step.

The sampling described above may be affected by the extraction being too bounded to the dataset elements (which could be overcome by optimizing the annealing schedule ) or to a high sampling temperature, which in turn could be overcome by optimizing .

As a last consideration, we suggest evaluating an alternative and complementary figure of merit together with the average Log–Likelihood to score a model. A potential issue connected to Log–Likelihood manifests if some images have low probability. Indeed, they carry a lower Log–Likelihood score, but it could not be informative about the capability to reconstruct data. Indeed, during reconstruction part of the visible units are clamped, thus many of the other configurations will have zero probability to be produced by the RBM. A better scoring method should only consider the ability of the final RBM to reconstruct each element of the training set. In our case, the RBMs have been trained on the whole dataset, so the training set and the test set coincide.

Figure 2 a presents the reconstruction score as the probability for an RBM to reconstruct one of the four central pixels of a BAS image, averaged over the four pixels and each image in the set. The reconstruction is performed by keeping the outer pixels set to the correct values. The number of steps for Gibbs sampling is . Note that the choice for the clamped units reduces dramatically the probability for configurations corresponding to different images in the dataset to take part in the reconstruction process since for each choice of the outer pixels there is a single dataset image compatible. Each colored band represents the probability that the corresponding image of the dataset is sampled from the Boltzmann distribution associated with the RBM. The reconstruction score confirms the quality of the learning already evaluated by the average Log–Likelihood. As expected, the low–probability images in the training with reverse–annealing have a reduced impact on the score. At epoch , the reconstruction score of the reverse and forward method are statistically close, but reverse annealing exhibits better performances in previous epochs (Figure 2 a).

In general, the reverse annealing schedule introduces a meaningful search method during the learning process of an RBM on a quantum computer. Despite its slower learning rate if compared to a classical machine in terms of epochs, one should remember that the total computational time is accounted for by the product of the number of epochs with the computational time per epoch. Therefore, the advantage of the adiabatic quantum computer to manage RMBs and more generally BMs resides on its ability to be employed once the conventional hardware fails as soon as the number of qubits of the quantum processor can handle the size of the problem.

Iii Methods

iii.1 Embedding the problem

To compare our results with existing literature, a RBM is considered. Differently from the one unit–to–one qubit mapping constrained by the topology of the quantum circuit to return a sparse RBM because of the missing connections benedetti2016estimation, we apply embedding technique to synthesize a full RBM. The graph of the chosen embedding is shown in Supplementary Figure 3. In such an embedding, each unit of the RBM is mapped on four distinct physical qubits, which are bonded together by a strong negative coupling . We used the strongest available coupling, , in the dimensionless units used in the manual of the processor dwave2019technical. The strong ferromagnetic coupling forces the four qubits to behave as a single two–state system, therefore behaving as a single virtual qubit. Indeed, each configuration where two of the four qubits are antiparallel is strongly energetically forbidden. This choice extends the connectivity of the hardware, allowing to implement all the desired connections. The embedding is reported in the Supplementary Note 3. In principle, we could have used a different value for . Supplementary Figure 4 shows that for the learning gains no apparent advantage. We thus chose to use throughout the work to minimize the probability of breaking the virtual qubits.

As the embedding requires qubits, we should in principle embed the problem in distinct positions in the superconductive chip, since the number of total qubits equals . Unfortunately, due to faulty qubits, the problem can be mapped in distinct positions. As a consequence, each annealing cycle produces samples. Running processes in parallel does not only sensibly reduce the computational cost per sample but is also useful to reduce the impact of sources of systematical errors that can be present in the AQC.

iii.2 Weight initialization and parameters of the learning process

Weights and biases of the RBM were initialized by extracting random values from a Gaussian distribution with

, , truncated in . The learning rate was set to . In the forward annealing case, an annealing time of s was chosen. For the reverse annealing case, we set a reverse step of s, followed by s of pause and finally a last s of forward annealing. During the pause .

For the classical case, training was performed with . For the quantum case, we performed annealing cycles in each training epoch, both for the forward and reverse annealing. Considering that each annealing cycle outputs configurations, it sums up to configurations.

iii.3 Temperature estimation

For the annealing to produce configurations according to the Boltzmann distribution at , one needs to properly rescale the parameters of the RBM before passing them to the AQC (Equation 9). It is known that the best results are obtained by fine-tuning the parameter by estimating the effective temperature at any stage during the training benedetti2016estimation. Nonetheless, the training can be effective even assuming that is constant during training. In such a case, a wrong guess for can lead to severe suboptimal learning scores.

We decided to avoid a dynamic estimation of to simplify the experiment design, to test the robustness of the approach and to keep the focus on the differences between the forward annealing and the reverse annealing procedure. We initialized an RBM with random weights and biases extracted from the probability distribution defined in the previous section and trained it with forward annealing using several different values for the rescaling parameter . The value resulted in the best score after iterations. Note that such value is significantly different from others reported in the literature for the sparse RBM, as in Ref. benedetti2016estimation that we confirmed in our experiment (Supplementary Figure 3). The difference is attributed to the use of virtual qubits composed of four physical qubits bounded by strong negative couplings. Indeed, higher values for the couplings between qubits can influence the freezing process and thus change amin2015searching, korenkevych2016benchmarking.

We trained an RBM with the sparse embedding of Ref. benedetti2017quantum, which indeed resulted in as the best value for the rescaling parameter. The corresponding learning curve is represented in Supplementary Figure 2. We evaluated the connected RBM with initial weights and biases extracted from a gaussian with , , truncated in (i.e. weights and biases are smaller than those used to produce the plots in Figure I a and b). In such a case, the best value for among those tested is . The corresponding learning curve is represented in Supplementary Figure 1.

Note that the currents inside the device possess a maximum and minimum programmable value. Thus, it is possible that during learning the weights to be passed to the AQC exceed those boundaries. In our implementation, weights and biases have never reached values close to the boundaries.

Iv Conclusions

Boltzmann Machines can be trained with an algorithm that exploits AQCs with the spirit of achieving a computational advantage over the classical method. We showed that the use of embedding techniques does preserve the quality of produced configurations, and thus it results in significantly better scores thanks to the increased connectivity. As opposed to the usual algorithm based on forward annealing schedule, we implemented a semantic quantum search based on reverse annealing schedule. We showed that such an algorithm quickly raises the sampling probability of a subset of the configurations set corresponding to elements of the dataset and can achieve good reconstruction scores in slightly less training epochs than forward annealing. Reverse annealing captures the benefit of starting the algorithm by exploiting the full information provided by the elements of the dataset. It leads to a sampling probability of elements of the dataset which is double that of the forward annealing and higher than that of the classical method.

Our results, combined with hyperparameter optimization of the annealing schedule and temperature estimation techniques pave the way towards the exploitation of both restricted and unrestricted Boltzmann machines as soon as new generations of hardware with increased connectivity will be available.

V Acknowledgements

The EP gratefully acknowledges D–Wave for having partially granted the access to the D–Wave 2000QTM System processor, used to obtain the experimental results presented in this article. EP and LR would like to thank Daniele Ottaviani and Carlo Cavazzoni of CINECA for useful discussions.

Vi Data availability

The data that support the findings of this study are available from the corresponding Author upon reasonable request.

Vii Competing interest

The Authors declare that there are no competing interests

Viii Author contributions

LR developed the quantum algorithm and analysed the data, CD contributed to theoretical aspects of Gibbs sampling and quantum dynamics, EP conceived and coordinated the research. All the Authors discussed the results and contributed to the writing of the manuscript.