F-Divergences and Cost Function Locality in Generative Modelling with Quantum Circuits

10/08/2021 ∙ by Chiara Leadbeater, et al. ∙ Cambridge Quantum Computing Ltd 0

Generative modelling is an important unsupervised task in machine learning. In this work, we study a hybrid quantum-classical approach to this task, based on the use of a quantum circuit Born machine. In particular, we consider training a quantum circuit Born machine using f-divergences. We first discuss the adversarial framework for generative modelling, which enables the estimation of any f-divergence in the near term. Based on this capability, we introduce two heuristics which demonstrably improve the training of the Born machine. The first is based on f-divergence switching during training. The second introduces locality to the divergence, a strategy which has proved important in similar applications in terms of mitigating barren plateaus. Finally, we discuss the long-term implications of quantum devices for computing f-divergences, including algorithms which provide quadratic speedups to their estimation. In particular, we generalise existing algorithms for estimating the Kullback-Leibler divergence and the total variation distance to obtain a fault-tolerant quantum algorithm for estimating another f-divergence, namely, the Pearson divergence.



There are no comments yet.


page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

One of the most challenging technological questions of our time is whether existing quantum computers can achieve quantum advantage in tasks of practical interest. Variational quantum algorithms (VQAs), which are well suited to the constraints imposed by existing devices, have emerged as the leading strategy for achieving such a quantum advantage [1, 2, 3, 4].

In VQAs, a problem-specific cost function, which typically consists of a functional of the output of a parameterised quantum circuit, is efficiently evaluated using a quantum computer. Meanwhile, a classical optimiser is leveraged to train the circuit parameters in order to minimise the cost function. This hybrid quantum-classical approach is robust to the limited connectivity and qubit count of existing devices, and, by restricting the circuit depth, also provides an effective strategy for error mitigation.

Given their flexibility, VQAs have been proposed for a vast array of applications. Of particular relevance are applications of VQAs to machine learning problems, including classification [5, 6, 7, 8, 9, 10], data compression [11, 12, 13], clustering [14], generative modelling [15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32] , and inference [33].

In this paper, we focus on a hybrid quantum-classical approach to generative modelling using a Born machine [34]

. We adopt an adversarial framework to this task, in which a Born machine (the ‘generator’) generates samples from the target distribution, while a binary classifier (the ‘discriminator’) attempts to distinguish between generated samples and true samples. This is sometimes referred to in the literature as a quantum generative adversarial network.

In a generalisation of existing approaches, we consider training the Born machine with respect to any -divergence as a cost function. Well-known examples of -divergences include the Kullback-Leibler divergence (KL), the Jensen-Shannon divergence (JS), the squared Hellinger distance (), the total variation distance (TV), and the Pearson divergence (). In the adversarial framework, it is straightforward to estimate the -divergence: any such divergence is defined in terms of the density ratio of the target distribution and model distribution, which can be estimated using standard techniques via the output of the binary classifier [35]. On this basis, we propose a heuristic for training the Born machine, based on the idea of dynamically switching the -divergence during training in order to optimise the rate of convergence and utilise favourable qualities of each one. We also propose a second heuristic, based on introducing locality into the -divergence, motivated by the now well-established connection between locality and barren plateaus in VQA training landscapes [36, 37]. For both heuristics, we provide numerical evidence to suggest that they can lead to (sometimes significant) performance improvements, particularly in under- and over-parameterised circuits.

We conclude this paper with a discussion of the longer-term implication of quantum devices for computing the

-divergences between two probability distributions. In particular, we discuss the existence of quadratic speedups for the estimation of TV and KL shown by 

[38, 39, 40] and extend these results to an algorithm for estimating , assuming access to a fault-tolerant quantum computer.

The remainder of this paper is organised as follows. In Section II, we begin by introducing generative modelling, Born machines, and -divergences. In Section III, we then introduce the two training heuristics for the Born machine. In Section IV, we provide numerical results to demonstrate the performance of the heuristics. In Section V, we discuss the long-term implications of quantum devices for computing -divergences. Finally, in Section VI, we offer some concluding remarks.

Ii Background

ii.1 Generative Modelling

Generative modelling is an unsupervised machine learning task in which the goal is to learn the probability distribution which generates a given data set. More precisely, given access to i.i.d. samples

in , the objective of generative modelling is to learn a model , typically parameterised by a

dimensional parameter vector,

, which closely resembles . Generative models find applications in a wide range of problems, ranging from the typical modalities of machine learning such as text [41], image [42] and graph [43]

analysis, to problems in active learning 


, reinforcement learning 

[45], medical imaging [46], physics [47], and speech synthesis [48].

Broadly speaking, one can distinguish between two main categories of generative model: prescribed models and implicit models [49, 50]

. Prescribed models provide an explicit parametric specification of the distribution of the observed random variable

, directly specifying the density

. An example of a prescribed model is the ubiquitous multivariate Gaussian distribution. Implicit models, on the other hand, specify only the stochastic procedure which generates samples. An example of an implicit model is a complex computer simulation of some physical phenomenon, for which the likelihood function cannot be computed. Since, in this case, one no longer models

directly, valid objectives can now only involve quantities (e.g., expectation values) which can be estimated efficiently using samples.

In the last three decades, a number of generative models, both explicit and implicit, have been proposed in the machine learning literature. These include autoregressive models 

[51, 52], normalising flows [53, 54, 55]

, variational autoencoders 

[56, 57]

, Boltzmann machines 

[58, 59, 60], generative stochastic networks [61]

, generative moment matching networks 

[62, 63], and generative adversarial networks [64]

. These models are classically implemented using deep neural network architectures. In recent years, however, hybrid quantum-classical approaches based on parameterised quantum circuits have also gained traction 

[15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32].

ii.2 Born Machines as Implicit Generative Models

By directly exploiting Born’s probabilistic interpretation of quantum wave functions [65], it is possible to model the probability distribution of classical data using a pure quantum state. Such models are referred to as Born machines [34]. We are particularly interested in Born machines for which the quantum state is obtained via a parameterised quantum circuit (as opposed to, say, a continuous time Hamiltonian evolution). These are known as quantum circuit Born machines (QCBMs) [15, 16].

The use of QCBMs as generative models is in large part motivated by their expressiveness. Indeed, it is now well established that Born machines have greater expressive power than classical models, including neural networks [20] and partially matrix product states [66] (see also [19]). This means, in particular, that QCBMs can efficiently represent certain distributions which are classically intractable to simulate (e.g., [67, 68, 69]). These include those recently used in a demonstration of quantum supremacy [70].

Let us consider a binary vector , with the number of qubits. A QCBM takes a product state as input and evolves it into a normalised output state via a parameterised quantum circuit . One can generate -bit strings according to


where are computational basis states; sampling from this distribution then consists of a simple measurement. Since we only have access to and not the probabilities, themselves, the Born machine can be regarded as an implicit generative model. We consider parameterised quantum circuits of the form


where is a set of fixed unitaries, is a set of parameterised unitaries, and is the depth of circuit. We also assume that are rotations through angles , generated by Hermitian operators

with eigenvalues

. In this case, one can compute partial derivatives of using the parameter-shift rule [71], which reads


where , with a unit vector in the direction. More generally, this formula allows one to express the first-order partial derivative of an expectation of a function as


The major challenge in using any implicit generative model is designing a suitable objective function. As noted before, one cannot compute directly, and thus valid objectives can only involve statistical quantities (e.g., expectations) which can be efficiently computed using samples. For generative models based on QCBMs, various objectives have been proposed, including moment-matching, maximum mean discrepancy, Stein and Sinkhorn divergences, and adversarial objectives based on the Kullback-Leibler divergence. In this paper, we propose a more general class of objective functions – -divergences – for training QCBMs.

ii.3 Adversarial Generative Modelling with -Divergences

Let be a convex function with and strict convexity at 1. Suppose that whenever . The -divergence, or Csiszár divergence [72, 73], between and is defined as


Suppose instead that whenever . Then the -divergence can be written as


where the conjugate function is defined as (not to be confused with the Fenchel conjugate). In what follows, we will generally prefer this formulation, as it leads to simpler expressions.

The function is called the generator of the divergence. For different choices of , one obtains well-known divergences such as TV, KL, and . In this paper, we investigate the effect of this choice on the training of a QCBM. To ensure a fair comparison, we assume that the generators are standardised and normalised such that and  [74]. This ensures that with equality if and only if , even if and are unnormalised. Note that one can normalise and standardise

We minimise the -divergence using gradient-based methods. We thus require the derivative of with respect to . Using the chain and the parameter-shift rules, it is straightforward to compute


We summarise some well-known -divergences, the conjugates of their generators, and their parameter-shift rules, in Tables 1 and 2. We also plot some of the conjugate generators in Figure 1.

Figure 1: Conjugate (left panel) and derivative (right panel) of the generator for several -divergences. All generators have been standardised with and normalised with , except for TV.

Returning to Equation (10), it is clear that the problem of computing the gradient reduces to that of estimating the probability ratio . We choose to define in this way since it is more natural when one is interested with writing the -divergence in terms of , as we do here. Note that in some literature the ratio is defined in the reverse manner by switching the probabilities. We can estimate the probability ratio from the output of a binary classifier [35]. Suppose we assign samples to one class, and samples to another class. Suppose, in addition, that one has access to an exact binary classifier , which outputs the probability that the sample originated from

. Then, assuming uniform prior probabilities for the two classes, it is straightforward to show via Bayes’ theorem that (see Section

II.2 in [50])


In practice, we do not have access to the exact classifier . However, under the assumption that we can efficiently sample from both distributions, we can train a classifier , parameterised by , to distinguish between the two distributions. One can use any proper scoring rule to train the classifier [50]. A typical choice is the negative cross entropy, given by


The classifier seeks to minimise this objective, which corresponds to low classification errors. We emphasise that, in this objective, is fixed at the current QCBM parameters. The resulting classifier approximates the probability ratio for the current QCBM as


This can be plugged into Equation (10) to approximate the gradient. With this in mind, we define the cost function for the QCBM as


where now the parameters of the classifier are fixed and the argument of the expectation value is independent of . The adversarial generative modelling can be regarded as the following optimisation problem


where the required expectation values are estimated from samples. In principle, the classifier can be trained to optimality in order to provide the best possible ratio for the generative model. Alternatively, the two objective functions can be optimised in tandem, using alternating gradient descent steps or a two-timescale gradient descent scheme [75].

f-Divergence Definition Parameter-Shift
total variation
squared Hellinger
Kullback-Leibler (type I, forward)
Kullback-Leibler (type I, reverse)
Kullback-Leibler (type II, forward)
Kullback-Leibler (type II, reverse)     
Pearson (forward)
Pearson (reverse)
Table 1: A summary of well-known -divergences, including the definition, the conjugate of the generator , and the corresponding parameter-shift rule in terms of the ratio . The symbol indicates that the divergence is asymmetric, while a comma indicates that it is symmetric. Interestingly, one can construct symmetric -divergences for every asymmetric one (see Table 2).
f-Divergence Definition Parameter-Shift
symmetric Kullback-Leibler (type I, Jeffrey)
symmetric Kullback-Leibler (type II, Jensen-Shannon)
symmetric Pearson
Table 2: A summary of the symmetric -divergences corresponding to some well-known asymmetric -divergences, including the definition, and the parameter-shift rule.

Iii Training Heuristics

iii.1 Switching -Divergences

In this Section, we describe a heuristic for dynamically switching between -divergences throughout the training process of our generative model (specifically the QCBM).

To motivate this heuristic, we examine how varies with respect to values of . We begin by noting that all -divergences which can be standardised agree on the divergence between nearby distributions [76], but can otherwise exhibit very different behaviours. In particular, we focus on their initial rates of convergence.

One may rationalise the different rates of convergence for each divergence at the beginning of training by considering the following argument [64, 50, 77]. Consider qubits, such that there are different values of . For a successful training, all these values need to converge towards (which implies our goal that ). Now suppose we were to estimate the divergence in Equation (6) using a set of samples from the target distribution . At the beginning of training, is initialised at random and is therefore expected to be far from the target. This means that for most of the samples. In other words, at the beginning of training most of the samples yield probability ratios .

It is evident from the left panel of Figure 1 that some divergences, including TV, vary slowly in the region where , and are therefore more liable to saturation in the initial stages of training. Other divergences, such as forward KL and reverse , generate strong gradients in this region. In the limiting case where and have disjoint supports, TV and JS saturate, whereas forward KL diverges [78]. This problem is well known within the context of training generative adversarial networks; since an idealised formulation optimises JS, several alternative cost functions have been proposed to mitigate its slow initial convergence [64, 79, 77, 78].

Though we can only apply this logic to the particular regime where and are far apart, it is also evident from Figure 1 that the -divergences exhibit a wide diversity of behaviours throughout most of training. We propose to exploit this with the following heuristic. At every optimisation step, we choose an -divergence for each direction in parameter space that generates the highest gradient in said direction. This requires no additional quantum circuit evaluations since we only need to evaluate Equation (10) for the different generators. Concretely, the heuristic can be written as follows. For each step, to update parameter , we choose the -divergence labelled , , which obeys


For simplicity, in this paper, we restrict the set to only contain those -divergences illustrated in Figure 1. We call this heuristic -switch.

iii.2 Local Cost Functions

In this Section, we outline an alternative heuristic for training the QCBM, based on introducing locality into the cost function. Let us briefly provide some motivation for this approach. One of the most fundamental challenges associated with hybrid quantum-classical algorithms is the barren plateau phenomenon, whereby the gradient of the cost function vanishes exponentially in the number of qubits [80, 81, 82, 83, 84, 37, 85, 36, 86, 87, 88]. This phenomenon can arise due to deep unstructured ansätze [80], large entanglement [83, 84], high levels of noise [88], and global cost functions [36, 37]. As such, it is a rather general phenomenon in many quantum machine learning applications, including generative models. In the presence of barren plateaus, exponential precision (i.e., an exponential number of samples) is required in order to resolve against finite sampling noise and determine a minimising direction in the cost function landscape. Since the standard objective of quantum algorithms is to achieve a polynomial scaling in the system size (as opposed to the exponential scaling of classical algorithms), barren plateaus can destroy any hope of a variational quantum algorithm achieving quantum advantage.

Although, in this paper, we do not directly analyse the emergence of barren plateaus in the QCBM, we are nonetheless motivated by existing results on barren plateaus. We focus, in particular, on the connection between barren plateaus and global cost functions (i.e., cost functions defined in terms of global observables), given that such cost functions naturally arise in hybrid quantum-classical generative models. The connection between trainability and locality was first established by Cerezo et al. [36], who proved that cost functions defined in terms of global observables exhibit barren plateaus for all circuit depths in circuits composed of random two-qubit gates which act on alternating pairs of qubits (i.e., blocks forming local 2-designs). Meanwhile, local cost functions do not exhibit barren plateaus for shallow circuits; in this case, cost function gradients vanish at worst polynomially in the number of qubits.

On the basis of this result, there is clear motivation to seek a local cost function (i.e., a cost function defined in terms of local observables) for the hybrid quantum-classical generative model introduced in Section II.3. We now attempt to make some progress towards this goal.

We write to denote the marginal distribution of the element of the bit-string . Using Jensen’s inequality on Equation (6), it can be shown that the

-divergence between joint distributions is larger than the

-divergence between marginal distributions. Thus, we have


Our heuristic consists of minimising the right-hand side of this inequality. Even though this is a lower-bound to the original cost, it is a fully local cost function. Later, we show how to generalise this approach allowing for a trade off between trainability and accuracy. We call this heuristic f-local.

Let us show the difference between the global cost function (left-hand side of the inequality) and the local cost function (right-hand side) by means of an example. For ease of exposition, we assume in this discussion that the -divergence of interest is the reverse KL with generator . We emphasise, however, that the methodology is generic to any -divergence. We begin by rewriting the expression in Equation (1) as


where we have defined . We can thus write the reverse KL in the form of a generic cost function (see, e.g., [3]) as


where we define . This cost function is clearly global, since the observables, , act on all qubits.

Now, rewriting Equation (20) in terms of the adversarial approximation in Equation (14), we have


where , and . It is interesting to note that the global observable only enters into via the first term, namely . It is arguable, however, that the second term in , namely should also be regarded as a global quantity.

We now consider the fully local cost function in the right-hand side of Equation (18). Applying the adversarial approximation to each of the probability ratios, the QCBM objective is


where we have replaced the global observable in Equation (21) by the set of local observables


Here, is a projector on the computational basis for qubit , and denotes the identity on all qubits except qubit . We have also replaced the ‘global’ function in Equation (21) by the set of local functions


Here, is a set of ‘local’ classifiers, which act only on the marginal distribution corresponding to the qubit. That is to say, are trained to distinguish between samples and samples . One may ask why it is not sufficient to simply make only the observable, , local as is done in other literature addressing the barren plateau problem [36]. In our case, it turns out that if one does not also make the functions local, in other words by keeping the classifier ‘global’, the cost function becomes intractable to compute due to a need to explicitly compute joint probabilities from the circuit, . This hints at the subtlety that appears when attempting to address barren plateaus in generative modelling, that does not necessarily exist in other variational algorithms.

We are, of course, interested in whether the local cost function is faithful to the original cost function. Recall that we are minimising the lower bound in Equation (18). It is clear that, if the local cost function is minimised, so that for all , and all of the marginals coincide, there is still no guarantee that the joint distributions will be identical. This observation suggests that, while this cost function may be more trainable than the original cost function on account of its locality, it may also be significantly less accurate. In an attempt to remedy this, we can instead consider a more general -local cost function which acts on subsets of qubits. In particular, by defining , we can introduce




and where is a set of -local’ classifiers, defined in an obvious fashion. This -local cost function now approximates the sum of the reverse KL between the -marginals (of neighbouring qubits) of the target distribution , and the variational distribution .

Arguing as before, it is clear that the -local cost function will admit additional global minima in comparison to the global cost function for any . In particular, when the -local cost function is minimised, the -nearest neighbour marginals of and coincide. One can expect, however, that as the value of is increased, not only will the number of additional minima decrease, but the disparity between the joint distributions of the target and the model at these global minima will decrease. This suggests that in order to achieve a ‘sweet spot’ between trainability and accuracy, a reasonably approach is to start by optimising the -local cost function with a small value of (promoting trainability), before iteratively increasing the value of (promoting accuracy) until , thus recovering the global cost function.

We should remark that, while for ease of notation we have defined the -local cost function in terms of marginals with respect to neighbouring qubits , one can in theory choose any sets of qubits of size at most (e.g., nearest neighbours, all possible combinations, and randomly sampled). In general, for a fixed value of , this choice will influence the accuracy of the objective function, as well as its computational cost, and should be made on a case-by-case basis on the basis of the available computational resources.

Iv Numerical Results

In this Section, we present numerical results to illustrate the performance of the training heuristics proposed in Section III. Throughout this Section, we utilise a QCBM composed of alternating layers of single qubit gates and entangling gates (see Figure 2). We implement the quantum circuit using pytket [89] and execute the simulations with Qiskit [90]

. The parameters of the QCBM are updated using stochastic gradient descent with a constant learning rate, which is tuned to each of the simulations.

Figure 2: The ansatz employed in numerical simulations (shown for three qubits). The ansatz consists of alternating layers of single qubit gates and entangling gates. The single qubit layers consists of two single qubit rotations, one around the axis and one around the axis. The entangling layer is composed of a ladder of CZ gates. There is an additional layer of Hadamard gates prior to the first layer, and an additional layer of single qubit rotations after the final layer. The total number of parameters in a circuit of depth is given by , where is the number of qubits.

Regarding the classical component of the adversarial generative model (i.e., the binary classifier), we use either a fully connected feed-forward neural network with ReLU neurons (NN), or a support vector machine with RBF kernel (SVM). Indeed, one rather surprising byproduct of our numerical investigation is that the training performance of the adversarial generative model could be improved, at times significantly, by using a SVM in place of a NN for this component (see Figure


). This, in itself, should be of some interest to practitioners. Not only can SVMs be faster to train, but they depend on significantly fewer hyper-parameters than NNs, whose performance is often highly dependent on careful tuning of the number of hidden layers, the number of neurons in each hidden layer, the learning rate, the batch size, etc. While we do not suggest that SVMs will always outperform NNs in this setting, this does indicate that SVMs may represent a viable alternative. We implement the NNs using PyTorch 

[91], while the SVMs are implemented with scikit-learn [92]. The particular hyper-parameters used in each simulation are specified below.

[3 qubits]         [4 qubits]

Figure 3:

Training performance of the QCBM in illustrative 3 qubit and 4 qubit experiments using 4 different classifiers. The classifiers are trained using 500 samples. We plot the bootstrapped median (solid line), as well as 90% confidence intervals (shaded).

In the majority of our numerical simulations, we consider a QCBM with qubits. This corresponds to a discrete target distribution which takes values. We generally also assume that the target distribution corresponds to a particular instantiation of the QCBM, for a fixed number of layers, . By varying the number of layers, used to train the generative model, we can then investigate different parameterisation regimes of interest. In the case that the number of layers used to generate the target is greater than the number of layers used in the model (), the model is under-parameterised (or severely under-parameterised). Meanwhile, when the number of layers used to generate the target and the number of layers used in the model are equal (), the model is said to be exactly parameterised. In these cases, a solution to the learning problem is guaranteed to exist: there exists such that . Finally, when the number of layers used to generate the target is less than the number used in the model (), the model is over-parameterised (or severely over-parameterised). We provide a more precise definition of these different cases, as applied to our numerics, in Table 3.

Severely Over
Severely Under
Number of parameters
(layers) used to
generate the target
12 parameters
(1 layer)
12 parameters
(1 layer)
12 parameters
(1 layer)
30 parameters
(4 layers)
30 parameters
(4 layers)
Number of parameters
(layers) used for the
30 parameters
(4 layers)
24 parameters
(3 layers)
12 parameters
(1 layer)
18 parameters
(2 layers)
12 parameters
(1 layer)
Table 3: The different parameterisation regimes used in the 3 qubit numerical simulations.

For each of the settings (i.e., choice of circuit depth for the target and model, choice of heuristic, number of qubits) explored, we train the generative model using nine independent parameter initialisations. We then use a bootstrapping procedure to provide a more robust estimate of the median cost at each training epoch. We first take samples of size nine from the outcome of the nine independent experiments, 10,000 times with replacement. We then compute the median cost across each set of samples to obtain a distribution of 10,000 medians. Using this distribution, we compute the median and obtain error bars from the

and percentiles, corresponding to a confidence interval.

iv.1 Switching -Divergences

We begin by considering the performance of the heuristic introduced in Section III.1. The -divergences that can be standardised locally behave as KL to second order [76]. Notably, TV cannot be standardised; indeed, it is straightforward to show that TV provides an upper bound for all other -divergences with in this regime. For this reason, we evaluate both the exact TV and the exact KL to measure performance.

We begin by reporting the results obtained using an exact classifier, for each of the parameterisation regimes given in Table 3. The generator is trained using samples per iteration. The results are given in Table 4.

Our results indicate that the heuristic is able to outperform TV when the QCBM is (severely) over-parameterised. This may be due to the extra degrees of freedom in the model. These allow for more discrepancies between the loss landscapes of the

-divergences, which the heuristic is able to exploit. In Figures 4 and 5, we provide a more detailed illustration of the training performance of the -switch heuristic in this regime. Figure 4 corresponds to an exact classifier: in this case, use of the heuristic significantly improves the convergence of the QCBM. Figure 5 corresponds to a trained classifier, trained on samples per iteration: in this case, use of the heuristic can lead to marginal performance improvements with respect to TV (left-hand figure). The remaining results in this Section are all reported for an exact classifier.

in training
(12, 30)
(12, 24)
(12, 12)
(30, 18)
(30, 12)
TV -switch
KL -switch
Table 4: Performance of the QCBM trained using the TV and the -divergence heuristic for qubits in over-, under-, and exactly parameterised regimes. We show the bootstrapped median of the TV (top two rows) and the KL (bottom two rows) after 500 epochs. The asterisk (*) on some of the experiments indicates that the cost is still converging. The bold indicates the regimes where -switch significantly outperforms the other methods.
Figure 4: Performance of the QCBM trained using the TV (green) and the -divergence heuristic (red) for 3 qubits in the severely over-parameterised case OO(12,30), using an exact classifier. We show the bootstrapped median (solid line) and 90% confidence intervals (shaded) of the TV (left) and the KL (right).
Figure 5: Performance of the QCBM training using the TV (green) and the -divergence heuristic (red) for 3 qubits in the severely over-parameterised case OO(12,30), using a trained SVM classifier. We show the bootstrapped median (solid line) and 90% confidence intervals (shaded) of both the TV (left) and the KL (right).

The average performance of the heuristic is similar to TV in the exactly and under-parametrised regimes. There are, however, initial parameter configurations within these regimes for which the heuristic significantly outperforms TV. In Figure 6, we plot the median losses obtained throughout the training of the QCBM in the under-parametrised U(30, 18) regime. The best-performing experiment in this regime is also presented in Figure 7, alongside all the other -divergences considered in Figure 1. After 200 epochs, the training method that solely uses TV has converged, but all the other divergences, including the heuristic, continue to converge exponentially quickly to smaller losses. In the under-parameterised regime, the ansatz is not guaranteed to contain the true solution. However, after reaching a KL of , these -divergences traverse similar landscapes. Since the -switch heuristic is shown to reach a KL of , we can assume that all of these -divergences will converge to the global minimum, with the heuristic arriving first.

Figure 6: Performance of the QCBM training using the TV (green) and the -divergence heuristic (red) for 3 qubits in the under-parameterised case U(30,18). We show the bootstrapped median (solid line) and 90% confidence intervals (shaded) of both the TV (left) and the KL (right).

Figure 7: Performance of the QCBM trained using several -divergences for 3 qubits in the under-parameterised case U(30,18). The parameters are initialised using the parameters which gave the lowest cost during training in Figure 6. We show the exact TV (left) and the exact KL (right).

Finally, in Figure 8, we illustrate the mechanics of the -switch heuristic. In particular, we plot which -divergence is ‘activated’ for each direction in the parameter space, at each epoch of the training in Figure 7.

Figure 8: -divergences chosen throughout the training of the heuristic in Figure 7 in each of the 18 directions in parameter space.

We remark that as the number of qubits is increased, the randomly initialised model and the target distributions are expected to be increasingly further apart. The heuristic can pick the divergence that provides the highest initial learning signal. For this reason, we expect the heuristic to become particularly useful as the number of qubits is increased.

iv.2 Local Cost Functions

We now turn our attention to the heuristic introduced in Section III.2, incorporating locality in the cost function, dubbed -local. In this Section, the target distribution is a discretised Gaussian. All classifiers are neural networks with hidden layer made of ReLU neurons, where is the locality parameter. The number of layers in the QCBM equals the number of qubits, . All expectation values are estimated using samples. In Figure 9, we plot the training performance of the QCBM using the global cost function and several -local cost functions, for , , and qubit experiments. For and qubits, we show the bootstrapped median for the first training epochs, as well as confidence intervals. For qubits, we plot an illustrative training example for the first training epochs.

[4 qubits]         [5 qubits]
[6 qubits ]

Figure 9: Training performance of the QCBM using the global and local reverse KL for 4 qubits, 5 qubits, and 6 qubits, for a discretised Gaussian target distribution. For 4 qubits and 5 qubits, we show the bootstrapped median (solid line), as well as 90% confidence intervals (shaded). For 6 qubits, we plot an illustrative training example.

Let us make several remarks. Firstly, it would appear that the use of a -local cost function can indeed improve the convergence (rate) of the training procedure, particularly during the initial stages. This improvement is increasingly evident as the number of qubits is increased. As such, this approach could be regarded as a potential strategy for tackling barren plateaus in higher-dimensional problems. However, we leave a thorough study of this phenomenon to future work.

Secondly, it is clear that the use of any -local cost function will eventually prohibit convergence to the true target distribution. As discussed in Section III.2, the -local cost function is minimised whenever the -marginal distributions of the target and the model coincide, which does not necessarily imply that their joint distributions are equal. The smaller the value of , the greater the possible disparity between two distributions whose -marginals coincide. This is clearly visualised in Figure 9: as the value of decreases, the asymptotic reverse KL achieved during training with the -local cost function plateaus at increasingly larger values.

As remarked previously, this suggests that an optimal training strategy may be to start the training procedure with a small value of , before iteratively increasing the value of as training proceeds. For example, let us consider the qubit experiment in Figure 9. Initially, the -local cost function (red) appears to yields the greatest convergence rate. After approximately 150 epochs, the -local cost function (purple) now seems to be favourable. Asymptotically, one can imagine that the global cost function (blue) will be preferable. One observes similar behaviour in the qubit experiment in Figure 9.

In practice, of course, it is not possible to compute the reverse KL directly, and thus another tractable metric is required in order to determine the optimal moment for switching between the -local cost functions. Alternatively, one can simply increase the locality of the cost function after a set number of epochs.

V Estimation of -divergences on Fault-Tolerant Quantum Computers

The above discussion is purely heuristic in nature and suitable for near-term quantum computers, but we can also address -divergences from the other end of the spectrum; using fault-tolerant devices. In particular, we can leverage a recent line of study into quantum property testing of distributions. The key question here is whether or not a particular probability distribution has a certain property.

The work of [38] was one of the first to provide such an answer, demonstrating a quadratic speedup for determining whether two distributions over were close or -far in TV. These quantum algorithms typically work in the oracle model, and we measure run time relative to the number of queries to such an oracle (query complexity). In the classical case, we define oracle access to a distribution over , as

. The oracle is a mechanism to translate a uniform distribution over

to the true distribution over . In the quantum case, such an oracle is replaced by a unitary operator, acting on a state encoding , along with an ancillary register to ensure reversibility and defined as: .

We begin our discussion with the TV. The authors of [38] produced a quantum property testing algorithm for the TV via an algorithm which actually estimates the TV quadratically faster. The analysis in [38] resulted in an algorithm to estimate the TV up to additive error , with probability of success of , using samples. This was later improved by [39] to the following

Theorem 1 (Section 4, Montanaro [39]).

Assume are two distributions on . Then there is a quantum algorithm that approximates up to an additive error , with probability of success , using quantum queries.

These ideas were extended in [40] to also give an algorithm for computing the (forward) KL quadratically faster than possibly classically (and also computing certain entropies of distributions). Due to the existence of the ratio in the expression for the KL, we must make a further assumption, which was not necessary in the case of the TV distance in Theorem 1. This assumption will also be necessary when considering many of the other divergences in Table 1. In particular, we must assume the two distributions are such that: , for some . (This assumption is appropriate when one defines the KL in terms of the generator and the ratio . Conversely, when one defines the KL in terms of the conjugate and the ratio , then the appropriate assumption would instead be that , .) This assumption is also necessary in the classical case. With this, we then have

Theorem 2 (Theorem 4.1, Li and Wu [40]).

Assume are two distributions on satisfying for some . Then there is a quantum algorithm that approximates within an additive error with probability of success at least using quantum queries to and quantum queries to . (The notation ignores factors that are polynomial in and .)

These results cover two of the -divergences we use above (see Table 1). In particular, the latter algorithm provide a quantum speedup since it is known that one requires classical queries to and respectively to estimate the KL [93]. On the other hand, we get a speedup for the former algorithm since it is known one requires  [94] queries to test if two distributions are near or far in TV classically, which is an easier problem than estimating the metric directly.

The key idea behind both of these algorithms is to use a subroutine known as quantum probability estimation or quantum counting, which is adapted from quantum amplitude estimation. This provides a quadratic speedup in producing estimates , of probabilities from the distributions , which are specified via a quantum oracle. Once the estimates of have been produced via the quantum subroutine, both of the above algorithms reduce to simple classical post-processing. This post processing involves constructing a random variable, , whose expectation value gives exactly the divergence we require. For TV and KL estimation, this random variable is given by


By sampling this random variable according to another distribution (to be defined below), the quantity of interest is exactly given as an expectation value, namely


One can check [38, 40] that the suitable random variables are given by


Due to the probabilistic nature of quantum mechanics, one cannot obtain the exact values of the probabilities required to compute these expectation values. We must settle instead for approximations of , namely . These estimates are achieved using the quantum approximate counting lemma, which is an application of quantum amplitude estimation [95]. The work in [40] considered two versions of this algorithm, called EstAmp and EstAmp’. The only difference between these two algorithms is the behavior when one of the probabilities, , is sufficiently close to zero. This is problematic in the case of the KL estimation (and indeed entropy estimation) in [40] since the relevant quantities diverge as . The same is true in our case, as appears in many -divergences.

Theorem 3 (Theorem 13, Brassard et al. [95] and Theorem 2.3, Li and Wu [40]).

For any , there is a quantum algorithm (named EstAmp) with queries to a boolean function, that outputs for some such that


where . This promises with probability at least for and with probability greater than for . If then .

The modified algorithm (EstAmp’) outputs when EstAmp outputs , and outputs the same as EstAmp otherwise. Now that we have a mechanism for estimating the probabilities, we need a final ingredient, which is the generic speedup of Monte Carlo methods from [39]

Theorem 4 (Theorem 5, Montanaro [39]).

Let be a quantum algorithm with output such that . Then for where , by using executions of and