Log In Sign Up

Policy Gradient based Quantum Approximate Optimization Algorithm

by   Jiahao Yao, et al.

The quantum approximate optimization algorithm (QAOA), as a hybrid quantum/classical algorithm, has received much interest recently. QAOA can also be viewed as a variational ansatz for quantum control. However, its direct application to emergent quantum technology encounters additional physical constraints: (i) the states of the quantum system are not observable; (ii) obtaining the derivatives of the objective function can be computationally expensive or even inaccessible in experiments, and (iii) the values of the objective function may be sensitive to various sources of uncertainty, as is the case for noisy intermediate-scale quantum (NISQ) devices. Taking such constraints into account, we show that policy-gradient-based reinforcement learning (RL) algorithms are well suited for optimizing the variational parameters of QAOA in a noise-robust fashion, opening up the way for developing RL techniques for continuous quantum control. This is advantageous to help mitigate and monitor the potentially unknown sources of errors in modern quantum simulators. We analyze the performance of the algorithm for quantum state transfer problems in single- and multi-qubit systems, subject to various sources of noise such as error terms in the Hamiltonian, or quantum uncertainty in the measurement process. We show that, in noisy setups, it is capable of outperforming state-of-the-art existing optimization algorithms.


page 15

page 21


Noise-Robust End-to-End Quantum Control using Deep Autoregressive Policy Networks

Variational quantum eigensolvers have recently received increased attent...

Reinforcement Learning for Many-Body Ground State Preparation based on Counter-Diabatic Driving

The Quantum Approximate Optimization Ansatz (QAOA) is a prominent exampl...

Faster variational quantum algorithms with quantum kernel-based surrogate models

We present a new optimization method for small-to-intermediate scale var...

Learning in Quantum Control: High-Dimensional Global Optimization for Noisy Quantum Dynamics

Quantum control is valuable for various quantum technologies such as hig...

Optimizing quantum optimization algorithms via faster quantum gradient computation

We consider a generic framework of optimization algorithms based on grad...

Policy Gradient Approach to Compilation of Variational Quantum Circuits

We propose a method for finding approximate compilations of quantum circ...

1 Introduction

Noisy intermediate scale quantum (NISQ) devices are becoming increasingly accessible. However, their performance can be severely restricted due to decoherence effects. This leads to noises in all components of the quantum computer, including initial state preparation, unitary evolution, and measurement/qubit readout. Thanks to the feasibility of being implemented and tested on near term devices, hybrid quantum-classical algorithms, and in particular quantum variational algorithms (QVA), have received significant amount of attention recently. Examples of QVA include the Variational Quantum Eigensolver (Peruzzo et al., 2014), the Quantum Approximate Optimization Algorithm (QAOA) (Farhi et al., 2014)

, Quantum Variational Autoencoders

(Romero et al., 2017), etc. The common feature of these algorithms is that the final wavefunction can be prepared by applying a unitary evolution operator, parametrized using a relatively small number of parameters, to an initial wavefunction. The parameters can then be variationally optimized to maximize a given objective function, measured on the quantum device.

In this study, we consider the Quantum Approximate Optimization Algorithm (QAOA) (Farhi et al., 2014), which is a particularly simple algorithm that alternates between two different unitary time evolution operators of the form , (). QAOA has been studied in the context of a number of discrete (Farhi et al., 2014; Lloyd, 2018; Hadfield, 2018) and continuous  (Verdon et al., 2019) optimization problems. QAOA has also been demonstrated to be universal under certain circumstances (Lloyd, 1995, 2018; Morales et al., 2019), in the sense that any element in a unitary group can be well approximated by a properly parameterized QAOA. This is highly nontrivial and is a unique quantum feature, since QAOA only has access to unitary operators generated by two specific Hamiltonians . However, the control energy landscape of QAOA is known to be highly complex (Streif and Leib, 2019; Niu et al., 2019b), and its optimization can therefore be challenging.

QAOA can be naturally related to quantum control, and thus also to reinforcement learning problems. This inspires studies from various angles, such as the Krotov method (Tannor et al., 1992), Pontryagin’s maximum principle (Yang et al., 2017) and tabular reinforcement learning methods (Chen et al., 2013; Bukov, 2018), functional-approximation-based (deep) Q-learning methods (Bukov et al., 2018a; Sørdal and Bergli, 2019; An and Zhou, 2019; Zhang et al., 2019), policy gradient methods (Fösel et al., 2018; August and Hernández-Lobato, 2018; Chen and Xue, 2019; Niu et al., 2019a; Porotti et al., 2019) and methods inspired by the success of AlphaZero (Dalgaard et al., 2019). Most studies focus on the noise-free scenarios, applicable to fault-tolerant quantum devices. In order to mitigate the errors on near-term devices, robust optimization based on sequential convex programming (SCP) has been recently studied (Kosut et al., 2013; Dong et al., 2019), which assumes that both the source and the range of magnitude of the error are known, but its exact magnitude. In such a case, the authors have found that robust optimization can significantly improve the accuracy of the variational solution.

Nonetheless, techniques such as SCP require access to information of the first as well as second order derivatives of the objective function, which can themselves be noisy and difficult to obtain on quantum devices. The objective function should also be at least continuous with respect to the error, a requirement which is not satisfied in the case of quantum uncertainty in the final measurement process (e.g. in the form of a bit flip or a phase flip). It is thus naturally desirable to only use function evaluations to perform robust optimization, while keeping the result resilient to unknown and generic types of errors.

In this paper, we demonstrate that reinforcement learning (RL) may be used to tackle all challenges above in optimizing the parameters of QAOA, and more generally QVA. Instead of directly optimizing the variational parameters themselves, we may assign a probability distribution

to the parameter set, and perform optimization with respect to the parameters of the probability distribution, denoted

. The modified objective function (called the reward function) can then be continuous with respect to , even if the original objective function is not. The optimization procedure only requires a (possibly large) number of function evaluations, but no information about the first or second order derivatives. We show that a simple policy gradient method only introduces a small number of additional parameters in the optimization procedure, and can be used to optimize the parameters in QAOA. Since each step of the optimization only involves a small batch of samples, the optimization procedure can also be resilient to various sources of noise.

This paper is organized as follows. Section 2 provides a brief introduction of QAOA, its connection to quantum control, and the noise models. Section 3 introduces the policy gradient based QAOA (PG-QAOA), in the context of noise-free and noisy optimization. After introducing the test systems in Section 4, we present in Section 5 numerical results of PG-QAOA for single-qubit and multi-qubit examples under different noise models. Section 6 concludes and discusses the further work. Additional numerical results are presented in the Appendices.

2 Preliminaries

2.1 QAOA and Quantum Control

Consider the Hilbert space , with the number of qubits in the quantum system. Starting from an initial quantum state , in QAOA we apply two alternating unitary evolution operators (Farhi et al., 2014):


The unitary evolution is generated by the time-independent Hamiltonian operators and , each applied for a duration and , respectively (); we refer to as the total depth. In QAOA, we have to adjust the parameters to optimize an objective function , e.g. minimizing the energy (Ho and Hsieh, 2019) or maximizing the fidelity of being in some target state111The latter problem is often referred to as a state transfer problem.. In the latter case, for a target wavefunction denoted by , the optimization problem becomes


The problem of finding the optimal parameters in QAOA can be reinterpreted as the following bilinear quantum optimal control problem


where . In particular, when is chosen to be the following piecewise constant function


we recover the QAOA wavefunction (2.1). This is a special type of quantum control problem called the bang-bang quantum control. For a protocol of the durations , the total duration is defined as


2.2 Noisy Objective Functions

Practical QAOA calculations can be prone to noises. For instance, the Hamiltonian may take the form , where is the Hamiltonian in the absence of noise, is the Hamiltonian modelling the noise source, with the magnitude of the noise. We assume that only the range/magnitude of is known a priori and is denoted by , and the precise value of is not known. This setup will be referred to as Hamiltonian noise. This noisy optimization problem can be solved as a max-min problem:




means the fidelity for the given noise strength and control duration.

Noise may naturally also occur due to measurements. For instance, the final fidelity may only be measurable up to an additive Gaussian noise, i.e.


where . Here, the function guarantees the noisy fidelity is still bounded between 0 and 1. This will be referred to as Gaussian noise.

One may even go one step further, and notice that quantum measurements produce an intrinsic source of uncertainty due to the probabilistic nature of quantum mechanics. Assuming the target state is an eigenstate of some measurable operator

with eigenvalue

, i.e. , a quantum measurement produces the eigenvalue with probability . Using this, we can define the following discrete cost function:


Assuming the same state of the system is prepared anew in a series of experiments, a measurement in repeated experiments will produce a discrete set of ones and zeros, whose mean value converges to the true fidelity in the limit of large number of quantum measurements. This setting was considered in (Bukov, 2018), and will be referred to as the quantum measurement noise. We mention in passing that, in systems with large Hilbert spaces, such as multi-qubit systems, it is in fact more appropriate to optimize the expectation value of some observable, instead of the fidelity.

3 Policy gradient based QAOA (PG-QAOA)

Being a variational ansatz, QAOA does not specify the optimization procedure to determine the variational parameters. In this paper, we demonstrate that policy gradient, which is a widely used algorithm in reinforcement learning, can be particularly robust to various sources of uncertainty in the physical system. In order to tackle the robust optimization of QAOA for general noise models, reinforcement learning algorithms provide a useful perspective.

We first reformulate the original problem as a probability-based optimization problem. The original optimization parameters are drawn from a probability distribution, described by some variational parameters . Optimization is then performed over the variational parameters. Such techniques are used in neural evolution strategies (NES) (Wierstra et al., 2008) and model-based optimization (MBO) (Brookes et al., 2019). If the solution of the optimization problem is unique, we should expect that the probability distribution of each parameter will converge to a Dirac- distribution. This probability-based approach also has the advantage of being resilient to perturbations and noise. As will be shown later, the width of the probability distribution after optimization can also be used to qualitatively monitor the magnitude of the unknown noise. A common example of a probability-based optimization algorithm in reinforcement learning is the policy gradient algorithm, where the goal is to find the optimal policy to perform a given task (Williams, 1992; Sutton and Barto, 2018). An additional advantage of probability-based optimization is that it can be used to handle continuous and discrete variables in a unified fashion. Thus, the ideas we put forward below can be used to open up the way to applying RL techniques to continuous quantum control. In the context of QAOA, the durations can be treated as continuous variables without the error due to time discretization.

Let us begin by casting the QAOA control problem Eq. (missing) in a reinforcement learning framework. We consider finite-horizon episodic learning task, with steps per episode, naturally defined by the discrete character of the QAOA ansatz Eq. (missing).

In RL, states uniquely determine the environment. In quantum control, the environment is given by the controlled quantum system. Hence, the natural choice for the RL state space is the Hilbert space . However, there are a number of problems associated with this choice: (i) in many-body quantum systems of particles, is exponentially large which raises questions about the scalability of the algorithm to a large number of qubits. (ii) the quantum states are artificial mathematical constructs, and cannot be observed in experiments; this violates the condition that the RL state space be fully observable. Indeed, reading out all entries of the quantum wavefunction requires full quantum tomography Torlai et al. (2018), which scales exponentially with the number of qubits . This comes in stark contrast with recent applications of RL to certain optimal quantum control problems e.g. (Niu et al., 2019a; Dalgaard et al., 2019), in which the quantum wavefunction for small Hilbert spaces is indeed accessible on a classical computer.

It was recently suggested that, since the initial quantum state is known and Schrodinger evolution is deterministic, the quantum state at an intermediate time can be inferred from the sequence of actions taken at each time interval. Therefore, the sequence of all actions taken before a given episode step can be treated effectively as the RL state, and we work with this definition here. We mention in passing that this choice is not unique: in practice, reinforcement learning based methods often incorporate some form of embedding of the quantum state as their state. Notable examples include tabular Q-Learning (Bukov et al., 2018a), Q-Learning network (Sørdal and Bergli, 2019; An and Zhou, 2019), LSTM based memory proximal policy optimization (August and Hernández-Lobato, 2018; Fösel et al., 2018).

At every step in the episode, our RL agent is required to choose two actions out of a continuous interval independently, representing the values of the durations . Hence, the action space is . Actions are selected using the parameterized policy . Since we use the fidelity as the objective function, the reward space is .

In this work, we use the simplest ansatz, i.e. independent Gaussian distributions to parameterize the policy over the control durations

in QAOA. Since a Gaussian is uniquely determined by its mean

and standard deviation (std)

, we have a total of independent variational parameters . The total number of parameters is (in particular, it does not scale with the number of qubits ). The probability density of all the parameters is the product of all the marginal distributions:


where is the probability density for the Gaussian distribution ,


Note that with such a choice, may become negative, which lies outside the action space

. We can enforce the constraint using a truncated Gaussian distribution (after proper normalization) or a log-normal distribution. In practice we observe that with proper initialization, the positivity condition is observed to be automatically satisfied by the minimizer even with the simple choice in Eq. (


The QAOA objective function (2.3) for the probability-based ansatz (3.1) introduced above, now takes the form:


Here is called the expected reward function. In this form, the objective function can be optimized using the REINFORCE algorithm for policy gradient (Williams, 1992):


In particular, the gradient can be evaluated without information about the first order derivative of the objective function . In practice, we use a Monte Carlo approximation to evaluate this gradient, as shown in Algorithm 1

. In order to reduce the variance of the gradient, usually a baseline is subtracted from the fidelity 

(Greensmith et al., 2004), i.e. replacing with in Eq. (missing); it is easy to compute the average fidelity over the MC sample (i.e. the batch) and we use that as the baseline. The resulting algorithm will be referred to as the policy gradient based QAOA (PG-QAOA).

0:  Initial guess for the mean and std    batch size , learning rate , total number of iterations .
1:  Initialize the mean and std with the initial guess
2:  for  do
3:     Sample batch of size :
4:     Compute the instantaneous fidelity and the averaged fidelity
5:     Compute the policy gradient
6:     Update weights .
7:  end for
Algorithm 1 Policy gradient based QAOA

PG-QAOA can be naturally extended to the setting of robust optimization for the Hamiltonian noise. For the max-min problem, the policy gradient in Eq. (missing) becomes


In practice, we sample independent random realizations from the noise region at uniform, and use as an approximation in Eq. (missing). When the fidelity itself is noisy such as the case of the Gaussian noise and the quantum noise, we simply use the measured fidelity in Eq. (missing) and Eq. (missing) in the policy gradient step of Eq. (missing).

4 Quantum Qubit Models

We investigate the performance of PG-QAOA for a single-qubit system, and two different multi-qubit systems, defined as follows:

4.1 Single qubit model

Consider a single-qubit system, whose QAOA dynamics is generated by the Hamiltonians


with the Pauli matrices. The initial and target states are chosen to be the ground states of and , respectively. This control problem was introduced and analyzed in the context of reinforcement learning in Ref. (Bukov et al., 2018a): below the quantum speed limit (QSL), i.e. for total duration , it is not possible to prepare the target state with unit fidelity; yet, in this regime there is a unique optimal solution which maximizes the fidelity of being in the target state, and its fidelity is less than . Above the QSL, , there exist multiple unit-fidelity solutions to this constrained optimization problem.

4.2 Multi-qubit Models

To compare the performance of PG-QAOA against alternative algorithms, we use multi-qubit systems. For the purpose of a more comprehensive analysis, we use two different models, discussed in (Bukov et al., 2018a; Niu et al., 2019b).

4.2.1 Multi-qubit system I

Consider first the transverse-field Ising model, described by the Hamiltonian (Bukov et al., 2018a):


Here is the total number of qubits. The global control field can take two discrete values, corresponding to the two alternating QAOA generators and , cf. Eq. (missing). The initial state is the ground state of , and the target state is chosen to be the ground state of , so the adiabatic regime is not immediately obvious; both states exhibit paramagnetic correlations and area-law bipartite entanglement. The overlap between the initial and target states goes down exponentially with increasing the number of qubits (with all other parameters kept fixed). This state preparation problem is motivated by applications in condensed matter theory. For , this qubit control problem was recently shown to exhibit similarities with optimization in glassy landscapes (Day et al., 2019); for there exist durations for which the optimal solution is doubly-degenerate and the optimization landscape features symmetry-breaking (Bukov et al., 2018b).

Additionally, we can also turn on small random Hamiltonian noise to the interaction terms on the first two bonds of the spin system, denoted by , which would mimic gate imperfections in the context of quantum computing:


The choice of noisy bonds is arbitrary. To keep the notation compact, we define the noise tuple . Each with

the support of the uniform distribution.

4.2.2 Multi-qubit system II

Consider another benchmark example (Niu et al., 2019b). Here, we choose the two alternating Hamiltonians from QAOA as


where is the identity operator. The initial state is the product state , and the target state is the product state . This population transfer problem amounts to a qubit transfer.

The noisy multi-qubit system II uses the gate Hamiltonians:


with . Here, the three-body noise term breaks the particle number (a.k.a. magnetization) symmetry of the original noise-free system.

5 Numerical Experiments and Results

The models we introduced in Section 4 were also studied in (Dong et al., 2019) using SCP to mitigate the error due to Hamiltonian noise. First, we benchmark our results against the derivative-based algorithms SCP and b-GRAPE (Wu et al., 2019) in the context of the Hamiltonian noise. We also present results for PG-QAOA in the context of the Gaussian noise and quantum measurement noise. Then we compare our results to other derivative-free optimization methods, including Nelder-Mead (Gao and Han, 2012), Powell (Powell, 1964), covariance matrix adaptation (CMA) (Hansen and Ostermeier, 2001)

, and particle swarm optimization (PSO)

(Shi et al., 2001).

All numerical experiments are performed on the Savio2 cluster at Berkeley Research Computing (BRC). Each node is equipped with Intel Xeon E5-2680 v4 CPUs with 28 cores. The PG-QAOA is implemented in the TensorFlow 1.14

(Abadi et al., 2015) along with TensorlFlow Probability 0.7.0 (Dillon et al., 2017). The quantum Hamiltonian environment is implemented using QuSpin (Weinberg and Bukov, 2017, 2019) and QuTIP (Johansson et al., 2012, 2013). The two blackbox optimization methods CMA and PSO are implemented with Nevergrad (Rapin and Teytaud, 2018).

Throughout, we used the Adam optimizer (Kingma and Ba, 2014) to train PG-QAOA with learning rate , and learning rate decay of applied every iteration steps. The training batch size is chosen either or (see figure captions). The initial values for the standard deviation parameters of the policy, , are either set to or sampled from truncation log normal distribution with mean and standard deviation . In the Single-qubit testcase (cf. Section 5.1) and the Multi-qubit I testcase (cf. Section 4.2.1), the initial values for the mean parameters of the policy, , are randomly sampled from a truncated normal distribution with mean and standard deviation . In the Multi-qubit II testcase (cf. Section 4.2.2) , the initial values for the means are sampled from a truncated normal distribution with mean and standard deviation . In practice, we noticed that the performance of PG-QAOA is sensitive to the initialization of the means . In some cases, the initialization was tuned to achieve better performance (c.f. Figure 5).

In the numerical experiments, we do not enforce hard constraints on the positivity of and ; yet, in practice we were still able to obtain protocols with positive and . This is mainly because the initialization of the mean parameters in the policy is positive and sufficiently far away from zero, and because there are already optimal protocol solutions (i.e. local minima of the control landscape close to the initial values) with positive and .

5.1 Single qubit results

Figure 1: The distribution in the learning process for the single-qubit testcase. From left to right, snapshots of the training batch distribution in the (protocol duration, fidelity) space at different training episodes for PG-QAOA. Top row: noise-free fidelity problem (green circles). Middle row: Gaussian fidelity noise problem (red tri-ups), the corresponding exact fidelity values for comparison only (red diamonds, not used in training), and the mean mini-batch fidelity (dashed vertical line). Bottom row: quantum measurement noise problem (magenta crosses) with binary values

, the corresponding exact fidelity values for comparison only (magenta squares, not used for training), and the mean mini-batch fidelity (dashed vertical line). The final learned distributions represent a set of solutions with different total protocol durations but still sharing the same optimal fidelity, demonstrating the machine learning aspect of the algorithm (see text). The standard deviation of the Gaussian noise is

. The QAOA depth is . The PG-QAOA algorithm is trained with a single minibatch of size for a total iteration number . The initial mean values are randomly sampled from a truncated normal distribution with mean and standard deviation (i.e. ) and the initial standard deviation values are sampled from a truncated log normal distribution of mean and standard deviation (i.e. ).

Figure 1 (topmost row) shows snapshots of the policy during training for PG-QAOA in the noise-free case. We sample a batch of protocols from the policy learned in the middle of training, and show its distribution in (protocol duration, fidelity)-space. Due to the random initialization of the policy parameters , the algorithm starts from a broad distribution. After the number of training episodes (a.k.a. optimization iterations) increases, the mean of the training batch distribution shifts ever closer to the unit-fidelity region, as expected. At the same time, the distribution also shrinks at later training episodes, and becomes approximately a delta-function in fidelity space in the infinite-training-episode limit since the environment for the noise-free problem is deterministic (though the distribution may still have exhibit finite width due to the decay of the learning rate in the optimization procedure).

Figure 1 (middle and bottom rows) shows the effect of the two types of noise on the performance of PG-QAOA. We test both the Gaussian noise, which takes into account various classical potential measurement uncertainty sources in the lab, as well as the intrinsic quantum measurement noise induced by collapsing the wavefunction during measurements. In the case of quantum measurement noise (magenta), we use only binary fidelity values for the reward PG-QAOA, cf. Eq. (2.10); the exact fidelity values for the batch (which are not binary) are shown for comparison purposes only. We emphasize that we do not repeat the quantum measurement on the same protocol several times, but only take a single quantum measurement for each protocol from the sampled batch in every iteration. The mean batch fidelity is shown as a vertical dashed line. In the case of Gaussian noise (red), the noisy fidelity values used for training are not binary; PG-QAOA is thus well-suited to handle both classical and quantum noise effects. Because we clip the Gaussian-noisy fidelities to fit in the interval , the mean fidelity of the policy (vertical dashed red line) remains slightly away from unity even after a large number of training episodes, introducing a small gap, also visible in the training curves for the multi-qubit examples (Figure 3, left).

Note that the policy optimized using PG-QAOA converges at later training episodes for both noisy settings (measurement and Hamiltonian noise). An interesting feature is the remaining finite width along the protocol duration axis: these unit-fidelity protocols are indistinguishable from the point of view of the objective function and are thus equally optimal. Hence, above the QSL, PG-QAOA is capable of learning multiple solutions simultaneously, unlike conventional optimal control algorithms, showcasing one of the advantages of using reinforcement learning. We can indeed verify that these distribution points correspond to distinct protocols, by visualizing the batch trajectories on the Bloch sphere (the projective space of the single-qubit Hilbert space), cf. Figure 7. We mention in passing that, depending on the initialization of the policy parameters, PG-QAOA finds a different (but equivalent w.r.t. the reward) local basin of attraction in the control landscape, as can be seen from the difference in the mean total protocol duration at later training episodes for the noise-free and the two noisy cases.

5.2 Multi-qubit results

Figure 2 shows the training curves of PG-QAOA for an increasing number of qubits and QAOA depths . In accord with the fact that the multi-spin fidelity decreases exponentially with increasing , the PG-QAOA algorithm takes longer to converge.

Figure 2: Multi-qubit systems, noise-free case. Learning curves (reward vs. episode number) for the Multi-qubit I testcase (a) and the Multi-qubit II testcase (b), for a different number of qubits and QAOA depth for three different random seeds. The PG-QAOA algorithm is trained with batch size of 128 for 2000 iterations. The means initialization is sampled from truncated [left] and [right]. The stds initialization is from truncated .

Adding Gaussian and quantum measurement noise, in Fig. 3 we show the training curves for PG-QAOAfor qubits. For each noisy case, we present the actual mean fidelities (red for the Gaussian noise and magenta for the quantum measurement noise); the exact fidelities (green) are shown only for comparison and are not used in training. Note that learning from quantum measurements is more prone to noise in the initial stage of the optimization, yet the algorithm converges within a smaller number of episodes compared to the case of the Gaussian noise. For Gaussian noise, similar to the single-qubit case, we observe a small gap between the exact fidelity and the noisy fidelity due to clipping the noisy fidelities to fit within the interval . Empirically, we observe the gap size to be almost always about half the Gaussian noise level. This indicates that the probability distribution is moving towards the correct direction (with fidelity close to unity) even though the observed fidelity is away from it due to the noise. More results are presented for the Gaussian noise and for the quantum measurement noise in Appendix B and Appendix C in the Appendix, respectively. In Figure 8, the optimization becomes more difficult with increasing qubit number and the gap is proportionally enlarged according to the Gaussian noise level. In Figure 9, we show that the variance of the mean fidelities is reduced at larger batch sizes for the quantum measurement noise, and similar results can be observed for the Gaussian noise as well.

Figure 3: Multi-qubit testcase I, training curves: the reward (mean batch fidelity, red for Gaussian noise and magenta for quantum measure noise) used in PG-QAOA against the number of training episodes (i.e. iterations). For comparison purposes only, we also show the exact noise-free mean fidelity (green). Left: Gaussian noise. Right: quantum measurement noise. The standard deviation of the Gaussian noise is . The number of qubits is . The batch sizes for Gaussian noise and quantum measurement noise are and , respectively. The initial mean values are sampled from truncated and the initial standard deviation values – from truncated for both noisy cases.

We now benchmark PG-QAOA against a number of different optimal control algorithms. In order to compare PG-QAOA with state-of-the-art optimization methods using gradient and Hessian information such as b-GRAPE and SCP, we evaluate their performance using both the batch average and the worst-case fidelity as reference. For protocol durations , the average and worst-case fidelity within a given support for the uniform distribution , are defined as


A comparison for testcases multi-qubit I and II are shown in Figure 4 and Figure 5, respectively. In terms of both the average and worst case, PG-QAOA performs comparably to the SCP; although PG-QAOA is derivative-free and uses a first-order derivative optimizer, it can occasionally even reach better solutions than SCP w.r.t. the average fidelity. PG-QAOA clearly outperforms b-GRAPE (Wu et al., 2019) in the numerical experiments involving a small number of qubits. We also observe a performance drop for PG-QAOA when the number of qubits is increased. Properly scaling up the performance of PG-QAOA with increasing remains a topic of further investigation.

Figure 4: Multi-qubit testcase I, algorithms comparison for Hamiltonian gate noise. Fidelity achieved by PG-QAOA (purple), SCP (blue) and b-GRAPE (orange) for a few different numbers of qubits and total QAOA depth values . The two panels correspond to different values of the support of the uniform distribution used for the Hamiltonian gate noise. We show both the average fidelity (solid lines), and the worst protocol (dashed lines), cf. Eq. (missing) and Eq. (missing), respectively. The PG-QAOA algorithm is trained with the mini-batch size , except , where . The initial values for the means are sampled from a truncated and the initial values for the standard deviations were kept constant at .
Figure 5: Multi-qubit testcase II, algorithms comparison for Hamiltonian gate noise. The comparison between PG-QAOA (purple) and SCP (blue) in terms of robust QAOA for multi-qubit case II with different number of qubits . The QAOA depth is and the support of the uniform distribution used for the Hamiltonian gate noise is . We show both the average fidelity (solid lines), and the worst protocol (dashed lines), cf. Eq. (missing) and Eq. (missing), respectively. The PG-QAOA algorithm is trained with minibatch sizes for iterations. The initial values of the standard deviations are kept constant at ; the initial values for the means were drawn from for , for , and for .

Last, in Figure 6 we show the comparison among other widely used blackbox optimization methods, such as Nelder-Mead, Powell, covariance matrix adaptation (CMA) and particle swarm optimization (PSO). In contrast to PG-QAOA which learns in distribution (i.e. in practice using MC-sampled batches), the other algorithms accept a single scalar cost function value to optimize. Therefore, we use the mean fidelity over a (potentially noisy) training batch; this constitutes a fair comparison, since the mean batch fidelity is precisely the definition of the reward in policy gradient. The different algorithms have a comparable performance in the noise-free case (Figure 6, leftmost column). In the presence of measurement noise in the reward function, we observe a decrease in performance in all algorithms. At the same time, the decrease is smallest in PG-QAOA which clearly outperforms the rest; this is clearly visible when the number of qubits is increased 222Note that, for , we keep fixed, so the maximum obtainable fidelity is expected to decrease.. PG-QAOA appears less sensitive to the size of the Gaussian noise; moreover, PG-QAOA appears particularly suitable for handling quantum measurement noise.

Figure 6: Multi-qubit testcase I. Comparison between different optimization algorithms for qubits (the first row) and qubits (the second row), and different fidelity noise level (cf. -axis for the standard deviation of the Gaussian noise; the label ”Q” (shaded area) stands for quantum measurement noise): PG-QAOA (blue), Nelder-Mead (orange), Powell (green), CMA (red), and PSO (purple). The comparison is in log-scale (upper row), and the normal scale (lower row). PG-QAOA outperforms the rest in the presence of noise. The batch sizes are for all the methods, except for , where , and the total number of iterations is . For all PG-QAOA experiments, the initial values for the means are sampled from a truncated and the standard deviations initialization – from truncated .

6 Conclusion and Outlook

Due to intrinsic limitations in near term quantum devices, error mitigation techniques can be essential for the performance of quantum variational algorithms such as QAOA. Many classical optimization algorithms (derivative-free or those requiring derivative information) may not perform well in the presence of noise. We demonstrate that probability-based optimization methods from reinforcement learning can be well suited for such tasks. This work considers the simplest setup, where we parameterize each optimization variable using only two variables describing an i.i.d. Gaussian distribution. The probability distribution is then optimized using the policy gradient method, which allows to handle continuous control problems. We demonstrate that PG-QAOA does not require derivatives to be computed explicitly, and can perform well even if the objective function is not smooth with respect to the error. The performance of PG-QAOA may even be sometimes comparable to much more sophisticated algorithms, such as sequential convex programming (SCP), which require information of first and second order derivatives of the objective function. PG-QAOA also compares favorably to a number of commonly used blackbox optimization methods, particularly in experiments with noise and other sources of uncertainty.

Viewed from the perspective of reinforcement learning, the Gaussian probability distribution used in this work is the simplest possible choice. More involved distributions, such as multi-modal Gaussian distributions, flow-based models, and long short-term memory (LSTM) models may be considered. Based on our preliminary results, these methods can introduce a significantly larger number of parameters, but the benefit is not yet obvious. We can also employ more advanced RL algorithms, such as the trust region policy optimization

(Schulman et al., 2015) and the proximal policy optimization method (Schulman et al., 2017). Finally, this work only considers implementations on a classical computer. Implementing and testing PG-QAOA on near term quantum computing devices such as those provided by IBM and Rigetti Computing will be our future work.

This work was partially supported by a Google Quantum Research Award (L.L., J.Y.) and by the Quantum Algorithm Teams Program under Grant No. DE-AC02-05CH11231 (L.L.). M.B. was supported by the Emergent Phenomena in Quantum Systems initiative of the Gordon and Betty Moore Foundation, and the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Quantum Algorithm Teams Program. We thank Yulong Dong for helpful discussions, and Berkeley Research Computing (BRC) for providing computational resources.


Appendix A Trajectories on the Bloch sphere

In this appendix, we visualize the PG-QAOA algorithm’s final policies learned in the Fig. 1.

Figure 7: Single-qubit testcase. Trajectories of the protocols plotted on the Bloch sphere sampled from the learned policy. The three sets of curves correspond to the noise-free case (green), Gaussian noise (red), and quantum measurement noise (magenta). The simulation parameters are the same as in Fig. 1.

Appendix B Learning curves: Gaussian noise

In this appendix, we provide the learning curves for PG-QAOA for various values of the standard deviation (i.e. the noise level) of the Gaussian noise, and different number of qubits .

Figure 8: Multi-quibit testcase I, Gaussain noise. Training curves (reward v.s. episode number) for various values of the Gaussian noise , and a few numbers of qubits and QAOA depth . For the sake of comparison, we show the exact noise-free mean fidelity (green). The three rows: different Gaussian noise level. From top to bottom: , and . The three columns: different numbers of qubits and QAOA depth . From left to right: , and . The PG-QAOA algorithm is trained with mini-batch sizes of 2048 for episodes. The means are initialized from a truncated Gaussian distribution and the stds are initialized to be 0.0024.

Appendix C Learning curves: quantum measurement noise

In this appendix, we provide the learning curves for PG-QAOA for various values of the batch size used in the quantum measurement noise simulations, and different number of qubits .

Figure 9: Multi-quibit testcase I, quantum measurement noise. Training curves (reward v.s. episode number) for various values of the batch size , and a few numbers of qubits and QAOA depth . For the sake of comparison, we show the exact noise-free mean fidelity (green). The two rows: different mini-batch size . From top to bottom: and . The three columns: different numbers of qubits and QAOA depth . From left to right: , and . The PG-QAOA algorithm is trained for episodes. The means are initialized from a truncated Gaussian distribution and the stds are initialized to be 0.0024.