1 Introduction
Noisy intermediate scale quantum (NISQ) devices are becoming increasingly accessible. However, their performance can be severely restricted due to decoherence effects. This leads to noises in all components of the quantum computer, including initial state preparation, unitary evolution, and measurement/qubit readout. Thanks to the feasibility of being implemented and tested on near term devices, hybrid quantumclassical algorithms, and in particular quantum variational algorithms (QVA), have received significant amount of attention recently. Examples of QVA include the Variational Quantum Eigensolver (Peruzzo et al., 2014), the Quantum Approximate Optimization Algorithm (QAOA) (Farhi et al., 2014)
, Quantum Variational Autoencoders
(Romero et al., 2017), etc. The common feature of these algorithms is that the final wavefunction can be prepared by applying a unitary evolution operator, parametrized using a relatively small number of parameters, to an initial wavefunction. The parameters can then be variationally optimized to maximize a given objective function, measured on the quantum device.In this study, we consider the Quantum Approximate Optimization Algorithm (QAOA) (Farhi et al., 2014), which is a particularly simple algorithm that alternates between two different unitary time evolution operators of the form , (). QAOA has been studied in the context of a number of discrete (Farhi et al., 2014; Lloyd, 2018; Hadfield, 2018) and continuous (Verdon et al., 2019) optimization problems. QAOA has also been demonstrated to be universal under certain circumstances (Lloyd, 1995, 2018; Morales et al., 2019), in the sense that any element in a unitary group can be well approximated by a properly parameterized QAOA. This is highly nontrivial and is a unique quantum feature, since QAOA only has access to unitary operators generated by two specific Hamiltonians . However, the control energy landscape of QAOA is known to be highly complex (Streif and Leib, 2019; Niu et al., 2019b), and its optimization can therefore be challenging.
QAOA can be naturally related to quantum control, and thus also to reinforcement learning problems. This inspires studies from various angles, such as the Krotov method (Tannor et al., 1992), Pontryagin’s maximum principle (Yang et al., 2017) and tabular reinforcement learning methods (Chen et al., 2013; Bukov, 2018), functionalapproximationbased (deep) Qlearning methods (Bukov et al., 2018a; Sørdal and Bergli, 2019; An and Zhou, 2019; Zhang et al., 2019), policy gradient methods (Fösel et al., 2018; August and HernándezLobato, 2018; Chen and Xue, 2019; Niu et al., 2019a; Porotti et al., 2019) and methods inspired by the success of AlphaZero (Dalgaard et al., 2019). Most studies focus on the noisefree scenarios, applicable to faulttolerant quantum devices. In order to mitigate the errors on nearterm devices, robust optimization based on sequential convex programming (SCP) has been recently studied (Kosut et al., 2013; Dong et al., 2019), which assumes that both the source and the range of magnitude of the error are known, but its exact magnitude. In such a case, the authors have found that robust optimization can significantly improve the accuracy of the variational solution.
Nonetheless, techniques such as SCP require access to information of the first as well as second order derivatives of the objective function, which can themselves be noisy and difficult to obtain on quantum devices. The objective function should also be at least continuous with respect to the error, a requirement which is not satisfied in the case of quantum uncertainty in the final measurement process (e.g. in the form of a bit flip or a phase flip). It is thus naturally desirable to only use function evaluations to perform robust optimization, while keeping the result resilient to unknown and generic types of errors.
In this paper, we demonstrate that reinforcement learning (RL) may be used to tackle all challenges above in optimizing the parameters of QAOA, and more generally QVA. Instead of directly optimizing the variational parameters themselves, we may assign a probability distribution
to the parameter set, and perform optimization with respect to the parameters of the probability distribution, denoted
. The modified objective function (called the reward function) can then be continuous with respect to , even if the original objective function is not. The optimization procedure only requires a (possibly large) number of function evaluations, but no information about the first or second order derivatives. We show that a simple policy gradient method only introduces a small number of additional parameters in the optimization procedure, and can be used to optimize the parameters in QAOA. Since each step of the optimization only involves a small batch of samples, the optimization procedure can also be resilient to various sources of noise.This paper is organized as follows. Section 2 provides a brief introduction of QAOA, its connection to quantum control, and the noise models. Section 3 introduces the policy gradient based QAOA (PGQAOA), in the context of noisefree and noisy optimization. After introducing the test systems in Section 4, we present in Section 5 numerical results of PGQAOA for singlequbit and multiqubit examples under different noise models. Section 6 concludes and discusses the further work. Additional numerical results are presented in the Appendices.
2 Preliminaries
2.1 QAOA and Quantum Control
Consider the Hilbert space , with the number of qubits in the quantum system. Starting from an initial quantum state , in QAOA we apply two alternating unitary evolution operators (Farhi et al., 2014):
(2.1) 
The unitary evolution is generated by the timeindependent Hamiltonian operators and , each applied for a duration and , respectively (); we refer to as the total depth. In QAOA, we have to adjust the parameters to optimize an objective function , e.g. minimizing the energy (Ho and Hsieh, 2019) or maximizing the fidelity of being in some target state^{1}^{1}1The latter problem is often referred to as a state transfer problem.. In the latter case, for a target wavefunction denoted by , the optimization problem becomes
(2.2)  
(2.3) 
The problem of finding the optimal parameters in QAOA can be reinterpreted as the following bilinear quantum optimal control problem
(2.4) 
where . In particular, when is chosen to be the following piecewise constant function
(2.5) 
we recover the QAOA wavefunction (2.1). This is a special type of quantum control problem called the bangbang quantum control. For a protocol of the durations , the total duration is defined as
(2.6) 
2.2 Noisy Objective Functions
Practical QAOA calculations can be prone to noises. For instance, the Hamiltonian may take the form , where is the Hamiltonian in the absence of noise, is the Hamiltonian modelling the noise source, with the magnitude of the noise. We assume that only the range/magnitude of is known a priori and is denoted by , and the precise value of is not known. This setup will be referred to as Hamiltonian noise. This noisy optimization problem can be solved as a maxmin problem:
(2.7) 
where
(2.8) 
means the fidelity for the given noise strength and control duration.
Noise may naturally also occur due to measurements. For instance, the final fidelity may only be measurable up to an additive Gaussian noise, i.e.
(2.9) 
where . Here, the function guarantees the noisy fidelity is still bounded between 0 and 1. This will be referred to as Gaussian noise.
One may even go one step further, and notice that quantum measurements produce an intrinsic source of uncertainty due to the probabilistic nature of quantum mechanics. Assuming the target state is an eigenstate of some measurable operator
with eigenvalue
, i.e. , a quantum measurement produces the eigenvalue with probability . Using this, we can define the following discrete cost function:(2.10) 
Assuming the same state of the system is prepared anew in a series of experiments, a measurement in repeated experiments will produce a discrete set of ones and zeros, whose mean value converges to the true fidelity in the limit of large number of quantum measurements. This setting was considered in (Bukov, 2018), and will be referred to as the quantum measurement noise. We mention in passing that, in systems with large Hilbert spaces, such as multiqubit systems, it is in fact more appropriate to optimize the expectation value of some observable, instead of the fidelity.
3 Policy gradient based QAOA (PGQAOA)
Being a variational ansatz, QAOA does not specify the optimization procedure to determine the variational parameters. In this paper, we demonstrate that policy gradient, which is a widely used algorithm in reinforcement learning, can be particularly robust to various sources of uncertainty in the physical system. In order to tackle the robust optimization of QAOA for general noise models, reinforcement learning algorithms provide a useful perspective.
We first reformulate the original problem as a probabilitybased optimization problem. The original optimization parameters are drawn from a probability distribution, described by some variational parameters . Optimization is then performed over the variational parameters. Such techniques are used in neural evolution strategies (NES) (Wierstra et al., 2008) and modelbased optimization (MBO) (Brookes et al., 2019). If the solution of the optimization problem is unique, we should expect that the probability distribution of each parameter will converge to a Dirac distribution. This probabilitybased approach also has the advantage of being resilient to perturbations and noise. As will be shown later, the width of the probability distribution after optimization can also be used to qualitatively monitor the magnitude of the unknown noise. A common example of a probabilitybased optimization algorithm in reinforcement learning is the policy gradient algorithm, where the goal is to find the optimal policy to perform a given task (Williams, 1992; Sutton and Barto, 2018). An additional advantage of probabilitybased optimization is that it can be used to handle continuous and discrete variables in a unified fashion. Thus, the ideas we put forward below can be used to open up the way to applying RL techniques to continuous quantum control. In the context of QAOA, the durations can be treated as continuous variables without the error due to time discretization.
Let us begin by casting the QAOA control problem Eq. (missing) in a reinforcement learning framework. We consider finitehorizon episodic learning task, with steps per episode, naturally defined by the discrete character of the QAOA ansatz Eq. (missing).
In RL, states uniquely determine the environment. In quantum control, the environment is given by the controlled quantum system. Hence, the natural choice for the RL state space is the Hilbert space . However, there are a number of problems associated with this choice: (i) in manybody quantum systems of particles, is exponentially large which raises questions about the scalability of the algorithm to a large number of qubits. (ii) the quantum states are artificial mathematical constructs, and cannot be observed in experiments; this violates the condition that the RL state space be fully observable. Indeed, reading out all entries of the quantum wavefunction requires full quantum tomography Torlai et al. (2018), which scales exponentially with the number of qubits . This comes in stark contrast with recent applications of RL to certain optimal quantum control problems e.g. (Niu et al., 2019a; Dalgaard et al., 2019), in which the quantum wavefunction for small Hilbert spaces is indeed accessible on a classical computer.
It was recently suggested that, since the initial quantum state is known and Schrodinger evolution is deterministic, the quantum state at an intermediate time can be inferred from the sequence of actions taken at each time interval. Therefore, the sequence of all actions taken before a given episode step can be treated effectively as the RL state, and we work with this definition here. We mention in passing that this choice is not unique: in practice, reinforcement learning based methods often incorporate some form of embedding of the quantum state as their state. Notable examples include tabular QLearning (Bukov et al., 2018a), QLearning network (Sørdal and Bergli, 2019; An and Zhou, 2019), LSTM based memory proximal policy optimization (August and HernándezLobato, 2018; Fösel et al., 2018).
At every step in the episode, our RL agent is required to choose two actions out of a continuous interval independently, representing the values of the durations . Hence, the action space is . Actions are selected using the parameterized policy . Since we use the fidelity as the objective function, the reward space is .
In this work, we use the simplest ansatz, i.e. independent Gaussian distributions to parameterize the policy over the control durations
in QAOA. Since a Gaussian is uniquely determined by its meanand standard deviation (std)
, we have a total of independent variational parameters . The total number of parameters is (in particular, it does not scale with the number of qubits ). The probability density of all the parameters is the product of all the marginal distributions:(3.1) 
where is the probability density for the Gaussian distribution ,
(3.2) 
Note that with such a choice, may become negative, which lies outside the action space
. We can enforce the constraint using a truncated Gaussian distribution (after proper normalization) or a lognormal distribution. In practice we observe that with proper initialization, the positivity condition is observed to be automatically satisfied by the minimizer even with the simple choice in Eq. (
3.2).The QAOA objective function (2.3) for the probabilitybased ansatz (3.1) introduced above, now takes the form:
(3.3) 
Here is called the expected reward function. In this form, the objective function can be optimized using the REINFORCE algorithm for policy gradient (Williams, 1992):
(3.4) 
In particular, the gradient can be evaluated without information about the first order derivative of the objective function . In practice, we use a Monte Carlo approximation to evaluate this gradient, as shown in Algorithm 1
. In order to reduce the variance of the gradient, usually a baseline is subtracted from the fidelity
(Greensmith et al., 2004), i.e. replacing with in Eq. (missing); it is easy to compute the average fidelity over the MC sample (i.e. the batch) and we use that as the baseline. The resulting algorithm will be referred to as the policy gradient based QAOA (PGQAOA).PGQAOA can be naturally extended to the setting of robust optimization for the Hamiltonian noise. For the maxmin problem, the policy gradient in Eq. (missing) becomes
(3.5) 
In practice, we sample independent random realizations from the noise region at uniform, and use as an approximation in Eq. (missing). When the fidelity itself is noisy such as the case of the Gaussian noise and the quantum noise, we simply use the measured fidelity in Eq. (missing) and Eq. (missing) in the policy gradient step of Eq. (missing).
4 Quantum Qubit Models
We investigate the performance of PGQAOA for a singlequbit system, and two different multiqubit systems, defined as follows:
4.1 Single qubit model
Consider a singlequbit system, whose QAOA dynamics is generated by the Hamiltonians
(4.1) 
with the Pauli matrices. The initial and target states are chosen to be the ground states of and , respectively. This control problem was introduced and analyzed in the context of reinforcement learning in Ref. (Bukov et al., 2018a): below the quantum speed limit (QSL), i.e. for total duration , it is not possible to prepare the target state with unit fidelity; yet, in this regime there is a unique optimal solution which maximizes the fidelity of being in the target state, and its fidelity is less than . Above the QSL, , there exist multiple unitfidelity solutions to this constrained optimization problem.
4.2 Multiqubit Models
To compare the performance of PGQAOA against alternative algorithms, we use multiqubit systems. For the purpose of a more comprehensive analysis, we use two different models, discussed in (Bukov et al., 2018a; Niu et al., 2019b).
4.2.1 Multiqubit system I
Consider first the transversefield Ising model, described by the Hamiltonian (Bukov et al., 2018a):
(4.2) 
Here is the total number of qubits. The global control field can take two discrete values, corresponding to the two alternating QAOA generators and , cf. Eq. (missing). The initial state is the ground state of , and the target state is chosen to be the ground state of , so the adiabatic regime is not immediately obvious; both states exhibit paramagnetic correlations and arealaw bipartite entanglement. The overlap between the initial and target states goes down exponentially with increasing the number of qubits (with all other parameters kept fixed). This state preparation problem is motivated by applications in condensed matter theory. For , this qubit control problem was recently shown to exhibit similarities with optimization in glassy landscapes (Day et al., 2019); for there exist durations for which the optimal solution is doublydegenerate and the optimization landscape features symmetrybreaking (Bukov et al., 2018b).
Additionally, we can also turn on small random Hamiltonian noise to the interaction terms on the first two bonds of the spin system, denoted by , which would mimic gate imperfections in the context of quantum computing:
(4.3) 
The choice of noisy bonds is arbitrary. To keep the notation compact, we define the noise tuple . Each with
the support of the uniform distribution.
4.2.2 Multiqubit system II
Consider another benchmark example (Niu et al., 2019b). Here, we choose the two alternating Hamiltonians from QAOA as
(4.4) 
where is the identity operator. The initial state is the product state , and the target state is the product state . This population transfer problem amounts to a qubit transfer.
The noisy multiqubit system II uses the gate Hamiltonians:
(4.5) 
with . Here, the threebody noise term breaks the particle number (a.k.a. magnetization) symmetry of the original noisefree system.
5 Numerical Experiments and Results
The models we introduced in Section 4 were also studied in (Dong et al., 2019) using SCP to mitigate the error due to Hamiltonian noise. First, we benchmark our results against the derivativebased algorithms SCP and bGRAPE (Wu et al., 2019) in the context of the Hamiltonian noise. We also present results for PGQAOA in the context of the Gaussian noise and quantum measurement noise. Then we compare our results to other derivativefree optimization methods, including NelderMead (Gao and Han, 2012), Powell (Powell, 1964), covariance matrix adaptation (CMA) (Hansen and Ostermeier, 2001)
, and particle swarm optimization (PSO)
(Shi et al., 2001).All numerical experiments are performed on the Savio2 cluster at Berkeley Research Computing (BRC). Each node is equipped with Intel Xeon E52680 v4 CPUs with 28 cores. The PGQAOA is implemented in the TensorFlow 1.14
(Abadi et al., 2015) along with TensorlFlow Probability 0.7.0 (Dillon et al., 2017). The quantum Hamiltonian environment is implemented using QuSpin (Weinberg and Bukov, 2017, 2019) and QuTIP (Johansson et al., 2012, 2013). The two blackbox optimization methods CMA and PSO are implemented with Nevergrad (Rapin and Teytaud, 2018).Throughout, we used the Adam optimizer (Kingma and Ba, 2014) to train PGQAOA with learning rate , and learning rate decay of applied every iteration steps. The training batch size is chosen either or (see figure captions). The initial values for the standard deviation parameters of the policy, , are either set to or sampled from truncation log normal distribution with mean and standard deviation . In the Singlequbit testcase (cf. Section 5.1) and the Multiqubit I testcase (cf. Section 4.2.1), the initial values for the mean parameters of the policy, , are randomly sampled from a truncated normal distribution with mean and standard deviation . In the Multiqubit II testcase (cf. Section 4.2.2) , the initial values for the means are sampled from a truncated normal distribution with mean and standard deviation . In practice, we noticed that the performance of PGQAOA is sensitive to the initialization of the means . In some cases, the initialization was tuned to achieve better performance (c.f. Figure 5).
In the numerical experiments, we do not enforce hard constraints on the positivity of and ; yet, in practice we were still able to obtain protocols with positive and . This is mainly because the initialization of the mean parameters in the policy is positive and sufficiently far away from zero, and because there are already optimal protocol solutions (i.e. local minima of the control landscape close to the initial values) with positive and .
5.1 Single qubit results
Figure 1 (topmost row) shows snapshots of the policy during training for PGQAOA in the noisefree case. We sample a batch of protocols from the policy learned in the middle of training, and show its distribution in (protocol duration, fidelity)space. Due to the random initialization of the policy parameters , the algorithm starts from a broad distribution. After the number of training episodes (a.k.a. optimization iterations) increases, the mean of the training batch distribution shifts ever closer to the unitfidelity region, as expected. At the same time, the distribution also shrinks at later training episodes, and becomes approximately a deltafunction in fidelity space in the infinitetrainingepisode limit since the environment for the noisefree problem is deterministic (though the distribution may still have exhibit finite width due to the decay of the learning rate in the optimization procedure).
Figure 1 (middle and bottom rows) shows the effect of the two types of noise on the performance of PGQAOA. We test both the Gaussian noise, which takes into account various classical potential measurement uncertainty sources in the lab, as well as the intrinsic quantum measurement noise induced by collapsing the wavefunction during measurements. In the case of quantum measurement noise (magenta), we use only binary fidelity values for the reward PGQAOA, cf. Eq. (2.10); the exact fidelity values for the batch (which are not binary) are shown for comparison purposes only. We emphasize that we do not repeat the quantum measurement on the same protocol several times, but only take a single quantum measurement for each protocol from the sampled batch in every iteration. The mean batch fidelity is shown as a vertical dashed line. In the case of Gaussian noise (red), the noisy fidelity values used for training are not binary; PGQAOA is thus wellsuited to handle both classical and quantum noise effects. Because we clip the Gaussiannoisy fidelities to fit in the interval , the mean fidelity of the policy (vertical dashed red line) remains slightly away from unity even after a large number of training episodes, introducing a small gap, also visible in the training curves for the multiqubit examples (Figure 3, left).
Note that the policy optimized using PGQAOA converges at later training episodes for both noisy settings (measurement and Hamiltonian noise). An interesting feature is the remaining finite width along the protocol duration axis: these unitfidelity protocols are indistinguishable from the point of view of the objective function and are thus equally optimal. Hence, above the QSL, PGQAOA is capable of learning multiple solutions simultaneously, unlike conventional optimal control algorithms, showcasing one of the advantages of using reinforcement learning. We can indeed verify that these distribution points correspond to distinct protocols, by visualizing the batch trajectories on the Bloch sphere (the projective space of the singlequbit Hilbert space), cf. Figure 7. We mention in passing that, depending on the initialization of the policy parameters, PGQAOA finds a different (but equivalent w.r.t. the reward) local basin of attraction in the control landscape, as can be seen from the difference in the mean total protocol duration at later training episodes for the noisefree and the two noisy cases.
5.2 Multiqubit results
Figure 2 shows the training curves of PGQAOA for an increasing number of qubits and QAOA depths . In accord with the fact that the multispin fidelity decreases exponentially with increasing , the PGQAOA algorithm takes longer to converge.
Adding Gaussian and quantum measurement noise, in Fig. 3 we show the training curves for PGQAOAfor qubits. For each noisy case, we present the actual mean fidelities (red for the Gaussian noise and magenta for the quantum measurement noise); the exact fidelities (green) are shown only for comparison and are not used in training. Note that learning from quantum measurements is more prone to noise in the initial stage of the optimization, yet the algorithm converges within a smaller number of episodes compared to the case of the Gaussian noise. For Gaussian noise, similar to the singlequbit case, we observe a small gap between the exact fidelity and the noisy fidelity due to clipping the noisy fidelities to fit within the interval . Empirically, we observe the gap size to be almost always about half the Gaussian noise level. This indicates that the probability distribution is moving towards the correct direction (with fidelity close to unity) even though the observed fidelity is away from it due to the noise. More results are presented for the Gaussian noise and for the quantum measurement noise in Appendix B and Appendix C in the Appendix, respectively. In Figure 8, the optimization becomes more difficult with increasing qubit number and the gap is proportionally enlarged according to the Gaussian noise level. In Figure 9, we show that the variance of the mean fidelities is reduced at larger batch sizes for the quantum measurement noise, and similar results can be observed for the Gaussian noise as well.
We now benchmark PGQAOA against a number of different optimal control algorithms. In order to compare PGQAOA with stateoftheart optimization methods using gradient and Hessian information such as bGRAPE and SCP, we evaluate their performance using both the batch average and the worstcase fidelity as reference. For protocol durations , the average and worstcase fidelity within a given support for the uniform distribution , are defined as
(5.1)  
(5.2) 
A comparison for testcases multiqubit I and II are shown in Figure 4 and Figure 5, respectively. In terms of both the average and worst case, PGQAOA performs comparably to the SCP; although PGQAOA is derivativefree and uses a firstorder derivative optimizer, it can occasionally even reach better solutions than SCP w.r.t. the average fidelity. PGQAOA clearly outperforms bGRAPE (Wu et al., 2019) in the numerical experiments involving a small number of qubits. We also observe a performance drop for PGQAOA when the number of qubits is increased. Properly scaling up the performance of PGQAOA with increasing remains a topic of further investigation.
Last, in Figure 6 we show the comparison among other widely used blackbox optimization methods, such as NelderMead, Powell, covariance matrix adaptation (CMA) and particle swarm optimization (PSO). In contrast to PGQAOA which learns in distribution (i.e. in practice using MCsampled batches), the other algorithms accept a single scalar cost function value to optimize. Therefore, we use the mean fidelity over a (potentially noisy) training batch; this constitutes a fair comparison, since the mean batch fidelity is precisely the definition of the reward in policy gradient. The different algorithms have a comparable performance in the noisefree case (Figure 6, leftmost column). In the presence of measurement noise in the reward function, we observe a decrease in performance in all algorithms. At the same time, the decrease is smallest in PGQAOA which clearly outperforms the rest; this is clearly visible when the number of qubits is increased ^{2}^{2}2Note that, for , we keep fixed, so the maximum obtainable fidelity is expected to decrease.. PGQAOA appears less sensitive to the size of the Gaussian noise; moreover, PGQAOA appears particularly suitable for handling quantum measurement noise.
6 Conclusion and Outlook
Due to intrinsic limitations in near term quantum devices, error mitigation techniques can be essential for the performance of quantum variational algorithms such as QAOA. Many classical optimization algorithms (derivativefree or those requiring derivative information) may not perform well in the presence of noise. We demonstrate that probabilitybased optimization methods from reinforcement learning can be well suited for such tasks. This work considers the simplest setup, where we parameterize each optimization variable using only two variables describing an i.i.d. Gaussian distribution. The probability distribution is then optimized using the policy gradient method, which allows to handle continuous control problems. We demonstrate that PGQAOA does not require derivatives to be computed explicitly, and can perform well even if the objective function is not smooth with respect to the error. The performance of PGQAOA may even be sometimes comparable to much more sophisticated algorithms, such as sequential convex programming (SCP), which require information of first and second order derivatives of the objective function. PGQAOA also compares favorably to a number of commonly used blackbox optimization methods, particularly in experiments with noise and other sources of uncertainty.
Viewed from the perspective of reinforcement learning, the Gaussian probability distribution used in this work is the simplest possible choice. More involved distributions, such as multimodal Gaussian distributions, flowbased models, and long shortterm memory (LSTM) models may be considered. Based on our preliminary results, these methods can introduce a significantly larger number of parameters, but the benefit is not yet obvious. We can also employ more advanced RL algorithms, such as the trust region policy optimization
(Schulman et al., 2015) and the proximal policy optimization method (Schulman et al., 2017). Finally, this work only considers implementations on a classical computer. Implementing and testing PGQAOA on near term quantum computing devices such as those provided by IBM and Rigetti Computing will be our future work.This work was partially supported by a Google Quantum Research Award (L.L., J.Y.) and by the Quantum Algorithm Teams Program under Grant No. DEAC0205CH11231 (L.L.). M.B. was supported by the Emergent Phenomena in Quantum Systems initiative of the Gordon and Betty Moore Foundation, and the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Quantum Algorithm Teams Program. We thank Yulong Dong for helpful discussions, and Berkeley Research Computing (BRC) for providing computational resources.
References
 Abadi et al. (2015) Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorflow. org, 1(2), 2015.
 An and Zhou (2019) Zheng An and DL Zhou. Deep reinforcement learning for quantum gate control. arXiv preprint arXiv:1902.08418, 2019.
 August and HernándezLobato (2018) Moritz August and José Miguel HernándezLobato. Taking gradients through experiments: Lstms and memory proximal policy optimization for blackbox quantum control. In International Conference on High Performance Computing, pages 591–613. Springer, 2018.
 Brookes et al. (2019) David H Brookes, Akosua Busia, Clara Fannjiang, Kevin Murphy, and Jennifer Listgarten. A view of estimation of distribution algorithms through the lens of expectationmaximization. arXiv preprint arXiv:1905.10474, 2019.
 Bukov (2018) Marin Bukov. Reinforcement learning for autonomous preparation of floquetengineered states: Inverting the quantum kapitza oscillator. Physical Review B, 98(22):224305, 2018.
 Bukov et al. (2018a) Marin Bukov, Alexandre GR Day, Dries Sels, Phillip Weinberg, Anatoli Polkovnikov, and Pankaj Mehta. Reinforcement learning in different phases of quantum control. Physical Review X, 8(3):031086, 2018a.
 Bukov et al. (2018b) Marin Bukov, Alexandre GR Day, Phillip Weinberg, Anatoli Polkovnikov, Pankaj Mehta, and Dries Sels. Broken symmetry in a correlated quantum control landscape. Physical Review A, 2018b.

Chen et al. (2013)
Chunlin Chen, Daoyi Dong, HanXiong Li, Jian Chu, and TzyhJong Tarn.
Fidelitybased probabilistic qlearning for control of quantum
systems.
IEEE transactions on neural networks and learning systems
, 25(5):920–933, 2013.  Chen and Xue (2019) JunJie Chen and Ming Xue. Manipulation of spin dynamics by deep reinforcement learning agent. arXiv preprint arXiv:1901.08748, 2019.
 Dalgaard et al. (2019) Mogens Dalgaard, Felix Motzoi, Jens Jakob Sorensen, and Jacob Sherson. Global optimization of quantum dynamics with alphazero deep exploration. arXiv preprint arXiv:1907.05672, 2019.
 Day et al. (2019) Alexandre GR Day, Marin Bukov, Phillip Weinberg, Pankaj Mehta, and Dries Sels. Glassy phase of optimal quantum control. Physical review letters, 122(2):020601, 2019.
 Dillon et al. (2017) Joshua V Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo, Srinivas Vasudevan, Dave Moore, Brian Patton, Alex Alemi, Matt Hoffman, and Rif A Saurous. Tensorflow distributions. arXiv preprint arXiv:1711.10604, 2017.
 Dong et al. (2019) Yulong Dong, Xiang Meng, Lin Lin, Robert Kosut, and K Birgitta Whaley. Robust control optimization for quantum approximate optimization algorithm. arXiv preprint arXiv:1911.00789, 2019.
 Farhi et al. (2014) Edward Farhi, Jeffrey Goldstone, and Sam Gutmann. A Quantum Approximate Optimization Algorithm. arXiv preprint arXiv:1411.4028, 2014.
 Fösel et al. (2018) Thomas Fösel, Petru Tighineanu, Talitha Weiss, and Florian Marquardt. Reinforcement learning with neural networks for quantum feedback. Physical Review X, 8(3):031084, 2018.
 Gao and Han (2012) Fuchang Gao and Lixing Han. Implementing the neldermead simplex algorithm with adaptive parameters. Computational Optimization and Applications, 51(1):259–277, 2012.

Greensmith et al. (2004)
Evan Greensmith, Peter L Bartlett, and Jonathan Baxter.
Variance reduction techniques for gradient estimates in reinforcement learning.
Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.  Hadfield (2018) Stuart Hadfield. Quantum algorithms for scientific computing and approximate optimization. arXiv preprint arXiv:1805.03265, 2018.
 Hansen and Ostermeier (2001) Nikolaus Hansen and Andreas Ostermeier. Completely derandomized selfadaptation in evolution strategies. Evolutionary computation, 9(2):159–195, 2001.
 Ho and Hsieh (2019) Wen Wei Ho and Timothy H Hsieh. Efficient variational simulation of nontrivial quantum states. SciPost Phys, 6:029, 2019.

Johansson et al. (2012)
J Robert Johansson, PD Nation, and Franco Nori.
Qutip: An opensource python framework for the dynamics of open quantum systems.
Computer Physics Communications, 183(8):1760–1772, 2012.  Johansson et al. (2013) J Robert Johansson, Paul D Nation, and Franco Nori. Qutip 2: A python framework for the dynamics of open quantum systems. Computer Physics Communications, 184(4):1234–1240, 2013.
 Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kosut et al. (2013) Robert L Kosut, Matthew D Grace, and Constantin Brif. Robust control of quantum gates via sequential convex programming. Physical Review A, 88(5):052326, 2013.
 Lloyd (1995) Seth Lloyd. Almost any quantum logic gate is universal. Physical Review Letters, 75(2):346, 1995.
 Lloyd (2018) Seth Lloyd. Quantum approximate optimization is computationally universal. arXiv preprint arXiv:1812.11075, 2018.
 Morales et al. (2019) Mauro ES Morales, Jacob Biamonte, and Zoltán Zimborás. On the universality of the quantum approximate optimization algorithm. arXiv preprint arXiv:1909.03123, 2019.
 Niu et al. (2019a) Murphy Yuezhen Niu, Sergio Boixo, Vadim N Smelyanskiy, and Hartmut Neven. Universal quantum control through deep reinforcement learning. npj Quantum Information, 5(1):33, 2019a.
 Niu et al. (2019b) Murphy Yuezhen Niu, Sirui Lu, and Isaac L Chuang. Optimizing qaoa: Success probability and runtime dependence on circuit depth. arXiv preprint arXiv:1905.12134, 2019b.
 Peruzzo et al. (2014) Alberto Peruzzo, Jarrod McClean, Peter Shadbolt, ManHong Yung, XiaoQi Zhou, Peter J Love, Alán AspuruGuzik, and Jeremy L O’brien. A variational eigenvalue solver on a photonic quantum processor. Nature communications, 5:4213, 2014.
 Porotti et al. (2019) Riccardo Porotti, Dario Tamascelli, Marcello Restelli, and Enrico Prati. Coherent transport of quantum states by deep reinforcement learning. Communications Physics, 2(1):61, 2019.
 Powell (1964) Michael JD Powell. An efficient method for finding the minimum of a function of several variables without calculating derivatives. The computer journal, 7(2):155–162, 1964.
 Rapin and Teytaud (2018) J. Rapin and O. Teytaud. Nevergrad  A gradientfree optimization platform. https://GitHub.com/FacebookResearch/Nevergrad, 2018.
 Romero et al. (2017) Jonathan Romero, Jonathan P Olson, and Alan AspuruGuzik. Quantum autoencoders for efficient compression of quantum data. Quantum Science and Technology, 2(4):045001, 2017.
 Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889–1897, 2015.
 Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Shi et al. (2001) Yuhui Shi et al. Particle swarm optimization: developments, applications and resources. In Proceedings of the 2001 congress on evolutionary computation (IEEE Cat. No. 01TH8546), volume 1, pages 81–86. IEEE, 2001.
 Sørdal and Bergli (2019) Vegard B Sørdal and Joakim Bergli. Deep reinforcement learning for robust quantum optimization. arXiv preprint arXiv:1904.04712, 2019.
 Streif and Leib (2019) Michael Streif and Martin Leib. Training the quantum approximate optimization algorithm without access to a quantum processing unit. arXiv preprint arXiv:1908.08862, 2019.
 Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
 Tannor et al. (1992) David J Tannor, Vladimir Kazakov, and Vladimir Orlov. Control of photochemical branching: Novel procedures for finding optimal pulses and global upper bounds. In Timedependent quantum molecular dynamics, pages 347–360. Springer, 1992.
 Torlai et al. (2018) Giacomo Torlai, Guglielmo Mazzola, Juan Carrasquilla, Matthias Troyer, Roger Melko, and Giuseppe Carleo. Neuralnetwork quantum state tomography. Nature Physics, 14(5):447–450, 2018.
 Verdon et al. (2019) Guillaume Verdon, Juan Miguel Arrazola, Kamil Brádler, and Nathan Killoran. A quantum approximate optimization algorithm for continuous problems. arXiv preprint arXiv:1902.00409, 2019.
 Weinberg and Bukov (2017) Phillip Weinberg and Marin Bukov. Quspin: a python package for dynamics and exact diagonalisation of quantum many body systems part i: spin chains. SciPost Phys, 2(1), 2017.
 Weinberg and Bukov (2019) Phillip Weinberg and Marin Bukov. Quspin: a python package for dynamics and exact diagonalisation of quantum many body systems. part ii: bosons, fermions and higher spins. SciPost Phys., 7(arXiv: 1804.06782):020, 2019.
 Wierstra et al. (2008) Daan Wierstra, Tom Schaul, Jan Peters, and Juergen Schmidhuber. Natural evolution strategies. In 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence), pages 3381–3387. IEEE, 2008.
 Williams (1992) Ronald J Williams. Simple Statistical GradientFollowing Algorithms for Connectionist Reinforcement Learning. Machine learning, 8(34):229–256, 1992.
 Wu et al. (2019) ReBing Wu, Haijin Ding, Daoyi Dong, and Xiaoting Wang. Learning robust and highprecision quantum controls. Physical Review A, 99(4):042327, 2019.
 Yang et al. (2017) ZhiCheng Yang, Armin Rahmani, Alireza Shabani, Hartmut Neven, and Claudio Chamon. Optimizing variational quantum algorithms using pontryagin’s minimum principle. Physical Review X, 7(2):021027, 2017.
 Zhang et al. (2019) XiaoMing Zhang, Zezhu Wei, Raza Asad, XuChen Yang, and Xin Wang. When reinforcement learning stands out in quantum control? a comparative study on state preparation. arXiv preprint arXiv:1902.02157, 2019.
Appendix A Trajectories on the Bloch sphere
In this appendix, we visualize the PGQAOA algorithm’s final policies learned in the Fig. 1.
Appendix B Learning curves: Gaussian noise
In this appendix, we provide the learning curves for PGQAOA for various values of the standard deviation (i.e. the noise level) of the Gaussian noise, and different number of qubits .
Appendix C Learning curves: quantum measurement noise
In this appendix, we provide the learning curves for PGQAOA for various values of the batch size used in the quantum measurement noise simulations, and different number of qubits .