1 Introduction
Recently, a number of quantum computing devices have become available on the cloud. Current devices, which are commonly referred to as Noisy IntermediateScale Quantum (NISQ) devices, operate on a small number of qubits and have limited errorcorrection capabilities. Demonstrating quantum advantage on these devices, which is the ability to solve a problem more efficiently by using quantum computational methods compared with classical stateoftheart methods, requires the development of algorithms that can run using a modest quantum circuit depth.
The Quantum Approximate Optimization Algorithm (QAOA) is one of the leading candidate algorithms for achieving quantum advantage in the near term. QAOA is a hybrid quantumclassical algorithm for approximately solving combinatorial problems [farhi2014quantum]. QAOA combines a parameterized quantum state evolution that is performed on a NISQ device, with a classical optimizer that is used to find optimal parameters. Conceivably, the quality of the solution produced by QAOA for a given combinatorial instance depends on the quality of the variational parameters found by the classical optimizer. Designing robust optimization methods for QAOA is therefore a prerequisite for achieving a practical quantum advantage.
Optimizing QAOA parameters is known to be a hard problem because the optimization objective is nonconvex with lowquality nondegenerate local optima [shaydulin2019multistart, zhou2018quantum]. Many approaches have been applied to QAOA parameter optimization, including gradientbased [romero2018strategies, zhou2018quantum, crooks2018performance] and derivativefree methods [wecker2016training, yang2017optimizing, shaydulin2019multistart]. Noting that the optimization objective of QAOA is specific to the underlying combinatorial instance, existing works approach the task of finding optimal QAOA parameters for a given instance as an exclusive task, and they devise methods that require quantum circuit evaluations on the order of thousands. To the best of our knowledge, approaching QAOA parameter optimization as a learning task is underexplored. To that end, we propose two machinelearningbased methods for QAOA parameter optimization, in which the knowledge gained from solving training instances can be leveraged to efficiently find highquality solutions for unseen test instances with only a couple of hundred quantum circuit evaluations. Our novel treatment of the QAOA optimization task has the potential to make QAOA a costeffective algorithm to run on nearterm quantum computers.
The main contributions of our work are summarized as follows. First, we formulate the task of learning a QAOA parameter optimization policy as a reinforcement learning (RL) task. This approach can learn a policy network that can exploit geometrical regularities in the QAOA optimization objective of training instances, to efficiently optimize new QAOA circuits of unseen test instances. Second, we propose a samplingbased QAOA optimization strategy based on a kernel density estimation (KDE) technique. This approach can learn a generative model of optimal QAOA parameters, which can be used to generate new parameters and quickly solve test instances. In both approaches, we choose a training set of smallsized combinatorial instances that can be simulated on a classical computer, yet the test set includes larger instances. We conduct extensive simulations using the IBM Qiskit Aer quantum circuit simulator to evaluate the performance of our proposed approaches. We show that the two approaches can reduce the optimality gap by factors up to when compared with other commonly used offtheshelf optimizers.
2 The Quantum Approximate Optimization Algorithm
The idea of encoding the solution to a combinatorial optimization problem in the spectrum of a quantum Hamiltonian goes back to 1989 [Apolloni1989]. Using this quantum encoding, one can find the optimal solution to the original combinatorial problem by preparing the highest energy eigenstate of the problem Hamiltonian. Multiple approaches inspired by the adiabatic theorem [Kato1950] have been proposed to achieve this. Of particular interest is the Quantum Approximate Optimization Algorithm, introduced by [farhi2014quantum], and its generalization, the Quantum Alternating Operator Ansatz [hadfield2017quantum]. QAOA is arguably one of the strongest candidates for demonstrating quantum advantage over classical approaches in the current NISQ computers era [streif2019comparison].
In QAOA, a classical binary assignment combinatorial problem is first encoded in a cost Hamiltonian by mapping classical binary decision variables
onto the eigenvalues of the quantum Pauli Z operator
. For unconstrained combinatorial problems, the initial state is a uniform superposition quantum state , prepared by applying Hadamard gates on all qubits in the system. QAOA prepares a variational quantum state by applying a series of alternating operators and , ,(1) 
where are variational parameters,
is the number of qubits or binary variables, and
is the transverse field mixer Hamiltonian . Alternative initial states and mixer operators can be used to restrict to the subspace of feasible solutions for constrained optimization problems [hadfield2017quantum]. In order to find the highest energy eigenstate of , a classical optimizer is used to vary parameters to maximize the expected energy of ,(2) 
For , such that the resulting quantum state encodes the optimal solution to the classical combinatorial problem [farhi2014quantum]. In practice, the value is chosen based on the tradeoff between the achieved approximation ratio, the complexity of parameter optimization, and the accumulated errors. Ideally, increasing monotonically improves the QAOA approximation ratio [zhou2018quantum], although for higher the complexity of QAOA parameter optimization can limit the benefits [huang2019alibaba]. On real nearterm quantum hardware, errors become a dominating factor. For instance, on a stateoftheart trappedion quantum computer, increasing beyond does not lead to improvements in the approximation ratio because of errors [pagano2019quantum]. The hardware is rapidly progressing, however, and it is expected that QAOA with can be run in the foreseeable future, thus motivating us to choose for our benchmark.
QAOA has been applied to a variety of problems, including graph maximum cut [crooks2018performance, zhou2018quantum], network community detection [shaydulin2018network, shaydulin2018community], and portfolio optimization, among many others [barkoutsos2019improving].
2.1 QAOA for MaxCut
In this paper, we explore QAOA applied to the graph maximum cut problem (MaxCut). It is among the most commonly used target problems for QAOA because of its equivalence to quadratic unconstrained binary optimization. Consider a graph where is the set of vertices and is the set of edges. The goal of MaxCut is to partition the set of vertices into two disjoint subsets such that the total weight of edges connecting the two subsets is maximized. Let binary variables denote the partition assignment of vertex , . Then MaxCut can be formulated as follows,
(3) 
where if , and otherwise, and is a constant. The objective in (3) can be encoded in a problem Hamiltonian by mapping binary variables onto the eigenvalues of the Pauli Z operator : . The optimal binary assignment for (3) is therefore encoded in the highest energy eigenstate of . Note that in the definition of
, there is an implicit tensor product with the identity unitary,
, applied to all qubits except for qubits .In [crooks2018performance] the author shows that QAOA for MaxCut can achieve approximation ratios exceeding those achieved by the classical GoemansWilliamson [goemans1995improved] algorithm. A number of theoretical results show that QAOA for MaxCut can improve on bestknown classical approximation algorithms for some graph classes [osti_1492737, PhysRevA.97.022304].
2.2 Linear Algebraic Interpretation of QAOA
Here, we provide a short linear algebraic description of QAOA for readers who are not familiar with the quantum computational framework. An
qubit quantum state is a superposition (i.e., a linear combination) of computational basis states that form an orthonormal basis set in a complex vector space
,(4)  
where
is the probability that the quantum state is in
,. Quantum gates are unitary linear transformations on quantum states. For instance, Pauli Z and Pauli X operators and the identity operator are
(5) 
Note that the eigenvectors of the Pauli Z operator are computational basis states
and with eigenvalues and , respectively. Therefore, and are linear operators, which for large cannot be efficiently simulated classically. is a Hermitian linear operator with eigenvalues . Based on the minmax variational theorem, the minimum eigenvalue , and the maximum eigenvalue , where is the RayleighRitz quotient, , . Note that for any quantum state , the highest energy of in (2) is in fact . That said, QAOA constructs linear operators parameterized by the parameters, , whose net effect is to transform the uniform superposition state, , into a unitlength eigenvector that corresponds to .^{1}^{1}1Eigenvectors of a quantum Hamiltonian are referred to as its eigenstates; therefore, “transforming into an eigenvector of corresponding to ” is a different way of saying “preparing an eigenstate of corresponding to the energy .”3 Related Works
In this section, a literature review of related research work is presented.
3.1 QAOA Parameter Optimization
A number of approaches have been explored for QAOA parameter optimization, including a variety of offtheshelf gradientbased [romero2018strategies, crooks2018performance] and derivativefree methods [wecker2016training, yang2017optimizing, shaydulin2019multistart]. QAOA parameters have been demonstrated to be hard to optimize by offtheshelf methods [shaydulin2019multistart] because the energy landscape of QAOA is nonconvex with lowquality nondegenerate local optima [zhou2018quantum]. Offtheshelf optimizers ignore features of QAOA energy landscapes and nonrandom concentration of optimal parameters. Researchers have demonstrated theoretically that for certain boundeddegree graphs [brandao2018fixed], QAOA energy landscapes are graph instanceindependent given instances that come from a reasonable distribution. Motivated by this knowledge, we develop two machinelearningbased methods exploiting the geometrical structure of QAOA landscapes and the concentration of optimal parameters, so that the cost of QAOA parameter optimization can be amortized.
3.2 Hyperparameter and Optimizer Learning
Hyperparameter optimization, which involves the optimization of hyperparameters used to train a machine learning model, is an active field of research [automl_book]. Each hyperparameter choice corresponds to one optimization task, and thus tuning of hyperparamters can be regarded as a search over different optimization instances, which is analogous to our problem formulation. Recent methods devise sequential modelbased Bayesian optimization techniques [feurer2015initializing] or asynchronous parallel modelbased search [8638041], while older methods devise randomsamplingbased strategies [prasanna2007].
On the other hand, learning an optimizer to train machine learning models has recently attracted considerable research interests. The motivation is to design optimization algorithms that can exploit structure within a class of problems, which is otherwise unexploited by handengineered offtheshelf optimizers. In existing works, the learned optimizer is implemented by long shortterm memory
[andrychowicz2016learning, verdon2019learning] or a policy network of an RL agent [li2016learning]. Our RLbased approach to optimizer learning differs from that of [li2016learning] mainly in the choice of reward function and the policy search mechanism. In our work, a Markovian reward function is chosen to improve the learning process of a QAOA optimization policy.4 Learning Optimal QAOA Parameters
Since evaluating a quantum circuit is an expensive task, the ability to find good variational parameters using a small number of calls to the quantum computer is crucial for the success of QAOA. To amortize the cost of QAOA parameter optimization across graph instances, we formulate the problem of finding optimal QAOA parameters as a learning task. We choose a set of graph instances, , which include graph instances that are representative of their respective populations, as training instances for our proposed machine learning methods. The learned models can then be used to find highquality solutions for unseen instances from a test set . In Section 4.1, we propose an RLbased approach to learn a QAOA parameter optimization policy. In Section 4.2, we propose a KDEbased approach to learn a generative model of optimal QAOA parameters. Section 4.3 elaborates on graph instances in and .
4.1 Learning to Optimize QAOA Parameters with Deep Reinforcement Learning
Our first approach aims to learn a QAOA parameter optimization policy, which can exploit structure and geometrical regularities in QAOA energy landscapes to find highquality solutions within a small number of quantum circuit evaluations. This approach can potentially outperform handengineered that are designed to function in a general setting. We cast the problem of learning a QAOA optimizer as an RL task, where the learned RL policy is used to produce iterative parameter updates, in a way that is analogous to handengineered iterative optimizers.
In the RL framework, an autonomous agent learns how to map its state in a state space, , to an action from its action space, , by repeated interaction with an environment. The environment provides the agent with a reward signal, , in response to its action. Based on the reward signal, the agent either reinforces the action or avoids it at future encounters, in an attempt to maximize the expected total discounted rewards received over time [sutton2018reinforcement].
Mathematically, the RL problem is formalized as a Markov decision process (MDP), defined by a tuple
, where the environment dynamics, , that is, the model’s stateactionstate transition probabilities, , and , which is the initial distribution over the states, are unknown; is a reward function that guides the agent through the learning process; and is a discount factor to bound the cumulative rewards and trade off how farsighted or shortsighted the agent is in its decision making. A solution to the RL problem is a stationary Markov policy that maps the agent’s states to actions, , such that the expected total discounted rewards is maximized,(6) 
In our setting, we seek a Markov policy that can be used to produce iterative QAOA parameter updates. This policy must exploit QAOA structure and geometrical regularities such that highquality solutions can be achieved despite the challenges pertaining to QAOA parameter optimization. To do so, we formulate QAOA optimizer learning as an MDP as follows.

, ; that is, the state space is the set of finite differences in the QAOA objective and the variational parameters between the current iteration and history iterations, .

, ; that is, the action space is the set of step vectors used to update the variational parameters, .

; that is, the reward is the change in the QAOA objective between the next iteration and the current iteration.
The motivation behind our state space formulation comes from the fact that parameter updates at should be in the direction of the gradient at and the step size should be proportional to the Hessian at , both of which can be numerically approximated by using the method of finite differences. The RL agent’s learning task is therefore to find the optimal way of producing a step vector , given some collection of historical differences in the objective and parameters space, , such that the expected total discounted rewards are maximized. Note that (6) is maximized when
, which means the QAOA objective has been increased between any two consecutive iterates. The choice of reward function adheres to the Markovian assumption and encourages the agent to take strides in the landscape of the parameter space that yield a higher increase in the QAOA objective (
2), if possible, while maintaining conditional independence on historical states and actions.Training Procedure
Learning a generalizable RL policy that can perform well on a wide range of test instances requires the development of a proper training procedure and reward normalization scheme. We use the following strategy to train the RL agent on instances in . A training episode is defined to be a trajectory of length , which is sampled from a depth QAOA objective (2) corresponding to one of the training instances in . At the end of each episode, the trajectory is cut off and restarted from a random point in the domain of (2). Training instances are circulated in a roundrobin fashion, with each episode mitigating destructive policy updates and overfitting. Rewards for a given training instance are normalized by the mean depth QAOA objective corresponding to that instance, which is estimated a priori by uniformly sampling the variational parameters. Training is performed for epochs of episodes each, and policy updates are performed at the end of each epoch.
Deep RL Implementation
We train our proposed deep RL framework using the actorcritic Proximal Policy Optimization (PPO) algorithm [schulman2017proximal]. In PPO, a clipped surrogate advantage objective is used as the training objective,
(7) 
The surrogate advantage objective, , is a measure of how the new policy performs relative to the old policy. Maximizing the surrogate advantage objective without constraints could lead to large policy updates that may cause policy collapse. To mitigate this issue, PPO maximizes the minimum of an unclipped and a clipped version of the surrogate advantage, where the latter removes the incentive for moving outside of . Clipping therefore acts as a regularizer that controls how much the new policy can go away from the old one while still improving the training objective. In order to further ensure reasonable policy updates, a simple early stopping method is adopted, which terminates gradient optimization on (7) when the mean KLdivergence between the new and old policy hits a predefined threshold.
Fully connected multilayer perceptron networks with two hidden layers for both the actor (policy) and critic (value) networks are used. Each hidden layer has
neurons. activation units are used in all neurons. The range of output neurons is scaled to . The discount factor and number of history iterations in the state formulation are set to and, respectively. A Gaussian policy with a constant noise variance of
is adopted throughout the training in order to maintain constant exploration and avoid getting trapped in a locally optimal policy. At testing, the trained policy network corresponding to the mean of the learned Gaussian policy is used, without noise.4.2 Learning to Sample Optimal QAOA Parameters with Kernel Density Estimation
In our second approach, we aim to learn the distribution of optimal QAOA parameters and use this distribution to sample QAOA parameters that can provide highquality solutions for test instances. Although previous theoretical results show that optimal QAOA parameters concentrate for graph instances necessarily coming from the same distribution [brandao2018fixed], we learn a metadistribution of optimal QAOA parameters for graph instances in , which come from diverse classes and distributions. This approach can potentially eliminate the need for graph feature engineering and similarity learning and can dramatically simplify the solution methodology.
To learn a metadistribution of optimal QAOA parameters, we adopt a KDE technique. Suppose we have access to a set of optimal QAOA parameters for graph instances in and a given QAOA circuit depth, . A natural local estimate for the density at is , where is a neighborhood of width around based on some distance metric. This estimate, however, is not smooth. Instead, one commonly adopts a ParzenRosenblatt approach with a Gaussian kernel to obtain a smoother estimate,
(8) 
where .
In order to generate new QAOA parameters using (8), a data point from is chosen uniformly randomly with probability , and it is added to a sample
, that is, a sample drawn from a multivariate Gaussian distribution with zero mean and diagonal covariance matrix
. Noting that , one can easily confirm that this sampling strategy yields data points distributed according to (8),(9)  
and hence, .
This methodology assumes that is readily available. In order to construct in practice, a derivativefree offtheshelf optimizer is started from a large number ( as in [zhou2018quantum]) of random points in the domain of (2). Because (2) is known to be a nonconvex function in the parameter space, a portion of these extensive searches may converge to lowquality local optima. To tackle this issue, we admit into only parameters that achieve an optimality ratio of or higher for a given training instance and a given QAOA circuit depth.
4.3 Graph MaxCut Instances
In this subsection, we describe graph instances in and . Four classes of graphs are considered: (1) ErdosRenyi random graphs , where is the number of vertices and is the edge generation probability, (2) ladder graphs , where is the length of the ladder, (3) barbell graphs , formed by connecting two complete graphs by an edge, and (4) Caveman graphs , where is the number of cliques and is the size of each clique. Figure 1 shows sample graph instances drawn from each graph class.
To construct , we choose one representative graph instance of vertices from each class and distribution, amounting to training instances. contains instances with varying number of vertices, as shown in Table 1. We choose to demonstrate that combining our proposed machine learning approaches with QAOA can be a powerful tool for amortizing the QAOA optimization cost across graph instances. In addition, we choose to train on graph instances that are smaller than test instances, to demonstrate that our proposed approaches are independent of instance size or complexity. We note that .
Graph Class  

,  , seed  
,  
,  
= 94 
5 Results and Discussion
In this section, we present the results of our work. We use Qiskit Aer to perform noiseless simulations of QAOA circuits. In Figure 6 we show the expected energy landscape of the cost Hamiltonian for a depth QAOA circuit with two variational parameters and for some graph instances. We can see that the expected energy of the cost Hamiltonian is nonconvex in the parameter space and is noisy because it is statistically estimated based on quantum circuit measurements. These features tend to be more severe as the depth of the QAOA circuit increases, thus posing serious challenges for commonly used derivativefree offtheshelf optimizers.
In Figure 10, we present training results associated with our proposed deep RL and KDEbased approaches for a depth QAOA circuit. The RL learning curve during the training procedure is shown in Figure 10(a). We can see that the expected total discounted rewards of the RL agent starts around at the beginning of training, which means the RL agent is performing as well as random sampling. As training progresses, the performance of the learned optimization policy on the training set improves. This can also be seen in Figure 10(b), which shows the best objective value for one of the training instances versus time steps of an episode at different stages of training. The optimization policy learned at the end of training, namely, epoch , produces a trajectory that rises quickly to a higher value compared with the trajectories produced by the optimization policies learned at the middle and beginning of training. On the other hand, Figure 10(c) shows contour lines for the learned bivariate probability density based on using our proposed KDE technique. We can see that optimal QAOA parameters for a depth QAOA circuits of training instances in concentrate in some regions in the domain of (2).
Next, we benchmark the performance of our trained RLbased QAOA optimization policy and the samplingKDEbased QAOA optimization strategy by comparing their performance with common derivativefree offtheshelf optimizers implemented in the NLopt nonlinear optimization package [nlopt], namely, BOBYQA, COBYLA, and NelderMead, as well as a purely random samplingbased strategy. Starting from 10 randomly chosen variational parameters in the domain of (2), each optimizer is given attempts with a budget of quantum circuit evaluations to solve QAOA energy landscapes corresponding to graph instances in . In each of the attempts, the random sampling variational parameters are sampled uniformly (random), and in the KDEbased approach they are based on the learned density (KDE); and the parameters with the highest objective value are chosen as the solution. Since our primary focus is to devise methods that find highquality parameters with a few quantum circuit evaluations, we use the learned optimization policy by RL to generate trajectories of length , and we resume the trajectory from the best parameters found by using NelderMead for the rest of evaluations. This approach is motivated by our observation that the RLbased approach reaches regions with highquality solutions rather quickly, yet subsequent quantum circuit evaluations are spent without further improvements on the objective value [khairy2019reinforcement]. The visualizations of the RL agent as it navigates the QAOA landscape for and the results of the pure RLbased method are reported in [khairy2019reinforcement].
We compare our results with gradientfree methods for the following reasons. First, analytical gradients are in general not available, and evaluating gradients on a quantum computer is computationally expensive [crooks2018performance, guerreschi2017practical]. Under reasonable assumptions about the hardware [Guerreschi2019], one evaluation of the objective takes second. Therefore, minimizing the number of objective evaluations needed for optimization is of utmost importance. Estimating gradient requires at least two evaluations for each parameter, making gradientbased methods noncompetitive within the budget of objective evaluations ( sec projected running time on a real quantum computer) that we chose for our benchmark. Second, gradients are sensitive to noise [zhu2019training], which includes stochastic noise originating from the random nature of quantum computation, and the noise caused by hardware errors. This is typically addressed by increasing the number of samples used to estimate the objective. On the other hand, optimizers employed for hyperparameter optimization, such as Bayesian optimization, require a few hundred evaluations simply to initialize the surrogate model, ruling out these optimizers from our benchmark where the evaluation budget is .
To report the performance, we group graph instances in in three subgroups, (1) Random graphs, which contains all graphs of the form , 2) Community graphs, which contains graphs of the form and , and 3) Ladder graphs, which contains graphs of the form . In Figure 14, we report a boxplot of the expected optimality ratio, , where the expectation is with respect to the highest objective value attained by a given optimizer in each of its attempts. The optimal solution to a graph instance in is the largest known value found by any optimizer in any of its attempts for a given depth . One can see that the median optimality ratio achieved by our proposed approaches outperforms that of other commonly used optimizers. While the performance of derivativefree optimizers, random sampling, and the RLbased approach degrades as the dimension of parameters increase from to , the KDEbased approach maintains high optimality ratios and is the superior approach in all cases and across different graph classes. The RLbased approach and random sampling rank second and third, respectively, in the majority of cases. Random sampling turns out to be a competitive candidate for sampling of variational parameters, especially for low epth circuits, and for random graph instances. We note that this finding hasd not been identified prior to our work because random search was not included in the experimental comparison [shaydulin2019multistart, nakanishi2019sequential, larose2019variational, verdon2019learning]. In Table 2, we summarize the median optimality gap reduction factor with respect to NelderMead, attained by our proposed RL and KDEbased approaches, namely, and . As the table shows, our proposed approaches reduce the optimality gap by factors up to compared with NelderMead, and the gap reduction factor is consistently larger than .
Graph Class  Proposed Optimizer  

Random  RL  
KDE  
Community  RL  
KDE  
Ladder  RL  
KDE 
In Figure 15, we show a boxplot of the expected approximation ratio performance, , of QAOA with respect to the classical optimal found using brute force methods across different graph instances in . We can see that increasing the depth of QAOA circuit improves the attained approximation ratio, especially for structured graphs (i.e., community and ladder graph instances).
6 Conclusion
In this paper, we formulated the problem of finding optimal QAOA parameters for approximately solving combinatorial problems as a learning task. Two machinelearningbased approaches have been proposed: an RLbased approach, which can learn a policy network that can efficiently optimize new QAOA circuits by exploiting geometrical regularities in QAOA objectives, and a KDEbased approach, which can learn a generative model of optimal QAOA parameters that can be used to sample parameters for new QAOA circuits. Our proposed approaches have been trained on a small set of smallsized training instances, yet they are capable of efficiently solving larger problem instances. When coupled with QAOA, our proposed approaches can be powerful tools for amortizing the QAOA optimization cost across combinatorial instances.
In our future work, we will investigate machinelearningbased methods for QAOA applied to constrained combinatorial optimization problems such as maximum independent set and max colorable subgraphs, which have important applications in many disciplines.
Acknowledgments
This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, under Contract DEAC0206CH11357. This research was funded in part by and used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DEAC0206CH11357. We gratefully acknowledge the computing resources provided on Bebop, a highperformance computing cluster operated by the Laboratory Computing Resource Center at Argonne National Laboratory. LC acknowledges support from LANL’s Laboratory Directed Research and Development (LDRD) program. LC was also supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, under the Quantum Computing Application Teams program.
Comments
There are no comments yet.