Log In Sign Up

Monte Carlo Tree Search based Hybrid Optimization of Variational Quantum Circuits

Variational quantum algorithms stand at the forefront of simulations on near-term and future fault-tolerant quantum devices. While most variational quantum algorithms involve only continuous optimization variables, the representational power of the variational ansatz can sometimes be significantly enhanced by adding certain discrete optimization variables, as is exemplified by the generalized quantum approximate optimization algorithm (QAOA). However, the hybrid discrete-continuous optimization problem in the generalized QAOA poses a challenge to the optimization. We propose a new algorithm called MCTS-QAOA, which combines a Monte Carlo tree search method with an improved natural policy gradient solver to optimize the discrete and continuous variables in the quantum circuit, respectively. We find that MCTS-QAOA has excellent noise-resilience properties and outperforms prior algorithms in challenging instances of the generalized QAOA.


page 1

page 2

page 3

page 4


Quantum Adversarial Learning in Emulation of Monte-Carlo Methods for Max-cut Approximation: QAOA is not optimal

One of the leading candidates for near-term quantum advantage is the cla...

Toward Neural Network Simulation of Variational Quantum Algorithms

Variational quantum algorithms (VQAs) utilize a hybrid quantum-classical...

Automated Quantum Circuit Design with Nested Monte Carlo Tree Search

Quantum algorithms based on variational approaches are one of the most p...

Policy Gradient Approach to Compilation of Variational Quantum Circuits

We propose a method for finding approximate compilations of quantum circ...

Fault Tolerant Free Gait and Footstep Planning for Hexapod Robot Based on Monte-Carlo Tree

Legged robots can pass through complex field environments by selecting g...

Numerical and geometrical aspects of flow-based variational quantum Monte Carlo

This article aims to summarize recent and ongoing efforts to simulate co...

Qubit Routing using Graph Neural Network aided Monte Carlo Tree Search

Near-term quantum hardware can support two-qubit operations only on the ...

1 Introduction

Quantum computing provides a fundamentally different way for solving a variety of important problems in scientific computing, such as finding the ground state energy in computational chemistry, and the MaxCut problem in combinatorial optimization. Variational quantum circuits are perhaps the most important quantum algorithms on near term quantum devices 

(Preskill, 2018), mainly due to the tunability and the relatively short circuit depth (Cerezo et al., 2021b), as exemplified by the variational quantum eigensolver (VQE) (Peruzzo et al., 2014; McClean et al., 2016) and the quantum approximate optimization algorithm (QAOA) (Farhi et al., 2014). A common thread in these algorithms is to variationally optimize a parameterized quantum circuit using classical methods to obtain an approximate ground state. For instance, in combinatorial optimization, QAOA encodes the classical objective function into a quantum Hamiltonian, and constructs a quantum circuit with a set of two alternating quantum gates. The continuous adjustable parameters are the duration or phases of the gates.

For quantum many-body problems, the expressivity of the QAOA ansatz may be limited: the exponentially large (in the number of qubits) Hilbert space may not be efficiently navigated by the dynamics generated by the alternating gate sequence. This can lead to circuit depths that grow with the system size 

(Ho and Hsieh, 2019), or render the target ground state outside the scope of accessible states altogether, thus fundamentally precluding its preparation. To address these problems, various versions of a generalized QAOA ansatz have been presented in recent works Zhu et al. (2020); Yao et al. (2020c); Chandarana et al. (2021), where additional control Hamiltonians are used to generate the variational circuits. In general, these Hamiltonians are tailored to the many-body system whose ground state we seek to prepare, and the extended Hamiltonian pool is often constructed using ideas from variational counter-diabatic (CD) driving (Sels and Polkovnikov, 2017). When the optimization of the parameterized circuit is performed successfully, the generalized ansatz produces a closer approximation to the ground state than the original alternating QAOA ansatz. The generalized QAOA may also significantly reduce the total protocol duration and therefore the depth of the quantum circuit while giving a high fidelity with respect to the ground state Yao et al. (2020c).

Figure 1: The schematics of MCTS-QAOA: MCTS provides promising paths for the discrete optimization search; the inner loop (highlighted in red) Policy Gradient (PG) solver evaluates the discrete sequence in a noise-robust way; the reward obtained is then propagated back through the search tree and used to improve the tree policy.

However, the ansatz of the generalized QAOA also results in a more challenging optimization problem. The original QAOA only involves optimization of continuous parameters. The generalized QAOA ansatz, in contrast, leads to a hybrid optimization problem that involves both the discrete variables (the choice of quantum gates) and the continuous variables (the duration of each gate). To solve this hybrid optimization problem, we propose a novel algorithm that combines the Monte Carlo Tree Search (MCTS) algorithm (Coulom, 2006; Browne et al., 2012; Abramson, 2014; Silver et al., 2016, 2017) – a powerful method in exploring the discrete sequence, with an improved noise-robust natural policy gradient solver for the continuous variables of a fixed gate sequence.


  • We propose the MCTS-QAOA algorithm which combines the MCTS algorithm and a noise-robust policy gradient solver. We show that it is not only efficient in exploring the quantum gate sequences but also robust in the presence of different types of noise.

  • The proposed MCTS-QAOA algorithm produces accurate results for problems that appear difficult or infeasible for previous algorithms based on the generalized QAOA ansatz, such as RL-QAOA (Yao et al., 2020b). In particular, MCTS-QAOA shows superior performance in the large protocol duration regime, where the hybrid optimization becomes challenging.

  • In order for the MCTS-QAOA algorithm to produce reliable optimal results, it is crucial that the inner loop solver finds the optimal continuous variables with high accuracy. Compared to the original PG-QAOA solver introduced in Yao et al. (2020a), we improve the inner loop solver with entropy regularization and the natural gradient method, and implement it in Jax (Bradbury et al., 2018), which offers more accurate, stable, and efficiently computed solutions during the continuous optimization.

  • For the physics models considered in this paper, we observe that there can be many “good” gate sequences. This means that for a large portion of gate sequences, the energy ratio obtained is not far away from the optimal energy ratio obtainable with the generalized QAOA ansatz, given that the continuous variables are solved with high quality. This phenomenon has not been recorded in the literature to the best of the authors’ knowledge.

Related works:

Quantum control and variational quantum eigensolver: Traditional optimal quantum control methods, often used in prior works, are GRAPE (Khaneja et al., 2005) and CRAB (Caneva et al., 2011)

. More recently, success has been seen by the combination of traditional methods with machine learning  

(Schäfer et al., 2020; Wang et al., 2020a; Sauvage and Mintert, 2019; Fösel et al., 2020; Nautrup et al., 2019; Albarrán-Arriagada et al., 2018; Sim et al., 2021; Wu et al., 2020a, b; Anand et al., 2020; Dalgaard et al., 2022)

, and especially reinforcement learning 

(Niu et al., 2019; Fösel et al., 2018; August and Hernández-Lobato, 2018; Porotti et al., 2019; Wauters et al., 2020; Yao et al., 2020a; Sung, 2020; Chen et al., 2013; Bukov, 2018; Bukov et al., 2018; Sørdal and Bergli, 2019; Bolens and Heyl, 2020; Dalgaard et al., 2020; Metz and Bukov, 2022)). Among them, Variational quantum eigensolver or VQE (Cerezo et al., 2021a; Tilly et al., 2021) provides a general framework applicable on noisy intermediate-scale quantum (NISQ) devices (Preskill, 2018) to variationally tune the circuit parameters and improve the approximation. In the fault tolerant setting, there are also possibilities of error mitigation via the variational quantum optimization (Sung et al., 2020; Arute et al., 2020).

QAOA (Farhi et al., 2014) can be viewed as a specific variational quantum algorithm, and can be extended to the generalized QAOA ansatz Zhu et al. (2020); Yao et al. (2020c); Chandarana et al. (2021). Prior works optimize the generalized QAOA greedily and progressively for each circuit layer or end-to-end as a large autoregressive network. The present work differs from these methods; we take advantage of the MCTS structure and formulate the problem as a two-level optimization.

MCTS and RL: Monte Carlo tree search (MCTS) has been one major workhorse behind the recent breakthrough of reinforcement learning algorithm, especially AlphaGo algorithms and variants (Silver et al., 2016, 2017, 2018; Schrittwieser et al., 2020; Ye et al., 2021). MCTS (Browne et al., 2012; Guo et al., 2014) makes use of a discrete hierarchical structure to figure out a better exploration in high dimensional search problems. While it is typically applied to discrete search, it has also been used in the continuous setting (Wang et al., 2020b), where the partition space of the whole space is viewed as branching of the tree. In the context of quantum computing, applications of MCTS have been recently emerged such as the Quantum Circuit Transformation (Zhou et al., 2020b), the quantum annealing schedules (Chen et al., 2020), and the quantum dynamics optimization (Dalgaard et al., 2020).

Further related works in hybrid optimization, counter-diabatic driving methods, and architecture search can be found in Appendix A.

2 Generalized QAOA ansatz

The generalized QAOA ansatz (Yao et al., 2020c) constructs a variational quantum circuit via the composition of a sequence of parameterized unitary operators:


Here the circuit parameters contain two components: i) the discrete variables define a sequence of Hamiltonians with length , while ii) the continuous variables represent the duration that each corresponding gate is applied for. It is further assumed that each Hamiltonian is selected from a fixed Hamiltonian pool , and consecutive gates are not repeated, i.e., . The total number of possible sequences is thus , which grows exponentially with , rendering exhaustive search intractable.

After applying the circuit to an initial quantum state , one obtains the final quantum state . To prepare a high quality approximation of the ground state of the target Hamiltonian , the continuous and discrete variables are solved for by minimizing the following objective function:


Note that the energy in the objective function is divided by the number of particles in the physical model, e.g. the number of qubits. This scaled objective function has a well-behaved limit when increasing the number of qubits, as required for larger-scale computations. Here, the energy function is always lower-bounded by the ground state energy . It is also worth noticing that the quantum states are unknown to the optimization algorithm (they cannot be measured), which increases the difficulty of the optimization algorithm.

3 Reinforcement learning setup

After defining the optimization problem posed by the generalized QAOA, let us briefly cast it within the RL framework.

3.1 Quantum constraints on the RL environment

Beyond classical physics, quantum mechanics imposes counterintuitive constraints on the state and reward spaces, which need to be embedded in a realistic RL environment.

First, the quantum state (or wavefunction) is not a physical observable by itself, and inference of the information of the full quantum state from experiments (called quantum state tomography) can require exponential resources. This fact is intimately related to the expected superior performance of quantum computers against their classical counterparts on certain tasks. To embed this quantum behavior into our environment simulator, we define the RL state as the sequence of actions applied (Bukov, 2018) rather than the quantum state. Starting from a fixed initial state, the quantum state is uniquely determined (though still unmeasurable) by the Hamiltonian sequence applied.

Second, (strong) quantum measurements lead to a collapse of the quantum wavefunction. This means that, once a measurement has been performed, the state itself is irreversibly lost. Therefore, a second constraint for our quantum RL environment is the sparsity of rewards. Indeed, only after the RL episode comes to an end, can we measure the energy and obtain the reward. In Sec. 4, we exploit this fact to introduce MCTS into the algorithm which does not evaluate the protocol during the construction of it. As a result, the evaluation is delegated to the noise-robust PG-QAOA solver.

3.2 The reinforcement learning environment

In the language of reinforcement learning (RL), the choice of quantum gates corresponds to the action of the learner, and the quantum circuit is completed after actions, which marks the end of the RL session/episode. The reward signal is provided by the inner loop solver which aims to compute the lowest possible energy that can be reached by the fixed chosen gate sequence. To be more specific, the action space is a set of Hamiltonians; the state space is the set of sequences of Hamiltonians with length no larger than . In particular, a session always starts with the empty sequence , and ends with a state given by a Hamiltonian sequence of length . When is not a terminal state, i.e., , the next state is obtained by appending the -th action at the end of , i.e., .

The reward only depends on the state , and it is set as whenever is not a terminal state. As explained in the previous section, this implements the physical constraint reflecting the inability to perform a strong quantum measurement without destroying the quantum state. When is a terminal state , we define


where are the duration obtained by the inner loop continuous optimizer, and the energy is defined in (2.2).

0:  UCB bound coefficient , number of outer loop iterations , number of random initialization .
1:  Initialize the Monte Carlo tree.
2:  for  do
3:     Pick a node according to the tree policy , cf. Eq. (4.1), using the UCB bound with parameter .
4:     if the tree node is not the terminal state then
5:        Randomly roll out from the current tree node to obtain a terminal state .
6:     end if
7:     for  do
8:        Run natural policy gradient method (see Algorithm 2

) to obtain the estimated reward

9:     end for
10:     Choose the best gate sequence durations according to the maximum reward across different random intialization of policy gradient.
11:     Back-propagate the reward from the node up to the root and update the statistics () on each node.
12:  end for
Algorithm 1 MCTS-QAOA

4 Monte Carlo tree search with improved policy gradient solver

In this section, we introduce MCTS-QAOA, an algorithm that solves the hybrid optimization problem defined by the generalized QAOA ansatz, using a combination of MCTS and an improved policy gradient solver. In the combined algorithm, MCTS serves as the solver for the outer optimization problem: it is used to search for high quality gate sequences . At the same time, we design an improved policy gradient solver to produce the optimal gates duration for the discrete sequence provided by MCTS. Finally, the outcome of the evaluation is propagated back through the nodes of the MC tree to improve the tree policy before the next iteration.

4.1 Discrete optimization: Monte Carlo tree search

MCTS-QAOA strikes an efficient balance between exploration and exploitation of the RL states, by leveraging the statistics recorded in a search tree. Each node of this tree corresponds to a state ; the child nodes denote all possible states following the state . For the problem considered in this paper, trajectories are loop-free, since each child state has one more action attached than its parent state . Thus, we refer to a given node by its corresponding state. In particular, the root node corresponds to the empty state , which has children, one for each action; any other non-terminal state has children, reflecting the constraint that no action can follow itself, and a terminal state has none. During the search process, each node keeps track of the statistics of two quantities: i) counts the selection of action at state ; ii) is the expected reward after taking action at state . Intuitively, the average is an estimate of how promising a child node is. Finally, a node is called fully expanded, if all its children are visited in the search, i.e., if for all ; otherwise, is called an expandable node, and is the focus of exploration.

In each MCTS iteration, the tree and the node statistics are updated as follows:

  1. Forming a search path. Starting from the root node, if the current node is fully expanded, then one of its children is chosen according to the following Upper Confidence Bound (UCB) (Auer et al., 2002):


    until reaching a terminal state or an expandable node; here denotes the tree policy. Then an unvisited child of the current node is chosen at random, unless the current node is a terminal state. After that, a simulation is rolled out with a uniform policy until reaching a terminal state.

  2. Evaluation and backup. The reward 111In order to distinguish the estimated reward from the true reward in the presence of noise, we denote the estimated reward as . of the terminal state is evaluated by the inner loop solver and the tree statistics are updated using for each visited edge .

For the generalized QAOA ansatz, the real challenge lies in the evaluation step. On the one hand, the overall minimization of the energy depends on the potential of the trajectory selected by the MCTS, whose role is to find the optimal trajectory sequence. On the other hand, if the accuracy of the evaluation is low, then the searching process can be stuck at a severely suboptimal solution. Similarly, if the evaluation is not efficient enough, then the benefit obtained by using quantum computation strategy will also be lost. And last but not least, if the evaluation results are not robust to noise, then the algorithm can hardly be carried out on quantum devices. Hence, the inner loop solver used to implement the evaluation must be able to efficiently offer high accuracy results while being robust to different kinds of noise. The above considerations refer to the generic case; in practice, the optimization dynamics of the algorithm is set by the properties of the optimization landscape.

4.2 Continuous optimization: natural policy gradient solver

For each terminal state reached in the MCTS process, an inner loop solver is invoked to produce the optimal duration and the reward which are then back-propagated through the tree to update the tree statistics. In order to ensure that the duration obtained has a practical magnitude and to allow for a fair comparison between algorithms, we further assume that the total duration of all gates is fixed as , which can be seen as a protocol for the circuit depth.

The continuous optimization problem for the inner-loop solver in the reward-evaluation step is thus


In order to avoid using explicit derivatives of the energy , we instead optimize the expectation of the energy

over a parameterized probability distribution of

; this is also crucial to to make the algorithm resilient to noise. More specifically, we set to ensure the constraints on , where

is a random variable drawn from the sigmoid Gaussian distribution

222 denotes the sigmoid Gaussian distribution with parameters and , i.e., the distribution of the Gaussian random variable

under the sigmoid transformation. It is also called the logit-normal distribution

. It can be parameterized as , where is a Gaussian random variable and

is the sigmoid function. Adding a Shannon entropy regularizer to the total expected reward we obtain the regularized objective function:


which is maximized over the parameters . Here , and denotes the temperature, which controls the trade-off between exploration and exploitation: higher temperature leads to a larger weight on the entropy term, and thus encourages exploration, while smaller reduces exploration. The entropy term can be derived from the definition of Shannon entropy, cf. Appendix D.

The inner loop solver is then constructed with a natural policy gradient (NPG) method applied to the regularized objective function using the natural gradient direction , where

is the Fisher information matrix for the joint distribution of

and is the gradient of with respect to the parameters. This procedure is different from the solver established in PG-QAOA (Yao et al., 2020a), where the standard gradient is used to update the parameters and no regularization is used. Using independent standard normal variables

, the natural gradient direction can be approximated by unbiased estimators:


where and is the -th -by- diagonal block of the Fisher information matrix, since is a block diagonal matrix, cf. Appendix D. In practice, we update instead of to ensure the positivity of , and we use the average of the unbiased estimators in (4.4) within a batch of size to give the approximation of the natural gradient direction.

The first term in the objective function can also be viewed as a smoothed reward function obtained with Gaussian perturbation. The parameter determines the distance between and (Nesterov and Spokoiny, 2017). If is too large, then is far from , and yields suboptimal solutions of since too much details are lost after the Gaussian smoothing. To avoid this, we propose to use a tempering technique (see for example Klink et al. (2020, Sec. 5); Abdolmaleki et al. (2018, Sec. 5); Haarnoja et al. (2018, Sec. 5)). More specifically, after a certain number of NPG iterations, we reduce the temperature , and in the final stage of entropy adjustment (cf. line 10-12 in Algorithm 2

), we discard the entropy term. In this way, the policy is less susceptible to highly suboptimal local maxima in the beginning of the inner loop optimization thanks to the entropy regularization. At the end of the optimization, the variance

decreases, since the temperature is reduced and the algorithm is able to achieve a higher precision as the smoothed problem becomes a better approximation to the original one. As a result, many policy gradient updates can be saved compared to the original policy gradient method in Yao et al. (2020a), and the quality of solutions is improved.

When the optimization by the inner loop solver is completed, the parameters are used to evaluate the reward to be back-propagated through the MC tree. More specifically, the gate sequence with duration is applied and a reward is obtained. In order to deal with noisy rewards, the evaluation is repeated times, and the average reward is sent to the discrete solver. The details of the inner loop algorithm is summarized in Algorithm 2.

0:  Action sequence , number of restarts , batch size , learning rates , total number of iterations , the number of evaluation repeats , the total gate duration , the initial temperature , the rate of temperature decrease .
1:  Randomly initialize the mean and variance .
2:  for  do
3:     Sample a batch of variables of size from sigmoid Gaussian distributions .
4:     Normalize the generalized QAOA parameter .
5:     Compute the approximate NPG direction using Eq. (4.4).
6:     Update the parameters with the gradient and learning rate .
7:     if  and  then .
8:     if  then .
9:  end for
10:  Apply the circuit times with gate sequence and durations , collect the rewards , and estimate the reward by .
10:  The mean and variance parameters and ; the estimated reward .
Algorithm 2 Improved policy gradient solver

4.3 Relation to previous algorithms used to optimize the generalized QAOA ansatz

We finish this section by a comparison of MCTS-QAOA with previous methods solving the QAOA problem. As shown in Table 1

, the CD-QAOA method adopts Scipy solver for the continuous optimization, which cannot be applied to problems with noise, and the RL-QAOA method can produce suboptimal solutions in certain regimes, which we verify with numerical experiments in the next section. Moreover, note that, due to the large neural network used in RL-QAOA, it is infeasible to apply the natural gradient methods as in Section 


= 0mm = 0mm

optimization (discrete) AutoReg+PG AutoReg+PG MCTS
optimization (continuous) SciPy PG
performance without noise
performance with noise

= 0.605mm = 0.984mm

Table 1: Comparison among the three algorithms for the generalized QAOA ansatz: CD-QAOA, RL-QAOA and MCTS-QAOA. In this table, AutoReg+PG stands for the policy gradient algorithm with the autoregressive neural network as a policy (Yao et al., 2020b); means the algorithm can fail in certain challenging regimes (e.g., large total duration ).

5 Numerical experiments

To benchmark the performance of MCTS-QAOA, we consider three physics models: the -dimensional Ising model, the -dimensional Ising model on a square lattice, and the Lipkin-Meshkov-Glick (LMG) model. The description of the models and the additional Hamiltonians inspired from the counter-diabatic theory can be found in Appendix B. In addition, in order to test the noise-resilience of MCTS-QAOA, we consider three types of noise models: classical measurement Gaussian noise, quantum measurement noise, and gate rotation error, cf. Appendix C.

We compare the performance of MCTS-QAOA with that of RL-QAOA, and provide an analysis on why RL-QAOA might fail in certain regimes. Further analysis of the energy landscape of the discrete optimization reveals a surprising phenomenon: for generalized QAOA with optimal choices of the continuous degrees of freedom, there can be a large number of discrete protocols producing relatively accurate energies.

5.1 Comparison with RL-QAOA

For the methods solving the generalized QAOA problem summarized in Table 1, the CD-QAOA algorithm cannot be applied to problems with noise since the continuous solver is not noise-resilient, while the RL-QAOA algorithm has been shown to be effective with relatively short total duration (using unnormalized Hamiltonians Yao et al. (2020b)). Therefore, we use RL-QAOA as a baseline when evaluating the performance of MCTS-QAOA, and we focus on the more challenging regime of large with normalized Hamiltonians333The Hamiltonians used in this work are normalized by their operator norm , i.e., we use instead of the original Hamiltonian . The reason for introducing the normalized Hamiltonian is that the dependence of the cost of performing a Hamiltonian evolution on a quantum device – – scales with the norm Berry et al. (2007); Low and Chuang (2017). Interested readers can refer to Appendix B for more details..

We first compare the performance of MCTS-QAOA against that of RL-QAOA for the physical systems discussed in Appendix B in the presence of quantum noise. Detailed numerical results for the noiseless experiments and other noise models can be found in Appendix C. In order to compare the performance of different optimizers, noisy rewards are offered to the optimizers during the training process, and the exact rewards are only used in evaluating the protocols found by the optimizers. For MCTS-QAOA, the protocol evaluated is given by a greedy search, i.e., a searching process with the exploration coefficient in Eq. (4.1).

Figure 2: (Quantum noise experiment) comparison between MCTS-QAOA and RL-QAOA with quantum measurement noise (Appendix C). (a): 1D spin- Ising chain (); (b): 2D spin- Ising chain (); (c): LMG model () at . The blue dotted line and the orange solid line display the energy ratio obtained by RL-QAOA and MCTS-QAOA. The green square shape and the red diamond shape in the left panel approximately corresponds to (an example in the small regime) and (an example in the large regime) with unnormalized Hamiltonians, respectively. The horizontal axis represents the total duration . MCTS-QAOA outperforms RL-QAOA in all tests.

Figure 2 shows the energy ratio evaluated for the protocols obtained by the optimizers across different lengths of total duration . For all three physics models, we find that the performance of MCTS-QAOA is at least as good as that of RL-QAOA for all protocol durations. In particular, for the 1D Ising model, MCTS-QAOA gives protocols that find close approximations to the true ground state when ; RL-QAOA gives inferior solutions in these settings. For the 2D Ising model, while the performance of RL-QAOA is similar with that of MCTS-QAOA at , the performance of RLQAOA at and is still inferior to that of MCTS-QAOA. For the LMG model, the quality of the gate sequence found by RL-QAOA further decreases when , and MCTS-QAOA is significantly more robust.

The inferior performance of RL-QAOA is directly related to the joint parameterization used in RL-QAOA for the continuous and discrete policies. Since RL-QAOA optimizes the continuous and discrete variables simultaneously, for each discrete sequence, the level of accuracy of the continuous optimization can be relatively low. Consequently, the optimizer can get stuck at a suboptimal discrete sequence.

Figure 3: Analysis of RL-QAOA using the LMG test: (a): Energy ratio versus number of function evaluations; (b): Number of unique gate sequences encountered versus number of function evaluations; (c): Histogram of the rewards received by the algorithm in the first 5000 iterations. The horizontal red line in the left / middle panel represents the maximal energy ratio and the maximal number of unique gate sequences encountered during the optimization, respectively. The orange line marks the transition between two stages of the training process.

To illustrate this behavior, we analyze the training of RL-QAOA using the LMG model with and noiseless rewards. Figure 3 summarizes the performance of RL-QAOA. Here by function evaluation we mean the computation of the objective function in (2.2). From Figure 3(a) and Figure 3(b), it is clear that the training process can be divided into two distinct stages, and the transition between the two stages is marked by the dashed-dotted vertical lines444These vertical lines are drawn at the point where the number of discrete protocol gate sequences drops to of the total number within a single mini-batch. In stage I, which is to the left of the vertical lines, the number of unique gate sequences encountered by RL-QAOA quickly increases, while the energy ratio keeps oscillating below zero, which suggests that RL-QAOA focuses on exploration and the continuous optimization is done only very roughly within stage I. In stage II, which is to the right of the vertical lines, the number of unique gate sequences encountered by RL-QAOA stops to grow, while the energy ratio obtained grow above zero and eventually gets stuck at around , which means that the algorithm stops its exploration and focuses on the optimization of the continuous variables for a fixed gate sequence with stage II. The overall performance of RL-QAOA can highly depend on the discrete gate sequence that the RL-QAOA agent decides to exploit. In the next section, we demonstrate that both the exploration and the exploitation phases in RL-QAOA can be suboptimal in this example, but the main issue is related to the suboptimal discrete sequences found in the exploration phase.

5.2 Landscape of the discrete optimization and comparison with random search

In order to further understand the relative importance of continuous optimization versus discrete optimization for the generalized QAOA, we study the energy landscape of discrete optimization. For each discrete gate sequence, we perform numerical optimization to identify the best continuous parameters , and record the corresponding energy ratio.

Energy landscape of discrete optimization. – A profile of the discrete optimization landscape can be given by solving the corresponding continuous optimization individually on a random subset of all possible gate sequences; if the total number of possible gate sequences is relatively small, this subset can actually be chosen to include all sequences. In our numerical implementation, each discrete gate sequence is sent to the natural policy gradient solver described in Section 4.2, and the continuous variables are solved for different regime. Histograms for the energy ratios obtained can then be drawn.

Figure 4: Discrete landscape of the LMG model (a, b, c) and the 1D Ising model (d, e, f): Histograms of the energy ratio optimized by the improved natural gradient solver for , respectively. samples are chosen from the discrete gate sequences of generalized QAOA with parameters , and (LMG) or (1D Ising). The dashed red line in the top right panel shows the energy ratio achieved by RL-QAOA in Figure 3; the green dashed line shows the energy ratio obtained by the NPG solver for the same gate sequences.

Figure 4 shows the discrete landscape for the LMG model and the 1D Ising model, respectively, where the parameters of the ansatz are , and the total number of gate sequences is thus . From the histogram plot, most gate sequences are concentrated at the right-most peak in the large regime. Far from searching “a needle in a haystack”, this showcases that there are plenty of “good”555“Good” gate sequences here means the optimized energy ratio is close to the optimal energy ratio obtainable within the generalized QAOA ansatz. gate sequences assuming that each continuous optimization parameter is well solved. Note that the behavior is significantly different from the discrete-only optimization, where the landscape has been shown to feature transitions between glassy, correlated and uncorrelated phases (Day et al., 2019). To the best of our knowledge, the existence of many good discrete gate sequences in the QAOA-type variational quantum algorithms has not been reported in the literature.

For the LMG model with total gate duration (cf. Figure 4), while most energy ratios fall into the cluster above , there is a smaller cluster located at . The energy ratio obtained by RL-QAOA (cf. Figure 3) falls into this cluster, which is depicted by the red dashed line, while the green dashed line shows the energy ratio obtained by the natural policy gradient solver with the same gate sequence. The green line corresponds to a higher energy ratio than the red line, which means that the optimization of the continuous variables in the second stage of RL-QAOA is not as good as the NPG solver, and the difference between the two lines indicates the suboptimality caused by the exploitation. However, the suboptimality of the RLQAOA solution is mainly due to the exploration stage, since the discrete sequence that RL-QAOA chooses to exploit represents a suboptimal local optimum that belongs to a cluster much smaller than the rightmost one in the histogram. The top right panel of Figure 4 also verifies the claim that RL-QAOA only does a rough optimization on the continuous variables before it stops exploration, since the energy ratios displayed there are mostly above , while the energy ratio obtained in optimization stage I is mainly negative. While the landscape of the hybrid optimization is challenging for RL-QAOA, the proposed method MCTS-QAOA is able to deal with it by using a noise-resilient solver for the continuous variables (NPG), and by exploring the discrete variables constantly using MCTS.

For the 1D Ising model with total gate duration shown in the bottom left panel, where the rightmost cluster is not the largest. This means that in this setting, it is more difficult to find a gate sequence in the rightmost cluster when the random search is used. We examine the performance of random search and MCTS-QAOA using this example in the next part.

Comparison with random search.

Figure 5: Comparison between MCTS-QAOA and Random Search: The blue and orange curves correspond to MCTS-QAOA and random search, respectively. The physics system is the 1D Ising model with duration , which corresponds to Figure 2 (a). The generalized QAOA parameters are and

. The horizontal axis is the number of function evaluations (with the function evaluation in the continuous optimization taken into account), and the vertical axis is the energy ratio. The shaded area for both algorithms represents the standard deviation across ten different random initializations.

A recent work Mania et al. (2018) points out that advanced RL methods need not outperform simpler methods such as random search. In fact, if there is no specific structure in a problem, a random search algorithm might be as efficient as any sophisticated algorithm. In addition, from the landscape illustrated in the previous histograms, one can see that, for the models we investigated, there are lots of gate sequences with relatively high energy ratios provided that the continuous protocols are optimized. Therefore, it is natural to compare MCTS-QAOA against the random search algorithm666The random search algorithm also uses a two-level optimization, where the continuous optimization is solved by the policy gradient algorithm and the discrete optimization uses the random search. Since we assume no prior knowledge, the random search would be uniformly random on the discrete search space.. For a fair comparison, we assume that the continuous optimization in both cases is solved by the natural policy gradient algorithm, and the difference only lies in the discrete optimization. In Figure 5, the best energy ratio in the training history is shown for the two methods, and one sees that MCTS-QAOA consistently outperforms the random search across different random seeds. MCTS-QAOA not only finds better gate sequences much faster, but also gives a smaller variance across different realizations. It is clear that instead of doing the search uniformly and treating each protocol as equally important, the tree statistics in MCTS-QAOA better guides into a more promising search direction.

6 Conclusion and discussions

In this paper, we study a continuous-discrete variational quantum algorithm for the generalized QAOA ansatz. To solve this hybrid optimization problem, we design a novel algorithm that combines the Monte Carlo tree search (MCTS) algorithm, a powerful method in exploring the discrete sequence, with an improved noise-robust policy gradient solver for the continuous duration variables of a fixed gate sequence. The proposed algorithms effectively generate robust quantum control where the prior methods fail.

In this context, we expect that random search algorithms cannot efficiently determine the best gate sequence if noisy rewards are used, while MCTS-QAOA is able to mitigate the noise and provide robust choice of gate sequence with the help of the tree structure it maintains. Moreover, it is possible for MCTS-QAOA to further reduce the number of evaluations by assigning different number of iterations for different gate sequences, e.g., it can assign more iterations for the more promising gate sequences. Also, MCTS-QAOA allows for the application of transfer learning using the tree statistics, which is not possible for the random search.

There are a number of possible ways to extend the problem presented in this paper:

Learning based guided search.

– MCTS can be possibly guided by a learned functional approximator, such as neural networks or tensor networks. We have also tried the implementation of AlphaZero in the same experimental settings. However, the neural network based method does not work better than the simple MCTS. We find that the value function mapping from the discrete gate sequences to the score was quite hard to learn. One reason might be that the continuous policy gradient will try the best to optimize the energy ratio to the highest, thus making this mapping from discrete sequences to score, highly non-linear. Also, in terms of sampling efficiency, the neural network based approach needs lots of samples to fit the function, which is a heavy overhead compared to the simple MCTS approach. Nevertheless, the question remains open as to how to upgrade MCTS to a guided search.

Amortized computation. – The computation within the policy gradient solver for different gate sequences can possibly be amortized. Currently, the continuous and discrete optimizations are separated. If some functional can be learned by replaying the data during the policy gradient iteration, the number of function evaluations can be further reduced. However, one difficulty in the quantum setting is that we do not have access to the quantum state, and thus we cannot learn a mapping taking the quantum state as input, unless we apply non-trivial quantum tomography. Therefore, how to reuse the past information and make the MCTS-QAOA algorithm quickly adaptive in physical setting remains to be investigated. Advanced algorithms like meta learning can be explored in the future work.

Budget-aware variational quantum algorithms. – A point of high interest is the design of budget-aware variational quantum algorithms. The importance of sample efficiency in the quantum setting can never be overemphasized. Each run of a quantum circuit can be expensive and quantum decoherence noise is usually not stationary over time. The budget-awareness property can be naturally incorporated in the MTCS framework. Making use of the tree structure, the adaptive algorithm would distribute more function evaluation budget to the most-visited or more promising nodes. The current algorithm likely operates in a budget-sufficient regime and uses the same amount of budget for each discrete gate sequences. We hope the adaptive algorithm can hit the sweet spot in the middle, i.e., use the right amount of computational budget and still compute the best possible gate sequence design. We hope that the present work will accelerate the research of budget-aware variational quantum algorithms in a realistic setting.


We thank Michael Luo for donating eight NVIDIA V100 Tensor Core GPUs to support computation. This work was partially supported by the Department of Energy under Grant No. DE-AC02-05CH11231 and No. DE-SC0017867 (L.L., J.Y.), and by the National Science Foundation under the NSF QLCI program through grant number OMA-2016245 (L.L.). M.B. was supported by the Marie Sklodowska-Curie grant agreement No 890711.

We used W&B Biewald (2020) to organize and analyze the experiments. The reinforcement learning networks are implemented in NumPy Harris et al. (2020), and Jax Bradbury et al. (2018); the quantum systems are simulated in Quspin Weinberg and Bukov (2017, 2019). We thank Berkeley Research Computing (BRC) and Google Cloud Computing Services (GCP) for providing the computational resources.


Appendix A Related works

Hybrid optimization: The generalized QAOA ansatz introduces a discrete and continuous control problem: the discrete degrees of freedom are the gates/unitaries that define the control protocol, while the continuous degrees of freedom are the gate duration. Most reinforcement learning algorithms (Lillicrap et al., 2016; Bertsekas, 2019; Trabucco et al., 2021) typically deal with the control of either discrete or continuous degree of freedom, and hardly consider the discrete and continuous control simultaneously in the policy. Even though the continuous control can always be discretized, it is always beneficial and desirable to consider discrete and continuous variables together, without loss of the flexibility of continuous control. Furthermore, the idea of continuous and discrete optimization can be quite general, and shows up in real world application like robotics (Neunert et al., 2020; Delalleau et al., 2019) and strategic games (Vinyals et al., 2019). Combining the discrete and continuous control together, the control capability of the algorithm can be quickly enhanced. In general, the discrete variables are usually chosen as the categories of actions, and the continuous variables will are naturally given by the strength for each specific action. Our work aims to shed light on the hybrid control in the field of quantum control, and we also hope it will accelerate the research of hybrid discrete-continuous optimization algorithms in the wider community.

Counter diabatic driving: Counter diabatic driving (Sels and Polkovnikov, 2017; Hegade et al., 2022), an example of a shortcut to adiabaticity (STA), introduces an extra auxiliary counter-diabatic (CD) Hamiltonian to suppress transitions (or excitations) between instantaneous eigenvalues.

For a given quantum state evolving under a time dependent Hamiltonian , the Schrödinger equation reads as


In the rotating frame, Hamiltonian remains stationary under the unitary transformation , i.e. in the instantaneous eigenbasis of Hamiltonian . The wave function in the rotating frame satisfies the following Schrödinger equation:


where . Specifically, instead of being diagonalized, the original Hamiltonian picks up an extra contribution due to the change in the parameter , and the effective Hamiltonian becomes


The idea of the CD driving is to evolve the system with the counterdiabatic Hamiltonian


Importantly, in the moving frame is stationary and no transitions occur.

However, in practice, the precise counter-diabatic Hamiltonian is intractable and usually approximated by different methods. A good number of prior works (Passarelli et al., 2020; Hartmann and Lechner, 2019; Hegade et al., 2022, 2021a; Zhou et al., 2020a; Wurtz and Love, 2021; Hegade et al., 2021b) are based on the concept of a variational approximation to the CD Hamiltonian (Sels and Polkovnikov, 2017). Most of these works typically make use of an analytically computed expression available for few-qubit systems; they first derive the continuous form of the variational gauge potential, and then discretize the underlying dynamics using the Trotter-Suzuki formula In this work, we aim to bypass these constraints by applying the variational generalized QAOA ansatz using additional gates, generated by terms that occur in the approximation to the variational adiabatic gauge potential. These extra gates can provide a shortcut to the preparation of the ground state, compared to the original alternating QAOA ansatz. Physically, this shortcut results in shorter circuit simulation times, which provices a significant advantage on noisy NISQ devices.

AutoML and neural architecture search: Automatic machine learning or AutoML has recently attracted lots of attentions as it reduces human efforts in designing the neural architecture from experience and instead leverage the computational power to search the best configuration. One of the most pronounced examples are neural architecture search (NAS) and their variants (Zoph and Le, 2017; Elsken et al., 2018; Liu et al., 2019; Cai et al., 2018; Real et al., 2020), where reinforcement learning or evolutionary strategies are used to find a better network architecture. Inspired by the success of AutoML, the architecture of quantum circuits can also be improved by machine learning algorithms, such as the quantum version of Neural Architecture Search (Wang et al., 2021a, b; Zhang et al., 2021, 2020; Kuo et al., 2021). These prior works interpret the problem as quantum compiling problems, which assembles quantum gates in the low level. Instead of exposing a huge number of choice alternatives for the search algorithms, our work specially uses the variational gauge potentials as the Hamiltonian pool for the search algorithm in a computation-efficient way. Compared with QAOA, MCTS-QAOA has more degree of freedom to approximate the unitary operator; compared with the quantum compiling, it does not search gates in the low level due to the constraint of computations. From this perspective, our method hits the sweet spot between the expressivity and efficiency.

Appendix B Setup of physical models

We first give a brief review on the physical models used in the numerical experiments. In all experiments, we choose the target state as the ground state of the Hamiltonian , denoted . The spin- matrices describing spin are denoted by . In contrast to the models considered in Yao et al. (2020b, c), the Hamiltonians used in this work is normalized by its operator norm , i.e., we use instead of the original Hamiltonian . The reason for introducing the normalized Hamiltonian is as follows. For generic Hamiltonians (e.g., sparse matrices), the cost of performing a Hamiltonian evolution on a quantum device is  Berry et al. (2007); Low and Chuang (2017). Due to the potential differences between the Hamiltonian norms in the Hamiltonian pool , using a normalized Hamiltonian (the corresponding duration parameter is thus multiplied by ) can lead to a more realistic estimate of the cost of the quantum simulation. Due to this multiplication factor, the duration shown in the results below is larger than that presented in Yao et al. (2020b, c).

One-dimensional (1D) Ising model

The spin- Ising Hamiltonian reads as:

where is the number of qubits and the parameters are set as and (Kim and Huse, 2013)

. These parameters are close to the critical line of the model in the thermodynamic limit, where the quantum phase transition occurs. They are also reported in Ref. 

(Matos et al., 2021) to be in the most challenging parameter region using QAOA. We use periodic boundary conditions here. The initial state for this experiment is given by -polarized product state, i.e. .

For the Hamiltonian pool, we use where . The operators are precisely the first three terms in the expansion for the adiabatic gauge potential of the translation-invariant 1D Ising model (Yao et al., 2020c).

Two-dimensional (2D) Ising model

The 2D spin- transverse-field Ising model reads:

where denotes nearest neighbors on the square lattice. The model parameters are set as and . The initial state is , i.e. -polarized product state on 2D lattice.

For the Hamiltonian pool, we use , where .

Lipkin-Meshkov-Glick (LMG) model

The Lipkin-Meshkov-Glick (LMG) model (Lipkin et al., 1965) reads:

where is the interactions trength, and stands for the magnetic field strength. The LMG model preserves the total spin, and the ground state is contained in an dimensional subspace due to this symmetry. This makes the LMG model particularly interesting because it allows us to simulate its dynamics for a large number of spins, where many-body effects, such as collective phenomena, dominate the physics of the system.

For instance, in the thermodynamic limit , the LMG model exhibits a quantum phase transition at  (Botet and Jullien, 1983). The transition is between a ferromagnetic (FM) order in the ground state in the -direction (), and the paramagnetic order ().

For the Hamiltonian pool, we use , where


Appendix C Noise models

An essential part of our study is the performance of the algorithms in the presence of noise. As mentioned in the main text, noise sets the current bottle neck for reliable quantum computation. Therefore, it is of primary importance for the near-term utility of quantum computers to develop stable and noise-robust manipulation algorithms.

We use the following three noise models in our numerical experiments: (i) classical measurement noise, (ii) quantum measurement error which micmic the situation on present-day NISQ devices, and (iii) gate rotation error noise.

Classical measurement Gaussian noise is added to the cost function according to

where and denotes the noise strength, and

is the normal distribution. Gaussian noise models various kinds of uncertainty present in experiments using an additive Gaussian random variable, which follows from the Central Limit theorem.

Quantum measurement noise:

where the noise strength depends on the strength of the energy quantum fluctuations

and is randomly sampled from . Quantum noise models the uncertainty arising from quantum measurements. For instance, quantum fluctuations are large when the evolved quantum state is far away from the target, while they decrease when the final state approaches the target ground state.

Gate rotation error noise:

where gate error strengths are multiplicative and the corresponding ratios are for some simulation parameter which controls the noise strength. Gate rotation errors Sung et al. (2020) present yet another common noise source, which arises due to imperfections or lack of calibration in the quantum computer hardware.

Appendix D Details for the natural policy gradient with entropy regularization

For a general -dimensional Gaussian distribution , the Shannon entropy is defined as , where . Hence

Omitting the constants, it is equivalent to take the entropy as

. For the model used, the probability distribution is a product of normal distribution, i.e.,

is a diagonal matrix with length and diagonal elements , so the corresponding entropy function is .

In the implementation, we adopt the parameterization to assure that is positive. Then for the distribution , we have


where the gradient is taken with respect to the parameters. Since are independent, the Fisher information matrix is a block diagonal matrix with the -th block equal to

Recall that for a fixed gate sequence , we set , where denotes the sigmoid function, and

Hence the gradient of is

where the gradient is taken with respect to the parameters, and . Therefore, the unbiased estimators for the variables are

where are independent standard normal variables and . As a result, the unbiased estimators for the natural gradient direction become

since is a block diagonal matrix with the -th block given by .

Appendix E Additional experiment results

In Section 5.1, we have presented a comparison between the RL-QAOA method and MCTS-QAOA for three different physics models with the quantum noise. In this section, we report the test results with the other types of noise, namely the results with the Gaussian noise, the results with the gate rotation error, and the results when no noise is considered (cf. Appendix C). We can observe from the comparison that MCTS-QAOA’s performance is much more stable and accurate.

Figure 6: (experiment with other types of noise models or without noise) comparison between MCTS-QAOA and RL-QAOA. The physics setup is the same as that in Figure 2. (a-c): Gaussian noise with ; (d-f): gate rotation noise with ; (g-i): experiments without noise (cf. Appendix C).

From Figure 6, one can observe similar behavior the two methods as in Section 5.1, i.e., MCTS-QAOA outperforms RL-QAOA in all settings and the gaps grow larger in the regime of large total gate durations. The raw data for the energy ratio obtained by MCTS-QAOA is summarized in Table 2 (highlighted in bold), which offers a more visually and quantitatively convenient comparison across different models.

Gate rotation noise Quantum noise Gaussian noise No noise
(Model) (a) Ising 1D
() () () ()
() () () ()
() () () ()
() () () ()
() () () ()
() () () ()
() () () ()
() () () ()
(Model) (b) Ising 2D
() () () ()
() () () ()
() () () ()
() () () ()
() () () ()
() () () ()
(Model) (c) LMG
() () () ()
() () () ()
() () () ()
() () () ()
() () () ()
() () () ()
() () () ()
() () () ()
() () () ()
Table 2: Energy ratio obtained by MCTS-QAOA: MCTS-QAOA using the Hamiltonian pool without identity (bold, see Appendix B) and with identity operator (gray in the parenthesis, see Appendix G). Sector (a): 1D spin- Ising chain (); Sector (b): 2D spin- Ising chain (); Sector (c): LMG model () at . We use for Gaussian noise and for the gate rotation noise (see Appendix. C).

Appendix F Additional numerical results on the energy landscape

In Section 5.2 we reported the discrete landscape of the generalized QAOA ansatz under the condition that the continuous variables are solved with high quality with the improved NPG solver. Here we include the landscape under another physical model, i.e. 2D Ising model. We consider the case where , and the total number of gate sequences is thus . Similar to the plots displayed in Section 5.2, the landscape with a longer total duration () features a dominant cluster at the rightmost part of the histogram. When the total duration is smaller, the number of clusters increases, and is shifted to the left.

Figure 8 shows the influence of the parameter in the discrete landscape for the LMG model with gate duration and . When and , the rightmost peak in the energy ratio histogram gets close to , which means that reaching the ground state would be a easy task in these two cases. The more difficult cases lies in between, for example when . For the parameter we choose in the main text, there is a bigger gap (cf. Fig. 4(c)) between the rightmost peak of the energy ratio and , which means the problem we choose to solve is relatively challenging.

Figure 7: Discrete landscape of 2D Ising model: (a-c): Histograms of the energy ratio optimized by the improved natural gradient solver for , respectively. samples are chosen from the discrete gate sequences of generalized QAOA with parameters and .
Figure 8: Discrete landscape of LMG model with respect to different parameter : (a-c): Histograms of the energy ratio optimized by the improved natural gradient solver for , and , respectively with gate duration . samples are chosen from the discrete gate sequences of generalized QAOA with parameters , and . For the LMG model, the gap between the right-most peak and is larger when is between and .

Appendix G Physical models with the identity action

The generalized QAOA ansatz provides us the freedom of adding different Hamiltonians to the Hamiltonian pool. One meaningful addition is the identity operator. Here, the identity operator corresponds to the identity gate that does not move the quantum state. If we take , then its corresponding unitary gate will be identity, i.e.