1 Introduction
Quantum computing provides a fundamentally different way for solving a variety of important problems in scientific computing, such as finding the ground state energy in computational chemistry, and the MaxCut problem in combinatorial optimization. Variational quantum circuits are perhaps the most important quantum algorithms on near term quantum devices
(Preskill, 2018), mainly due to the tunability and the relatively short circuit depth (Cerezo et al., 2021b), as exemplified by the variational quantum eigensolver (VQE) (Peruzzo et al., 2014; McClean et al., 2016) and the quantum approximate optimization algorithm (QAOA) (Farhi et al., 2014). A common thread in these algorithms is to variationally optimize a parameterized quantum circuit using classical methods to obtain an approximate ground state. For instance, in combinatorial optimization, QAOA encodes the classical objective function into a quantum Hamiltonian, and constructs a quantum circuit with a set of two alternating quantum gates. The continuous adjustable parameters are the duration or phases of the gates.For quantum manybody problems, the expressivity of the QAOA ansatz may be limited: the exponentially large (in the number of qubits) Hilbert space may not be efficiently navigated by the dynamics generated by the alternating gate sequence. This can lead to circuit depths that grow with the system size
(Ho and Hsieh, 2019), or render the target ground state outside the scope of accessible states altogether, thus fundamentally precluding its preparation. To address these problems, various versions of a generalized QAOA ansatz have been presented in recent works Zhu et al. (2020); Yao et al. (2020c); Chandarana et al. (2021), where additional control Hamiltonians are used to generate the variational circuits. In general, these Hamiltonians are tailored to the manybody system whose ground state we seek to prepare, and the extended Hamiltonian pool is often constructed using ideas from variational counterdiabatic (CD) driving (Sels and Polkovnikov, 2017). When the optimization of the parameterized circuit is performed successfully, the generalized ansatz produces a closer approximation to the ground state than the original alternating QAOA ansatz. The generalized QAOA may also significantly reduce the total protocol duration and therefore the depth of the quantum circuit while giving a high fidelity with respect to the ground state Yao et al. (2020c).However, the ansatz of the generalized QAOA also results in a more challenging optimization problem. The original QAOA only involves optimization of continuous parameters. The generalized QAOA ansatz, in contrast, leads to a hybrid optimization problem that involves both the discrete variables (the choice of quantum gates) and the continuous variables (the duration of each gate). To solve this hybrid optimization problem, we propose a novel algorithm that combines the Monte Carlo Tree Search (MCTS) algorithm (Coulom, 2006; Browne et al., 2012; Abramson, 2014; Silver et al., 2016, 2017) – a powerful method in exploring the discrete sequence, with an improved noiserobust natural policy gradient solver for the continuous variables of a fixed gate sequence.
Contributions:

We propose the MCTSQAOA algorithm which combines the MCTS algorithm and a noiserobust policy gradient solver. We show that it is not only efficient in exploring the quantum gate sequences but also robust in the presence of different types of noise.

The proposed MCTSQAOA algorithm produces accurate results for problems that appear difficult or infeasible for previous algorithms based on the generalized QAOA ansatz, such as RLQAOA (Yao et al., 2020b). In particular, MCTSQAOA shows superior performance in the large protocol duration regime, where the hybrid optimization becomes challenging.

In order for the MCTSQAOA algorithm to produce reliable optimal results, it is crucial that the inner loop solver finds the optimal continuous variables with high accuracy. Compared to the original PGQAOA solver introduced in Yao et al. (2020a), we improve the inner loop solver with entropy regularization and the natural gradient method, and implement it in Jax (Bradbury et al., 2018), which offers more accurate, stable, and efficiently computed solutions during the continuous optimization.

For the physics models considered in this paper, we observe that there can be many “good” gate sequences. This means that for a large portion of gate sequences, the energy ratio obtained is not far away from the optimal energy ratio obtainable with the generalized QAOA ansatz, given that the continuous variables are solved with high quality. This phenomenon has not been recorded in the literature to the best of the authors’ knowledge.
Related works:
Quantum control and variational quantum eigensolver: Traditional optimal quantum control methods, often used in prior works, are GRAPE (Khaneja et al., 2005) and CRAB (Caneva et al., 2011)
. More recently, success has been seen by the combination of traditional methods with machine learning
(Schäfer et al., 2020; Wang et al., 2020a; Sauvage and Mintert, 2019; Fösel et al., 2020; Nautrup et al., 2019; AlbarránArriagada et al., 2018; Sim et al., 2021; Wu et al., 2020a, b; Anand et al., 2020; Dalgaard et al., 2022), and especially reinforcement learning
(Niu et al., 2019; Fösel et al., 2018; August and HernándezLobato, 2018; Porotti et al., 2019; Wauters et al., 2020; Yao et al., 2020a; Sung, 2020; Chen et al., 2013; Bukov, 2018; Bukov et al., 2018; Sørdal and Bergli, 2019; Bolens and Heyl, 2020; Dalgaard et al., 2020; Metz and Bukov, 2022)). Among them, Variational quantum eigensolver or VQE (Cerezo et al., 2021a; Tilly et al., 2021) provides a general framework applicable on noisy intermediatescale quantum (NISQ) devices (Preskill, 2018) to variationally tune the circuit parameters and improve the approximation. In the fault tolerant setting, there are also possibilities of error mitigation via the variational quantum optimization (Sung et al., 2020; Arute et al., 2020).QAOA (Farhi et al., 2014) can be viewed as a specific variational quantum algorithm, and can be extended to the generalized QAOA ansatz Zhu et al. (2020); Yao et al. (2020c); Chandarana et al. (2021). Prior works optimize the generalized QAOA greedily and progressively for each circuit layer or endtoend as a large autoregressive network. The present work differs from these methods; we take advantage of the MCTS structure and formulate the problem as a twolevel optimization.
MCTS and RL: Monte Carlo tree search (MCTS) has been one major workhorse behind the recent breakthrough of reinforcement learning algorithm, especially AlphaGo algorithms and variants (Silver et al., 2016, 2017, 2018; Schrittwieser et al., 2020; Ye et al., 2021). MCTS (Browne et al., 2012; Guo et al., 2014) makes use of a discrete hierarchical structure to figure out a better exploration in high dimensional search problems. While it is typically applied to discrete search, it has also been used in the continuous setting (Wang et al., 2020b), where the partition space of the whole space is viewed as branching of the tree. In the context of quantum computing, applications of MCTS have been recently emerged such as the Quantum Circuit Transformation (Zhou et al., 2020b), the quantum annealing schedules (Chen et al., 2020), and the quantum dynamics optimization (Dalgaard et al., 2020).
Further related works in hybrid optimization, counterdiabatic driving methods, and architecture search can be found in Appendix A.
2 Generalized QAOA ansatz
The generalized QAOA ansatz (Yao et al., 2020c) constructs a variational quantum circuit via the composition of a sequence of parameterized unitary operators:
(2.1) 
Here the circuit parameters contain two components: i) the discrete variables define a sequence of Hamiltonians with length , while ii) the continuous variables represent the duration that each corresponding gate is applied for. It is further assumed that each Hamiltonian is selected from a fixed Hamiltonian pool , and consecutive gates are not repeated, i.e., . The total number of possible sequences is thus , which grows exponentially with , rendering exhaustive search intractable.
After applying the circuit to an initial quantum state , one obtains the final quantum state . To prepare a high quality approximation of the ground state of the target Hamiltonian , the continuous and discrete variables are solved for by minimizing the following objective function:
(2.2) 
Note that the energy in the objective function is divided by the number of particles in the physical model, e.g. the number of qubits. This scaled objective function has a wellbehaved limit when increasing the number of qubits, as required for largerscale computations. Here, the energy function is always lowerbounded by the ground state energy . It is also worth noticing that the quantum states are unknown to the optimization algorithm (they cannot be measured), which increases the difficulty of the optimization algorithm.
3 Reinforcement learning setup
After defining the optimization problem posed by the generalized QAOA, let us briefly cast it within the RL framework.
3.1 Quantum constraints on the RL environment
Beyond classical physics, quantum mechanics imposes counterintuitive constraints on the state and reward spaces, which need to be embedded in a realistic RL environment.
First, the quantum state (or wavefunction) is not a physical observable by itself, and inference of the information of the full quantum state from experiments (called quantum state tomography) can require exponential resources. This fact is intimately related to the expected superior performance of quantum computers against their classical counterparts on certain tasks. To embed this quantum behavior into our environment simulator, we define the RL state as the sequence of actions applied (Bukov, 2018) rather than the quantum state. Starting from a fixed initial state, the quantum state is uniquely determined (though still unmeasurable) by the Hamiltonian sequence applied.
Second, (strong) quantum measurements lead to a collapse of the quantum wavefunction. This means that, once a measurement has been performed, the state itself is irreversibly lost. Therefore, a second constraint for our quantum RL environment is the sparsity of rewards. Indeed, only after the RL episode comes to an end, can we measure the energy and obtain the reward. In Sec. 4, we exploit this fact to introduce MCTS into the algorithm which does not evaluate the protocol during the construction of it. As a result, the evaluation is delegated to the noiserobust PGQAOA solver.
3.2 The reinforcement learning environment
In the language of reinforcement learning (RL), the choice of quantum gates corresponds to the action of the learner, and the quantum circuit is completed after actions, which marks the end of the RL session/episode. The reward signal is provided by the inner loop solver which aims to compute the lowest possible energy that can be reached by the fixed chosen gate sequence. To be more specific, the action space is a set of Hamiltonians; the state space is the set of sequences of Hamiltonians with length no larger than . In particular, a session always starts with the empty sequence , and ends with a state given by a Hamiltonian sequence of length . When is not a terminal state, i.e., , the next state is obtained by appending the th action at the end of , i.e., .
The reward only depends on the state , and it is set as whenever is not a terminal state. As explained in the previous section, this implements the physical constraint reflecting the inability to perform a strong quantum measurement without destroying the quantum state. When is a terminal state , we define
(3.1) 
where are the duration obtained by the inner loop continuous optimizer, and the energy is defined in (2.2).
4 Monte Carlo tree search with improved policy gradient solver
In this section, we introduce MCTSQAOA, an algorithm that solves the hybrid optimization problem defined by the generalized QAOA ansatz, using a combination of MCTS and an improved policy gradient solver. In the combined algorithm, MCTS serves as the solver for the outer optimization problem: it is used to search for high quality gate sequences . At the same time, we design an improved policy gradient solver to produce the optimal gates duration for the discrete sequence provided by MCTS. Finally, the outcome of the evaluation is propagated back through the nodes of the MC tree to improve the tree policy before the next iteration.
4.1 Discrete optimization: Monte Carlo tree search
MCTSQAOA strikes an efficient balance between exploration and exploitation of the RL states, by leveraging the statistics recorded in a search tree. Each node of this tree corresponds to a state ; the child nodes denote all possible states following the state . For the problem considered in this paper, trajectories are loopfree, since each child state has one more action attached than its parent state . Thus, we refer to a given node by its corresponding state. In particular, the root node corresponds to the empty state , which has children, one for each action; any other nonterminal state has children, reflecting the constraint that no action can follow itself, and a terminal state has none. During the search process, each node keeps track of the statistics of two quantities: i) counts the selection of action at state ; ii) is the expected reward after taking action at state . Intuitively, the average is an estimate of how promising a child node is. Finally, a node is called fully expanded, if all its children are visited in the search, i.e., if for all ; otherwise, is called an expandable node, and is the focus of exploration.
In each MCTS iteration, the tree and the node statistics are updated as follows:

Forming a search path. Starting from the root node, if the current node is fully expanded, then one of its children is chosen according to the following Upper Confidence Bound (UCB) (Auer et al., 2002):
(4.1) until reaching a terminal state or an expandable node; here denotes the tree policy. Then an unvisited child of the current node is chosen at random, unless the current node is a terminal state. After that, a simulation is rolled out with a uniform policy until reaching a terminal state.

Evaluation and backup. The reward ^{1}^{1}1In order to distinguish the estimated reward from the true reward in the presence of noise, we denote the estimated reward as . of the terminal state is evaluated by the inner loop solver and the tree statistics are updated using for each visited edge .
For the generalized QAOA ansatz, the real challenge lies in the evaluation step. On the one hand, the overall minimization of the energy depends on the potential of the trajectory selected by the MCTS, whose role is to find the optimal trajectory sequence. On the other hand, if the accuracy of the evaluation is low, then the searching process can be stuck at a severely suboptimal solution. Similarly, if the evaluation is not efficient enough, then the benefit obtained by using quantum computation strategy will also be lost. And last but not least, if the evaluation results are not robust to noise, then the algorithm can hardly be carried out on quantum devices. Hence, the inner loop solver used to implement the evaluation must be able to efficiently offer high accuracy results while being robust to different kinds of noise. The above considerations refer to the generic case; in practice, the optimization dynamics of the algorithm is set by the properties of the optimization landscape.
4.2 Continuous optimization: natural policy gradient solver
For each terminal state reached in the MCTS process, an inner loop solver is invoked to produce the optimal duration and the reward which are then backpropagated through the tree to update the tree statistics. In order to ensure that the duration obtained has a practical magnitude and to allow for a fair comparison between algorithms, we further assume that the total duration of all gates is fixed as , which can be seen as a protocol for the circuit depth.
The continuous optimization problem for the innerloop solver in the rewardevaluation step is thus
(4.2) 
In order to avoid using explicit derivatives of the energy , we instead optimize the expectation of the energy
over a parameterized probability distribution of
; this is also crucial to to make the algorithm resilient to noise. More specifically, we set to ensure the constraints on , whereis a random variable drawn from the sigmoid Gaussian distribution
^{2}^{2}2 denotes the sigmoid Gaussian distribution with parameters and , i.e., the distribution of the Gaussian random variableunder the sigmoid transformation. It is also called the logitnormal distribution
. It can be parameterized as , where is a Gaussian random variable andis the sigmoid function. Adding a Shannon entropy regularizer to the total expected reward we obtain the regularized objective function:
(4.3) 
which is maximized over the parameters . Here , and denotes the temperature, which controls the tradeoff between exploration and exploitation: higher temperature leads to a larger weight on the entropy term, and thus encourages exploration, while smaller reduces exploration. The entropy term can be derived from the definition of Shannon entropy, cf. Appendix D.
The inner loop solver is then constructed with a natural policy gradient (NPG) method applied to the regularized objective function using the natural gradient direction , where
is the Fisher information matrix for the joint distribution of
and is the gradient of with respect to the parameters. This procedure is different from the solver established in PGQAOA (Yao et al., 2020a), where the standard gradient is used to update the parameters and no regularization is used. Using independent standard normal variables, the natural gradient direction can be approximated by unbiased estimators:
(4.4) 
where and is the th by diagonal block of the Fisher information matrix, since is a block diagonal matrix, cf. Appendix D. In practice, we update instead of to ensure the positivity of , and we use the average of the unbiased estimators in (4.4) within a batch of size to give the approximation of the natural gradient direction.
The first term in the objective function can also be viewed as a smoothed reward function obtained with Gaussian perturbation. The parameter determines the distance between and (Nesterov and Spokoiny, 2017). If is too large, then is far from , and yields suboptimal solutions of since too much details are lost after the Gaussian smoothing. To avoid this, we propose to use a tempering technique (see for example Klink et al. (2020, Sec. 5); Abdolmaleki et al. (2018, Sec. 5); Haarnoja et al. (2018, Sec. 5)). More specifically, after a certain number of NPG iterations, we reduce the temperature , and in the final stage of entropy adjustment (cf. line 1012 in Algorithm 2
), we discard the entropy term. In this way, the policy is less susceptible to highly suboptimal local maxima in the beginning of the inner loop optimization thanks to the entropy regularization. At the end of the optimization, the variance
decreases, since the temperature is reduced and the algorithm is able to achieve a higher precision as the smoothed problem becomes a better approximation to the original one. As a result, many policy gradient updates can be saved compared to the original policy gradient method in Yao et al. (2020a), and the quality of solutions is improved.When the optimization by the inner loop solver is completed, the parameters are used to evaluate the reward to be backpropagated through the MC tree. More specifically, the gate sequence with duration is applied and a reward is obtained. In order to deal with noisy rewards, the evaluation is repeated times, and the average reward is sent to the discrete solver. The details of the inner loop algorithm is summarized in Algorithm 2.
4.3 Relation to previous algorithms used to optimize the generalized QAOA ansatz
We finish this section by a comparison of MCTSQAOA with previous methods solving the QAOA problem. As shown in Table 1
, the CDQAOA method adopts Scipy solver for the continuous optimization, which cannot be applied to problems with noise, and the RLQAOA method can produce suboptimal solutions in certain regimes, which we verify with numerical experiments in the next section. Moreover, note that, due to the large neural network used in RLQAOA, it is infeasible to apply the natural gradient methods as in Section
4.2.= 0mm = 0mm
Method  CDQAOA  RLQAOA  MCTSQAOA 

optimization (discrete)  AutoReg+PG  AutoReg+PG  MCTS 
optimization (continuous)  SciPy  PG  
performance without noise  ✓  ✓✗  ✓ 
performance with noise  ✗  ✓✗  ✓ 
= 0.605mm = 0.984mm
5 Numerical experiments
To benchmark the performance of MCTSQAOA, we consider three physics models: the dimensional Ising model, the dimensional Ising model on a square lattice, and the LipkinMeshkovGlick (LMG) model. The description of the models and the additional Hamiltonians inspired from the counterdiabatic theory can be found in Appendix B. In addition, in order to test the noiseresilience of MCTSQAOA, we consider three types of noise models: classical measurement Gaussian noise, quantum measurement noise, and gate rotation error, cf. Appendix C.
We compare the performance of MCTSQAOA with that of RLQAOA, and provide an analysis on why RLQAOA might fail in certain regimes. Further analysis of the energy landscape of the discrete optimization reveals a surprising phenomenon: for generalized QAOA with optimal choices of the continuous degrees of freedom, there can be a large number of discrete protocols producing relatively accurate energies.
5.1 Comparison with RLQAOA
For the methods solving the generalized QAOA problem summarized in Table 1, the CDQAOA algorithm cannot be applied to problems with noise since the continuous solver is not noiseresilient, while the RLQAOA algorithm has been shown to be effective with relatively short total duration (using unnormalized Hamiltonians Yao et al. (2020b)). Therefore, we use RLQAOA as a baseline when evaluating the performance of MCTSQAOA, and we focus on the more challenging regime of large with normalized Hamiltonians^{3}^{3}3The Hamiltonians used in this work are normalized by their operator norm , i.e., we use instead of the original Hamiltonian . The reason for introducing the normalized Hamiltonian is that the dependence of the cost of performing a Hamiltonian evolution on a quantum device – – scales with the norm Berry et al. (2007); Low and Chuang (2017). Interested readers can refer to Appendix B for more details..
We first compare the performance of MCTSQAOA against that of RLQAOA for the physical systems discussed in Appendix B in the presence of quantum noise. Detailed numerical results for the noiseless experiments and other noise models can be found in Appendix C. In order to compare the performance of different optimizers, noisy rewards are offered to the optimizers during the training process, and the exact rewards are only used in evaluating the protocols found by the optimizers. For MCTSQAOA, the protocol evaluated is given by a greedy search, i.e., a searching process with the exploration coefficient in Eq. (4.1).
Figure 2 shows the energy ratio evaluated for the protocols obtained by the optimizers across different lengths of total duration . For all three physics models, we find that the performance of MCTSQAOA is at least as good as that of RLQAOA for all protocol durations. In particular, for the 1D Ising model, MCTSQAOA gives protocols that find close approximations to the true ground state when ; RLQAOA gives inferior solutions in these settings. For the 2D Ising model, while the performance of RLQAOA is similar with that of MCTSQAOA at , the performance of RLQAOA at and is still inferior to that of MCTSQAOA. For the LMG model, the quality of the gate sequence found by RLQAOA further decreases when , and MCTSQAOA is significantly more robust.
The inferior performance of RLQAOA is directly related to the joint parameterization used in RLQAOA for the continuous and discrete policies. Since RLQAOA optimizes the continuous and discrete variables simultaneously, for each discrete sequence, the level of accuracy of the continuous optimization can be relatively low. Consequently, the optimizer can get stuck at a suboptimal discrete sequence.
To illustrate this behavior, we analyze the training of RLQAOA using the LMG model with and noiseless rewards. Figure 3 summarizes the performance of RLQAOA. Here by function evaluation we mean the computation of the objective function in (2.2). From Figure 3(a) and Figure 3(b), it is clear that the training process can be divided into two distinct stages, and the transition between the two stages is marked by the dasheddotted vertical lines^{4}^{4}4These vertical lines are drawn at the point where the number of discrete protocol gate sequences drops to of the total number within a single minibatch. In stage I, which is to the left of the vertical lines, the number of unique gate sequences encountered by RLQAOA quickly increases, while the energy ratio keeps oscillating below zero, which suggests that RLQAOA focuses on exploration and the continuous optimization is done only very roughly within stage I. In stage II, which is to the right of the vertical lines, the number of unique gate sequences encountered by RLQAOA stops to grow, while the energy ratio obtained grow above zero and eventually gets stuck at around , which means that the algorithm stops its exploration and focuses on the optimization of the continuous variables for a fixed gate sequence with stage II. The overall performance of RLQAOA can highly depend on the discrete gate sequence that the RLQAOA agent decides to exploit. In the next section, we demonstrate that both the exploration and the exploitation phases in RLQAOA can be suboptimal in this example, but the main issue is related to the suboptimal discrete sequences found in the exploration phase.
5.2 Landscape of the discrete optimization and comparison with random search
In order to further understand the relative importance of continuous optimization versus discrete optimization for the generalized QAOA, we study the energy landscape of discrete optimization. For each discrete gate sequence, we perform numerical optimization to identify the best continuous parameters , and record the corresponding energy ratio.
Energy landscape of discrete optimization. – A profile of the discrete optimization landscape can be given by solving the corresponding continuous optimization individually on a random subset of all possible gate sequences; if the total number of possible gate sequences is relatively small, this subset can actually be chosen to include all sequences. In our numerical implementation, each discrete gate sequence is sent to the natural policy gradient solver described in Section 4.2, and the continuous variables are solved for different regime. Histograms for the energy ratios obtained can then be drawn.
Figure 4 shows the discrete landscape for the LMG model and the 1D Ising model, respectively, where the parameters of the ansatz are , and the total number of gate sequences is thus . From the histogram plot, most gate sequences are concentrated at the rightmost peak in the large regime. Far from searching “a needle in a haystack”, this showcases that there are plenty of “good”^{5}^{5}5“Good” gate sequences here means the optimized energy ratio is close to the optimal energy ratio obtainable within the generalized QAOA ansatz. gate sequences assuming that each continuous optimization parameter is well solved. Note that the behavior is significantly different from the discreteonly optimization, where the landscape has been shown to feature transitions between glassy, correlated and uncorrelated phases (Day et al., 2019). To the best of our knowledge, the existence of many good discrete gate sequences in the QAOAtype variational quantum algorithms has not been reported in the literature.
For the LMG model with total gate duration (cf. Figure 4), while most energy ratios fall into the cluster above , there is a smaller cluster located at . The energy ratio obtained by RLQAOA (cf. Figure 3) falls into this cluster, which is depicted by the red dashed line, while the green dashed line shows the energy ratio obtained by the natural policy gradient solver with the same gate sequence. The green line corresponds to a higher energy ratio than the red line, which means that the optimization of the continuous variables in the second stage of RLQAOA is not as good as the NPG solver, and the difference between the two lines indicates the suboptimality caused by the exploitation. However, the suboptimality of the RLQAOA solution is mainly due to the exploration stage, since the discrete sequence that RLQAOA chooses to exploit represents a suboptimal local optimum that belongs to a cluster much smaller than the rightmost one in the histogram. The top right panel of Figure 4 also verifies the claim that RLQAOA only does a rough optimization on the continuous variables before it stops exploration, since the energy ratios displayed there are mostly above , while the energy ratio obtained in optimization stage I is mainly negative. While the landscape of the hybrid optimization is challenging for RLQAOA, the proposed method MCTSQAOA is able to deal with it by using a noiseresilient solver for the continuous variables (NPG), and by exploring the discrete variables constantly using MCTS.
For the 1D Ising model with total gate duration shown in the bottom left panel, where the rightmost cluster is not the largest. This means that in this setting, it is more difficult to find a gate sequence in the rightmost cluster when the random search is used. We examine the performance of random search and MCTSQAOA using this example in the next part.
Comparison with random search. –
A recent work Mania et al. (2018) points out that advanced RL methods need not outperform simpler methods such as random search. In fact, if there is no specific structure in a problem, a random search algorithm might be as efficient as any sophisticated algorithm. In addition, from the landscape illustrated in the previous histograms, one can see that, for the models we investigated, there are lots of gate sequences with relatively high energy ratios provided that the continuous protocols are optimized. Therefore, it is natural to compare MCTSQAOA against the random search algorithm^{6}^{6}6The random search algorithm also uses a twolevel optimization, where the continuous optimization is solved by the policy gradient algorithm and the discrete optimization uses the random search. Since we assume no prior knowledge, the random search would be uniformly random on the discrete search space.. For a fair comparison, we assume that the continuous optimization in both cases is solved by the natural policy gradient algorithm, and the difference only lies in the discrete optimization. In Figure 5, the best energy ratio in the training history is shown for the two methods, and one sees that MCTSQAOA consistently outperforms the random search across different random seeds. MCTSQAOA not only finds better gate sequences much faster, but also gives a smaller variance across different realizations. It is clear that instead of doing the search uniformly and treating each protocol as equally important, the tree statistics in MCTSQAOA better guides into a more promising search direction.
6 Conclusion and discussions
In this paper, we study a continuousdiscrete variational quantum algorithm for the generalized QAOA ansatz. To solve this hybrid optimization problem, we design a novel algorithm that combines the Monte Carlo tree search (MCTS) algorithm, a powerful method in exploring the discrete sequence, with an improved noiserobust policy gradient solver for the continuous duration variables of a fixed gate sequence. The proposed algorithms effectively generate robust quantum control where the prior methods fail.
In this context, we expect that random search algorithms cannot efficiently determine the best gate sequence if noisy rewards are used, while MCTSQAOA is able to mitigate the noise and provide robust choice of gate sequence with the help of the tree structure it maintains. Moreover, it is possible for MCTSQAOA to further reduce the number of evaluations by assigning different number of iterations for different gate sequences, e.g., it can assign more iterations for the more promising gate sequences. Also, MCTSQAOA allows for the application of transfer learning using the tree statistics, which is not possible for the random search.
There are a number of possible ways to extend the problem presented in this paper:
Learning based guided search.
– MCTS can be possibly guided by a learned functional approximator, such as neural networks or tensor networks. We have also tried the implementation of AlphaZero in the same experimental settings. However, the neural network based method does not work better than the simple MCTS. We find that the value function mapping from the discrete gate sequences to the score was quite hard to learn. One reason might be that the continuous policy gradient will try the best to optimize the energy ratio to the highest, thus making this mapping from discrete sequences to score, highly nonlinear. Also, in terms of sampling efficiency, the neural network based approach needs lots of samples to fit the function, which is a heavy overhead compared to the simple MCTS approach. Nevertheless, the question remains open as to how to upgrade MCTS to a guided search.
Amortized computation. – The computation within the policy gradient solver for different gate sequences can possibly be amortized. Currently, the continuous and discrete optimizations are separated. If some functional can be learned by replaying the data during the policy gradient iteration, the number of function evaluations can be further reduced. However, one difficulty in the quantum setting is that we do not have access to the quantum state, and thus we cannot learn a mapping taking the quantum state as input, unless we apply nontrivial quantum tomography. Therefore, how to reuse the past information and make the MCTSQAOA algorithm quickly adaptive in physical setting remains to be investigated. Advanced algorithms like meta learning can be explored in the future work.
Budgetaware variational quantum algorithms. – A point of high interest is the design of budgetaware variational quantum algorithms. The importance of sample efficiency in the quantum setting can never be overemphasized. Each run of a quantum circuit can be expensive and quantum decoherence noise is usually not stationary over time. The budgetawareness property can be naturally incorporated in the MTCS framework. Making use of the tree structure, the adaptive algorithm would distribute more function evaluation budget to the mostvisited or more promising nodes. The current algorithm likely operates in a budgetsufficient regime and uses the same amount of budget for each discrete gate sequences. We hope the adaptive algorithm can hit the sweet spot in the middle, i.e., use the right amount of computational budget and still compute the best possible gate sequence design. We hope that the present work will accelerate the research of budgetaware variational quantum algorithms in a realistic setting.
Acknowledgement
We thank Michael Luo for donating eight NVIDIA V100 Tensor Core GPUs to support computation. This work was partially supported by the Department of Energy under Grant No. DEAC0205CH11231 and No. DESC0017867 (L.L., J.Y.), and by the National Science Foundation under the NSF QLCI program through grant number OMA2016245 (L.L.). M.B. was supported by the Marie SklodowskaCurie grant agreement No 890711.
We used W&B Biewald (2020) to organize and analyze the experiments. The reinforcement learning networks are implemented in NumPy Harris et al. (2020), and Jax Bradbury et al. (2018); the quantum systems are simulated in Quspin Weinberg and Bukov (2017, 2019). We thank Berkeley Research Computing (BRC) and Google Cloud Computing Services (GCP) for providing the computational resources.
References
 Abdolmaleki et al. (2018) Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920v1, Jun 2018. URL http://arxiv.org/abs/1806.06920v1.
 Abramson (2014) Bruce Abramson. The expectedoutcome model of twoplayer games. Morgan Kaufmann, 2014.
 AlbarránArriagada et al. (2018) Francisco AlbarránArriagada, Juan C Retamal, Enrique Solano, and Lucas Lamata. Measurementbased adaptation protocol with quantum reinforcement learning. Phys. Rev. A, 98:042315, 2018. doi: 10.1103/PhysRevA.98.042315. URL https://link.aps.org/doi/10.1103/PhysRevA.98.042315.
 Anand et al. (2020) Abhinav Anand, Matthias Degroote, and Alan AspuruGuzik. Natural evolutionary strategies for variational quantum computation. arXiv preprint arXiv:2012.00101, 2020.
 Arute et al. (2020) Frank Arute, Kunal Arya, Ryan Babbush, Dave Bacon, Joseph C Bardin, Rami Barends, Sergio Boixo, Michael Broughton, Bob B Buckley, David A Buell, et al. Hartreefock on a superconducting qubit quantum computer. Science, 369(6507):1084–1089, 2020.
 Auer et al. (2002) Peter Auer, Nicolo CesaBianchi, and Paul Fischer. Finitetime analysis of the multiarmed bandit problem. Machine learning, 47(23):235–256, 2002.
 August and HernándezLobato (2018) Moritz August and José Miguel HernándezLobato. Taking gradients through experiments: Lstms and memory proximal policy optimization for blackbox quantum control. In International Conference on High Performance Computing, pages 591–613. Springer, 2018. URL https://arxiv.org/abs/1802.04063.
 Berry et al. (2007) Dominic W Berry, Graeme Ahokas, Richard Cleve, and Barry C Sanders. Efficient quantum algorithms for simulating sparse Hamiltonians. Commun. Math. Phys., 270(2):359–371, 2007.
 Bertsekas (2019) Dimitri Bertsekas. Reinforcement learning and optimal control. Athena Scientific, 2019.
 Biewald (2020) Lukas Biewald. Experiment tracking with weights and biases, 2020. URL https://www.wandb.com/. Software available from wandb.com.
 Bolens and Heyl (2020) Adrien Bolens and Markus Heyl. Reinforcement learning for digital quantum simulation. arXiv preprint arXiv:2006.16269, 2020. URL https://arxiv.org/abs/2006.16269.
 Botet and Jullien (1983) R Botet and R Jullien. Largesize critical behavior of infinitely coordinated systems. Physical Review B, 28(7):3955, 1983.
 Bradbury et al. (2018) James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye WandermanMilne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
 Browne et al. (2012) Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012.
 Bukov (2018) Marin Bukov. Reinforcement learning for autonomous preparation of floquetengineered states: Inverting the quantum kapitza oscillator. Physical Review B, 98(22):224305, 2018. doi: 10.1103/PhysRevB.98.224305. URL https://link.aps.org/doi/10.1103/PhysRevB.98.224305.
 Bukov et al. (2018) Marin Bukov, Alexandre GR Day, Dries Sels, Phillip Weinberg, Anatoli Polkovnikov, and Pankaj Mehta. Reinforcement learning in different phases of quantum control. Physical Review X, 8(3):031086, 2018. doi: 10.1103/PhysRevX.8.031086. URL https://link.aps.org/doi/10.1103/PhysRevX.8.031086.
 Cai et al. (2018) Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332, 2018.
 Caneva et al. (2011) Tommaso Caneva, Tommaso Calarco, and Simone Montangero. Chopped randombasis quantum optimization. Phys. Rev. A, 84:022326, 2011. doi: 10.1103/PhysRevA.84.022326. URL https://link.aps.org/doi/10.1103/PhysRevA.84.022326.
 Cerezo et al. (2021a) M. Cerezo, Andrew Arrasmith, Ryan Babbush, Simon C. Benjamin, Suguru Endo, Keisuke Fujii, Jarrod R. McClean, Kosuke Mitarai, Xiao Yuan, Lukasz Cincio, and Patrick J. Coles. Variational quantum algorithms. Nature Reviews Physics, 3(9):625–644, 2021a. doi: 10.1038/s42254021003489. URL https://doi.org/10.1038/s42254021003489.
 Cerezo et al. (2021b) Marco Cerezo, Andrew Arrasmith, Ryan Babbush, Simon C Benjamin, Suguru Endo, Keisuke Fujii, Jarrod R McClean, Kosuke Mitarai, Xiao Yuan, Lukasz Cincio, et al. Variational quantum algorithms. Nature Reviews Physics, 3(9):625–644, 2021b.
 Chandarana et al. (2021) P. Chandarana, N. N. Hegade, Koushik Paul, F. AlbarránArriagada, Enrique Solano, A. del Campo, and Xi Chen. Digitizedcounterdiabatic quantum approximate optimization algorithm. arXiv preprint arXiv:2107.02789v2, Jul 2021. URL http://arxiv.org/abs/2107.02789v2.
 Chen et al. (2013) Chunlin Chen, Daoyi Dong, HanXiong Li, Jian Chu, and TzyhJong Tarn. Fidelitybased probabilistic qlearning for control of quantum systems. IEEE transactions on neural networks and learning systems, 25(5):920–933, 2013. doi: 10.1109/TNNLS.2013.2283574.
 Chen et al. (2020) YuQin Chen, Yu Chen, CheeKong Lee, Shengyu Zhang, and ChangYu Hsieh. Optimizing quantum annealing schedules: From monte carlo tree search to quantumzero. arXiv preprint arXiv:2004.02836v2, Apr 2020. URL http://arxiv.org/abs/2004.02836v2.
 Coulom (2006) Rémi Coulom. Efficient selectivity and backup operators in montecarlo tree search. In International conference on computers and games, pages 72–83. Springer, 2006.
 Dalgaard et al. (2020) Mogens Dalgaard, Felix Motzoi, Jens Jakob Sørensen, and Jacob Sherson. Global optimization of quantum dynamics with AlphaZero deep exploration. npj Quantum Information, 6(1), jan 2020. doi: 10.1038/s4153401902410. URL https://doi.org/10.1038%2Fs4153401902410.

Dalgaard et al. (2022)
Mogens Dalgaard, Felix Motzoi, and Jacob Sherson.
Predicting quantum dynamical cost landscapes with deep learning.
Physical Review A, 105(1):012402, 2022.  Day et al. (2019) Alexandre GR Day, Marin Bukov, Phillip Weinberg, Pankaj Mehta, and Dries Sels. Glassy phase of optimal quantum control. Physical review letters, 122(2):020601, 2019. doi: 10.1103/PhysRevLett.122.020601.
 Delalleau et al. (2019) Olivier Delalleau, Maxim Peter, Eloi Alonso, and Adrien Logut. Discrete and continuous action representation for practical rl in video games. arXiv preprint arXiv:1912.11077, 2019.
 Elsken et al. (2018) Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. arXiv preprint arXiv:1808.05377v3, Aug 2018. URL http://arxiv.org/abs/1808.05377v3. Journal of Machine Learning Research 20 (2019) 121.
 Farhi et al. (2014) Edward Farhi, Jeffrey Goldstone, and Sam Gutmann. A Quantum Approximate Optimization Algorithm. arXiv preprint arXiv:1411.4028, 2014. URL https://arxiv.org/pdf/1411.4028.pdf.
 Fösel et al. (2018) Thomas Fösel, Petru Tighineanu, Talitha Weiss, and Florian Marquardt. Reinforcement learning with neural networks for quantum feedback. Phys. Rev. X, 8:031084, 2018. doi: 10.1103/PhysRevX.8.031084. URL https://link.aps.org/doi/10.1103/PhysRevX.8.031084.
 Fösel et al. (2020) Thomas Fösel, Stefan Krastanov, Florian Marquardt, and Liang Jiang. Efficient cavity control with snap gates. arXiv preprint arXiv:2004.14256, 2020. URL https://arxiv.org/abs/2004.14256.
 Guo et al. (2014) Xiaoxiao Guo, Satinder Singh, Honglak Lee, Richard L Lewis, and Xiaoshi Wang. Deep learning for realtime atari game play using offline montecarlo tree search planning. In Advances in neural information processing systems, pages 3338–3346, 2014.
 Haarnoja et al. (2018) Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actorcritic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018. URL https://arxiv.org/abs/1812.05905.
 Harris et al. (2020) Charles R. Harris, K. Jarrod Millman, St’efan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fern’andez del R’ıo, Mark Wiebe, Pearu Peterson, Pierre G’erardMarchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array programming with NumPy. Nature, 585(7825):357–362, 2020. doi: 10.1038/s4158602026492. URL https://doi.org/10.1038/s4158602026492.
 Hartmann and Lechner (2019) Andreas Hartmann and Wolfgang Lechner. Rapid counterdiabatic sweeps in lattice gauge adiabatic quantum computing. New Journal of Physics, 21(4):043025, 2019. doi: 10.1088/13672630/ab14a0. URL https://doi.org/10.1088%2F13672630%2Fab14a0.
 Hegade et al. (2021a) N. N. Hegade, P. Chandarana, K. Paul, X. Chen, F. AlbarránArriagada, and E. Solano. Portfolio optimization with digitizedcounterdiabatic quantum algorithms. arXiv preprint arXiv:2112.08347v1, Dec 2021a. URL http://arxiv.org/abs/2112.08347v1.
 Hegade et al. (2021b) Narendra N. Hegade, Koushik Paul, Yongcheng Ding, Mikel Sanz, F. AlbarránArriagada, Enrique Solano, and Xi Chen. Shortcuts to adiabaticity in digitized adiabatic quantum computing. Physical Review Applied, 15(2), feb 2021b. doi: 10.1103/physrevapplied.15.024038. URL https://doi.org/10.1103%2Fphysrevapplied.15.024038.
 Hegade et al. (2022) Narendra N. Hegade, Xi Chen, and Enrique Solano. Digitizedcounterdiabatic quantum optimization. arXiv preprint arXiv:2201.00790v1, Jan 2022. URL http://arxiv.org/abs/2201.00790v1.
 Ho and Hsieh (2019) Wen Wei Ho and Timothy H Hsieh. Efficient variational simulation of nontrivial quantum states. SciPost Phys, 6:29, 2019.
 Khaneja et al. (2005) Navin Khaneja, Timo Reiss, Cindie Kehlet, Thomas SchulteHerbrüggen, and Steffen J Glaser. Optimal control of coupled spin dynamics: design of nmr pulse sequences by gradient ascent algorithms. Journal of magnetic resonance, 172(2):296–305, 2005.
 Kim and Huse (2013) Hyungwon Kim and David A Huse. Ballistic spreading of entanglement in a diffusive nonintegrable system. Physical review letters, 111(12):127205, 2013.
 Klink et al. (2020) Pascal Klink, Carlo D’Eramo, Jan R Peters, and Joni Pajarinen. Selfpaced deep reinforcement learning. Advances in Neural Information Processing Systems, 33:9216–9227, 2020.
 Kuo et al. (2021) EnJui Kuo, YaoLung L. Fang, and Samuel YenChi Chen. Quantum architecture search via deep reinforcement learning. arXiv preprint arXiv:2104.07715v1, Apr 2021. URL http://arxiv.org/abs/2104.07715v1.
 Lillicrap et al. (2016) Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 24, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1509.02971.
 Lipkin et al. (1965) Harry J Lipkin, N Meshkov, and AJ Glick. Validity of manybody approximation methods for a solvable model:(i). exact solutions and perturbation theory. Nuclear Physics, 62(2):188–198, 1965.
 Liu et al. (2019) Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: differentiable architecture search. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 69, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=S1eYHoC5FX.
 Low and Chuang (2017) Guang Hao Low and Isaac L. Chuang. Optimal hamiltonian simulation by quantum signal processing. Phys. Rev. Lett., 118:010501, 2017.
 Mania et al. (2018) Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055v1, Mar 2018. URL http://arxiv.org/abs/1803.07055v1.
 Matos et al. (2021) Gabriel Matos, Sonika Johri, and Zlatko Papić. Quantifying the efficiency of state preparation via quantum variational eigensolvers. PRX Quantum, 2(1), 2021. doi: 10.1103/prxquantum.2.010309. URL https://doi.org/10.1103%2Fprxquantum.2.010309.
 McClean et al. (2016) Jarrod R McClean, Jonathan Romero, Ryan Babbush, and Alán AspuruGuzik. The theory of variational hybrid quantumclassical algorithms. New J. Phys., 18(2):023023, 2016.
 Metz and Bukov (2022) Friederike Metz and Marin Bukov. Selfcorrecting quantum manybody control using reinforcement learning with tensor networks. arXiv preprint arXiv:2201.11790, 2022.
 Nautrup et al. (2019) Hendrik Poulsen Nautrup, Nicolas Delfosse, Vedran Dunjko, Hans J Briegel, and Nicolai Friis. Optimizing quantum error correction codes with reinforcement learning. Quantum, 3:215, 2019. doi: 10.22331/q20191216215.
 Nesterov and Spokoiny (2017) Yurii Nesterov and Vladimir Spokoiny. Random gradientfree minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017.
 Neunert et al. (2020) Michael Neunert, Abbas Abdolmaleki, Markus Wulfmeier, Thomas Lampe, Jost Tobias Springenberg, Roland Hafner, Francesco Romano, Jonas Buchli, Nicolas Heess, and Martin Riedmiller. Continuousdiscrete reinforcement learning for hybrid control in robotics. arXiv preprint arXiv:2001.00449, 2020.
 Niu et al. (2019) Murphy Yuezhen Niu, Sergio Boixo, Vadim N Smelyanskiy, and Hartmut Neven. Universal quantum control through deep reinforcement learning. npj Quantum Information, 5(1):1–8, 2019. doi: 10.1038/s4153401901413. URL https://doi.org/10.1038/s4153401901413.
 Passarelli et al. (2020) G. Passarelli, V. Cataudella, R. Fazio, and P. Lucignano. Counterdiabatic driving in the quantum annealing of the spin model: A variational approach. Phys. Rev. Research, 2:013283, Mar 2020. doi: 10.1103/PhysRevResearch.2.013283. URL https://link.aps.org/doi/10.1103/PhysRevResearch.2.013283.

Peruzzo et al. (2014)
Alberto Peruzzo, Jarrod McClean, Peter Shadbolt, ManHong Yung, XiaoQi Zhou,
Peter J Love, Alán AspuruGuzik, and Jeremy L O’brien.
A variational eigenvalue solver on a photonic quantum processor.
Nature communications, 5:4213, 2014. doi: 10.1038/ncomms5213. URL https://doi.org/10.1038/ncomms5213.  Porotti et al. (2019) Riccardo Porotti, Dario Tamascelli, Marcello Restelli, and Enrico Prati. Coherent transport of quantum states by deep reinforcement learning. Communications Physics, 2(1):1–9, 2019. doi: 10.1038/s420050190169x. URL https://doi.org/10.1038/s420050190169x.
 Preskill (2018) John Preskill. Quantum computing in the nisq era and beyond. Quantum, 2:79, 2018.
 Real et al. (2020) Esteban Real, Chen Liang, David So, and Quoc Le. Automlzero: Evolving machine learning algorithms from scratch. In International Conference on Machine Learning, pages 8007–8019. PMLR, 2020.
 Sauvage and Mintert (2019) Frederic Sauvage and Florian Mintert. Optimal quantum control with poor statistics. arXiv preprint arXiv:1909.01229, 2019. URL https://arxiv.org/abs/1909.01229.
 Schäfer et al. (2020) Frank Schäfer, Michal Kloc, Christoph Bruder, and Niels Lörch. A differentiable programming method for quantum control. Machine Learning: Science and Technology, 1, 2020. doi: 10.1088/26322153/ab9802.
 Schrittwieser et al. (2020) Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, dec 2020. doi: 10.1038/s41586020030514. URL https://doi.org/10.1038%2Fs41586020030514.
 Sels and Polkovnikov (2017) Dries Sels and Anatoli Polkovnikov. Minimizing irreversible losses in quantum systems by local counterdiabatic driving. Proceedings of the National Academy of Sciences, 114(20):E3909–E3916, 2017.
 Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
 Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
 Silver et al. (2018) David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through selfplay. Science, 362(6419):1140–1144, 2018.
 Sim et al. (2021) Sukin Sim, Jonathan Romero, Jérôme F Gonthier, and Alexander A Kunitsa. Adaptive pruningbased optimization of parameterized quantum circuits. Quantum Science and Technology, 6(2):025019, mar 2021. doi: 10.1088/20589565/abe107. URL https://doi.org/10.1088%2F20589565%2Fabe107.
 Sørdal and Bergli (2019) Vegard B Sørdal and Joakim Bergli. Deep reinforcement learning for robust quantum optimization. arXiv preprint arXiv:1904.04712, 2019. URL https://arxiv.org/abs/1904.04712.
 Sung (2020) Kevin Sung. Towards the First Practical Applications of Quantum Computers. PhD thesis, University of Michigan, 2020.
 Sung et al. (2020) Kevin J Sung, Jiahao Yao, Matthew P Harrigan, Nicholas C Rubin, Zhang Jiang, Lin Lin, Ryan Babbush, and Jarrod R McClean. Using models to improve optimizers for variational quantum algorithms. Quantum Science and Technology, 5(4):044008, 2020.
 Tilly et al. (2021) Jules Tilly, Hongxiang Chen, Shuxiang Cao, Dario Picozzi, Kanav Setia, Ying Li, Edward Grant, Leonard Wossnig, Ivan Rungger, George H. Booth, and Jonathan Tennyson. The variational quantum eigensolver: a review of methods and best practices. arXiv preprint arXiv:2111.05176v1, Nov 2021. URL http://arxiv.org/abs/2111.05176v1.
 Trabucco et al. (2021) Brandon Trabucco, Aviral Kumar, Xinyang Geng, and Sergey Levine. Conservative objective models for effective offline modelbased optimization. In International Conference on Machine Learning, pages 10358–10368. PMLR, 2021.
 Vinyals et al. (2019) Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multiagent reinforcement learning. Nature, 575(7782):350–354, 2019.
 Wang et al. (2020a) Guoming Wang, Dax Enshan Koh, Peter D Johnson, and Yudong Cao. Bayesian inference with engineered likelihood functions for robust amplitude estimation. arXiv preprint arXiv:2006.09350, 2020a.
 Wang et al. (2021a) Hanrui Wang, Yongshan Ding, Jiaqi Gu, Zirui Li, Yujun Lin, David Z. Pan, Frederic T. Chong, and Song Han. Quantumnas: Noiseadaptive search for robust quantum circuits. arXiv preprint arXiv:2107.10845v5, Jul 2021a. URL http://arxiv.org/abs/2107.10845v5.
 Wang et al. (2021b) Hanrui Wang, Jiaqi Gu, Yongshan Ding, Zirui Li, Frederic T. Chong, David Z. Pan, and Song Han. Roqnn: Noiseaware training for robust quantum neural networks. arXiv preprint arXiv:2110.11331v1, Oct 2021b. URL http://arxiv.org/abs/2110.11331v1.
 Wang et al. (2020b) Linnan Wang, Rodrigo Fonseca, and Yuandong Tian. Learning search space partition for blackbox optimization using monte carlo tree search. arXiv preprint arXiv:2007.00708, 2020b.
 Wauters et al. (2020) Matteo M Wauters, Emanuele Panizon, Glen B Mbeng, and Giuseppe E Santoro. Reinforcement learning assisted quantum optimization. Phys. Rev. Research, 2:033446, 2020. doi: 10.1103/PhysRevResearch.2.033446. URL https://link.aps.org/doi/10.1103/PhysRevResearch.2.033446.
 Weinberg and Bukov (2017) Phillip Weinberg and Marin Bukov. Quspin: a python package for dynamics and exact diagonalisation of quantum many body systems part i: spin chains. SciPost Phys, 2(1), 2017.
 Weinberg and Bukov (2019) Phillip Weinberg and Marin Bukov. Quspin: a python package for dynamics and exact diagonalisation of quantum many body systems. part ii: bosons, fermions and higher spins. SciPost Phys., 7(arXiv: 1804.06782):020, 2019.
 Wu et al. (2020a) ReBing Wu, Xi Cao, Pinchen Xie, and Yuxi Liu. Endtoend quantum machine learning with quantum control systems. arXiv preprint arXiv:2003.13658, 2020a. URL https://arxiv.org/abs/2003.13658.
 Wu et al. (2020b) Yadong Wu, Zengming Meng, Kai Wen, Chengdong Mi, Jing Zhang, and Hui Zhai. Active learning approach to optimization of experimental control. arXiv preprint arXiv:2003.11804, 2020b.
 Wurtz and Love (2021) Jonathan Wurtz and Peter J Love. Counterdiabaticity and the quantum approximate optimization algorithm. arXiv preprint arXiv:2106.15645, 2021.
 Yao et al. (2020a) Jiahao Yao, Marin Bukov, and Lin Lin. Policy gradient based quantum approximate optimization algorithm. In Mathematical and Scientific Machine Learning, pages 605–634. PMLR, 2020a.
 Yao et al. (2020b) Jiahao Yao, Paul Köttering, Hans Gundlach, Lin Lin, and Marin Bukov. Noiserobust endtoend quantum control using deep autoregressive policy networks. arXiv preprint arXiv:2012.06701, 2020b.
 Yao et al. (2020c) Jiahao Yao, Lin Lin, and Marin Bukov. Reinforcement learning for manybody ground state preparation based on counterdiabatic driving. arXiv preprint arXiv:2010.03655, 2020c. URL https://arxiv.org/abs/2010.03655.
 Ye et al. (2021) Weirui Ye, Shaohuai Liu, Thanard Kurutach, Pieter Abbeel, and Yang Gao. Mastering atari games with limited data. arXiv preprint arXiv:2111.00210v2, Oct 2021. URL http://arxiv.org/abs/2111.00210v2.
 Zhang et al. (2020) ShiXin Zhang, ChangYu Hsieh, Shengyu Zhang, and Hong Yao. Differentiable quantum architecture search. arXiv preprint arXiv:2010.08561, 2020.
 Zhang et al. (2021) ShiXin Zhang, ChangYu Hsieh, Shengyu Zhang, and Hong Yao. Neural predictor based quantum architecture search. arXiv preprint arXiv:2103.06524v1, Mar 2021. URL http://arxiv.org/abs/2103.06524v1.
 Zhou et al. (2020a) Hui Zhou, Yunlan Ji, Xinfang Nie, Xiaodong Yang, Xi Chen, Ji Bian, and Xinhua Peng. Experimental realization of shortcuts to adiabaticity in a nonintegrable spin chain by local counterdiabatic driving. Physical Review Applied, 13(4):044059, 2020a.
 Zhou et al. (2020b) Xiangzhen Zhou, Yuan Feng, and Sanjiang Li. A monte carlo tree search framework for quantum circuit transformation. arXiv preprint arXiv:2008.09331v2, Aug 2020b. URL http://arxiv.org/abs/2008.09331v2.
 Zhu et al. (2020) Linghua Zhu, Ho Lun Tang, George S Barron, Nicholas J Mayhall, Edwin Barnes, and Sophia E Economou. An adaptive quantum approximate optimization algorithm for solving combinatorial problems on a quantum computer. arXiv preprint arXiv:2005.10258, 2020. URL https://arxiv.org/abs/2005.10258.
 Zoph and Le (2017) Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=r1Ue8Hcxg.
Appendix A Related works
Hybrid optimization: The generalized QAOA ansatz introduces a discrete and continuous control problem: the discrete degrees of freedom are the gates/unitaries that define the control protocol, while the continuous degrees of freedom are the gate duration. Most reinforcement learning algorithms (Lillicrap et al., 2016; Bertsekas, 2019; Trabucco et al., 2021) typically deal with the control of either discrete or continuous degree of freedom, and hardly consider the discrete and continuous control simultaneously in the policy. Even though the continuous control can always be discretized, it is always beneficial and desirable to consider discrete and continuous variables together, without loss of the flexibility of continuous control. Furthermore, the idea of continuous and discrete optimization can be quite general, and shows up in real world application like robotics (Neunert et al., 2020; Delalleau et al., 2019) and strategic games (Vinyals et al., 2019). Combining the discrete and continuous control together, the control capability of the algorithm can be quickly enhanced. In general, the discrete variables are usually chosen as the categories of actions, and the continuous variables will are naturally given by the strength for each specific action. Our work aims to shed light on the hybrid control in the field of quantum control, and we also hope it will accelerate the research of hybrid discretecontinuous optimization algorithms in the wider community.
Counter diabatic driving: Counter diabatic driving (Sels and Polkovnikov, 2017; Hegade et al., 2022), an example of a shortcut to adiabaticity (STA), introduces an extra auxiliary counterdiabatic (CD) Hamiltonian to suppress transitions (or excitations) between instantaneous eigenvalues.
For a given quantum state evolving under a time dependent Hamiltonian , the Schrödinger equation reads as
(A.1) 
In the rotating frame, Hamiltonian remains stationary under the unitary transformation , i.e. in the instantaneous eigenbasis of Hamiltonian . The wave function in the rotating frame satisfies the following Schrödinger equation:
(A.2) 
where . Specifically, instead of being diagonalized, the original Hamiltonian picks up an extra contribution due to the change in the parameter , and the effective Hamiltonian becomes
(A.3) 
The idea of the CD driving is to evolve the system with the counterdiabatic Hamiltonian
(A.4) 
Importantly, in the moving frame is stationary and no transitions occur.
However, in practice, the precise counterdiabatic Hamiltonian is intractable and usually approximated by different methods. A good number of prior works (Passarelli et al., 2020; Hartmann and Lechner, 2019; Hegade et al., 2022, 2021a; Zhou et al., 2020a; Wurtz and Love, 2021; Hegade et al., 2021b) are based on the concept of a variational approximation to the CD Hamiltonian (Sels and Polkovnikov, 2017). Most of these works typically make use of an analytically computed expression available for fewqubit systems; they first derive the continuous form of the variational gauge potential, and then discretize the underlying dynamics using the TrotterSuzuki formula In this work, we aim to bypass these constraints by applying the variational generalized QAOA ansatz using additional gates, generated by terms that occur in the approximation to the variational adiabatic gauge potential. These extra gates can provide a shortcut to the preparation of the ground state, compared to the original alternating QAOA ansatz. Physically, this shortcut results in shorter circuit simulation times, which provices a significant advantage on noisy NISQ devices.
AutoML and neural architecture search: Automatic machine learning or AutoML has recently attracted lots of attentions as it reduces human efforts in designing the neural architecture from experience and instead leverage the computational power to search the best configuration. One of the most pronounced examples are neural architecture search (NAS) and their variants (Zoph and Le, 2017; Elsken et al., 2018; Liu et al., 2019; Cai et al., 2018; Real et al., 2020), where reinforcement learning or evolutionary strategies are used to find a better network architecture. Inspired by the success of AutoML, the architecture of quantum circuits can also be improved by machine learning algorithms, such as the quantum version of Neural Architecture Search (Wang et al., 2021a, b; Zhang et al., 2021, 2020; Kuo et al., 2021). These prior works interpret the problem as quantum compiling problems, which assembles quantum gates in the low level. Instead of exposing a huge number of choice alternatives for the search algorithms, our work specially uses the variational gauge potentials as the Hamiltonian pool for the search algorithm in a computationefficient way. Compared with QAOA, MCTSQAOA has more degree of freedom to approximate the unitary operator; compared with the quantum compiling, it does not search gates in the low level due to the constraint of computations. From this perspective, our method hits the sweet spot between the expressivity and efficiency.
Appendix B Setup of physical models
We first give a brief review on the physical models used in the numerical experiments. In all experiments, we choose the target state as the ground state of the Hamiltonian , denoted . The spin matrices describing spin are denoted by . In contrast to the models considered in Yao et al. (2020b, c), the Hamiltonians used in this work is normalized by its operator norm , i.e., we use instead of the original Hamiltonian . The reason for introducing the normalized Hamiltonian is as follows. For generic Hamiltonians (e.g., sparse matrices), the cost of performing a Hamiltonian evolution on a quantum device is Berry et al. (2007); Low and Chuang (2017). Due to the potential differences between the Hamiltonian norms in the Hamiltonian pool , using a normalized Hamiltonian (the corresponding duration parameter is thus multiplied by ) can lead to a more realistic estimate of the cost of the quantum simulation. Due to this multiplication factor, the duration shown in the results below is larger than that presented in Yao et al. (2020b, c).
Onedimensional (1D) Ising model
The spin Ising Hamiltonian reads as:
where is the number of qubits and the parameters are set as and (Kim and Huse, 2013)
. These parameters are close to the critical line of the model in the thermodynamic limit, where the quantum phase transition occurs. They are also reported in Ref.
(Matos et al., 2021) to be in the most challenging parameter region using QAOA. We use periodic boundary conditions here. The initial state for this experiment is given by polarized product state, i.e. .For the Hamiltonian pool, we use where . The operators are precisely the first three terms in the expansion for the adiabatic gauge potential of the translationinvariant 1D Ising model (Yao et al., 2020c).
Twodimensional (2D) Ising model
The 2D spin transversefield Ising model reads:
where denotes nearest neighbors on the square lattice. The model parameters are set as and . The initial state is , i.e. polarized product state on 2D lattice.
For the Hamiltonian pool, we use , where .
LipkinMeshkovGlick (LMG) model
The LipkinMeshkovGlick (LMG) model (Lipkin et al., 1965) reads:
where is the interactions trength, and stands for the magnetic field strength. The LMG model preserves the total spin, and the ground state is contained in an dimensional subspace due to this symmetry. This makes the LMG model particularly interesting because it allows us to simulate its dynamics for a large number of spins, where manybody effects, such as collective phenomena, dominate the physics of the system.
For instance, in the thermodynamic limit , the LMG model exhibits a quantum phase transition at (Botet and Jullien, 1983). The transition is between a ferromagnetic (FM) order in the ground state in the direction (), and the paramagnetic order ().
For the Hamiltonian pool, we use , where
(B.1) 
Appendix C Noise models
An essential part of our study is the performance of the algorithms in the presence of noise. As mentioned in the main text, noise sets the current bottle neck for reliable quantum computation. Therefore, it is of primary importance for the nearterm utility of quantum computers to develop stable and noiserobust manipulation algorithms.
We use the following three noise models in our numerical experiments: (i) classical measurement noise, (ii) quantum measurement error which micmic the situation on presentday NISQ devices, and (iii) gate rotation error noise.
Classical measurement Gaussian noise is added to the cost function according to
where and denotes the noise strength, and
is the normal distribution. Gaussian noise models various kinds of uncertainty present in experiments using an additive Gaussian random variable, which follows from the Central Limit theorem.
Quantum measurement noise:
where the noise strength depends on the strength of the energy quantum fluctuations
and is randomly sampled from . Quantum noise models the uncertainty arising from quantum measurements. For instance, quantum fluctuations are large when the evolved quantum state is far away from the target, while they decrease when the final state approaches the target ground state.
Gate rotation error noise:
where gate error strengths are multiplicative and the corresponding ratios are for some simulation parameter which controls the noise strength. Gate rotation errors Sung et al. (2020) present yet another common noise source, which arises due to imperfections or lack of calibration in the quantum computer hardware.
Appendix D Details for the natural policy gradient with entropy regularization
For a general dimensional Gaussian distribution , the Shannon entropy is defined as , where . Hence
Omitting the constants, it is equivalent to take the entropy as
. For the model used, the probability distribution is a product of normal distribution, i.e.,
is a diagonal matrix with length and diagonal elements , so the corresponding entropy function is .In the implementation, we adopt the parameterization to assure that is positive. Then for the distribution , we have
and
where the gradient is taken with respect to the parameters. Since are independent, the Fisher information matrix is a block diagonal matrix with the th block equal to
Recall that for a fixed gate sequence , we set , where denotes the sigmoid function, and
Hence the gradient of is
where the gradient is taken with respect to the parameters, and . Therefore, the unbiased estimators for the variables are
where are independent standard normal variables and . As a result, the unbiased estimators for the natural gradient direction become
since is a block diagonal matrix with the th block given by .
Appendix E Additional experiment results
In Section 5.1, we have presented a comparison between the RLQAOA method and MCTSQAOA for three different physics models with the quantum noise. In this section, we report the test results with the other types of noise, namely the results with the Gaussian noise, the results with the gate rotation error, and the results when no noise is considered (cf. Appendix C). We can observe from the comparison that MCTSQAOA’s performance is much more stable and accurate.
From Figure 6, one can observe similar behavior the two methods as in Section 5.1, i.e., MCTSQAOA outperforms RLQAOA in all settings and the gaps grow larger in the regime of large total gate durations. The raw data for the energy ratio obtained by MCTSQAOA is summarized in Table 2 (highlighted in bold), which offers a more visually and quantitatively convenient comparison across different models.
Gate rotation noise  Quantum noise  Gaussian noise  No noise  
(Model)  (a) Ising 1D  
()  ()  ()  ()  
()  ()  ()  ()  
()  ()  ()  ()  
()  ()  ()  ()  
()  ()  ()  ()  
()  ()  ()  ()  
()  ()  ()  ()  
()  ()  ()  ()  
(Model)  (b) Ising 2D  
()  ()  ()  ()  
()  ()  ()  ()  
()  ()  ()  ()  
()  ()  ()  ()  
()  ()  ()  ()  
()  ()  ()  ()  
(Model)  (c) LMG  
()  ()  ()  ()  
()  ()  ()  ()  
()  ()  ()  ()  
()  ()  ()  ()  
()  ()  ()  ()  
()  ()  ()  ()  
()  ()  ()  ()  
()  ()  ()  ()  
()  ()  ()  () 
Appendix F Additional numerical results on the energy landscape
In Section 5.2 we reported the discrete landscape of the generalized QAOA ansatz under the condition that the continuous variables are solved with high quality with the improved NPG solver. Here we include the landscape under another physical model, i.e. 2D Ising model. We consider the case where , and the total number of gate sequences is thus . Similar to the plots displayed in Section 5.2, the landscape with a longer total duration () features a dominant cluster at the rightmost part of the histogram. When the total duration is smaller, the number of clusters increases, and is shifted to the left.
Figure 8 shows the influence of the parameter in the discrete landscape for the LMG model with gate duration and . When and , the rightmost peak in the energy ratio histogram gets close to , which means that reaching the ground state would be a easy task in these two cases. The more difficult cases lies in between, for example when . For the parameter we choose in the main text, there is a bigger gap (cf. Fig. 4(c)) between the rightmost peak of the energy ratio and , which means the problem we choose to solve is relatively challenging.
Appendix G Physical models with the identity action
The generalized QAOA ansatz provides us the freedom of adding different Hamiltonians to the Hamiltonian pool. One meaningful addition is the identity operator. Here, the identity operator corresponds to the identity gate that does not move the quantum state. If we take , then its corresponding unitary gate will be identity, i.e.