gym-forest
Reinforcement learning environment for the classical synthesis of quantum programs.
view repo
We develop a general method for incentive-based programming of hybrid quantum-classical computing systems using reinforcement learning, and apply this to solve combinatorial optimization problems on both simulated and real gate-based quantum computers. Relative to a set of randomly generated problem instances, agents trained through reinforcement learning techniques are capable of producing short quantum programs which generate high quality solutions on both types of quantum resources. We observe generalization to problems outside of the training set, as well as generalization from the simulated quantum resource to the physical quantum resource.
READ FULL TEXT VIEW PDF
The advent of quantum computing processors with possibility to scale bey...
read it
Quantum hardware and quantum-inspired algorithms are becoming increasing...
read it
Quantum computing exploits basic quantum phenomena such as state
superpo...
read it
Quantum computing is a computational paradigm with the potential to
outp...
read it
Quantum Bridge Analytics relates to methods and systems for hybrid
class...
read it
Quantum Bridge Analytics relates to methods and systems for hybrid
class...
read it
Case-Based Reasoning (CBR) is an artificial intelligence approach to
pro...
read it
Reinforcement learning environment for the classical synthesis of quantum programs.
One of the earliest ambitions of artificial intelligence research was to consider ways of mechanizing the task of computer programming itself, via the automated synthesis of programs from high level specifications. There is rich literature on such techniques, including a range of meta-heuristic
[olmo2014swarm, boussaid2013survey]and machine learning approaches
[kant2018recent] (cf. [gulwani2017program] for a recent, broad survey on program synthesis). Such ideas are potentially compelling for the programming of quantum computers, due to both the unintuitive nature of these devices as well as the unique challenges presented by near-term hardware.To this end, we explore reinforcement learning for automated program synthesis in the context of a hybrid classical-quantum computing architecture, considering both simulated and physical gate-based quantum computers. We explore the application of this framework to solve a range of combinatorial optimization problems (COPs). This class of problems is of particular interest in the quantum computing domain due to the emergence of new quantum heuristics for both adiabatic [das2005quantum] and gate-model [hadfield2019quantum, farhi2014quantum] quantum computing. From an application perspective, these optimization problems are ubiquitous and of very high value to many processes in industry.
Reinforcement learning techniques applied to quantum computation have found success in a range of contexts, including quantum error correction [fosel2018reinforcement], as well as noisy control for gate design [an2019deep, niu2019universal] and state preparation [bukov2018reinforcement1, bukov2018reinforcement2, august2018taking, albarran2018measurement, zhang2019reinforcement]. Here we utilize reinforcement learning directly at the application level to solve COP programming tasks.
Combinatorial optimization has proven to be a popular target domain for machine learning methods. This work dates back to at least the last machine learning cycle of the 1980s and 1990s, where Hopfield networks were used to model a variety of problem types [smith1999neural]. More recently, state of the art techniques such as recurrent encoder/decoder networks [vinyals2015pointer, bello2016neural], graph embeddings [khalil2017learning], and attention mechanisms [nazari2018reinforcement] have been used to solve a range of COPs. Note that, in these applications, machine learning has been used either end-to-end to map directly from a problem instance to a solution, or alternatively as a subroutine of an already existing heuristic. For a comprehensive discussion of this intersection of domains, see Ref. [bengio2018machine].
In what follows, we detail the design considerations involved in defining an effective reinforcement learning environment, with particular emphasis on the definition of the state space, action space, and reward function. We subsequently apply this framework to train reinforcement learning agents to generate quantum programs solving combinatorial optimization problems on both simulated and real quantum resources. Relative to a test set of randomly generated problem instances, we observe that the mean performance of the trained agents exceeds the performance of both untrained agents as well as that of the leading near-term hybrid quantum algorithm typically used to solve combinatorial optimization problems, the quantum approximate optimization algorithm (QAOA) [farhi2014quantum]. Following this, we briefly analyze and discuss agent strategy, consider limitations of the current work, and note promising avenues forward.
In this section we broadly identify the aspects of the hybrid classical/quantum environment relevant to reinforcement learning. In particular, we indicate the state space and observations of it, the available actions, the reward, and the learning agent. These specifications have been implemented using the standard environment API proposed by OpenAI Gym [brockman2016openai]. This implementation was used to conduct all experiments contained in this study.
At a high level, the framework we propose is as follows. A reinforcement learning agent, executing on a classical computing resource, incrementally produces a quantum program for execution on a quantum resource, with the goal of preparing a quantum state which serves to solve some posed problem. In the examples we consider, the problem may be described by the specification of a ‘problem instance’ (e.g. a weighted graph) and a ‘reward function’ (detailed below).
This process is illustrated in Fig. 1. At the outset, the problem reward function, and graph to be evaluated in the context of this reward function are specified. Given this information, the agent produces a quantum program by iterating the following steps: i) the agent chooses from some available quantum gates to append to the current program, ii) the updated program is evaluated on a quantum resource, perhaps several times, and iii) the results of these evaluations are used to compute a reward and an ‘observation’. The results of step (iii) are subsequently used in the decision criteria of step (i) of the next iteration, and so on. This interaction between the agent and quantum resource is repeated until the agent ‘wins’, wherein the reward exceeds some threshold, or ‘loses’, wherein the program length exceeds some threshold.
In the case of combinatorial optimization problems, one may naturally identify the reward of a quantum state as the cost Hamiltonian’s expectation value. More generally, any monotonic function of the expectation serves as an adequate surrogate. For the experiments which follow, we have found it convenient to rescale the cost Hamiltonian’s expectation to take values in the range 0 to 1. We identify the action space as a finite set of quantum gates, such as a discretized set of RZ and RY rotation gates. For the agent, we focus exclusively on the PPO (Proximal Policy Optimization, cf. [schulman2017proximal]) algorithm, applied to a shared actor-critic architecture.
A reinforcement learning problem is formally specified as a Markov Decision Process (MDP), for which the goal of the learning agent is to find the optimal policy, i.e. the conditional probability
of applying a particular quantum gate (action) given a particular representation of the qubit register (state
) that would maximize the expected (discounted) return , without necessarily having a model of the environment . Defining the value of a state under a policy as(1) |
we identify the optimal policy a bit more concretely as such that for all and all policies . In practice PPO will find some approximation to the theoretical optimum as a function of some parameters which it will tune towards the optimum during the learning process.
The problem we consider here is more naturally modeled as a partially observed Markov Decision Process (POMDP), since quantum states are not themselves directly observable, and only their measurement outcomes are. While the action (quantum gate) that the agent chooses to carry out deterministically evolves the quantum state (in the absence of noise), the observation it receives from the measurement samples are in general not deterministic. For a single COP instance, the observation representation that the agent receives from the environment is given by a specification of the sampled bitstrings we receive from measurements, given a particular sequence of actions and a fixed starting state
(2) |
such that . Note that , where describes the complex amplitude of the computational basis state .
In order to extract quantum circuits from the trained agent on unseen problem instances of a COP, we augment the state space with a representation of the COP problem instance itself. For example, in the case of MaxCut (see below), part of the state description is the graph whose maximum cut we seek. We train the agent over a collection of several such COP instances, forming the training set, and test its predictions against a collection of similar but different COP instances that the agent has not seen before.
In this section, we first outline the classes of combinatorial optimization problems which we consider. Following this, we describe several experiments on simulated and real quantum hardware.
We consider three problems of increasing generality.
Let be a weighted graph with vertices and edges . For convenience we assume , with weights specified by an real symmetric matrix , where the weight is nonzero if there is an edge between vertices and . The maximum cut problem asks for a partition of into two subsets such that the total edge weight between them is maximized. Formally,
MaxCut |
This problem is known to be NP-hard [karp1972reducibility], although a number of polynomial-time approximation algorithms exist. In this regard, it is known to be NP-hard to approximate MaxCut with ratio better than 16/17 ([arora1998proof]). The best known approximation ratio of 0.87856 is given by the semidefinite programming algorithm of Goemans and Williamson [goemans1995improved]. If the Unique Games Conjecture holds, this ratio is optimal [khot2007optimal].
Note that solving MaxCut is equivalent to maximizing the slightly simplified expression , where the coefficients are always negative. A natural generalization is to allow mixed signs. The resulting problem, also NP-hard, is
MaxQP |
where is a real symmetric matrix with null diagonal entries.
It can be shown that the optimal value in MaxQP is non-negative, and a randomized -approximation algorithm is given in [charikar2004maximizing] (see also [nesterov1998global], [nemirovski1999maximization], [megretski2001relaxations]). It is NP-hard to approximate with a ratio better than 11/13, and quasi-NP-hard to approximate with a ratio better than for a constant [arora2005non].
Although of great theoretical interest, MaxCut and MaxQP do not necessarily offer the most convenient form in which to embed practical problems. In this regard, one may consider a slight generalization of MaxQP where the quadratic is augmented with an affine term. In keeping with the conventions of the literature, we pose this as the QUBO ("quadratic unconstrained binary optimization") problem, given by
QUBO |
where is a real symmetric matrix. The above formulation is sometimes abbreviated as UQBP (“unconstrained quadratic binary program"), see [kochenberger2014unconstrained] for a broad survey, and [dunning2018works] for a recent study of the empirical performance of various heuristic algorithms.We remark that under a transformation we may embed instances of MaxCut and MaxQP as instances of QUBO.
In the context of quantum computing it is standard to formulate the above optimization problems in the language of operators and expectation values, so that task is to maximize the expectation of a certain ‘problem Hamiltonian’ with respect to a -qubit quantum state. When the cost function is expressed as a polynomial in variables (e.g. as in ), the corresponding Hamiltonian follows immediately (e.g. where
denotes the tensor product of Pauli
-operators acting on the th and th qubits respectively).Having fixed the identification of the state, observation, action space, reward, and learning agent, we apply the aforementioned framework in order to solve random instances of the above three problems. In all cases we take the number of variables fixed. For MaxCut, we consider Erdős-Renyi graphs with edge probability . If there is an edge from vertex to vertex , we choose an independent weight . Otherwise, . For MaxQP, we consider random matrices formed by letting if , , and imposing that be symmetric. For QUBO, we consider random matrices formed by letting for , and imposing that is symmetric. For each of the three problems, we generated 50,000 random instances, which were then shuffled into a single dataset for training. Additional independent validation and test sets were created for each problem type. In total, the validation set contained 12000 unique instances, while the test set contained 3000 unique instances.
Following a modest amount of hyperparameter optimization (further detailed in the supplementary information), we trained a PPO agent for approximately 1,700,000 episodes (for a total of approximately 20 million individual steps) on a quantum simulator (Rigetti’s “Quantum Virtual Machine", henceforth abbreviated as QVM
[smith2016practical]). At the end of the training, we selected the model with the best performance on the validation set, as indicated by the highest average episode score (see below). We refer to this model as the ‘QVM-trained’ agent. The QVM-trained agent was then used to initialize training of the ‘QPU-trained‘ agent on the Rigetti Aspen QPU [didier2018ac, caldwell2018parametrically, nersisyan2019manufacturing]. This model was trained for 150,000 episodes.We also run (single-step) QAOA[farhi2014quantum] on each of the test sets as a benchmark (see the supplementary information for more information). We choose QAOA since it is a widely known and fairly generic algorithm for solving combinatorial optimization problems on quantum computers. Although QAOA generally performs better the larger the value of , or the number of alternating steps in the algorithm, we focus on since for the 10-variable problems under consideration this limits the program length to approximately 100. A larger number of steps would cause the output to have significant contributions from noise, which we would like to avoid as a comparative benchmark. We additionally benchmark performance with respect to an ‘untrained’ agent. This agent is subject to the same environment as the trained agent, but has not been optimized through a training process. This means that, for all possible observations, the untrained agent samples from the action set with uniform probability.
With respect to a fixed problem instance and a fixed policy, one may produce an episode, described by a sequence of state-action-reward triplets. The reward for a single action is formally the expectation value of the problem Hamiltonian in the state produced by the program sequence thus far, normalized to be in the range 0 to 1, i.e. where is the state prepared by the first actions, and and denote the minimum and maximum attainable expectation values. We remark that this normalization is convenient for our subsequent data analysis, and enables the possibility of early stopping in training, but is not strictly speaking necessary for the use of our proposed method (and rightfully so, as the maximum value used in the normalization is precisely what we set out to compute). On a physical device, the reward
is estimated by repeated preparation and measurement for some number of measurement shots (cf. the supplementary information).
With respect to a complete episode, the episode score metric is given by ES . For the QAOA ansatz, which is intended to run to completion, the QAOA episode score is given by ES , where represents the final ansatz instruction.
Comparison of the performance on the test problems, for each COP, for each surveyed method, on each quantum resource is illustrated in Fig. 2. In each circumstance, the QVM-trained agent produces a score distribution with the highest expected episode score. In moving from the QVM to the QPU, episode score distributions were found to increase in variance, and in the case of the QAOA, decrease in expected value. For the QVM-trained and untrained agents, expected test performance remains comparable between the QVM and QPU. Increased performance of the trained agent compared to the untrained agent, in all circumstances, suggests that the training process is successful in biasing the action distribution to yield more promising, observation-dependent, actions. Increased performance of the trained agent compared to the QAOA, in all circumstances, suggests that the QAOA ansatz is a less optimal circuit ansatz compared to those generated by the trained agent. Additionally, we note that the trained agent is less sensitive to COP-type compared to the QAOA, resulting in test distributions of similar character for each surveyed problem. Interestingly, the untrained agent yields better performance on the QPU compared to the QAOA. We speculate that this is in part due to the structure, length, and noise sensitivity of the QAOA ansatz.
Given the episode score test distributions given in Fig. 2, one may compute the number of instructions contained within the programs generating each episode score. We label this instruction-metric # instructions to reach episode score, and depict this data in Fig. 3. Note that when executing on the QVM, there is no noise, and the program is not compiled. The QVM instruction-metric therefore describes, for each episode score, the number of corresponding (perfect) gates specified by the agent (in other words, ). In the case of the QPU, the agent program is compiled to the chip topology of the Rigetti Aspen QPU, and the native gate set given by CZ, RZ(), and RX( ) using the Rigetti Quilc compiler. The QPU instruction-metric therefore describes, for each episode score, the number of compiled gates resulting from a sequence of actions specified by the agent.
By design, the agents were constrained to 25 uncompiled instructions. For the QAOA, the uncompiled program length is determined by the size and connectivity of the graph of interest to the QAOA ansatz. Through training, for all problems, the expected length of the trained agent programs is decreased compared to those of the untrained agent. Although compiled, the number of instructions required to reach each episode score is reduced for the trained agent on the QPU compared to the QVM. Note that on the QVM, there is no noise. Additionally, there is no penalty for generating longer programs until the program length constraint of 25 uncompiled gates. On the QPU, the agent experiences noise that likely increases as the program increases in length and complexity. This could create an incentive for the agent to generate shorter programs on the QPU, relative to the QVM. In all cases, the untrained and trained agents yield shorter programs compared to the QAOA. On the QPU, the QAOA programs compiled into more than instructions. Additionally, the later gates of the QAOA are given by the more expensive entangling gates. If too much decoherence is experienced by the later stages of the program, one could expect a strong degradation in performance, as evidenced by the decreased episode score between the QVM and QPU for the QAOA.
Recall that the QPU-trained model was initialized using the parameters of the QVM-trained model, and subsequently trained for an additional 150,000 episodes on the QPU. In order to investigate the impact of training on a real quantum resource, both the QVM-trained model and QPU-trained model were test on the QPU across both the reward-metric and the instruction-metric. This is shown in Fig. 4.
Inference on both models, on the QPU, yields comparable reward-metrics. For each COP, the episode score test distributions have similar expected values and variances. Although generating comparable reward-metric statistics, the instruction-metrics look much different. It is observed that through QPU-training, the resultant instruction-metric values become much shorter. In this context, device noise acts as an indirect incentive to the agent. This indirect incentive promotes the generation of programs that remain short following compilation. This is particularly evident for the MaxCut problems. In this case, testing the QVM-trained model on the QPU yields compiled programs of lengths, in some cases, exceeding 100 instructions. However, testing the QPU-trained model on the QPU yields compiled programs of lengths shorter than 50 instructions.
Broad statistical analysis of agent-generated programs was accomplished by computing the frequency of each action in the programs generated by the experiments detailed in Sec. 3.2. These histograms are shown in Fig. 5. Note that the data between Fig. 2, Fig. 3, Fig. 4, and Fig. 5 are all consistent, resulting from different varieties of analysis on the same set of experiments over a precomputed test dataset. Additionally, explicit examples of a set of random sampled programs are explicitly shown in the supplementary information.
The agent demonstrates a strong preference for the CNOT, RX(), and RY() gates, occurring approximately equally for all problems, for all models, on both types of quantum resources. Interestingly, the RZ() gate is not as consistently prevalent across the different experiments, occurring less frequently on the QVM. On the QVM, the remaining gates frequencies are distributed somewhat evenly. On the QPU, the RZ() gate becomes more common, even in the case of the QVM-trained model. Through training on the QPU, the QPU-trained agent appears to less frequently sample non-native RY and RX gates. In contrast, the native RZ() gates show comparable frequencies to the QVM-trained agent. This parallels the instruction-metric results shown in Fig. 4. It appears that QPU-training aims to reduce compiled program length, which amounts to diminished sampling of non-native gates (which are compiled into longer sequences of native gates). To further develop these observations, more extensive and robust analysis is required.
In the case of the COPs we have considered above with our simple reward (discounted) reward structure, we hypothesize that the theoretical limit of the optimal program would be a series of X gates. This is so because the Hamiltonians (reward functions) we consider are all diagonal in the computational basis, and their solutions can be specified as some bitstring, i.e. some computational basis element, and not necessarily a linear combination of such basis elements. The gates I and X are sufficient to produce such states starting from the state. Hence, we expect the optimal policy to disregard any phase information in the quantum states. Furthermore, recall that our defined representation of the observation space is equivalent to the space of probabilistic bits. For example, if the goal was to maximize , it is sufficient to produce any of the states , whichever one is cheapest to produce given a particular gateset. This is another reason we therefore expect that the optimal policy should disregard any phase information. Although more extensive experimental and theoretical analysis is required, this hypothesis is consistent with the gate frequencies observed during inference on the test set (as shown in Fig. 5).
For such problems, where the Hamiltonian is diagonal in the computational basis, the shortest sequence of gates we can imagine to produce the solution bitstring is a series of X gates on the appropriate qubits. A rotation of any angle other than about the x-axis would produce a less than optimal value for the term, and therefore the reward, so that we cannot use the policy improvement theorem (Ref. [suttonbarto]) to improve upon this policy.
Similarly, our choice of observation space and policy architecture reflect the correspondence between computational basis states and candidate solutions of the optimization problems. In particular, for more general problems (e.g. as may arise in quantum chemistry) it may be necessary for the observation to consist of measurements with respect to several bases; the policy in such a case should be modified accordingly.
Overall we find reinforcement learning to be an effective method for quantum programming for combinatorial optimization for the problems and datasets studied here. For the evaluated problems, the trained agents were observed to generate very competitive results compared to an untrained agent and compared to QAOA on both simulated and physical quantum resources.
Extensions of this work include, but are not limited to: refinements to the learning environment, training and comparison of different agent types, as well as an investigation of how the training time scales as the size of the problem grows, and application of this learning method to more programming tasks.
Modifications to the learning environment could included deeper investigation of state and observation representations. The quantum and problem observations are of a decidedly distinct nature, and thus it is natural for a policy to treat them differently. There has been much recent attention to vectorizations or feature representations of graphs and similar discrete structures, relevant to processing of a given combinatorial optimization problem. As our focus is primarily on the general method of reinforcement learning for quantum program synthesis, we have opted to not consider any specialized handling of the state observation representation and leave that as an avenue for future work. Modifications to the learning environment could also include larger or continuous action spaces, as well as modifications of the reward function.
Additionally, we chose to focus exclusively on PPO learning agents. This is in part due to the breadth and multicomponent nature involved in iterating simultaneously on the learning environment as well as on the learning agent. The robustness of the PPO algorithm allowed for fast and favorable training without extensive hyperparameter optimization. Performance of alternative agents such as deep Q-learning is of interest.
With respect to scalability, if we limit the architecture of the shared actor-critic network to never grow more than polynomially in the size of the input problem, then the computations this network carries out in order to produce an output action should also grow at most polynomially. We may ask how many training steps we require to reach a certain validation accuracy on some held out set of problem instances. For a sufficiently high validation accuracy, the trained network may serve as a useful heuristic if the training time grows at most polynomially in the size of the problem. It is not unreasonable to expect this to happen, since neither the gate set we have considered above nor the trainable parameters of the network grow super-polynomially. However, a systematic investigation of how well this framework scales as the size of the problem input is currently lacking, and would serve as an important barometer of how well this approach would work in practice.
Here we have chosen to focus on COPs, for which the specified Hamiltonian is diagonal in the computational basis. We remain curious about the extension of this work to different domains. We speculate that extending our analysis to other problems, such as those found in quantum simulation settings where we expect to see Hamiltonians that are non-diagonal in the computational basis, would yield theoretically optimal policies that use non-Clifford operations. We also imagine that by changing the reward structure, we could retool this procedure to optimize not just for shortest sequence of gates from some given gateset, but also to preferentially utilize quantum resources over classical ones.
The authors wish to thank Amy Brown and the entire Rigetti hardware team for QPU support. We also would like to give mention to Nima Alidoust and David Rivas for their technical and logistical guidance, and Robert Smith, Joshua Combes, Eric Peterson, Marcus da Silva, Kirby Linvill, and Kung-Chuan Hsu for many insightful conversations.
Comments
There are no comments yet.