## 1 Introduction

Successful reinforcement learning (RL) algorithms for challenging tasks come at the cost of computationally expensive training silver2016. Since quantum computers can perform certain complex tasks much faster than classical machines grover1996; shor1999 a hope is that the use of quantum computers could also enable more efficient RL algorithms. Today, there is theoretical and practical evidence that quantum interaction between agent and environment can indeed reduce the time it takes for the agent to learn (Section 2).

Herein, we describe an optimization-based quantum algorithm for RL that makes use of a direct quantum mechanical realization of a finite MDP, in which agent and environment are modelled by unitary operators and exchange states, actions and rewards in superposition (Section 3.1).
Using this model, we show in detail how to use amplitude estimation to estimate the value function of a policy in a finite MDP with finite horizon (Section 3.2).
This quantum policy evaluation (QPE) algorithm can be compared to a classical Monte Carlo (MC) method, i.e., sampling from agent-environment interactions to estimate the value function.
QPE uses *quantum samples* (qsamples) instead of classical ones and provides a quantum advantage over any possible classical MC approach; it needs quadratically fewer qsamples to estimate the value function of a policy up to a given precision.
We embed QPE into an optimization scheme that uses repeated Grover-like searches of the set of policies to find a decision rule with -optimal performance (Section 3.3).
The resulting routine shares similarities with the policy iteration method from dynamic programming (DP) sutton2018 and, thus, we call it *quantum (approximate) policy iteration*

. A major contribution of our work is a detailed description of how to implement quantum policy iteration for a two-armed bandit MDP at the level of standard single and multi qubit gates on a digital quantum computer. Simulations of this implementation numerically confirm the mathematically proven quantum advantage of QPE (Appendix

B) and illustrate the behavior of quantum policy iteration (Section 4).## 2 Related Work

Existing quantum RL approaches can be roughly divided into two categories: quantum enhanced agents that learn in classical environments, and scenarios where the agent and environment can interact quantum mechanically.

An example for the first category, is *quantum projective simulation* briegel2012

, where an agent makes decisions using a quantum random walk on a learned graph that represents its internal memory. In this approach, the special properties of quantum walks allow for fast decision making and thus less active learning time than classical agents. Furthermore, various quantum-enhanced agents have been described that learn in classical environments by replacing components of classical learning algorithms with quantum counterparts. There are approaches for

*deep Q-learning*and

*policy gradient*

methods, where classical artificial neural networks are replaced by analogous

*variational quantum circuits*(VQCs) with trainable parameters chen2020; lockwood2020; moll2021; skolik2021; jerbi2021.

Theoretical work by dunjko2015 indicates that the possibility of quantum interaction between agent and environment can be leveraged to design quantum-enhanced agents that provably outperform the best possible classical learner in certain environments dunjko2015; dunjko2016; dunjko2017. One such approach is related to our ideas and uses Grover searches in deterministic environments to find sequences of actions that lead to high rewards dunjko2016. The possibility of using amplitude estimation similar to QPE to be applied to stochastic environments are also discussed. Further extensions were proposed in dunjko2017advances and hamann2020. Recently, Wang et al. used similar tools as in this work (i.e., amplitude estimation and quantum optimization) to obtain quantum speedups for subroutines of otherwise classical algorithms based on modern DP methods wang2021. Their methods require access to a generative quantum model of the environment, which is similar to the environment operator E from Section 3.1.

Contrary to the methods by wang2021, our novel quantum (approximate) policy iteration is not a quantum-enhanced version of an existing classical method. It also differs from the ideas of dunjko2016 as it directly searches for the optimal policy instead of looking for rewarded actions that are then used to improve the decision rule.

## 3 Quantum (Approximate) Policy Iteration

### 3.1 Quantum Mechanical Realization of a Finite Markov Decision Process

Consider an MDP with finitely many states , actions and rewards . We can identify these sets with finite-dimensional complex Hilbert spaces , and , which we call *state, action* and *reward spaces.* Formally, we use maps from the sets to the corresponding spaces, i.e.,

(1) |

In order to be able to distinguish quantum states, actions and rewards we require that

(2) |

holds for all pairs , and . On a digital quantum computer, we can for example arbitrarily enumerate all states and actions and use strings of qubits as quantum representatives, i.e., and As the rewards are real numbers, we can for example approximate them with fixed point binaries which can be again represented as strings of qubits.

Agent and environment are modelled by unitary operators that act on , and . For any policy , we define the *policy operator* as a unitary operator which satisfies

(3) |

for all states . The state is an arbitrary reference state in and each amplitude has to satisfy The policy operator can be chosen as any unitary that satisfies these conditions. To see why such a unitary operator indeed exists, note that (3) describes a bijection between two orthonormal bases (ONBs) of two subspaces of . We can extend both ONBs to ONBs of the full space and define an arbitrary bijection between the newly added basis states while maintaining (3) on the original ones. The linear extension of this assignment is then unitary by construction. Using an analogous argument, we define the *environment operator* as any unitary operator that satisfies

(4) |

with amplitudes that satisfy for all state-action pairs , . A single interaction between agent and environment is modelled by the *step operator* .
For any state , it holds that

(5) |

where .
We use S to construct a unitary operator that prepares a quantum state which represents the distribution of all trajectories with a fixed finite horizon . We call it *MDP operator* and denote it as M. While the step operator acts on , the MDP operator is a unitary on the *trajectory space* which is large enough to store quantum agent-environment interactions. The states in this space represent trajectories of length and are of type

(6) |

We define the MDP operator as

(7) |

where denotes a local version of S that acts on the -th subsystem of It holds that

(8) |

where

which is the classical probability of trajectory

. This can be seen by inductive application of (5). If we measure the state from (8), we observe a trajectory (or more precisely its quantum analogue ) with probability Therefore, this state is a quantum version of the distribution of all trajectories and we refer to it as qsample of the trajectories.For QPE, we need the distribution of the returns in order to approximate the value function. With this in mind, we use the qsample of the trajectories and calculate for each trajectory state the associated return defined as , where is a fixed discount factor. As the reward states of are qubit binary encodings of (real) numbers, we can use quantum arithmetic to calculate their discounted sum. Formally, we use a unitary *return operator* G that maps

(9) |

where is a reference state in the *return space* , which represents a system of sufficiently many qubits to encode all returns. Ruiz-Perez and Garcia-Escartin show an explicit construction of such an operator which performs the weighted addition in the Fourier domain ruiz2017. By letting G act on all reward subsystems and the return component of , we can define which satisfies

(10) |

As all states are (by construction) orthonormal, it follows that if we measure the second subsystem of , we receive an individual return with the classical probability determined by the MDP and the agent’s policy. can be thought of as qsample of the return.

### 3.2 Quantum (Approximate) Policy Evaluation

Consider the problem of (approximately) evaluating the value function of a policy for some finite horizon . The value function is defined pointwise via

(11) |

where is the initial state, and the expectation is taken with respect to the distribution of all trajectories of length .
Even if we assume perfect knowledge of the MDP dynamics , evaluating exactly is usually infeasible: the complexity of directly calculating the expectation grows exponentially in the horizon . Techniques from DP such as *iterative policy evaluation* reduce the computational complexity but come at the cost of high memory consumption sutton2018.
An alternative approach is to use an MC method to estimate . The arguably most straightforward MC algorithm for policy evaluation is to collect a dataset of trajectories and to average the returns. This requires the possibility to sample from the distribution of the trajectories.

montanaro2015

developed a quantum algorithm to estimate the expectation of a random variable using qsamples

montanaro2015. The method is quadratically more sample efficient than the best possible classical MC routine and it forms the foundation of QPE. To approximate the expected return, we use qsamples of the returns to which we have access via the operator from (10). The first step towards QPE is to note that we can encodeas amplitude of a basis vector in a quantum superposition. To this end, assume that we know a lower bound

and an upper bound on all returns. Using an affine function defined by , we construct a unitary operator that maps(12) |

where is a binary (qubit) representation of a real number and the second subsystem is a single qubit. If we now apply to the return component of from (10) and an additional qubit, we get

(13) |

where

(14) |

The state contains all the states in which the last qubit is in state and is therefore orthonormal to . The amplitude , which is the norm of the (non-normalized) state on the right-hand side of (14), satisfies

(15) |

By affinity of , it holds that

(16) |

Therefore, approximately evaluating the value function can be done by estimating for which we can use an *amplitude estimation* algorithm.

One way to estimate

(or any amplitude in general) is to encode it via the phases of two eigenvalues of a unitary operator and to estimate them using

*phase estimation*brassard2002. Following the original construction of brassard2002, we define this unitary as

(17) |

where . In this expression, acts on the return subsystem and the ancillary qubit. The operator is a *phase oracle* that, considering the basis states, flips the phase of a state precisely when all components (in this case the trajectory and the return component as well as the additional qubit) are in the corresponding ground states. The Z operator corresponds to a Pauli-Z gate and therefore the last operator is also a phase oracle which flips the phase of a state precisely when the ancillary qubit is in state . brassard2002 showed that can be decomposed as
where and are orthonormal, and , which means that are eigenstates of . The phase is the unique value in that satisfies
Therefore, we can use a phase estimation algorithm to estimate and . The classical phase estimation routine uses a system of

(18) |

qubits where and are arbitrary. The basis states of these qubits are interpreted as binary numbers. Phase estimation exploits the fact that are eigenstates of to bring the qubits into a joint state which, when measured, satisfies

(19) |

i.e., the distribution of is concentrated at and . In case , we get the same result with total probability at least . For more details, please refer to Appendix A.

As and , we define the QPE approximation of for any realization of as

(20) |

The approximation error of given in (19) can be translated to an approximation error of the value function

(21) |

where the error is (as shown in Appendix B) given by

(22) |

Phase estimation uses applications of and (via ) to achieve this error bound (c.f. Figure A.1). Each such application corresponds to the collection of one qsample. According to the optimality results on MC methods due to dagum2000, the sample efficiency of the best possible classical MC type algorithm to estimate via averaging has sample complexity in . Therefore, QPE yields a quadratic reduction of the sample complexity w.r.t. to the approximation error .

### 3.3 Quantum Policy Improvement

The task of improving a given policy is to find another policy such that holds for all and that strict inequality holds for at least one . Once we know the value function of , such a can be explicitly constructed via *greedification*. In this section, we derive *quantum policy improvement* (QPI) which instead uses a Grover search over a set of policies to find an improved decision rule. For simplicity, we assume from now on that the MDP always starts in some fixed initial state . In this case, instead of using the whole value function, we can restrict ourselves to the *value* of policy which, under abuse of notation, we define as .

QPI uses Grover search on QPE results to find an improved decision rule. For a fixed policy , QPE can be summarized as one unitary QPE (consisting of the phase estimation circuit from Figure A.1 without the measurements) that maps Now consider the finite set that contains all deterministic policies. By mapping each onto a member of an ONB of some suitable Hilbert space , we can construct a unitary, -controlled version of QPE that satisfies

(23) |

for all policies . Using these operators, we define another unitary that prepares the search state for QPI Grover search. Consider

(24) |

where is a Hadamard transform on . By construction, this operator maps

(25) |

As the states encode the value of policy , we can define a suitable oracle and use it to amplify the amplitudes of those policies that are likely to yield higher returns. For a policy and an approximation of its value, consider a phase oracle that satisfies

(26) |

Using this oracle, we define the QPI Grover operator as

(27) |

According to the principle of *amplitude amplification*, which is used in Grover search and was described in detail by brassard2002, if we apply this operator a certain number of times to the search state prepared by , this amplifies the amplitude of all states whose component satisfies the oracle condition . As a result, the probability of measuring a policy that has better performance than (up to an error of ) is increased. Due to the QPE failure probability of , the procedure may also amplify amplitudes of policies that do not yield an improvement (within ) as QPE may overestimate their performance. However, the probability of such errors can be limited by choosing small values for .

For the amplitude amplification to work, we need to specify the number of *Grover rotations*, i.e., the number of times we apply , which determines how the amplitudes of the desired states are scaled. Although there is a theoretically optimal number of Grover rotations, this value is inaccessible in QPI as it would require knowledge of the probability of obtaining a desired state when measuring the search state brassard2002. To circumvent this problem, we use the *exponential Grover search* strategy introduced by boyer1998: We initialize a parameter . In each iteration, we uniformly sample the Grover rotations from and measure the resulting state. If the result corresponds to an improvement, we reset . Otherwise, we overwrite for some which stays the same over all iterations.
For a theoretical justification of this technique, please refer to the original work of boyer1998.

Quantum (approximate) policy iteration as described in Algorithm 1 starts with an initial policy and repeatedly applies QPI to generate a sequence of policies with increasing values. Note that the procedure is similar to durr1996’s quantum minimization algorithm durr1996. In each iteration , quantum policy iteration runs QPI which is a Grover search and can therefore also fail to produce a policy whose estimated value exceeds the current best . In this case, the policy and its value are not updated. As we typically do not know the value of the optimal policy, the algorithm uses a patience criterion to interrupt the iteration if there was no improvement in the last steps.

As quantum policy iteration relies on QPE, it can only find a policy with -optimal behavior and may also fail to do so. This is because QPE yields an -approximation of the value only with probability . We conjecture that the failure probability of quantum policy iteration can be made arbitrarily small by choosing a sufficiently small . Moreover, note that if in some iteration the QPE output overestimates but still , QPI may nevertheless find a policy with . This mitigates QPE errors.

We propose that the complexity of quantum policy iteration is best measured in terms of the total number of Grover rotations performed in all QPI steps. This number is proportional to the runtime of the procedure and at the same time measures the number of times we ran QPE which in turn relates via the MDP model to the total number of quantum agent-environment interactions. Therefore, the number of Grover rotations also measures the qsample complexity. We conjecture that the complexity grows linear in . Our intuition behind this is that quantum policy iteration is essentially Grover search except that it uses changing, inaccurate oracles. This differentiates it from durr1996’s more general quantum optimization algorithm for which complexity proportional to the square root of the number of elements in the search space is known durr1996.

## 4 Experiments

Existing quantum hardware is not yet ready to run QPE, let alone quantum policy iteration. Both methods use far more gates than what is possible with existing, non-error-corrected digital quantum computers. To nevertheless illustrate the operation of our methods, we conducted simulations in which we applied them to a two-armed bandit toy problem. Imagine a slot machine with two arms (levers). Upon pulling an arm, one receives a reward of either or . This translates to an MDP with one state and two actions, i.e., where "" means "pull the left arm" and "" means "pull the right arm". The reward set is . The MDP dynamics are determined by the two probabilities and of losing, i.e., "winning" , when pulling the left or right arm. Each policy is determined by which denotes the probability of choosing the left arm. The learning problem is to identify the arm that yields the highest value , given by .

Before we discuss the experiments with QPE and quantum policy iteration, we show how the policy, environment and MDP operator for the toy problem can be implemented on a gate based quantum computer. We encode the actions via and which requires a single qubit. We use another qubit to encode the rewards as and . In this encoding, we can represent the step operator S as a quantum circuit shown in Figure 2. The angles of the -axis rotations are given by , and . A direct calculation shows that this S indeed prepares a qsample of one agent-environment interaction.

Now consider the two-armed bandit for horizon , i.e., the agent gets to play two rounds. The MDP operator of this decision process is given by two step operators according to Figure 2 that act on two separate subsystems represented by four qubits. For simplicity, we set the discount factor . The return operator G adds the two reward qubits and stores the result using another two-qubit subsystem that represents the return space . It can be implemented using two CNOT gates and one Toffoli gate to realize the logical expressions that determine which qubit of the return register must be set to or depending on the rewards. can be realized using three doubly-controlled gates that rotate another ancillary qubit depending on the two reward qubits.

### 4.1 Simulations of Quantum Policy Evaluation

Putting all of the above together, we receive a gate-decomposition of . To simulate QPE, we implemented the operator in IBM’s qiskit framework for Python.^{1}^{1}1https://github.com/Qiskit
The software comes with a built-in amplitude estimation algorithm that uses the phase estimation circuit we discussed above and in Appendix A. Using this routine with our implementation of as input, we simulated QPE for a concrete instance of the two-armed bandit MDP. We used a bandit with dynamics and and considered the policy given by which has the value . All of these values were chosen arbitrarily. We used simulated QPE to obtain an estimate of this value with absolute error of at most . As maximum error probability, we chose . According to and , we have to use qubits in the first register of the phase estimation circuit to achieve the -approximation with probability at least .
The distribution of the QPE outputs for these parameters is shown in Figure 2. Qiskit

offers the possibility to calculate measurement probabilities analytically, which is what we used to generate the plot. The probability distribution of the QPE output

is concentrated at the true value at and the mass that lies outside the region is less than which is even better than our guaranteed error probability bound of .Next, we wanted to empirically confirm the quantum advantage of QPE over classical MC. To this end, we used both methods to evaluate a fixed policy for a fixed bandit using an increasing number of (q)samples. To reduce computational complexity, we chose horizon . According to (22), the approximation error of QPE is in , where is from the definition of in (18). For our experiment, we set . From Figure A.1, we see that QPE then uses qsamples via applications of and . We chose a range of integer values for and for each executed QPE times and calculated the median approximation error. Recall that the failure probability for QPE with is so taking the median delivers a run in which the algorithm indeed returned an -approximation of the true value with high probability. The results of this experiment are shown in Figure 4. We also included the error bound according to (22) and see that the approximation errors satisfy the theoretical guarantee. For each , we also collected classical samples, averaged them, repeated this times and calculated the median. The approximation error of this classical MC approach is always higher than the one of QPE. This empirically confirms the quantum advantage of QPE over MC in terms of (q)sample complexity.

### 4.2 Simulations of Quantum Policy Iteration

We now turn to policy iteration. As a toy learning problem, we used a two-armed bandit where the agent always loses when it chooses the left arm and always wins when it chooses the right arm, i.e., and . We want to use quantum policy iteration to find an (-optimal) decision rule for this scenario. In DP, one typically only considers deterministic policies as, under mild assumptions, every finite MDP has an optimal deterministic policy sutton2018. For the two-armed bandit, there are only two such policies, which results in an uninteresting problem. To make the search for an optimal decision rule more challenging, we instead used a set of stochastic policies given by

(28) |

We let the agent start with the worst policy which chooses the left arm with unit probability and want to find the optimal decision rule .

Figure 4 documents one successful run of quantum policy iteration for the toy learning problem with policies and parameters , , and . We set so the agent plays one round and the value of the optimal policy is . In the beginning, the policy value rapidly increases even when no or only few Grover rotations are applied. This is because we start with the worst possible policy, and better policies are easily found by chance. The steep increase stops after a few iterations, as the policy then is close to optimal. It takes increasingly more rotations and some minor improvements until the procedure makes the final jump (dotted vertical line) after which the values stagnate at the maximum of as from then on, no further improvement is possible. In the end, the algorithm tries more and more Grover rotations as (that determines their maximum number) grows exponentially.

Finally, we investigated our conjecture that the runtime of quantum policy iteration is proportional to . To empirically test this, we ran quantum policy iteration for the same bandit and using the same parameters as in the previous experiment. We quadratically increased the size of the policy set from to . For each , we ran quantum policy iteration times and calculated statistics of the resulting distribution of the number of Grover rotations. As the number of rotations depends on the patience , we omitted the ones that happened in the last iterations. We only considered "successful" runs where quantum policy iteration returned an -optimal policy. Due to the low QPE failure probability, at least of all runs were successful for any . The results of the complexity experiment are shown in Figure 5. We see that the average number of rotations indeed seems to grow linear in

The line through the means was found by linear regression and fits the data well; the mean squared error is

. Note that for each, the empirical distribution of the runtimes (we hid the outliers for the sake of clarity), appears to be quite symmetric. This is expected because sometimes, the algorithm is lucky and samples the optimal policy in an early iteration while in other runs, the routine suffers from many incremental improvements.

## 5 Conclusion

In this work, we showed in detail how to use amplitude estimation and Grover search to obtain a quantum version of classical policy iteration that consists of QPE and QPI. The foundation of all our algorithms is a quantum mechanical realization of a finite MDP in which states, actions and rewards are exchanged in superposition. QPE exploits this via amplitude estimation, and uses the distribution of the returns represented as superposition state. The quantum advantage of QPE is a quadratic reduction of the sample complexity compared to sampling from this distribution classically. QPI and quantum policy iteration use the superposition principle and consider all policies "simultaneously". Both methods are an example of how to use Grover search to find an optimal decision rule in a way that is analogous to database search. While existing quantum hardware is not yet ready for our methods, the implementation details we presented and the simulations we conducted shed light on how gate based quantum computers can be used to solve classical reinforcement learning problems.

## References

## Appendix A Phase Estimation

Figure A.1 shows the classical phase estimation circuit. Note that the input state of the second (working) register is not a single eigenstate of but an equal superposition of the two orthonormal eigenstates . After all gates have been applied, the state of the first register is a superposition of close approximations of as formally stated in (19). This is because the circuit from Figure A.1 applies phase estimation to and in superposition and thus estimates the corresponding phases and "at the same time". The error bounds (19) follow from a detailed analysis of the phase estimation circuit as done in cleve1998.

## Appendix B Proof of (21) and (22)

The proof makes use of the following technical result:

###### Lemma 1 (brassard2002, Lemma 7).

Let and with . Then it holds that

(29) |

Recall from (19) that the random variable that corresponds to the final measurements of phase estimation in QPE satisfies

(30) |

We set and and obtain