I Introduction
Alongside supervised and unsupervised learning, reinforcement learning (RL) constitutes one of the main pillars of modern attempts to create artificial intelligence
Sutton and Barto (2018). RL describes a set of techniques, anchored in the mathematical formalism of the Markov decision process (MDP), whereby an autonomous agent learns how to behave according to an optimal policy within an environment in order to maximize some notion of expected future rewards derived from said environment by repeatedly interacting with it. When the agent finds itself in some state of the environment, it chooses an action that causes it to transition to a subsequent state of the environment and simultaneously be given a reward (or penalty). The agent’s behavioral policy is therefore typically regarded as a probability distribution over stateaction pairs, which tells the agent which action(s) is (are) best to take in each state. RL has begun to see use in a wide variety of applications including recommendation systems and optimal control, and perhaps the most stark demonstration of the computational utility of RL has been its role in enabling a computer to beat expert human players at the game of Go
Golovin and Rahm (2004); Sutton et al. (1992); Silver et al. (2018).While RL is a novel and powerful algorithmic paradigm, its implementation currently remains largely tied to computational resources that are dictated by classical physics at the hardware level. Given that the space of possible policies for an agent to enumerate is often combinatorially large, such as in the instance of Go, there is good reason to suspect that the particular type of parallelism afforded by quantum hardware might help an agent in finding an optimal policy more quickly. In fact, an existing body of literature already explores quantum formulations of RL in situations where either the agent or environment, or both, are regarded as quantum mechanical, and in some instances a quantum speedup has been established (see for example Dong et al. (2008); Dunjko et al. (2017); Lamata (2017); Briegel and De las Cuevas (2012); Paparo et al. (2014); Dunjko et al. (2016, 2015); Neukart et al. (2018)). To our knowledge however, there does not yet exist a formulation of a quantum MDP with input from a classical environment that is amenable to solution directly on nearterm, noisy intermediatescale quantum hardware via existing quantum optimization algorithms.
In this work, we therefore establish what we regard to be the simplest and most straightforward connection between the mathematical formalism of RL and that of quantum computing. Namely, we consider a “quantumaccelerated” agent interacting with a classical environment. Moreover, we suppose that the agent possesses a model of its environment such that the quantum MDP consists only of the agent solving for an optimal policy using quantum hardware. Strictly speaking, this formulation in and of itself does not constitute “quantum RL” because the agent already has a model of its environment. However, paired with a classical routine where the agent learns the transition and reward rules for its environment, our quantumresolved MDP may be thought of as a (potentially hybrid) quantum subroutine embedded in a larger hybrid quantumclassical RL algorithm.
Note that all three variants of the general MDP (finite horizon, infinite horizon discounted, and infinite horizon averaged cost) have been shown to be Pcomplete Papadimitriou and Tsitsiklis (1987). This means roughly that while MDPs are in principle efficiently solvable in polynomial time, they are not categorically efficiently parallelizable Greenlaw et al. (1995)
. Moreover, a general MDP suffers from the “curse of dimensionality”, meaning that its attendant memory requirements grow exponentially in the dimension of the state space. In practice, a conventional computer can solve MDPs involving millions of states
Sutton and Barto (2018).Our approach is motivated by the observation that a majority of nearterm quantum algorithms such as adiabatic quantum annealing and the quantum approximate optimization algorithm are concerned with the solution of discrete, combinatorial optimization problems recast as Kspin Hamiltonians of the form
Farhi et al. (2014); Denchev et al. (2016)(1) 
As such, we identify the policy with discrete variables to be mapped to spins (and thus qubits) and a sum over Qfunctions with a Kspin Hamiltonian. Since it is wellknown that finding the ground state of a Kspin Hamiltonian is in the complexity class NPcomplete through its polynomialtime order reduction to the Ising spin Hamiltonian, it may at first glance look odd to map a problem from Pcomplete into a formulation, which ostensibly resides in NPcomplete
Barahona (1982); Rosenberg (1975). Our primary justification is that certain MDP generalizations, such as the partially observable MDP (POMDP), are harder than the MDP for example POMDP is PSPACEcomplete Papadimitriou and Tsitsiklis (1987). The present work provides a platform from which to explore Kspin Hamiltonians as an approximation scheme for problems such as POMDP. In addition, while arbitrary instances of the Kspin problem are NPcomplete there can be specific instances of the Kspin problem, which are soluble in polynomial time Zintchenko et al. (2015); Cipra (2000). It is possible that the Kspin Hamiltonian derived in the present work inherits its polynomialtime solubility from the dynamic programming formulation from which it is derived. Suggesting otherwise however, is the fact that when reformulated into Ising spin form, the regular XORSAT problem has been shown to take exponential time to solve by adiabatic quantum annealing when while Gaussian elimination allows polynomialtime solubility classically Farhi et al. (2012); Patil et al. (2019); Bapst et al. (2013). We therefore do not consider our quantum MDP a panacea for all largeorder polynomialtime MDPs. Rather, it furnishes an interesting connection between two disparate fields in computational science while allowing for quantum parallelization of a problem, which is not efficiently parallelizable on a classical computer (unless ).Ii Markov Decision Process
The basic mathematical formalism that describes an autonomous agent interacting with its environment is the MDP. Our treatment follows from the standard text by Sutton and Barto Sutton and Barto (2018). We take the discrete, finite, discounted MDP with infinite horizon to be a 5tuple where is a finite set of discrete states in which the agent may find itself, is a finite set of discrete actions that the agent may carry out, is a set of transition probabilities of the form , which describe the probability that an agent will transition to state from state given that it has taken action , is a set of rewards of the form , which describe the reward that an agent receives upon moving from state to state under the action , and is the discount factor, which gauges the degree to which rewards farther into the future are prioritized less than more immediate ones.
The goal of an agent moving around in an environment is to learn an optimal policy such that when it finds itself in the state , is the probability that the agent should take the action in order to maximize its expected future discounted reward. Formally, there are two quantities that represent an expected future discounted reward under the policy : the value function
(2) 
which describes the future discounted reward expected by virtue of being in state at time , and the actionvalue function (also known as the Qfunction)
(3) 
which describes the future discounted reward expected by choosing action when starting in state at time . In either case, the quantity
(4) 
is the discounted return expected by the agent from time onward.
Intuitively, the agent wants to collect as many of the largest, positivevalued as possible and as few negativevalued ones over some effective time horizon dictated by the value of . If the sets and are known a priori the agent is said to have a model of its environment. If not, the agent must learn features of its environment implicitly or through a training simulation, which are situations encompassed by the MDP. To make these concepts concrete consider the particular example of an autonomous vacuuming robot (AVR) in a hallway trying to find the larger of two dirt piles to clean as displayed in Fig. 1a. The states accessible to the AVR, , are the six floor tiles, and the action space is with corresponding to moving left and , right. The terminal states, and , correspond to dirt piles that the AVR wants to find and clean. As such, if it encounters a dirt pile, the AVR receives the rewards (the smaller pile) and (the larger pile). If we suppose that each AVR is responsible for finding and cleaning only one dirt pile, then we can entice the AVR to stay at the dirt pile it has found by introducing a steep penalty for moving away . Note that one could also immobilize the AVR with . Each step in the interior of the hallway, say between and , incurs a penalty so that the AVR does not dawdle. We specify that this process is slightly stochastic in that and . In this context, stochasticity could reflect a tendency of the AVR to occasionally misfire and move in the wrong direction. We make this designation because the deterministic MDP is even easier (NC complexity class) than the general, stochastic MDP Papadimitriou and Tsitsiklis (1987). At the edges of the hallway the AVR sees relatively hard boundary conditions via . We also regard the policy, , as a binary indicator variable such that if the agent takes action in state and if it does not. Such a policy is called a deterministic policy. However, the Kspin Hamiltonian derived in Sec. III
is valid for nondeterministic policies as well since any probabilistic policy expressed as a fixedprecision or floatingpoint number has a linear expansion in binary variables and thus can be mapped linearly to qubit degrees of freedom
Yates (2009). The use of deterministic policy variables simply allows a more economical use of qubit resources. We remark that this environment is equivalent to a 1dimensional version of the classic “Gridworld” environment encountered in the RL literature Sutton and Barto (2018).As established, one way for an agent to understand the quality of its current policy is by calculating the actionvalue function in Eq. 3. For a given policy, the actionvalue function returns the expected future discounted reward for taking action in state and then subsequently following . An optimal policy is one for which is maximized . One of the central results of the dynamic programming treatment of MDPs is that the actionvalue function obeys a Bellman equation
(5) 
We use Eq. 5 as the starting point from which to derive a Kspin Hamiltonian for a general discrete, finite, discounted MDP over an infinite horizon.
Iii MDP Kspin Hamiltonian
Object  Classical Field Theory  MDP 

Coordinate  
Field  
Density  
Functional  
Stationarity  
EulerLagrange 
One can express the actionvalue function purely as a functional of the policy by repeatedly substituting the lefthand side of Eq. 5 into the righthand side of Eq. 5. The result is
(6) 
is now a functional of the policy. A stationary point of a functional occurs whenever its variation with respect to its argument vanishes Dreyfus (1965). Therefore finding the optimal policy that maximizes the Qfunction everywhere is equivalent to the variational condition
(7) 
for an arbitrary variation of with respect to the policy. From here it is straightforward to define a Hamiltonian functional as
(8) 
Recall here that and therefore has a simple relationship to singlequbit spin operators via . So, Eq. 8 is equivalent to a Kspin Hamiltonian where multiplies the largest order in the infinite expansion that needs to be taken into account in order to find the optimal policy.
The Hamiltonian in Eq. 8 is amenable to simulation on a quantum computer either via adiabatic or variational preparation. What one needs to check is whether minimizing with respect to is equivalent to maximizing for each and . For this purpose, consider an arbitrary variation of the Hamiltonian with respect to the policy. Minimization of the Hamiltonian corresponds to
(9) 
But since the variation is arbitrary, each term in the sum needs to vanish identically leading to Eq. 7 for all .
There is an interesting analogy to classical field theory in the present discussion shown in Table 1. Each stateaction pair plays the role of a spacetime coordinate over which a field is defined. The actionvalue function can loosely be thought of similar to a Lagrangian density. The Hamiltonian functional plays a role analogous to the action functional of field theory and its variation leads to an “EulerLagrange” equation for
. We note that there exist path integral approaches to both optimal control and reinforcement learning, and Markov chains have been explored as tools in field theory
Kappen (2005); Theodorou et al. (2010); Dynkin (1983). It would be an interesting direction for future work to explore the possible connections between our Kspin theory and these previous approaches. In addition, we emphasize that by expressing the MDP in the form of Eq. 8 (along with the conditional probability normalization below in the second line of Eq. 10), a quantum optimization heuristic will attempt to produce the optimal policy for the entire stateaction space at once by virtue of the fact that the Hamiltonian contains in its parameters a model for the entire environment. In the absence of a known model, such as in the instance where the stateaction space is too large to allow tabulation of all transition and reward functions or where such functions simply are not known a priori, we expect that one should still be able to write down a Hamiltonian whose ground state infers a good policy by sampling transition and reward functions from across the stateaction space. We leave a demonstration of this concept for future work as well, since it is beyond the scope of the present paper.Explicitly then, the Hamiltonian we may use to solve an MPD for an optimal policy is
(10)  
where we have made the identifications , , etc. and
(11) 
Eq. 10 includes a penalty function with strength in the second line due to the fact that the policy is generally a probability distribution conditioned on the state variable and so must equal unity when summed over all actions (). Also, note that the term involves a slight abuse of notation and is actually just a constant offset to the Hamiltonian, which otherwise displays the same form as Eq. 1 with . The quantities defined in Eq. III are qubit coupling coefficients, which gauge how many terms in the Hamiltonian need to be retained in order to solve for an optimal policy. Even though monotonically decreases with since , it is not clear a priori that Eq. 1 is convergent for a general MDP problem instance. However, even if Eq. 1 is an asymptotic series in some instances, it may still provide useful solutions as is well known within the context of perturbative quantum field theory Peskin (2018). We take as the order at which one is practically able to truncate the series.
In terms of hardware requirements, a larger will require higher gate complexity and depth for gatemodel machines and larger numbers of ancillary qubits for annealermodel machines in order to perform orderreduction Pedersen et al. (2019); Boros and Hammer (2002). Gatemodel machines may also use ancillary variables in order to reduce gate complexity. In addition, if is chosen to be too small with respect to the decay of then the ground state of the Hamiltonian will fail to exhibit some longrange correlations that might be essential to finding an optimal policy at every stateaction pair. We will discuss hardware resource requirements for accurate policy determination in more detail in Sec. IV.
Fig. 1b shows the optimal policy found by solving for the ground state of for the AVRdirt scenario described in Sec. II on the DWave 2000Q processor and validated by simulated annealing and classical QLearning run in an OpenAI Gym environment Brockman et al. (2016)
. For the QLearning hyperparameters, we use
(learning rate), (convergence threshold), and (number of training episodes). A more detailed discussion of quantum and simulated annealing and their hyperparameters will follow in Sec. V. The strength of the penalty function in Eq. 10 is set to , the discount factor is set to , and the Hamiltonian is truncated at and then reduced to quadratic order, which corresponds to a quadratic unconstrained binary optimization (QUBO) or Ising model. The procedure for Hamiltonian order reduction will be discussed in Sec. IV, but we remark here that we set the penalty strength required for order reduction to . Fig. 2 shows how premature truncation adversely affects solution quality for the same environment. In Fig. 2a, truncation at erroneously leads to a policy that is symmetric in the state space due to the fact that terms, which couple policy variables in the fourth tile from left to the leftmost tile are neglected. By truncating instead at , the AVR becomes aware of longerterm rewards and therefore identifies the correct policy. Due to this observation, we turn to an analysis of the scaling of spatial resources (i.e. numbers of qubits and Hamiltonian parameters) in Sec. IV before assessing the scaling in time complexity of the Kspin MDP in Sec. V.Iv Hardware Requirements for the KSpin MDP
Given that the order () at which Eq. 10 is truncated affects whether or not the groundstate solution equates to the optimal policy, a natural question to ask is: how many terms in the Hamiltonian need to be retained? This question is also important since either the number of ancilla qubits or gate complexity or both will grow as a function of . Fig. 3 shows the lowest truncation order () at which the optimal policy was found as the unique ground state of Eq. 10 as a function of system size () and discount factor () by simulated annealing. For our hallway environment, we find that the truncation order scales roughly as for typical values of the discount factor such as , but for a general environment we expect that this scaling will depend upon the form of the transition and reward probabilities and as well as the dimensions of the state and action spaces and . The aberrant result observed at is a consequence of the small size of the state space combined with a small discrepancy in how the boundary conditions are implemented classically. Boundary conditions matter less as the state space becomes larger. Otherwise, Fig. 3 is encouraging for two reasons. First, while Eq. 10 formally involves an infinite number of terms, Fig. 3 confirms that in practice, only a finite number of terms need to be retained in order to find an optimal policy. This situation is similar to the dynamic programming treatment for MDPs wherein only a finite number of policy iterations are needed in order to find an optimal policy Sutton and Barto (2018). And second, we see that as the discount factor () is varied, our Hamiltonian’s ground state is able to track the optimal policy by simply including higherorder interactions. For example, the phase transition observed between and for corresponds to the ground state policy switching from the one in Fig. 2a to the policy in Fig. 2b.
We now turn to an accounting of physical resources required to simulate our Hamiltonian on quantum hardware. In its current form, i.e. before the substitution to qubit operators , Eq. 10 is in the multilinear polynomial form of a pseudoBoolean classical cost function. There is a useful result from combinatorial optimization that an arbitrary pseudoBoolean function with order (degree) can be reduced to a quadraticorder polynomial in polynomial time Rosenberg (1975). There are a variety of approaches to polynomial order reduction, which typically require the introduction of ancillary variables Dattani (2019). Here we discuss the most general but worstscaling technique for order reduction in order to place upper bounds on the quantum hardware required to solve a general MDP. This technique is referred to as substitution.
Substitution for order reduction was first introduced by Rosenberg, and it is predicated on the observation that for three binary variables , relationships such as
(12) 
hold Rosenberg (1975); Boros and Hammer (2002). Therefore, one may rewrite a product of two variables as a single variable while at the same time introducing the penalty function
(13) 
into the problem Hamiltonian to enforce , where is the strength of the penalty function. If each of variables occurs in Hamiltonian terms with at most other binary variables, then the number of additional qubits needed to quadratize the problem scales as , where is the magnitude of the MDP stateaction space and is the order at which Eq. 10 is truncated Fix et al. (2011). Fig. 4a shows the number of logical qubits () required to simulate the truncated, orderreduced Hamiltonian as a function of system size for two different discount factors, (blue squares) and (orange triangles). Dashed lines connect points calculated by using the DWave make_quadratic utility while dotted lines connect points calculated by the function , which in the asymptotic limit corroborates the scaling of order reduction by substitution. This scaling is general for a fixed and independent of the hardware platform used. The dependence of on is a result of the fact that, on average, for larger discount factors more terms in Eq. 10 need to be taken into account in order to find the correct policy. Once orderreduced, these terms manifest as additional logical variables.
Fig. 4b shows the scaling of computational resources required to simulate the truncated, orderreduced Hamiltonian as a function of system size on the DWave 2000Q processor in particular for (green circles) and (red inverted triangles). Dotted lines connect points that denote the number of parameters () of the form in Eq. III needed to specify the Hamiltonian after order reduction while dashed lines connect points that denote the number of physical qubits () required to simulate the orderreduced Hamiltonian after being minorembedded on the DWave (16,4)Chimera hardware graph using the find_embedding utility. The find_embedding utility was run ten times with the best resultant embedding taken. Note that the point is missing since and so the orderreduced Hamiltonian cannot be embedded on the qubit processor.
As observed in previous work, and track closely to one another over multiple orders of magnitude Jones et al. (2020). In addition, at and as shown, and both increase by roughly two orders of magnitude between and . This scaling similarity sits well within the worstcase asymptotic scaling for embedding a general complete graph () into a Chimera graph using logical chains Lucas (2019). Fig. 4b therefore shows that in addition to being amenable to the type of quantum parallelism exhibited by conventional quantum optimization hardware and heuristics, Eq. 10 may suffer from a less severe curse of dimensionality when compared to the classical MDP.
The process of order reduction is necessary in order to solve for the ground state of Eq. 10 on conventional quantum annealers since execution of an annealing schedule is beholden to representing qubitqubit interactions on the native twoqubit coupling in hardware Boothby et al. (2019). By contrast, gatemodel quantum computers can decompose higherorder interactions via gate compilation into products of single and twoqubit gates Nielsen and Chuang (2002). Specifically, after truncating and elevating Eq. 10 to an operator and exponentiating it, we find that for a quantum approximate optimization algorithm (QAOA)like unitary with variational parameter , we have
(14) 
Eq. IV is equivalent to a product of qubit controlledPHASE gates. Without introducing any additional ancillary qubits, a general Nqubit controlledunitary may be decomposed with twoqubit gates Barenco et al. (1995)
. Therefore, without exploiting any hardwarespecific implementations, we estimate that for a QAOA of depth
, the contribution of Eq. IV to the gate volume scales at worst as . However, either by introducing ancillary qubits or by hardware specific implementations, scaling of the gate volume can be reduced to or better Kumar (2013). Keeping in mind that the hardware resources required to solve Eq. 10 are very much architecture dependent, we now turn to a particular experimental assessment of the time complexity of solving the Kspin Hamiltonian by classical simulated annealing and the variant of quantum annealing (transverse field Ising model) offered by the DWave 2000Q processor Kadowaki and Nishimori (1998).V Time Complexity of Kspin MDP
v.1 Quantum Annealing
A common and welldefined timetosolution () metric for the DWave 2000Q quantum annealer is
(15) 
where here is the time chosen over which each annealing schedule is carried out, is a desired success probability often taken to be , and for a given annealing time, is the probability of success for a single run of the annealer Denchev et al. (2016); Albash and Lidar (2018). For both quantum annealing calculations and simulated annealing calculations, we set and . These values are chosen such that the penalty functions for conditional probability conservation of the policy (second line in Eq. 10) and consistency terms from orderreduction (Eq. 13), respectively, are costly to violate. We note that the particular values of and are somewhat arbitrary in that any will make the attendant penalty functions sufficiently costly to violate. In each problem instance, the chain strength to enforce the logical consistency of embedded variables was set to , where is the largest magnitude coupling term in the orderreduced QUBO. This value was chosen so that the fraction of broken logical chains was less than 0.001.
Fig. 5a shows a representative set of curves for as a function of annealing time at . At each system size, it is clear that the optimal timetosolution occurs at (or below) , which is the shortest annealing time available on the processor used. This is also the case for all other values of . As such, Fig. 5b shows as a function of for all values of . While actually appears to decrease between and for , likely due to the boundary condition effect for small mentioned in Sec. IV, between the rest of the values of appears to follow a sort piecewise polynomial scaling where large jumps in are concurrent with large jumps in the number of physical qubits () in Fig. 4b. Indeed, the dependence of on is a direct result of the dependence of and therefore on . However, that ranges over five orders of magnitude while stepping through just a twofold increase in speaks to asymptotic exponential scaling of the solution of Eq. 10 by quantum annealing on the DWave 2000Q, in clear analogy with solving regular XORSAT– another polynomialtime solvable problem– by quantum annealing Farhi et al. (2012); Patil et al. (2019); Bapst et al. (2013).
v.2 Simulated Annealing
An analog to Eq. 15 for simulated annealing is
(16) 
where for large , , but in general Isakov et al. (2015). Here, is the total number of Monte Carlo sweeps (that is, the number of times all spins see an attempted update at a given temperature times the number of temperature steps) involved in a particular annealing run, is the number of logical spins needed to represent the system under consideration, and is the attempted spin update frequency. Note that since is a fixed property of the implementation and classical hardware being used across all problem instances and since is fixed by the problem instance, we consider in Fig. 6a. In this plot, for and , while for all other system sizes at this discount factor . In general for the simulated annealing algorithm presented here will depend upon both and . The simulated annealing implementation used here was the DWave SimulatedAnnealingSampler
with default, linearly interpolated
values run on a single 2.5 GHz Intel Core i5 processor with 16 GB of 1600 MHz DDR3 memory (source code at SA_ ). Taken together with the results in Fig. 5b, the results in Fig. 6b support the likely asymptotically exponential time complexity for solving the MDP as a spin Hamiltonian.Vi Conclusion
Here we have established a direct link between the Markov decision process formalism of reinforcement learning and a quantum Hamiltonian amenable to solution on currentgeneration and nearterm quantum hardware. Our derivation demonstrates that some problems that resist parallelization classically may become manifestly parallel when recast into the quantum model of computation. Despite this parallelization, the Kspin Hamiltonian likely scales exponentially in time when solved using conventional simulated and quantum annealing resources. Moreover, spatial fullyconnected quantum computational resources scale polynomially in the size of the state space, while limited hardware connectivity induces a polynomial growth in spatial resources that is at most quadratically worse than the case for fullyconnected resources. Despite these limitations, our Kspin Hamiltonian provides another previously unexplored inroad to quantum formulations of full reinforcement learning as well as more difficult variants of the MDP such as POMDP.
Acknowledgements.
This work was authored in part by the National Renewable Energy Laboratory (NREL), operated by Alliance for Sustainable Energy, LLC, for the U.S. Department of Energy (DOE) under Contract No. DEAC3608GO28308. This work was supported by the Laboratory Directed Research and Development (LDRD) Program at NREL. The views expressed in the article do not necessarily represent the views of the DOE or the U.S. Government. The U.S. Government retains and the publisher, by accepting the article for publication, acknowledges that the U.S. Government retains a nonexclusive, paidup, irrevocable, worldwide license to publish or reproduce the published form of this work, or allow others to do so, for U.S. Government purposes. This research used Ising, Los Alamos National Laboratory’s DWave quantum annealer. Ising is supported by NNSA’s Advanced Simulation and Computing program. The authors would like to thank Scott Pakin and Denny Dahl. This material is based inpart upon work supported by the National Science Foundation under Grant No. PHY1653820.References
 Sutton and Barto (2018) R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction (MIT press, 2018).
 Golovin and Rahm (2004) N. Golovin and E. Rahm, Reinforcement learning architecture for web recommendations, in International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004., Vol. 1 (IEEE, 2004) pp. 398–402.
 Sutton et al. (1992) R. S. Sutton, A. G. Barto, and R. J. Williams, Reinforcement learning is direct adaptive optimal control, IEEE Control Systems Magazine 12, 19 (1992).
 Silver et al. (2018) D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al., A general reinforcement learning algorithm that masters chess, shogi, and go through selfplay, Science 362, 1140 (2018).
 Dong et al. (2008) D. Dong, C. Chen, H. Li, and T.J. Tarn, Quantum reinforcement learning, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 38, 1207 (2008).
 Dunjko et al. (2017) V. Dunjko, J. M. Taylor, and H. J. Briegel, Advances in quantum reinforcement learning, in 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC) (IEEE, 2017) pp. 282–287.
 Lamata (2017) L. Lamata, Basic protocols in quantum reinforcement learning with superconducting circuits, Scientific reports 7, 1609 (2017).
 Briegel and De las Cuevas (2012) H. J. Briegel and G. De las Cuevas, Projective simulation for artificial intelligence, Scientific reports 2, 400 (2012).

Paparo et al. (2014)
G. D. Paparo, V. Dunjko, A. Makmal, M. A. MartinDelgado, and H. J. Briegel, Quantum speedup for active learning agents, Physical Review X
4, 031002 (2014). 
Dunjko et al. (2016)
V. Dunjko, J. M. Taylor, and H. J. Briegel, Quantumenhanced machine learning, Physical review letters
117, 130501 (2016).  Dunjko et al. (2015) V. Dunjko, N. Friis, and H. J. Briegel, Quantumenhanced deliberation of learning agents using trapped ions, New Journal of Physics 17, 023006 (2015).
 Neukart et al. (2018) F. Neukart, D. Von Dollen, C. Seidel, and G. Compostella, Quantumenhanced reinforcement learning for finiteepisode games with discrete state spaces, Frontiers in Physics 5, 71 (2018).
 Papadimitriou and Tsitsiklis (1987) C. H. Papadimitriou and J. N. Tsitsiklis, The complexity of markov decision processes, Mathematics of operations research 12, 441 (1987).
 Greenlaw et al. (1995) R. Greenlaw, H. J. Hoover, W. L. Ruzzo, et al., Limits to parallel computation: Pcompleteness theory (Oxford University Press on Demand, 1995).
 Farhi et al. (2014) E. Farhi, J. Goldstone, and S. Gutmann, A quantum approximate optimization algorithm, arXiv preprint arXiv:1411.4028 (2014).
 Denchev et al. (2016) V. S. Denchev, S. Boixo, S. V. Isakov, N. Ding, R. Babbush, V. Smelyanskiy, J. Martinis, and H. Neven, What is the computational value of finiterange tunneling?, Physical Review X 6, 031015 (2016).
 Barahona (1982) F. Barahona, On the computational complexity of ising spin glass models, Journal of Physics A: Mathematical and General 15, 3241 (1982).
 Rosenberg (1975) I. G. Rosenberg, Reduction of bivalent maximization to the quadratic case, Cahiers du Centre d’etudes de Recherche Operationnelle 17, 71 (1975).
 Zintchenko et al. (2015) I. Zintchenko, M. B. Hastings, and M. Troyer, From local to global ground states in ising spin glasses, Physical Review B 91, 024201 (2015).
 Cipra (2000) B. A. Cipra, The ising model is npcomplete, SIAM News 33, 1 (2000).
 Farhi et al. (2012) E. Farhi, D. Gosset, I. Hen, A. Sandvik, P. Shor, A. Young, and F. Zamponi, Performance of the quantum adiabatic algorithm on random instances of two optimization problems on regular hypergraphs, Physical Review A 86, 052334 (2012).
 Patil et al. (2019) P. Patil, S. Kourtis, C. Chamon, E. R. Mucciolo, and A. E. Ruckenstein, Obstacles to quantum annealing in a planar embedding of xorsat, Physical Review B 100, 054435 (2019).
 Bapst et al. (2013) V. Bapst, L. Foini, F. Krzakala, G. Semerjian, and F. Zamponi, The quantum adiabatic algorithm applied to random optimization problems: The quantum spin glass perspective, Physics Reports 523, 127 (2013).
 Yates (2009) R. Yates, Fixedpoint arithmetic: An introduction, Digital Signal Labs 81, 198 (2009).
 Dreyfus (1965) S. E. Dreyfus, Dynamic programming and the calculus of variations, Tech. Rep. (RAND CORP SANTA MONICA CA, 1965).
 Kappen (2005) H. J. Kappen, Path integrals and symmetry breaking for optimal control theory, Journal of statistical mechanics: theory and experiment 2005, P11011 (2005).
 Theodorou et al. (2010) E. Theodorou, J. Buchli, and S. Schaal, A generalized path integral control approach to reinforcement learning, journal of machine learning research 11, 3137 (2010).
 Dynkin (1983) E. Dynkin, Markov processes as a tool in field theory, Journal of Functional Analysis 50, 167 (1983).
 Peskin (2018) M. E. Peskin, An introduction to quantum field theory (CRC press, 2018).
 Pedersen et al. (2019) S. P. Pedersen, K. S. Christensen, and N. T. Zinner, Native threebody interaction in superconducting circuits, Physical Review Research 1, 033123 (2019).
 Boros and Hammer (2002) E. Boros and P. L. Hammer, Pseudoboolean optimization, Discrete applied mathematics 123, 155 (2002).
 Brockman et al. (2016) G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, Openai gym, arXiv preprint arXiv:1606.01540 (2016).
 Dattani (2019) N. Dattani, Quadratization in discrete optimization and quantum mechanics, arXiv preprint arXiv:1901.04405 (2019).

Fix et al. (2011)
A. Fix, A. Gruber,
E. Boros, and R. Zabih, A graph cut algorithm for higherorder markov random
fields, in
2011 International Conference on Computer Vision
(IEEE, 2011) pp. 1020–1027.  Jones et al. (2020) E. B. Jones, E. Kapit, C.Y. Chang, D. Biagioni, D. Vaidhynathan, P. Graf, and W. Jones, On the computational viability of quantum optimization for pmu placement, arXiv preprint arXiv:2001.04489 (2020).
 Lucas (2019) A. Lucas, Hard combinatorial problems and minor embeddings on lattice graphs, Quantum Information Processing 18, 203 (2019).
 Boothby et al. (2019) K. Boothby, P. Bunyk, J. Raymond, and A. Roy, Nextgeneration topology of Dwave quantum processors, Tech. Rep. (Technical report, 2019).
 Nielsen and Chuang (2002) M. A. Nielsen and I. Chuang, Quantum computation and quantum information (2002).
 Barenco et al. (1995) A. Barenco, C. H. Bennett, R. Cleve, D. P. DiVincenzo, N. Margolus, P. Shor, T. Sleator, J. A. Smolin, and H. Weinfurter, Elementary gates for quantum computation, Physical review A 52, 3457 (1995).
 Kumar (2013) P. Kumar, Direct implementation of an nqubit controlledunitary gate in a single step, Quantum information processing 12, 1201 (2013).
 Kadowaki and Nishimori (1998) T. Kadowaki and H. Nishimori, Quantum annealing in the transverse ising model, Physical Review E 58, 5355 (1998).
 Albash and Lidar (2018) T. Albash and D. A. Lidar, Demonstration of a scaling advantage for a quantum annealer over simulated annealing, Physical Review X 8, 031016 (2018).
 Isakov et al. (2015) S. V. Isakov, I. N. Zintchenko, T. F. Rønnow, and M. Troyer, Optimised simulated annealing for ising spin glasses, Computer Physics Communications 192, 265 (2015).
 (44) Source code for neal.sampler, https://docs.ocean.dwavesys.com/projects/neal/en/latest/_modules/neal/sampler.html#SimulatedAnnealingSampler.sample, accessed: 20200321.
Comments
There are no comments yet.