1 Introduction
Many decisionmaking problems involve learning by interacting with the environment and observing what rewards result from these interactions. In the field of machine learning, this line of research falls into what is referred as reinforcement learning (RL), and algorithms to train artificial agents that interact with an environment have been studied extensively
(Sutton and Barto, 2018; Kaelbling et al., 1996; Bertsekas and Tsitsiklis, 1996). We are here interested in the best arm identification (BAI) problem from the family of bandit problems, which encompasses the set of RL problems where the interactions with the environment give rise to immediate rewards and where longterm planning is unnecessary (see the survey of Lattimore and Szepesvári, 2020). More precisely, we are interested in a quantum version of the BAI problem, for which we design a quantum algorithm capable to solve it.Quantum machine learning is a research field at the interface of quantum computing and machine learning where the goal is to use quantum computing paradigms and technologies to improve the speed and performance of learning algorithms (Wittek, 2014; Biamonte et al., 2017; Ciliberto et al., 2018; Schuld and Petruccione, 2018). A fundamental concept in quantum computing is quantum superposition, which is the means by which quantum algorithms like that of Grover (1996b) —one of the most popular quantum algorithm— succeeds in solving the problem of finding one item from an unstructured database of items in time , beating so the classical time requirement. Recent works have investigated the use of Grover’s quantum search algorithm to enhance machine learning and have proved it can provide nontrivial improvements not only in the computational complexity but also in the statistical performance of these models (Aïmeur et al., 2013; Wittek, 2014; Kapoor et al., 2016)
. Beyond Grover’s algorithm, quantum algorithms for linear algebra, such as quantum matrix inversion and quantum singular value decomposition, were recently proposed and used in the context of machine learning
(Rebentrost et al., 2014; Kerenidis and Prakash, 2017). Works on quantum reinforcement learning are emerging (Dong et al., 2008; Naruse et al., 2015a; Dunjko et al., 2016; Lamata, 2017), and our paper aims at providing a new piece of knowledge in that area, by bringing two contributions: i) a formalization of the best arm identification problem in a quantum setting, and ii) a quantum algorithm to solve this problem that is quadratically faster than classical ones.The paper is organized as follows. In Section 2, we formulate the best arm identification (BAI) problem, briefly review the upper confidence bound, and illustrate how it can be used to solve the BAI problem. In Section 3, we describe the quantum amplitude amplification, at the core of Grover’s algorithm, which forms the basis of our approach. Our main results are in Section 4: we provide our quantum modeling of the BAI problem, which assumes that both the learning agent and the environment are quantum; and then we proposes an algorithm based on quantum amplitude amplification to solve BAI, that it is able to get the optimal solution quadratically faster than what is known to hold in the classical case. Section 5 concludes the paper.
2 Best Arm Identification
2.1 Stochastic MultiArmed Bandits and the BAI Problem
Bandit problems are RL problems where it is assumed an agent evolves in an environment with which it can interact by choosing at each time step an action (or arm), each action taken providing the agent with a reward, which values the quality of the chosen action (see function below, and more generally, Lattimore and Szepesvári, 2020).
The bandit problem we want to study from a quantum point of view is that of best arm identification from stochastic multiarmed bandits (Audibert and Bubeck, 2010). It comes with the following assumptions: the set of actions is finite and discrete, with , and when action is chosen at time then the reward depends upon the independent realisation (called
afterwards) of a random variable distributed according to some unknown (but fixed) law
The BAI problem is to devise a strategy of action selection for the agent such that, after a predefined number of interactions, the agent is able to identify the best action with the best possible guarantees.We may go one step further in the formal statement of the problem and, in the way, use a modelling that is both in line with the classical BAI problem and suitable for its quantum extension. In particular, in order to take the unknown distributions , we will explicitly introduce , the set of all possible internal states of the environment —this notion of internal state of the environment is uncommon in the classical bandit literature. The agent’s action sets the internal state of the environment to , which is a random draw from distribution , unknown to the agent. The agent then receives a reward , indicating the fit of action with the state of the environment; we here assume that can only take values in —this corresponds to the classical case where the reward
is drawn according to a Bernoulli distribution of unknown parameter
. With these assumptions, the average reward associated with action is(1) 
and we may defined the optimal action as
(2) 
and the mean reward of the optimal action. After interactions with the environment, the agent will choose an action as its recommendation (see Algorithm 1). The quality of the agent’s decision is then evaluated as the regret , i.e. the difference between the mean reward of optimal action and the mean reward of the recommended action.
Let us elaborate further on the regret; let
(3) 
be the difference between the value of the optimal action and the value of action . If the agent recommends the action
with probability
after rounds, then the average difference between the value of its recommendation and the value of the optimal action is(4) 
which is the average regret after iterations of the agent’s strategy. Our goal is to find an action selection strategy for which the value of decreases quickly as the value of increases.
If is the probability that the agent does not recommend the best action after iterations, then, as , the (average) regret is so that . In the following, we recall how a tight upper bound for can be derived.
2.2 Upper Confidence Bound Explorationbased strategy
Part of the difficulty in the BAI problem comes from the fact that the value of each action is the mean of random variable that depends on an unknown probability distribution. The only way for an agent to estimate the value
of action is to repeatedly interact with the environment to obtain a sample of rewards associated to . Thus, a good strategy needs to find a balance between sampling the most promising actions, and sampling the actions for which we lack information. The Upper Confidence Bound Exploration (UCBE) depicted in Algorithm 2, first described in Audibert and Bubeck (2010), is an efficient strategy to solve the best arm identification problem. It is based on a very well known and used family of UCB strategies (Lai and Robbins, 1985; Auer et al., 2002), which were proven to be optimal for solving the multiarmed bandit problem (Thompson, 1933).Let be the set of rounds for which the agent picked action until time , and
(5) 
be the empirical average of the reward for action . We know from Hoeffding (1963) that and are tied by the relation
This means that, for all , there is a range of value centered around in which lies with probability at least . The more the agent interacts with the environment with action , the smaller this range of values is. The principle behind UCB is to choose, at each iteration, the action for which the upper bound of this range is the highest.
Audibert and Bubeck (2010) showed that UCBE admits the following upper bound on , when the exploration parameter is well tuned :
From this inequality, we can deduce a lower bound of the number of iterations to recommend the optimal arm with probability at least , for any :
The quantum modelling and accompanying algorithm proposed in this paper come with a theoretical result that quadratically improves this bounds.
3 Quantum Amplitude Amplification
If we dispose of an unstructured, discrete set of elements and we are interested in finding one marked element , a simple probability argument shows that it takes an average of N/2 (exhaustive) queries to find the marked element. While it is well known that is optimal with classical means, Lov Grover in 1996 proved that a simple quantum search algorithm speeds up any brute force problem into a problem. This algorithm comes in many variants and has been rephrased in many ways, including in terms of resonance effects (Grover, 1996b) and quantum walks (Childs and Goldstone, 2004; Guillet et al., 2019). The principle behind the original Grover search algorithm is the amplitude amplification (Brassard et al., 2000; Grover, 1998) in contrast with the techniques called probability amplification used in classical randomized algorithms. In classical case it is known that, if we know the procedure which verifies the output, then we can amplify the success probability times, and the probability to recover the good result is approximately where is the probability to return the searched value. Thus in order to amplify the probability to we need to multiply the runtime by a factor . In the quantum case, the basic principle is the same and we amplify amplitudes instead of probabilities. Grover’s algorithms and all its generalisation have shown that in order to achieve a maximum probability close to 1, we amplify for a number of rounds which is , then quadratically faster then the classical case. Before we show how to apply this result to the best arm identification problem, let us briefly recall how the amplitudeamplification algorithms works.
First, we need to introduce a dimensional state space , which can be supplied by qubits, spanned by the orthonormal set of states , with . In general, we say that, after the application of an arbitrary quantum operator, the probability to find the marked element is , where this element is a point in the domain of a generic Boolean function such that . This function induces a partition of into two subspaces, and , and each of them can be seen respectively as the good subspace spanned by the set of basis states for which and the bad subspace, which is its orthogonal. Any arbitrary state belonging to can be decomposed on the basis as follows
where are the normalised projections of in the two subspaces and :
and denotes the probability that measuring produces a marked state (for which ). In general terms, one step of the algorithm is composed by two operators: (i) the oracle, as in the original Grover results; (ii) and the generalised Grover diffusion operator. The oracle is built using and reads:
which essentially marks the searched state with minus sign. The diffusion operator is defined as:
where is the usual reflection operator around and . The composition of both operators leads to one evolution step of the amplitudeamplification algorithm:
Notice that when , the WalshHadamard transform, the above algorithm reduces to the original Grover algorithm, where the initial state is an uniform superposition of states. The repetitive application of after iterations leads to:
(6) 
As in the Grover algorithm for and , , leading quadratic speedup over classical algorithms.
4 Quantum Best Arm Identification
Solve efficiently the best arm identification problem is generally limited by the amount of information the agent needs to recover from a single interaction with the environment. This is also the case in the unstructured classical search problem, as a single call to the indication function , the oracle, gives us information on a single element of the set. In general terms, the idea is to apply the same basic principle of the amplitudeamplification quantum algorithm to the best arm identification problem, where the reward function introduced in Section 2, now plays the role of the oracle. Indeed, in the same way that the boolean function in a searching problem recognises whether is the marked element we are looking for, the reward , indicates whether corresponds to a desirable outcome (in that case, ) or not (then ), where is the action of the agent and the state of the environment. Thus, our strategy in the following is to apply the amplitudeamplification quantum algorithm to recover the desirable outcome, i.e., the optimal action of the agent.
In order to properly apply the above quantum strategy, we define a composite Hilbert space , where is the space of the quantum actions of the agent, spanned by the orthonormal basis and is the space of the quantum environment states, spanned by the orthonormal basis
. All vector
, representing the whole composite system, decomposes on the basis . Notice that in the classical context, the agent’s action sets the internal state of the environment to , according to a random distribution , which is unknown to the agent. A straightforward way to recover the same condition, is to prepare the state of the environment in a superposition , where depends on the action chosen by the agent. This is achieved preparing the initial state of the environment as follows:where is a unitary operator acting on the composite Hilbert space . Moreover, the initial state of the agent is prepared in an arbitrary superposition state, applying an unitary operator on the state space of the agent :
Once the initial state is prepared, we build the oracle on the composite Hilbert space of the agent and the environment, the action of which is:
As for a search problem, we propose a quantum procedure that allows us to a find the optimal action (for which ) using application of , with probability approaching 1. The quantum amplitude amplification algorithm and its analysis is then reminiscent of what was presented in Section 3. One round of the algorithm is defined by the composition of the above three operators and the resulting algorithm QBAI (Quantum Best Arm Identification) is depicted in Algorithm 3.
Iterating times the above algorithm we recover
which is of the same form of Equation 6, where now are the normalised projections of in the two subspaces and , respectively the good subspace spanned by the set of basis states for which and the bad subspace, which is its orthogonal. We know from Section 3, that to recover the optimal action we need to maximise the sinus. Let us choose an alternative, but equivalent, path. Let compute the recommendation probability . After a straightforward computation and few simplifications, it results:
where , and . The recommendation probability for the optimal action is then recovered when , i.e. when .
Summarizing the results so far:
The probability that QBAI will recommend the optimal action is maximized when . It follows that .
In order to compare to compare with the classical bounds, we need to define . For sake of simplicity, let consider so that , which translates in . From Theorem 1, we need
rounds to recommend the optimal action with probability . Let recall that UCBE needs at least rounds to recommend the optimal action with the same probability. The ratio between both probabilities is of order . In the case , then and the complexity gain for the quantum algorithm results quadratic in respect of the number of actions. Otherwise, since , we get that , and the speedup is once again quadratic in respect of the number of actions. This result is sufficient to prove that QBAI is quadratically faster than a classical algorithm to recommend the optimal arm with probability at least .
5 Conclusion
We studied the problem of Best Arm Identification (BAI) in a quantum setting. We proposed a quantum modeling of this problem when both the learning agent and the environment are quantum. We introduced a quantum bandit algorithm based on quantum amplitude amplification to solve the quantum BAI problem and showed that is able to get the optimal solution quadratically faster than what is known to hold in the classical case. Our results confirm that quantum algorithms can have a significant impact on reinforcement learning and open up new opportunities for more efficient bandit algorithms.
Our aim with this paper has been to provide a direct application of amplitude amplification to the best arm identification problem, and to show that it exhibit the same behavior it did in other problems of the same nature in therm of efficiency. It has been proposed a direct quantum analogue of the multiarmed bandit problem, and an analytical proof that amplitude amplification can find the best action quadratically faster than the best known classical algorithm with respect to the number of actions. Future extensions of this work might include the following topics : (i) could this algorithm be adapted to recommend the optimal action with arbitrarily small margin of error ? (ii) can it be possible to treat the case where the reward function have value in ? (iii) can this algorithm be adapted to solve more complex decision making problems ? (iv) can it be proven or disproven that amplitude amplification is optimal for this problem, as it is for other unstructured search problems ?
References

Aïmeur et al. (2013)
Esma Aïmeur, Gilles Brassard, and Sébastien Gambs.
Quantum speedup for unsupervised learning.
Machine Learning, 90(2):261–287, 2013.  Audibert and Bubeck (2010) JeanYves Audibert and Sébastien Bubeck. Best Arm Identification in MultiArmed Bandits. In COLT  23th Conference on Learning Theory  2010, page 13 p., Haifa, Israel, June 2010.
 Auer et al. (2002) Peter Auer, Nicolò CesaBianchi, and Paul Fischer. Finitetime analysis of the multiarmed bandit problem. Machine Learning, 47(2):235–256, May 2002.
 Bertsekas and Tsitsiklis (1996) D. P. Bertsekas and J. N. Tsitsiklis. Neurodynamic programming. Athena Scientific, Belmont, MA, 1996.
 Biamonte et al. (2017) Jacob Biamonte, Peter Wittek, Nicola Pancotti, Patrick Rebentrost, Nathan Wiebe, and Seth Lloyd. Quantum machine learning. Nature, 549(7671):195–202, 2017.
 Brassard et al. (2000) Gilles Brassard, Peter Hoyer, Michele Mosca, and Alain Tapp. Quantum Amplitude Amplification and Estimation. arXiv eprints, art. quantph/0005055, May 2000.
 Childs and Goldstone (2004) Andrew M Childs and Jeffrey Goldstone. Spatial search by quantum walk. Physical Review A, 70(2):022314, 2004.
 Ciliberto et al. (2018) Carlo Ciliberto, Mark Herbster, Alessandro Davide Ialongo, Massimiliano Pontil, Andrea Rocchetto, Simone Severini, and Leonard Wossnig. Quantum machine learning: a classical perspective. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 474(2209):20170551, 2018.
 Dong et al. (2008) Daoyi Dong, Chunlin Chen, Hanxiong Li, and TzyhJong Tarn. Quantum reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 38(5):1207–1220, 2008.
 Dunjko et al. (2015) Vedran Dunjko, Jacob M. Taylor, and Hans J. Briegel. Framework for learning agents in quantum environments. arXiv eprints, art. arXiv:1507.08482, Jul 2015.
 Dunjko et al. (2016) Vedran Dunjko, Jacob M Taylor, and Hans J Briegel. Quantumenhanced machine learning. Physical review letters, 117(13):130501, 2016.
 Dunjko et al. (2018) Vedran Dunjko, Jacob M. Taylor, and Hans J. Briegel. Advances in Quantum Reinforcement Learning. arXiv eprints, art. arXiv:1811.08676, Nov 2018.
 Durr and Hoyer (1996) Christoph Durr and Peter Hoyer. A Quantum Algorithm for Finding the Minimum. arXiv eprints, art. quantph/9607014, Jul 1996.

Girgin et al. (2009)
Sertan Girgin, Manuel Loth, Rémi Munos, Philippe Preux, and Daniil Ryabko.
Recent Advances in Reinforcement Learning.
Springer, Lectures Notes in Artificial Intelligence (LNAI), vol. 5323, February 2009.
 Gosavi (2009) Abhijit Gosavi. Reinforcement learning: A tutorial survey and recent advances. INFORMS Journal on Computing, 21:178–192, 05 2009. doi: 10.1287/ijoc.1080.0305.
 Grover (1996a) Lov K. Grover. A fast quantum mechanical algorithm for database search, 1996a.

Grover (1996b)
Lov K Grover.
A fast quantum mechanical algorithm for database search.
In
Proceedings of the twentyeighth annual ACM symposium on Theory of computing
, pages 212–219, 1996b.  Grover (1998) Lov K Grover. Quantum computers can search rapidly by using almost any transformation. Physical Review Letters, 80(19):4329, 1998.
 Guillet et al. (2019) Stéphane Guillet, Mathieu Roget, Pablo Arrighi, and Giuseppe Di Molfetta. The Grover search as a naturally occurring phenomenon. arXiv preprint arXiv:1908.11213, 2019.
 Hoeffding (1963) Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
 Ivanov and D’yakonov (2019) Sergey Ivanov and Alexander D’yakonov. Modern Deep Reinforcement Learning Algorithms. arXiv eprints, art. arXiv:1906.10025, Jun 2019.
 Kaelbling et al. (1996) Leslie Pack Kaelbling, Michael L. Littman, and Andrew P. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285, 1996.

Kapoor et al. (2016)
Ashish Kapoor, Nathan Wiebe, and Krysta Svore.
Quantum perceptron models.
In Advances in Neural Information Processing Systems, pages 3999–4007, 2016.  Kerenidis and Prakash (2017) Iordanis Kerenidis and Anupam Prakash. Quantum recommendation systems. 2017.
 Lai and Robbins (1985) T.L Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4 – 22, 1985.
 Lamata (2017) Lucas Lamata. Basic protocols in quantum reinforcement learning with superconducting circuits. Scientific reports, 7(1):1–10, 2017.
 Lattimore and Szepesvári (2020) Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2020.
 Naruse et al. (2015a) Makoto Naruse, Martin Berthel, Aurélien Drezet, Serge Huant, Masashi Aono, Hirokazu Hori, and SongJu Kim. Singlephoton decision maker. Scientific reports, 5(1):1–9, 2015a.
 Naruse et al. (2015b) Makoto Naruse, Martin Berthel, Aurélien Drezet, Serge Huant, Masashi Aono, Hirokazu Hori, and SongJu Kim. Singlephoton decision maker. Scientific Reports, 5(1):13253, 2015b. ISSN 20452322. doi: 10.1038/srep13253. URL https://doi.org/10.1038/srep13253.

Rebentrost et al. (2014)
Patrick Rebentrost, Masoud Mohseni, and Seth Lloyd.
Quantum support vector machine for big data classification.
Physical review letters, 113(13):130503, 2014.  Schuld and Petruccione (2018) Maria Schuld and Francesco Petruccione. Supervised learning with quantum computers, volume 17. Springer, 2018.
 Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, 2017.
 Slivkins (2019) Aleksandrs Slivkins. Introduction to MultiArmed Bandits. arXiv eprints, art. arXiv:1904.07272, Apr 2019.
 Sutton and Barto (2018) Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018.
 Thompson (1933) William R. Thompson. on the likelihood that one unknown probability distribution exceeds another in view of the evidence ot two samples. Biometrika, 25(34):285–294, 12 1933.
 Wittek (2014) Peter Wittek. Quantum machine learning: what quantum computing means to data mining. Academic Press, 2014.
 YenChi Chen et al. (2019) Samuel YenChi Chen, ChaoHan Huck Yang, Jun Qi, PinYu Chen, Xiaoli Ma, and HsiSheng Goan. Variational Quantum Circuits for Deep Reinforcement Learning. arXiv eprints, art. arXiv:1907.00397, Jun 2019.