Log In Sign Up

Quantum Bandits

by   Balthazar Casalé, et al.

We consider the quantum version of the bandit problem known as best arm identification (BAI). We first propose a quantum modeling of the BAI problem, which assumes that both the learning agent and the environment are quantum; we then propose an algorithm based on quantum amplitude amplification to solve BAI. We formally analyze the behavior of the algorithm on all instances of the problem and we show, in particular, that it is able to get the optimal solution quadratically faster than what is known to hold in the classical case.


page 1

page 2

page 3

page 4


Quantum exploration algorithms for multi-armed bandits

Identifying the best arm of a multi-armed bandit is a central problem in...

Quantum Multi-Armed Bandits and Stochastic Linear Bandits Enjoy Logarithmic Regrets

Multi-arm bandit (MAB) and stochastic linear bandit (SLB) are important ...

An exact quantum order finding algorithm and its applications

We present an efficient exact quantum algorithm for order finding proble...

The complexity of a quantum system and the accuracy of its description

The complexity of the quantum state of a multiparticle system and the ma...

Mirror modular cloning and fast quantum associative retrieval

We show that a quantum state can be perfectly cloned up to global mirror...

Another Look at Quantum Neural Computing

The term quantum neural computing indicates a unity in the functioning o...

Optimization by a quantum reinforcement algorithm

A reinforcement algorithm solves a classical optimization problem by int...

1 Introduction

Many decision-making problems involve learning by interacting with the environment and observing what rewards result from these interactions. In the field of machine learning, this line of research falls into what is referred as reinforcement learning (RL), and algorithms to train artificial agents that interact with an environment have been studied extensively 

(Sutton and Barto, 2018; Kaelbling et al., 1996; Bertsekas and Tsitsiklis, 1996). We are here interested in the best arm identification (BAI) problem from the family of bandit problems, which encompasses the set of RL problems where the interactions with the environment give rise to immediate rewards and where long-term planning is unnecessary (see the survey of Lattimore and Szepesvári, 2020). More precisely, we are interested in a quantum version of the BAI problem, for which we design a quantum algorithm capable to solve it.

Quantum machine learning is a research field at the interface of quantum computing and machine learning where the goal is to use quantum computing paradigms and technologies to improve the speed and performance of learning algorithms (Wittek, 2014; Biamonte et al., 2017; Ciliberto et al., 2018; Schuld and Petruccione, 2018). A fundamental concept in quantum computing is quantum superposition, which is the means by which quantum algorithms like that of Grover (1996b) —one of the most popular quantum algorithm— succeeds in solving the problem of finding one item from an unstructured database of items in time , beating so the classical time requirement. Recent works have investigated the use of Grover’s quantum search algorithm to enhance machine learning and have proved it can provide non-trivial improvements not only in the computational complexity but also in the statistical performance of these models (Aïmeur et al., 2013; Wittek, 2014; Kapoor et al., 2016)

. Beyond Grover’s algorithm, quantum algorithms for linear algebra, such as quantum matrix inversion and quantum singular value decomposition, were recently proposed and used in the context of machine learning 

(Rebentrost et al., 2014; Kerenidis and Prakash, 2017). Works on quantum reinforcement learning are emerging (Dong et al., 2008; Naruse et al., 2015a; Dunjko et al., 2016; Lamata, 2017), and our paper aims at providing a new piece of knowledge in that area, by bringing two contributions: i) a formalization of the best arm identification problem in a quantum setting, and ii) a quantum algorithm to solve this problem that is quadratically faster than classical ones.

The paper is organized as follows. In Section 2, we formulate the best arm identification (BAI) problem, briefly review the upper confidence bound, and illustrate how it can be used to solve the BAI problem. In Section 3, we describe the quantum amplitude amplification, at the core of Grover’s algorithm, which forms the basis of our approach. Our main results are in Section 4: we provide our quantum modeling of the BAI problem, which assumes that both the learning agent and the environment are quantum; and then we proposes an algorithm based on quantum amplitude amplification to solve BAI, that it is able to get the optimal solution quadratically faster than what is known to hold in the classical case. Section 5 concludes the paper.

2 Best Arm Identification

2.1 Stochastic Multi-Armed Bandits and the BAI Problem

Bandit problems are RL problems where it is assumed an agent evolves in an environment with which it can interact by choosing at each time step an action (or arm), each action taken providing the agent with a reward, which values the quality of the chosen action (see function below, and more generally, Lattimore and Szepesvári, 2020).

The bandit problem we want to study from a quantum point of view is that of best arm identification from stochastic multi-armed bandits (Audibert and Bubeck, 2010). It comes with the following assumptions: the set of actions is finite and discrete, with , and when action is chosen at time then the reward depends upon the independent realisation (called

afterwards) of a random variable distributed according to some unknown (but fixed) law

The BAI problem is to devise a strategy of action selection for the agent such that, after a predefined number of interactions, the agent is able to identify the best action with the best possible guarantees.

We may go one step further in the formal statement of the problem and, in the way, use a modelling that is both in line with the classical BAI problem and suitable for its quantum extension. In particular, in order to take the unknown distributions , we will explicitly introduce , the set of all possible internal states of the environment —this notion of internal state of the environment is uncommon in the classical bandit literature. The agent’s action sets the internal state of the environment to , which is a random draw from distribution , unknown to the agent. The agent then receives a reward , indicating the fit of action with the state of the environment; we here assume that can only take values in —this corresponds to the classical case where the reward

is drawn according to a Bernoulli distribution of unknown parameter

. With these assumptions, the average reward associated with action is


and we may defined the optimal action as


and the mean reward of the optimal action. After interactions with the environment, the agent will choose an action as its recommendation (see Algorithm 1). The quality of the agent’s decision is then evaluated as the regret , i.e. the difference between the mean reward of optimal action and the mean reward of the recommended action.

Data: A number of rounds ;
Result: a recommended action;
for  to  do
       the agent chooses the action ;
       the environment picks an internal state following ;
       the agent perceives the reward ;
end for
the agent return the recommended action;
Algorithm 1 The best arm identification problem

Let us elaborate further on the regret; let


be the difference between the value of the optimal action and the value of action . If the agent recommends the action

with probability

after rounds, then the average difference between the value of its recommendation and the value of the optimal action is


which is the average regret after iterations of the agent’s strategy. Our goal is to find an action selection strategy for which the value of decreases quickly as the value of increases.

If is the probability that the agent does not recommend the best action after iterations, then, as , the (average) regret is so that . In the following, we recall how a tight upper bound for can be derived.

2.2 Upper Confidence Bound Exploration-based strategy

Part of the difficulty in the BAI problem comes from the fact that the value of each action is the mean of random variable that depends on an unknown probability distribution. The only way for an agent to estimate the value

of action is to repeatedly interact with the environment to obtain a sample of rewards associated to . Thus, a good strategy needs to find a balance between sampling the most promising actions, and sampling the actions for which we lack information. The Upper Confidence Bound Exploration (UCB-E) depicted in Algorithm 2, first described in Audibert and Bubeck (2010), is an efficient strategy to solve the best arm identification problem. It is based on a very well known and used family of UCB strategies (Lai and Robbins, 1985; Auer et al., 2002), which were proven to be optimal for solving the multi-armed bandit problem (Thompson, 1933).

Let be the set of rounds for which the agent picked action until time , and


be the empirical average of the reward for action . We know from Hoeffding (1963) that and are tied by the relation

This means that, for all , there is a range of value centered around in which lies with probability at least . The more the agent interacts with the environment with action , the smaller this range of values is. The principle behind UCB is to choose, at each iteration, the action for which the upper bound of this range is the highest.

Data: A number of trials ; , an exploration parameter;
Result: a recommended action;
for and ;
for  to  do
       the agent chooses the action
       the environment picks an internal state according to
       the agent perceives the reward
       the agent updates the values of to take into account
end for
the agent return ;
Algorithm 2 UCB-E algorithm

Audibert and Bubeck (2010) showed that UCB-E admits the following upper bound on , when the exploration parameter is well tuned :

From this inequality, we can deduce a lower bound of the number of iterations to recommend the optimal arm with probability at least , for any :

The quantum modelling and accompanying algorithm proposed in this paper come with a theoretical result that quadratically improves this bounds.

3 Quantum Amplitude Amplification

If we dispose of an unstructured, discrete set of elements and we are interested in finding one marked element , a simple probability argument shows that it takes an average of N/2 (exhaustive) queries to find the marked element. While it is well known that is optimal with classical means, Lov Grover in 1996 proved that a simple quantum search algorithm speeds up any brute force problem into a problem. This algorithm comes in many variants and has been rephrased in many ways, including in terms of resonance effects (Grover, 1996b) and quantum walks (Childs and Goldstone, 2004; Guillet et al., 2019). The principle behind the original Grover search algorithm is the amplitude amplification (Brassard et al., 2000; Grover, 1998) in contrast with the techniques called probability amplification used in classical randomized algorithms. In classical case it is known that, if we know the procedure which verifies the output, then we can amplify the success probability times, and the probability to recover the good result is approximately where is the probability to return the searched value. Thus in order to amplify the probability to we need to multiply the runtime by a factor . In the quantum case, the basic principle is the same and we amplify amplitudes instead of probabilities. Grover’s algorithms and all its generalisation have shown that in order to achieve a maximum probability close to 1, we amplify for a number of rounds which is , then quadratically faster then the classical case. Before we show how to apply this result to the best arm identification problem, let us briefly recall how the amplitude-amplification algorithms works.
First, we need to introduce a -dimensional state space , which can be supplied by qubits, spanned by the orthonormal set of states , with . In general, we say that, after the application of an arbitrary quantum operator, the probability to find the marked element is , where this element is a point in the domain of a generic Boolean function such that . This function induces a partition of into two subspaces, and , and each of them can be seen respectively as the good subspace spanned by the set of basis states for which and the bad subspace, which is its orthogonal. Any arbitrary state belonging to can be decomposed on the basis as follows

where are the normalised projections of in the two subspaces and :

and denotes the probability that measuring produces a marked state (for which ). In general terms, one step of the algorithm is composed by two operators: (i) the oracle, as in the original Grover results; (ii) and the generalised Grover diffusion operator. The oracle is built using and reads:

which essentially marks the searched state with minus sign. The diffusion operator is defined as:

where is the usual reflection operator around and . The composition of both operators leads to one evolution step of the amplitude-amplification algorithm:

Notice that when , the Walsh-Hadamard transform, the above algorithm reduces to the original Grover algorithm, where the initial state is an uniform superposition of states. The repetitive application of after iterations leads to:


As in the Grover algorithm for and , , leading quadratic speedup over classical algorithms.

4 Quantum Best Arm Identification

Solve efficiently the best arm identification problem is generally limited by the amount of information the agent needs to recover from a single interaction with the environment. This is also the case in the unstructured classical search problem, as a single call to the indication function , the oracle, gives us information on a single element of the set. In general terms, the idea is to apply the same basic principle of the amplitude-amplification quantum algorithm to the best arm identification problem, where the reward function introduced in Section 2, now plays the role of the oracle. Indeed, in the same way that the boolean function in a searching problem recognises whether is the marked element we are looking for, the reward , indicates whether corresponds to a desirable outcome (in that case, ) or not (then ), where is the action of the agent and the state of the environment. Thus, our strategy in the following is to apply the amplitude-amplification quantum algorithm to recover the desirable outcome, i.e., the optimal action of the agent.
In order to properly apply the above quantum strategy, we define a composite Hilbert space , where is the space of the quantum actions of the agent, spanned by the orthonormal basis and is the space of the quantum environment states, spanned by the orthonormal basis

. All vector

, representing the whole composite system, decomposes on the basis . Notice that in the classical context, the agent’s action sets the internal state of the environment to , according to a random distribution , which is unknown to the agent. A straightforward way to recover the same condition, is to prepare the state of the environment in a superposition , where depends on the action chosen by the agent. This is achieved preparing the initial state of the environment as follows:

where is a unitary operator acting on the composite Hilbert space . Moreover, the initial state of the agent is prepared in an arbitrary superposition state, applying an unitary operator on the state space of the agent :

Once the initial state is prepared, we build the oracle on the composite Hilbert space of the agent and the environment, the action of which is:

As for a search problem, we propose a quantum procedure that allows us to a find the optimal action (for which ) using application of , with probability approaching 1. The quantum amplitude amplification algorithm and its analysis is then reminiscent of what was presented in Section 3. One round of the algorithm is defined by the composition of the above three operators and the resulting algorithm QBAI (Quantum Best Arm Identification) is depicted in Algorithm 3.

Data: A unitary operator acting on ;
A unitary operator acting on the composite system agent-environment;
number of rounds;
Result: The recommended action
prepare a quantum register to the state ;
apply to the state of the register ;
for  to  do
       apply to the state of the register ;
end for
Algorithm 3 Quantum Best Arm Identification (QBAI)

Iterating times the above algorithm we recover

which is of the same form of Equation 6, where now are the normalised projections of in the two subspaces and , respectively the good subspace spanned by the set of basis states for which and the bad subspace, which is its orthogonal. We know from Section 3, that to recover the optimal action we need to maximise the sinus. Let us choose an alternative, but equivalent, path. Let compute the recommendation probability . After a straightforward computation and few simplifications, it results:

where , and . The recommendation probability for the optimal action is then recovered when , i.e. when .

Summarizing the results so far:

The probability that QBAI will recommend the optimal action is maximized when . It follows that .

In order to compare to compare with the classical bounds, we need to define . For sake of simplicity, let consider so that , which translates in . From Theorem 1, we need

rounds to recommend the optimal action with probability . Let recall that UCB-E needs at least rounds to recommend the optimal action with the same probability. The ratio between both probabilities is of order . In the case , then and the complexity gain for the quantum algorithm results quadratic in respect of the number of actions. Otherwise, since , we get that , and the speedup is once again quadratic in respect of the number of actions. This result is sufficient to prove that QBAI is quadratically faster than a classical algorithm to recommend the optimal arm with probability at least .

5 Conclusion

We studied the problem of Best Arm Identification (BAI) in a quantum setting. We proposed a quantum modeling of this problem when both the learning agent and the environment are quantum. We introduced a quantum bandit algorithm based on quantum amplitude amplification to solve the quantum BAI problem and showed that is able to get the optimal solution quadratically faster than what is known to hold in the classical case. Our results confirm that quantum algorithms can have a significant impact on reinforcement learning and open up new opportunities for more efficient bandit algorithms.

Our aim with this paper has been to provide a direct application of amplitude amplification to the best arm identification problem, and to show that it exhibit the same behavior it did in other problems of the same nature in therm of efficiency. It has been proposed a direct quantum analogue of the multi-armed bandit problem, and an analytical proof that amplitude amplification can find the best action quadratically faster than the best known classical algorithm with respect to the number of actions. Future extensions of this work might include the following topics : (i) could this algorithm be adapted to recommend the optimal action with arbitrarily small margin of error ? (ii) can it be possible to treat the case where the reward function have value in ? (iii) can this algorithm be adapted to solve more complex decision making problems ? (iv) can it be proven or disproven that amplitude amplification is optimal for this problem, as it is for other unstructured search problems ?