The so-called Narendra-Shapiro bandit algorithm (referred to as NSa) was introduced in  and developed in  as a linear learning automata. This algorithm has been primarily considered by the probabilistic community as an interesting benchmark of stochastic algorithm. More precisely, NSa is an example of recursive (non-homogeneous) Markovian algorithm, topic whose almost complete historical overview may be found in the seminal contributions of  and .
NSa belongs to the large class of bandit-type policies whose principle may be sketched as follows: a -armed bandit algorithm is a procedure designed to determine which one, among sources, is the most profitable without spending too much time on the wrong ones. In the simplest case, the sources (or arms) randomly provide some rewards whose values belong to
with Bernoulli laws. The associated probabilities of successare unknown to the player and his goal is to determine the most efficient source, i.e. the highest probability of success.
Let us now remind a rigorous definition of admissible sequential policies. We consider independent sequences of i.i.d.
Bernoulli random variables. Each represents the reward associated with the arm at time . We then consider some sequential predictions where at each stage a forecaster chooses an arm , receives a reward and then uses this information to choose the next arm at step . As introduced in the pioneering work , the rewards are sampled independently of a fixed product distribution at each step . The innovations here at time are provided by and we are naturally led to introduce the filtration . In the following, the sequential admissible policies will be a
(inhomogeneous) Markov chain. We also define another filtration by adding all the events before stepand observe that To sum-up, contains all the results of each arm between time and although only provides partial information about the tested arms.
In this paper, we focus on the stochastic NSa whose principle is very simple: it consists in sampling one arm according to a probability distribution on, and in modifying this probability distribution in terms of the reward obtained with the chosen arm. From this point of view, this algorithm bears similarities with the EXP3 algorithm (and many of its variants) introduced in 
. Among other close bandit algorithms, one can also cite the Thompson Sampling strategy where the random selection of the arm is based on a Bayesian posterior which is updated after each result. We refer to for a recent theoretical contribution on this algorithm.
Instead of sampling one arm sequentially according to a randomized decision, other algorithms define their policy through a deterministic maximization procedure at each iteration. Among them, we can mention the UCB algorithm  and its derivatives (including MOSS  and KL-UCB 
), whose dynamics are dictated by an appropriate empirical upper confidence bound of the estimated best performance.
Let us now present the NSa algorithm. In fact, we will distinguish two types of NSa: crude-NSa and penalized-NSa. Before going further, let us recall their mechanism in the case of (the general case will be introduced in Section 2). Designating as the probability of drawing arm 1 at step and as a decreasing sequence of positive numbers that tends to 0 when goes to infinity, crude-NS is recursively defined by:
Note that the construction is certainly symmetric, i.e., (which corresponds to the probability of drawing arm 2) has a symmetric dynamics. The long-time behavior of some NSa was extensively investigated in the last decade. To name a few, in  and , some convergence and rate of convergence results are proved. However, these results strongly depend on both and the probabilities of success of the arms. In order to get rid of these constraints, the authors then introduced in  a penalized NSa and proved that this method is an efficient distribution-free procedure, meaning that it unconditionally converges to the best arm on the unknown probabilities and . The idea of the penalized-NS algorithm is to also take the failures of the player into account and to reduce the probability of drawing the tested arm when it loses. Designating as a second positive sequence, the dynamics of the penalized NSa is given by :
Performances of bandit algorithms. In view of potential applications, it is certainly important to have some informations about the performances of the used policies. To this end, one first needs to define what is a “good” sequencial algorithm. The primary efficiency requirement is the ability of the algorithm to asymptotically recover the best arm. In , this property is referred to as the infallibility of the algorithm. If without loss of generality, the first arm is assumed to be the best, ( that ) and if denotes the probability of drawing arm , the algorithm is said to be infallible if
An alternative way for describing the efficiency of a method is to consider the behaviour of the cumulative reward obtained between time and :
In particular, in the old paper , Robbins is looking for algorithms such that
This last property is weaker than the infallibility of an algorithm since the Lebesgue theorem associated to (3) implies the convergence above.
A much stronger requirement involves the regret of the algorithm. The regret measures the gap between the cumulative reward of the best player and the one induced by the policy. The regret is the -measurable random variable defined as:
A good strategy corresponds to a selection procedure that minimizes the expected regret , optimal ones being referred to as minimax strategies.
The former expected regret cannot be easily handled and is generally replaced in statistical analysis by the pseudo-regret defined as
Since , can also be written as
A low pseudo-regret property then means that the quantity
has to be small, in particular sub-linear with . The quantities and are closely related and it is reasonable to study the pseudo-regret instead of the true regret, owing to the next proposition:
For any -measurable strategy, we obtain after plays:
Furthermore, for every integer and and for any (admissible) strategy,
We refer to Proposition 34 of  for a detailed proof of and to Theorem 5.1 of  for . As mentioned in , the bounds are distribution-free (uniform in ).444The rate orders are strongly different if a dependence in is allowed. Since the MOSS method of  satisfies , and show that a non-asymptotic distribution-free minimax rate is on the order of .
In particular, a fallible algorithm (meaning that ) necessarily generates a linear regret and is not optimal. For example, in the case , the dependence of in terms of is as follows:
Objectives. In this paper, we therefore propose to focus on the regret and to answer to the question “Are NSa competitive from a regret viewpoint? In the case of positive answer, what are the associated upper-bounds ?”
Due to some too restrictive conditions of infallibility, it will be seen that the crude-NSa cannot be competitive from a regret point of view. As mentioned before, the penalized NSa is more robust and is a priori more appropriate for this problem. More precisely, the penalty induces more balance between exploration and exploitation, between playing the best arm (the one in terms of the past actions) and exploring new options (playing the suboptimal arms). In this paper, we are going to prove that, up to a slight reinforcement, it is possible to obtain some competitive bounds for the regret of this procedure. The slightly modified penalized algorithm will be referred to as the over-penalized-algorithm below.
Outline. The paper is organized as follows : Section 2.1 provides some basic information about the crude NSa. Then, in Section 2.2, after some background on the penalized Nsa, we introduce a new algorithm called over-penalized NSa.
Section 3 is devoted to the main results: in Theorem 3.2, we establish an upper-bound of the pseudo-regret for the over-penalized algorithm in the two-armed case and also show a weaker result for the penalized NSa.
In this section, we also extend to the multi-armed case some existing convergence and rate of convergence results of the two-armed algorithm. In the “critical” case (see below for details), the normalized algorithm converges in distribution toward a PDMP (Piecewise Deterministic Markov Process). We develop a careful study of its ergodicity and bounds on the rate of convergence to equilibrium are established. It uses a non-trivial coupling strategy to derive explicit rates of convergence in Wasserstein and total variation distance. The dependence of these rates are made explicit with the several parameters of the initial Bandit problem.
2 Definitions of the NS algorithms
2.1 Crude NSa and regret
The crude NSa (1) is rather simple: it defines a Markov chain and is a random variable satisfying:
The arm is selected at step with the current distribution and is evaluated. In the event of success, the weight of the arm is increased and the weight of the other arm is decreased by the same quantity. The algorithm can be rewritten in a more concise form as:
The arm at step succeeds with the probability and we suppose that so that the arm 1 is the optimal one.
As pointed in (1), we obtain that
This formula is important regarding the fallibility of an algorithm. In particular, it is shown in  that for any choice with and or with , the NSa (7) may be fallible: some parameters exist such that a.s. converges to a binary random variable with . In this situation, for large enough , we have:
It can easily be concluded that this method cannot induce a competitive policy since some “bad” values of the probabilities generate a linear regret.
2.2 Penalized and over-penalized two-armed NSa
A major difference between the crude NSa and its penalized counterpart introduced in  relies on the exploitation of the failure of the selected arms. The crude NSa (1) only uses the sequence of successes to update the probability distribution since the value of is modified iff . In contrast, the penalized NSa (2) also uses the information generated by a potential failure of the arm . More precisely, in the event of success of the selected arm , this penalized NSa mimics the crude NSa, whereas in the case of failure, the weight of the selected arm is now multiplied (and thus decreased) by a factor (whereas the probability of drawing the other arm is increased by the corresponding quantity). For the penalized NSa, the update formula of can be written in the following way:
In view of the minimization of the regret, we will show that it may be useful to reinforce the penalization. For this purpose, we introduce a slightly “over-penalized” NSa where a player is also (slightly) penalized if it wins:
If player 1 wins, then with probability it is penalized by a factor .
If player 2 wins, then with probability arm 1 is increased by a factor of .
The over-penalized-NSa can be written as follows
is a sequence of i.i.d. r.v. with a Bernoulli distribution, meaning that . Moreover, these r.v. are independent of and in such a way that for all , and are also independent. It should be noted that
In fact, this slight over-penalization of the successful arm (with probability ) can be viewed as an additional statistical excitation which helps the stochastic algorithm to escape from local traps. The case corresponds to the penalized NSa (8), whereas when , the arm is always penalized when it plays. In particular, this modification implies that the increment of is slightly weaker than in the previous case when the selected arm wins.
Asymptotic convergence of the penalized NSa.
Before stating the main results, we need to understand which regret could be reached by penalized and over-penalized NSa. We recall (in a slightly less general form) the convergence results of Proposition 3, Theorems 3 and 4 of .
Theorem 2.1 (Lamberton & Pages, ).
Let and and with and . Let be the algorithm given by (8).
If and , the penalized two-armed bandit is infallible.
Furthermore, if and , then
If and : where stands for the convergence in distribution and is the stationary distribution of the PDMP whose generator acts on as
We then obtain the key observation
where is a constant that may depend on and . According to Theorem 2.1, it seems that the potential optimal choice corresponds to the one of . Indeed, the infallibility occurs only when and and Equation (10) suggests that should be chosen as large as possible to minimize the r.h.s. of (11), leading to . This is why in the following, we will focus on the case:
2.3 Over-penalized multi-armed NSa
We generalize the definition of the penalized and over-penalized NSa to the -armed case, with . Let and assume that ( the probability of success of arm ). The over-penalized NSa recursively defines a sequence of probability measures on denoted by where . At step , the arm is sampled according to the discrete distribution and tcrthen tested through the computation of . Setting , the multi-armed NSa is defined by:
In contrast with the two-armed case, we have to choose how to distribute the penalty to the other arms when . The (natural) choice in (13) is to divide it fairly, , to spread it uniformly over the other arms. Note that alternative algorithms (not studied here) could be considered.
3 Main Results
3.1 Regret of the over-penalized two-armed bandit
First, we provide some uniform upper-bounds for the two-armed -over-penalized NSa . Our main result is Theorem 3.2. Before stating it, we choose to state a new result when , for the “original” penalized NSa introduced in .
The upper bound of the original penalized-NS algorithm is not completely uniform. From a theoretical point of view, there is not enough penalty when is too large, which in turn generates a deficiency of the mean-reverting effect for the sequence when is close to . In other words, the trap of the stochastic algorithm near is not enough repulsive and Figure 1 below shows that this problem also appears numerically and suggests a logarithmic explosion of .
This explains the interest of the over-penalization, illustrated by the next result, which is the main theorem of the paper.
(a) A exists such that:
(b) Furthermore, the choice , yields
At the price of technicalities, could be made explicit in terms of and for every . The second bound is obtained by an optimization of (see (38) and below).
Figure 1 presents on the left side a numerical approximation of for the penalized and over-penalized algorithms. The continuous curves indicate that the upper bound in Theorem 3.2 is not sharp since the over-penalized NSa satisfies a uniform upper-bound on the order of . This bound is obtained with a small (as pointed in Theorem 3.2), and (red line in Figure 1 (left)), suggesting that the rewards should always be over-penalized with .
The right-hand side of Figure 1 focuses on the behavior of the regret with . The map confirms the influence of the over-penalization and indicates that to obtain optimal performances for the cumulative regret, we should use a low value of between and . The importance of this choice of seems relative since the behaviour of the over-penalized bandit is stable on this interval. The best numerical choice is attained for and and permits to achieve a long-time behavior of of the order (see Figure 2, red line).
Finally, the statistical performances of the over-penalized NSa are compared with some classical bandit algorithms: KL-UCB algorithm (see e.g.  and the references therein) and EXP3 (see ). These two algorithms are anytime policies that are known to be minimax optimal with a cumulative minimax regret of the order . Figure 2 shows that the performances of the over-penalized NSa are located between the one of the KL-UCB algorithm and of the EXP3 algorithm (our simulations suggest that the uniform bounds of KL-UCB and EXP3 are respectively and ). Also, it is worth noting that the simulation cost of the over-penalized NSa is strongly weaker than the initial UCB algorithm (the phenomenon is increased when compared to KL-UCB, which requires an additional difficulty for the computation of the upper confidence bound at each step): the same amount of Monte-Carlo simulations for the over-penalized NSa is almost hundred times faster than the KL-UCB runs in equivalent numerical conditions.
3.2 Convergence of the multi-armed over-penalized bandit
Proposition 3.1 (Convergence of the multi-armed over-penalized bandit).
Consider and with and . Algorithm (8) with satisfies
If and , then .
Furthermore, if and , then:
Proposition 3.2 provides a description of the behavior of the normalized NSa while considering . It states that converges to the dynamics of a Piecewise Deterministic Markov Process (referred to as PDMP below).
Proposition 3.2 (Weak convergence of the over-penalized NSa).
Under the assumptions of Proposition 3.1, if and , then:
where is the (unique) stationary distribution of the Markov process whose generator acts on compactly supported functions of as follows:
3.3 Ergodicity of the limiting process
In this section, we focus on the long time behavior of the limiting Markov process that appears (after normalization) in Proposition 3.2. As mentioned before, this process is a PDMP and its long time behavior can be carefully studied with some arguments in the spirit of . We also learned about the existence of a close study in the PhD thesis of Florian Bouguet (some details may be found in ). Such properties are stated for both the one-dimensional and the multidimensional cases.
3.3.1 One-dimensional case
the generator given by Proposition 3.2 may be written as:
In what follows, we will assume that , and are positive numbers. We can see in two parts. On the one hand, the deterministic flow that guides the PDMP between the jumps is given by:
Hence, if (resp. ), decreases (resp. increases) and converges exponentially fast to .
On the other hand, the PDMP possesses some positive jumps that occur with a Poisson intensity “”, whose size is deterministic and equals to .
From the finiteness and positivity of , it is easy to show that for every positive starting point, the process is well-defined on , positive and does not explode in finite time. The fact that the size of the jumps is deterministic is less important and what follows could easily be generalized to a random size (under adapted integrability assumptions). In Figure 3 below, some paths of the process are represented with different values of the parameters.
3.3.2 Convergence results
As pointed out in Figure 3, the long-time behavior of the process certainly depends on the relationship between the mean-reverting effect generated by “” and the frequency and size of the jumps.
The process (16) possesses a unique invariant distribution if . Actually, the existence is ensured by the fact that is a Lyapunov function for the process since
Among other arguments, the uniqueness is ensured by Theorem 3.3 (the convergence in Wasserstein distance of the process toward the invariant distribution implies in particular its uniqueness). We denote it by below. It could also be shown that , that the process is strongly ergodic on (see  for some background) and that if , the process explodes when (this case corresponds to the bottom left-hand side of Figure 3). Finally, it should be noted that for the limiting PDMP of the bandit algorithm,
and thus, the ergodicity condition coincides with the positivity of .
We aim to derive rates of convergence for the PDMP toward for two distances, namely the Wasserstein distance and the total variation distance. Rather different ways to obtain such results exist using coupling arguments or PDEs. We use coupling techniques here that are consistent with the work of  and . Before stating our results, let us recall that the -Wasserstein distance is defined for any probability measures and on by:
Designating as the initial distribution of the PDMP and as its law at time , we now state the main result on the PDMP in dimension one driven by (16).
Theorem 3.3 (One dimensional PDMP).
Let and denote for every where is a Markov process driven by (16) with initial distribution (with support included in ). If , we have
and if , a constant exists such that
where satisfies the recursion .
If , the lower and upper bounds imply the optimality of the rate obtained in the exponential. For , the optimality of the exponent is still an open question.
We now give a corollary for the limiting process that appears in Proposition 3.2.
Corollary 3.1 (Multi-dimensional PDMP).
The proof is almost obvious due to the “tensorized” form of the generator. Actually, for every starting point , all the coordinates are independent one-dimensional PDMPs with generator defined by (16) with
The result then easily follows from Theorem 3.3 with a global rate given by . The details are left to the reader.
3.4 Total variation results
When some bounds are available for the Wasserstein distance, a classical way to deduce an upper bound of the total variation is to build a two-step coupling. In the first step, a Wasserstein coupling is used to bring the paths sufficiently close (with a probability controlled by the Wasserstein bound). In a second step, we use a total variation coupling to try to stick the paths with a high probability. In our case, the jump size is deterministic and sticking the paths implies a non trivial coupling of the jump times. Some of the ideas to obtain the results below are in the spirit of , who follows this strategy for the TCP process.
Let be a starting distribution with moments of any order. Then, for every
be a starting distribution with moments of any order. Then, for every, a exists such that:
Once again, this result can be extended to the multi-armed case.
The proof of this result is based on the remark that follows Corollary 3.1. Owing to the “tensorization” property, the probability for coupling all the coordinates before time is essentially the product of the probabilities of the coupling of each coordinate. Once again, the details of this corollary are left to the reader.
This section is devoted to the study of the regret of the penalized two-armed bandit procedure described in Section 2. We will mainly focus on the proof of the explicit bound given in Theorem 3.2 and we will give the main ideas for the proofs of Theorems 3.1 and 3.1.
In order to lighten the notations, will be summarized by , so that .
The proofs are then strongly based on a detailed study of the behavior of the (positive) sequence defined by
As we said before, we will consider the following sequences and below:
where and are constants in that will be specified later. In the meantime, we also define:
With this setting, the pseudo-regret is
It should be noted here that we have substituted the division by in (11) by a normalization with . This will be easier to handle in the sequel. The main issue now is to obtain a convenient upper bound for . More precisely, note that:
and conversely for every ,
Thus it is enough to derive an upper bound of after an iteration that can be on the order of . In particular, the “suitable” choice of will strongly depend on the value of .
4.2 Evolution of
Recursive dynamics of .
In order to understand the mechanism and difficulties of the penalized procedure, let us first roughly describe the behavior of the sequences and . According to (9),
It can be observed that the drift term may be split into two parts, where the main part is the usual drift of NSa described by defined by:
The second term comes from the penalization procedure and depends on . We set
As a consequence, we can write the evolution of as follows:
where is a martingale increment. On the basis of the equation above, we easily derive that