1 Introduction
Several recent and less recent analyses of bandit problems share the remarkable feature that an instancedependant lowerbound analysis permits to show the existence of an optimal proportion of draws, which every efficient strategy needs to match, and which is used as a basis for the design of optimal algorithms. This is the case in Active Exploration bandit problems, see Chernoff [1959], Soare et al. [2014], Russo [2016] and Garivier and Kaufmann [2016] but also for the Regret Minimization bandit problems, from the simplest multiarmed bandit setting Garivier et al. [2018] to more complex setting Lattimore and Szepesvari [2017], Combes et al. [2017]
. To reach the asymptotic lower bounds one needs to sample asymptotically according to this optimal proportion of draws. A natural strategy is to sample according to the optimal proportion of draws associated with the current estimate of the true parameter, with some extra exploration. See for example
Antos et al. [2008], Garivier and Kaufmann [2016], Lattimore and Szepesvari [2017] and Combes et al. [2017]. This strategy has a major drawback, computing the optimal proportion of draws requires to solve an often involved concave optimization problem. Thus, this can lead to rather computationally inefficient strategy since one must solve exactly at each steps a new concave optimization problem.In this paper we propose to use instead a gradient ascent to solve in an online fashion the optimization problem thus merging the Active Exploration problem and the computation of the optimal proportion of draws. Precisely we perform an online lazy mirror ascent, see ShalevShwartz et al. [2012], Bubeck [2011], adding an new link between stochastic bandits and online convex optimization. Hence, it is sufficient to compute at each steps only a (sub)gradient, which greatly improves the computational complexity. As a byproduct the obtained algorithm is quite generic and can be applied in various Active Exploration bandit problems, see Appendix A.
The paper is organized as follows. In Section 1.1 we define the framework. A general asymptotic lower bound is presented in Section 1.2 . In Section 1.3 we motivate the introduction of the gradient ascent. The main result, namely the asymptotic optimality of Algorithm 1 and its proof compose Section 2. Section A regroups various examples that are described by the general setting introduced in Section 1.1. Section 3 reports results of some numerical experiments comparing Algorithm 1 to its competitors.
Notation.
For , let be the set of integers lower than or equal to . We denote by the simplex of dimension and by the canonical basis of . A distribution on is assimilated to an element of
. The KullbackLeibler divergence between two probability distributions
on is (with the usual conventions)1.1 Problem description
For , we consider a Gaussian bandit problem
, which we unambiguously refer to by the vector of means
. Without loss of generality, we set in the following . We denote by the set of Gaussian bandit problems. Let and be respectively the probability and the expectation under the bandit problem .We fix a finite number of subsets of bandit problems for with and we assume that the subsets are pairwise disjoint, open and convex. We will explain latter why we need these assumptions on the sets . For a certain bandit problem in our objective is to identify to which set it belongs, i.e. to find such that . Namely, we consider algorithms that output a subset index after pulls. This setting is quite general and encompasses several Active Exploration bandit problems, see Section A.
Two approaches for this problem have been proposed: first, one may consider a given budget and try to minimize the probability to predict a wrong subset index, this is the Fixed Budget setting, see Bubeck et al. [2012], Audibert and Bubeck [2010] and Locatelli et al. [2016]. The second approach is the Fixed Confidence setting, where we fix a confidence level and try to minimize the expected number of sample under the constraint that the predicted subset index is the right one with probability at least , see Chernoff [1959], EvenDar et al. [2002], Mannor and Tsitsiklis [2004] and Kaufmann et al. [2016]. In this paper we will consider the second approach.
The game goes as follow: at each round the agent chooses an arm and observes a sample conditionally independent from the past. Let be the information available to the agent at time . In order to respect the confidence constraint the agent must follow a correct algorithm comprised of:

[noitemsep,nolistsep]

a sampling rule , where is measurable,

a stopping rule , a stopping time for the filtration ,

a decision rule measurable,
such that for all the fixed confidence condition is satisfied and that the algorithm stop almost surely . In this paper we will focus our attention on the sampling rule since stopping rules are now well understood and decision rule are straightforward to find.
1.2 Lower Bound
The KullbackLeibler divergence between two Gaussian distributions
and is defined byThe set of alternatives of the problem is denoted by . One can prove the following generic asymptotic lower bound on the expected number of samples when the confidence level tends to zero, see Garivier and Kaufmann [2016] and Garivier et al. [2017].
Theorem 1.
For all , for all ,
(1) 
where the characteristic time is defined by
(2) 
In particular (1) implies that
(3) 
As already explained by Chernoff [1959], it is interesting to note that asymptotically we end up with a zerosum game where the agent first plays a proportion of draws trying to minimize the sum in (2) then the "nature" plays an alternative trying to do the opposite. The value of this game is exactly . In the sequel we denote by
(4) 
the function that the agent needs to maximize against a "nature" that plays optimally. An algorithm is thus asymptotically optimal if the reverse inequality of (3) holds with a limsup instead of a liminf.
1.3 Intuition: what is the idea behind the algorithm?
To get an asymptotically optimal algorithm the agent wants to play accordingly to an optimal proportion of draws , defined by
(5) 
in order to minimize the characteristic time in (2). But, of course, the agent has not access to the true vector of means. One way to settle this problem is to track the optimal proportion of the current empirical means. Let be the vector of empirical means at time :
where denotes the number of draws of arm up to and including time . We will denote by the empirical proportion of draws at time . Following this idea, the sampling rule could be
This rule is equivalent to the direct tracking rule (without forced exploration, see below) by Garivier and Kaufmann [2016]. But this approach has a major drawback, at each time we need to solve exactly the concave optimization problem in (5). And it appears that in some case we can not solve it analytically, see for example Garivier et al. [2017]. Even if there exists an efficient way to solve the optimization problem numerically like for example in the Best Arm Identification problem some simplest and efficient algorithms give experimentally comparable results. We can cite for example Best Challenger type algorithms, see Garivier and Kaufmann [2016] and Russo [2016].
The idea of our algorithm is best explained on the simple example of the Thresholding Bandit problem (see Section A.1), where the set of all arms larger than the threshold is to be identified. There exists a natural and efficient sampling rule (see Locatelli et al. [2016]):
(6) 
It turns out that this sampling rule leads to an asymptotically optimal algorithm. We are not aware of a reference for this fact. In order to give an interpretation of this sampling rule, let takes one step back. In this problem we want to maximize with respect to the first variable the following concave function (see Section A.1)
(7) 
The subgradient of at , denoted by , is a convex combination of the vectors
for the active coordinates that attain the minimum in (7). With this notation, the sampling rule (6) can be rewritten in the following form
where is some element in the subgradient . Then the update of the empirical proportion of draws follows the simple rule
(8) 
Here we recognize surprisingly one step of the FrankWolfe algorithm [Frank and Wolfe, 1956] for maximizing the concave function on the simplex. The exact same analysis can be done with a variant of the Best Challenger sampling rule for the Best Arm Identification problem. This is described in Section A.2. It is not the first time that FrankWolfe algorithm appears in the stochastic bandits field, see for example Berthet and Perchet [2017]. Precisely in the aforementioned reference they interpret the classical UCB algorithm as an instance of this algorithm with an "optimistic" gradient. The main difficulty here, which does not appear in the Regret Minimization problem, is that the function is not smooth in general (as an infimum of linear functions). Thus we can not directly leverage the analysis of FrankWolfe algorithm in our setting as Berthet and Perchet [2017]. In particular it is not obvious that the sampling rule driven by the FrankWolfe algorithm will converge to the maximum of , for the general problem presented in Section 1, even in the absence of noise (i.e. ).
But we can keep the idea of using a concave optimizer in an online fashion instead of computing at each steps the optimal proportion of draws. Indeed there is a candidate of choice for optimizing nonsmooth concave function namely the subgradient ascent. Now the strategy is clear, at each steps we will perform one step of subgradient ascent for the function on the simplex. Nevertheless, the update of the proportion of draws will be more intricate than in (8), we will need to track the average of weights proposed by the subgradient ascent and force some exploration, see next section for details. Note that this greatly improve the computational complexity of the algorithm since one just needs to compute an element of the subgradient of at each time step. In various setting this computation is straightforward, see Appendix A, in general it boils down to compute the projection of the vector of empirical means on the closure of alternative sets thanks to the particular form of the function , see (4). Since the set are convex, if the weights are strictly positive (which will be the case in Algorithm 1) the projection always exists.
2 Gradient Ascent
Before presenting the algorithm we need to fix some notations. Since does not necessary lie in the set , we first extend on the entire set , by setting
Then, will denote some element of the subgradient of at .
As motivated in Section 1.3, we will perform a gradient ascent on the concave function to drive the sampling rule. More precisely we use an online lazy mirror ascent (see Bubeck et al. [2015]
) on the simplex, using the KullbackLeibler divergence to the uniform distribution
as mirror map:where, for an arbitrary constant , we clipped the gradient . This is just a technical trick to handle the fact that the gradient may be not uniformly bounded in the very first steps. In practice, however, this technical trick seems useless and we recommend to ignore it (that is, take ). There is a closed formula for the weights , see Appendix F
. Note that it is crucial here to use an anytime optimizer since we do not know in advance when the algorithm will stop. Then we skew the weights
toward the uniform distribution to force explorationThis trick is quite usual as for example in the EXP3.P algorithm, see Bubeck et al. [2012]. In some particular settings this extra exploration is not necessary, for example in the Thresholding Bandits problem. We believe that there is a more intrinsic way to perform exploration but this is out of the scope of this paper. Since we perform step size of order we can not use the same simple update rule of the empirical proportion of draws as in (8) where the steps size is of order . But we can track the cumulative sum of weights as follows
It is important to track the cumulative sum of weights here because the analysis of the online mirror ascent provides only guarantees on the cumulative regret.
For the stopping rule we use the classical Chernoff stopping rule (12), see Chernoff [1959], Garivier and Kaufmann [2016], Garivier et al. [2017],
That is, we stop when the vector of empirical means is far enough from any alternative with respect to the empirical KullbackLeibler divergence. Note that, here, the threshold does not depend directly on , but via the vector of counts . This allows to use the maximal inequality of Proposition 1, which yields a very short and direct proof of correctness: see Section 2.1.
The decision rule (13) just chooses the closest set to the vector of empirical means with respect to the empirical KullbackLeibler divergence. Putting all together, we end up with Algorithm 1.
Initialization Pull each arms once and set for all
Sampling rule, for
Update the weights (subgradient ascent)
(9) 
(10) 
Pull the arm (track the cumulative sum of weights)
(11) 
Stopping rule
(12) 
Decision rule
(13) 
In order to preform a gradient descent we need that the subgradient of is bounded in a neighborhood of . For the examples presented in Appendix A or if the are bounded this assertion holds but for some pathological examples this assertion can be wrong (see Appendix G.3). That why we make the following assumption where we denote by the ball of radius for the infinity norm centered at .
Assumption 1.
We assume that for all there exists that may depend on such that:
We can now state the main result of the paper.
In the rest of this section we will present the main lines of the proof of Theorem 2. A detailed proof can be found in Appendix C.
2.1 correctness of Algorithm 1
The correctness of Algorithm 1 is a simple consequence of the following maximal inequality, see Appendix D for a proof.
Proposition 1 (Maximal inequality).
2.2 Asymptotic Optimality of Algorithm 1
First we need some properties of regularity of the function around in order to prove a regret bound on the online lazy mirror ascent. In Appendix G we derive the following proposition.
Proposition 2 (Regularity).
For all and there exists constants that may depend on such that and it holds
Fix some real number and consider the typical event
where , for some horizon . We want to prove that for large enough, on the event , the difference between the maximum of for the true parameter, namely and its empirical counterpart at time , is small, precisely of order . To this aim we will use the following regret bound for the online lazy mirror ascent proved in Appendix F.
Proposition 3 (Regret bound for the online lazy mirror ascent).
We then need a consequence of the tracking and the forced exploration, proved in Appendix E, to relate to .
Proposition 4 (Tracking).
Using Proposition 2, 3 and 4 one can proves that for , on the event
Hence if we rewrite the stopping rule (12)
since the algorithm will stop as soon as . Thus for such we have the inclusion . But thanks to the forced exploration, see Lemma 2, we know that . Therefore we obtain
Thus dividing the above inequality by and letting go to zero then go to zero allows us to conclude.
3 Numerical Experiments
For the experiments we consider the Best Arm Identification problem described in Section A.2. Precisely we restrict our attention to the simple, arbitrary, 4armed bandit problem . The optimal proportion of draws is . The experiments compare several algorithms: the Lazy Mirror Ascent (LMA) described in Algorithm 1, the same algorithm but with a constant learning rate (LMAc), the Best Challenger (BC) algorithm given in Section A.2, the Direct Tracking (DT) algorithm by Garivier and Kaufmann [2016]
, Top Two Thompson Sampling (TTTS) by
Russo [2016] and finally the uniform Sampling (Unif) as baseline. See Appendix B for details. Note in particular that all of them use the same Chernoff Stopping rule (12) with the same threshold and the same decision rule (13). This allows a fair comparison between the sampling rules. Indeed it is known (see Garivier et al. [2017]) that the choice of the stopping rule is decisive to minimize the expected number of sample. We only investigate here the effects of the sampling rule here because it is where the tradeoff between uniform exploration and selective exploration takes place.Algorithm  BC  TTTS  LMAc  LMA  DT  Unif 

Time (in second) 
Figure 1 displays the average number of draws of each aforementioned algorithms for two different confidence levels and . The associated theoretical expected number of draws is respectively for and for . Table 1 displays the average execution time of one step of these algorithms. Unsurprisingly all the algorithms perform better than the uniform sampling. LMA compares to the other algorithms but with slightly worse results. This may due to the fact that lazy mirror ascent (with a learning rate of order ) is less aggressive than Frank Wolfe algorithm for example. Indeed using a constant learning rate (LMAc) we recover the same results as BC. But doing so we loose the guaranty of asymptotic optimality. The four mentioned algorithms share roughly the same (one step) execution time which is normal since they have the same complexity, see Appendix B. The Direct Tracking of the optimal proportion of draws performs slightly better than the other algorithms but the execution time is much longer (approximately 100 times longer) due to the extra cost of computing the optimal weights. Note that TTTS also tends to be slow when the posteriors are well concentrated, since it is then hard to sample the challenger. But it is the only algorithm that does not explicitly force the exploration.
4 Conclusion
In this paper we developed an unified approach to Bandit Active Exploration problems. In particular we provided a general, computationally efficient, asymptotically optimal algorithm. To avoid obfuscating technicalities, we treated only the case of Gaussian arms with known variance and unknown mean, but the results can easily be extended to other oneparameter exponential families. For this, we just need to replace the maximal inequality of Proposition
1 by the one of Theorem 14 by Kaufmann and Koolen [2018] and to adapt the threshold accordingly.Several questions remain open. It would be interesting to provide an analysis for the moderateconfidence regime as argued by Simchowitz et al. [2017]. An other way of improvement could be to explore further the connection with the FrankWolfe algorithm. Nevertheless the main open question, from the author point of view, is to find a natural way to explore instead of forcing the exploration. One possibility could be to use in this setting the principle of optimism. Because even for the Active Exploration problems there is tradeoff between uniformly explore the distributions of the arms and selectively explore the distribution of specific arms to find in which set the bandit problem lies.
References
 AbbasiYadkori et al. [2011] Yasin AbbasiYadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
 Antos et al. [2008] András Antos, Varun Grover, and Csaba Szepesvári. Active learning in multiarmed bandits. In International Conference on Algorithmic Learning Theory, pages 287–302. Springer, 2008.
 Audibert and Bubeck [2010] JeanYves Audibert and Sébastien Bubeck. Best arm identification in multiarmed bandits. In COLT23th Conference on Learning Theory2010, pages 13–p, 2010.
 Balsubramani [2014] Akshay Balsubramani. Sharp finitetime iteratedlogarithm martingale concentration. arXiv preprint arXiv:1405.2639, 2014.
 Berthet and Perchet [2017] Quentin Berthet and Vianney Perchet. Fast rates for bandit optimization with upperconfidence frankwolfe. In Advances in Neural Information Processing Systems, pages 2225–2234, 2017.
 Bubeck [2011] Sébastien Bubeck. Introduction to online optimization. Lecture Notes, 2011.

Bubeck et al. [2012]
Sébastien Bubeck, Nicolo CesaBianchi, et al.
Regret analysis of stochastic and nonstochastic multiarmed bandit
problems.
Foundations and Trends® in Machine Learning
, 5(1):1–122, 2012.  Bubeck et al. [2015] Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(34):231–357, 2015.
 Chernoff [1959] Herman Chernoff. Sequential design of experiments. The Annals of Mathematical Statistics, 30(3):755–770, 1959.
 Combes et al. [2017] Richard Combes, Stefan Magureanu, and Alexandre Proutiere. Minimal exploration in structured stochastic bandits. In Advances in Neural Information Processing Systems, pages 1763–1771, 2017.
 Degenne and Koolen [2019] Rémy Degenne and Wouter M Koolen. Pure exploration with multiple correct answers. arXiv preprint arXiv:1902.03475, 2019.

EvenDar et al. [2002]
Eyal EvenDar, Shie Mannor, and Yishay Mansour.
Pac bounds for multiarmed bandit and markov decision processes.
InInternational Conference on Computational Learning Theory
, pages 255–270. Springer, 2002.  Finkelstein et al. [1971] Helen Finkelstein et al. The law of the iterated logarithm for empirical distribution. The Annals of Mathematical Statistics, 42(2):607–615, 1971.
 Frank and Wolfe [1956] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly, 3(12):95–110, 1956.
 Garivier and Kaufmann [2016] Aurélien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pages 998–1027, 2016.
 Garivier et al. [2017] Aurélien Garivier, Pierre Ménard, and Laurent Rossi. Thresholding bandit for doseranging: The impact of monotonicity. arXiv preprint arXiv:1711.04454, 2017.
 Garivier et al. [2018] Aurélien Garivier, Pierre Ménard, and Gilles Stoltz. Explore first, exploit next: The true shape of regret in bandit problems. Mathematics of Operations Research, 2018.
 Kaufmann and Koolen [2018] Emilie Kaufmann and Wouter Koolen. Mixture martingales revisited with applications to sequential tests and confidence intervals. arXiv preprint arXiv:1811.11419, 2018.
 Kaufmann et al. [2016] Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of bestarm identification in multiarmed bandit models. The Journal of Machine Learning Research, 17(1):1–42, 2016.

Lattimore and Szepesvari [2017]
Tor Lattimore and Csaba Szepesvari.
The end of optimism? an asymptotic analysis of finitearmed linear bandits.
In Artificial Intelligence and Statistics, pages 728–737, 2017.  Lattimore and Szepesvári [2019] Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Preprint, 2019.
 Locatelli et al. [2016] Andrea Locatelli, Maurilio Gutzeit, and Alexandra Carpentier. An optimal algorithm for the thresholding bandit problem. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 1924, 2016, pages 1690–1698, 2016.
 Mannor and Tsitsiklis [2004] Shie Mannor and John N Tsitsiklis. The sample complexity of exploration in the multiarmed bandit problem. Journal of Machine Learning Research, 5(Jun):623–648, 2004.
 Peña et al. [2008] Victor H Peña, Tze Leung Lai, and QiMan Shao. SelfNormalized Processes. Springer Science & Business Media, 2008.
 Russo [2016] Daniel Russo. Simple bayesian algorithms for best arm identification. In Conference on Learning Theory, pages 1417–1418, 2016.
 ShalevShwartz et al. [2012] Shai ShalevShwartz et al. Online learning and online convex optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2012.
 Simchowitz et al. [2017] Max Simchowitz, Kevin Jamieson, and Benjamin Recht. The simulator: Understanding adaptive sampling in the moderateconfidence regime. In Conference on Learning Theory, pages 1794–1834, 2017.
 Soare et al. [2014] Marta Soare, Alessandro Lazaric, and Rémi Munos. Bestarm identification in linear bandits. In Advances in Neural Information Processing Systems, pages 828–836, 2014.
Appendix A Examples
In this appendix we present some classical and less classical active exploration bandit problems that can be described by the general framework presented in Section 1.1. Note that for all examples presented below Assumption 1 holds. For the three first examples it is a direct consequence of the expression of the subgradient. For the last one just needs to remark that the projection of a certain on an alternative set (for ) is such that belongs to the interval for all .
a.1 Thresholding Bandits
We fix a threshold . The objective here is to identify the set of arms above this threshold, . Therefore, to see this problem as a particular case of the one presented in Section 1.1 we choose the power set of and
For , it turns out that there is an explicit expression for and the characteristic time in this particular case,
(18) 
In the function we recognize the minimum of the costs (with respect to the weights ) for moving the mean of one arm to the threshold. Thanks to this rewriting the computation of the subgradient is direct
for that realize the minimum in (18) (the nonzero coordinate is at position ).
a.2 Best Arm Identification
Here the objective is to identify the arm with the greatest mean. We set and
For , we can simplify a bit the expression of the characteristic time. Indeed, using well chosen alternatives, see Garivier and Kaufmann [2016], we have
(19) 
where is the mean between the optimal mean and the mean with respect to the weights :
We can see the weighted divergence that appears in (4) as the cost for moving the mean of arm above the optimal one and thus make the arm optimal. Precisely we move at the same time and to the weighted mean . The computation of the subgradient is also straightforward in this case
for active coordinates that realize the minimum in (19) (the nonzero coordinates are at positions and ). A variant of the Best Challenger sampling rule introduced by Garivier and Kaufmann [2016], see also Russo [2016], is given by
(20) 
where we denote by the current optimal arm (the one with the greatest mean) at time . At a high level, we select the best challenger of the current best arm with respect to the cost that appear in (19). Then we greedily choose between and the one that increases the most this cost. Again, as in the previous example, this sampling rule rewrites as one step of the FrankWolfe algorithm for the function
(21) 
a.3 Signed Bandits
This is a variant of the Thresholding Bandits problem where we add the assumption that all the means lie above or under a certain threshold . Thus we choose and
It is easy to see, for , that the function and the characteristic time reduce to
(22) 
In the function we recognize the cost (with respect to the weights ) for moving all the means to the threshold . The subgradient of at is
This example is interesting because if we follow a sampling rule based on the FrankWolfe algorithm, see (21) (which is equivalent to track the optimal proportion of draws in this case), it would boil down to a kind of Follow the Leader sampling rule. And it is well known that it can fail to sample asymptotically according to the optimal proportion of draws which is in this case:
where is the number of arms that attain the maximum that appears in the definition of the characteristic time, see (22). This highlights the necessity to force in some way the exploration.
a.4 Monotonous thresholding bandit
It is again a variant of the Thresholding Bandit problem with some additional structure. We fix a threshold and assume that sequence of means is increasing. The objective is to identify the arm with the closest mean to the threshold. Hence, we choose and
Unfortunately there is no explicit expressions for neither for the characteristic time in this problem. But it is possible to compute efficiently an element of the subgradient of using isotonic regressions, see Garivier et al. [2017].
Appendix B Details on Numerical Experiments
As stated in the Section 3 we consider the Best Arm Identification problem (see Appendix A) for . For all the algorithms we used the same stopping rule (12) with the threshold and decision rule (13). We consider the following sampling rules:

[label=]

TTTS: it is basically the sampling rule of Top Tow Thompson Sampling by Russo [2016]. We use a Gaussian prior for each arms and we slightly alter the rule to choose between the best sampled arm and its resampled challenger . Inspired by (20), if we denote by the sample from the posterior where is optimal and by the resample where is optimal, we choose arm if , else. Here the complexity of one step is dominated by the sampling phase, in particular the sampling of the challenger, which can be costly if the posterior are concentrated.

LMAc: Exactly the same as above but with a constant learning rate.

DT: this is the Direct Tracking (DT) algorithm by Garivier and Kaufmann [2016], it basically tracks the optimal weights associated to the vector of empirical means plus some forced exploration (same as BC). For the Best Arm Identification problem, to compute the optimal weights, one needs to find the root of an increasing function, e.g. by the bisection method, whose evaluations requires the resolution of K scalar equations.

Unif: the arm is selected at random.
Appendix C Proof of Theorem 2
Fix some real number and consider the typical event
(23) 
where , for some horizon such that and ( is sufficient). We also impose to be greater than the smallest integer such that . This condition allows to get rid of the effects of clipping the gradient on .
Using Proposition 2 we can replace the vector of empirical means by the true vector of means in the first sum of (16) at cost
similarly, we can replace by in the second sum
Hence, we deduce from (16), with , on the event
(24) 
Now we need to compare the sum in (24) with the quantity . To this end we will use Proposition 3, which is a consequence of the tracking and the forced exploration, see (11) and (10). Thus, using the concavity of then Proposition 2 we have
Before applying Proposition 4 we need to handle the fact that the sum in the last inequality above begins at . But it is not harmful because is small enough, one can proves:
(25) 
Indeed, using the triangular inequality we have
It remains to notice that
where in the last line we used , by definition. Now, using (25) then (17) we obtain
Thus, using the above inequality in (24) and dividing by we get
Comments
There are no comments yet.