Multi-arm bandits (Thompson, 1933; Cesa-Bianchi and Lugosi, 2006; Bubeck and Cesa-Bianchi, 2012; Lattimore and Szepesvári, 2019) formalizes the core aspects of the exploration-exploitation dilemma in online learning, where an agent has to trade off the exploration of the environment to gather information and the exploitation of the current knowledge to maximize the reward. In the stochastic setting (Thompson, 1933; Auer et al., 2002a), each arm is characterized by a stationary reward distribution and whenever an agent pulls an arm, it observes an i.i.d. sample from the corresponding distribution. Despite the extensive algorithmic and theoretical study of this setting (Cesa-Bianchi and Lugosi, 2006; Bubeck and Cesa-Bianchi, 2012; Kaufmann et al., 2012; Garivier and Cappé, 2011), the stationarity assumption is often too restrictive in practice, since the value of the arms may change over time (e.g., change of the preferences of users). The adversarial setting (Auer et al., 2002b) addresses this limitation by removing any assumption on how the rewards are generated and learning agents should be able to perform well for any arbitrary sequence of rewards. While algorithms such as Exp3 (Auer et al., 2002b) are guaranteed to achieve small regret in this setting, their behavior is conservative as all arms are repeatedly explored in order to avoid incurring too much regret because of unexpected changes in arms’ values, which corresponds to unsatisfactory performance in practice, where arms values, while non-stationary, are far from being adversarial. Garivier and Moulines (2011) proposed a variation of the stochastic setting, where the distribution of each arm is piecewise stationary. Similarly, Besbes et al. (2014) introduced an adversarial setting where the total amount of change in arms’ values is bounded. While these settings effectively capture the characteristics of a wide set of applications, they consider the case where the arms’ value evolves independently from the decisions of the agent. This setting is often called restless bandits. On the other hand, in many problems, the value of an arm changes only when it is pulled and we talk about rested bandits. For instance, the value of a service may deteriorate only when it is actually used. Next, if a recommender system shows always the same item to the users, they get bored and enjoy less their experience on the platform. Finally, a student can master a frequently taught topic in an intelligent tutoring system and extra learning on that topic would be less effective. A particularly interesting case is represented by the rotting bandits, where the value of an arm decreases every time it is pulled. More precisely, each reward is non-increasing, since it could remain constant at each pull. Heidari et al. (2016) studied this problem in the case where the rewards observed by the agent are deterministic (i.e., no noise) and showed how a greedy policy (i.e., selecting the arm that returned the largest reward the last time it was pulled) is optimal up to a small constant factor depending on the number of arms and the largest per-round decay in the arms’ value . Bouneffouf and Féraud (2016) considered the stochastic setting when the dynamics of the rewards is known up to a constant factor. Finally, Levine et al. (2017) defined both non-parametric and parametric noisy rotting bandits, for which they derive new algorithms with regret guarantees. In particular, in the non-parametric case, where the decrease in reward is neither constrained nor known, they introduce the sliding-window average (wSWA) algorithm, which is shown to achieve a regret to the optimal policy of order , where is the number of rounds in the experiment.
In this paper, we study the non-parametric rotting setting of Levine et al. (2017) and introduce Filtering on Expanding Window Average
(FEWA) algorithm, a novel method that at each round constructs moving average estimates with different windows to identify the arms that are more likely to perform well if pulled once more. Under the assumption that the reward decay are bounded by, we show that FEWA achieves a regret of without any prior knowledge of , thus significantly improving over wSWA and matching the minimax rate of stochastic bandits up to logarithmic factor. This shows that learning with non-increasing rewards is not more difficult than in the constant case (the stochastic setting). Furthermore, when rewards are constant we recover standard problem-dependent UCB regret guarantees (up to constants), while in the rotting bandit scenario with no noise, the regret reduces to the one derived by Heidari et al. (2016). Finally, numerical simulations confirm our theoretical result and show the superiority of FEWA over wSWA.
We consider a rotting bandits similar to the ones introduced by Levine et al. (2017). At each round , an agent chooses an arm and it receives a noisy reward . Unlike in standard bandits, the reward associated to each arm is a
-sub-Gaussian random variable with an expected value, which depends on the number of times it was pulled before, e.g., is the expectation at the beginning.111Our definition of slightly differs from Levine et al. (2017), where it denotes the expected value of arm when it is pulled for the -th time instead of after pulls. As a result, in Levine et al. (2017), define from , while with our notation it actually starts from . More formally, let be the sequence of arms pulled and reward observed over time until round (), then
where is the number of times arm is pulled before round . In the following, by we also denote the random reward obtained from arm when it is pulled for the -th time, e.g., . We finally introduce a non-parametric rotting assumption with bounded decay.
The reward functions are non-increasing with bounded decays For the sake of the analysis, we also assume that the first pull is bounded . We refer to this set of functions as
Similarly to Levine et al. (2017), we consider non-increasing functions where the value of arms can only decrease when they are pulled. However, we do not restrict them to stay positive but we bound the per-round decay by . On one hand, any function in has the range bounded in . Therefore, our setting is included in the setting of Levine et al. (2017) when . However, the regret of wSWA, defined below in Equation 2, is bounded by which becomes in our setting. Therefore, wSWA is not proved to learn in our setting. On the other hand, any decreasing function with range in is included in for . Therefore, our analysis applies directly to the setting of Levine et al. (2017) by simply setting , where we get a regret bound of thereby significantly improving the rate on their result.
The learning problem
In general, an agent’s policy returns the arm to pull at round on the basis of the whole history of observations, i.e., . In the following, we use as shorthand notation for . The performance of a policy is measured by the (expected) rewards accumulated over time,
Since depends on the (random) history observed over time, is also random. We therefore define the expected cumulative reward as . We restate a useful characterization of the optimal policy given by Heidari et al. (2016).
If the (exact) mean of each arm is known in advance for any number of pulls, then the optimal policy maximizing the expected cumulative reward is greedy at each round, i.e.,
We denote by , the cumulative reward of the optimal policy.
The objective of a learning algorithm is to implement a policy whose performance is close to ’s as much as possible. We define the (random) regret as
Notice that the regret is measured against an optimal allocation over arms rather than a fixed-arm policy as it is a case in adversarial and stochastic bandits. Therefore, even the adversarial algorithms that one could think of applying in our setting (e.g., Exp3 of Auer et al., 2002a) are not known to provide any guarantee for our definition of regret. On the other hand, for constant our problem reduces to standard stochastic bandits. Therefore, our regret definition reduces to the standard stochastic regret. Therefore, for constant functions, any algorithm with some guarantee for rotting regret immediately inherits the same guarantee for the standard regret.
Let be the (deterministic) number of times that arm is pulled by the optimal policy up to time (excluded). Similarly, for a given policy , let be the (random) number pulls of arm . Using this notation, notice that the cumulative reward can be rewritten as
Then, we can conveniently rewrite the regret as
where and are the sets of arms that are respectively under-pulled and over-pulled by w.r.t. the optimal policy.
Prior regret bounds
In order to ease the discussion of the theoretical results we derive in Sect. 4, we restate prior results for two special cases. We start with the minimax regret lower bound for stochastic bandits, which corresponds to the case when the expected rewards are constant.
(Auer et al., 2002b, Thm. 5.1) For any learning policy and any horizon , there exists a stochastic stationary problem with sub-Gaussian arms with parameter such that suffers an expected regret
where the expectation is taken with respect to both the randomization over rewards and the algorithms internal randomization,
Next, Heidari et al. (2016) previously derived lower and upper bounds for the regret in the case of deterministic rotting bandits (i.e., ).
Let be a greedy (not necessarily an oracle) policy that selects at each round the arm with the largest upcoming reward . For any deterministic rotting bandits (i.e., ) satisfying Assumption 1 with bounded decay , suffers an expected regret
Propositions 2 and 3 bound the performance of any algorithm on the constant and deterministic classes of problems with respective parameters and . Note that any problem in one of these two classes is a rotting problem with parameters (, ). Therefore, the performance of any algorithm on the rotting problem described above is also bounded by both lower bounds.
3 FEWA: Filtering on Expanding Window Average
Since the expected rewards change over time, the main difficulty in the non-parametric rotting bandit setting introduced in the previous section is that we cannot entirely rely on all the samples observed until time
to accurately predict which arm is likely to return the highest reward in the future. In particular, the older the sample, the less representative is of the reward that the agent may observe by pulling the same arm once again. This suggests that we should construct estimates using the more recent samples. On the other hand, by discarding older rewards, we also reduce the number of samples used in the estimates, thus increasing their variance. In Algorithm1 we introduce a novel algorithm (FEWA or ) that at each round
, relies on estimates using windows of increasing length to filter out arms that are suboptimal with high probability and then pulls the least pulled arm among the remaining arms.
Before we describe FEWA in detail, we first describe the subroutine Filter in Algorithm 2, which receives as input a set of active arms , a window , and a confidence parameter , to return an updated set of arm . For each arm that has been pulled times, the algorithm constructs an estimate that averages the most recent rewards observed from . The estimator is well defined only for . Nonetheless, the construction of the set and the stopping condition at Line 10 in Algorithm 1 guarantee that are always well defined for the arms in . The subroutine Filter then discards from all the arms whose mean estimate (built with window ) is lower than the empirically best arm by more than twice a threshold constructed by standard Hoeffding’s concentration inequality (see Algorithm 4).
The Filter subroutine is used in FEWA to incrementally refine the set of active arms, starting with a window of size , until the condition at Line 10 is met. As a result, only contains arms that passed the filter for all windows from up to . Notice that it is crucial to start filtering arms from a small window and to keep refining the previous set of active arms, instead of completely recomputing them for every new window . In fact, the estimates constructed using a small window use recent rewards, which are closer to the future value of an arm. As a result, if there is enough evidence that an arm is suboptimal already at a small window , then there is no reason to consider it again for larger windows. On the other hand, a suboptimal arm may pass the filter for small windows as the threshold is large for small , i.e., when only a few samples are used in constructing . Thus, FEWA keeps refining for larger and larger windows in the attempt of constructing more and more accurate estimates and discard more suboptimal arms. This process stops when we reach a window as large as the number of samples for at least one arm in the active set (i.e., Line 10). At this point, increasing would not bring any additional evidence that could refine further222 is not defined for and FEWA finally selects the active arm whose number of samples matches the current window, i.e., the least pulled arm in . The set of available rewards and the number of pulls are then updated accordingly.
Theorem 1 shows that FEWA achieves a regret without any knowledge of the size of decay . This significantly improves over the regret of wSWA (Levine et al., 2017), which is of order and needs to know . The improvement is also due to the fact that FEWA exploits filters using moving averages with increasing windows to discard arms that are with high probability suboptimal. Since this process is done at each round, FEWA smoothly tracks changes in the value of each arm, so that if an arm becomes worse later on, other arms would be recovered and pulled again. On the other hand, wSWA relies on a fixed exploratory phase where all arms are pulled in a round-robin fashion and the tracking is performed using averages constructed with a fixed window. Furthermore, while the performance of wSWA can be optimized by having prior knowledge on the range of the expected rewards (see the tuning of in the work of Levine et al. 2017, Theorem 3.1), FEWA does not require any knowledge of to achieve the regret. Moreover, FEWA in naturally anytime ( does not need to be known), while the fixed exploratory phase of wSWA requires to be properly tuned and resorts to a doubling trick to be anytime. Algorithms (such as FEWA) with direct anytime guarantees show a practical advantage over the doubling-trick ones, that often give a suboptimal empirical performance.
For , our upper bound reduces to , thus matching the prior (upper and lower) bound of Heidari et al. (2016) for deterministic rotting bandits. Moreover, the additive decomposition of regret shows that there is no coupling between the stochastic problem and the rotting problem as the terms are summed with the term while wSWA shows an factor444Specifically, it is where is equivalent to in our setting, though our setting is more general as explained in the remark following Assumption 1. in front of the leading term. Finally, the matches the worst-case optimal regret bound of the standard stochastic bandits (i.e., s are constant) up to a logarithmic factor. Whether an algorithm can achieve regret bound is an open question. On one hand, FEWA uses more confidence bounds than UCB1 to track change for each arm. Thus, FEWA uses larger bands in order to make all the confidence bounds hold with high probability. Therefore, we pay an extra exploration cost which may be necessary for handling the possible rotting behavior of arms. On the other hand, our worst-case analysis shows that some of the difficult problems that reach the worst-case bound of Theorem 1 are realized with constant functions, which is the standard stochastic bandits. For standard stochastic bandits, it is known that MOSS-like (Audibert and Bubeck, 2009) strategies are able to get regret guarantees without the factor. To sum up, the necessity of the extra factor for the worst-case regret of rotting bandits remains an open problem.
4.1 Sketch of the proof
In this section, we give a sketch of the proof of the regret bound. We first introduce the expected value of the estimators used in FEWA. For any and , we define
Notice that if at round , the number of pulls to arm is , then , which is the expected value of arm the last time it was pulled. We now use Hoeffding’s concentration inequality and the favorable events that we consider throughout the analysis.
For any fixed arm , number of pulls and window , we have with probability
Furthermore, for any round , for a confidence , let
be the event under which all the possible estimates constructed by FEWA at round are well concentrated towards their expected value. Then, taking the union bound,
Quality of arms in the active set
We are now ready to derive a crucial lemma that provides support to the arm selection process implemented by FEWA through the series of refinements obtained by the Filter subroutine. Recall that at any round , after pulling arms the greedy (oracle) policy would select an arm characterized by
We denote by the expected reward that such oracle policy would obtain by pulling Notice that the dependence on in the definition of is due to the fact that we consider what the deterministic oracle policy would do at the state reached by . While FEWA cannot directly target the performance of the greedy arm, the following lemma shows that the last pulls of any arms in the active set returned by the filter are close to the performance of the current best arm up to four times the confidence band .
On favorable event , if an arm passes through a filter of window at round , the average of its last pulls cannot deviate significantly from the best available arm at that round, i.e.,
Relating FEWA to the optimal policy While Lemma 1 (with proof in the appendix) provides a first link between the value of the arms returned by the filter and the greedy arm, is still defined according to the number of pulls obtained by FEWA up to . On the other hand, the optimal policy could actually pull a different sequence of arms and at it could have different number of pulls. In order to bound the regret, we need to relate the actual performance of the optimal policy to the value of the arms pulled by FEWA. We let be the absolute difference in the numbers of pulls between and the optimal policy. Since , we have that which means that there are as many overpulls than underpulls over all arms. Let be an underpulled arm555if such arm does not exist, then suffers no regret with . Then, we have the inequalities
As a consequence, we derive the first upper bound on the regret from Equation 3 as
where the inequality is obtained by bounding in the first summation666notice that since and is decreasing, the inequality directly follows from the definition of and then using . While the previous expression shows that we can now only focus on over-pulled arms in op, it is still difficult to directly control the expected reward , as it may change at each round (by at most ). Nonetheless, we notice that its cumulative sum can be directly linked to the average of the expected reward over a suitable window. In fact, for any and , we have
Let be an arm overpulled by FEWA at round and be the difference in the number of pulls w.r.t. the optimal policy at round . On favorable event , we have
4.2 Discussion on problem-dependent result and the price of decaying rewards
Since our setting generalizes the standard bandit setting, where are constant over pulls, a natural question is whether we pay any price for this generalization. While the result of Levine et al. (2017) suggested that learning in rotting bandits could be more difficult, in Theorem 1, we proved that FEWA matches the minimax regret for multi-arm bandits.
However, we may now wonder whether FEWA also matches the result of, e.g., UCB in terms of problem-dependent regret. As illustrated in the next remark, we show that up to constants, FEWA performs as well as UCB on any stochastic problem.
If we apply the result of Corollary 1 applied to stochastic bandits, i.e., when are constant and we get that for
Therefore, our algorithm matches the lower bound of Lai and Robbins (1985) up to a constant. Moreover, in the case of constant functions, our upper bound for FEWA is at most larger than the one for UCB1 (Auer et al., 2002a).777To make the results comparable, we need to replace by in the proof of Auer et al. (2002a) to adapt the confidence bound for a sub-Gaussian noise. The main source of suboptimality is the use of a confidence bound filtering instead of an upper-confidence index policy. Selecting the less pulled arm in the active set is conservative as it requires uniform exploration until elimination, resulting in factor 4 in the confidence bound guarantee on the selected arm (versus 2 for UCB) which implies 4 times more overpulls than UCB (see Equation 8). We conjecture this may not be necessarily needed and it is an open question whether it is possible to derive either an index policy or a selection rule that is better than pulling the less pulled arm in the active set. The other source of suboptimality w.r.t. UCB is the use of larger confidence band because (1) the higher number of estimators computed at each round and ( instead of for UCB) and because (2) the regret at each round in the worst case grows as , which requires reducing the probability of the unfavorable event.
As a result of Remark 1, we claim, that surprisingly and contrarily to what the prior work (Levine et al., 2017) suggests, the rotting bandits are not significantly more difficult than the multi-arm bandits with constants mean rewards. We show this observation is not only theoretical. In particular, in Section 5, we show that in our experiments, the empirical regret of FEWA was at most twice as large as UCB1.
Remark 1 also reveals that Corollary 1 is in fact a problem-dependent result. Similarly, as we derived a problem-dependent bound of FEWA’s regret for constant functions (standard stochastic bandits) we now show a way to get a similar problem-dependent bound for the general case. In particular, with Corollary 1 we upper-bound the maximum number of overpulls by a problem dependent quantity
Corollary 2 (problem-dependent guarantee).
For , the regret is bounded as
4.3 Runtime and memory usage
At each round , FEWA has a worst-case time and memory complexity of a . In fact, it needs to store and update up to averages per-arm. Since moving from an average computed on window to can be done at a cost the per-round complexity is . Such complexity may be undesirable.888This observation is worst-case. In fact, in some cases, the number of samples for the suboptimal arms may be much smaller than For example, in standard bandits it could be . This would dramatically reduce the number of means to compute at each round.
The first idea to improve time and memory complexity is to reduce the number of filters used in the selection. We first notice that the selectivity of the filters scales with . As a result, when increases, the usefulness of the consecutive filters decreases. This remark suggests that we could replace the window increment (Line 9 of Algorithm 1) by a geometric update with factor 2 for time in order to have a constant ratio between two selectivity values. However, this is not enough to reduce the amount of computation. In fact, we still have to compute ( number of) averages of up samples and therefore we still pay in time and memory. We therefore provide a more efficient version of FEWA, called EFF-FEWA (Appendix E) which also uses filters (handling the expanding dynamics) but now with precomputed statistics (handling the sliding dynamics) only being updated when the number of samples for a particular arm doubles. Specifically, the precomputed statistics are updated with a delay in order to be representative of exactly samples with for some . For instance, the (two) statistics of length are replaced every pulls while statistics of length are replaced every pulls. Therefore each filter needs to only store two statistics for each arm : the currently used one and the pending one . Therefore, at any time, the -th filter is fed with for all arms which are averages of consecutive samples among the last ones. In the worst case, the last samples are not covered by filter but these samples are necessarily covered by all the filters before. This way, EFF-FEWA recovers the same bound than FEWA up to a constant factor (proof in Appendix E). In contrast, the small number of filters can now be updated sporadically, thus reducing a per-round time and space complexity to only per arm. A similar yet different idea from the one we propose here has appeared independently in the context of streaming mining (Bifet and Gavaldà, 2007).
5 Numerical simulations
In this section, we report numerical simulations designed to provide insights on the difference between wSWA and FEWA. We consider rotting bandits with two arms defined as
The rewards are then generated by applying a Gaussian i.i.d. noise . The single point of non-stationarity in the second arm is designed to satisfy Figure 1 with a bounded decay . The gap has been chosen as to not advantage FEWA, which pulls each arm times when no arm is filtered. In the two-arms setting defined above, the optimal allocation is and .
Both algorithms have a parameter to tune. In wSWA, is a multiplicative constant for the theoretical optimal window. We try four different values of , including the recommendation of Levine et al. (2017), . In FEWA, tunes the confidence of the threshold . While our analysis suggests (or
for bounded variables), Hoeffding confidence intervals, union bounds, and filtering algorithms are too conservative for a typical case. Therefore, we use a more aggressive. While Theorem 1 suggests that the performance of FEWA should only mildly depend on the bounded decay , Theorem 3.1 of Levine et al. (2017) displays a linear dependence on the largest , which in this case is . Their Theorem 3.1 also states that the linear dependence appears for larger when is small.
In Figure 1, we validate the difference between the two algorithms and their dependence on . The first plot shows the regret at the for various values of and different algorithms. The second and the third plot shows the regret as a function of the number of rounds for and , which correspond to the worst case performance for FEWA and to the regime. All our experiments are run for and we average results over runs.
Before discussing the results, we point out that in the rotting setting, the regret can both increase and decrease over time. Consider two simple policies: , which first pulls arm for times and the then pulls arm for times, and which reverses the order (first arm and then arm ). If we take as reference, would have an increasing regret for the first rounds, which would reverse back to 0 at time , since would select arm getting a reward , while (that had already pulled ) transitioned to pulling arm with a reward of .
As illustrated in Theorem 3.1 of Levine et al. (2017), wSWA regret scales linearly with when . In Figure 1 (left), we show that this regime depends effectively on : The smaller the , the smaller the averaging window, the more reactive it is to large drops (see Figure 1, right). On the other hand, FEWA ends up doing a single mistake for large . Therefore, it recovers the regret with no dependence on as Heidari et al. (2016). Indeed, when is large, Corollary 2 shows that, since in our setting, , the leading term is for a reasonable horizon.
For small (Figure 1, middle), wSWA is competitive only when is sufficiently large. We see that (recommended by Levine et al., 2017) is indeed a good choice until , even though it becomes quickly suboptimal after that. For FEWA, corresponds to the hardest problems as suggested by Theorem 1. We conclude that FEWA is more robust than wSWA as it almost always achieves the best performance across different problems while being agnostic to the value of . On the other hand, wSWA’s performance is very sensitive to the choice of and the same value of the parameter may correspond to significantly different performance depending on . Finally, we notice that EFF-FEWA has a comparable regret with FEWA when is large, while for a small value of , EFF-FEWA suffers the cost of the delay in its statistics update, which is larger for the last filter.
We also tested our algorithm in a rotting setting with 10 arms: the mean of 1 arm is constant with value 0 while 9 arms after 1000 pulls abruptly decrease from to . is ranging from 0.001 to 10 in a geometric sequence. Figure 2 shows regret for different algorithms. Beside FEWA and the four instances of wSWA, we add SW-UCB and D-UCB (Garivier and Moulines, 2011) with window and discount parameters tuned to achieve the best performance. While the two algorithms are known benchmarks for non-stationary bandits, they are designed for the restless case. Therefore, they keep exploring arms that have not been pulled for many rounds. This behavior is suboptimal for rested bandits that we have here, as the arms stay constant when they are not pulled.
We see that after each switch to , FEWA is among the best ones at quickly recovering and adapting to the new situation. EFF-FEWA has similar performance after big drops as it is not too delayed on a new sample. However, the effect of delay in updates has a larger impact in situations where we need many samples to filter an arm. Therefore, we observe a larger regret at the end of the game as compared to FEWA. wSWA with large uses windows that are too large and therefore, for very big changes in the mean reward, suffers high empirical regret at the beginning of this game. On the other hand, wSWA with small suffers larger empirical regret at the end of this game where it is blind to small differences between arms, as the window size too small. We conclude that the windows of a fixed size that wSWA uses, makes it difficult for wSWA to adapt to different situations. Moreover, when is too large, wSWA is very sensitive to its doubling trick.
We remark that SW-UCB and D-UCB show similar behavior. They are both heavily penalized by their restless forgetting even though their forgetting parameters and are optimally tuned for this experimental setup. Indeed, there is no good choice of parameters as a fast forgetting rate makes the policies repeatedly pull bad arms (whose mean rewards do not change when they are not pulled in our rested setup) while a slow forgetting rate makes the policies not being able to adapt to abrupt shifts.
Finally, in Figure 3 we compare the performance of FEWA against UCB1 (Auer et al., 2002a) on two-arm bandits with different gaps. These experiments confirm the theoretical findings of Theorem 1 and Corollary 2: FEWA has comparable performance with UCB1. In particular, both algorithms have a logarithmic asymptotic behavior and for , the ratio between the regret of two algorithms is empirically lower than . Notice, the theoretical factor between the two upper bounds is (for ). This shows the ability of FEWA to be competitive for stochastic bandits.
6 Conclusion and discussion
We introduced FEWA, a novel algorithm for the non-parametric rotting bandits. We proved that FEWA achieves an regret without any knowledge of the decays by using moving averages with a window that effectively adapts to the changes in the expected rewards. This result greatly improves the wSWA algorithm proposed by Levine et al. (2017), that suffered a regret of order . Our analysis of FEWA is quite non-standard and new. FEWA hinges on the adaptive nature of the window size. The most interesting aspect of the proof technique (which can be of independent interest) is that confidence bounds are used not only for the action selection but also for the data selection, i.e., to identify the best window to trade off the bias and the variance in estimating the current value of each arm. Furthermore, we show that in the case of constant arms, FEWA recovers the performance of UCB, while in the deterministic case we match the performance of Heidari et al. (2016).
The research presented was supported by European CHIST-ERA project DELTA, French Ministry of Higher Education and Research, Nord-Pas-de-Calais Regional Council, Inria and Otto-von-Guericke-Universität Magdeburg associated-team north-European project Allocate, and French National Research Agency projects ExTra-Learn (n.ANR-14-CE24-0010-01) and BoB (n.ANR-16-CE23-0003). The work of A. Carpentier is also partially supported by the Deutsche Forschungsgemeinschaft (DFG) Emmy Noether grant MuSyAD (CA 1488/1-1), by the DFG - 314838170, GRK 2297 MathCoRe, by the DFG GRK 2433 DAEDALUS, by the DFG CRC 1294 Data Assimilation, Project A03, and by the UFA-DFH through the French-German Doktorandenkolleg CDFA 01-18. This research has also benefited from the support of the FMJH Program PGMO and from the support to this program from CRITEO. Part of the computational experiments was conducted using the Grid’5000 experimental testbed (https://www.grid5000.fr).
- Audibert and Bubeck (2009) Jean-Yves Audibert and Sébastien Bubeck. Minimax policies for adversarial and stochastic bandits. In Conference on Learning Theory, 2009.
- Auer et al. (2002a) Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002a.
- Auer et al. (2002b) Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multi-armed bandit problem. Journal on Computing, 32(1):48–77, 2002b.
- Besbes et al. (2014) Omar Besbes, Yonatan Gur, and Assaf Zeevi. Stochastic multi-armed bandit problem with non-stationary rewards. In Neural Information Processing Systems, 2014.
- Bifet and Gavaldà (2007) Albert Bifet and Ricard Gavaldà. Learning from time-changing data with adaptive windowing. In International Conference on Data Mining, 2007.
- Bouneffouf and Féraud (2016) Djallel Bouneffouf and Raphael Féraud. Multi-armed bandit problem with known trend. Neurocomputing, 205(C):16–21, 2016.
- Bubeck and Cesa-Bianchi (2012) Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5:1–122, 2012.
- Cesa-Bianchi and Lugosi (2006) Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.
- Garivier and Cappé (2011) Aurélien Garivier and Olivier Cappé. The KL-UCB algorithm for bounded stochastic bandits and beyond. In Conference on Learning Theory, 2011.
- Garivier and Moulines (2011) Aurélien Garivier and Eric Moulines. On upper-confidence-bound policies for switching bandit problems. In Algorithmic Learning Theory, 2011.
Heidari et al. (2016)
Hoda Heidari, Michael Kearns, and Aaron Roth.
policy regret bounds for improving and decaying bandits.
International Conference on Artificial Intelligence and Statistics, 2016.
- Kaufmann et al. (2012) Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On Bayesian upper confidence bounds for bandit problems. In International Conference on Artificial Intelligence and Statistics, 2012.
- Lai and Robbins (1985) Tze L. Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
- Lattimore and Szepesvári (2019) Tor Lattimore and Csaba Szepesvári. Bandit algorithms. 2019.
- Levine et al. (2017) Nir Levine, Koby Crammer, and Shie Mannor. Rotting bandits. In Neural Information Processing Systems, 2017.
- Thompson (1933) William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25:285–294, 1933.
Appendix A Proof of core FEWA guarantees
Let be an arm that passed a filter of window at round . First, we use the confidence bound for the estimates and we pay the cost of keeping all the arms up to a distance of ,
where in the last inequality, we used that that for all
Second, since the means of arms are decaying, we know that
Third, we show that the largest average of the last means of arms in is increasing with ,
To show the above property, we remark that thanks to our selection rule, the arm that has the largest average of means, always passes the filter. Formally, we show that Let . Then for such , we have
where the first and the third inequality are due to confidence bounds on estimates, while the second one is due to the definition of .
Appendix B Proofs of auxiliary results
Let . For any policy , the regret at round T is no bigger than
We refer to the the first sum above as to and to the second on as to .
We consider the regret at round . From Equation 3, the decomposition of regret in terms of overpulls and underpulls gives
In order to separate the analysis for each arm, we upper-bound all the rewards in the first sum by their maximum . This upper bound is tight for problem-independent bound because one cannot hope that the unexplored reward would decay to reduce its regret in the worst case. We also notice that there are as many terms in the first double sum (number of underpulls) than in the second one (number of overpulls). This number is equal to . Notice that this does not mean that for each arm , the number of overpulls equals to the number of underpulls, which cannot happen anyway since an arm cannot be simultaneously underpulled and overpulled. Therefore, we keep only the second double sum,
Then, we need to separate overpulls that are done under and under . We introduce , the round at which pulls arm for the -th time. We now make the round at which each overpull occurs explicit,
For the analysis of the pulls done under we do not need to know at which round it was done. Therefore,
For FEWA, it is not easy to directly guarantee the low probability of overpulls (the second sum). Thus, we upper-bound the regret of each overpull at round under by its maximum value . While this is done to ease FEWA analysis, this is valid for any policy . Then, noticing that we can have at most 1 overpull per round , i.e., , we get
Therefore, we conclude that
Let .For policy with parameters (, ), defined in Lemma 2 is upper-bounded by
First, we define , the last overpull of arm pulled at round under . Now, we upper-bound by including all the overpulls of arm until the -th overpull, even the ones under ,
where We can therefore split the second sum of term above into two parts. The first part corresponds to the first (possibly zero) terms (overpulling differences) and the second part to the last -th one. Recalling that at round , arm was selected under , we apply Corollary 1 to bound the regret caused by previous overpulls of (possibly none),
with . The second inequality is obtained because is decreasing and is decreasing as well. The last inequality is the definition of confidence interval in Proposition 4 with for . If and then
since and and by the assumptions of our setting. Otherwise, we can decompose
For term , since arm was overpulled at least once by FEWA, it passed at least the first filter. Since this -th overpull is done under , by Lemma 1 we have that
The second difference, cannot exceed , since by the assumptions of our setting, the maximum decay in one round is bounded. Therefore, we further upper-bound Equation 17 as
Let . Thus, with and , we can use Proposition 4 and get
Appendix C Minimax regret analysis of FEWA
To get the problem-independent upper bound for FEWA, we need to upper-bound the regret by quantities which do not depend on . The proof is based on Lemma 2, where we bound the expected values of terms and from the statement of the lemma. We start by noting that on high-probability event , we have by Lemma 3 and that
Since and there are at most overpulled arms, we can upper-bound the number of terms in the above sum by . Next, the total number of overpulls cannot exceed . As square-root function is concave we can use Jensen’s inequality. Moreover, we can deduce that the worst allocation of overpulls is the uniform one, i.e.,