A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal, and Parameter-free

We propose the first contextual bandit algorithm that is parameter-free, efficient, and optimal in terms of dynamic regret. Specifically, our algorithm achieves dynamic regret O({√(ST), Δ^1/3T^2/3}) for a contextual bandit problem with T rounds, S switches and Δ total variation in data distributions. Importantly, our algorithm is adaptive and does not need to know S or Δ ahead of time, and can be implemented efficiently assuming access to an ERM oracle. Our results strictly improve the O({S^1/4T^3/4, Δ^1/5T^4/5}) bound of (Luo et al., 2018), and greatly generalize and improve the O(√(ST)) result of (Auer et al, 2018) that holds only for the two-armed bandit problem without contextual information. The key novelty of our algorithm is to introduce replay phases, in which the algorithm acts according to its previous decisions for a certain amount of time in order to detect non-stationarity while maintaining a good balance between exploration and exploitation.

Authors

• 6 publications
• 7 publications
• 32 publications
• 20 publications
• OSOM: A Simultaneously Optimal Algorithm for Multi-Armed and Linear Contextual Bandits

We consider the stochastic linear (multi-armed) contextual bandit proble...
05/24/2019 ∙ by Niladri S. Chatterji, et al. ∙ 0

• Efficient Contextual Bandits in Non-stationary Worlds

Most contextual bandit algorithms minimize regret to the best fixed poli...
08/05/2017 ∙ by Haipeng Luo, et al. ∙ 0

• A Smoothed Analysis of Online Lasso for the Sparse Linear Contextual Bandit Problem

We investigate the sparse linear contextual bandit problem where the par...
07/16/2020 ∙ by Zhiyuan Liu, et al. ∙ 0

• Lipschitz Bandit Optimization with Improved Efficiency

We consider the Lipschitz bandit optimization problem with an emphasis o...
04/25/2019 ∙ by Xu Zhu, et al. ∙ 12

• Combinatorial Semi-Bandit in the Non-Stationary Environment

In this paper, we investigate the non-stationary combinatorial semi-band...
02/10/2020 ∙ by Wei Chen, et al. ∙ 13

• Recurrent Neural-Linear Posterior Sampling for Non-Stationary Contextual Bandits

An agent in a non-stationary contextual bandit problem should balance be...
07/09/2020 ∙ by Aditya Ramesh, et al. ∙ 0

• Adaptive Exploration in Linear Contextual Bandit

Contextual bandits serve as a fundamental model for many sequential deci...
10/15/2019 ∙ by Botao Hao, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

For online learning problems, a standard performance measure is static regret, which compares the difference between the total reward of the best fixed policy (or action/arm/expert under different contexts) and the total reward of the algorithm. While minimizing static regret makes sense when there exists a fixed policy with large total reward, it becomes much less meaningful in a non-stationary environment where data distribution is changing over time and no single policy can perform well all the time.

Instead, in this case a more natural benchmark would be to compare the algorithm with the best sequence of policies. This is formally defined as dynamic regret, which is the difference between the total reward of the best sequence of policies and the total reward of the algorithm. Due to the ubiquity of non-stationary data, there is an increasing trend of designing online algorithms with strong dynamic regret guarantee. We provide a more detailed review of related work in Section 2. In short, while obtaining dynamic regret is relatively well-studied in the full-information setting, for the more challenging bandit feedback, most existing works only focus on the simplest multi-armed bandit problem. More importantly, a sharp contrast between these two regimes is that except for the recent work of AuerGO018 for a two-armed bandit problem, none of the others achieves optimal dynamic regret without the knowledge of the non-stationarity of the data in the bandit setting, indicating the extra challenge of being adaptive to non-stationary data with partial information.

In this work, we make a significant step in this direction. Specifically we consider the general contextual bandit setting AuerCeFrSc02, LangfordZh08 which subsumes many other bandit problems. For an environment with rounds where at each time the data is generated from some distribution , denote by the number of switches (plus one) and by the total variation of these distributions (see Section 3 for more formal definition of the setting) . Our main contribution is to propose an algorithm called with the following guarantee:

Main Result

achieves the optimal dynamic regret bound without knowing or . Moreover, is oracle-efficient.

Here the dependence on all other parameters are omitted (see Theorem 5 for the complete version) and the optimality of the dependence on and are well-known garivier2011upper, BesbesGuZe14. Oracle-efficiency refers to efficiency assuming access to an ERM oracle, a common assumption made in most prior works for efficient contextual bandit (formally defined in Section 3).

Our result is by far the best and most general dynamic regret bound for bandit problems. Recent work by LuoWA018 studies the exact same setting and achieves the same optimal bound only if and are known; otherwise their algorithms only achieve suboptimal bounds such as . On the other hand, AuerGO018 propose the first bandit algorithm with expected regret without knowing , but only for the simplest setting: the two-armed bandit problem without contexts. In contrast, our algorithm works for the general multi-armed bandit problem with contextual information, enjoys a meaningful bound as long as is small (even when is

), works with high probability, and importantly is oracle-efficient as well.

Our key technique is inspired by AuerGO018. The high level idea of their algorithm is to occasionally enter some pure exploration phase in order to detect non-stationarity, and crucially the durations of these exploration phases are multi-scale and determined in some randomized way. The reason behind this is that smaller non-stationarity requires more time to discover and vice versa. We extend this multi-scale idea to the contextual bandit setting. However, the extension is highly non-trivial and requires the following two new elements:

1. First, we find that pure exploration over arms (used by AuerGO018, LuoWA018) is not the optimal way to detect non-stationarity in contextual bandit. Instead, we propose to let the algorithm occasionally enter replay phases

, meaning that the algorithm acts according to some policy distribution used earlier by the algorithm itself. The duration of a replay phase and which previous policy distribution to replay are both determined in some randomized way similar to AuerGO018. This can be seen as an interpolation between using the current policy distribution and using pure exploration, and as shown by our analysis achieves a better trade off between exploitation and exploration in non-stationary environments.

2. Second, the algorithm of AuerGO018 is an “arm-elimination” approach, which eliminates arms as long as their sub-optimality is identified. Direct extension to contextual bandit leads to an inefficient approach similar to PolicyElimination by DudikHsKaKaLaReZh11. Instead, our algorithm is based on the soft elimination scheme of AgarwalHsKaLaLiSc14 and can be efficiently implemented with an ERM oracle. Combining this soft elimination scheme and the replay idea in a proper way is another key novelty of our work.

We review related work in Section 2 and introduce all necessary preliminaries in Section 3. Our algorithm is presented in Section 4. The rest of the paper is dedicated to the relatively involved analysis of our algorithm.

2 Related Work

Different forms of dynamic regret bound.

Bounding dynamic regret in terms of the number of switches is traditionally referred to as switching regret or tracking regret, and has been studied under various settings. Note, however, that in some works refers to the number of switches of data distributions just as our definition (e.g. garivier2011upper, WeiHoLu16, liu2018change, LuoWA018), while in others refers to the more general notion of number of switches in the competitor sequence (e.g. herbster1998tracking, bousquet2002tracking, AuerCeFrSc02, hazan2009efficient).

Bounding dynamic regret in terms of the variation of loss functions or data distributions is also widely studied (e.g. BesbesGuZe14, BesbesGuZe15, LuoWA018), and there are in fact several other forms of dynamic regret bounds studied in the literature (e.g. Zinkevich03, slivkins2008adapting, jadbabaie2015online, WeiHoLu16, yang2016tracking, zhang2017improved).

Achieving optimal dynamic regret bounds without any prior knowledge of the non-stationarity is the main focus of this work. This has been achieved for most full-information problems LuoSc15, jun2017online, ZhangYaJiZh18, but is much more challenging in the bandit setting. Several recent attempts only achieve suboptimal bounds KarninAn16, LuoWA018, cheung2019learning. It was not clear whether optimal bounds were achievable in this case, until the recent work of AuerGO018 answers this in the affirmative for the two-armed bandit problem. As mentioned our results significantly generalize their work.

Contextual bandits.

Contextual bandit is a generalization of the multi-armed bandit problem. While direct generalization of the classic multi-armed bandit algorithm already achieves the optimal static regret AuerCeFrSc02, recent research has been focusing on developing practically efficient algorithms with strong regret guarantee due to their applicability to real-world applications. To avoid running time that is linear in the size of the policy set, most existing works make the practical assumption that an ERM oracle is given to solve the corresponding offline problem. Based on this assumption, a series of progress has been made on developing oracle-efficient algorithms with small static regret LangfordZh08, DudikHsKaKaLaReZh11, AgarwalHsKaLaLiSc14, syrgkanis2016efficient, rakhlin2016bistro, SyrgkanisLuKrSc16. All these results rely on some stationary assumption of the environment, since it is known that minimizing static regret oracle-efficiently is impossible in an adversarial environment Hazan2016.

Despite the negative result for static regret with oracle-efficient algorithms, LuoWA018 find that this is no longer true for dynamic regret, and develop oracle-efficient algorithms with optimal dynamic regret when the non-stationarity is known. Their work is most closely related to ours and our algorithm is in essence similar to their Ada-ILTCB algorithm. The key novelty compared to theirs is the replay phases mentioned earlier, which eventually allows the algorithm to adapt to the non-stationarity of the data.

Replay phases.

Introducing replay phases is one of our key contributions. The closest idea in the literature is the method of “mixing past posteriors” of bousquet2002tracking, adamskiy2012putting, which at each time acts according to some weighted combination of all previous distributions. One key difference of our method is that once it enters into a replay phase, it has to continue for a certain amount of time to gather enough information for non-stationarity detection. Another difference is that in bousquet2002tracking, adamskiy2012putting the main point of “mixing past posteriors” is to obtain some form of “long-term memory”; otherwise for typical dynamic regret bounds it is enough to just mix with some amount of pure exploration. It is not clear to us whether our replay idea actually equips the algorithm with some kind of “long-term memory” as well, and we leave this as a future direction.

3 Preliminaries

The contextual bandit problem is defined as follows. Let be some arbitrary context space and be the number of actions. A policy is a mapping from the context space to the actions.111Throughout the paper we use the notation to denote the set for some integer . The learner is given a set of policies , assumed to be finite for simplicity but with a huge cardinality . Before the learning procedure starts, the environment decides distributions on , and draws independent samples from them: . The learning procedure then proceeds as follows: for each time , the learner first receives the context , and then based on this context picks an action . Afterwards the learner receives the reward feedback for the selected action but not others. The instantaneous regret against a policy at time is . The classic goal of contextual bandit algorithms is to minimize , that is, the cumulative regret against the best fixed policy, and the optimal bound is known to be  AuerCeFrSc02.

The classic regret is not a good performance measure for non-stationary environments where no single policy can perform well all the time. Instead, we consider dynamic regret that compares the reward of the algorithm to the reward of the best policy at each time. Specifically, denote the expected reward of policy at time as , and the optimal policy at time as . The dynamic regret is then defined as .

It is well-known that in general it is impossible to achieve sub-linear dynamic regret. Instead, typical dynamic regret bounds are expressed in terms of some quantities that characterize the non-stationarity of the data distributions, and are meaningful as long as these quantities are sublinear in . Two such quantities considered in this work are: the number of distribution hard switches (plus one) and the total variation of distributions .

More notation.

For any integer , we denote by the time interval . For an interval , we define the number of switches and the variation on this interval respectively as and .

As in most algorithms, at each time we sample an action according to some distribution , calculated based on the history before time . After receiving the reward feedback

, we construct the usual importance-weighted estimator

, which is defined as and is clearly unbiased with mean .

For any interval , we define the average reward of a policy over this interval as and similarly its empirical average reward as . The optimal policy in interval is defined as while the empirically best policy is . Furthermore, the expected and empirical interval (static) regret of a policy for an interval are respectively defined as and . When , we simply use to replace as the subscript. For example, represents .

For a context and a distribution over the policies , the projected distribution over the actions is denoted by such that for all . The smoothed projected distribution with a minimum probability is defined as where

is the all-one vector. Similarly to AgarwalHsKaLaLiSc14, our algorithm keeps track of a bound on the variance of the reward estimates. To this end, define for a policy

, an interval , a distribution , and a minimum probability , the empirical and expected variance as

 ˆVI(Q,ν,π)≜1|I|∑t∈I[1Qν(π(xt)|xt)],VI(Q,ν,π)≜1|I|∑t∈IEx∼DXt[1Qν(π(x)|x)],

where is the marginal distribution of over the context space . Again, and are shorthands for and respectively.

We are interested in efficient algorithms assuming access to an ERM oracle AgarwalHsKaLaLiSc14, defined as:

An ERM oracle is an algorithm which takes any set of context-reward pairs as inputs and outputs any policy in .

An algorithm is oracle-efficient if its total running time and the number of oracle calls are both polynomial in and , excluding the running time of the oracle itself.

Finally, we use notation to suppress logarithmic dependence on , and for some confidence level . For notational convenience we also define .

4 Algorithm

Our algorithm is built upon ILOVETOCONBANDITS of AgarwalHsKaLaLiSc14. The main idea of their algorithm is to find a sparse distribution over the policies with both low empirical regret and low empirical variance on the collected data, and then sample actions according to this distribution. Finding such distributions is formalized in Figure 1, Optimization Problem (OP), and AgarwalHsKaLaLiSc14 show that this can be efficiently implemented using an ERM oracle and importantly the distribution is sparse. Under a stationary environment, it can be shown that the empirical regret concentrates around the expected regret reasonably well and thus the algorithm has low regret.

The Ada-ILTCB algorithm of LuoWA018 works by equipping ILOVETOCONBANDITS with some non-stationarity tests and restarting once non-stationarity is detected. Our algorithm works under a similar framework with similar tests, but importantly enters into replay phases occasionally. The complete pseudocode is included in Algorithm 1 and we describe in detail how it works below.

The algorithm starts a new epoch every time it restarts (that is, on execution of Line 1 or 1). We index an epoch by and denote the first round of epoch by . Within an epoch, the algorithm works on a block schedule. Specifically, in epoch , we call the interval block and interval block for any (in the case of restart, the block ends earlier), where is some fixed base length.222The lengths of these blocks are doubling except that block 0 and block 1 have the same length . This is merely for notational convenience and it is not crucial. Each block is associated with an exploration probability of order . At the beginning of each block (for ), the algorithm first solves the Optimization Problem (OP) (Figure 1) using exploration probability and all data collected since the beginning of the current epoch, that is, data from . The solution is denoted by , which is a sparse distribution over policies.

Afterwards, for most of the time of the current block, the algorithm simply plays according to , just like ILOVETOCONBANDITS. The difference is that at each time, with probability the algorithm enters into a replay phase of index which lasts for rounds. This is implemented in Line 1-1, where we first sample a Bernoulli variable rep to decide whether or not to enter into a replay phase, and if so then randomly select a replay index to ensure the aforementioned probability. The set is used to record all pairs of replay index and replay interval. Similar to AuerGO018, the reason of using different lengths is to allow the algorithm to detect different level of non-stationarity: a longer replay interval with a larger index is used to detect smaller non-stationarity.

Note that at each time , the algorithm could potentially be in multiple replay phases simultaneously. Let be the set of indices of all the ongoing replay intervals (defined in Line 1). If is empty, the algorithm is not in any replay phase and simply samples an action according to as mentioned. On the other hand, if is not empty, the algorithm uniformly at random picks an index from , and then replays the distribution learned at the beginning of block , that is, samples an action according to . Recall that our reward estimators ’s are defined in terms of a distribution over actions, and it is clear that for our algorithm .

Finally, at the end of every replay interval, the algorithm calls the subroutine EndofReplayTest to check whether the data collected in the replay interval and that collected prior to the current block (that is, ) are consistent (Line 1). Also, at the end of every block , the algorithm calls another subroutine EndofBlockTest to check the consistency between data up to block and data up to block for all (Line 1). Both tests are in similar spirit to those of LuoWA018, and check the difference of empirical regret or empirical variance of each policy over different sets of data (see Figure 1). If either of the tests indicates that there is a significant distribution change, the algorithm restarts from scratch and enters into the next epoch. Also note that if EndofBlockTest passes and the algorithm enters into a new block, all unfinished replay intervals will discontinue ( is reset to be empty in Line 1).

We provide an illustration of our algorithm in Figure 2.

Oracle-efficiency.

Our algorithm can be implemented efficiently with an ERM oracle. AgarwalHsKaLaLiSc14 show that the Optimization Problem (OP) with input can be solved using oracle calls with a solution that is -sparse. In our case, is at most . The two tests can also be implemented efficiently by the exact same arguments of LuoWA018. For example, in EndofReplayTest, to check if there exists a satisfying Eq. (3), we can first use two oracle calls to precompute and , and collect . Then we again use an oracle call to find and add this value to , which is equal to taking the max over of the left hand side of Eq. (3). It remains to compare this value with the right hand side of Eq. (3).

5 Main Theorem and Proof Outline

The dynamic regret guarantee of is summarized below: [Main Theorem] guarantees with high probability,

 T∑t=1rt(π∗t(xt))−rt(at)=˜O(min{√K(lnN)ST,√K(lnN)T+(KlnN)13Δ13T23}).

The rest of the paper proves our main theorem, following these steps: in Section 5.1, we provide a key lemma that bounds the dynamic regret for any interval within a block (in terms of some algorithm-dependent quantities). In Section 5.2, with the help of the key lemma we bound the dynamic regret for a block. In Section 5.3 we bound the number of epochs/restarts, and sum up the regret over all blocks in all epochs to get the final bound. Since the analysis in Sections 5.1 and 5.2 is all about a fixed epoch , for notation simplicity, we simply write and as and in these two sections.

5.1 A main Lemma and regret decomposition

To bound the dynamic regret over any interval, we define the concept of excess regret:

For an interval that lies in for some , we define its excess regret as

 εI≜maxπ∈Π\rm Reg% I(π)−8ˆ\rm RegB(i,j−1)(π),

and its excess regret threshold as .

In words, excess regret of is the maximum discrepancy between a policy’s expected static regret on and (8 times) its empirical static regret on the first blocks. Large excess regret thus indicates non-stationarity. We now use the following main lemma to decompose the dynamic regret on based on whether the excess regret reaches the excess regret threshold.

[Main Lemma] With probability , guarantees for all and any interval that lies in block ,

 ∑t∈I(rt(π∗t(xt))−rt(at))≤O⎛⎝⎛⎝∑t∈I∑m∈Mt∪{j}¯¯¯¯¯Kνm⎞⎠+|I|αI+|I|ΔI+|I|εI1{εI>D3αI}⎞⎠

where.

By Azuma’s inequality and a union bound over all possible intervals, we have that with probability , for any interval ,

 ∑t∈I(rt(π∗t(xt))−rt(at))≤∑t∈IEt[rt(π∗t(xt))−rt(at)]+O(√|I|log(T2/δ)), (9)

where is the conditional expectation given everything up to Step 1 of the algorithm of round . It remains to bound each . Depending on the case of replay or non-replay, this term can be written as

 Et[rt(π∗t(xt))−rt(at)] =⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩∑a∈[K]∑m∈MtQνmm(a|xt)|Mt|Et[rt(π∗t(xt))−rt(a)],if Mt≠∅,∑a∈[K]Qνjj(a|xt)Et[rt(π∗t(xt))−rt(a)],if Mt=∅.

Now observe that for any and , by definition of we have

 ∑aQν(a|xt)Et[rt(π∗t(xt))−rt(a)]≤Kν+∑π∈ΠQ(π)\rm Regt(π).

So we continue to bound by

 ∑m∈Mt∪{j}Kνm+{1|Mt|∑m∈Mt∑π∈ΠQm(π)\rm Regt(π),if Mt≠∅,∑π∈ΠQj(π)\rm Regt(π),if Mt=∅.% (10)

Next note that for any and , we have

 ∑π∈ΠQm(π)\rm Regt(π) ≤∑π∈ΠQm(π)\rm RegI(π)+O(ΔI) =∑π∈Π8Qm(π)ˆ\rm RegBj−1(π)+O(ΔI)+εI ≤∑π∈Π8Qm(π)(4ˆ\rm RegBm−1(π)+D4¯¯¯¯¯Kνm)+O(ΔI)+εI ≤O(¯¯¯¯¯Kνm+ΔI)+εI ≤O(¯¯¯¯¯Kνm+αI+ΔI)+εI1{εI>D3αI}.

In fact, the above holds for too since the left hand side is at most . Combining this inequality with Eq. (10) and (9), and noting that the term is of order finish the proof.

5.2 Dynamic regret for a block

In this section, we bound the dynamic regret of some block within epoch . This block can be formally written as

 J≜[τi,τi+1−1]∩[τi+2j−1L,τi+2jL−1]. (11)

The idea is to divide into several intervals, apply Lemma 5.1 to each of them, and finally sum up the regret. Importantly, we need to divide in a careful way according to the following lemma, so that the variation on each interval is bounded by its excess regret threshold, while at the same time the number of intervals is not too large. Note that this division only happens in the analysis.

There is a way to partition any interval into , such that , and .

For the first intervals of this partition, we apply Lemma 5.1 to each of them. Note that the term in Lemma 5.1 can be absorbed by the term by our partition property. Summing up the bounds from Lemma 5.1, we get the following dynamic regret bound for these intervals:

 Γ−1∑k=1O⎛⎝⎛⎝∑t∈Ik∑m∈Mt∪{j}¯¯¯¯¯Kνm⎞⎠+|Ik|αIk+|Ik|ΔIk+|Ik|εIk1{εIk>D3αIk}⎞⎠ ≤Γ−1∑k=1∑t∈Ik∑m∈Mt∪{j}O(¯¯¯¯¯Kνm)\textscTerm$1$+Γ−1∑k=1O(|Ik|αIk)\textscTerm$2$+Γ−1∑k=1O(|Ik|εIk1{εIk>D3αIk})\textscTerm$3$. (12)

For the last interval in the block, it is possible that it was interrupted by a restart, which makes the analysis trickier and we defer the details to Appendix C. Further bounding Term and Term is relatively straightforward by the definition of and and also the construction of (see Appendix C). For Term, the idea is that this term is nonzero only when is large, which implies that the distribution in is quite different from that in . In this case we will show that as long as the algorithm starts a replay phase with some “correct” index within , it will detect the non-stationarity with high probability and restart the algorithm. Thus we only need to bound the regret accumulated before this “correct” replay phase appears. We provide the complete proof in Appendix C.1, which is the most important part of the analysis. Combining the bounds for these three terms, we eventually arrive at the following lemma: With probability , the following holds for any block with block index :

 ∑t∈J(rt(π∗t)−rt(at))=˜O(min{√KC0SJ2jL,√KC02jL+(KC0)13Δ13J(2jL)23}).

Note that is the length of block unless there is a restart triggered within this block, in which case the length is smaller.

5.3 Combining regret over blocks and epochs

We finally sum up the dynamic regret over blocks and epochs. To this end, we reintroduce the subscripts in our notations, and write epoch as and block (for ) in epoch as

Dynamic regret for an epoch.

The last block index in epoch is , which we denote by . Using Lemma 5.2, we combine the regret over all blocks in epoch and upper bound the regret of epoch simultaneously by (using the bound in terms of number of switches)

 ˜O(L+j∗∑j=1√KC0SJij2jL)=˜O⎛⎜⎝KC0+ ⎷KC0j∗∑j=1SJijj∗∑j=12jL⎞⎟⎠ (Cauchy-Schwarz) =˜O(KC0+√KC0(SEi+j∗)|Ei|)=˜O(√KC0SEi|Ei|)

and similarly by (using the bound in terms of variation and Hölder inequality)

 ˜O(L+j∗∑j=1√KC02jL+j∗∑j=1(KC0)13Δ13J(2jL)23)=˜O(√KC0|Ei|+(KC0)13Δ13Ei|Ei|23).

Combining regret over epochs.

For the last step of combining all epochs, we make use the following lemma which bounds the number of epochs (see Appendix D for the proof). Denote the total number of epochs by . With probability at least , we have .

Therefore, summing up the previous bounds over all epochs, we arrive at the final dynamic regret bound, which is the minimum of the following two:

 ˜O(E∑i=1√KC0SEi|Ei|) ≤˜O⎛⎜⎝ ⎷KC0(E∑i=1SEi)(E∑i=1|Ei|)⎞⎟⎠ =˜O(√KC0(S+E)T)=˜O(√KC0ST),

and by

 ˜O(E∑i=1(√KC0|Ei|+(KC0)13Δ13Ei|Ei|23)) ≤˜O⎛⎜⎝√KC0ET+(KC0)13(E∑i=1ΔEi)13T23⎞⎟⎠ =˜O(√KC0T+(KC0)13Δ13T23).

This proves the bound stated in the main theorem.

The authors would like to thank Peter Auer for the discussion about the possibility of getting optimal bounds for our problem, and thank Peter Auer, Pratik Gajane, Ronald Ortner for kindly sharing their manuscript of AuerGO018 before it was public. HL and CYW are supported by NSF Grant #1755781.

References

• [Adamskiy et al.(2012)Adamskiy, Warmuth, and Koolen] Dmitry Adamskiy, Manfred K Warmuth, and Wouter M Koolen. Putting bayes to sleep. In Advances in neural information processing systems 25, 2012.
• [Agarwal et al.(2014)Agarwal, Hsu, Kale, Langford, Li, and Schapire] Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In

Proceedings of the 31st International Conference on Machine Learning

, 2014.
• [Auer et al.(2002)Auer, Cesa-Bianchi, Freund, and Schapire] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002.
• [Auer et al.(2018)Auer, Gajane, and Ortner] Peter Auer, Pratik Gajane, and Ronald Ortner. Adaptively tracking the best arm with an unknown number of distribution changes. In

14th European Workshop on Reinforcement Learning

, 2018.
• [Besbes et al.(2014)Besbes, Gur, and Zeevi] Omar Besbes, Yonatan Gur, and Assaf Zeevi. Stochastic multi-armed-bandit problem with non-stationary rewards. In Advances in Neural Information Processing Systems 27, 2014.
• [Besbes et al.(2015)Besbes, Gur, and Zeevi] Omar Besbes, Yonatan Gur, and Assaf Zeevi. Non-stationary stochastic optimization. Operations Research, 63(5):1227–1244, 2015.
• [Beygelzimer et al.(2011)Beygelzimer, Langford, Li, Reyzin, and Schapire] Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert E Schapire.

Contextual bandit algorithms with supervised learning guarantees.

In

Proceedings of the 14th International Conference on Artificial Intelligence and Statistics

, 2011.
• [Bousquet and Warmuth(2002)] Olivier Bousquet and Manfred K Warmuth. Tracking a small set of experts by mixing past posteriors. Journal of Machine Learning Research, 3(Nov):363–396, 2002.
• [Cheung et al.(2019)Cheung, Simchi-Levi, and Zhu] Wang Chi Cheung, David Simchi-Levi, and Ruihao Zhu. Learning to optimize under non-stationarity. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2019.
• [Dudík et al.(2011)Dudík, Hsu, Kale, Karampatziakis, Langford, Reyzin, and Zhang] M. Dudík, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, and T. Zhang. Efficient optimal learning for contextual bandits. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2011.
• [Garivier and Moulines(2011)] Aurélien Garivier and Eric Moulines. On upper-confidence bound policies for switching bandit problems. In International Conference on Algorithmic Learning Theory, 2011.
• [Hazan and Koren(2016)] Elad Hazan and Tomer Koren. The computational power of optimization in online learning. In

Proceedings of the 48th Annual ACM Symposium on the Theory of Computing

, 2016.
• [Hazan and Seshadhri(2009)] Elad Hazan and Comandur Seshadhri. Efficient learning algorithms for changing environments. In Proceedings of the 26th International Conference on Machine Learning, pages 393–400, 2009.
• [Herbster and Warmuth(1998)] Mark Herbster and Manfred K Warmuth. Tracking the best expert. Machine learning, 32(2):151–178, 1998.
• [Jadbabaie et al.(2015)Jadbabaie, Rakhlin, Shahrampour, and Sridharan] Ali Jadbabaie, Alexander Rakhlin, Shahin Shahrampour, and Karthik Sridharan. Online optimization: Competing with dynamic comparators. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, 2015.
• [Jun et al.(2017)Jun, Orabona, Wright, Willett, et al.] Kwang-Sung Jun, Francesco Orabona, Stephen Wright, Rebecca Willett, et al. Online learning for changing environments using coin betting. Electronic Journal of Statistics, 11(2):5282–5310, 2017.
• [Karnin and Anava(2016)] Zohar S Karnin and Oren Anava. Multi-armed bandits: Competing with optimal sequences. In Advances in Neural Information Processing Systems 29, 2016.
• [Langford and Zhang(2008)] John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in Neural Information Processing Systems 21, 2008.
• [Liu et al.(2018)Liu, Lee, and Shroff] Fang Liu, Joohyun Lee, and Ness Shroff. A change-detection based framework for piecewise-stationary multi-armed bandit problem. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
• [Luo and Schapire(2015)] Haipeng Luo and Robert E. Schapire. Achieving All with No Parameters: AdaNormalHedge. In 28th Annual Conference on Learning Theory (COLT), 2015.
• [Luo et al.(2018)Luo, Wei, Agarwal, and Langford] Haipeng Luo, Chen-Yu Wei, Alekh Agarwal, and John Langford. Efficient contextual bandits in non-stationary worlds. In 31st Annual Conference on Learning Theory (COLT), 2018.
• [Rakhlin and Sridharan(2016)] Alexander Rakhlin and Karthik Sridharan. Bistro: An efficient relaxation-based method for contextual bandits. In Proceedings of the 33rd International Conference on Machine Learning, 2016.
• [Slivkins and Upfal(2008)] Aleksandrs Slivkins and Eli Upfal. Adapting to a changing environment: the brownian restless bandits. In 21st Annual Conference on Learning Theory (COLT), pages 343–354, 2008.
• [Syrgkanis et al.(2016a)Syrgkanis, Krishnamurthy, and Schapire] Vasilis Syrgkanis, Akshay Krishnamurthy, and Robert E Schapire. Efficient algorithms for adversarial contextual learning. In Proceedings of the 33rd International Conference on Machine Learning, 2016a.
• [Syrgkanis et al.(2016b)Syrgkanis, Luo, Krishnamurthy, and Schapire] Vasilis Syrgkanis, Haipeng Luo, Akshay Krishnamurthy, and Robert E Schapire. Improved regret bounds for oracle-based adversarial contextual bandits. In Advances in Neural Information Processing Systems 29, 2016b.
• [Wei et al.(2016)Wei, Hong, and Lu] Chen-Yu Wei, Yi-Te Hong, and Chi-Jen Lu. Tracking the best expert in non-stationary stochastic environments. In Advances in Neural Information Processing Systems 29, 2016.
• [Yang et al.(2016)Yang, Zhang, Jin, and Yi] Tianbao Yang, Lijun Zhang, Rong Jin, and Jinfeng Yi. Tracking slowly moving clairvoyant: optimal dynamic regret of online learning with true and noisy gradient. In Proceedings of the 33rd International Conference on Machine Learning, pages 449–457, 2016.
• [Zhang et al.(2017)Zhang, Yang, Yi, Rong, and Zhou] Lijun Zhang, Tianbao Yang, Jinfeng Yi, Jing Rong, and Zhi-Hua Zhou. Improved dynamic regret for non-degenerate functions. In Advances in Neural Information Processing Systems 30, 2017.
• [Zhang et al.(2018)Zhang, Yang, Jin, and Zhou] Lijun Zhang, Tianbao Yang, Rong Jin, and Zhi-Hua Zhou. Dynamic regret of strongly adaptive methods. In Proceedings of the 35th International Conference on Machine Learning, 2018.
• [Zinkevich(2003)] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, 2003.

Appendix A Useful Lemmas

In this section we prove two small lemmas that are useful for our analysis.

a.1 Discrepancy between intervals

The following results allow us to relate regret and variance measured on one interval to those measured on another, with the price in terms of the distribution variation. For any interval , its sub-intervals , and any , we have

 ∣∣\rm RegI1(π)−\rm Reg% I2(π)∣∣≤2ΔI.

Let