1 Introduction
For online learning problems, a standard performance measure is static regret, which compares the difference between the total reward of the best fixed policy (or action/arm/expert under different contexts) and the total reward of the algorithm. While minimizing static regret makes sense when there exists a fixed policy with large total reward, it becomes much less meaningful in a nonstationary environment where data distribution is changing over time and no single policy can perform well all the time.
Instead, in this case a more natural benchmark would be to compare the algorithm with the best sequence of policies. This is formally defined as dynamic regret, which is the difference between the total reward of the best sequence of policies and the total reward of the algorithm. Due to the ubiquity of nonstationary data, there is an increasing trend of designing online algorithms with strong dynamic regret guarantee. We provide a more detailed review of related work in Section 2. In short, while obtaining dynamic regret is relatively wellstudied in the fullinformation setting, for the more challenging bandit feedback, most existing works only focus on the simplest multiarmed bandit problem. More importantly, a sharp contrast between these two regimes is that except for the recent work of AuerGO018 for a twoarmed bandit problem, none of the others achieves optimal dynamic regret without the knowledge of the nonstationarity of the data in the bandit setting, indicating the extra challenge of being adaptive to nonstationary data with partial information.
In this work, we make a significant step in this direction. Specifically we consider the general contextual bandit setting AuerCeFrSc02, LangfordZh08 which subsumes many other bandit problems. For an environment with rounds where at each time the data is generated from some distribution , denote by the number of switches (plus one) and by the total variation of these distributions (see Section 3 for more formal definition of the setting) . Our main contribution is to propose an algorithm called with the following guarantee:
Main Result
achieves the optimal dynamic regret bound without knowing or . Moreover, is oracleefficient.
Here the dependence on all other parameters are omitted (see Theorem 5 for the complete version) and the optimality of the dependence on and are wellknown garivier2011upper, BesbesGuZe14. Oracleefficiency refers to efficiency assuming access to an ERM oracle, a common assumption made in most prior works for efficient contextual bandit (formally defined in Section 3).
Our result is by far the best and most general dynamic regret bound for bandit problems. Recent work by LuoWA018 studies the exact same setting and achieves the same optimal bound only if and are known; otherwise their algorithms only achieve suboptimal bounds such as . On the other hand, AuerGO018 propose the first bandit algorithm with expected regret without knowing , but only for the simplest setting: the twoarmed bandit problem without contexts. In contrast, our algorithm works for the general multiarmed bandit problem with contextual information, enjoys a meaningful bound as long as is small (even when is
), works with high probability, and importantly is oracleefficient as well.
Our key technique is inspired by AuerGO018. The high level idea of their algorithm is to occasionally enter some pure exploration phase in order to detect nonstationarity, and crucially the durations of these exploration phases are multiscale and determined in some randomized way. The reason behind this is that smaller nonstationarity requires more time to discover and vice versa. We extend this multiscale idea to the contextual bandit setting. However, the extension is highly nontrivial and requires the following two new elements:

First, we find that pure exploration over arms (used by AuerGO018, LuoWA018) is not the optimal way to detect nonstationarity in contextual bandit. Instead, we propose to let the algorithm occasionally enter replay phases
, meaning that the algorithm acts according to some policy distribution used earlier by the algorithm itself. The duration of a replay phase and which previous policy distribution to replay are both determined in some randomized way similar to AuerGO018. This can be seen as an interpolation between using the current policy distribution and using pure exploration, and as shown by our analysis achieves a better trade off between exploitation and exploration in nonstationary environments.

Second, the algorithm of AuerGO018 is an “armelimination” approach, which eliminates arms as long as their suboptimality is identified. Direct extension to contextual bandit leads to an inefficient approach similar to PolicyElimination by DudikHsKaKaLaReZh11. Instead, our algorithm is based on the soft elimination scheme of AgarwalHsKaLaLiSc14 and can be efficiently implemented with an ERM oracle. Combining this soft elimination scheme and the replay idea in a proper way is another key novelty of our work.
2 Related Work
Different forms of dynamic regret bound.
Bounding dynamic regret in terms of the number of switches is traditionally referred to as switching regret or tracking regret, and has been studied under various settings. Note, however, that in some works refers to the number of switches of data distributions just as our definition (e.g. garivier2011upper, WeiHoLu16, liu2018change, LuoWA018), while in others refers to the more general notion of number of switches in the competitor sequence (e.g. herbster1998tracking, bousquet2002tracking, AuerCeFrSc02, hazan2009efficient).
Bounding dynamic regret in terms of the variation of loss functions or data distributions is also widely studied (e.g. BesbesGuZe14, BesbesGuZe15, LuoWA018), and there are in fact several other forms of dynamic regret bounds studied in the literature (e.g. Zinkevich03, slivkins2008adapting, jadbabaie2015online, WeiHoLu16, yang2016tracking, zhang2017improved).
Adaptivity to nonstationarity.
Achieving optimal dynamic regret bounds without any prior knowledge of the nonstationarity is the main focus of this work. This has been achieved for most fullinformation problems LuoSc15, jun2017online, ZhangYaJiZh18, but is much more challenging in the bandit setting. Several recent attempts only achieve suboptimal bounds KarninAn16, LuoWA018, cheung2019learning. It was not clear whether optimal bounds were achievable in this case, until the recent work of AuerGO018 answers this in the affirmative for the twoarmed bandit problem. As mentioned our results significantly generalize their work.
Contextual bandits.
Contextual bandit is a generalization of the multiarmed bandit problem. While direct generalization of the classic multiarmed bandit algorithm already achieves the optimal static regret AuerCeFrSc02, recent research has been focusing on developing practically efficient algorithms with strong regret guarantee due to their applicability to realworld applications. To avoid running time that is linear in the size of the policy set, most existing works make the practical assumption that an ERM oracle is given to solve the corresponding offline problem. Based on this assumption, a series of progress has been made on developing oracleefficient algorithms with small static regret LangfordZh08, DudikHsKaKaLaReZh11, AgarwalHsKaLaLiSc14, syrgkanis2016efficient, rakhlin2016bistro, SyrgkanisLuKrSc16. All these results rely on some stationary assumption of the environment, since it is known that minimizing static regret oracleefficiently is impossible in an adversarial environment Hazan2016.
Despite the negative result for static regret with oracleefficient algorithms, LuoWA018 find that this is no longer true for dynamic regret, and develop oracleefficient algorithms with optimal dynamic regret when the nonstationarity is known. Their work is most closely related to ours and our algorithm is in essence similar to their AdaILTCB algorithm. The key novelty compared to theirs is the replay phases mentioned earlier, which eventually allows the algorithm to adapt to the nonstationarity of the data.
Replay phases.
Introducing replay phases is one of our key contributions. The closest idea in the literature is the method of “mixing past posteriors” of bousquet2002tracking, adamskiy2012putting, which at each time acts according to some weighted combination of all previous distributions. One key difference of our method is that once it enters into a replay phase, it has to continue for a certain amount of time to gather enough information for nonstationarity detection. Another difference is that in bousquet2002tracking, adamskiy2012putting the main point of “mixing past posteriors” is to obtain some form of “longterm memory”; otherwise for typical dynamic regret bounds it is enough to just mix with some amount of pure exploration. It is not clear to us whether our replay idea actually equips the algorithm with some kind of “longterm memory” as well, and we leave this as a future direction.
3 Preliminaries
The contextual bandit problem is defined as follows. Let be some arbitrary context space and be the number of actions. A policy is a mapping from the context space to the actions.^{1}^{1}1Throughout the paper we use the notation to denote the set for some integer . The learner is given a set of policies , assumed to be finite for simplicity but with a huge cardinality . Before the learning procedure starts, the environment decides distributions on , and draws independent samples from them: . The learning procedure then proceeds as follows: for each time , the learner first receives the context , and then based on this context picks an action . Afterwards the learner receives the reward feedback for the selected action but not others. The instantaneous regret against a policy at time is . The classic goal of contextual bandit algorithms is to minimize , that is, the cumulative regret against the best fixed policy, and the optimal bound is known to be AuerCeFrSc02.
The classic regret is not a good performance measure for nonstationary environments where no single policy can perform well all the time. Instead, we consider dynamic regret that compares the reward of the algorithm to the reward of the best policy at each time. Specifically, denote the expected reward of policy at time as , and the optimal policy at time as . The dynamic regret is then defined as .
It is wellknown that in general it is impossible to achieve sublinear dynamic regret. Instead, typical dynamic regret bounds are expressed in terms of some quantities that characterize the nonstationarity of the data distributions, and are meaningful as long as these quantities are sublinear in . Two such quantities considered in this work are: the number of distribution hard switches (plus one) and the total variation of distributions .
More notation.
For any integer , we denote by the time interval . For an interval , we define the number of switches and the variation on this interval respectively as and .
As in most algorithms, at each time we sample an action according to some distribution , calculated based on the history before time . After receiving the reward feedback
, we construct the usual importanceweighted estimator
, which is defined as and is clearly unbiased with mean .For any interval , we define the average reward of a policy over this interval as and similarly its empirical average reward as . The optimal policy in interval is defined as while the empirically best policy is . Furthermore, the expected and empirical interval (static) regret of a policy for an interval are respectively defined as and . When , we simply use to replace as the subscript. For example, represents .
For a context and a distribution over the policies , the projected distribution over the actions is denoted by such that for all . The smoothed projected distribution with a minimum probability is defined as where
is the allone vector. Similarly to AgarwalHsKaLaLiSc14, our algorithm keeps track of a bound on the variance of the reward estimates. To this end, define for a policy
, an interval , a distribution , and a minimum probability , the empirical and expected variance aswhere is the marginal distribution of over the context space . Again, and are shorthands for and respectively.
We are interested in efficient algorithms assuming access to an ERM oracle AgarwalHsKaLaLiSc14, defined as:
An ERM oracle is an algorithm which takes any set of contextreward pairs as inputs and outputs any policy in .
An algorithm is oracleefficient if its total running time and the number of oracle calls are both polynomial in and , excluding the running time of the oracle itself.
Finally, we use notation to suppress logarithmic dependence on , and for some confidence level . For notational convenience we also define .
4 Algorithm
Optimization Problem (OP)
Input: time interval , minimum exploration probability
Return such that for constant ,
(1)  
(2) 
EndofReplayTest
Return Fail if there exists such that any of the following inequalities holds:
(3)  
(4)  
(5) 
where and ; otherwise return Pass.
EndofBlockTest
Return Fail if there exists and such that any of the following inequalities holds:
(6)  
(7)  
(8) 
where and ; otherwise return Pass.
Our algorithm is built upon ILOVETOCONBANDITS of AgarwalHsKaLaLiSc14. The main idea of their algorithm is to find a sparse distribution over the policies with both low empirical regret and low empirical variance on the collected data, and then sample actions according to this distribution. Finding such distributions is formalized in Figure 1, Optimization Problem (OP), and AgarwalHsKaLaLiSc14 show that this can be efficiently implemented using an ERM oracle and importantly the distribution is sparse. Under a stationary environment, it can be shown that the empirical regret concentrates around the expected regret reasonably well and thus the algorithm has low regret.
The AdaILTCB algorithm of LuoWA018 works by equipping ILOVETOCONBANDITS with some nonstationarity tests and restarting once nonstationarity is detected. Our algorithm works under a similar framework with similar tests, but importantly enters into replay phases occasionally. The complete pseudocode is included in Algorithm 1 and we describe in detail how it works below.
The algorithm starts a new epoch every time it restarts (that is, on execution of Line 1 or 1). We index an epoch by and denote the first round of epoch by . Within an epoch, the algorithm works on a block schedule. Specifically, in epoch , we call the interval block and interval block for any (in the case of restart, the block ends earlier), where is some fixed base length.^{2}^{2}2The lengths of these blocks are doubling except that block 0 and block 1 have the same length . This is merely for notational convenience and it is not crucial. Each block is associated with an exploration probability of order . At the beginning of each block (for ), the algorithm first solves the Optimization Problem (OP) (Figure 1) using exploration probability and all data collected since the beginning of the current epoch, that is, data from . The solution is denoted by , which is a sparse distribution over policies.
Afterwards, for most of the time of the current block, the algorithm simply plays according to , just like ILOVETOCONBANDITS. The difference is that at each time, with probability the algorithm enters into a replay phase of index which lasts for rounds. This is implemented in Line 11, where we first sample a Bernoulli variable rep to decide whether or not to enter into a replay phase, and if so then randomly select a replay index to ensure the aforementioned probability. The set is used to record all pairs of replay index and replay interval. Similar to AuerGO018, the reason of using different lengths is to allow the algorithm to detect different level of nonstationarity: a longer replay interval with a larger index is used to detect smaller nonstationarity.
Note that at each time , the algorithm could potentially be in multiple replay phases simultaneously. Let be the set of indices of all the ongoing replay intervals (defined in Line 1). If is empty, the algorithm is not in any replay phase and simply samples an action according to as mentioned. On the other hand, if is not empty, the algorithm uniformly at random picks an index from , and then replays the distribution learned at the beginning of block , that is, samples an action according to . Recall that our reward estimators ’s are defined in terms of a distribution over actions, and it is clear that for our algorithm .
Finally, at the end of every replay interval, the algorithm calls the subroutine EndofReplayTest to check whether the data collected in the replay interval and that collected prior to the current block (that is, ) are consistent (Line 1). Also, at the end of every block , the algorithm calls another subroutine EndofBlockTest to check the consistency between data up to block and data up to block for all (Line 1). Both tests are in similar spirit to those of LuoWA018, and check the difference of empirical regret or empirical variance of each policy over different sets of data (see Figure 1). If either of the tests indicates that there is a significant distribution change, the algorithm restarts from scratch and enters into the next epoch. Also note that if EndofBlockTest passes and the algorithm enters into a new block, all unfinished replay intervals will discontinue ( is reset to be empty in Line 1).
We provide an illustration of our algorithm in Figure 2.
Oracleefficiency.
Our algorithm can be implemented efficiently with an ERM oracle.
AgarwalHsKaLaLiSc14 show that the Optimization Problem (OP) with input can be solved using oracle calls with a solution that is sparse. In our case, is at most .
The two tests can also be implemented efficiently by the exact same arguments of LuoWA018.
For example, in EndofReplayTest, to check if there exists a satisfying Eq. (3), we can first use two oracle calls to precompute and , and collect . Then we again use an oracle call to find
and add this value to , which is equal to taking the max over of the left hand side of Eq. (3).
It remains to compare this value with the right hand side of Eq. (3).
5 Main Theorem and Proof Outline
The dynamic regret guarantee of is summarized below: [Main Theorem] guarantees with high probability,
Proof roadmap.
The rest of the paper proves our main theorem, following these steps: in Section 5.1, we provide a key lemma that bounds the dynamic regret for any interval within a block (in terms of some algorithmdependent quantities). In Section 5.2, with the help of the key lemma we bound the dynamic regret for a block. In Section 5.3 we bound the number of epochs/restarts, and sum up the regret over all blocks in all epochs to get the final bound. Since the analysis in Sections 5.1 and 5.2 is all about a fixed epoch , for notation simplicity, we simply write and as and in these two sections.
5.1 A main Lemma and regret decomposition
To bound the dynamic regret over any interval, we define the concept of excess regret:
For an interval that lies in for some , we define its excess regret as
and its excess regret threshold as .
In words, excess regret of is the maximum discrepancy between a policy’s expected static regret on and (8 times) its empirical static regret on the first blocks. Large excess regret thus indicates nonstationarity. We now use the following main lemma to decompose the dynamic regret on based on whether the excess regret reaches the excess regret threshold.
[Main Lemma] With probability , guarantees for all and any interval that lies in block ,
where.
By Azuma’s inequality and a union bound over all possible intervals, we have that with probability , for any interval ,
(9) 
where is the conditional expectation given everything up to Step 1 of the algorithm of round . It remains to bound each . Depending on the case of replay or nonreplay, this term can be written as
Now observe that for any and , by definition of we have
So we continue to bound by
(10) 
Next note that for any and , we have
In fact, the above holds for too since the left hand side is at most . Combining this inequality with Eq. (10) and (9), and noting that the term is of order finish the proof.
5.2 Dynamic regret for a block
In this section, we bound the dynamic regret of some block within epoch . This block can be formally written as
(11) 
The idea is to divide into several intervals, apply Lemma 5.1 to each of them, and finally sum up the regret. Importantly, we need to divide in a careful way according to the following lemma, so that the variation on each interval is bounded by its excess regret threshold, while at the same time the number of intervals is not too large. Note that this division only happens in the analysis.
There is a way to partition any interval into , such that , and .
For the first intervals of this partition, we apply Lemma 5.1 to each of them. Note that the term in Lemma 5.1 can be absorbed by the term by our partition property. Summing up the bounds from Lemma 5.1, we get the following dynamic regret bound for these intervals:
(12) 
For the last interval in the block, it is possible that it was interrupted by a restart, which makes the analysis trickier and we defer the details to Appendix C. Further bounding Term and Term is relatively straightforward by the definition of and and also the construction of (see Appendix C). For Term, the idea is that this term is nonzero only when is large, which implies that the distribution in is quite different from that in . In this case we will show that as long as the algorithm starts a replay phase with some “correct” index within , it will detect the nonstationarity with high probability and restart the algorithm. Thus we only need to bound the regret accumulated before this “correct” replay phase appears. We provide the complete proof in Appendix C.1, which is the most important part of the analysis. Combining the bounds for these three terms, we eventually arrive at the following lemma: With probability , the following holds for any block with block index :
Note that is the length of block unless there is a restart triggered within this block, in which case the length is smaller.
5.3 Combining regret over blocks and epochs
We finally sum up the dynamic regret over blocks and epochs. To this end, we reintroduce the subscripts in our notations, and write epoch as and block (for ) in epoch as
Dynamic regret for an epoch.
The last block index in epoch is , which we denote by . Using Lemma 5.2, we combine the regret over all blocks in epoch and upper bound the regret of epoch simultaneously by (using the bound in terms of number of switches)
(CauchySchwarz)  
and similarly by (using the bound in terms of variation and Hölder inequality)
Combining regret over epochs.
For the last step of combining all epochs, we make use the following lemma which bounds the number of epochs (see Appendix D for the proof). Denote the total number of epochs by . With probability at least , we have .
Therefore, summing up the previous bounds over all epochs, we arrive at the final dynamic regret bound, which is the minimum of the following two:
and by
This proves the bound stated in the main theorem.
The authors would like to thank Peter Auer for the discussion about the possibility of getting optimal bounds for our problem, and thank Peter Auer, Pratik Gajane, Ronald Ortner for kindly sharing their manuscript of AuerGO018 before it was public. HL and CYW are supported by NSF Grant #1755781.
References
 [Adamskiy et al.(2012)Adamskiy, Warmuth, and Koolen] Dmitry Adamskiy, Manfred K Warmuth, and Wouter M Koolen. Putting bayes to sleep. In Advances in neural information processing systems 25, 2012.

[Agarwal et al.(2014)Agarwal, Hsu, Kale, Langford, Li, and
Schapire]
Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E
Schapire.
Taming the monster: A fast and simple algorithm for contextual
bandits.
In
Proceedings of the 31st International Conference on Machine Learning
, 2014.  [Auer et al.(2002)Auer, CesaBianchi, Freund, and Schapire] Peter Auer, Nicolo CesaBianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002.

[Auer et al.(2018)Auer, Gajane, and Ortner]
Peter Auer, Pratik Gajane, and Ronald Ortner.
Adaptively tracking the best arm with an unknown number of
distribution changes.
In
14th European Workshop on Reinforcement Learning
, 2018.  [Besbes et al.(2014)Besbes, Gur, and Zeevi] Omar Besbes, Yonatan Gur, and Assaf Zeevi. Stochastic multiarmedbandit problem with nonstationary rewards. In Advances in Neural Information Processing Systems 27, 2014.
 [Besbes et al.(2015)Besbes, Gur, and Zeevi] Omar Besbes, Yonatan Gur, and Assaf Zeevi. Nonstationary stochastic optimization. Operations Research, 63(5):1227–1244, 2015.

[Beygelzimer et al.(2011)Beygelzimer, Langford, Li, Reyzin, and
Schapire]
Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert E Schapire.
Contextual bandit algorithms with supervised learning guarantees.
InProceedings of the 14th International Conference on Artificial Intelligence and Statistics
, 2011.  [Bousquet and Warmuth(2002)] Olivier Bousquet and Manfred K Warmuth. Tracking a small set of experts by mixing past posteriors. Journal of Machine Learning Research, 3(Nov):363–396, 2002.
 [Cheung et al.(2019)Cheung, SimchiLevi, and Zhu] Wang Chi Cheung, David SimchiLevi, and Ruihao Zhu. Learning to optimize under nonstationarity. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2019.
 [Dudík et al.(2011)Dudík, Hsu, Kale, Karampatziakis, Langford, Reyzin, and Zhang] M. Dudík, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, and T. Zhang. Efficient optimal learning for contextual bandits. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2011.
 [Garivier and Moulines(2011)] Aurélien Garivier and Eric Moulines. On upperconfidence bound policies for switching bandit problems. In International Conference on Algorithmic Learning Theory, 2011.

[Hazan and Koren(2016)]
Elad Hazan and Tomer Koren.
The computational power of optimization in online learning.
In
Proceedings of the 48th Annual ACM Symposium on the Theory of Computing
, 2016.  [Hazan and Seshadhri(2009)] Elad Hazan and Comandur Seshadhri. Efficient learning algorithms for changing environments. In Proceedings of the 26th International Conference on Machine Learning, pages 393–400, 2009.
 [Herbster and Warmuth(1998)] Mark Herbster and Manfred K Warmuth. Tracking the best expert. Machine learning, 32(2):151–178, 1998.
 [Jadbabaie et al.(2015)Jadbabaie, Rakhlin, Shahrampour, and Sridharan] Ali Jadbabaie, Alexander Rakhlin, Shahin Shahrampour, and Karthik Sridharan. Online optimization: Competing with dynamic comparators. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, 2015.
 [Jun et al.(2017)Jun, Orabona, Wright, Willett, et al.] KwangSung Jun, Francesco Orabona, Stephen Wright, Rebecca Willett, et al. Online learning for changing environments using coin betting. Electronic Journal of Statistics, 11(2):5282–5310, 2017.
 [Karnin and Anava(2016)] Zohar S Karnin and Oren Anava. Multiarmed bandits: Competing with optimal sequences. In Advances in Neural Information Processing Systems 29, 2016.
 [Langford and Zhang(2008)] John Langford and Tong Zhang. The epochgreedy algorithm for multiarmed bandits with side information. In Advances in Neural Information Processing Systems 21, 2008.
 [Liu et al.(2018)Liu, Lee, and Shroff] Fang Liu, Joohyun Lee, and Ness Shroff. A changedetection based framework for piecewisestationary multiarmed bandit problem. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 [Luo and Schapire(2015)] Haipeng Luo and Robert E. Schapire. Achieving All with No Parameters: AdaNormalHedge. In 28th Annual Conference on Learning Theory (COLT), 2015.
 [Luo et al.(2018)Luo, Wei, Agarwal, and Langford] Haipeng Luo, ChenYu Wei, Alekh Agarwal, and John Langford. Efficient contextual bandits in nonstationary worlds. In 31st Annual Conference on Learning Theory (COLT), 2018.
 [Rakhlin and Sridharan(2016)] Alexander Rakhlin and Karthik Sridharan. Bistro: An efficient relaxationbased method for contextual bandits. In Proceedings of the 33rd International Conference on Machine Learning, 2016.
 [Slivkins and Upfal(2008)] Aleksandrs Slivkins and Eli Upfal. Adapting to a changing environment: the brownian restless bandits. In 21st Annual Conference on Learning Theory (COLT), pages 343–354, 2008.
 [Syrgkanis et al.(2016a)Syrgkanis, Krishnamurthy, and Schapire] Vasilis Syrgkanis, Akshay Krishnamurthy, and Robert E Schapire. Efficient algorithms for adversarial contextual learning. In Proceedings of the 33rd International Conference on Machine Learning, 2016a.
 [Syrgkanis et al.(2016b)Syrgkanis, Luo, Krishnamurthy, and Schapire] Vasilis Syrgkanis, Haipeng Luo, Akshay Krishnamurthy, and Robert E Schapire. Improved regret bounds for oraclebased adversarial contextual bandits. In Advances in Neural Information Processing Systems 29, 2016b.
 [Wei et al.(2016)Wei, Hong, and Lu] ChenYu Wei, YiTe Hong, and ChiJen Lu. Tracking the best expert in nonstationary stochastic environments. In Advances in Neural Information Processing Systems 29, 2016.
 [Yang et al.(2016)Yang, Zhang, Jin, and Yi] Tianbao Yang, Lijun Zhang, Rong Jin, and Jinfeng Yi. Tracking slowly moving clairvoyant: optimal dynamic regret of online learning with true and noisy gradient. In Proceedings of the 33rd International Conference on Machine Learning, pages 449–457, 2016.
 [Zhang et al.(2017)Zhang, Yang, Yi, Rong, and Zhou] Lijun Zhang, Tianbao Yang, Jinfeng Yi, Jing Rong, and ZhiHua Zhou. Improved dynamic regret for nondegenerate functions. In Advances in Neural Information Processing Systems 30, 2017.
 [Zhang et al.(2018)Zhang, Yang, Jin, and Zhou] Lijun Zhang, Tianbao Yang, Rong Jin, and ZhiHua Zhou. Dynamic regret of strongly adaptive methods. In Proceedings of the 35th International Conference on Machine Learning, 2018.
 [Zinkevich(2003)] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, 2003.
Appendix A Useful Lemmas
In this section we prove two small lemmas that are useful for our analysis.
a.1 Discrepancy between intervals
The following results allow us to relate regret and variance measured on one interval to those measured on another, with the price in terms of the distribution variation. For any interval , its subintervals , and any , we have
Let
Comments
There are no comments yet.