The multi-armed bandit (MAB) problem is a sequential learning setting, where each round the player decides which arm to pull from a -arm bandit. The player only observes partial reward feedback according to the pulled arm and may use the past rewards to adapt its strategy. The goal is to balance the trade-off between exploration and exploitation over time and minimize the cumulative regret up to rounds. The MAB setting, first introduced by , has received extensive research during the past few decades due to its significant applications to online advertisements  and recommender systems [17, 18]. Recently, the contextual bandit setting [1, 6] receives increasing interest due to its efficiency in the case of large recommender system ( is large) and interrelated reward distributions. Linear Upper Confidence Bound algorithm (LinUCB) 
was proposed for contextual bandit setting under a linear assumption, where the reward of each arm is predicted by a linear model of feature vectorsand linear regression parameter . This is known as the stochastic linear bandits. Chu et al. linucb proved a lower bound of for linear bandit setting, where is the dimension of feature vectors. It was shown that LinUCB can achieve this lower bound with a logarithm factor .
Most existing stochastic linear bandit algorithms like LinUCB and Linear Thompson Sampling (LinTS) assume the regression parameters for rewards stay non-stationary over time. However, in reality, the assumption of stationarity rarely holds. As an example, in news recommendation, a user might be more interested in political news during the presidential debate, and more interested in sports news during the NBA playoff season. Popular algorithms like LinUCB or LinTS which achieve optimal regret bounds in stationary environments could end up with linear regret for non-stationary environments in the worst case. Many efforts have been taken to emphasize this problem [11, 10, 21, 25], including methods of passively and actively adapting to the changing environment.
We explore the solutions for piecewise-stationarity in stochastic bandit settings with linear assumptions, where the regression model parameter stays stationary for a while and changes abruptly at a certain time. The main idea is to design a changepoint detection method and perform the classic LinUCB algorithm within the intervals of homogeneity. When we detect a changepoint for an arm, we reset the LinUCB index for this arm. While the changepoint-based method sounds reasonable, it hasn’t been successful due to the extreme difficulty of detecting faint changes in bandit problems. Piecewise-stationary environment in previous works mostly assumes the change in mean reward (at least for some portion of the arms) is bounded below by a constant . However, faint changes are hardly ignorable. For example, neglecting to pull an optimal arm with faint changes over a stationary window of length is going to incur a large regret.
In this paper, we first propose a piecewise-stationary environment with weaker assumptions, where we do not need the change in mean reward to be bounded below. We only require that for small changes, the consecutive stationary periods should be relatively long enough for our algorithm to detect a change and vice versa. We then propose a multiscale changepoint detection based algorithm, Multiscale-LinUCB, for the piecewise-stationary linear bandit setting (formally defined in Section 2.2) and prove the nearly optimal regret bound for this algorithm. We show that the multiscale nature of the changepoint detector is essential for preventing poor regret when there are faint changes in reward distribution. Then we extend this setting to piecewise-stationary MAB bandit setting, where the reward distributions of some arms may change at certain changepoints. Extensive research in experiments show that our algorithm performs significantly better than other state-of-the-art algorithms in non-stationary environment.
There is an important line of work for non-stationary MAB problems [14, 19, 5, 7, 8]. Recently, there has also been some novel researches that consider non-stationary contextual (can be non-linear) bandit algorithms where there are probabilistic assumptions on the context vectors [9, 20]. Chen et al. chen2019new attains parameter-free and efficient algorithm assuming access to an ERM oracle. Here, we will only discuss some previous works on stochastic linear bandit algorithms for non-stationary environments , as those works are closely related to ours.
The recently developed D-LinUCB  employs a weighted linear bandit model, where the weight is adjusted according to how recently the data point is observed. By putting a discount rate on past observations for computing the LinUCB index, it passively adapts to the changing environment. This work has its similarities in Discounted UCB, which is proposed for non-stationary MAB . In the same work, Garivier and Moulines dis-sw proposed Sliding Window UCB for non-stationary MAB. Cheung et al. cheung2018learning generalized it to Sliding Window LinUCB (SW-LinUCB) for non-stationary stochastic linear bandit. SW-LinUCB computes the LinUCB index according to the most recent observations, where is the sliding window size. Both D-LinUCB and SW-LinUCB assumes the knowledge of the total variation bound where , which is rarely practical in reality. Here is the true model parameter for the regression model at time . When the discount rate of D-LinUCB or the window size of SW-LinUCB is chosen based on , both algorithms can attain a regret upper bound of .
In addition to passively adapting to the changing environment, there has also been substantial works considering actively adapting to the changing environments by changepoint detection methods. These works are mostly proposed for piecewise-stationary environment, and most of them assume the change in reward is bounded from below. The idea can track back to many algorithms in piecewise-stationary MAB environment [19, 8, 15]. For piecewise-stationary linear bandits, Wu et al. wu2018learning proposed Dynamic Linear UCB (dLinUCB) algorithm. The key idea of dLinUCB is to maintain a master bandit model which keeps track of the “badness” of several slave bandit models. The best slave model is chosen to determine which arm to pull each time and the feedback is shared with all models in the system. When there is no “good” slave model in the pool, a change is detected and a new slave model is created. Wu et al. wu2018learning showed that when the “badness” of the model is set based on the proportion of arms changing and the lower bound of changes in rewards, then the algorithm can attain an optimal regret upper bound of , where is the length of the longest stationary period and is the total number of changepoints.
2.1 Problem Formulation
We consider the contextual bandit problems with disjoint linear models proposed by Li et al. news_contextual in non-stationary environment. In a time horizon , let be the set of arms. At time , the player has access to the feature vectors of every arm . After observing , the player chooses an action and observe a sample reward . Each time, the observed reward is independent of each other.
In the stationary setting, expected reward of arm at time is modeled as a function of unknown vector and feature vectors. Under linear assumption, the expected reward becomes
In non-stationary contextual setting, could change over time. We assume that for arm there are in total changepoints, denoted as , where and . We say that is a changepoint for arm if the model parameter is different before and after time . Specifically, for the stationary periods, we define the length of the -th stationary period to be and are associated with an unknown parameter , where . We have
Define and . Then will be the set of all changepoints and is the total number of changepoints. Note that it is possible that , which means that there are multiple arms changing at the same time. See Figure 1 for illustration of the notations.
Define the optimal arm at time to be , i.e., , where is defined in Equation 1. Also define . Similar to stationary settings, the goal of the decision maker is to find a policy , so that following policy , it chooses an arm every time to minimize the total regret over time, where the total regret is defined to be
2.2 Piecewise-stationary Environment
We study the piecewise-stationary environment in , where the reward distribution remains the same for a while and abruptly changes at a changepoint. In addition, we propose two mild assumptions for our piecewise-stationary contextual environment.
(Sub-Gaussian Reward) The reward distribution is sub-Gaussian with parameter , without loss of generality, we assume for the analysis below.
Assumption 1 has been widely used in the literature. It includes the widely used Bernoulli reward in online recommender systems.
(Detectability) There exists a constant such that the following holds. For arm and adjacent stationary periods of length and respectively, true parameter changes from to , and for any in these two stationary periods, define . We assume the following inequalities hold.
Assumption 2 is weaker than most of the assumptions made in literature [19, 25]. Most changepoint-based method for piecewise-stationary bandit assumes bounded below to ensure detectability. However, our method does not need this. Assumption 2 means that when is small, we need longer stationary periods and for us to detect a changepoint. For example, this condition allows stationary periods of length with faint changes .
2.3 Proposed Algorithm: Multiscale-LinUCB
In this section, we introduce our proposed changepoint detection based LinUCB algorithm, Multiscale-LinUCB. Generally speaking, the algorithm performs LinUCB algorithm when there is no changepoint, and when we detect a changepoint for an arm, we reset the LinUCB index for this arm.
One of the biggest challenges for changepoint detection in the stochastic bandit setting is that LinUCB will not pull every arm frequently enough to detect a change in reward distribution. Due to the nature of LinUCB, it will eventually stop pulling suboptimal arms, but this can cause a missed changepoint in this arm. If this arm then becomes optimal, this new optimal arm will continue to be neglected, resulting in a regret that is linear with
. To remedy this problem, we randomly preselect some “changepoint detection” rounds with probabilityto pull arm . These are rounds at which we pull an arm purely for the purpose of detecting changepoints. Therefore, for each arm , there will be approximately preselected rounds, such that are disjoint. This probability is carefully selected so that we can balance between minimizing total regret and the need of having enough samples for detecting changes in every arm. Moreover, in non-stationary bandit settings, there could be a changepoint at any time, so it is important to maintain at least some level of exploration all the time to make sure that we still have a chance to choose the optimal arm at current time, even though this optimal arm could be the worst arm in previous times.
Let’s focus on a single arm now. Assume we have detected the most recent changepoint , are now at time , and for any cut point , we cut the interval into the two parts. Define and . Over , we get observed rewards , where has its elements as , . Similarly, has its elements as , where . Define the design matrix for arm at interval as . Each row of is where . We similarly have defined as the design matrices for arm at intervals and .
We claim there is a changepoint at time for arm if there exists a such that , where is a constant to be specified and defined as,
Otherwise, we assert that there is no changepoint in interval . See Algorithm 1 for details. In our algorithm, we will need to check the following condition in order to verify the trustworthiness of our detection, and we also require this condition for the true changepoints.
(Minimum stationary length and well conditioned) There exists some universal constant , such that for every arm , and two adjacent stationary regions, if we compute and then the following hold:
and such that
Here is the Gram matrix defined as
Our proposed Multiscale-LinUCB algorithm is formally presented in Algorithm 2. Our analysis is only valid when the preselected rounds, can be considered fixed (or predetermined independently of the data). In practice, we will combine these rounds with those sampled from the LinUCB steps as well for the changepoint detection steps. We would like to clarify that Assumption 2, 3 are only needed in theoretical analysis. In practice, our proposed Multiscale-LinUCB can achieve significantly better experimental results even if these two assumptions do not hold in some settings, as shown in Section 5.
If Algorithm 1 can detect every changepoint successfully, then we can just restart the LinUCB algorithm at the beginning of every stationary period to achieve a regret upper bound of , where is defined to be the length of the longest stationary period and is the regret of LinUCB in a stationary period of length . However, for every changepoint detection method, there will be false alarms and detection delays. Assume we are at time , false alarm means that even though there is no changepoint in interval , the algorithm alarms us that there is a changepoint at time . For changepoint , if the algorithm alarms us at time , then the detection delay is defined to be .
Let . Consider all adjacent sampled regions that satisfy
Assumptions 3 holds;
For an arm , and two adjacent stationary regions, , where changes to , if we compute over , we have
where is some constant depending on only and it is the same constant in Assumption 2, . Then there exists a constant dependent on the input such that if we run Algorithm 2, then we can detect all such changepoints with probability at least .
For ease of notation let . Note that is the projection of onto the column space of the block diagonal matrix whose two blocks are respectively. We denote this projection as and let be the projection onto the column space of . Denote the 0-mean sub-Gaussian noise vector, we have , where we have decomposed into its mean and the noise term . For the first term, by the properties of projection matrix and Assumption 3, one can verify that it is greater than where . For noise term, by idempotency, we know . By Hanson-Wright inequality , we have for . So we detect the changepoint with probability at least as long as for some constant depending on only. We can set and apply the union bound to obtain our desired result. ∎
The following lemma bounds the false alarm probability.
We can apply much of the same reasoning as in the previous proof. Notice that within a stationary region, there is a single vector such that , so . As before, we have that by the Hanson-Wright inequality, So, we do not detect a changepoint for a single selection of within a stationary region, with probability because . Furthermore, we can apply this result uniformly over the selections of arm, within stationary regions with the union bound, and arrive at our conclusion by setting . ∎
4.1 Non-stationary Joint Linear Models
In addition to disjoint linear models, Chu et al. linucb also proposed a contextual framework for joint linear model. We consider the extension of Multiscale-LinUCB for joint linear models below. This model is also consistent with the one considered by Russac et al. russac2019weighted and Cheung et al. cheung2018learning.
There are still changepoints in total, where and . However, the changepoints and model parameter is now invariant to the arms. In the -th stationary period , each arm is associated with the same model parameter .
The analog of Multiscale-LinUCB algorithm for joint linear model is basically the same. However, now we only need to randomly preselect rounds in total, denoted as . For cut point , we similarly define and . , , and . We also have , . We assert there is a changepoint if there exists a such that
Consider adjacent stationary periods of length and respectively, true parameter changes from to . In these two stationary periods, there are two preselected sets and for changepoint detection only. Define . If there exists a constant depending only on in Assumption 3 only, such that
4.2 Non-stationary Multi-armed Bandit (MAB)
There are plenty of literature on non-stationary multi-armed bandit problems [14, 19, 3, 4, 5, 8, 7]. Most of the notations remain the same as in Section 2.1. However, we don’t have a model parameter now. For MAB setting, the algorithm can be simplified a lot. We don’t need to randomly preselect rounds for changepoint detection purpose. Instead, at time , we will now randomly select each arm with probability , and we will pull arm with the maximum UCB index with probability . Define and
We calculate the test statisticsas follows.
If there exists a cut point such that , then we reset the most recent changepoint as current time , we also reset the UCB index for arm . Otherwise, we assert there is no changepoint in interval and keep runing UCB.
Define to be the change in reward of arm at changepoint . Define , and . Without loss of generality, we can assume for some constant for all . We provide an analog of Assumption 2 and regret analysis in MAB setting.
(Detectability) and are the length of two adjacent stationary periods for arm , for all we assume there exists a constant , such that,
5 Experimental Results
In Algorithm 1, although the algorithm breaks at time , which is a cut point of interval , the returned changepoint is the current time . However, we found that reusing information during is helpful for reducing cumulative regret. Therefore, in the experiments below, we use as the detected changepoints instead of . We compare our algorithm with state-of-the-art algorithms including Sliding Window LinUCB (SW-LinUCB) , D-LinUCB  and LinUCB . We omit the comparison with Dynamic Linear UCB (dLinUCB) here since that it is shown by Russac et al. russac2019weighted in their experiments that dLinUCB performs much worse than D-LinUCB, and even worse than LinUCB in many simulations, which is also the case in our experiments.
For Multiscale-LinUCB, although input needs to be chosen based on in order to achieve our regret bound in the analysis, we found that in most experiments, choosing is enough. For both SW-LinUCB  and D-LinUCB , the algorithm needs to know which is an upper bound on . Here is the true model parameter at time . However, in practice, it’s often the case that is unknown. The authors of SW-LinUCB  suggest using when is unknown, so we use in the comparisons.
All the experiments shown here is for non-stationary contextual bandit with joint linear model, since SW-LinUCB and D-LinUCB is proposed for joint linear models. For all experiments, we fix and draw sample reward from , where is the mean reward for arm at time . Feature vectors are drawn from randomly. In each stationary period, the true model parameter is drawn from , except the first scenario. We repeat the experiments times and plot the average regret of these experiments. We demonstrate the success of Multiscale-LinUCB under the scenarios below. Note that for both Scenario 1 and 2, if you zoom in and look closely at the plots, you will find that the regret of Multiscale-LinUCB accumulates at a faster rate at changepoints. Immediately after the changepoints, the regret accumulates much slower which shows that our algorithm captures the change very quickly and adapts well to the changing environment. Details can be found in Figure 2.
Scenario 1 (Detectable environments): This is a similar setting as in . Before , ; for , ; for , ; for , . We set and here. From the plot in Figure 2, we can see that LinUCB cannot adpat to abruptly changing environment. SW-LinUCB and D-LinUCB presents similar behavior when there is an abrupt changepoint. Both algorithms incur fairly large regret for some rounds right after the changepoint. However, Multiscale-LinUCB can adapt to the change faster and therefore achieve smaller regret.
Scenario 2 (High dimensions): There are in total changepoints and they are evenly spread over the whole time horizon. We set and here. In the experiments of , it was shown that under high dimensions (), D-LinUCB can perform well. We can see from Figure 2 that Multiscale-LinUCB adapts to changes much faster and performs much better than all other algorithms under high dimensions.
Scenario 3 (Random changepoints): At time , changes with probability . We set and . Although we require each stationary period to be long enough in Assumption 2 and 3, we show here that even when the changepoints are randomly distributed over the whole time horizon, where Assumption 2, 3 could be violated, Multiscale-LinUCB still performs quite well.
Scenario 4 (Multiple arms): At each time , will change with probability . We set and here. We show by this scenario that Multiscale-LinUCB can work well with multiple arms. We found that the regret of every algorithm roughly scales linearly with the number arms, although the regret analysis for D-LinUCB and SW-LinUCB shows that their regret upper bound is invariant to .
We proposed a multiscale changepoint detection based LinUCB algorithm for non-stationary stochastic disjoint linear bandit setting, Multiscale-LinUCB. We also extended it to non-stationary joint linear bandit setting and MAB setting. The regret of our proposed algorithm matches the lower bound up to a logarithm factor. Particularly, our algorithm can also deal with faint change in mean reward. Experimental results show that our proposed algorithm outperforms other state-of-the-art algorithms significantly in non-stationary environments.
-  (2003) Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica 37 (4), pp. 263–293. Cited by: §1.
Thompson sampling for contextual bandits with linear payoffs.
International Conference on Machine Learning, pp. 127–135. Cited by: §1.
-  (2015) Exp3 with drift detection for the switching bandit problem. In Data Science and Advanced Analytics (DSAA), 2015. 36678 2015. IEEE International Conference on, pp. 1–7. Cited by: §4.2.
-  (2002) The nonstochastic multiarmed bandit problem. SIAM journal on computing 32 (1), pp. 48–77. Cited by: §C.4, §4.2.
-  (2018) Adaptively tracking the best arm with an unknown number of distribution changes. In 14th European Workshop on Reinforcement Learning, Cited by: §1, §4.2.
-  (2002) Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research 3 (Nov), pp. 397–422. Cited by: §1.
-  (2014) Stochastic multi-armed-bandit problem with non-stationary rewards. In Advances in neural information processing systems, pp. 199–207. Cited by: §C.4, §1, §4.2, Remark 1.
-  (2018) Nearly optimal adaptive procedure with change detection for piecewise-stationary bandit. arXiv preprint arXiv:1802.03692. Cited by: §C.4, §1, §1, §4.2.
-  (2019) A new algorithm for non-stationary contextual bandits: efficient, optimal, and parameter-free. arXiv preprint arXiv:1902.00980. Cited by: §1.
-  (2018) Hedging the drift: learning to optimize under non-stationarity. Available at SSRN 3261050. Cited by: §1.
-  (2018) Learning to optimize under non-stationarity. arXiv preprint arXiv:1810.03024. Cited by: §1, §5, §5.
Contextual bandits with linear payoff functions.
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208–214. Cited by: §1, §5.
-  (2008) On upper-confidence bound policies for non-stationary bandit problems. arXiv preprint arXiv:0805.3415. Cited by: §3.
-  (2011) On upper-confidence bound policies for switching bandit problems. In International Conference on Algorithmic Learning Theory, pp. 174–188. Cited by: §C.4, §1, §1, §4.2, Remark 1.
-  (1971) Inference about the change-point from cumulative sum tests. Biometrika 58 (3), pp. 509–523. Cited by: §1.
-  (2012) A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability 17. Cited by: §A.1, §3.
-  (2010) A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670. Cited by: §A.3, §1.
-  (2018) Information directed sampling for stochastic bandits with graph feedback. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1.
-  (2017) A change-detection based framework for piecewise-stationary multi-armed bandit problem. arXiv preprint arXiv:1711.03539. Cited by: §C.4, §C.4, §C.4, §1, §1, §2.2, §4.2, Remark 1.
-  (2017) Efficient contextual bandits in non-stationary worlds. arXiv preprint arXiv:1708.01799. Cited by: §1.
-  (2019) Weighted linear bandits for non-stationary environments. In Advances in Neural Information Processing Systems, pp. 12017–12026. Cited by: §1, §1, item 1, item 2, item 4, §5, §5.
-  (2017) Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Science 36 (4), pp. 500–522. Cited by: §1.
-  (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25 (3/4), pp. 285–294. Cited by: §1.
Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027. Cited by: Appendix B.
-  (2018) Learning contextual bandits in a non-stationary environment. arXiv preprint arXiv:1805.09365. Cited by: §1, §1, §2.2, §3.
-  (2009) Piecewise-stationary bandit problems with side observations. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1177–1184. Cited by: §2.2.
Appendix A Proofs for Non-stationary Contextual Bandit Setting
a.1 Proof of Lemma 1
For ease of notation let . Consider the test statistic defined in (5),
By standard OLS theory, the vector is the projection of onto the column space of the following matrix,
Let’s call this projection , and let be the projection onto the column space of . Then
Notice that these column spaces are nested, so that is the projection onto a subspace orthogonal to the column space of . Let for zero-mean subGaussian(1) vector, . By the triangle inequality, we have that
Let us begin by lower bounding the first term on the RHS. Notice that for any vector we have that
since . Let
where and denote . Hence,
Because, the projections are into nested subspaces, and the vector in question is within the outer subspace, we have that
The first term can be written as,
The second term can be written as,
Notice that and