1 Introduction
The multiarmed bandit (MAB) problem is a sequential learning setting, where each round the player decides which arm to pull from a arm bandit. The player only observes partial reward feedback according to the pulled arm and may use the past rewards to adapt its strategy. The goal is to balance the tradeoff between exploration and exploitation over time and minimize the cumulative regret up to rounds. The MAB setting, first introduced by [23], has received extensive research during the past few decades due to its significant applications to online advertisements [22] and recommender systems [17, 18]. Recently, the contextual bandit setting [1, 6] receives increasing interest due to its efficiency in the case of large recommender system ( is large) and interrelated reward distributions. Linear Upper Confidence Bound algorithm (LinUCB) [17]
was proposed for contextual bandit setting under a linear assumption, where the reward of each arm is predicted by a linear model of feature vectors
and linear regression parameter . This is known as the stochastic linear bandits. Chu et al. linucb proved a lower bound of for linear bandit setting, where is the dimension of feature vectors. It was shown that LinUCB can achieve this lower bound with a logarithm factor [12].Most existing stochastic linear bandit algorithms like LinUCB and Linear Thompson Sampling (LinTS)
[2] assume the regression parameters for rewards stay nonstationary over time. However, in reality, the assumption of stationarity rarely holds. As an example, in news recommendation, a user might be more interested in political news during the presidential debate, and more interested in sports news during the NBA playoff season. Popular algorithms like LinUCB or LinTS which achieve optimal regret bounds in stationary environments could end up with linear regret for nonstationary environments in the worst case. Many efforts have been taken to emphasize this problem [11, 10, 21, 25], including methods of passively and actively adapting to the changing environment.We explore the solutions for piecewisestationarity in stochastic bandit settings with linear assumptions, where the regression model parameter stays stationary for a while and changes abruptly at a certain time. The main idea is to design a changepoint detection method and perform the classic LinUCB algorithm within the intervals of homogeneity. When we detect a changepoint for an arm, we reset the LinUCB index for this arm. While the changepointbased method sounds reasonable, it hasn’t been successful due to the extreme difficulty of detecting faint changes in bandit problems. Piecewisestationary environment in previous works mostly assumes the change in mean reward (at least for some portion of the arms) is bounded below by a constant [25]. However, faint changes are hardly ignorable. For example, neglecting to pull an optimal arm with faint changes over a stationary window of length is going to incur a large regret.
In this paper, we first propose a piecewisestationary environment with weaker assumptions, where we do not need the change in mean reward to be bounded below. We only require that for small changes, the consecutive stationary periods should be relatively long enough for our algorithm to detect a change and vice versa. We then propose a multiscale changepoint detection based algorithm, MultiscaleLinUCB, for the piecewisestationary linear bandit setting (formally defined in Section 2.2) and prove the nearly optimal regret bound for this algorithm. We show that the multiscale nature of the changepoint detector is essential for preventing poor regret when there are faint changes in reward distribution. Then we extend this setting to piecewisestationary MAB bandit setting, where the reward distributions of some arms may change at certain changepoints. Extensive research in experiments show that our algorithm performs significantly better than other stateoftheart algorithms in nonstationary environment.
Related Works:
There is an important line of work for nonstationary MAB problems [14, 19, 5, 7, 8]. Recently, there has also been some novel researches that consider nonstationary contextual (can be nonlinear) bandit algorithms where there are probabilistic assumptions on the context vectors [9, 20]. Chen et al. chen2019new attains parameterfree and efficient algorithm assuming access to an ERM oracle. Here, we will only discuss some previous works on stochastic linear bandit algorithms for nonstationary environments , as those works are closely related to ours.
The recently developed DLinUCB [21] employs a weighted linear bandit model, where the weight is adjusted according to how recently the data point is observed. By putting a discount rate on past observations for computing the LinUCB index, it passively adapts to the changing environment. This work has its similarities in Discounted UCB, which is proposed for nonstationary MAB [14]. In the same work, Garivier and Moulines dissw proposed Sliding Window UCB for nonstationary MAB. Cheung et al. cheung2018learning generalized it to Sliding Window LinUCB (SWLinUCB) for nonstationary stochastic linear bandit. SWLinUCB computes the LinUCB index according to the most recent observations, where is the sliding window size. Both DLinUCB and SWLinUCB assumes the knowledge of the total variation bound where , which is rarely practical in reality. Here is the true model parameter for the regression model at time . When the discount rate of DLinUCB or the window size of SWLinUCB is chosen based on , both algorithms can attain a regret upper bound of .
In addition to passively adapting to the changing environment, there has also been substantial works considering actively adapting to the changing environments by changepoint detection methods. These works are mostly proposed for piecewisestationary environment, and most of them assume the change in reward is bounded from below. The idea can track back to many algorithms in piecewisestationary MAB environment [19, 8, 15]. For piecewisestationary linear bandits, Wu et al. wu2018learning proposed Dynamic Linear UCB (dLinUCB) algorithm. The key idea of dLinUCB is to maintain a master bandit model which keeps track of the “badness” of several slave bandit models. The best slave model is chosen to determine which arm to pull each time and the feedback is shared with all models in the system. When there is no “good” slave model in the pool, a change is detected and a new slave model is created. Wu et al. wu2018learning showed that when the “badness” of the model is set based on the proportion of arms changing and the lower bound of changes in rewards, then the algorithm can attain an optimal regret upper bound of , where is the length of the longest stationary period and is the total number of changepoints.
2 Methodology
2.1 Problem Formulation
We consider the contextual bandit problems with disjoint linear models proposed by Li et al. news_contextual in nonstationary environment. In a time horizon , let be the set of arms. At time , the player has access to the feature vectors of every arm . After observing , the player chooses an action and observe a sample reward . Each time, the observed reward is independent of each other.
In the stationary setting, expected reward of arm at time is modeled as a function of unknown vector and feature vectors. Under linear assumption, the expected reward becomes
(stationary) 
In nonstationary contextual setting, could change over time. We assume that for arm there are in total changepoints, denoted as , where and . We say that is a changepoint for arm if the model parameter is different before and after time . Specifically, for the stationary periods, we define the length of the th stationary period to be and are associated with an unknown parameter , where . We have
(1) 
Define and . Then will be the set of all changepoints and is the total number of changepoints. Note that it is possible that , which means that there are multiple arms changing at the same time. See Figure 1 for illustration of the notations.
Define the optimal arm at time to be , i.e., , where is defined in Equation 1. Also define . Similar to stationary settings, the goal of the decision maker is to find a policy , so that following policy , it chooses an arm every time to minimize the total regret over time, where the total regret is defined to be
2.2 Piecewisestationary Environment
We study the piecewisestationary environment in [26], where the reward distribution remains the same for a while and abruptly changes at a changepoint. In addition, we propose two mild assumptions for our piecewisestationary contextual environment.
Assumption 1.
(SubGaussian Reward) The reward distribution is subGaussian with parameter , without loss of generality, we assume for the analysis below.
Assumption 1 has been widely used in the literature. It includes the widely used Bernoulli reward in online recommender systems.
Assumption 2.
(Detectability) There exists a constant such that the following holds. For arm and adjacent stationary periods of length and respectively, true parameter changes from to , and for any in these two stationary periods, define . We assume the following inequalities hold.
Assumption 2 is weaker than most of the assumptions made in literature [19, 25]. Most changepointbased method for piecewisestationary bandit assumes bounded below to ensure detectability. However, our method does not need this. Assumption 2 means that when is small, we need longer stationary periods and for us to detect a changepoint. For example, this condition allows stationary periods of length with faint changes .
2.3 Proposed Algorithm: MultiscaleLinUCB
In this section, we introduce our proposed changepoint detection based LinUCB algorithm, MultiscaleLinUCB. Generally speaking, the algorithm performs LinUCB algorithm when there is no changepoint, and when we detect a changepoint for an arm, we reset the LinUCB index for this arm.
One of the biggest challenges for changepoint detection in the stochastic bandit setting is that LinUCB will not pull every arm frequently enough to detect a change in reward distribution. Due to the nature of LinUCB, it will eventually stop pulling suboptimal arms, but this can cause a missed changepoint in this arm. If this arm then becomes optimal, this new optimal arm will continue to be neglected, resulting in a regret that is linear with
. To remedy this problem, we randomly preselect some “changepoint detection” rounds with probability
to pull arm . These are rounds at which we pull an arm purely for the purpose of detecting changepoints. Therefore, for each arm , there will be approximately preselected rounds, such that are disjoint. This probability is carefully selected so that we can balance between minimizing total regret and the need of having enough samples for detecting changes in every arm. Moreover, in nonstationary bandit settings, there could be a changepoint at any time, so it is important to maintain at least some level of exploration all the time to make sure that we still have a chance to choose the optimal arm at current time, even though this optimal arm could be the worst arm in previous times.Let’s focus on a single arm now. Assume we have detected the most recent changepoint , are now at time , and for any cut point , we cut the interval into the two parts. Define and . Over , we get observed rewards , where has its elements as , . Similarly, has its elements as , where . Define the design matrix for arm at interval as . Each row of is where . We similarly have defined as the design matrices for arm at intervals and .
Since OLS estimator is an unbiased estimator for
, to detect changepoints for arm , we calculate the OLS estimators for intervals and as follows:(2)  
(3)  
(4) 
We claim there is a changepoint at time for arm if there exists a such that , where is a constant to be specified and defined as,
(5) 
Otherwise, we assert that there is no changepoint in interval . See Algorithm 1 for details. In our algorithm, we will need to check the following condition in order to verify the trustworthiness of our detection, and we also require this condition for the true changepoints.
Assumption 3.
(Minimum stationary length and well conditioned) There exists some universal constant , such that for every arm , and two adjacent stationary regions, if we compute and then the following hold:
and such that
Here is the Gram matrix defined as
Notice that Assumption 3 is checkable for proposed . Moreover, we have Proposition 1 below showing that Assumption 3 is valid under many circumstances.
Proposition 1.
Our proposed MultiscaleLinUCB algorithm is formally presented in Algorithm 2. Our analysis is only valid when the preselected rounds, can be considered fixed (or predetermined independently of the data). In practice, we will combine these rounds with those sampled from the LinUCB steps as well for the changepoint detection steps. We would like to clarify that Assumption 2, 3 are only needed in theoretical analysis. In practice, our proposed MultiscaleLinUCB can achieve significantly better experimental results even if these two assumptions do not hold in some settings, as shown in Section 5.
3 Analysis
If Algorithm 1 can detect every changepoint successfully, then we can just restart the LinUCB algorithm at the beginning of every stationary period to achieve a regret upper bound of , where is defined to be the length of the longest stationary period and is the regret of LinUCB in a stationary period of length . However, for every changepoint detection method, there will be false alarms and detection delays. Assume we are at time , false alarm means that even though there is no changepoint in interval , the algorithm alarms us that there is a changepoint at time . For changepoint , if the algorithm alarms us at time , then the detection delay is defined to be .
The following lemma controls the probability of missed changepoints when the sampled stationary regions satisfy our Assumption 3 and Equation 6.
Lemma 1.
Let . Consider all adjacent sampled regions that satisfy
Assumptions 3 holds;
For an arm , and two adjacent stationary regions, , where changes to , if we compute over , we have
(6) 
where is some constant depending on only and it is the same constant in Assumption 2, . Then there exists a constant dependent on the input such that if we run Algorithm 2, then we can detect all such changepoints with probability at least .
Proof Sketch.
For ease of notation let . Note that is the projection of onto the column space of the block diagonal matrix whose two blocks are respectively. We denote this projection as and let be the projection onto the column space of . Denote the 0mean subGaussian noise vector, we have , where we have decomposed into its mean and the noise term . For the first term, by the properties of projection matrix and Assumption 3, one can verify that it is greater than where . For noise term, by idempotency, we know . By HansonWright inequality [16], we have for . So we detect the changepoint with probability at least as long as for some constant depending on only. We can set and apply the union bound to obtain our desired result. ∎
Condition (2) in Lemma 1 holds when Assumption 2 holds with , so we can bound detection delay in Lemma 2.
Lemma 2.
The following lemma bounds the false alarm probability.
Lemma 3.
Proof Sketch.
We can apply much of the same reasoning as in the previous proof. Notice that within a stationary region, there is a single vector such that , so . As before, we have that by the HansonWright inequality, So, we do not detect a changepoint for a single selection of within a stationary region, with probability because . Furthermore, we can apply this result uniformly over the selections of arm, within stationary regions with the union bound, and arrive at our conclusion by setting . ∎
4 Extensions
4.1 Nonstationary Joint Linear Models
In addition to disjoint linear models, Chu et al. linucb also proposed a contextual framework for joint linear model. We consider the extension of MultiscaleLinUCB for joint linear models below. This model is also consistent with the one considered by Russac et al. russac2019weighted and Cheung et al. cheung2018learning.
There are still changepoints in total, where and . However, the changepoints and model parameter is now invariant to the arms. In the th stationary period , each arm is associated with the same model parameter .
The analog of MultiscaleLinUCB algorithm for joint linear model is basically the same. However, now we only need to randomly preselect rounds in total, denoted as . For cut point , we similarly define and . , , and . We also have , . We assert there is a changepoint if there exists a such that
Here is the same defined in Equation 2, 3, 4. For joint linear model, we still have similar regret bounds.
Theorem 2.
Consider adjacent stationary periods of length and respectively, true parameter changes from to . In these two stationary periods, there are two preselected sets and for changepoint detection only. Define . If there exists a constant depending only on in Assumption 3 only, such that
Denote and . Under Assumption 1, 3, the regret bound of MultiscaleLinUCB for joint linear models satisfies the following:
4.2 Nonstationary Multiarmed Bandit (MAB)
There are plenty of literature on nonstationary multiarmed bandit problems [14, 19, 3, 4, 5, 8, 7]. Most of the notations remain the same as in Section 2.1. However, we don’t have a model parameter now. For MAB setting, the algorithm can be simplified a lot. We don’t need to randomly preselect rounds for changepoint detection purpose. Instead, at time , we will now randomly select each arm with probability , and we will pull arm with the maximum UCB index with probability . Define and
We calculate the test statistics
as follows.(9) 
If there exists a cut point such that , then we reset the most recent changepoint as current time , we also reset the UCB index for arm . Otherwise, we assert there is no changepoint in interval and keep runing UCB.
Define to be the change in reward of arm at changepoint . Define , and . Without loss of generality, we can assume for some constant for all . We provide an analog of Assumption 2 and regret analysis in MAB setting.
Assumption 4.
(Detectability) and are the length of two adjacent stationary periods for arm , for all we assume there exists a constant , such that,
Theorem 3.
Remark 1.
5 Experimental Results
In Algorithm 1, although the algorithm breaks at time , which is a cut point of interval , the returned changepoint is the current time . However, we found that reusing information during is helpful for reducing cumulative regret. Therefore, in the experiments below, we use as the detected changepoints instead of . We compare our algorithm with stateoftheart algorithms including Sliding Window LinUCB (SWLinUCB) [11], DLinUCB [21] and LinUCB [12]. We omit the comparison with Dynamic Linear UCB (dLinUCB) here since that it is shown by Russac et al. russac2019weighted in their experiments that dLinUCB performs much worse than DLinUCB, and even worse than LinUCB in many simulations, which is also the case in our experiments.
For MultiscaleLinUCB, although input needs to be chosen based on in order to achieve our regret bound in the analysis, we found that in most experiments, choosing is enough. For both SWLinUCB [11] and DLinUCB [21], the algorithm needs to know which is an upper bound on . Here is the true model parameter at time . However, in practice, it’s often the case that is unknown. The authors of SWLinUCB [11] suggest using when is unknown, so we use in the comparisons.
All the experiments shown here is for nonstationary contextual bandit with joint linear model, since SWLinUCB and DLinUCB is proposed for joint linear models. For all experiments, we fix and draw sample reward from , where is the mean reward for arm at time . Feature vectors are drawn from randomly. In each stationary period, the true model parameter is drawn from , except the first scenario. We repeat the experiments times and plot the average regret of these experiments. We demonstrate the success of MultiscaleLinUCB under the scenarios below. Note that for both Scenario 1 and 2, if you zoom in and look closely at the plots, you will find that the regret of MultiscaleLinUCB accumulates at a faster rate at changepoints. Immediately after the changepoints, the regret accumulates much slower which shows that our algorithm captures the change very quickly and adapts well to the changing environment. Details can be found in Figure 2.

Scenario 1 (Detectable environments): This is a similar setting as in [21]. Before , ; for , ; for , ; for , . We set and here. From the plot in Figure 2, we can see that LinUCB cannot adpat to abruptly changing environment. SWLinUCB and DLinUCB presents similar behavior when there is an abrupt changepoint. Both algorithms incur fairly large regret for some rounds right after the changepoint. However, MultiscaleLinUCB can adapt to the change faster and therefore achieve smaller regret.

Scenario 2 (High dimensions): There are in total changepoints and they are evenly spread over the whole time horizon. We set and here. In the experiments of [21], it was shown that under high dimensions (), DLinUCB can perform well. We can see from Figure 2 that MultiscaleLinUCB adapts to changes much faster and performs much better than all other algorithms under high dimensions.

Scenario 3 (Random changepoints): At time , changes with probability . We set and . Although we require each stationary period to be long enough in Assumption 2 and 3, we show here that even when the changepoints are randomly distributed over the whole time horizon, where Assumption 2, 3 could be violated, MultiscaleLinUCB still performs quite well.

Scenario 4 (Multiple arms): At each time , will change with probability . We set and here. We show by this scenario that MultiscaleLinUCB can work well with multiple arms. We found that the regret of every algorithm roughly scales linearly with the number arms, although the regret analysis for DLinUCB and SWLinUCB shows that their regret upper bound is invariant to [21].
6 Conclusion
We proposed a multiscale changepoint detection based LinUCB algorithm for nonstationary stochastic disjoint linear bandit setting, MultiscaleLinUCB. We also extended it to nonstationary joint linear bandit setting and MAB setting. The regret of our proposed algorithm matches the lower bound up to a logarithm factor. Particularly, our algorithm can also deal with faint change in mean reward. Experimental results show that our proposed algorithm outperforms other stateoftheart algorithms significantly in nonstationary environments.
References
 [1] (2003) Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica 37 (4), pp. 263–293. Cited by: §1.

[2]
(2013)
Thompson sampling for contextual bandits with linear payoffs.
In
International Conference on Machine Learning
, pp. 127–135. Cited by: §1.  [3] (2015) Exp3 with drift detection for the switching bandit problem. In Data Science and Advanced Analytics (DSAA), 2015. 36678 2015. IEEE International Conference on, pp. 1–7. Cited by: §4.2.
 [4] (2002) The nonstochastic multiarmed bandit problem. SIAM journal on computing 32 (1), pp. 48–77. Cited by: §C.4, §4.2.
 [5] (2018) Adaptively tracking the best arm with an unknown number of distribution changes. In 14th European Workshop on Reinforcement Learning, Cited by: §1, §4.2.
 [6] (2002) Using confidence bounds for exploitationexploration tradeoffs. Journal of Machine Learning Research 3 (Nov), pp. 397–422. Cited by: §1.
 [7] (2014) Stochastic multiarmedbandit problem with nonstationary rewards. In Advances in neural information processing systems, pp. 199–207. Cited by: §C.4, §1, §4.2, Remark 1.
 [8] (2018) Nearly optimal adaptive procedure with change detection for piecewisestationary bandit. arXiv preprint arXiv:1802.03692. Cited by: §C.4, §1, §1, §4.2.
 [9] (2019) A new algorithm for nonstationary contextual bandits: efficient, optimal, and parameterfree. arXiv preprint arXiv:1902.00980. Cited by: §1.
 [10] (2018) Hedging the drift: learning to optimize under nonstationarity. Available at SSRN 3261050. Cited by: §1.
 [11] (2018) Learning to optimize under nonstationarity. arXiv preprint arXiv:1810.03024. Cited by: §1, §5, §5.

[12]
(2011)
Contextual bandits with linear payoff functions.
In
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics
, pp. 208–214. Cited by: §1, §5.  [13] (2008) On upperconfidence bound policies for nonstationary bandit problems. arXiv preprint arXiv:0805.3415. Cited by: §3.
 [14] (2011) On upperconfidence bound policies for switching bandit problems. In International Conference on Algorithmic Learning Theory, pp. 174–188. Cited by: §C.4, §1, §1, §4.2, Remark 1.
 [15] (1971) Inference about the changepoint from cumulative sum tests. Biometrika 58 (3), pp. 509–523. Cited by: §1.
 [16] (2012) A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability 17. Cited by: §A.1, §3.
 [17] (2010) A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670. Cited by: §A.3, §1.
 [18] (2018) Information directed sampling for stochastic bandits with graph feedback. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §1.
 [19] (2017) A changedetection based framework for piecewisestationary multiarmed bandit problem. arXiv preprint arXiv:1711.03539. Cited by: §C.4, §C.4, §C.4, §1, §1, §2.2, §4.2, Remark 1.
 [20] (2017) Efficient contextual bandits in nonstationary worlds. arXiv preprint arXiv:1708.01799. Cited by: §1.
 [21] (2019) Weighted linear bandits for nonstationary environments. In Advances in Neural Information Processing Systems, pp. 12017–12026. Cited by: §1, §1, item 1, item 2, item 4, §5, §5.
 [22] (2017) Customer acquisition via display advertising using multiarmed bandit experiments. Marketing Science 36 (4), pp. 500–522. Cited by: §1.
 [23] (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25 (3/4), pp. 285–294. Cited by: §1.

[24]
(2010)
Introduction to the nonasymptotic analysis of random matrices
. arXiv preprint arXiv:1011.3027. Cited by: Appendix B.  [25] (2018) Learning contextual bandits in a nonstationary environment. arXiv preprint arXiv:1805.09365. Cited by: §1, §1, §2.2, §3.
 [26] (2009) Piecewisestationary bandit problems with side observations. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1177–1184. Cited by: §2.2.
Appendix A Proofs for Nonstationary Contextual Bandit Setting
a.1 Proof of Lemma 1
Proof.
For ease of notation let . Consider the test statistic defined in (5),
By standard OLS theory, the vector is the projection of onto the column space of the following matrix,
Let’s call this projection , and let be the projection onto the column space of . Then
Notice that these column spaces are nested, so that is the projection onto a subspace orthogonal to the column space of . Let for zeromean subGaussian(1) vector, . By the triangle inequality, we have that
Let us begin by lower bounding the first term on the RHS. Notice that for any vector we have that
since . Let
where and denote . Hence,
Because, the projections are into nested subspaces, and the vector in question is within the outer subspace, we have that
The first term can be written as,
The second term can be written as,
Notice that
Thus,
Notice that and
Moreover, and
Comments
There are no comments yet.