Near-optimal Oracle-efficient Algorithms for Stationary and Non-Stationary Stochastic Linear Bandits

by   Baekjin Kim, et al.
University of Michigan

We investigate the design of two algorithms that enjoy not only computational efficiency induced by Hannan's perturbation approach, but also minimax-optimal regret bounds in linear bandit problems where the learner has access to an offline optimization oracle. We present an algorithm called Follow-The-Gaussian-Perturbed Leader (FTGPL) for stationary linear bandit where each action is associated with a d-dimensional feature vector, and prove that FTGPL (1) achieves the minimax-optimal Õ(d√(T)) regret, (2) matches the empirical performance of Linear Thompson Sampling, and (3) can be efficiently implemented even in the case of infinite actions, thus achieving the best of three worlds. Furthermore, it firmly solves an open problem raised in <cit.>, which perturbation achieves minimax-optimality in Linear Thompson Sampling. The weighted variant with exponential discounting, Discounted Follow-The-Gaussian-Perturbed Leader (D-FTGPL) is proposed to gracefully adjust to non-stationary environment where unknown parameter is time-varying within total variation B_T. It asymptotically achieves optimal dynamic regret Õ( d ^2/3B_T^1/3 T^2/3) and is oracle-efficient due to access to an offline optimization oracle induced by Gaussian perturbation.


page 1

page 2

page 3

page 4


Online Non-Convex Learning: Following the Perturbed Leader is Optimal

We study the problem of online learning with non-convex losses, where th...

Non-stationary Linear Bandits Revisited

In this note, we revisit non-stationary linear bandits, a variant of sto...

Combinatorial Semi-Bandit in the Non-Stationary Environment

In this paper, we investigate the non-stationary combinatorial semi-band...

Follow the Perturbed Leader: Optimism and Fast Parallel Algorithms for Smooth Minimax Games

We consider the problem of online learning and its application to solvin...

Follow-the-Perturbed-Leader for Adversarial Markov Decision Processes with Bandit Feedback

We consider regret minimization for Adversarial Markov Decision Processe...

Kiefer Wolfowitz Algorithm is Asymptotically Optimal for a Class of Non-Stationary Bandit Problems

We consider the problem of designing an allocation rule or an "online le...

Generalized Policy Elimination: an efficient algorithm for Nonparametric Contextual Bandits

We propose the Generalized Policy Elimination (GPE) algorithm, an oracle...

Code Repositories

1 Introduction

A multi-armed bandit is the simplest model of decision making that involves the exploration versus exploitation trade-off [Lai and Robbins, 1985]. Linear bandits are an extension of multi-armed bandits where reward has linear structure with a finite-dimensional feature associated with each arm [Abe et al., 2003, Dani et al., 2008]. Two standard exploration strategies in stochastic linear bandits are Upper Confidence Bound algorithm (LinUCB) [Abbasi-Yadkori et al., 2011] and linear Thomson Sampling [Agrawal and Goyal, 2013]

. The former relies on optimism in face of uncertainty and is a deterministic algorithm built upon the construction of a high-probability confidence ellipsoid for the unknown parameter vector. The latter is a Bayesian solution that maximizes the expected rewards according to a parameter sampled from posterior distribution.

Chapelle and Li [2011] showed that linear Thompson Sampling empirically performs better is more robust to corrupted or delayed feedback than LinUCB. From a theoretical perspective, it enjoys a regret bound that is a factor of worse than minimax-optimal regret bound that LinUCB enjoys. However, the minimax optimality of optimism comes at a cost: implementing UCB type algorithms can lead to NP-hard optimization problems even for convex action sets [Agrawal, 2019].

Random perturbation methods were originally proposed in the 1950s by Hannan [1957] in the full information setting where losses of all actions are observed. Kalai and Vempala [2005] showed that Hannan’s perturbation approach leads to efficient algorithms by making repeated calls to an offline optimization oracle. They also gave a new name to this family of randomized algorithms: Follow the Perturbed Leader (FTPL). Recent work [Abernethy et al., 2014, 2015, Kim and Tewari, 2019] has studied the relationship between FTPL algorithms and Follow the Regularized Leader (FTRL) algorithms and also investigated whether FTPL algorithms achieve minimax optimal regret in both full and partial information settings.

Abeille et al. [2017] viewed linear Thompson Sampling as a perturbation based algorithm, characterized a family of perturbations whose regrets can be analyzed, and raised an open problem to find a minimax-optimal perturbation. In addition to its significant role in smartly balancing exploration with exploitation, a perturbation based approach to linear bandits also reduces the problem to one call to the offline optimization oracle at each round. Recent works [Kveton et al., 2019a, b] have proposed randomized algorithms that use perturbation as a means to achieve oracle-efficient computation as well as better theoretical guarantee than linear Thompson Sampling, but there is still gap between their regret bounds and the lower bound of . This gap is logarithmic in the number of actions which can introduce extra dependence on dimensions for large or finite action spaces.

Linear bandit problems were originally motivated by applications such as online ad placement with features extracted from the ads and website users. However, users’ preferences often evolve with time which leads to interest in the non-stationary variant of linear bandits. Accordingly, adaptive algorithms that accommodate time-variation of environments have been studied in a rich line of works in both multi-armed bandit

[Besbes et al., 2014] and linear bandit. SW-LinUCB [Cheung et al., 2019] and D-LinUCB [Russac et al., 2019] were constructed on the basis of optimism principle under the uncertainty via sliding window and exponential discounting weights, respectively. Luo et al. [2017] and Chen et al. [2019] studied fully adaptive and oracle-efficient algorithms assuming access to an optimization oracle. It is still open problem to design a practically simple, oracle-efficient and statistically optimal algorithm for non-stationary linear bandits.


We design and analyze two algorithms that enjoy not only computational efficiency (assuming access to an offline optimization oracle), but also statistical optimality (in terms of regret) in linear bandit problems.

In section 2, we consider a stationary environment and present an algorithm called Follow-The-Gaussian-Perturbed Leader (FTGPL) that (1) achieves the minimax-optimal regret, (2) matches the empirical performance of Linear Thompson Sampling, and (3) can be efficiently implemented given oracle access to the offline optimization problem. Furthermore, it solves an open problem raised in Abeille et al. [2017], namely finding a randomized algorithm that achieves minimax-optimal frequentist regret in stochastic linear bandits.

In section 3, we study the non-stationary setting and propose the weighted variant of FTGPL with exponential discounting, called Discounted Follow-The-Gaussian-Perturbed Leader (D-FTGPL). It gracefully adjusts to the total time-variation in the true parameter so that it enjoys not only a regret bound comparable to those of SW-LinUCB and D-LinUCB, but also computational efficiency due to sole reliance on an offline optimization oracle for the action set.

2 Stationary Stochastic Linear Bandit

2.1 Preliminary

In stationary stochastic linear bandit, a learner chooses an action from a given action set in every round , and he subsequently observes reward where is an unknown parameter and

is conditionally 1-subGaussian random variable. For simplicity, assume that for all

, , , and thus .

As a measure of evaluating a learner, the regret is defined as the difference between rewards the learner would have received had he played the best in hindsight, and the rewards he actually received. Therefore, minimizing the regret is equivalent to maximizing the expected cumulative reward. Denote the best action in a round as and the expected regret as .

To learn about unknown parameter from history up to time , , algorithms heavily rely on

-regularized least-squares estimate of

, , and confidence ellipsoid centered from . Here, we define , where .

2.2 Open Problem : The Best of Three Worlds

The standard solutions in stationary stochastic linear bandit are optimism based algorithm (LinUCB, Abbasi-Yadkori et al. [2011]) and linear Thompson Sampling (LinTS, Abeille et al. [2017]). While the former obtains the theoretically optimal regret bound matched to lower bound , the latter empirically performs better in spite of regret bound worse than LinUCB [Chapelle and Li, 2011]. Also, some recent works proposed a series of randomized algorithms for (Generalized) linear bandit; PHE [Kveton et al., 2019a], FPL-GLM [Kveton et al., 2019b], and RandUCB [Vaswani et al., 2019]. They are categorized in terms of regret bounds, randomness, and oracle access in Table 1.

Algorithm Regret Random Oracle
LinUCB No No
LinTS Yes Yes
PHE Yes Yes
randUCB Yes No
Table 1: Stationary Stochastic Linear Bandit

PHE and FPL-GLM algorithms allow efficient implementation in that they are designed to choose an action by maximizing the expected reward where historical rewards are perturbed via Binomial and Gaussian distributions,

But they are limited in that regret bounds, , depend on the number of arms, , and theoretically unavailable when action set is not finite.

RandUCB [Vaswani et al., 2019] is a randomized version of LinUCB by randomly sampling confidence level from a certain distribution by setting where . It requires computation of for all actions like LinUCB so that it cannot be efficiently implemented in an infinite-arms setting, while it achieves theoretically optimal regret bounds of LinUCB as well as matches empirical performance of LinTS.

It is an open problem to construct algorithms that are random, theoretically optimal in regret bound, and efficient in implementation, thus achieving the best of three worlds.

2.3 Follow The Gaussian Perturbed Leader (FTGPL)

  Initialize and .
  for  to  do
     if  then
        Randomly play and receive reward
        Oracle :
        Play action and receive reward
     end if
     Update where
  end for
Algorithm 1 Follow The Gaussian Perturbed Leader (FTGPL)

While the regret analysis for Linear Thompson sampling [Abeille et al., 2017] is based on anti-concentration event that perturbed estimate be spread widely enough from least-squares estimate in -dimensional space, the general framework for regret analysis that PHE, FTP-GLM, and RandUCB algorithms share depends on anti-concentration event that perturbed expected reward is apart enough from expected reward for the best action as,

for a certain . This dimension reduction in parameter space from to is underlying intuition how extra gap in regret is eliminated.

Define concentration event that perturbed expected reward is close enough to expected reward for all as,

for a certain . In PHE and FPL-GLM, is set as , and different inequalities should hold simultaneously to make holds, which adds extra term in regret bound. While, RandUCB does not produce extra term since is reduced to single inequality independent of term by setting .


The natural question arises, which form of enables us to have both efficient implementation and optimal regret bound? I firmly have an answer, where . The motivation is that between RandUCB and FTGPL algorithm are approximately equivalent once corresponding perturbations are Gaussian due to its linear invariant property,

where .

Oracle point of view

We assume that our learner has access to an algorithm that returns a near-optimal solution to the offline problem, called an offline optimization oracle. It returns the optimal action that maximizes the expected reward from a given action space when a parameter is given as input.

Definition 1 (Offline Optimization Oracle).

There exists an algorithm, , which when given a pair of action space , and a parameter , computes

In comparison with LinUCB and RandUCB algorithms required to compute spectral norms of all actions in every round, the main advantage of using FTPGL algorithm is that it relies on an offline optimization oracle in every round so that the optimal action can be efficiently obtained within polynomial times even from infinite action set , as well as its regret bound is also independent of size of action set, .

Perspective of Linear Thompson Sampling

Another interpretation is from a perspective of Linear Thompson Sampling [Abeille et al., 2017] where perturbed estimate is defined as,

It is equivalent to Linear Thompson Sampling where randomly sampled vector is where . Generally, linear Thompson sampling empirically performs well but the existence of minimax-optimal perturbation in linear Thompson sampling has been open [Abeille et al., 2017], and we provide the FTPGL algorithm as an answer for this open problem.

2.4 Analysis

2.4.1 General Framework for Regret Analysis

We construct a general regret bound for linear bandit algorithm on the top of prior work of Kveton et al. [2019a]. The difference from their work is that action sets vary from time and can have infinite arms. Firstly, we define three three events as follow,

The choice of is made by algorithmic design, which decides choices on both and simultaneously. At time , the general algorithm which maximizes perturbed expected reward over action space is considered. The following theorem and lemma are simple extension of Theorem 1 and Lemma 2 from the work of Kveton et al. [2019a].

Theorem 2 (Expected Regret).

Assume we have satisfying , , and , and . Let be an algorithm that chooses arm at time . Then the expected regret of is bounded as

We defer the proof of Theorem 2 to Appendix B.1, since a similar regret bound in non-stationary setting will be carefully studied in Theorem 12. It is more general regret analysis in that non-stationary setting contains stationary setting as a special case where total variation of rewards, is zero.

2.4.2 Expected Regret of FTGPL

We set , where in FTGPL algorithm. At each time , it chooses an action by oracle property.

Corollary 3 (Expected Regret of FTGPL).

Assume we have

Let A be a Follow-The-Gaussian-Perturbed-Leader algorithm with . Then the expected regret of is bounded by where logarithmic terms in and constants are ignored in .

The optimal choices of and heavily depends on the following three lemmas that control the probabilities of three events , and . The main contribution over previous works [Kveton et al., 2019a, b] is that the parameter is chosen independent of the number of action sets, thanks to an invariant property of Gaussian distributions over linear combination. Thus, our analysis is novel in that it can be extended to infinite arm setting as well as contextual bandit setting.

Compared to standard solutions in stochastic linear bandit, the FTTGPL algorithm (Algorithm 1) is not only theoretically guaranteed to have the regret bound comparable to that of LinUCB [Abbasi-Yadkori et al., 2011], but also shows empirical performance and computational efficiency on par with Linear Thompson sampling algorithm [Agrawal and Goyal, 2013, Abeille et al., 2017], thus achieving the best of both worlds.

Lemma 4 (Least-Squares Confidence Ellipsoid, Theorem 2 [Abbasi-Yadkori et al., 2011]).

For any , and , then the event holds with probability at least .

Lemma 5 (Concentration).

Given history , suppose , where , and . Then, .


Given history , is equivalent to where by linearity of Gaussian distributions. Recall by definition.

The last equality holds since the event becomes independent on action after cancelling out on the right hand side of event condition. ∎

Lemma 6 (Anti-concentration).

Given , suppose , . Then, .
Furthermore, once we set and , .


By Gaussian linearity, is equivalent to where given history .

The first inequality holds because of anti-concentration of Gaussian distribution (Lemma A). ∎

3 Non-Stationary Stochastic Linear Bandit

3.1 Preliminary

In each round , an action set is given to the learner and he has to choose an action . Then, the reward is observed to the learner. is an unknown time-varying parameter and is a conditionally 1-subGaussin random variable. Unlike stationary setting, the non-stationary variant allows unknown parameter to be time-variant within total variation . It is a great way of quantifying time-variations of in that it covers both slowly-changing and abruptly-changing environments. Simply, assume that for all , , , and thus .

In a similar way to stationary setting, denote the best action in a round as and the expected dynamic regret as where is chosen action at time . The goal of the learner is to minimize the expected dynamic regret.

3.2 Open Problem

In a stationary stochastic environment where reward has a linear structure, Linear Upper Confidence Bound algorithm (LinUCB) follows a principle of optimism in the face of uncertainty. Under this principle, two recent works of Wu et al. [2018], Russac et al. [2019] proposed SW-LinUCB and D-LinUCB, which are non-stationary variants of LinUCB to adapt to time-variation of . They rely on weighted least-squares estimators with equal weights only given to recent observations where is length of a sliding-window, and exponentially discounted weights, respectively.

As described in Table 2, SW-LinUCB and D-LinUCB are deterministic as well as asymptotically achieve the minimax optimal dynamic regret , but share inefficiency of implementation with LinUCB and RandUCB algorithms in that the computation of spectral norms of all actions are required.

It is still open to find algorithms that are theoretically optimal in regret bound and efficient in implementation even in infinite action space.

Algorithm Regret Random Oracle
D-LinUCB No No
Table 2: Non-Stationary Stochastic Linear Bandit

3.3 Weighted Least-Squares Estimator

We firstly study the weighted least-squares estimator with discounting factor . In round , the weighted least-squares estimator is obtained in both implicit and closed forms,

where . Additionally, we define . This form is closely connected with the covariance matrix of . For simplicity in notations, we define .

Lemma 7 (Weighted Least-Sqaures Confidence Ellipsoid, Theorem 1 [Russac et al., 2019]).

Assume the stationary setting where and thus . For any , where .

While Lemma 7 states that the confidence ellipsoid contains true parameter with high probability in stationary setting, the true parameter is not necessarily inside the confidence ellipsoid in the non-stationary setting because of variation on the parameters. We alternatively define a surrogate parameter , which belongs to with probability at least , which is formally stated in Lemma 9.

3.4 Discounted Follow The Gaussian Perturbed Leader (D-FTGPL)

The idea of perturbing history using Gaussian distribution in FTGPL algorithm (Algorithm 1) can be directly applied to non-stationary setting by replacing with . Accordingly, we would set where in D-FTGPL algorithm (Algorithm 2).


Under the optimism of face in uncertainty, D-LinUCB [Russac et al., 2019] chooses an action by maximizing the upper confidence bound of expected regret based on and confidence level is replaced by a random variable in non-stationary variant of RandUCB algorithm Vaswani et al. [2019] with the comparable theoretical guarantee,

D-FTGPL algorithm is proposed from two motivations that it is approximately equivalent to non-stationary variant of RandUCB algorithm as

where , but its innate perturbation allows this algorithm to have an arg-max oracle access () differently from D-LinUCB and SW-LinUCB. Therefore, D-FTPGL algorithm can be efficient in computation though infinite action set is considered.

  Input: , , and
  Initialize and .
  for  to  do
     if  then
        Randomly play and receive reward
        Oracle :
        Play action and receive reward
     end if
     Update where
  end for
Algorithm 2 Discounted Follow The Gaussian Perturbed Leader (D-FTGPL)

3.5 Analysis

The framework of regret analysis in Section 2.4 is extended to non-stationary setting where true parameter changes within total variation . Dynamic regret is decomposed into surrogate regret and bias arising from total variation.

3.5.1 Surrogate Instantaneous Regret

To bound the surrogate instantaneous regret , we newly define three events , and :

Theorem 8.

Assume we have satisfying , , and , and . Let be an algorithm that chooses arm at time . Then the expected surrogate instantaneous regret of , is bounded by


Define . Given where holds, let be the set of arms that are under-sampled and worse than given in round . Among them, let be the least uncertain under-sampled arm in round . By definition, . The set of sufficiently sampled arms is defined as and let .

The proof is different from that of Theorem 15 in that actions with can be neglected since the regret induced by these actions is upper bounded by zero. is deterministic term while is random because of innate perturbation in . Thus surrogate instantaneous regret can be bounded as,

Thus, the expected surrogate instantaneous regret can be bounded as,