An Efficient Algorithm For Generalized Linear Bandit: Online Stochastic Gradient Descent and Thompson Sampling

06/07/2020 ∙ by Qin Ding, et al. ∙ University of California-Davis 0

We consider the contextual bandit problem, where a player sequentially makes decisions based on past observations to maximize the cumulative reward. Although many algorithms have been proposed for contextual bandit, most of them rely on finding the maximum likelihood estimator at each iteration, which requires O(t) time at the t-th iteration and are memory inefficient. A natural way to resolve this problem is to apply online stochastic gradient descent (SGD) so that the per-step time and memory complexity can be reduced to constant with respect to t, but a contextual bandit policy based on online SGD updates that balances exploration and exploitation has remained elusive. In this work, we show that online SGD can be applied to the generalized linear bandit problem. The proposed SGD-TS algorithm, which uses a single-step SGD update to exploit past information and uses Thompson Sampling for exploration, achieves Õ(√(dT)) regret with the total time complexity that scales linearly in T and d, where T is the total number of rounds and d is the number of features. Experimental results show that SGD-TS consistently outperforms existing algorithms on both synthetic and real datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A contextual bandit is a sequential learning problem, where each round the player has to decide which action to take by pulling an arm from arms. Before making the decisions at each round, the player is given the information of arms, represented by

-dimensional feature vectors. Only the rewards of pulled arms are revealed to the player and the player may use past observations to estimate the relationship between feature vectors and rewards. However, the reward estimate is biased towards the pulled arms as the player cannot observe the rewards of unselected arms. The goal of the player is to maximize the cumulative reward or minimize cumulative regret across

rounds. Due to this partial feedback setting in bandit problems, the player is facing a dilemma of whether to exploit by pulling the best arm based on the current estimates, or to explore uncertain arms to improve the reward estimates. This is the so-called exploration-exploitation trade-off. Contextual bandit problem has substantial applications in recommender system li2010contextual , clinical trials woodroofe1979one , online advertising schwartz2017customer

, etc. It is also the fundamental problem of reinforcement learning

sutton1998introduction .

The most classic problem in contextual bandit is the stochastic linear bandit abbasi2011improved ; chu2011contextual , where the expected rewards follow a linear model of the feature vectors and an unknown model parameter . Upper Confidence Bound (UCB) abbasi2011improved ; auer2002finite ; chu2011contextual and Thompson Sampling (TS) thompson1933likelihood ; agrawal2012analysis ; agrawal2013thompson ; chapelle2011empirical are two most popular algorithms to solve bandit problems. UCB uses the upper confidence bound to estimate the reward optimistically and therefore mixes exploration into exploitation. TS assumes the model parameter follows a prior and uses a random sample from the posterior to estimate the reward model. Despite the popularity of stochastic linear bandit, linear model is restrictive in representation power and the assumption of linearity rarely holds in practice. This leads to extensive studies in more complex contextual bandit problems such as generalized linear bandit (GLB) filippi2010parametric ; jun2017scalable ; li2017provably , where the rewards follow a generalized linear model (GLM). In li2012unbiased , it is shown by extensive experiments that GLB achieves lower regret than linear bandit.

For most applications of contextual bandit, efficiency is crucial as the decisions need to be made in real time. While GLB can still be solved by UCB or TS, the estimate of upper confidence bound or posterior becomes much more challenging than the linear case. It does not have closed form in general and has to be approximated, which usually requires costly operations in online learning. As pointed out by li2017provably , most GLB algorithms suffer from two expensive operations. The first is that they need to invert a matrix every round, which is time-consuming when is large. The second is that they need to find the maximum likelihood estimator (MLE) by solving an optimization problem using all the previous observations at each round. This results in time and memory for rounds.

From an optimization perspective, stochastic gradient descent (SGD) hazan2016introduction

is a popular algorithm for both convex and non-convex problems, even for complex models like neural networks. Online SGD

hazan2016introduction is an efficient optimization algorithm that incrementally updates the estimator via new observations at each round. Although it is nature to apply online SGD to contextual bandit problems so that the time complexity at the -th round can be reduced to constant with respect to

, it has not been successful used due to the following reasons: 1) the hardness of constructing unbiased stochastic gradient with controllable variance due to the partial feedback setting in bandit problems, 2) the difficulty to achieve a balance between sufficient exploration and fast convergence to the optimal decision using solely online SGD, 3) lack of theoretical guarantee. Previous attempts of online SGD in contextual bandit problems are limited to empirical studies.

bietti2018contextual uses importance weight and doubly-robust techniques to construct unbiased stochastic gradient with reduced variance. In riquelme2018deep , it is shown that the inherit randomness of SGD does not always offer enough exploration for bandit problems. To the best of our knowledge, there is no existing work that can successfully apply online SGD to update the model parameter of a contextual bandit, while maintaining low theoretical regret.

In this work, we study how online SGD can be appropriately applied to GLB problems. To overcome the dilemma of exploration and exploitation, we propose an algorithm that carefully combines online SGD and TS techniques for GLB. The exploration factor in TS is re-calibrated to make up for the gap between SGD estimator and MLE. Interestingly, we found that by doing so, we can skip the step of inverting matrices. This leads to time complexity of our proposed algorithm when , which is the most efficient GLB algorithm so far. We provide theoretical guarantee of our algorithm and show that under mild assumptions, it can obtain 111 ignores poly-logarithmic factors. regret. This regret upper bound is optimal for finite-arm contextual bandit problems up to logarithmic factors. Moreover, it improves some existing results in filippi2010parametric ; jun2017scalable ; li2017provably by a factor of when the number of arms is finite.

Notations: We use to denote the true model parameter. For a vector , we use to denote its norm and to denote its weighted norm associate with a positive-definite matrix . We use

to denote the minimum eigenvalue of a matrix

. Denote and as the first derivative of a function . Finally, we use to denote the maximum integer such that and use to denote the minimum integer such that .

2 Related Work

In this section, we briefly discuss some previous algorithms in GLB. filippi2010parametric first proposes a UCB type algorithm, called GLM-UCB. It achieves regret upper bound. According to dani2008stochastic , this regret bound is optimal up to logarithmic factors for contextual bandit problems with infinite arms. li2017provably proposes a similar algorithm called UCB-GLM. It improves the regret bound of GLM-UCB by a factor. The main idea is to calculate the MLE of at each round, and then find the upper confidence bound of reward estimates. The time complexity of these two algorithms depends quadratically on both and as they need to calculate the MLE and matrix inverse every round.

Another rich line of algorithms for GLB follows TS scheme, where the key is to estimate the posterior of after observing extra data at each round. Laplace-TS chapelle2011empirical

estimates the posterior of regularized logistic regression by Laplace approximations, whose per-round time complexity is

. However, Laplace-TS works only for logistic bandit and does not apply to general GLB problems. Moreover, it performs poorly when the feature vectors are non-Gaussian and when . dumitrascu2018pg proposes Pólya-Gamma augmented Thompson Sampling (PG-TS) with a Gibbs sampler to estimate the posterior for logistic bandit. However, Gibbs sampler inference is very expensive in online algorithms. The time complexity of PG-TS is , where is the burn-in step. In general, previous TS based algorithms for logistic bandit have regret bound or worse dong2019on ; abeille2017linear ; russo2014learning .

To make GLB algorithms scalable, jun2017scalable

proposes Generalized Linear Online-to-confidence-set Conversion (GLOC) algorithm. GLOC utilizes the exp-concavity of the loss function of GLM and applies online Newton steps to construct a confidence set for

. It obtains regret upper bound . Its TS version, GLOC-TS has regret , which has worse dependency on . The total time complexity of GLOC is due to the successful use of an online second order update. However, GLOC remains expensive when is large. We show a detailed analysis of time complexity and regret upper bound of GLB algorithms in Table 1 of Section 5.

3 Problem setting

We consider the -armed stochastic generalized linear bandit (GLB) setting. Denote as the total number of rounds. At each round , the player observes a set of contexts including feature vectors . is drawn IID from an unknown distribution with for all and , where represents the information of arm at round . We make the same regularity assumption as in li2017provably , i.e., there exists a constant such that . Denote as the associated random reward of arm at round . After is revealed to the player, the player pulls an arm and only observes the reward associated with the pulled arm, . In the following, we denote and .

In GLB, the expected rewards follow a generalized linear model (GLM) of the feature vectors and an unknown vector , i.e., there is a fixed, strictly increasing link function such that for all and . For example, linear bandit and logistic bandit are special cases of GLB with and respectively. Without loss of generality, we assume and . We also assume that

follows a sub-Gaussian distribution with parameter

. Formally, the GLM can be written as where are independent zero-mean sub-Gaussian noises with parameter . We use to denote the -algebra generated by all the information up to round . Then we have for all and . Denote and , the cumulative regret of rounds is defined as

(1)

The player’s goal is to find an optimal policy , such that if the player follows policy to pull arm at round , the total regret or the expected regret is minimized. Note that is random due to the randomness in . We make the following mild assumptions similar to li2017provably .

Assumption 1.

is differentiable and there exsits a constant such that .

For logistic link function, Assumption 1 holds when . For linear function, we have .

Assumption 2.

Denote . We assume .

This assumption is not stronger than the assumption made in li2017provably for linear bandit and logistic bandit, as li2017provably assumes and in both cases.

To make sure we can successfully apply online SGD update in bandit problems, we also need the following regularity assumption, which assumes that the optimal arm based on any model parameter

has non-singular second moment matrix. This assumption is similar to the regularity assumption made in

li2017provably , which assumes that the averaged second moment matrices of feature vectors, i.e., is non-singular. Assumption 3 below merely says that the same holds for the optimal arm based on any .

Assumption 3.

For a fixed , let Denote and . We assume is a positive constant.

Intuitively, Assumption 3 means that based on any model parameter

, the projection of the optimal arm’s feature vector onto any direction has positive probability to be non-zero. In practice, the optimal arms at different rounds are diverse and it is a mild assumption to make that the projections of these random vectors onto any direction are not always a constant zero.

4 Proposed algorithm

In this section, we formally describe our proposed algorithm. The main idea is to use online stochastic gradient descent (SGD) procedure to estimate the MLE and use Thompson Sampling (TS) to explore.

For GLM, the MLE from data points is where . Therefore, it is natural to define the loss function at round to be . Effective algorithms in GLB abeille2017linear ; filippi2010parametric ; li2017provably ; russo2014learning have been shown to converge to the optimal action at a rate of . Similarly, we need to ensure that online SGD steps will achieve the same fast convergence rate. This rate is only attainable when the loss function is strongly convex. However, the loss function at a single round is convex but not necessarily strongly convex. To tackle this problem, we aggregate the loss function every steps, where is a parameter to be specified. We define the -th aggregated loss function as

(2)

Let be a positive constant, we will show in Section 5 that when is appropriately chosen based on , such that , the aggregated loss function of rounds is -strongly convex and therefore fast convergence can be obtained. The gradient and Hessian of are derived as

(3)
Figure 1: Illustration of notations.

In the first rounds of the algorithm, we randomly pull arms. Denote as the MLE at round using previous observations. We calculate the MLE only once at round and get . We keep a convex set . We will show in Section 5 that when is properly chosen, we have for all . Therefore, for every , we have . Denote as the -th updated SGD estimator and let . Starting from round , we update every rounds. Since the minimum of the loss function lies in , we project to the convex set . Define , then is treated as the posterior mean of and we use TS to ensure sufficient exploration. Specifically, we draw from a multivariate Gaussian distribution with mean and covariance matrix

(4)

where and are defined as

(5)

Previous works filippi2010parametric ; li2017provably ; jun2017scalable in GLB use as the covariance matrix, where . In contrast, we use to approximate . Meanwhile, the covariance matrix in Equation 4 has an extra second term, which comes from the gap between the averaged SGD estimator and the MLE . Note that similar to the SGD estimator , TS estimator is updated every rounds. At round , we will pull arm , where . See Figure 1 for a brief illustration of the notations. Since our proposed algorithm employs both techniques from online SGD and TS methods, we call our algorithm SGD-TS. Details can be found in Algorithm 1.

Input: , .

1:  Initialize constant , .
2:  Randomly choose and record , for .
3:  Calculate the maximum-likelihood estimator by solving .
4:  Maintain convex set .
5:  .
6:  for  to  do
7:     if  then
8:         and .
9:        Calculate defined in Equation 3 and update .
10:        Compute .
11:        Compute defined in Equation 4.
12:        Draw .
13:     end if
14:     Pull arm and observe reward .
15:  end for
Algorithm 1 Online stochastic gradient descent with Thompson Sampling (SGD-TS)

Since some GLB algorithms like UCB-GLM li2017provably and GLM-UCB filippi2010parametric need to compute MLE every round, to be able to compare the time complexity, we assume the MLE using datapoints with features can be solved in time. SGD-TS is an extremely efficient algorithm for GLB. We only calculate the MLE once at the -th round, which costs time. Then we update the SGD estimator every rounds and the gradient can be incrementally computed with per-round time . Note that we do not need to calculate matrix inverse every round either since we approximate by a diagonal matrix. In conclusion, the time complexity of SGD-TS in rounds is , and it will be shown in Section 5 that . In practice, is usually greater than , and in such cases, SGD-TS costs time. Our algorithm improves the efficiency significantly if either or is large. See Table 1 in Section 5 for comparisons with other algorithms.

5 Mathematical analyis

In this section, we formally analyze Algorithm 1. Proofs are deferred to supplementary materials.

5.1 Convergence of SGD update

Lemma 1.

Denote . If where is a small probability, then holds with probability at least .

From Lemma 1, we have with probability at least when as long as is properly chosen. This is essential because the SGD estimator is projected to . In Lemma 2, we show that when is chosen as Equation 6, the averaged SGD estimator converges to MLE at a rate of .

Lemma 2.

For a constant , let

(6)

where and are two universal constants, then with probability at least , the following holds when ,

5.2 Concentration events

By the property of MLE and Lemma 2, we have the concentration property of SGD estimator.

Lemma 3.

Suppose is chosen as in Equation 6, and , define , we have holds with probability at least , where is defined in the following, and are defined in Equation 5.

The following lemma shows the concentration property of TS estimator.

Lemma 4.

Define , we have , where

Lemma 5 shows the anti-concentration property of TS estimator, which ensures enough exploration.

Lemma 5.

Denote . For any filtration such that is true, we have

5.3 Regret analysis

We bound a single-round regret in Lemma 6. Denote , and

(7)
Lemma 6.

At round , where is defined in Equation 6, denote , we have

(8)

We are now ready to put together the above information and prove the regret bound of Algorithm 1.

Theorem 1.

When Algorithm 1 runs with , and defined in Equation 6, the expected total regret satisfies the following inequality

where , and .

Remark 1.

Note that in the above theorem , , and . Thus, we have . Our regret upper bound is optimal up to logarithmic factors chu2011contextual for finite-arm contextual bandit problems. SGD-TS improves the regret bound of UCB-GLM, GLOC and Laplace-TS by a factor of when the number of arms is finite. Moreover, it significantly improves efficiency when either or is large for GLB. See Table 1 for details.222Sherman–Morrison formula improves the time complexity of a matrix inverse in UCB-GLM and GLOC to .

algorithms Time Complexity Theoretical Regret Comment
UCB-GLM li2017provably
Laplace-TS chapelle2011empirical only for logistic bandit
GLOC jun2017scalable
GLOC-TS jun2017scalable
SGD-TS (This work)
Table 1: Comparison with other algorithms. The time complexity listed here assumes . GLOC, GLOC-TS and Laplace-TS need to solve an optimization problem on one datapoint every round and we assume this optimization problem can be solved in fixed iterations every round.

6 Experimental results

In this section, we show by experiments in both synthetic and real datasets that our proposed SGD-TS algorithm outperforms existing approaches. We compare SGD-TS with UCB-GLM li2017provably , Laplace-TS chapelle2011empirical and GLOC jun2017scalable . 333We choose UCB-GLM and GLOC since they have lower theoretical regrets than GLM-UCB and GLOC-TS. In order to have a fair comparison, we perform a grid search for the parameters of different algorithms and select the best parameters to report. The covariance matrix in Equation 4 is set to , where and are explorations rates. We do a grid search for exploration rates of SGD-TS, GLOC and UCB-GLM in . For UCB-GLM and SGD-TS, we set and is tuned in . The initial step sizes for SGD-TS, GLOC and Laplace-TS are tuned in . In SGD-TS, we set . The experiments are repeated for 10 times and the averaged results are presented.

Figure 2: Result for simulation.

6.1 Simulation

We simulate a dataset with , and

. The feature vectors and the true model parameter are drawn IID from uniform distribution in the interval of

. We build a logistic model on the dataset and draw random rewards

from a Bernoulli distribution with mean

. As suggested by dumitrascu2018pg , Laplace approximation of the global optimum does not always converge in non-asymptotic settings. From Figure 2, we can see that our proposed SGD-TS performs the best, while Laplace-TS performs the worst as expected.

Figure 3: Result for news article recommendation data.

6.2 News article recommendation data

We compare the algorithms on the benchmark Yahoo! Today Module dataset. This dataset contains user visits to the news articles website - Yahoo Today Module from May 1, 2009 to May 10, 2009. For each user’s visit, the module will select one article from a changing pool of around articles to present to the user. The user will decide to click (reward ) or not to click (). Both the users and the articles are associated with a -dimensional feature vector (including a constant feature), constructed by conjoint analysis with a bilinear model chu2009case . We treat the articles as arms and discard the users’ features. The click through rate (CTR) of each article at every round is calculated using the average of recorded rewards at that round. We still build logistic bandit on this data. Each time, when the algorithm pulls an article, the observed reward is simulated from a Bernoulli distribution with mean equal to CTR. For better visualization, we plot against . Since we want higher CTR, the result will be better if is bigger. From the plot in Figure 3, we can see that SGD-TS performs better than UCB-GLM during May 1 - May 2 and May 5 - May 9. During other days, UCB-GLM and SGD-TS have similar behaviors. However, GLOC and Laplace-TS perform poorly in this real application.

Figure 4: Scenario 1 for forest cover type data.

6.3 Forest cover type data

We compare the algorithms on the Forest Cover Type data from the UCI repository. The dataset contains datapoints from a forest area. The labels represent the main species of the cover type. For each datapoint, if it belongs to the first class (Spruce/Fir species), we set the reward of this datapoint to , otherwise, we set it as . We extract the features (quantitative features are centralized and standardized) from the dataset and then partition the data into clusters. The reward of each cluster is set to the proportion of datapoints having reward equal to in that cluster. Since the observed reward is either or , we build logistic bandits for this dataset. Assume arm 1 has the highest reward and arm 6 has the 6-th highest reward. We plot the averaged cumulative regret and the median frequencies of an algorithm pulls the best arms for the following two scenarios in Figure 4 and Figure 5.

Figure 5: Scenario 2 for forest cover type data.

Scenario 1: Similar to dumitrascu2018pg ; filippi2010parametric , we use only the quantitative features and treat the cluster centroid as the feature vector of the cluster. The maximum reward of the arms is around and the minimum is around .

Scenario 2: To make the classification task more challenging, we utilize both categorical and quantitative features, i.e., . Meanwhile, the feature vector of each cluster at each round is a random sample from that cluster. This makes the features more dynamic and the algorithm needs to do more exploration before being able to identify the optimal arm. The maximum reward is around and the minimum is .

From the plots, we can see that in both scenarios, our proposed algorithm performs the best and it pulls the best arm most frequently. For scenario 1, GLOC and UCB-GLM perform relatively well, while Laplace-TS is stuck in sub-optimal arms. This is consistent with the results in dumitrascu2018pg . For the more difficult scenario 2, both UCB-GLM and Laplace-TS perform poorly and frequently pull sub-optimal arms. GLOC performs relatively better than UCB-GLM and Laplace-TS, but it is not able to pull the best arm as frequently as SGD-TS. Note that in scenario 2, it costs UCB-GLM and GLOC much more time than scenario 1 to update the decisions, as they need to invert a matrix every round.

7 Conclusion and future work

In this paper, we derive and analyze SGD-TS, a novel and efficient algorithm for generalized linear bandit. The time complexity of SGD-TS scales linearly in both total number of rounds and feature dimensions in general. Under mild assumptions, we prove a regret upper bound that is optimal up to logarithmic factors for SGD-TS algorithm in generalized linear bandit problems. Experimental results of both synthetic and real datasets show that SGD-TS consistently outperforms other state-of-the-art algorithms. To the best of our knowledge, this is the first attempt that successfully applies online stochastic gradient descent steps to contextual bandit problems with theoretical guarantee.

Future work

Although generalized linear bandit is successful in many cases, there are many other models that are more powerful in representation for contextual bandit. This motivates a number of works for contextual bandit with complex reward models chowdhury2017kernelized ; riquelme2018deep ; zhou2019neural . For most of these works, finding the posterior or upper confidence bound remains an expensive task in online learning. While we have seen in this work that online SGD can be successfully applied to GLB under mild assumptions, it is interesting to investigate whether we could further use online SGD to design efficient and theoretically solid methods for contextual bandit with more complex reward models, like neural networks, etc.

Broader Impact

Contextual bandit problems have substantial applications in recommender system, online advertising, clinical trials, etc. The proposed work yields a novel method to significantly improve the efficiency of algorithms for generalized linear bandit with theoretical guarantee. The trick of combining online stochastic gradient descent and Thompson Sampling is hitherto effective, which we believe will motivate numerous developments in efficient algorithms for contextual bandit problems with complex models. Moreover, we will release the code of the four algorithms in our experiments. This will provide the research society with reliable platform for evaluating the performance of algorithms in generalized linear bandit problems.

References

  • [1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
  • [2] Marc Abeille, Alessandro Lazaric, et al. Linear thompson sampling revisited. Electronic Journal of Statistics, 11(2):5165–5197, 2017.
  • [3] Milton Abramowitz and Irene A Stegun. Handbook of mathematical functions with formulas, graphs, and mathematical tables, volume 55. US Government printing office, 1948.
  • [4] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pages 39–1, 2012.
  • [5] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In

    International Conference on Machine Learning

    , pages 127–135, 2013.
  • [6] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
  • [7] Alberto Bietti, Alekh Agarwal, and John Langford. A contextual bandit bake-off. arXiv preprint arXiv:1802.04064, 2018.
  • [8] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249–2257, 2011.
  • [9] Kani Chen, Inchi Hu, Zhiliang Ying, et al. Strong consistency of maximum quasi-likelihood estimators in generalized linear models with fixed and adaptive designs. The Annals of Statistics, 27(4):1155–1163, 1999.
  • [10] Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 844–853. JMLR. org, 2017.
  • [11] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In

    Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics

    , pages 208–214, 2011.
  • [12] Wei Chu, Seung-Taek Park, Todd Beaupre, Nitin Motgi, Amit Phadke, Seinjuti Chakraborty, and Joe Zachariah. A case study of behavior-driven conjoint analysis on yahoo! front page today module. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1097–1104, 2009.
  • [13] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. In COLT, pages 355–366, 2008.
  • [14] Shi Dong, Tengyu Ma, and Benjamin Van Roy. On the performance of thompson sampling on logistic bandits. In COLT, pages 1158–1160, 2019.
  • [15] Bianca Dumitrascu, Karen Feng, and Barbara Engelhardt. Pg-ts: Improved thompson sampling for logistic contextual bandits. In Advances in neural information processing systems, pages 4624–4633, 2018.
  • [16] Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems, pages 586–594, 2010.
  • [17] Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
  • [18] Kwang-Sung Jun, Aniruddha Bhargava, Robert Nowak, and Rebecca Willett. Scalable generalized linear bandits: Online computation and hashing. In Advances in Neural Information Processing Systems, pages 99–109, 2017.
  • [19] Branislav Kveton, Csaba Szepesvári, Mohammad Ghavamzadeh, and Craig Boutilier. Perturbed-history exploration in stochastic linear bandits. In UAI, page 176, 2019.
  • [20] Lihong Li, Wei Chu, John Langford, Taesup Moon, and Xuanhui Wang. An unbiased offline evaluation of contextual bandit algorithms with generalized linear models. In Proceedings of the Workshop on On-line Trading of Exploration and Exploitation 2, pages 19–36, 2012.
  • [21] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670, 2010.
  • [22] Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear contextual bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2071–2080. JMLR. org, 2017.
  • [23] Carlos Riquelme, George Tucker, and Jasper Snoek. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. In ICLR 2018 : International Conference on Learning Representations 2018, 2018.
  • [24] Daniel Russo and Benjamin Van Roy. Learning to optimize via information-directed sampling. In Advances in Neural Information Processing Systems, pages 1583–1591, 2014.
  • [25] Eric M Schwartz, Eric T Bradlow, and Peter S Fader. Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Science, 36(4):500–522, 2017.
  • [26] Richard S Sutton, Andrew G Barto, et al. Introduction to reinforcement learning, volume 135. MIT press Cambridge, 1998.
  • [27] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
  • [28] Michael Woodroofe. A one-armed bandit problem with a concomitant variable. Journal of the American Statistical Association, 74(368):799–806, 1979.
  • [29] Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural contextual bandits with upper confidence bound-based exploration. arXiv preprint arXiv:1911.04462, 2019.

8 Supplementary Material

8.1 Proof of Lemma 1

The proof of Lemma 1 is an adaptation from the proof of Theorem 1 in [22].

Proof.

Define . We have and , where is the sub-Gaussian noise at round . For convenience, define . From mean value theorem, for any , there exists and such that

(9)

where . Therefore, for any , we have

since and . So is an injection from to . Consider an -neighborhood of , , where is a constant that will be specified later such that we have . When , from the property of convex set, we have . From Equation 9, we have when ,

The last inequality is due to

From Lemma A in [9], we have that

Now from Lemma 7 in [22], we have with probability at least ,

Therefore, when

we have . Since when , we have

when

8.2 Proof of Lemma 2

Note that the condition of Lemma 1 holds with high probability when is chosen as Equation 6. This is a consequence of Proposition 1 in [22], which is presented below for reader’s convenience.

Proposition 1 (Proposition 1 in [22]).

Define , where is drawn IID from some distribution in unit ball . Furthermore, let be the second moment matrix, let be two positive constants. Then there exists positive, universal constants and such that with probability at least , as long as

Now we formally prove Lemma 2.

Proof.

Note that from the definition of in the algorithm, when , the conclusion holds trivially. When is chosen as in Equation 6, we have from Lemma 1 and Proposition 1 that for all with probability at least . Therefore, for all with probability at least . For the analysis below, we assume for all .

Since , we have . Denote , we have . For any , define , since is convex, we have . Therefore, we have from Assumption 2

Since we update every rounds and only depends on . For the next rounds, the pulled arms are only dependent on . Therefore, the feature vectors of pulled arms among the next rounds are IID. According to Proposition 1 and Equation 6, and by applying a union bound, we have holds for all with probability at least . This tells us that for all , is a -strongly convex function when . Therefore, we can apply (Theorem 3.3 of Section 3.3.1 in [17]) to get for all

where satisfies . Note that since and . From Jensen’s Inequality, we have

Since , we have for any , if , then for all . Since , we have

By applying a union bound, we get the conclusion. ∎

8.3 Proof of Lemma 3

We utilize the concentration property of MLE. Here, we present the analysis of MLE in [22].

Lemma 7 (Lemma 3 in [22]).

Suppose . For any , the following event

holds for all with probability at least .

Proof.

Note that from Proposition 1, when , holds with probability at least