1 Introduction
The multiarmed bandit framework Lai and Robbins (1985); Bubeck et al. (2012); Auer (2002); Auer et al. (2002) is a classic approach for sequential decisionmaking under uncertainty. The basic framework consists of independent arms that correspond to different choices or actions. These may be different treatments in a clinical trial or different products that can be recommended to the users of an online service. Each arm has an associated expected reward or utility. Typically, we do not have prior information about the utility of the available choices and the agent learns to make “good” decisions via repeated interaction in a trialanderror fashion. Under the bandit setting, in each interaction or round, the agent selects an arm and observes a reward only for the selected arm. The objective of the agent is to maximize the reward accumulated across multiple rounds. This results in an explorationexploitation tradeoff
: exploration means choosing an arm to gain more information about it, while exploitation corresponds to choosing the arm with the highest estimated reward so far. The
contextual bandit setting Wang et al. (2005); Pandey et al. (2007); Kakade et al. (2008); Dani et al. (2008); Li et al. (2010); Agrawal and Goyal (2013b)is a generalization of the bandit framework and assumes that we have additional information in the form of a feature vector or “context” at each round. A context might be used to encode the medical data of a patient in a clinical trial or the demographics of an online user of a recommender system. In this case, the expected reward for an arm is an unknown function
^{1}^{1}1We typically assume a parametric form for this function and infer the corresponding parameters from observations. of the context at that particular round. For example, for linear bandits Rusmevichientong and Tsitsiklis (2010); Dani et al. (2008); AbbasiYadkori et al. (2011), this function is assumed to be linear implying that the expected reward can be expressed as an inner product between the context vector and an (unknown) parameter to be learned from observations.In both the bandit and contextual bandit settings, there are three main strategies for addressing the explorationexploitation tradeoff: (i) greedy Langford and Zhang (2008) (ii) optimisminthefaceofuncertainty Auer (2002); AbbasiYadkori et al. (2011) (OFU) and (iii) Thompson sampling Agrawal and Goyal (2013b). Though greedy (EG) is simple to implement and is widely used in practice, it results in suboptimal performance from a theoretical standpoint. In practice, its performance heavily relies on choosing the right exploration parameter and the strategy for annealing it. Strategies based on optimism under uncertainty rely on constructing confidence sets and are statistically optimal and computationally efficient in the bandit Auer et al. (2002) and linear bandit AbbasiYadkori et al. (2011) settings. However, for nonlinear featurereward mappings, we can construct only approximate confidence sets Filippi et al. (2010); Li et al. (2017); Zhang et al. (2016); Jun et al. (2017) that result in overconservative uncertainty estimates Filippi et al. (2010)
and consequently to worse empirical performance. Given a prior distribution over the rewards or parameters being inferred, Thompson sampling (TS) uses the observed rewards to compute a posterior distribution. It then uses samples from the posterior to make decisions. TS is computationally efficient when we have a closedform posterior like in the case of Bernoulli or Gaussian rewards. For reward distributions beyond those admitting conjugate priors or for complex nonlinear featurereward mappings, it is not possible to have a closed form posterior or obtain exact samples from it. In these cases, we have to rely on computationallyexpensive approximate sampling techniques
Riquelme et al. (2018).To address the above difficulties, bootstrapping Efron (1992) has been used in the bandit Baransi et al. (2014); Eckles and Kaptein (2014), contextual bandit Tang et al. (2015); McNellis et al. (2017)
and deep reinforcement learning
Osband and Van Roy (2015); Osband et al. (2016) settings. All previous work uses nonparametric bootstrapping (explained in Section 3.1) as an approximation to TS. As opposed to maintaining the entire posterior distribution for TS, bootstrapping requires computing only pointestimates (such as the maximum likelihood estimator). Bootstrapping thus has two major advantages over other existing strategies: (i) Unlike OFU and TS, it is simple to implement and does not require designing problemspecific confidence sets or efficient sampling algorithms. (ii) Unlike EG, it is not sensitive to hyperparameter tuning. In spite of its advantages and good empirical performance, bootstrapping for bandits is not well understood theoretically, even under special settings of the bandit problem. Indeed, to the best of our knowledge, McNellis et al. (2017) is the only work that attempts to theoretically analyze the nonparametric bootstrapping (referred to as NPB) procedure. For the bandit setting with Bernoulli rewards and a Beta prior (henceforth referred to as the Bernoulli bandit setting), they prove that both TS and NPB will take similar actions as the number of rounds increases. However, this does not have any implication on the regret for NPB.In this work, we first show that the NPB procedure used in the previous work is provably inefficient in the Bernoulli bandit setting (Section 3.2). In particular, we establish a nearlinear lower bound on the incurred regret. In Section 3.3, we show that NPB with an appropriate amount of forced exploration (done in practice in McNellis et al. (2017); Tang et al. (2015)) can result in a sublinear though suboptimal upper bound on the regret. As an alternative to NPB, we propose the weighted bootstrapping (abbreviated as WB) procedure. For Bernoulli (or more generally categorical) rewards, we show that WB with multiplicative exponential weights is mathematically equivalent to TS and thus results in nearoptimal regret. Similarly, for Gaussian rewards, WB with additive Gaussian weights is equivalent to TS with an uninformative prior and also attains nearoptimal regret.
In Section 5, we empirically show that for several reward distributions on , WB outperforms TS with a randomized rounding procedure proposed in Agrawal and Goyal (2013b). In the contextual bandit setting, we give two implementation guidelines. To improve the computational efficiency of bootstrapping, prior work Eckles and Kaptein (2014); McNellis et al. (2017); Tang et al. (2015)
approximated it by an ensemble of models that requires additional hyperparameter tuning, such as choosing the size of the ensemble; or problemspecific heuristics, for example
McNellis et al. (2017)uses a lazy update procedure specific to decision trees. We find that with appropriate stochastic optimization, bootstrapping (without any approximation) is computationally efficient and simple to implement. Our second guideline is for the initialization of the bootstrapping procedure. Prior work
McNellis et al. (2017); Tang et al. (2015) used forced exploration at the beginning of bootstrapping, by pulling each arm for some number of times or by adding pseudoexamples. This involves tuning additional hyperparameters, for example, McNellis et al. (2017) pull each arm times before bootstrapping. Similarly, the number of pseudoexamples or the procedure for generating them is rather arbitrary. We propose a simple method for generating such examples and experimentally validate that using pseudoexamples, whereis the dimension of the context vector, leads to consistently good performance. These contributions result in a simple and efficient implementation of the bootstrapping procedure. We experimentally evaluate bootstrapping with several parametric models and realworld datasets.
2 Background
We describe the framework for the contextual bandit problem in Section 2.1. In Section 2.2, we give the necessary background on bootstrapping and then explain its adaptation to bandits in Section 2.3.
2.1 Bandits Framework
The bandit setting consists of arms where each arm has an underlying (unknown) reward distribution. The protocol for a bandit problem is as follows: in each round , the bandit algorithm selects an arm . It then receives a reward sampled from the underlying reward distribution for the selected arm . The best or optimal arm is defined as the one with the highest expected reward. The aim of the bandit algorithm is to maximize the expected cumulative reward, or alternatively, to minimize the expected cumulative regret. The cumulative regret is the cumulative loss in the reward across rounds because of the lack of knowledge of the optimal arm.
In the contextual bandit setting Langford and Zhang (2008); Li et al. (2017); Chu et al. (2011), the expected reward at round depends on the context or feature vector . Specifically, each arm is parametrized by the (unknown) vector and its expected reward at round is given by i.e. . Here the function is referred to as the model class. Given these definitions, the expected cumulative regret is defined as follows:
(1) 
The standard bandit setting is a special case of the above framework. To see this, if denotes the expected reward of arm of the arm bandit, then it can be obtained by setting , for all and for all . Assuming that arm is the optimal arm, i.e. , then the expected cumulative regret is defined as: . Throughout this paper, we describe our algorithm under the general contextual bandit framework, but develop our theoretical results under the simpler bandit setting.
2.2 Bootstrapping
In this section, we set up some notation and describe the bootstrapping procedure in the offline setting. Assume we have a set of datapoints denoted by . Here, and refer to the feature vector and observation (alternatively label) for the point. We assume a parametric generative model (parametrized by ) from the features to the observations . Given , the loglikelihood of observing the data is given by where
is the probability of observing label
given the feature vector , under the model parameters . In the absence of features, the probability of observing (for all ) is given by . The maximum likelihood estimator (MLE) for the observed data is defined as . In this paper, we mostly focus on Bernoulli observations without features in which case, .Bootstrapping is typically used to obtain uncertainty estimates for a model fit to data. The general bootstrapping procedure consists of two steps: (i) Formulate a bootstrapping loglikelihood function by injecting stochasticity into
via the random variable
such that . (ii) Given , generate a bootstrap sample as: . In the offline setting Friedman et al. (2001), these steps are repeated (usually ) times to obtain the set. The variance of these samples is then used to estimate the uncertainty in the model parameters
. Unlike a Bayesian approach that requires characterizing the entire posterior distribution in order to compute uncertainty estimates, bootstrapping only requires computing pointestimates (maximizers of the bootstrapped loglikelihood functions). In Sections 3 and 4, we discuss two specific bootstrapping procedures.2.3 Bootstrapping for Bandits
In the bandit setting, the work in Eckles and Kaptein (2014); Tang et al. (2015); McNellis et al. (2017) uses bootstrapping as an approximation to Thompson sampling (TS). The basic idea is to compute one bootstrap sample and treat it as a sample from an underlying posterior distribution in order to emulate TS. In Algorithm 1, we describe the procedure for the contextual bandit setting. At every round , the set consists of the features and observations obtained on pulling arm in the previous rounds. The algorithm (in line ) uses the set to compute a bootstrap sample for each arm . Given the bootstrap sample for each arm, the algorithm (similar to TS) selects the arm maximizing the reward conditioned on this bootstrap sample (line ). After obtaining the observation (line ), the algorithm updates the set of observations for the selected arm (line ). In the subsequent sections, we instantiate the procedures for generating the bootstrap sample and analyze the performance of the algorithm in these settings.
3 Nonparametric Bootstrapping
We first describe the nonparametric bootstrapping (NPB) procedure in Section 3.1. We show that NPB used in conjunction with Algorithm 1 can be provably inefficient and establish a nearlinear lower bound on the regret incurred by it in the Bernoulli bandit setting (Section 3.2). In Section 3.3, we show that NPB with an appropriate amount of forced exploration can result in an regret in this setting.
3.1 Procedure
In order to construct the bootstrap sample in Algorithm 1, we first create a new dataset by sampling with replacement, points from . The bootstrapped loglikelihood is equal to the loglikelihood of observing . Formally,
(2) 
The bootstrap sample is computed as . Observe that the sampling with replacement procedure is the source of randomness for bootstrapping and .
For the special case of Bernoulli rewards without features, a common practice is to use Laplace smoothing where we generate positive () or negative () pseudoexamples to be used in addition to the observed labels. Laplace smoothing is associated with two nonnegative integers , where (and ) is the pseudocount, equal to the number of positive (or negative) pseudoexamples. These pseudocounts are used to “simulate” the prior distribution . For the NPB procedure with Bernoulli rewards, generating
is equivalent to sampling from a Binomial distribution
where and the success probability is equal to the fraction of positive observations in . Formally, if the number of positive observations in is equal to , then(3) 
3.2 Inefficiency of NonParametric Bootstrapping
In this subsection, we formally show that Algorithm 1 used with NPB might lead to an regret with arbitrarily close to . Specifically, we consider a simple arm bandit setting, where at each round , the reward of arm
is independently drawn from a Bernoulli distribution with mean
, and the reward of arm is deterministic and equal to . Furthermore, we assume that the agent knows the deterministic reward of arm , but not the mean reward for arm . Notice that this case is simpler than the standard twoarm Bernoulli bandit setting, in the sense that the agent also knows the reward of arm . Observe that if is a bootstrap sample for arm (obtained according to equation 3), then the arm is selected if . Under this setting, we prove the following lower bound:Theorem 1.
If the NPB procedure is used in the abovedescribed case with pseudocounts for arm , then for any and any , we obtain
Proof.
Please refer to Appendix A for the detailed proof of Theorem 1. It is proved based on a binomial tail bound (Proposition 2) and uses the following observation: under a “bad history", where at round NPB has pulled arm for times, but all of these pulls have resulted in a reward , NPB will pull arm with probability less than (Lemma 1). Hence, the number of times NPB will pull the suboptimal arm before it pulls arm again or reach the end of the
time steps follows a “truncated geometric distribution", whose expected value is bounded in Lemma
2. Based on Lemma 2, and the fact that the probability of this bad history is , we have in Lemma 3. Theorem 1 is proved by setting . ∎Theorem 1 shows that, when is large enough, the NPB procedure used in previous work Eckles and Kaptein (2014); Tang et al. (2015); McNellis et al. (2017) incurs an expected cumulative regret arbitrarily close to a linear regret in the order of . It is straightforward to prove a variant of this lower bound with any constant (in terms of ) number of pseudoexamples. Next, we show that NPB with appropriate forced exploration can result in sublinear regret.
3.3 Forced Exploration
In this subsection, we show that NPB, when coupled with an appropriate amount of forced exploration, can result in sublinear regret in the Bernoulli bandit setting. In order to force exploration, we pull each arm times before starting Algorithm 1. The following theorem shows that for an appropriate value of , this strategy can result in an upper bound on the regret.
Theorem 2.
In any armed bandit setting, if each arm is initially pulled times before starting Algorithm 1, then
Proof.
The claim is proved in Appendix B based on the following observation: If the gap of the suboptimal arm is large, the prescribed steps are sufficient to guarantee that the bootstrap sample of the optimal arm is higher than that of the suboptimal arm with a high probability at any round . On the other hand, if the gap of the suboptimal arm is small, no algorithm can have high regret. ∎
Although we can remedy the NPB procedure using this strategy, it results in a suboptimal regret bound. In the next section, we consider a weighted bootstrapping approach as an alternative to NPB.
4 Weighted Bootstrapping
In this section, we propose weighted bootstrapping (WB) as an alternative to the nonparametric bootstrap. We first describe the weighted bootstrapping procedure in Section 4.1. For the bandit setting with Bernoulli rewards, we show the mathematical equivalence between WB and TS, hence proving that WB attains nearoptimal regret (Section 4.2).
4.1 Procedure
In order to formulate the bootstrapped loglikelihood, we use a random transformation of the labels in the corresponding loglikelihood function. First, consider the case of Bernoulli observations where the labels . In this case, the loglikelihood function is given by:
where the function is the inverselink function. For each observation , we sample a random weight
from an exponential distribution, specifically, for all
, . We use the following transformation of the labels: and . Since we transform the labels by multiplying them with exponential weights, we refer to this case as WB with multiplicative exponential weights. Observe that this transformation procedure extends the domain for the labels from values in to those in and does not result in a valid probability mass function. However, below, we describe several advantages of using this transformation.Given this transformation, the bootstrapped loglikelihood function is defined as:
(4) 
Here is the loglikelihood of observing point . As before, the bootstrap sample is computed as: . Note that in WB, the randomness for bootstrapping is induced by the weights and that . As a special case, in the absence of features, when for all , assuming positive and negative pseudocounts and denoting , we obtain the following closedform expression for computing the bootstrap sample:
(5) 
Using the above transformation has the following advantages: (i) Using equation 4, we can interpret as a random reweighting (by the weights ) of the observations. This formulation is equivalent to the weighted likelihood bootstrapping procedure proposed and proven to be asymptotically consistent in the offline case in Newton and Raftery (1994). (ii) From an implementation perspective, computing involves solving a weighted maximum likelihood estimation problem. It thus has the same computational complexity as NPB and can be solved by using blackbox optimization routines. (iii) In the next section, we show that using WB with multiplicative exponential weights has good theoretical properties in the bandit setting. Furthermore, such a procedure of randomly transforming the labels lends itself naturally to the Gaussian case and in Appendix C.2.1, we show that WB with an additive transformation using Gaussian weights is equivalent to TS.
4.2 Equivalence to Thompson sampling
We now analyze the theoretical performance of WB in the Bernoulli bandit setting. In the following proposition proved in appendix C.1.1, we show that WB with multiplicative exponential weights is equivalent to TS.
Proposition 1.
If the rewards , then weighted bootstrapping using the estimator in equation 5 results in , where and is the number of positive and negative observations respectively; and are the positive and negative pseudocounts. In this case, WB is equivalent to Thompson sampling under the prior.
Since WB is mathematically equivalent to TS, the bounds in Agrawal and Goyal (2013a) imply nearoptimal regret for WB in the Bernoulli bandit setting.
In Appendix C.1.2, we show that this equivalence extends to the more general categorical (with categories) reward distribution i.e. for . In appendix C.2.1, we prove that for Gaussian rewards, WB with additive Gaussian weights, i.e. and using the additive transformation , is equivalent to TS under an uninformative prior. Furthermore, this equivalence holds even in the presence of features, i.e. in the linear bandit case. Using the results in Agrawal and Goyal (2013b), this implies that for Gaussian rewards, WB with additive Gaussian weights achieves nearoptimal regret.
5 Experiments
In Section 5.1, we first compare the empirical performance of bootstrapping and Thompson sampling in the bandit setting. In section 5.2, we describe the experimental setup for the contextual bandit setting and compare the performance of different algorithms under different featurereward mappings.
5.1 Bandit setting
We consider arms (refer to Appendix D for results with other values of ), a horizon of rounds and average our results across runs. We perform experiments for four different reward distributions  Bernoulli, Truncated Normal, Beta and the Triangular distribution, all bounded on the interval. In each run and for each arm , we choose the expected reward
(mean of the corresponding distribution) to be a uniformly distributed random number in
. For the TruncatedNormal distribution, we choose the standard deviation to be equal to
, whereas for the Beta distribution, the shape parameters of arm
are chosen to be and . We use the prior for TS. In order to use TS on distributions other than Bernoulli, we follow the procedure proposed in Agrawal and Goyal (2013a): for a reward in we flip a coin with the probability of obtaining equal to the reward, resulting in a binary “pseudoreward”. This pseudoreward is then used to update the Beta posterior as in the Bernoulli case. For NPB and WB, we use the estimators in equations 3 and 5 respectively. For both of these, we use the pseudocounts .In the Bernoulli case, NPB obtains a higher regret as compared to both TS and WB which are equivalent. For the other distributions, we observe that both WB and NPB (with WB resulting in consistently better performance) obtain lower cumulative regret than the modified TS procedure. This shows that for distributions that do not admit a conjugate prior, WB (and NPB) can be directly used and results in good empirical performance as compared to making modifications to the TS procedure.
5.2 Contextual bandit setting
We adopt the oneversusall multiclass classification setting for evaluating contextual bandits Agarwal et al. (2014); McNellis et al. (2017). Each arm corresponds to a class. In each round, the algorithm receives a reward of one if the context vector belongs to the class corresponding to the selected arm and zero otherwise. Each arm maintains an independent set of sufficient statistics that map the context vector to the observed binary reward. We use two multiclass datasets: CoverType ( and ) and MNIST ( and ). The number of rounds in experiments is and we average results over independent runs. We experiment with LinUCB AbbasiYadkori et al. (2011), which we call UCB, linear Thompson sampling (TS) Agrawal and Goyal (2013b), greedy (EG) Langford and Zhang (2008)
, nonparametric bootstrapping (NPB), and weighted bootstrapping (WB). For EG, NPB and WB, we consider three model classes: linear regression (suffix “lin” in plots), logistic regression (suffix “log” in plots), and a single hiddenlayer (with
hidden neurons) fullyconnected neural network (suffix “nn” in plots). Since we compare various bandit algorithms and model classes, we use the expected perstep reward,
, as our performance metric.For EG, we experimented extensively with many different exploration schedules. We found that leads to the best performance on both of our datasets. In practice, it is not possible to do such tuning on a new problem. Therefore, the EG results in this paper should be viewed as a proxy for the “best” attainable performance. As alluded to in Section 1, we implement bootstrapping using stochastic optimization with warmstart, in contrast to approximating it as in McNellis et al. (2017); Tang et al. (2015)
. Specifically, we use stochastic gradient descent to compute the MLE for the bootstrapped loglikelihood and warmstart the optimization at round
by the solution from the previous round . For linear and logistic regression, we optimize until we reach an error threshold of . For the neural network, we take pass over the dataset in each round. To ensure that the results do not depend on our specific choice of optimization, we use scikitlearn Buitinck et al. (2013)with stochastic optimization, and default optimization options for both linear and logistic regression. For the neural network, we use the Keras library
Chollet (2015)with the ReLU nonlinearity for the hidden layer and sigmoid in the output layer, along with SGD and its default configuration. Preliminary experiments suggested that our procedure leads to better runtime as compared to
McNellis et al. (2017) and better performance than the approximation proposed in Tang et al. (2015), while also alleviating the need to tune any hyperparameters.In the prior work on bootstrapping McNellis et al. (2017); Tang et al. (2015) for contextual bandits, the algorithm was initialized through forced exploration, where each arm is explored times at the beginning; or equivalently assigned pseudoexamples that are randomly sampled context vectors. Such a procedure introduces yet another tunable parameter . Therefore, we propose the following parameterfree procedure. Let
be the eigenvectors of the covariance matrix of the context vectors, and
be the corresponding eigenvalues. For each arm, we add
pseudoexamples: for all , we include the vectors and each with both and labels. Since is the standard deviation of features in the direction of , this procedure ensures that we maintain enough variance in the directions where the contexts lie. In the absence of any prior information about the contexts, we recommend using samples from an isotropic multivariate Gaussian and validate that it led to comparable performance on the two datasets.We plot the expected perstep reward of all compared methods on the CoverType and MNIST datasets in figures 2(a) and 2(b), respectively. In figure 2(a), we observe that EG, NPB, and WB with logistic regression have the best performance in all rounds. The linear methods (EG, UCB, and bootstrapping) perform similarly, and slightly worse than logistic regression whereas TS has the worst performance. Neural networks perform similarly to logistic regression and we do not plot them here. This experiment shows that even for a relatively simple dataset, like CovType, a more expressive nonlinear model can lead to better performance. This effect is more pronounced in figure 2(b). For this dataset, we only show the best performing linear method, UCB. The performance of other linear methods, including those with bootstrapping, is comparable to or worse than UCB. We observe that nonlinear models yield a much higher perstep reward, with the neural network performing the best. For both logistic regression and neural networks, the performance of both bootstrapping methods is similar and only slightly worse, respectively, than that of a tuned EG method. Both NPB and WB are computationally efficient; on the CovType dataset, NPB and WB with logistic regression take on average, and seconds per round, respectively. On the MNIST dataset, NPB and WB have an average runtime of and seconds per round, respectively, when using logistic regression; and and seconds per round, respectively, when using a neural network.
6 Discussion
We showed that the commonly used nonparametric bootstrapping procedure can be provably inefficient. As an alternative, we proposed the weighted bootstrapping procedure, special cases of which become equivalent to TS for common reward distributions such as Bernoulli and Gaussian. On the empirical side, we showed that the WB procedure has better performance than a modified TS scheme for several bounded distributions in the bandit setting. In the contextual bandit setting, we provided guidelines to make bootstrapping simple and efficient to implement and showed that nonlinear versions of bootstrapping have good empirical performance. Our work raises several open questions: does bootstrapping result in nearoptimal regret for generalized linear models? Under what assumptions or modifications can NPB be shown to have good performance? On the empirical side, evaluating bootstrapping across multiple datasets and comparing it against TS with approximate sampling is an important future direction.
References
 AbbasiYadkori et al. (2011) Y. AbbasiYadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.

Agarwal et al. (2014)
A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. Schapire.
Taming the monster: A fast and simple algorithm for contextual
bandits.
In
International Conference on Machine Learning
, pages 1638–1646, 2014.  Agrawal and Goyal (2013a) S. Agrawal and N. Goyal. Further optimal regret bounds for thompson sampling. In Artificial Intelligence and Statistics, pages 99–107, 2013a.
 Agrawal and Goyal (2013b) S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, 2013b.
 Arratia and Gordon (1989) R. Arratia and L. Gordon. Tutorial on large deviations for the binomial distribution. Bulletin of Mathematical Biology, 51(1):125–131, Jan 1989. ISSN 15229602.
 Auer (2002) P. Auer. Using confidence bounds for exploitationexploration tradeoffs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
 Auer et al. (2002) P. Auer, N. CesaBianchi, and P. Fischer. Finitetime analysis of the multiarmed bandit problem. Machine learning, 47(23):235–256, 2002.
 Baransi et al. (2014) A. Baransi, O.A. Maillard, and S. Mannor. Subsampling for multiarmed bandits. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 115–131. Springer, 2014.
 Boucheron et al. (2013) S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013.
 Bubeck et al. (2012) S. Bubeck, N. CesaBianchi, et al. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
 Buitinck et al. (2013) L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. VanderPlas, A. Joly, B. Holt, and G. Varoquaux. API design for machine learning software: experiences from the scikitlearn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pages 108–122, 2013.
 Chollet (2015) F. Chollet. keras. https://github.com/fchollet/keras, 2015.
 Chu et al. (2011) W. Chu, L. Li, L. Reyzin, and R. E. Schapire. Contextual bandits with linear payoff functions. In AISTATS, volume 15, pages 208–214, 2011.
 Dani et al. (2008) V. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit feedback. In COLT, pages 355–366, 2008.
 Eckles and Kaptein (2014) D. Eckles and M. Kaptein. Thompson sampling with the online bootstrap. arXiv preprint arXiv:1410.4009, 2014.
 Efron (1992) B. Efron. Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics, pages 569–593. Springer, 1992.
 Filippi et al. (2010) S. Filippi, O. Cappe, A. Garivier, and C. Szepesvári. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems, pages 586–594, 2010.
 Friedman et al. (2001) J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001.
 Jun et al. (2017) K.S. Jun, A. Bhargava, R. Nowak, and R. Willett. Scalable generalized linear bandits: Online computation and hashing. arXiv preprint arXiv:1706.00136, 2017.
 Kakade et al. (2008) S. M. Kakade, S. ShalevShwartz, and A. Tewari. Efficient bandit algorithms for online multiclass prediction. In Proceedings of the 25th international conference on Machine learning, pages 440–447. ACM, 2008.
 Lai and Robbins (1985) T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.

Langford and Zhang (2008)
J. Langford and T. Zhang.
The epochgreedy algorithm for multiarmed bandits with side information.
In Advances in neural information processing systems, pages 817–824, 2008.  Li et al. (2010) L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670. ACM, 2010.
 Li et al. (2017) L. Li, Y. Lu, and D. Zhou. Provable optimal algorithms for generalized linear contextual bandits. arXiv preprint arXiv:1703.00048, 2017.
 McNellis et al. (2017) R. McNellis, A. N. Elmachtoub, S. Oh, and M. Petrik. A practical method for solving contextual bandit problems using decision trees. In Proceedings of the ThirtyThird Conference on Uncertainty in Artificial Intelligence, UAI 2017, Sydney, Australia, August 1115, 2017, 2017.

Newton and Raftery (1994)
M. A. Newton and A. E. Raftery.
Approximate bayesian inference with the weighted likelihood bootstrap.
Journal of the Royal Statistical Society. Series B (Methodological), pages 3–48, 1994.  Osband and Van Roy (2015) I. Osband and B. Van Roy. Bootstrapped thompson sampling and deep exploration. arXiv preprint arXiv:1507.00300, 2015.
 Osband et al. (2016) I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep exploration via bootstrapped dqn. In Advances in Neural Information Processing Systems, pages 4026–4034, 2016.
 Pandey et al. (2007) S. Pandey, D. Chakrabarti, and D. Agarwal. MultiArmed Bandit Problems with Dependent Arms. In ICML ’07: Proceedings of the 24th international conference on Machine learning, pages 721–728, New York, NY, USA, 2007. ACM.
 Riquelme et al. (2018) C. Riquelme, G. Tucker, and J. Snoek. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. arXiv preprint arXiv:1802.09127, 2018.
 Rusmevichientong and Tsitsiklis (2010) P. Rusmevichientong and J. N. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.
 Tang et al. (2015) L. Tang, Y. Jiang, L. Li, C. Zeng, and T. Li. Personalized recommendation via parameterfree contextual bandits. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 323–332. ACM, 2015.
 Wang et al. (2005) C.c. Wang, S. R. Kulkarni, and H. V. Poor. Bandit problems with side observations. IEEE Transactions on Automatic Control, 50:338–355, 2005.
 Zhang et al. (2016) L. Zhang, T. Yang, R. Jin, Y. Xiao, and Z. Zhou. Online stochastic linear optimization under onebit feedback. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 1924, 2016, pages 392–401, 2016.
Appendix A Proof for Theorem 1
We prove Theorem 1 in this section. First, we have the following tail bound for Binomial random variables:
Proposition 2 (Binomial Tail Bound).
Assume that random variable , then for any s.t. , we have
where is the KLdivergence between and .
Notice that for our considered case, the “observation history" of the agent at the beginning of time is completely characterized by a triple , where is the number of times arm has been pulled from time to and the realized reward is , plus the pseudo count ; similarly, is the number of times arm has been pulled from time to and the realized reward is , plus the pseudo count . Moreover, conditioning on this history , the probability that the agent will pull arm under the NPB only depends on . To simplify the exposition, we use to denote this conditional probability. The following lemma bounds this probability in a “bad" history:
Lemma 1.
Consider a “bad" history with and for some integer , then we have
Proof.
Recall that by definition, we have
(6) 
where (a) follows from the NPB procedure in this case, and (b) follows from Proposition 2. Specifically, recall that , and for . Thus, the conditions of Proposition 2 hold in this case. Furthermore, we have
(7) 
where (c) follows from the fact that for . Thus we have
(8) 
∎
The following technical lemma derives the expected value of a truncated geometric random variable, as well as a lower bound on it, which will be used in the subsequent analysis:
Lemma 2 (Expected Value of Truncated Geometric R.V.).
Assume that is a truncated geometric r.v. with parameter and integer . Specifically, the domain of is , and for and . Then we have
Proof.
Notice that by definition, we have
Define the shorthand notation , we have
(9) 
Recall that , we have proved that .
Now we prove the lower bound. First, we prove that
(10) 
always holds by induction on . Notice that when , the LHS of equation (10) is , and the RHS of equation (10) is . Hence, this inequality trivially holds in the base case. Now assume that equation (10) holds for , we prove that it also holds for . Notice that
(11) 
where (a) follows from the induction hypothesis. Thus equation (10) holds for all and . Notice that equation 10 implies that
We now prove the lower bound. Notice that for any , is an increasing function of , thus for , we have
On the other hand, if , we have
Combining the above results, we have proved the lower bound on . ∎
We then prove the following lemma:
Lemma 3 (Regret Bound Based on ).
When NPB is applied in the considered case, for any integer and time horizon satisfying , we have
Proof.
We start by defining the bad event as
Thus, we have . Since for all , with probability , the agent will pull arm infinitely often. Moreover, the event only depends on the outcomes of the first pulls of arm . Thus we have . Furthermore, conditioning on , we define the stopping time as
Then we have
(12) 
Notice that conditioning on event , in the first steps, the agent either pulls arm or pulls arm but receives a reward , thus, by definition of , we have
On the other hand, if , notice that for any time with history s.t. , the agent will pull arm conditionally independently with probability . Thus, conditioning on , the number of times the agent will pull arm before it pulls arm again follows the truncated geometric distribution with parameter and . From Lemma 2, for any , we have
(13) 
notice that a factor of in inequality (a) is due to the reward gap. Inequality (b) follows from the fact that ; inequality (c) follows from Lemma 1, which states that for , we have ; inequality (d) follows from the fact that for , we have
Finally, notice that
Thus, combining everything together, we have
(14) 
where the last equality follows from the fact that for . This concludes the proof. ∎
Finally, we prove Theorem 1.
Proof.
For any given , we choose . Since
we have
thus, Lemma 3 is applicable. Notice that
Furthermore, we have
where the first inequality follows from . On the other hand, we have
where the last inequality follows from the fact that , since . Notice that we have
where the first inequality follows from the fact that , and the second inequality follows from . Putting it together, we have
This concludes the proof for Theorem 1. ∎
Appendix B Proof for Theorem 2
For simplicity of exposition, we consider arms with means . Let . Let be the mean of the history of arm at time and be the mean of the bootstrap sample of arm at time . Note that both are random variables. Each arm is initially explored times. Since and are estimated from random samples of size at least , we get from Hoeffding’s inequality (Theorem 2.8 in Boucheron et al. Boucheron et al. (2013)) that
for any and time . The first two inequalities hold for any and . The last two hold for any and , and therefore also in expectation over their random realizations. Let be the event that the above inequalities hold jointly at all times and be the complement of event . Then by the union bound,
By the design of the algorithm, the expected step regret is bounded from above as
where the last inequality follows from the definition of event and observation that the maximum step regret is . Let
where is a tunable parameter that determines the number of exploration steps per arm. From the definition of and , and the fact that when , we have that
Finally, note that and we choose that optimizes the upper bound.
Appendix C Weighted bootstrapping and equivalence to TS
In this section, we prove that for the common reward distributions, WB becomes equivalent to TS for specific choices of the weight distribution and the transformation function.
c.1 Using multiplicative exponential weights
In this subsection, we consider multiplicative exponential weights, implying that and . We show that in this setting WB is mathematically equivalent to TS for Bernoulli and more generally categorical rewards.
c.1.1 Proof for Proposition 1
Proof.
Recall that the bootstrap sample is given as:
To characterize the distribution of , let us define and as the sum of weights for the positive and negative examples respectively. Formally,
The sample can then be rewritten as:
Observe that (and ) is the sum of (and respectively) exponentially distributed random variables. Hence,
Comments
There are no comments yet.