Residual Bootstrap Exploration for Bandit Algorithms

In this paper, we propose a novel perturbation-based exploration method in bandit algorithms with bounded or unbounded rewards, called residual bootstrap exploration (ReBoot). The ReBoot enforces exploration by injecting data-driven randomness through a residual-based perturbation mechanism. This novel mechanism captures the underlying distributional properties of fitting errors, and more importantly boosts exploration to escape from suboptimal solutions (for small sample sizes) by inflating variance level in an unconventional way. In theory, with appropriate variance inflation level, ReBoot provably secures instance-dependent logarithmic regret in Gaussian multi-armed bandits. We evaluate the ReBoot in different synthetic multi-armed bandits problems and observe that the ReBoot performs better for unbounded rewards and more robustly than Giro<cit.> and PHE<cit.>, with comparable computational efficiency to the Thompson sampling method.

Authors

• 4 publications
• 85 publications
• 15 publications
• 49 publications
• Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits

We propose a multi-armed bandit algorithm that explores based on randomi...
11/13/2018 ∙ by Branislav Kveton, et al. ∙ 6

• Output-Weighted Sampling for Multi-Armed Bandits with Extreme Payoffs

We present a new type of acquisition functions for online decision makin...
02/19/2021 ∙ by Yibo Yang, et al. ∙ 0

• Beyond the Hazard Rate: More Perturbation Algorithms for Adversarial Multi-armed Bandits

02/17/2017 ∙ by Zifan Li, et al. ∙ 0

• Optimal Algorithms for Stochastic Multi-Armed Bandits with Heavy Tailed Rewards

In this paper, we consider stochastic multi-armed bandits (MABs) with he...
10/24/2020 ∙ by Kyungjae Lee, et al. ∙ 0

• Bootstrapping Upper Confidence Bound

Upper Confidence Bound (UCB) method is arguably the most celebrated one ...
06/12/2019 ∙ by Botao Hao, et al. ∙ 0

• Robust Stochastic Bandit Algorithms under Probabilistic Unbounded Adversarial Attack

The multi-armed bandit formalism has been extensively studied under vari...
02/17/2020 ∙ by Ziwei Guan, et al. ∙ 14

• Linear Bandit algorithms using the Bootstrap

This study presents two new algorithms for solving linear stochastic ban...
05/04/2016 ∙ by Nandan Sudarsanam, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A sequential decision problem is characterized by an agent who interacts with an uncertain environment, and maximizes cumulative rewards (Sutton and Barto, 2018). To learn to make optimal decisions as soon as possible, the agent must balance between exploiting the current best decision to accumulate instant rewards and executing an exploratory decision to optimize future rewards. In literature (Lattimore and Szepesvári, 2018), a gold standard is that we should explore more on where we are not sufficiently confident. Thus, a principled way to understand and ultilize uncertain quantification stands in the core of sequential decision makings.

The most popular approaches with theoretical guarantees are based on the optimism principle. The class of upper confidence bound (UCB) type algorithm (Auer et al., 2002; Abbasi-Yadkori et al., 2011) tends to maintain a confidence set such that the agent acts in an optimistic environment. The class of Thompson sampling (TS) type algorithm (Russo et al., 2018) maintains a posterior distribution over model parameters and then acts optimistically with respect to samples from it. However, those types of algorithms are known to be hard to generalize to structured problems (Kveton et al., 2018). Thus, we may ask the following question:

Can we design a better principle that can provably quantify uncertainties and is easy to generalize?

We follow the line of bootstrap explorations (Osband and Van Roy, 2015; Elmachtoub et al., 2017; Kveton et al., 2018), which is known to be easily generalized to structured problems. In this work, we carefully design a type of “perturbed” residual bootstrap as a data-dependent exploration in bandits, which may also be viewed as “follow the bootstrap leader” algorithm in a general sense. The main principle is that residual bootstrap in the statistics literature (Mammen, 1993) can be adapted to capture the underlying distributional properties of fitting errors. It turns out that the resulting level of exploration leads to the optimal regret. In this case, we call the employed perturbation as “regret-optimal perturbation” scheme. The regret-optimal perturbation is obtained through appropriate uncertainty boosting based on residual sum of squares (RSS).

Our contributions:

• We propose a novel residual bootstrap exploration algorithm that maintains the generalization property and works for both bounded and unbounded rewards;

• We prove an optimal instance-dependent regret for an instance of ReBoot for unbounded rewards (Gaussian bandit). This is a non-trivial extension beyond Bernoulli rewards (Kveton et al., 2018)

. We utilize sharp lower bounds for the normal distribution function and carefully design the variance inflation level;

• We empirically compare ReBoot with several provable competitive methods and demonstrate our superior performance in a variety of unbounded reward distributions while preserving computational efficiency.

Related Works. Giro (Kveton et al., 2018) directly perturbed the historical rewards by nonparametric bootstrapping and adding deterministic pseudo rewards. One limitation is that Giro is only suitable for bounded rewards. The range of pseudo rewards usually depends on the extreme value of rewards which could be for an unbounded distribution. In this case, there is no principle guidance in choosing appropriate values for pseudo rewards. Technically, the analysis in Kveton et al. (2018) heavily relies on beta-binomial transformation, and thus is hard to extend to unbounded reward, e.g. Gaussian bandit. Another related work Elmachtoub et al. (2017)

utilized bootstrap to randomize decision-tree estimator in contextual bandits but did not provide any optimal regret guarantee.

PHE (Kveton et al., 2019) randomized the history by directly injecting Bernoulli noise and provided regret guarantee limited to bounded reward case. Lu and Van Roy (2017) proposed ensemble sampling by injecting Gaussian noise to approximate posterior distribution in Thompson sampling. However, their regret guarantee has an irreducible term linearly depending on time horizon even when the posterior can be exactly calculated. In practice, it may also be hard to decide what kind of noises and what amount of noises should be injected since this is not a purely data-dependent approach.

Another line of works is to use bootstrap to construct sharper confidence intervals in UCB-type algorithm.

Hao et al. (2019) proposed a nonparametric and data-dependent UCB algorithm based on the multiplier bootstrap and derived a second-order correction term to boost the agent away from sub-optimal solutions. However, their approach is computational expensive since at each round, they need to resample a large number of times for the history. By contrast, Reboot only needs to resample once at each round.

Notations. Throughout the paper, we denote as the set . For a set , we denote its complement as . We denote

as the Gaussian distribution with mean parameter

and variance parameter . We write

for a random variable

if its distribution has mean and variance . We write if for some constant .

2 Residual Bootstrap Exploration (ReBoot)

In this section, we first briefly discuss in Section 2.1 why vanilla residual bootstrap exploration may not work for multi-armed bandit problems. This motivates the ReBoot algorithm as a remedy to be presented in Section 2.2 and the full description of each step will be given in Section 2.3. Discussion and interpretation on the tuning parameter is given in Section 2.4.

Problem setup. We present our approach in the stochastic multi-armed bandit (MAB) problems. There are arms, and each arm has a reward distribution with an unknown mean parameter . Without loss of generality, we assume arm 1 is the optimal arm, that is, . Specifically, the agent interacts with an bandit environment for rounds. In round , the agent pulls an arm and observes a reward . The objective is to minimize the expected cumulative regret, defined as,

 R(T)=Tμ1−E[T∑t=1rt]=K∑k=2ΔkE[T∑t=1I{It=k}], (1)

where is the sub-optimality gap for arm , and is an indicator function. Here, the second equality is from the regret decomposition Lemma (Lemma 4.5 in Lattimore and Szepesvári (2018)).

2.1 The failure of vanilla residual bootstrap

Bootstrapping a size reward sample set of arm , , via residual bootstrap method (Mammen, 1993) consists of the following four steps:

1. [noitemsep]

2. Compute an average reward .

3. Compute the residuals for each reward sample with .

4. Generate bootstrap weights (random variables with zero mean and unit variance) .

5. Add the reward average with the average of perturbed residuals to get perturbed reward average .

An exclusive feature of is that it preserves the empirical variation among current data set. That is, the conditional variance of perturbed reward average on given reward sample set is the uncertainty quantified by reward average . To see this, notice that, from the above residual bootstrap procedure, the perturbed reward average admits the presentation

 ¯Y∗k,s=¯Yk,s+1ss∑i=1wi⋅ek,i. (2)

Since the bootstrap weights are required to have zero mean and unit variance, the distribution of perturbed average conditioning on the current data set has mean and variance as

which means that its expectation equals sample average and its variance can be represented as , where

is the residual sum of squares.

Remark 1.

Note that the RSS in (4) is a standard measure of goodness of fit in statistics, that is, how well a statistical model fit the current data set. After a glimpse, it seems that the variance of bootstrap-based mean estimator should drop hints on the right amount of randomness for exploration. Such intuition guides a different type of exploration in MAB problem, as elaborated in the following paragraph.

Vanilla residual bootstrap exploration.

As well recognized in the literature of bandit algorithm (Lattimore and Szepesvári, 2018), policy using reward average as arm index (Follow-the-Leader algorithm) can incur linear regret in multi-armed bandit problem (Figure.1; blue line). Alternatively, policy using perturbed reward average via residual bootstrap as in (3) induces an data-driven exploration, at the level of statistical uncertainty of current reward sample set ().

Exploration at the level of current reward sample set’s statistical uncertainty is a mixture of hopes and concerns. On one hand, we hope the data-driven exploration level () hints a right amount of randomness for escaping from suboptimal solutions; on the other hand, we concern that large-deviated average of reward samples or poor fitting of adopted statistical model haunts the performance of bandit algorithms.

Unfortunately, we witness in empirical experiments that policy using perturbed reward average as arm index, in spite of improving the Follow-the-Leader algorithm, still can incur linear regret (Figure.1; orange line). The problem is that the exploration level of perturbed reward average inherited from current reward samples’ statistical uncertainty is not sufficient, leading to under-exploration.

Surprisingly, after we carefully perturb the reward average in an unconventional way, we observe that the bandit algorithm successfully secures sublinear regret in the experiment (Figure.1; green line). As a summary of our experience, we devise a ”regret-optimal“ perturbation scheme and propose it as the ReBoot algorithm.

2.2 Algorithm ReBoot

We propose the ReBoot algorithm which explores via residual bootstrap with a regret-optimal perturbation scheme. Specifically, at each round, for each arm (denoting by the history of arm

: a vector of all

rewards received so far by pulling arm ), ReBoot computes an index for arm via four steps:

1. [noitemsep]

2. Compute the average reward .

3. Compute the residuals with for all . Appending and as two pseudo residuals ( is a tuning parameter).

4. Generate bootstrap weights (random variables with zero mean and unit variance) .

5. Add the reward average with the average of perturbed residuals to the get arm index .

Then, ReBoot follows the bootstrap leader. That is, ReBoot pulls the arm with the highest arm index; formally,

 It=argmaxk∈[K]^μ∗k. (5)

A summarized ReBoot algorithm for the MAB problem is presented in Algorithm 1. We explain the residual bootstrap (Step 4) and discuss the choice of bootstrap weights (Step 3) in Section 2.3, and we explain the reason for adding pseudo residuals in Section 2.4.

2.3 Residual bootstrap exploration with regret optimal perturbation scheme

In this subsection, we illustrate how to implement residual bootstrap exploration with a proposed regret-optimal perturbation scheme in MAB problem. We give full descriptions on the four steps of the ReBoot with discussion on how the proposed regret-optimal perturbation scheme boosts the uncertainty to escape from suboptimal solutions. Before proceeding to the exact description of our proposed policy, it’s helpful to introduce further notations for MAB. At round , the number of pulls of arm is denoted by .

Step 1. Compute the average reward of history

We first describe a historical reward vector for arm when at round . Denote the -th entry of by the reward of arm received after the -th pull.

Associated with a historical reward vector is the average reward .

Step 2. Compute residuals and pseudo residuals

Given the average reward of the historical reward vector , compute the residual set . Note that the residual set carries the statistical uncertainty among the history , contributing to part of exploration level used in the ReBoot. As we see in Section 2.1, such exploration level is not enough to secure sublinear regret.

Encouraging exploration to escape from suboptimal solutions, especially when number of rewards is little, requires a sophisticated design of perturbation scheme. Our proposal on devising a variance-inflated average of perturbed rewards applies the following three steps. First, specify an exploration aid unit , a tuning parameter of ReBoot. Second, generate pseudo residuals by the scheme

 ek,s+1 = √s+2⋅σa, (6) ek,s+2 = √s+2⋅(−σa). (7)

Last, append pseudo residuals to to form an augmented residual set .

Step 3. Generate bootstrap weights

We generate bootstrap weights by drawing i.i.d. random variables from a mean zero and unit variance distribution. As recommended in the literature of residual bootstrap (Mammen, 1993)

, choices of bootstrap weights include Gaussian weights, Rademacher weights and skew correcting weights.

Step 4. Perturb average reward with residuals

The arm index using in the ReBoot is then computed by summing up the average reward and the average of perturbed augmented residual set ; formally,

 ^μ∗k,t=¯Yk,s+1s+2s+2∑i=1wi⋅ek,i. (8)
Remark 2.

Compared to the perturbed average reward (2) in vanilla residual bootstrap exploration, the arm index (8) used in the ReBoot possesses additional exploration level controlled by the tuning parameter in equations (6) and (7). Intuitively, larger delivers stronger exploration assistance for arm index (8), increasing the chance of escaping from suboptimal solutions.

How the ReBoot explores?

Here we explain, conditioning on historical reward vector , how the arm index used in the ReBoot explores. By our perturbation scheme, at round , the arm index admits a presentation

 ^μ∗k,t=¯Yk,s+1s+2[s∑i=1wi⋅ek,i+s+2∑i=s+1wi⋅ek,i], (9)

where are pseudo residuals specified from the proposed perturbation scheme(equations (6) and (7)). The data-driven exploration, contributed by perturbed residuals , reflects the current reward samples’ statistical uncertainty; the additional exploration aid, contributed by perturbed pseudo residuals , echos the expected statistical uncertainty in the scale of specified exploration aid unit . The art is to tune the parameter to strike a balance between these two source of exploration, and to avoid underexploration and secure linear regret. (See more discussion on the intuition and interpretation on in Section 2.4).

To sum up, given a historical reward vector , the conditional variance of arm index (8) has a formula

which consists of the residual sum of squares, (see (4)), from the perturbed residuals , and the pseudo residual sum of square,

from the perturbed pseudo residuals at the level of tuning parameter . Later, in formal regret analysis in Section 3, we will see how can assist the arm index to prevent underexploration, leading to a pathway to a successful escape from suboptimal solutions.

2.4 Use exploration aid unit to manage residual bootstrap exploration

In this subsection, we discuss the intuition and interpretation of the tuning parameter in ReBoot.

Choice of exploration aid unit .

Now, we answer the question of what level of exploration aid is appropriate for the sake of MAB exploration. The art is to choose a level that can prevent the index

from underexploration. We first mentally trace such heuristics, and then provide an exact implementation scheme.

Intuitively, we want to choose a exploration aid unit such that the in (10) plays a major role when number of reward samples is small and then gradually loses its importance. Such consideration is an effort to preserve statistical efficiency of averaging procedure in the proposed regret-optimal perturbation scheme.

Manage exploration level via exploration aid unit . Now we showcase our craftsmanship. Set the exploration unit

at the scale that inflates the reward distribution standard deviation

by a inflation ratio such that

 σa=r⋅σ. (12)

An immediate consequence of scheme (12) with positive inflation ratio is a considerable uncertainty boosting on the arm index (8), especially at the beginning of bandit algorithm (the regime of little number of reward samples).

Note that in scheme (12), the pseudo residual sum of squares (11) admits a formula . That is, for a size reward sample set, pseudo residuals and collectively inflate the arm index variance to times reward distribution variance .

Now we illustrate how the exploration aid unit manages the exploration of arm index given a historical reward vector . First, we note that, with a sufficiently large inflation ratio , the data-driven variation is dominated by the pseudo residual sum of square ; formally, for any number of reward samples , the good event

is of high probability. Second, given the good event

, formula (10) implies that the variance of the arm index given a historical reward vector stays within a certain pre-specified range proportional to ; that is,

Last, involving the variance inflation scheme (12) with inflation ratio , is enclosed in a range proportional to reward distribution variance ; i.e.

 2r2s+2⋅σ2≤Var(^μ∗k,t|Hk,s)≤4r2s+2⋅σ2. (15)

The key consequence of two-sided variance bound (15) on is that, on good event , the index of arm given reward samples does not suffer either severe overexploration or underexploration. Then, we are able to avoid severe underestimation for the optimal arm and severe overestimation for the suboptimal arms.

Practical choice of inflation ratio .

For practice, we recommend to choose the inflation ratio in MAB with unbounded reward. This choice is supported theoretically by formal analysis of Gaussian bandit presented in Regret Analysis section (Section 3) and empirically by experiments including Gaussian, exponential and logistic bandits in Figure 3 in Section 4.

Remark 3.

All treatments above did not impose distributional assumptions (i.e., the shape of distribution) on the arm reward (can be bounded or unbounded) and bootstrap weights (only assume zero mean and unit variance). In Section 2.5, we discuss the benefits of Gaussian bootstrap weight in ReBoot, leading to an efficient implementation with storage and computation cost as low as Thompson sampling.

2.5 Efficient implementation using Gaussian weight

A significant advantage of choosing Gaussian bootstrap weight is low storage and computational cost. This is due to the resulting conditional normality of arm index (8), no matter the underlying reward distribution. The arm index of the ReBoot under Gaussian bootstrap weight condition on the historical reward is Gaussian distributed with sample average as its mean parameter and as its variance parameter. That is,

Then, it can be implemented efficiently by the following incremental updates. At round , after pulling arm , we update , and by

 ¯Yk,s=[(s−1)¯Yk,s−1+Yk,s]/s,Sk,s=Sk,s−1+Y2k,s, (17)

and thus, . in (9) can be computed by a similar efficient approach. This implementation yielded by Gaussian bootstrap weight saves both storage and computational cost and makes ReBoot as efficient as TS (Agrawal and Goyal, 2013). We compare ReBoot with TS, Giro (Kveton et al., 2018), and PHE (Kveton et al., 2019) on storage and computational cost in Table 1. An empirical comparison on computational cost is done in Section 4.2.

3 Regret Analysis

3.1 Gaussian ReBoot

We analyze ReBoot in a K-armed Gaussian bandit. The setting and regret are defined in Section 2. We further assume the reward distribution of arm is Gaussian distributed with mean and variance .

Theorem 1.

Consider a K-armed Gaussian bandit where the reward distribution of arm k is drawn from Gaussian distribution . Let be the exploration aid unit. Then, the T round regret of ReBoot satisfies:

 R(T)≤K∑k=2Δk[6+{C1(σa)+C2(σa)Δ−2k}⋅logT], (18)

where the constants and are defined as

 C1(σa) = 8(2σ2a−1)−1, (19) C2(σa) = 128σ2a(3.1+2(1−2.25σ−2a)−12). (20)
Proof.

We defer the proof and rigorous non-asymptotic analysis to Appendix

A. The key steps and asymptotic reasoning are presented in subsection 3.3. ∎

After further optimizing the constants and assuming, without loss of generality, the maximum suboptimality gap , we have the following corollary.

Corollary 1.

Choose in Theorem 1. Then, the T round regret of ReBoot satisfies:

 R(T)≲K∑k=2Δk+K∑k=2logTΔk. (21)
Remark 4.

Corollary 1 demonstrates that the regret bound of the proposed ReBoot algorithm matches the state-of-art theoretical result for MAB based on UCB algorithm (Theorem 7.1 in Lattimore and Szepesvári (2018)).

Compare to Giro. The way of adding pseudo observations to analyze regret of Bernoulli bandit in Kveton et al. (2018) heavily relies on the bounded support assumption on reward distribution. Our theoretical contribution is to carry the regret analysis beyond bounded support reward distribution to unbounded reward distribution regime, by introducing residual perturbation-based exploration in MAB problem.

Technical novelty compared to Giro. The argument in Giro

for proving regret upper bound does not directly apply to Gaussian bandit because they rely on the fact that the sample variance of Bernoulli reward is bounded. Indeed, in Gaussian bandit, the sample variance has chi-square distribution, which is not bounded. We overcome such predicament after recognizing a consequential good event of proposed perturbation scheme. Certainly, our novel regret-optimal perturbation scheme cages the exploration level of arm index into a two-sided bound with high probability in Gaussian bandit. Such bound is controllable by the tuning parameter

of ReBoot and capable of preventing underexploration phenomenon of vanilla residual bootstrap exploration.

3.2 Discussion on choosing exploration aid unit σa

The condition is to ensure the constant in (20) is finite. Constant comes from analysis of , i.e., the expected number of sub-optimal pulls due to underestimation on the optimal arm. Large helps with jumping off the bad instance where the reward samples of the optimal arm is far below its expectation. Constant in (20) is decreasing in for . Constant in (19) is decreasing in for . Therefore, in Corollary 1, we pick to optimize the constant. Since ReBoot performs well empirically, as we show in Section 4, the theoretically suggested value of exploration aid unit is likely to be loose.

3.3 Proof Scheme

We roadmap the proof scheme of Theorem 1. The key is to analyze the situation that leads to pulling a sub-optimal arm. Such situation consists two type of events: underestimating the optimal arm and overestimating a suboptimal arm.

As shown in the Theorem 1 of (Kveton et al., 2018), the round regret of perturbed history type algorithm has an upper bound

 R(T)≤K∑k=2Δk(ak+bk). (22)

The first term is the expected number of rounds that the optimal arm 1 has been being underestimated; formally,

 ak=T−1∑s=0E[min{N1,s(τk),T}], (23)

where is the expected number of rounds that the optimal arm 1 being underestimated given sample rewards. The second term is the probability that the suboptimal arm is being overestimated; formally,

 bk=1+T−1∑s=0P(Qk,s(τk)>T−1), (24)

where is the probability of the suboptimal arm k is being overestimated given sample rewards.

Here we explain the situation that the bandit algorithm will not pull the suboptimal arm at round . Consider the optimal arm 1 and a suboptimal arm . At round , suppose and , then the indexes of arm 1 and arm are and . Given a constant level , we define the event of underestimated the optimal arm 1 as

 Fs1={^μ∗1,s1≤τk} (25)

and the event of overestimated a suboptimal arm k as

 Ecsk={^μ∗k,sk>τk}. (26)

If we pick and the distribution of indexes both have exponential decaying tails, theory of large deviation indicates that the events of and both are rare events asymptotically. Given both events and happens, the agent will not pull the suboptimal arm k.

We provide asymptotic reasoning on bounding and defer the non-asymptotic analysis to lemma 1. Recall that for a given constant level , the probability of the optimal arm 1 being underestimated given reward samples is . If we pick the level to satisfy , the theory of large deviation gives

 Q1,s(τk)s→∞→1. (27)

Recall that is the expected number of rounds to observe a not-under-estimated instance from resample mean distribution given reward samples. The asymptotics in (27) implies as the number of pulls grows to infinity. Thus, given the time horizon , there exists a constant such that for all over . Consequently, the quantity in regret bound (22) is bounded by

 ak≤1+s0(T)∑s=0E[min{N1,s(τk),T}]. (28)

The fact that constant is of order will be shown in lemmas 2 and 3. For small number of pull , we show in lemma 1 that for any . Thus, it is enough to conclude that can be bounded by a term of order.

We provide asymptotic reasoning on bounding and defer the non-asymptotic analysis to lemma 5. Recall that for a given constant level , the probability of the suboptimal arm being overestimated given reward samples is . If we pick the level to satisfy , the theory of large deviation gives

 Qk,s(τk)s→∞→0. (29)

Thus, given the time horizon , there exists a constant such that for all over . As a result, the event is empty if the number of pull is beyond . Consequently, the quantity in regret bound (22) is bounded by

 bk≤s0(T)∑s=0P(Qk,s(τk)>T−1) (30)

The fact that constant is of order will be shown in lemmas 6 and 7. For small number of pull , we apply trivial bound that holds for any . Therefore, it is enough to conclude that can be bounded by a term of order.

4 Experiments

We compare ReBoot to three baselines: TS (Agrawal and Goyal, 2013) (with prior), Giro (Kveton et al., 2018), and PHE (Kveton et al., 2019). For all the experiments unless otherwise specified, we choose and for Giro and PHE respectively, as justified by the associated theory. All the results are averaged over runs.

4.1 Robustness to Reward Mean/Variance

We compare ReBoot with the two bounded bandit algorithms, Giro and PHE, on two classes of -armed Gaussian bandit problems where , . The first class has and , where varies from to . The second class has and varying from to , where we choose as guided by our theory. Figure 2 shows the effect of the shifted mean and the varying variance in three algorithms on the -round regret.

From the left panel of Figure 2, we see that ReBoot is robust to the increase in the mean rewards, while both Giro and PHE are sensitive. The reason is that when the mean rewards increase, the added pseudo rewards ( for Giro and for PHE) cannot represent the upper extreme value which is supposed to help with escaping from sub-optimal arms. From the right panel of Figure 2, we observe a slow growth in the regret of ReBoot and Giro as the variance increases due to the raised problem difficulty level, while PHE is much more sensitive than ReBoot and Giro since the exploration in PHE completely relies on pseudo rewards which are inappropriate in varying variance but ReBoot and Giro mostly depends on bootstrap which yields more stable performance.

4.2 Robustness to Reward Shape

We compare ReBoot with TS, Giro, and PHE on three classes of -armed bandit problems:

• [noitemsep]

• with ;

• with mean ;

• is a logistic distribution with mean and variance .

In each class, the mean reward takes values in , and the variance is either for all arms (Gaussian and logistic) or varies in among arms (exponential). The regret of the first rounds is displayed in Figure 3.

ReBoot has sub-linear and small regret when

in all cases, which validates our theory for Gaussian bandits and potential applicability in bandits of other distributions with even heteroscedasticity. We also see that

is a near-optimal choice for Gaussian bandits and exponential bandits (for logistic bandits, is slightly better). The linear regret of PHE and the sub-linear but large regret of Giro in all cases are because the mean rewards are shifted away from . TS cannot achieve its optimal performance without setting the prior with accurate knowledge of the reward distribution.

4.3 Computational Cost

We compare the run times of ReBoot (), TS, Giro, and PHE in a Gaussian bandit. The settings consist of all combinations of and . Our results are reported in Table 2. In all settings, the run times of ReBoot, TS, and PHE are all comparable, while the run time of Giro is significantly higher due to computationally expensive sampling with replacement over history with pseudo rewards. This comparison validates our analysis in Section 2.5.

5 Conclusion

In this work, we propose a new class of algorthm ReBoot: residual bootstrap based exploration mechanism. We highlight the limitation of directly using statistical bootstrap in bandit setting and develop a remedy procedure called variance inflation. We analyze ReBoot in an unbounded reward showcase (Gaussian bandit) and prove an optimal instance-dependent regret.

References

• Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári (2011) Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312–2320. Cited by: §1.
• S. Agrawal and N. Goyal (2013) Further optimal regret bounds for thompson sampling. In Artificial intelligence and statistics, pp. 99–107. Cited by: §2.5, §4.
• P. Auer, N. Cesa-Bianchi, and P. Fischer (2002) Finite-time analysis of the multiarmed bandit problem. Machine learning 47 (2-3), pp. 235–256. Cited by: §1.
• A. N. Elmachtoub, R. McNellis, S. Oh, and M. Petrik (2017) A practical method for solving contextual bandit problems using decision trees. arXiv preprint arXiv:1706.04687. Cited by: §1, §1.
• B. Hao, Y. Abbasi Yadkori, Z. Wen, and G. Cheng (2019) Bootstrapping upper confidence bound. In Advances in Neural Information Processing Systems 32, pp. 12123–12133. Cited by: §1.
• B. Kveton, C. Szepesvari, M. Ghavamzadeh, and C. Boutilier (2019) Perturbed-history exploration in stochastic multi-armed bandits. arXiv preprint arXiv:1902.10089. Cited by: Residual Bootstrap Exploration for Bandit Algorithms, §1, §2.5, §4.
• B. Kveton, C. Szepesvari, Z. Wen, M. Ghavamzadeh, and T. Lattimore (2018) Garbage in, reward out: bootstrapping exploration in multi-armed bandits. arXiv preprint arXiv:1811.05154. Cited by: Appendix A, Residual Bootstrap Exploration for Bandit Algorithms, 2nd item, §1, §1, §1, §2.5, §3.1, §3.3, §4.
• T. Lattimore and C. Szepesvári (2018) Bandit algorithms. preprint. Cited by: §1, §2.1, §2, Remark 4.
• X. Lu and B. Van Roy (2017) Ensemble sampling. In Advances in neural information processing systems, pp. 3258–3266. Cited by: §1.
• E. Mammen (1993) Bootstrap and wild bootstrap for high dimensional linear models. The annals of statistics, pp. 255–285. Cited by: §1, §2.1, §2.3.
• I. Osband and B. Van Roy (2015) Bootstrapped thompson sampling and deep exploration. arXiv preprint arXiv:1507.00300. Cited by: §1.
• D. J. Russo, B. Van Roy, A. Kazerouni, I. Osband, Z. Wen, et al. (2018) A tutorial on thompson sampling. Foundations and Trends® in Machine Learning 11 (1), pp. 1–96. Cited by: §1.
• R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §1.
• M. J. Wainwright (2019) High-dimensional statistics: a non-asymptotic viewpoint. Vol. 48, Cambridge University Press. Cited by: Lemma 10.

Appendix A Proof of Theorem 1.

Step 0: Notation and Preparation

We restate the Theorem 1 in (Kveton et al., 2018):

Theorem 2.

Let for each arm . For any , the expected round regret of General randomized exploration algorithm is bounded from above as

 R(T)≤K∑k=2Δk(ak+bk), (31)

where

 ak=T−1∑s=0E[min{Q1,s(τk)−1−1,T}]    ;    bk=T−1∑s=0P(Qk,s(τk)>T−1)+1. (32)

We then use Theorem 2 to analyze the regret of Gaussian ReBoot.

Step 1: Bounding ak.

Recall and . We define event and . Let and hence . Define and set . Define . Set . Note that by Lemma 1, for any . Write , where

 ak,s,1 = E[min{N1,s(τk),T}I(A1,s)I(G1,s)], (33) ak,s,2 = E[min{N1,s(τk),T}I(Ac1,s)I(G1,s)], (34) ak,s,3 = E[min{N1,s(τk),T}I(Gc1,s)]. (35)

From Lemmas 2, 3 and 4, one has that, for any , . Now, set , then .

 ak ≤ M(r)⋅max{sa,1(T),sa,2(T),sa,3(T)}+(T−max{sa,1(T),sa,2(T),sa,3(T)})⋅3T−1 (36) ≤ 3+16M(r)max{16(σ/Δk)2r2,(2r2−1)−1}⋅logT. (37)

Step 2: Bounding bk.

Recall that and . We define events and . Let and hence .

Set . Note the trivial bound for any , which follows from the fact that is a probability. Write , where

 bk,s,1 = E[I(Qk,s(τk)>T−1)I(Ak,s)I(Gk,s)], (38) bk,s,2 = E[I(Qk,s(τk)>T−1)I(Ack,s)I(Gk,s)], (39) bk,s,3 = E[I(Qk,s(τk)>T−1)I(Gck,s)]. (40)

From Lemmas 5, 6 and 7, one has that, for any , . Now, set , then .

 bk ≤ 1+max{sb,1(T),sb,2(T),sb,3(T)}+(T−max{sb,1(T),sb,2(T),sb,3(T)})⋅2T−1 (41) ≤ 3+8max{16(σ/Δk)2r2,(2r2−1)−1}⋅logT. (42)

Step 3: Bounding T-round regret R(t).

Combine the results we obtained at Step 1 and 2, one has

 ak+bk≤6+(8+16M(r))max{16(σ/Δk)2r2,(2r2−1)−1}⋅logT. (43)

Then put into Theorem 2 to have the claimed bound.

Appendix B Technical Lemmas

b.1 Lemmas on bounding ak.

Lemma 1 (Bounding ak,s for any s>0).

Set . For any ,

 ak,s≤M(r)≡1.1+(1−(3/