A sequential decision problem is characterized by an agent who interacts with an uncertain environment, and maximizes cumulative rewards (Sutton and Barto, 2018). To learn to make optimal decisions as soon as possible, the agent must balance between exploiting the current best decision to accumulate instant rewards and executing an exploratory decision to optimize future rewards. In literature (Lattimore and Szepesvári, 2018), a gold standard is that we should explore more on where we are not sufficiently confident. Thus, a principled way to understand and ultilize uncertain quantification stands in the core of sequential decision makings.
The most popular approaches with theoretical guarantees are based on the optimism principle. The class of upper confidence bound (UCB) type algorithm (Auer et al., 2002; Abbasi-Yadkori et al., 2011) tends to maintain a confidence set such that the agent acts in an optimistic environment. The class of Thompson sampling (TS) type algorithm (Russo et al., 2018) maintains a posterior distribution over model parameters and then acts optimistically with respect to samples from it. However, those types of algorithms are known to be hard to generalize to structured problems (Kveton et al., 2018). Thus, we may ask the following question:
Can we design a better principle that can provably quantify uncertainties and is easy to generalize?
We follow the line of bootstrap explorations (Osband and Van Roy, 2015; Elmachtoub et al., 2017; Kveton et al., 2018), which is known to be easily generalized to structured problems. In this work, we carefully design a type of “perturbed” residual bootstrap as a data-dependent exploration in bandits, which may also be viewed as “follow the bootstrap leader” algorithm in a general sense. The main principle is that residual bootstrap in the statistics literature (Mammen, 1993) can be adapted to capture the underlying distributional properties of fitting errors. It turns out that the resulting level of exploration leads to the optimal regret. In this case, we call the employed perturbation as “regret-optimal perturbation” scheme. The regret-optimal perturbation is obtained through appropriate uncertainty boosting based on residual sum of squares (RSS).
We propose a novel residual bootstrap exploration algorithm that maintains the generalization property and works for both bounded and unbounded rewards;
We prove an optimal instance-dependent regret for an instance of ReBoot for unbounded rewards (Gaussian bandit). This is a non-trivial extension beyond Bernoulli rewards (Kveton et al., 2018)
. We utilize sharp lower bounds for the normal distribution function and carefully design the variance inflation level;
We empirically compare ReBoot with several provable competitive methods and demonstrate our superior performance in a variety of unbounded reward distributions while preserving computational efficiency.
Related Works. Giro (Kveton et al., 2018) directly perturbed the historical rewards by nonparametric bootstrapping and adding deterministic pseudo rewards. One limitation is that Giro is only suitable for bounded rewards. The range of pseudo rewards usually depends on the extreme value of rewards which could be for an unbounded distribution. In this case, there is no principle guidance in choosing appropriate values for pseudo rewards. Technically, the analysis in Kveton et al. (2018) heavily relies on beta-binomial transformation, and thus is hard to extend to unbounded reward, e.g. Gaussian bandit. Another related work Elmachtoub et al. (2017)
PHE (Kveton et al., 2019) randomized the history by directly injecting Bernoulli noise and provided regret guarantee limited to bounded reward case. Lu and Van Roy (2017) proposed ensemble sampling by injecting Gaussian noise to approximate posterior distribution in Thompson sampling. However, their regret guarantee has an irreducible term linearly depending on time horizon even when the posterior can be exactly calculated. In practice, it may also be hard to decide what kind of noises and what amount of noises should be injected since this is not a purely data-dependent approach.
Another line of works is to use bootstrap to construct sharper confidence intervals in UCB-type algorithm.Hao et al. (2019) proposed a nonparametric and data-dependent UCB algorithm based on the multiplier bootstrap and derived a second-order correction term to boost the agent away from sub-optimal solutions. However, their approach is computational expensive since at each round, they need to resample a large number of times for the history. By contrast, Reboot only needs to resample once at each round.
2 Residual Bootstrap Exploration (ReBoot)
In this section, we first briefly discuss in Section 2.1 why vanilla residual bootstrap exploration may not work for multi-armed bandit problems. This motivates the ReBoot algorithm as a remedy to be presented in Section 2.2 and the full description of each step will be given in Section 2.3. Discussion and interpretation on the tuning parameter is given in Section 2.4.
Problem setup. We present our approach in the stochastic multi-armed bandit (MAB) problems. There are arms, and each arm has a reward distribution with an unknown mean parameter . Without loss of generality, we assume arm 1 is the optimal arm, that is, . Specifically, the agent interacts with an bandit environment for rounds. In round , the agent pulls an arm and observes a reward . The objective is to minimize the expected cumulative regret, defined as,
where is the sub-optimality gap for arm , and is an indicator function. Here, the second equality is from the regret decomposition Lemma (Lemma 4.5 in Lattimore and Szepesvári (2018)).
2.1 The failure of vanilla residual bootstrap
Bootstrapping a size reward sample set of arm , , via residual bootstrap method (Mammen, 1993) consists of the following four steps:
Compute an average reward .
Compute the residuals for each reward sample with .
Generate bootstrap weights (random variables with zero mean and unit variance) .
Add the reward average with the average of perturbed residuals to get perturbed reward average .
An exclusive feature of is that it preserves the empirical variation among current data set. That is, the conditional variance of perturbed reward average on given reward sample set is the uncertainty quantified by reward average . To see this, notice that, from the above residual bootstrap procedure, the perturbed reward average admits the presentation
Since the bootstrap weights are required to have zero mean and unit variance, the distribution of perturbed average conditioning on the current data set has mean and variance as
which means that its expectation equals sample average and its variance can be represented as , where
is the residual sum of squares.
Note that the RSS in (4) is a standard measure of goodness of fit in statistics, that is, how well a statistical model fit the current data set. After a glimpse, it seems that the variance of bootstrap-based mean estimator should drop hints on the right amount of randomness for exploration. Such intuition guides a different type of exploration in MAB problem, as elaborated in the following paragraph.
Vanilla residual bootstrap exploration.
As well recognized in the literature of bandit algorithm (Lattimore and Szepesvári, 2018), policy using reward average as arm index (Follow-the-Leader algorithm) can incur linear regret in multi-armed bandit problem (Figure.1; blue line). Alternatively, policy using perturbed reward average via residual bootstrap as in (3) induces an data-driven exploration, at the level of statistical uncertainty of current reward sample set ().
Exploration at the level of current reward sample set’s statistical uncertainty is a mixture of hopes and concerns. On one hand, we hope the data-driven exploration level () hints a right amount of randomness for escaping from suboptimal solutions; on the other hand, we concern that large-deviated average of reward samples or poor fitting of adopted statistical model haunts the performance of bandit algorithms.
Unfortunately, we witness in empirical experiments that policy using perturbed reward average as arm index, in spite of improving the Follow-the-Leader algorithm, still can incur linear regret (Figure.1; orange line). The problem is that the exploration level of perturbed reward average inherited from current reward samples’ statistical uncertainty is not sufficient, leading to under-exploration.
Surprisingly, after we carefully perturb the reward average in an unconventional way, we observe that the bandit algorithm successfully secures sublinear regret in the experiment (Figure.1; green line). As a summary of our experience, we devise a ”regret-optimal“ perturbation scheme and propose it as the ReBoot algorithm.
2.2 Algorithm ReBoot
We propose the ReBoot algorithm which explores via residual bootstrap with a regret-optimal perturbation scheme. Specifically, at each round, for each arm (denoting by the history of arm
: a vector of allrewards received so far by pulling arm ), ReBoot computes an index for arm via four steps:
Compute the average reward .
Compute the residuals with for all . Appending and as two pseudo residuals ( is a tuning parameter).
Generate bootstrap weights (random variables with zero mean and unit variance) .
Add the reward average with the average of perturbed residuals to the get arm index .
Then, ReBoot follows the bootstrap leader. That is, ReBoot pulls the arm with the highest arm index; formally,
A summarized ReBoot algorithm for the MAB problem is presented in Algorithm 1. We explain the residual bootstrap (Step 4) and discuss the choice of bootstrap weights (Step 3) in Section 2.3, and we explain the reason for adding pseudo residuals in Section 2.4.
2.3 Residual bootstrap exploration with regret optimal perturbation scheme
In this subsection, we illustrate how to implement residual bootstrap exploration with a proposed regret-optimal perturbation scheme in MAB problem. We give full descriptions on the four steps of the ReBoot with discussion on how the proposed regret-optimal perturbation scheme boosts the uncertainty to escape from suboptimal solutions. Before proceeding to the exact description of our proposed policy, it’s helpful to introduce further notations for MAB. At round , the number of pulls of arm is denoted by .
Step 1. Compute the average reward of history
We first describe a historical reward vector for arm when at round . Denote the -th entry of by the reward of arm received after the -th pull.
Associated with a historical reward vector is the average reward .
Step 2. Compute residuals and pseudo residuals
Given the average reward of the historical reward vector , compute the residual set . Note that the residual set carries the statistical uncertainty among the history , contributing to part of exploration level used in the ReBoot. As we see in Section 2.1, such exploration level is not enough to secure sublinear regret.
Encouraging exploration to escape from suboptimal solutions, especially when number of rewards is little, requires a sophisticated design of perturbation scheme. Our proposal on devising a variance-inflated average of perturbed rewards applies the following three steps. First, specify an exploration aid unit , a tuning parameter of ReBoot. Second, generate pseudo residuals by the scheme
Last, append pseudo residuals to to form an augmented residual set .
Step 3. Generate bootstrap weights
We generate bootstrap weights by drawing i.i.d. random variables from a mean zero and unit variance distribution. As recommended in the literature of residual bootstrap (Mammen, 1993)
, choices of bootstrap weights include Gaussian weights, Rademacher weights and skew correcting weights.
Step 4. Perturb average reward with residuals
The arm index using in the ReBoot is then computed by summing up the average reward and the average of perturbed augmented residual set ; formally,
Compared to the perturbed average reward (2) in vanilla residual bootstrap exploration, the arm index (8) used in the ReBoot possesses additional exploration level controlled by the tuning parameter in equations (6) and (7). Intuitively, larger delivers stronger exploration assistance for arm index (8), increasing the chance of escaping from suboptimal solutions.
How the ReBoot explores?
Here we explain, conditioning on historical reward vector , how the arm index used in the ReBoot explores. By our perturbation scheme, at round , the arm index admits a presentation
where are pseudo residuals specified from the proposed perturbation scheme(equations (6) and (7)). The data-driven exploration, contributed by perturbed residuals , reflects the current reward samples’ statistical uncertainty; the additional exploration aid, contributed by perturbed pseudo residuals , echos the expected statistical uncertainty in the scale of specified exploration aid unit . The art is to tune the parameter to strike a balance between these two source of exploration, and to avoid underexploration and secure linear regret. (See more discussion on the intuition and interpretation on in Section 2.4).
To sum up, given a historical reward vector , the conditional variance of arm index (8) has a formula
which consists of the residual sum of squares, (see (4)), from the perturbed residuals , and the pseudo residual sum of square,
from the perturbed pseudo residuals at the level of tuning parameter . Later, in formal regret analysis in Section 3, we will see how can assist the arm index to prevent underexploration, leading to a pathway to a successful escape from suboptimal solutions.
2.4 Use exploration aid unit to manage residual bootstrap exploration
In this subsection, we discuss the intuition and interpretation of the tuning parameter in ReBoot.
Choice of exploration aid unit .
Now, we answer the question of what level of exploration aid is appropriate for the sake of MAB exploration. The art is to choose a level that can prevent the index
from underexploration. We first mentally trace such heuristics, and then provide an exact implementation scheme.
Intuitively, we want to choose a exploration aid unit such that the in (10) plays a major role when number of reward samples is small and then gradually loses its importance. Such consideration is an effort to preserve statistical efficiency of averaging procedure in the proposed regret-optimal perturbation scheme.
Manage exploration level via exploration aid unit . Now we showcase our craftsmanship. Set the exploration unit
at the scale that inflates the reward distribution standard deviationby a inflation ratio such that
An immediate consequence of scheme (12) with positive inflation ratio is a considerable uncertainty boosting on the arm index (8), especially at the beginning of bandit algorithm (the regime of little number of reward samples).
Note that in scheme (12), the pseudo residual sum of squares (11) admits a formula . That is, for a size reward sample set, pseudo residuals and collectively inflate the arm index variance to times reward distribution variance .
Now we illustrate how the exploration aid unit manages the exploration of arm index given a historical reward vector . First, we note that, with a sufficiently large inflation ratio , the data-driven variation is dominated by the pseudo residual sum of square ; formally, for any number of reward samples , the good event
is of high probability. Second, given the good event, formula (10) implies that the variance of the arm index given a historical reward vector stays within a certain pre-specified range proportional to ; that is,
Last, involving the variance inflation scheme (12) with inflation ratio , is enclosed in a range proportional to reward distribution variance ; i.e.
The key consequence of two-sided variance bound (15) on is that, on good event , the index of arm given reward samples does not suffer either severe overexploration or underexploration. Then, we are able to avoid severe underestimation for the optimal arm and severe overestimation for the suboptimal arms.
Practical choice of inflation ratio .
For practice, we recommend to choose the inflation ratio in MAB with unbounded reward. This choice is supported theoretically by formal analysis of Gaussian bandit presented in Regret Analysis section (Section 3) and empirically by experiments including Gaussian, exponential and logistic bandits in Figure 3 in Section 4.
All treatments above did not impose distributional assumptions (i.e., the shape of distribution) on the arm reward (can be bounded or unbounded) and bootstrap weights (only assume zero mean and unit variance). In Section 2.5, we discuss the benefits of Gaussian bootstrap weight in ReBoot, leading to an efficient implementation with storage and computation cost as low as Thompson sampling.
2.5 Efficient implementation using Gaussian weight
A significant advantage of choosing Gaussian bootstrap weight is low storage and computational cost. This is due to the resulting conditional normality of arm index (8), no matter the underlying reward distribution. The arm index of the ReBoot under Gaussian bootstrap weight condition on the historical reward is Gaussian distributed with sample average as its mean parameter and as its variance parameter. That is,
Then, it can be implemented efficiently by the following incremental updates. At round , after pulling arm , we update , and by
and thus, . in (9) can be computed by a similar efficient approach. This implementation yielded by Gaussian bootstrap weight saves both storage and computational cost and makes ReBoot as efficient as TS (Agrawal and Goyal, 2013). We compare ReBoot with TS, Giro (Kveton et al., 2018), and PHE (Kveton et al., 2019) on storage and computational cost in Table 1. An empirical comparison on computational cost is done in Section 4.2.
3 Regret Analysis
3.1 Gaussian ReBoot
We analyze ReBoot in a K-armed Gaussian bandit. The setting and regret are defined in Section 2. We further assume the reward distribution of arm is Gaussian distributed with mean and variance .
Consider a K-armed Gaussian bandit where the reward distribution of arm k is drawn from Gaussian distribution . Let be the exploration aid unit. Then, the T round regret of ReBoot satisfies:
where the constants and are defined as
After further optimizing the constants and assuming, without loss of generality, the maximum suboptimality gap , we have the following corollary.
Choose in Theorem 1. Then, the T round regret of ReBoot satisfies:
Compare to Giro. The way of adding pseudo observations to analyze regret of Bernoulli bandit in Kveton et al. (2018) heavily relies on the bounded support assumption on reward distribution. Our theoretical contribution is to carry the regret analysis beyond bounded support reward distribution to unbounded reward distribution regime, by introducing residual perturbation-based exploration in MAB problem.
Technical novelty compared to Giro. The argument in Giro
for proving regret upper bound does not directly apply to Gaussian bandit because they rely on the fact that the sample variance of Bernoulli reward is bounded. Indeed, in Gaussian bandit, the sample variance has chi-square distribution, which is not bounded. We overcome such predicament after recognizing a consequential good event of proposed perturbation scheme. Certainly, our novel regret-optimal perturbation scheme cages the exploration level of arm index into a two-sided bound with high probability in Gaussian bandit. Such bound is controllable by the tuning parameterof ReBoot and capable of preventing underexploration phenomenon of vanilla residual bootstrap exploration.
3.2 Discussion on choosing exploration aid unit
The condition is to ensure the constant in (20) is finite. Constant comes from analysis of , i.e., the expected number of sub-optimal pulls due to underestimation on the optimal arm. Large helps with jumping off the bad instance where the reward samples of the optimal arm is far below its expectation. Constant in (20) is decreasing in for . Constant in (19) is decreasing in for . Therefore, in Corollary 1, we pick to optimize the constant. Since ReBoot performs well empirically, as we show in Section 4, the theoretically suggested value of exploration aid unit is likely to be loose.
3.3 Proof Scheme
We roadmap the proof scheme of Theorem 1. The key is to analyze the situation that leads to pulling a sub-optimal arm. Such situation consists two type of events: underestimating the optimal arm and overestimating a suboptimal arm.
As shown in the Theorem 1 of (Kveton et al., 2018), the round regret of perturbed history type algorithm has an upper bound
The first term is the expected number of rounds that the optimal arm 1 has been being underestimated; formally,
where is the expected number of rounds that the optimal arm 1 being underestimated given sample rewards. The second term is the probability that the suboptimal arm is being overestimated; formally,
where is the probability of the suboptimal arm k is being overestimated given sample rewards.
Here we explain the situation that the bandit algorithm will not pull the suboptimal arm at round . Consider the optimal arm 1 and a suboptimal arm . At round , suppose and , then the indexes of arm 1 and arm are and . Given a constant level , we define the event of underestimated the optimal arm 1 as
and the event of overestimated a suboptimal arm k as
If we pick and the distribution of indexes both have exponential decaying tails, theory of large deviation indicates that the events of and both are rare events asymptotically. Given both events and happens, the agent will not pull the suboptimal arm k.
Roadmap of bounding
We provide asymptotic reasoning on bounding and defer the non-asymptotic analysis to lemma 1. Recall that for a given constant level , the probability of the optimal arm 1 being underestimated given reward samples is . If we pick the level to satisfy , the theory of large deviation gives
Recall that is the expected number of rounds to observe a not-under-estimated instance from resample mean distribution given reward samples. The asymptotics in (27) implies as the number of pulls grows to infinity. Thus, given the time horizon , there exists a constant such that for all over . Consequently, the quantity in regret bound (22) is bounded by
Roadmap of bounding
We provide asymptotic reasoning on bounding and defer the non-asymptotic analysis to lemma 5. Recall that for a given constant level , the probability of the suboptimal arm being overestimated given reward samples is . If we pick the level to satisfy , the theory of large deviation gives
Thus, given the time horizon , there exists a constant such that for all over . As a result, the event is empty if the number of pull is beyond . Consequently, the quantity in regret bound (22) is bounded by
The fact that constant is of order will be shown in lemmas 6 and 7. For small number of pull , we apply trivial bound that holds for any . Therefore, it is enough to conclude that can be bounded by a term of order.
We compare ReBoot to three baselines: TS (Agrawal and Goyal, 2013) (with prior), Giro (Kveton et al., 2018), and PHE (Kveton et al., 2019). For all the experiments unless otherwise specified, we choose and for Giro and PHE respectively, as justified by the associated theory. All the results are averaged over runs.
4.1 Robustness to Reward Mean/Variance
We compare ReBoot with the two bounded bandit algorithms, Giro and PHE, on two classes of -armed Gaussian bandit problems where , . The first class has and , where varies from to . The second class has and varying from to , where we choose as guided by our theory. Figure 2 shows the effect of the shifted mean and the varying variance in three algorithms on the -round regret.
From the left panel of Figure 2, we see that ReBoot is robust to the increase in the mean rewards, while both Giro and PHE are sensitive. The reason is that when the mean rewards increase, the added pseudo rewards ( for Giro and for PHE) cannot represent the upper extreme value which is supposed to help with escaping from sub-optimal arms. From the right panel of Figure 2, we observe a slow growth in the regret of ReBoot and Giro as the variance increases due to the raised problem difficulty level, while PHE is much more sensitive than ReBoot and Giro since the exploration in PHE completely relies on pseudo rewards which are inappropriate in varying variance but ReBoot and Giro mostly depends on bootstrap which yields more stable performance.
4.2 Robustness to Reward Shape
We compare ReBoot with TS, Giro, and PHE on three classes of -armed bandit problems:
with mean ;
is a logistic distribution with mean and variance .
In each class, the mean reward takes values in , and the variance is either for all arms (Gaussian and logistic) or varies in among arms (exponential). The regret of the first rounds is displayed in Figure 3.
ReBoot has sub-linear and small regret when
in all cases, which validates our theory for Gaussian bandits and potential applicability in bandits of other distributions with even heteroscedasticity. We also see thatis a near-optimal choice for Gaussian bandits and exponential bandits (for logistic bandits, is slightly better). The linear regret of PHE and the sub-linear but large regret of Giro in all cases are because the mean rewards are shifted away from . TS cannot achieve its optimal performance without setting the prior with accurate knowledge of the reward distribution.
|Model||Run time (seconds)|
4.3 Computational Cost
We compare the run times of ReBoot (), TS, Giro, and PHE in a Gaussian bandit. The settings consist of all combinations of and . Our results are reported in Table 2. In all settings, the run times of ReBoot, TS, and PHE are all comparable, while the run time of Giro is significantly higher due to computationally expensive sampling with replacement over history with pseudo rewards. This comparison validates our analysis in Section 2.5.
In this work, we propose a new class of algorthm ReBoot: residual bootstrap based exploration mechanism. We highlight the limitation of directly using statistical bootstrap in bandit setting and develop a remedy procedure called variance inflation. We analyze ReBoot in an unbounded reward showcase (Gaussian bandit) and prove an optimal instance-dependent regret.
- Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312–2320. Cited by: §1.
- Further optimal regret bounds for thompson sampling. In Artificial intelligence and statistics, pp. 99–107. Cited by: §2.5, §4.
- Finite-time analysis of the multiarmed bandit problem. Machine learning 47 (2-3), pp. 235–256. Cited by: §1.
- A practical method for solving contextual bandit problems using decision trees. arXiv preprint arXiv:1706.04687. Cited by: §1, §1.
- Bootstrapping upper confidence bound. In Advances in Neural Information Processing Systems 32, pp. 12123–12133. Cited by: §1.
- Perturbed-history exploration in stochastic multi-armed bandits. arXiv preprint arXiv:1902.10089. Cited by: Residual Bootstrap Exploration for Bandit Algorithms, §1, §2.5, §4.
- Garbage in, reward out: bootstrapping exploration in multi-armed bandits. arXiv preprint arXiv:1811.05154. Cited by: Appendix A, Residual Bootstrap Exploration for Bandit Algorithms, 2nd item, §1, §1, §1, §2.5, §3.1, §3.3, §4.
- Bandit algorithms. preprint. Cited by: §1, §2.1, §2, Remark 4.
- Ensemble sampling. In Advances in neural information processing systems, pp. 3258–3266. Cited by: §1.
- Bootstrap and wild bootstrap for high dimensional linear models. The annals of statistics, pp. 255–285. Cited by: §1, §2.1, §2.3.
- Bootstrapped thompson sampling and deep exploration. arXiv preprint arXiv:1507.00300. Cited by: §1.
- A tutorial on thompson sampling. Foundations and Trends® in Machine Learning 11 (1), pp. 1–96. Cited by: §1.
- Reinforcement learning: an introduction. MIT press. Cited by: §1.
- High-dimensional statistics: a non-asymptotic viewpoint. Vol. 48, Cambridge University Press. Cited by: Lemma 10.
Appendix A Proof of Theorem 1.
Step 0: Notation and Preparation
We restate the Theorem 1 in (Kveton et al., 2018):
Let for each arm . For any , the expected round regret of General randomized exploration algorithm is bounded from above as
We then use Theorem 2 to analyze the regret of Gaussian ReBoot.
Step 1: Bounding .
Recall and . We define event and . Let and hence . Define and set . Define . Set . Note that by Lemma 1, for any . Write , where
Step 2: Bounding .
Recall that and . We define events and . Let and hence .
Step 3: Bounding -round regret .
Combine the results we obtained at Step 1 and 2, one has
Then put into Theorem 2 to have the claimed bound.
Appendix B Technical Lemmas
b.1 Lemmas on bounding .
Lemma 1 (Bounding for any ).
Set . For any ,