Code for "Best arm identification in multi-armed bandits with delayed feedback", AISTATS 2018.
We propose a generalization of the best arm identification problem in stochastic multi-armed bandits (MAB) to the setting where every pull of an arm is associated with delayed feedback. The delay in feedback increases the effective sample complexity of standard algorithms, but can be offset if we have access to partial feedback received before a pull is completed. We propose a general framework to model the relationship between partial and delayed feedback, and as a special case we introduce efficient algorithms for settings where the partial feedback are biased or unbiased estimators of the delayed feedback. Additionally, we propose a novel extension of the algorithms to the parallel MAB setting where an agent can control a batch of arms. Our experiments in real-world settings, involving policy search and hyperparameter optimization in computational sustainability domains for fast charging of batteries and wildlife corridor construction, demonstrate that exploiting the structure of partial feedback can lead to significant improvements over baselines in both sequential and parallel MAB.READ FULL TEXT VIEW PDF
In this paper we consider the problem of best-arm identification in
We study Thompson Sampling algorithms for stochastic multi-armed bandits...
Motivated by the task of hyperparameter optimization, we introduce the
We study pure exploration in multi-armed bandits with graph side-informa...
In this paper, we study Censored Semi-Bandits, a novel variant of the
The bias of the sample means of the arms in multi-armed bandits is an
It is well known that in stochastic multi-armed bandits (MAB), the sampl...
Code for "Best arm identification in multi-armed bandits with delayed feedback", AISTATS 2018.
Intelligent agents often need to interact with the environment and make rational decisions that optimize for a suitable objective. One such setting that commonly arises is the best arm identification problem in stochastic multi-armed bandits (Bubeck et al., 2009; Audibert and Bubeck, 2010). In a multi-armed bandit (MAB) problem, an agent is given a set of
finite actions (or arms), each associated with a reward drawn from an arm-specific probability distribution. In a pure exploration setting, the goal is to reliably identify the top-arms while minimizing the exploration cost. This problem has numerous applications, including optimal experimental design.
We consider a new variant of this problem where the feedback rewards are received after a delay. Delayed feedback is common in the real-world. For instance, hypothesis testing in science and engineering often suffers from delayed feedback since they involve expensive, time-consuming experiments. In one of the motivating applications of this work we want to search over fast-charging policies for electrochemical batteries to maximize lifetime, overcoming the difficulties posed due to lengthy experiments. Even within the field of machine learning, finding the best hyperparameter settings for a given learning algorithm and dataset can be modeled as a best arm identification problem involving a non-trivial delay(Jamieson and Talwalkar, 2016).
However, many scenarios of interest are not complete black-boxes during the intermediate time steps before receiving a delayed feedback reward. Depending on the application, we often have access to side-information in the form of partial feedback that can aid decision making. These could be extra measurements such as temperature and remaining capacity while charging batteries in the aforementioned scenario, or learning curves for hyperparameter optimization.
In this work, we propose a general-purpose framework for modeling delayed feedback in MAB, and take a deeper dive into several practically relevant instantiations. In particular, we design and analyze algorithms for best arm identification in the fixed confidence setting where the partial feedback are biased or unbiased estimators of the delayed feedback. Our proposed algorithms adaptively tune the mean and confidence estimates wherever the partial feedback reduces the overall uncertainty. We also extend these algorithms to the parallel MAB setting where we are allowed to pull a batch of arms at every time step (Jun et al., 2016).
Finally, we empirically validate the proposed algorithms on simulated data and real world datasets drawn from two domains. The first corresponds to experimental design for finding the optimal charging policy for a battery that maximizes overall lifetime (Moura et al., 2017). In the second domain, we perform hyperparameter optimization for finding the best cut strategy for a standard mixed integer programming solver with performance tested on a benchmark set of problem instances drawn from computational sustainability (Gomes et al., 2008). Our experiments demonstrate that accounting for partial feedback can reduce the delayed sample complexity on average by 15.6% and 80.8% for sequential MAB over baselines for the two application scenarios respectively. The corresponding average savings over baselines for parallel MAB are 20.7% and 87.6% respectively.
The chief workhorse of our analysis will be the law of iterated logarithms (LIL) that analyzes the limiting behavior of random walks (sequence of pulls for a given arm in our case) defined over sub-Gaussian random variables(Darling and Robbins, 1967). Several finite LIL bounds have been proposed in the literature; we consider the one proposed by Zhao et al. (2016) which has been shown to outperform others empirically while retaining the same asymptotic behavior. Alternate bounds, such as the one by Jamieson et al. (2014), could also be used with no effect on the theoretical analysis of this work.
Let be i.i.d. sub-Gaussian random variables with scale parameter and mean . Let be any random variable with domain . For any , the following holds with probability at least :
where denotes the Riemannian zeta function. The constants in Lemma 1 are chosen such that the lemma holds for a target confidence. To simplify the notation, we denote the the error probability by and the right hand side of Lemma 1 by such that the following holds with probability for any :
We consider a stochastic multi-armed bandit (MAB) problem characterized by a set of arms, indexed by . Each arm is associated with a fixed, unknown probability distribution with means . We assume that the means are unique. Without loss of generality, assume that the arm indices are sorted as per the means, such that .
We are interested in the pure exploration setting, also known as the best arm identification problem, where the goal of an agent is to identify the top- arms (with the highest means) with a target confidence while minimizing the total time spent on exploration. Exploration in our setting, however, is not the same across the pulls of a given arm. In particular, we assume that each pull of an arm is associated with an unknown (stochastic) delay that contributes to the total exploration time. The presentation in this section assumes a sequential MAB setting where the agent can pull/run only one arm at a given time step; the alternate parallel MAB setting where an agent can control a “batch” of arms at once is discussed in Section 4 (Perchet et al., 2015; Wu et al., 2015; Jun et al., 2016).
Formally, the stochastic data generating process with delayed feedback can be described as follows. At any given start time :
Agent chooses an arm .
Nature samples a delay from an (unknown) arm specific delay distribution.
Nature samples a sequence of partial feedback,
jointly. The joint distribution of the partial feedback depends on.
At time where ,
Nature reveals to the agent.
If , the agent goes to step 1. Otherwise, the agent decides whether to continue the current pull (step 4) or start another pull (step 1) in which case any remaining partial feedback for the current pull will not be observed.
The agent and nature continue to play the above game until the agent has selected a set of candidate top- arms. The delay can contribute significantly to the total time spent on exploration. Under appropriate assumptions however, we can exploit the structure in the partial feedback to significantly reduce the overall exploration cost of delayed feedback. The data generating process described above is very general and one can make many natural assumptions on the distribution of the partial feedback .
For instance, we can model the following scenarios:
Full delayed feedback: The partial feedback at the last delay, is sub-Gaussian with mean and scale parameter . For the intermediate time steps, , we have , and hence, we receive no information about at these time steps.
Incremental partial feedback: The set of partial feedback for every time step consists of mutually independent, sub-Gaussian random variables with mean and scale parameter . Hence, the cumulative partial feedback is also sub-Gaussian with mean and scale parameter .
Unbiased noisy partial feedback: The partial feedback at the last delay, is sub-Gaussian with mean and scale parameter . For the intermediate time steps, , the set of partial feedback consists of mutually independent, sub-Gaussian random variables with zero mean and scale parameter .
Biased noisy partial feedback: The partial feedback at the last delay, is sub-Gaussian with mean and scale parameter . For the intermediated time steps, , the set of partial feedback consists of mutually independent, sub-Gaussian random variables with mean and scale parameter . Here, is a fixed, but unknown bias associated with the partial feedback for the arm.
Note that the standard MAB setting where we observe the feedback at the immediate next time step is a special case of the full delayed feedback with a constant delay for every pull. In fact, the algorithms for best arm identification in the full delayed and incremental partial feedback settings can be derived naturally from the standard MAB algorithms with no delays. Specifically, the agent can simply chose to ignore the time instants at which delayed feedback is unavailable for the full delayed feedback setting. The sample complexity of any such algorithm is hence the number of arm pulls required in the standard MAB setting weighted by the delay of every pull. These settings are still interesting for parallel MAB where information can be shared across arms; we discuss this case in Section 4.
The partial feedback settings, however, present an interesting scenario where the agent can extract information from noisy feedback. For such settings, we propose modified algorithms based on racing-style procedures typically used for the standard MAB setting (Maron and Moore, 1994). Typically, racing algorithms maintain three disjoint arm sets: accepted arms , rejected arms , and surviving arms . Initially, all arms are assigned to the surviving set . Racing procedures uniformly sample arms while removing them from the surviving set based on confidence bounds. For convenience, define the lower confidence bounds (LCB) and upper confidence bounds (UCB) for every arm as:
where is the empirical mean of the feedback for arm and the confidence bound will depend on the particular racing algorithm under consideration. Let be the effective number of top arms remaining to be identified at a time step . Each time we receive a feedback reward (full or partial), the racing procedures update these sets based on the rule that any arm in whose LCB is greater than the UCB of arms is accepted. Similarly, any arm in whose UCB is less than the LCB of arms is rejected. The racing procedure is repeated until is empty. The pseudocode for the subroutine that updates the arm sets is given in Algorithm 1.
In sequential MAB, we assume that the agent can receive (partial) feedback from only a single arm pull at any given time step, e.g., we can only perform one experiment at a time. We skip a separate discussion on the trivial full feedback (and the related incremental feedback) setting and discuss it only in the context of the noisy feedback settings. For convenience, we denote the partial feedback at the last delay as . Here, is a sub-Gaussian random variable with mean and scale parameter . The proofs of all results in this section are given in the Appendix.
In this setting, an agent has access to unbiased partial feedback at the intermediate time steps before receiving the full delayed feedback. In the following result, we derive a variation of the finite LIL bound for the unbiased partial feedback setting.
Let , denote the partial feedback sequences for the pulls of an arm started at time steps and delays . Then, under the distributional assumptions on the unbiased partial feedback (see Section 2) for any , , , we have with probability :
where by definition. At any intermediate time step between the the start and end of the -th arm pull, Proposition 1 adaptively “splits” the confidence bounds pertaining to the full delayed feedback for steps (first term in the RHS) and the partial delayed feedback for the -th arm pull (second term in the RHS). Contrast this with the full delayed feedback setting where the following confidence bound holds with probability :
To obtain the same target confidence in the two cases above, we constrain . Solving for the optimal that minimize the RHS of Eq. (1) under the constraint due to corresponds to a convex optimization problem that can be solved in closed form. Comparing the mean estimators in Eq. (1) and Eq. (5), we note that the agent can only use the full delayed feedback up till the -th arm pull while waiting for the outcome of the -th arm pull in the latter case while the former dynamically incorporates the partial feedback observed for the -th arm pull.
Based on the above analysis, we propose a racing algorithm for the unbiased partial feedback setting with the psuedocode given in Algorithm 2. At any intermediate time step, the agent chooses a mean estimator and a confidence bound for the current arm (Lines 10-13). The choice corresponds to the tighter confidence bound obtained either by optimizing Eq. (1) over or the one obtained by Eq. (5) where only the full delayed feedback are considered. Thereafter, the agent invokes the racing subroutine that checks whether a surviving arm can be rejected or accepted (Line 14). If the pull has finished running or the current arm is itself eliminated (Line 15), the agent pulls a new arm in the next time step which has the least number of full delayed feedback (Line 19).
We can make some observations about Algorithm 2
. First, we see that an agent adopting the proposed algorithm can never do worse than the alternate racing strategy that considers estimates only based on the full delayed feedback. This is because even at the intermediate time steps, the agent considers the mean estimator corresponding to the smaller of the two confidence bounds, which can only reduce the delayed sample complexity of the algorithm. Whenever an arm pull has finished, the agent also updates the mean and confidence interval by an arithmetic averaging overthe full delayed feedback. Using partial feedback is impractical at such time steps since the partial feedback only introduce noise and do not provide any additional information about the true mean.
If the maximum possible delay associated with any arm pull is given by , then we can trivially extend bounds for the sample complexity of racing style procedures (Jamieson and Nowak, 2014) to derive similar bounds on the delayed sample complexity with an extra multiplicative factor of .111The delayed sample complexity for an algorithm refers to the total number of time steps (including delays) before termination. This is similar to what one would expect from the full delayed feedback setting and is not surprising for Algorithm 2 since in the absence of any additional assumptions, the partial feedback could be completely uninformative and the algorithm will choose to ignore them. We believe domain-specific assumptions about the delay distribution and the noise associated with the partial feedback as a function of time could lead to a tighter analysis and is an interesting direction of future work. The correctness of Algorithm 2 can be summarized below.
Assuming the delay associated with any arm pull is bounded, then Algorithm 2 outputs the top- arms with probability at least .
To get further intuition about the working of Algorithm 2, consider the situation where all arms have been pulled once except one. When the last remaining arm is pulled for the first time, the full delayed feedback setting will necessarily have to wait for the pull to finish running before eliminating the arms whereas Algorithm 2 can potentially start eliminating arms right after the first partial delayed feedback is received.
The partial feedback at the intermediate time steps before a full delayed feedback can also correspond to biased estimates of the full delayed feedback. Although the bias for the arms is unknown, it can be estimated empirically based on differences in the full delayed feedback and the partial feedback at the corresponding intermediate time steps. Formally, we assume the bias for a particular arm is an unknown constant and derive the following LIL bounds.
Let , denote the partial feedback sequences for the pulls of an arm started at time steps and delays with bias . Then, under the distributional assumptions on the partial feedback (see Section 2) for any , , , we have with probability :
Comparing Eq. (2) with Eq. (5) by constraining , we see that the mean estimator takes into account the partial feedback as before but also has a bias correction term. The bias correction term is an empirical average of the biases observed from the past full delayed feedback. This correction has the effect of introducing additional uncertainty (third term in the RHS) and we need at least one full feedback to estimate the bias before we can use the above bound. The corresponding racing algorithm runs similar to Algorithm 2 with the key difference being that the mean estimator corresponds to the minimum of the confidence bounds in Eq. (5) and Eq. (2), where the RHS of Eq. (2) is specified for the optimal minimizing the expression under the constraint due to . We defer the pseudocode for this setting to the Appendix (see Algorithm 4).
In parallel MAB, an agent has the additional ability to “accumulate” bulk information by controlling a batch of arm pulls. We extend the setting proposed in Jun et al. (2016) where the agent is allowed to run at most arm pulls in parallel at any given time step with an upper limit on the number of pulls of each arm.
Even the full delayed feedback setting becomes interesting, as the agent can exploit information from arm pulls which have finished running in parallel to accept/reject delayed arm pulls that are still running thereby avoiding the pitfalls of long delays. The pseudocode for the proposed batch racing algorithm with full delayed feedback is given in Algorithm 3. At every time step, an agent pulls a batch of arms with the least pull count that obeys the constraints (Lines 18-19). Whenever we obtain at least one full delayed feedback, we can update our arm sets as per the racing criteria (Lines 13-15).
The algorithms for the noisy partial feedback settings discussed in Section 3 can be extended for parallel MAB in a similar manner and are skipped here to keep the presentation clean. The theoretical analysis of the batch MAB setting in Jun et al. (2016) builds on the analysis of standard MAB in ways independent of the choice of LIL bounds and hence, a merged analysis for delayed batch MAB using the LIL bounds for delayed feedback (as in Propositions 1 and 2) suggests a reduction factor of in the corresponding upper bounds.
We empirically validated the proposed algorithms on a simulated setting and two real world datasets. All experiments use an error probability of and we observed that in each case, the algorithm obtains the desired confidence level empirically. For the parallel MAB setting, we set .
We performed an ablation study of the proposed algorithms for sequential and parallel MAB under different settings of delayed feedback. All experiments were repeated for
random runs such that the standard errors are vanishingly small and the number of top arms to be identified,is set to . We quantify improvement as the ratio (=) of the time taken by Algorithm 2 or its parallel MAB extension (i.e., ) and the time taken by a full delayed feedback racing procedure (i.e., ). We evaluate performance as a function of the following problem parameters.
To analyze the difference in performance as a function of the number of arms (), we further consider two distribution of means.
In the bounded means case, we set the means of the arms as for any choice of constants and . Hence, the range of the means does not vary with . In Figure 0(a), we observe that accounting for unbiased partial feedback can give gains of up to 25% and 40% for the sequential and parallel MAB when the number of arms is low. The gains are reduced when the number of arms is large, which suggests that partial feedback is less advantageous in scenarios where a large number of full pulls are required for disambiguating very closely spaced means.
In the free means case, we set the means of the arms as for any choice of constants and . Here, the range of the means increases with . From the results in Figure 0(b), we observe that the gains due to partial feedback improve as the number of arms increases. This suggests that when the relative separation in means between the arms is fixed, Algorithm 2 and its parallel MAB extension quickly eliminate arms with extreme means (very high or very low) unlike the racing algorithms that wait for full delayed feedback.
Here, we fix and vary the delay of the arms. For all settings of the delay in Figure 0(c), Algorithm 2 and its parallel MAB extension require a significantly lower fraction of the time with the lowest ratios observed to be and for sequential and parallel MAB respectively. While we did not see much variation in improvements for sequential MAB, the improvements are better for longer delays in the case of parallel MAB.
For any given battery chemistry, the charging (and discharging) policy has a significant impact on the lifetime of the cells. However, a single run of a particular policy however takes months to complete since every cell needs to be repeatedly charged and discharged until the end of its lifetime. Hence, delayed feedback can significantly slow down the search procedure. The true, unknown reward for any arm (charging policy) is stochastic and corresponds to the lifetime of the battery (Harris et al., 2017; Baumhöfer et al., 2014; Schuster et al., 2015).222Formally, the lifetime of the cell is defined to be the number of cycles until a battery reaches of its original capacity at which point a battery is considered dead.
We model the search for the best charging policy for the Li-ion battery chemistry as a best arm identification problem in a stochastic MAB with arms,
. The true mean cycle life, cell-to-cell variances, and delays are obtained from a battery charging simulator(Moura et al., 2017; Perez et al., 2016). While a battery cell undergoes charging and discharging, we can additionally monitor key indicators such as voltage, temperature, and internal resistance. Predictive models of lifetime based on these factors is an active area of research, and can serve the purpose of partial feedback estimator (Burns et al., 2013; Dubarry et al., 2017). We assume the existence of such an estimator and test the robustness of our algorithm by evaluating the relative improvements obtained from Algorithm 2 on varying the noise associated with the partial feedback. The results are shown in Figure 2. When the estimator is “trustworthy” (low ), we can achieve improvements of up to 35% in the number of experiments required. As expected, the gains diminish for poorer models of partial feedback in which case the algorithm can choose to ignore the noisy feedback.
The CPLEX solver333https://www.ibm.com/software/commerce/optimization/cplex-optimizer/index.html for mixed integer programming has a host of hyperparameters, including options to switch on or off different cut strategies employed by the solver during the search process. We model the task of finding the best cut strategy as a stochastic MAB problem with arms (i.e., cut strategies), . The performance is measured on CORLAT, a benchmark set of
(maximization) mixed integer linear programming instances derived from real world data used for the construction of a wildlife corridor for grizzly bears in the Northern Rockies region(Gomes et al., 2008; Hutter et al., 2010). The true mean for each arm is the average of lower bounds attained by the cut strategy on the feasible instances in the dataset under specified time and resource constraints per instance ( seconds on core). Every pull of an arm corresponds to running a cut strategy on a sampled problem instance.
Instead of waiting for the solver to completely solve (or time out) a sampled problem instance, we can save computation by using partial feedback about the search process. In particular, the solver outputs the best integral lower bound (LB) and real valued upper bound (UB) found after executing each cut during search. The final output of the solver is the best lower bound. To obtain an unbiased partial feedback estimator, we use a training subset of instances to learn a linear model that predicts the final lower bound for a given input instance based on the intermediate lower and upper bounds. The best arm identification algorithms are tested on the remaining instances in the dataset. Conditioned on a problem instance, the uncertainty associated with the partial feedback, is given by and shrinks with an increase in the time steps elapsed. Note that the delays are not fixed and depend on both the cut strategy and the problem instance under consideration. We directly report the final results: the percentage reduction in time taken by the unbiased partial feedback scenarios over full delayed feedback is and for sequential and parallel MAB respectively stressing the importance of partial feedback for this particular application scenario.
Early work in pure exploration is attributed to Bechhofer (1958) and Paulson (1964) who studied this problem in the context of optimal experimental design. Modern day literature can be categorized into either the fixed budget or the fixed confidence settings. Algorithms for the fixed budget setting strive to maximize the probability of identifying the top- arms (Audibert and Bubeck, 2010; Bubeck et al., 2013; Kaufmann et al., 2015). In the fixed confidence setting, which is the one we consider in this paper, the goal is to minimize the number of pulls to attain a target confidence (Maron and Moore, 1994; Bubeck et al., 2009). See Gabillon et al. (2012) for a unified treatment of the two settings.
Algorithms for the fixed confidence setting can be broadly classified into racing style procedures which sample arms uniformly and eliminate sub-optimal arms(Maron and Moore, 1994; Even-Dar et al., 2002) and the UCB/LUCB style procedures which adaptively sample arms without explicit elimination. We direct the reader to the excellent survey by Jamieson and Nowak (2014) that summarizes the major advancements in the analysis of the sample complexity of these algorithms. Algorithmic generalizations of the best arm identification include top- identification (Heidrich-Meisner and Igel, 2009) and the parallel MAB settings for batch arm pulls (Perchet et al., 2015; Jun et al., 2016; Wu et al., 2015) among others.
While the delayed feedback framework we propose is novel to the pure exploration problem, online learning with delays has been studied previously in the regret minimization setting (Weinberger and Ordentlich, 2002; Joulani et al., 2013; Desautels et al., 2014). In particular, algorithms designed particularly for hyperparameter optimization have enjoyed great success. Krueger et al. (2015) proposes a modified cross-validation procedure performed on increasing subsets of data coupled with a sequential testing strategy to eliminate the poor parameter configurations early on. Jamieson and Talwalkar (2016) and Li et al. (2017) recently proposed algorithms for hyperparameter optimization based on non-stochastic MAB. Here, the arms correspond to hyperparameter configurations, and a pull is equivalent to observing a fixed sequence of losses.
For many real-world problems, we have access to a shared structure across arms that makes the overall problem amenable to Bayesian optimization techniques (Snoek et al., 2012; Eggensperger et al., 2013; Snoek et al., 2015; Feurer et al., 2015; McIntire et al., 2016b, a). Combining the LIL bounds we proposed for noisy partial feedback with Bayesian multi-armed bandits (Srinivas et al., 2010; Krause and Ong, 2011; Hoffman et al., 2014) is a promising extension we are pursuing for our on-going real world application relating to efficient search of fast charging policies for Li-ion battery cells (Ermon et al., 2012).
We introduced a new general framework for pure exploration in stochastic multi-armed bandit problems with partial and delayed feedback. We provided efficient algorithms for solving specific instantiations of our framework that can naturally model real world scenarios, especially in the context of optimal experimental design. We leave as future work the problem of identifying information-theoretic lower bounds on the sample complexity of the new pure exploration problems we formulated. Extension of our framework to the fixed budget setting is another interesting direction for future work.
We are thankful to Neal Jean and Daniel Levy for helpful comments on early drafts. This research has been supported by a Microsoft Research PhD fellowship in machine learning for the first author, NSF grants #1651565, #1522054, #1733686, Toyota Research Institute, Future of Life Institute, Precourt Institute for Energy, and Intel.
Pac bounds for multi-armed bandit and markov decision processes.In Conference on Learning Theory, 2002.
AAAI Conference on Artificial Intelligence, 2015.
Integration of AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems, pages 303–307, 2008.
Scalable bayesian optimization using deep neural networks.In International Conference on Machine Learning, 2015.
By Lemma 1 applied to for an arm for full delayed feedback, we have w.p. :
For any , , and . Conditioned on , is sub-Gaussian by assumption.
Therefore, conditioned on , by Lemma 1 applied to for an arm computed using partial feedback for the -th pull, we have w.p. :
Given that the result does not depend on the value , we have:
Union bounding Eq.(10) over all arms, we have w.p. :
finishing the proof. ∎
At any given time , , we observe full feedback, for an arbitrary arm . Accordingly, we have the following two cases to consider as per Algorithm 2.
Case (b): otherwise
Define be the event that the lower and upper confidence bounds of arm trap the true mean for all where and are chosen as described above at time . Let denote the set of surviving, accepted, and rejected arms at time . We can then state and prove the following lemma.
Assume holds for an arbitrary arm and . Then, the following statements hold:
By definition, . Recursing over , we note that . Since the lemma assumes that arm , either or .
We will prove the first statement of the lemma by contradiction. For an arbitrary , let us assume . This implies that . Since by assumptions on the lemma the lower and upper confidence bounds of any arm trap its true mean, we have and . Hence, we obtain which is a contradiction since . The second statement holds true by symmetry. ∎
By Lemma 1 applied to for an arm for full delayed feedback, we have w.p. :
For any , , and . Conditioned on , is sub-Gaussian by assumption. Therefore, conditioned on , by Lemma 1 applied to for the (incomplete) -th pull of an arm with partial feedback, we have w.p. :
Now, consider the random variables for all :
The random variables in (14) are all sub-Gaussian with mean and scale parameter . Hence, applying LIL on these random variables conditioning on , we have w.p. :
Finally, union bounding Eq. (B.1) over all arms, we have w.p. :
finishing the proof. ∎