Inference for Batched Bandits
As bandit algorithms are increasingly utilized in scientific studies, there is an associated increasing need for reliable inference methods based on the resulting adaptively-collected data. In this work, we develop methods for inference regarding the treatment effect on data collected in batches using a bandit algorithm. We focus on the setting in which the total number of batches is fixed and develop approximate inference methods based on the asymptotic distribution as the size of the batches goes to infinity. We first prove that the ordinary least squares estimator (OLS), which is asymptotically normal on independently sampled data, is not asymptotically normal on data collected using standard bandit algorithms when the treatment effect is zero. This asymptotic non-normality result implies that the naive assumption that the OLS estimator is approximately normal can lead to Type-1 error inflation and confidence intervals with below-nominal coverage probabilities. Second, we introduce the Batched OLS estimator (BOLS) that we prove is asymptotically normal—even in the zero treatment effect case—on data collected from both multi-arm and contextual bandits. Moreover, BOLS is robust to changes in the baseline reward and can be used for obtaining simultaneous confidence intervals for the treatment effect from all batches in non-stationary bandits. We demonstrate in simulations that BOLS can be used reliably for hypothesis testing and obtaining a confidence interval for the treatment effect, even in small sample settings.READ FULL TEXT VIEW PDF
Inference for Batched Bandits
Most statistical inference methods for data from randomized experiments assume that all treatments are assigned independently (imbens2015causal). In many settings though, in order to minimize regret, we would like to adjust the treatment assignment probabilities to assign treatments that appear to be performing better with higher probability. As a result, bandits have been increasingly utilized in scientific studies—for example, in mobile health (yom2017encouraging) and online education (rafferty2019statistical). Such adaptively collected data, in which prior treatment outcomes are used to inform future treatment decisions, makes inference challenging due to the induced dependence. For example, estimators like the sample mean are often biased on adaptively-collected data (nie2017adaptively; shin2019sample).
Additionally, many settings in which bandit algorithms are used have significant non-stationarity. For example, in online advertising, the effectiveness of an ad may change over time due to exposure to competing ads and general societal changes that could affect perceptions of an ad. Thus, approximate inference methods based on asymptotics that rely on the number of (approximately stationary) time periods of an experiment going to infinity are often less applicable. For this reason, in this work we focus on the setting in which data is collected with bandit algorithms in batches
and develop inference techniques that allow for non-stationarity between batches. For our asymptotic analysis we fix the total number of batches,, and allow the samples in each batch, , to go to infinity. Many real world settings naturally collect data in batches since data is often received from multiple users simultaneously and it can be costly to update the algorithm’s parameters too frequently (jun2016top).
The first contribution of this work is showing that on adaptively collected data, rather surprisingly, whether standard estimators are asymptotically normal can depend on whether the treatment effect is zero. We prove that when data is collected using common bandit algorithms, the sampling probabilities can only concentrate if there is a unique optimal arm. Thus, for two-arm bandits, the sampling probabilities do not concentrate when the difference in the expected rewards between the arms (the treatment effect) is zero. We show that this leads the ordinary least squares (OLS) estimator to be asymptotically normal when the treatment effect is non-zero, and asymptotically not
normal when the treatment effect is zero. Due to the asymptotic non-normality of the OLS estimator under zero treatment effect, we demonstrate that using a normal distribution to approximate the OLS estimator’s finite sample distribution for hypothesis testing can lead to inflated Type-1 error. More crucially,the discontinuity in the OLS estimator’s asymptotic distribution means that standard inference methods (normal approximations, bootstrap111Note that since the validity of bootstrap methods rely on uniform convergence (romano2012uniform), the non-uniformity in the asymptotic distribution of standard estimators also means that bootstrap methods can lead to confidence intervals with below-nominal coverage. ) may lead to unreliable confidence intervals.
In particular, even when the treatment effect is non-zero, when the ratio of the magnitude of treatment effect to the standard deviation of the noise is small,assuming the OLS estimator has an approximately normal distribution can lead to confidence intervals that have below-nominal coverage probability (see Figure 2).
The second contribution of this work is introducing the Batched OLS (BOLS) estimator, which can be used for reliable inference—even in non-stationary and small-sample settings—on data collected with batched bandits. Regardless of whether the true treatment effect is zero or not, the BOLS estimator for the treatment effect for both multi-arm and contextual bandits is asymptotically normal. BOLS can be used for both hypothesis testing and obtaining confidence intervals for the treatment effect. Moreover, BOLS is also automatically robust to non-stationarity in the rewards and can be used for constructing valid confidence intervals even if there is non-stationarity in the baseline reward, i.e., if the rewards of the arms change from batch to batch, but the treatment effect remains constant. If the treatment effect itself is also non-stationary, BOLS can also be used for constructing simultaneous confidence intervals for the treatment effects for each batch. Additionally, we find in simulations that BOLS has very reliable Type-1 error control, even in small-sample settings.
Batched Bandits Most work on batched bandits either focuses on minimizing regret (perchet2016batched; gao2019batched) or identifying the best arm with high probability (agarwal2017learning; jun2016top). Note that best arm identification is distinct from obtaining a confidence interval for the treatment effect, as the former identifies the best arm with high probability (assuming there is a best arm), while the latter can be used to test whether one arm is better than the other and provides guarantees regarding the magnitude of the difference in expected rewards between two arms. Note that in contrast to other batched bandit literature that fix the total number of batches and allow batch sizes to be adjusted adaptively (perchet2016batched), we assume that the size of the batches are not chosen adaptively.
Adaptive Inference Much of the recent work on inference for adaptively-collected data focuses on characterizing and reducing the bias of the OLS estimator in finite samples (deshpande; nie2017adaptively; shin2019sample). villar2015multi thoroughly examine a variety of adaptive sampling algorithms for multi-arm bandits and empirically find that the OLS estimator has inflated finite-sample Type-1 error rates when the data is collected using these algorithms.
deshpande also consider inference for adaptively sampled arms. They develop the W-decorrelated estimator which is an adjusted version of OLS. Their estimator requires choosing a tuning parameter
, which allows practitioners to trade off bias for variance. They prove a CLT for their estimator whenis chosen in a particular way that depends on the number of times each arm is sampled. To gain further insight into the W-decorrelated estimator, we examine it in the two-armed bandit setting; see Appendix LABEL:appendix:wdecorrelated. Most notably, we find that the W-decorrelated estimator down-weights samples that occur later in the study and up-weights samples from earlier in the study. Note that the W-decorrelated estimator does not have guarantees in non-stationary settings.
The Adaptively-Weighted Augmented-Inverse-Probability-Weighted Estimator (AW-AIPW) for multi-arm bandits reweights the samples of a regular AIPW estimator with adaptive weights that are non-anticipating (athey). The AW-AIPW estimator for the treatment effect can easily be adapted to the batched (triangular array) setting. In the stationary multi-arm bandit case, we make similar assumptions to those that athey use to prove asymptotic normality of the AW-AIPW estimator; however, the AW-AIPW estimator does not have guarantees in non-stationary settings.
lai1982least prove that the OLS estimator is asymptotically normal on adaptively collected data under certain conditions. In Section 4.1, we examine these conditions and determine settings in which the necessary conditions are satisfied. Later in Section 4.2, we characterize natural settings in which the necessary conditions are violated and explain how this leads to the asymptotic non-normality of the OLS estimator on adaptively collected data.
Though our results generalize to -arm, contextual bandits (see Section 5.2), we first focus on the two-arm bandit for expositional simplicity. Suppose there are timesteps or batches in a study. In each timestep , we select treatment arms . We then observe independent rewards
, one for each treatment arm sampled. Note that the distribution of these random variables changes with the batch size,. For example, the distribution of the actions one chooses for the batch, , may change if one has observed vs. samples in the first batch. We include an superscript on these random variables as a reminder that their distribution changes with ; however, we often drop the superscript for readability.
We define multi-arm bandit algorithms as functions such that , where is the history prior to batch . The bandit selects treatment arms such that for each , conditionally on . We assume the following conditional mean for the rewards:
Let ( will be higher dimensional when we add more arms and/or contextual variables). Also define and , the number of times each arm is sampled in the batch. We define the errors as . Equation (1) implies that are a martingale difference array with respect to the filtration , where . So for all .
The parameters can change across batches , which allows for non-stationarity between batches. Assuming that for all simplifies to the stationary bandit case. Our goal is to estimate and obtain confidence intervals for for
when the data is collected using common bandit algorithms like Thompson Sampling and-greedy. Specifically we want an estimator that has an asymptotic distribution that approximates its finite-sample distribution well, so we can both perform hypothesis testing for the zero treatment effect and construct a confidence interval for the treatment effect.
When running an experiment with adaptive sampling, it is desirable to minimize regret as much as possible. However, in order to perform inference on the resulting data it is also necessary to guarantee that the bandit algorithm explores sufficiently. Greater exploration in the multi-arm bandit case means sampling the treatments with closer to equal probability, rather than sampling one treatment arm almost exclusively. For example, the central limit theorems (CLTs) for both the W-decorrelated(deshpande) and the AW-AIPW estimators (athey) have conditions that implicitly require that the bandit algorithms cannot sample any given treatment arm with probabilities that go to zero or one arbitrarily fast. Greater exploration also increases the power of statistical tests regarding the treatment effect, i.e., it makes it more probable that a true discovery will be made from the collected data. Moreover, if there is non-stationarity in the treatment effect between batches, it is desirable for the bandit algorithm to continue exploring, so it can adjust to these changes and not almost exclusively sample one arm that is no longer the best.
We explicitly guarantee exploration by constraining the probability that any given treatment arm can be sampled, as per Definition 3.3 below. Note that we allow the sampling probabilities to converge to and/or at some rate. A clipping constraint with rate means that satisfies the following:
Suppose we are in the stationary case, and we would like to estimate . Consider the OLS estimator:
where and . Note that .
If are i.i.d. (i.e., no adaptive sampling), ,
, and the first two moments ofexist, a classical result from statistics (amemiya1985advanced) is that the OLS estimator is asymptotically normal, i.e., as ,
lai1982least generalize this result by proving that the OLS estimator is still asymptotically normal in the adaptive sampling case when satisfies a certain stability condition. To show that a similar result holds for the batched setting, we generalize the asymptotic normality result of lai1982least to triangular arrays, as stated in Theorem 4.1. Note, we must consider triangular array asymptotics since the distribution of our random variables vary as the batch size, , changes.
[Moments] For all , and .
[Stability] For some non-random sequence of scalars , as ,
A strong assumption of Theorem 4.1 is Condition 4.1 above. Intuitively, if Condition 4.1 holds, then the rate at which each arm is sampled eventually will not depend on the values of the previous rewards, so the algorithm will be essentially sampling non-adaptively and the samples can be treated as if they are i.i.d. We now state simple sufficient conditions for Condition 4.1, and thus Theorem 4.1 in the bandit setting.
[Conditionally i.i.d. actions] For each , i.i.d. over conditionally on .
[Sufficient conditions for Theorem 4.1] If Conditions 4.1 and 4.1 hold, and the treatment effect is non-zero, data collected in batches using -greedy or Thompson Sampling with clipping constraint for some (see Definition 3.3) satisfy Theorem 4.1 conditions. In the next section, we show that when the treatment effect is zero, the OLS estimator is asymptotically non-normal; we also discuss how this is due to a violation of stability Condition 4.1. Note that when the treatment effect is zero, the OLS estimator of the treatment effect is an unbiased estimator. This is because in the zero treatment effect setting for a two-armed bandit, the two arms have the same expected reward and are exchangeable. Thus, the expected value of the OLS estimates for the expected reward for each arm are equal. So, when the treatment effect is zero, the OLS estimator of the treatment effect—which is exactly the difference in the OLS estimates for each arm—has expectation zero and is unbiased. From nie2017adaptively and shin2019sample we know that estimators based on adaptively collected data are often biased. As will be seen below, the inferential difficulties arising from the use of adaptively sampled data is not limited to just bias. Another challenge arises because whether standard estimators for the treatment effect converge uniformly (meaning the normalized errors of the estimator has the same asymptotic distribution no matter the size of the true treatment effect) can depend on whether the data is sampled independently or adaptively.
We prove that when the treatment effect is zero, the OLS estimator is asymptotically non-normal under Thompson Sampling (Theorem 4) and -greedy (Appendix LABEL:appendix:nonnormality). As seen in Figure 1, the normalized errors of the OLS estimator of the treatment effect can have fat tails, which can lead to poor control of Type-1 error in hypothesis testing.
It is sufficient to prove asymptotic non-normality when and the treatment effect is the same for each batch, under Thompson Sampling with fixed clipping. Here the OLS estimator of is simply the difference in the sample means for each arm, i.e., . The normalized errors of , which are asymptotically normal under non-adaptive sampling, are as follows:
[Asymptotic non-normality of OLS estimator for Thompson Sampling under zero treatment effect] Let and for all . We put independent standard normal priors on the means of each arm, . The algorithm assumes noise variance . If and for constants with , then the normalized errors of the OLS estimator are asymptotically not normal when the treatment effect is zero, i.e.,
where is the standard Normal distribution CDF and is the CDF of the normalized errors of the OLS estimator, (4).
By Corollary 4.1, we know the OLS estimator is asymptotically normal when . And since Theorem 4 shows that the OLS estimator is asymptotically not normal when , we know that the OLS estimator does not converge uniformly on adaptively-collected data (see kasy2019uniformity
for a review of the problems with asymptotic approximations under non-uniform convergence). In real world applications, there is rarely exactly zero treatment effect. However, the discontinuity in the asymptotic distribution of the OLS estimator at zero treatment effect is still practically important. For hypothesis testing and constructing confidence intervals, we use the asymptotic distribution to approximate the finite-sample distribution. The asymptotic distribution of the OLS estimator when the treatment effect is zero is indicative of the finite-sample distribution when the treatment effect is statistically difficult to differentiate from zero, i.e., when the signal-to-noise ratio,, is low. Figure 2 shows that even when the treatment effect is non-zero, when the signal-to-noise ratio is low, the confidence interval constructed using cutoffs based on the normal distribution have coverage probabilities below the nominal level. Moreover, for any number of samples and noise variance , there exists a non-zero treatment effect size such that the finite-sample distribution will be poorly approximated by a normal distribution.
The asymptotic non-normality of OLS occurs specifically when the treatment effect is zero because when there is no unique optimal arm, the sampling probabilities, , do not concentrate, i.e.,
does not converge to a constant and can fluctuate indefinitely. Under Thompson Sampling, the posterior probability that one arm is better than another is approximately the p-value for the test of the nullusing the z-statistic for the OLS estimator of the treatment effect. Thus, under the null of zero treatment effect, the posterior probability that arm is better than arm
converges to a uniform distribution, as seen in PropositionLABEL:prop:nonconcentrationTS; see Appendix LABEL:appendix:nonnormality for details. Next, we introduce an inference method that is asymptotically normal even when the sampling probabilities do not concentrate.
We now introduce the Batched OLS (BOLS) estimator that is asymptotically normal, even when the treatment effect is zero.
Instead of computing the OLS estimator on the data from all batches together, we compute the OLS estimator from each batch and normalize it by the variance estimated from that batch.
For each , the BOLS estimator of the treatment effect is:
[Asymptotic normality of Batched OLS estimator for multi-arm bandits] Assuming Conditions 4.1 (moments) and 4.1 (conditionally i.i.d. actions), and a clipping rate (see Definition 3.3),222It is straightforward to show that these results hold in the case that batches are different sizes (for non-adaptively chosen batch sizes) as the size of the smallest batch goes to infinity.
By Theorem 5.1, for the stationary treatment effect case, we can test vs. with the following statistic, which is asymptotically normal under the null:
The key to proving asymptotic normality for BOLS is that the following ratio converges in probability to one:
Since , is a constant given . Thus, even if does not concentrate, we are still able to apply the martingale CLT (dvoretzky1972asymptotic) to prove asymptotic normality. See Appendix LABEL:appendix:triangularCLT for more details.
For the contextual, -arm bandit case, the parameters of interest are . For any two arms , we can estimate the treatment effect between them for all
. In each batch, we observe context vectorswhere . We can deterministically set the first element of to to allow for a non-zero intercept. For our new definition of the history, , we define a new filtration . We define contextual bandit algorithms to be functions such that . Note, is now indexed by because it depends on the context . Our policy is now a vector representing the probability that any action is chosen, i.e., an action and . We assume the following conditional mean model of the reward:
and let .
[Conditionally i.i.d. contexts] For each , are i.i.d. and its first two moments, , are non-random given , i.e., .
[Bounded context] for all for some constant
. Also, the minimum eigenvalue ofis lower bounded, i.e., .
A conditional clipping constraint with rate means that the sampling probabilities satisfy the following:
For each , we have the OLS estimator for :
[Asymptotic normality of Batched OLS estimator for contextual bandits] Assuming Conditions 4.1 (moments)333Assume an analogous moment condition for the contextual bandit case, where is replaced by ., 4.1 (conditionally i.i.d. actions), 5.2, and 5.2, and a conditional clipping rate for some (see Definition 5.2),
Many real world problems we would like to use bandit algorithms for have non-stationarity over time. For example, in mobile health, bandit algorithms are used to decide when to send interventions to encourage users to engage in healthy behaviors; however, due to peoples’ changing routines and because users can become desensitized to the interventions, there is significant non-stationarity in the effectiveness of these interventions over time. We now describe how BOLS can be used in the non-stationary multi-arm bandit setting.
We may believe that the expected reward for a given treatment arm may vary over time, but that the treatment effect is constant from batch to batch. In this case, we can simply use the BOLS test statistic described earlier in equation (5) to test vs. . Note that the BOLS test statistic for the treatment effect is robust to non-stationarity in the baseline reward without any adjustment. Moreover, in our simulation settings we estimate the variance separately for each batch, which allows for non-stationarity in the variance between batches as well; see Appendix A for variance estimation details and see Section 6 for simulation results.
Alternatively we may believe that the treatment effect itself varies from batch to batch. In this case, we are able to construct a confidence region that contains the true treatment effect for each batch simultaneously with probability . [Confidence band for treatment effect for non-stationary bandits] Assume the same conditions as Theorem 5.1. We let be the quantile of the standard Normal distribution. For each , we define the interval
. We can also test the null hypothesis of no treatment effect against the alternative that at least one batch has non-zero treatment effect, i.e.,vs. . Note that the global null stated above is of great interest in the mobile health literature (klasnja2015microrandomized; liao2016sample). Specifically we use the following test statistic:
which by Theorem 5.1
converges in distribution to a chi-squared distribution withdegrees of freedom under the null for all .
We focus on the two-arm bandit setting and test whether the treatment effect is zero, specifically vs. . We perform experiments for when the variance of the error is estimated. We assume homoscedastic errors throughout. See Appendix A.3 for more details about how we estimate the noise variance. In Figures 6 and 7, we display results for stationary bandits and in Figure 5 we show results for bandits with non-stationary baseline rewards. See Appendix A.4 for results for bandits with non-stationary treatment effects.
For the AW-AIPW estimator, we use the variance stabilizing weights, which provably satisfy the CLT conditions of athey and performed well in their simulation results, in terms of MSE and low standard error; for the model of the expected rewards for each arm we use the respective arm’s sample mean. For the W-decorrelated estimator, we choose based on the procedure used in deshpande; see Appendix A.1 for details.
We found that several of the estimators, primarily OLS and AW-AIPW, have inflated finite-sample Type-1 error. Since Type-1 error control is a hard constraint, solutions with inflated Type-1 error are infeasible solutions. For the sake of comparison, in the power plots, we adjust the cutoffs of the estimators to ensure proper Type-1 error control under the null. Note that it is unfeasible to make this cutoff adjustment for real experiments (unless one found the worst case setting), as there are many nuisance parameters—like the expected rewards for each arm and the variance of the noise—which can affect these cutoff values. For BOLS, we use cutoffs based on the Student-t distribution rather than the normal distribution, as it is relatively straightforward to determine the number of degrees of freedom needed in the correction; see Appendix A.3 for details. We do not make a similar correction for the other estimators we compare to because it is unclear how to determine the number of degrees of freedom that should be used.
Figure 6 shows that for small sample sizes (), BOLS has more reliable Type-1 error control than AW-AIPW with variance stabilizing weights. After samples, AW-AIPW has proper Type-1 error, and by Figure 7 it always has slightly greater power than the BOLS estimator in the stationary setting. The W-decorrelated approach consistently has reliable Type-1 error control, but very low power compared to AW-AIPW and BOLS.
In Figure 5, we display simulation results for the non-stationary baseline reward setting. Whereas other estimators have no Type-1 error guarantees, BOLS still has proper Type-1 error control in the non-stationary baseline reward setting. Moreover, BOLS can have much greater power than other estimators when there is non-stationarity in the baseline reward. Overall, it makes sense to choose BOLS over other estimators (e.g. AW-AIPW) in small-sample settings or whenever the experimenter wants to be robust to non-stationarity in the baseline reward—at the cost of losing a little power if the environment is stationary.
We found that the OLS estimator is asymptotically non-normal when the treatment effect is zero due to the non-concentration of the sampling probabilities. Since the OLS estimator is a canonical example of a method-of-moments estimator (Hazelton2011), our results suggest that the inferential guarantees of standard method-of-moments estimators may fail to hold on adaptively collected data when there is no unique optimal, regret-minimizing policy. We develop the the Batched OLS estimator, which is asymptotically normal even when the sampling probabilities do not concentrate. An open question is whether batched versions of general method-of-moments estimators could similarly be used for adaptive inference.
KZ is supported by the NSF Graduate Research Fellowship under Grant No. DGE 1745303. Further funding was provided the National Institutes of Health, National Institute on Alcohol Abuse and Alcoholism (R01AA23187).
For the -decorrelated estimator (deshpande), for a batch size of and for batches, we set to be the quantile of , where denotes the minimum eigenvalue of . This procedure of choosing is derived from Theorem 4 in deshpande and is based on what deshpande do in their simulation experiments. We had to adjust the original procedure for choosing used by deshpande (who set to the quantile of ), because they only evaluated the W-decorrelated method for when the total number of samples was and valid values of changes with the sample size.
Since the AW-AIPW test statistic for the treatment effect is not explicitly written in the original paper (athey), we now write the formulas for the AW-AIPW estimator of the treatment effect: . We use the variance stabilizing weights, equal to the square root of the sampling probabilities, and .
The variance estimator for is where
Given the OLS estimators for the means of each arm, , we estimate the noise variance as follows:
We use a degrees of freedom bias correction by normalizing by rather than . Since the W-decorrelated estimator is a modified version of the OLS estimator, we also use this same noise variance estimator for the W-decorrelated estimator; we found that this worked well in practice, in terms of Type-1 error control.
Given the Batched OLS estimators for the means of each arm for each batch, , we estimate the noise variance for each batch as follows:
Again, we use a degrees of freedom bias correction by normalizing by rather than . Using BOLS to test vs. , we use the following test statistic:
For this test statistic, we use cutoffs based on the Student-t distribution, i.e., for we use a cutoff such that
We found by simulating draws from the Student-t distribution.
In the non-stationary treatment effect simulations, we test the null that vs. . To test this we use the following test statistic:
For this test statistic, we use cutoffs based on the Student-t distribution, i.e., for we use a cutoff such that
We found by simulating draws from the Student-t distribution.
In the plots below we call the test statistic in (6) “BOLS Non-Stationary Treatment Effect” (BOLS NSTE). BOLS NSTE performs poorly in terms of power compared to other test statistics in the stationary setting; however, in the non-stationary setting, BOLS NSTE significantly outperforms all other test statistics, which tend to have low power when the average treatment effect is close to zero. Note that the W-decorrelated estimator performs well in the left plot of Figure 8; this is because as we show in Appendix LABEL:appendix:wdecorrelated, the W-decorrelated estimator upweights samples from the earlier batches in the study. So when the treatment effect is large in the beginning of the study, the W-decorrelated estimator has high power and when the treatment effect is small or zero in the beginning of the study, the W-decorrelated estimator has low power.