DeepAI AI Chat
Log In Sign Up

Inference for Batched Bandits

by   Kelly W. Zhang, et al.

As bandit algorithms are increasingly utilized in scientific studies, there is an associated increasing need for reliable inference methods based on the resulting adaptively-collected data. In this work, we develop methods for inference regarding the treatment effect on data collected in batches using a bandit algorithm. We focus on the setting in which the total number of batches is fixed and develop approximate inference methods based on the asymptotic distribution as the size of the batches goes to infinity. We first prove that the ordinary least squares estimator (OLS), which is asymptotically normal on independently sampled data, is not asymptotically normal on data collected using standard bandit algorithms when the treatment effect is zero. This asymptotic non-normality result implies that the naive assumption that the OLS estimator is approximately normal can lead to Type-1 error inflation and confidence intervals with below-nominal coverage probabilities. Second, we introduce the Batched OLS estimator (BOLS) that we prove is asymptotically normal—even in the zero treatment effect case—on data collected from both multi-arm and contextual bandits. Moreover, BOLS is robust to changes in the baseline reward and can be used for obtaining simultaneous confidence intervals for the treatment effect from all batches in non-stationary bandits. We demonstrate in simulations that BOLS can be used reliably for hypothesis testing and obtaining a confidence interval for the treatment effect, even in small sample settings.


page 4

page 5


Statistical Inference with M-Estimators on Bandit Data

Bandit algorithms are increasingly used in real world sequential decisio...

Post-Contextual-Bandit Inference

Contextual bandit algorithms are increasingly replacing non-adaptive A/B...

Semi-parametric inference based on adaptively collected data

Many standard estimators, when applied to adaptively collected data, fai...

Asymptotic expansion for batched bandits

In bandit algorithms, the randomly time-varying adaptive experimental de...

Design-Based Inference for Multi-arm Bandits

Multi-arm bandits are gaining popularity as they enable real-world seque...

Confidence Intervals for Policy Evaluation in Adaptive Experiments

Adaptive experiments can result in considerable cost savings in multi-ar...

Post-Episodic Reinforcement Learning Inference

We consider estimation and inference with data collected from episodic r...

Code Repositories


Inference for Batched Bandits

view repo