# Power-Constrained Bandits

Contextual bandits often provide simple and effective personalization in decision making problems, making them popular in many domains including digital health. However, when bandits are deployed in the context of a scientific study, the aim is not only to personalize for an individual, but also to determine, with sufficient statistical power, whether or not the system's intervention is effective. In this work, we develop a set of constraints and a general meta-algorithm that can be used to both guarantee power constraints and minimize regret. Our results demonstrate a number of existing algorithms can be easily modified to satisfy the constraint without significant decrease in average return. We also show that our modification is also robust to a variety of model mis-specifications.

## Authors

• 6 publications
• 50 publications
• 16 publications
• 11 publications
• 74 publications
• ### Algorithms with Logarithmic or Sublinear Regret for Constrained Contextual Bandits

We study contextual bandits with budget and time constraints, referred t...
04/27/2015 ∙ by Huasen Wu, et al. ∙ 0

• ### Sequential Batch Learning in Finite-Action Linear Contextual Bandits

We study the sequential batch learning problem in linear contextual band...
04/14/2020 ∙ by Yanjun Han, et al. ∙ 9

• ### Greedy Bandits with Sampled Context

Bayesian strategies for contextual bandits have proved promising in sing...
07/27/2020 ∙ by Dom Huh, et al. ∙ 0

• ### Information Directed Sampling for Sparse Linear Bandits

Stochastic sparse linear bandits offer a practical model for high-dimens...
05/29/2021 ∙ by Botao Hao, et al. ∙ 0

• ### Generalized Linear Bandits with Local Differential Privacy

Contextual bandit algorithms are useful in personalized online decision-...
06/07/2021 ∙ by Yuxuan Han, et al. ∙ 0

• ### Markov Decision Process modeled with Bandits for Sequential Decision Making in Linear-flow

In membership/subscriber acquisition and retention, we sometimes need to...
07/01/2021 ∙ by Wenjun Zeng, et al. ∙ 12

• ### Adaptive Exploration in Linear Contextual Bandit

Contextual bandits serve as a fundamental model for many sequential deci...
10/15/2019 ∙ by Botao Hao, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Contextual bandits provide an attractive middle ground between multi-arm bandits and full-blown Markov Decision Processes. Their simplicity, robustness, and effectiveness had made them popular in various domains, ranging from online education to web content and ads recommendation, where personalization is important. We are specifically motivated by the promise of contextual bandits in the context of digital health, e.g. imagine a mobile app that will help patients manage a mental illness that delivers personalized interventions such as reminders to self-monitor their mental state or suggestions for how to manage depressed states.

In digital health applications, much of the initial research and development is done via clinical studies. Especially when a person’s health is involved, it is critical to understand if, when, and how the treatments are effective. A currently popular study design is the micro-randomized trial (liao2015micro; klasnja2015microrandomized), in which an automated agent interacts in parallel with a number of individuals over a number of times. This type of design allows the designer to observe the pattern of initial excitement and novelty effect, followed by some disengagement, that one would observe in a real deployment, while the fact that each intervention is randomized allows for rigorous statistical analysis to quantify the treatment effect.

Micro-randomized trials also offer the promise of being able to personalize or adapt during the study period itself: the randomization probability for each intervention needs not to be a fixed constant, but can be adjusted based on the prior responses of the individual. For example, some individuals may respond better to self-monitoring reminders, others to concrete activity suggestions. The responsiveness of the individual may also depend on the context (e.g. at work or home). However, now we are left with a complex set of challenges: we desire our algorithms to develop a personalized policy for each individual while still being sufficiently powered to determine the effectiveness of the treatments and enable further off-policy or causal inference analyses, all in a highly stochastic, non-stationary environment.

While there are many approaches that quantify treatment effects or minimize regret, to our knowledge, there does not exist approaches that do both in a principled way. Much of classical experimental design focuses on ensuring that the scientific study is scoped to allow for sufficient power to detect a significant treatment effect. Part of the multi-armed bandit literature focuses on estimating the means of all arms (e.g.

carpentier2011upper) and approaches focused on best-arm identification aim to find the best treatment with confidence (audibert2010best). These typically personalize little if at all, and thus can result in high regret.

In contrast, most of the multi-armed and contextual bandit literature employ adaptivity towards the goal of minimizing regret, but we are aware of none that also provide a guarantee on power for a pre-specified after-study primary analysis, much less protecting the ability to permit a variety of non-pre-specified secondary analyses. In the context of an expensive clinical study, it is essential that analyses can be performed to inform further development and downstream adoption. One might hope that standard regret-minimizing algorithms would automatically provide strong power guarantees, but recent work suggests that popular regret-minimizing bandit algorithms can result in biased estimates (nie2017adaptively) and indeed there is very recent work providing new techniques on how to best analyze data generated by regret-minimizing algorithms (hadad2019confidence). Such work highlights the need for approaches for data gathering that can preserve power and provide favorably low regret.

Our core contribution is to provide an avenue toward ensuring that the study will be sufficiently powered and have optimal regret with respect to that constraint. We provide analyses both for specific algorithms as well as a general approach for taking existing contextual bandit algorithms and achieving optimal regret rates with respect to an oracle which satisfies the power constraints. Finally, while our work is motivated by applications to digital heath, we emphasize that the same needs occur in many settings. For example, in education, we may want to personalize a flash card app’s prompts to the student and also quantify its overall effectiveness.

## 2 Related Work

There are a variety of contextual bandit algorithms that can find the best arm with confidence in both stochastic as well as adversarial settings (abbasiyadkori2018; LattimoreSzepesvari2019). However, these algorithms typically are not concerned with minimizing cumulative regret alongside their best-arm identification.

Furthermore there have been several recent contextual bandits that can achieve optimal first order regret rates in highly stochastic, even adversarial settings (LattimoreSzepesvari2019; Krishnamurthy2018; greenewald2017action). We consider similarly general settings, but unlike these algorithms, we also guarantee sufficient power to test hypotheses related to treatment effectiveness.

Finally, other work considers other simultaneous objectives. degenne2019; erraqabi2017 consider arm value estimation jointly with regret minimization. nie2017adaptively; deshpande

consider how to accurately estimate the means or confidence intervals with data collected via adaptive sampling algorithms. To our knowledge, ours is the first to consider how to gather data in a way that guarantees power in a non-stationary setting and minimizes regret.

## 3 Model and Problem Setting

We consider planning a study with subjects each for time units. At each time , for each subject

, a context vector

is observed. We take a binary action with a reward . We use to denote the history for subject up to time : . Denote the reward under action as ; thus the observed reward satisfies .

To plan the study, we must specify (1) assumptions on the true environment, (2) the behavior policy for selecting the actions and (3) the null hypothesis, the alternate hypothesis, the associated test statistic, the desired Type 1 error rate, the desired power to detect a particular standardized effect size.

True Environment We assume that the reward satisfies

 E[Rnt(1)|Hnt]−E[Rnt(0)|Hnt]=Z⊺t(Hnt)δ0 (1)

where is a set of features that are a known function of history, and is the treatment effect of the intervention. Importantly, is independent of present action but may depend on prior actions. We assume that an expert defines what features of a history may be important for the reward but make no assumptions about how the history itself evolves. In the following, we write as for short. We assume the histories are independent and identically distributed. However, there may be dependencies across time within a specific subject. Finally, we assume that for and .

Behavior Policy To select actions for the th subject at step , the study uses a behavior policy that may depend on the history and time,

 P(Ant=1|Hnt)=πt(Hnt)

where is a deterministic function of subject ’s history and is indexed by to denote that it can be non-stationary.

We impose two constraints on the policy. First, the policy for user does not depend on data from other subjects, e.g., is only a function of . Second, all must lie between some . The rationale for these constraints on the behavior policy is to facilitate both the primary hypothesis testing procedure as well as perhaps not pre-specified secondary off-policy (e.g. as described by (philip2016; su2019)) and causal inference analyses (e.g. as described by (boruvka2018assessing)). In the following we write as for short. We emphasize that at least in digital health, it is both a requirement and the norm that actions are always sampled with non-zero probabilities to ensure that primary and secondary analyses can be performed even in the face of model mis-specification.

Hypothesis and Test Statistic When designing a study, there are many unknowns, and thus it is standard practice to use test statistics that require minimal assumptions to guarantee the desired Type 1 error rate. Thus the assumptions used to form the test statistic may be weaker than the assumptions the study designers use to construct the behavior policy, . Indeed in our experience it is often the case that one might wish to be more conservative when performing hypothesis tests but be willing to make more assumptions when it comes to constructing the personalization (i.e., behavior) policy to minimize regret.

A natural primary hypothesis concerns the treatment effect, here encoded by the value of in Equation 1. Our goal is to test the hypothesis: and the alternate hypothesis: .

To test those hypotheses, we will construct a test statistic based on one used in multiple micro-randomized trials (liao2015micro; boruvka2018assessing; klasnja2019; Bidargaddi2018). We first assume the model in Equation 1. For the marginal reward averaged over the action, denoted as , we assume a working model as:

 E[Rnt|Hnt]=B⊺ntγ0, (2)

where is a vector of features constructed from .

Our estimated effect is the minimizer of the loss

 L(γ,δ)=N∑n=1T∑t=1(Rnt−B⊺ntγ−(Ant−πnt)Z⊺ntδ)2πnt(1−πnt)

In the above loss function the action is centered by probability that the action is

(i.e., ); this is a classical orthogonalization trick used in both statistics and economics (Robinson1988; boruvka2018assessing). This orthogonalization allows one to prove that the asymptotic (large , fixed ) distribution of is Gaussian even if the working model in Equation 2 is false (boruvka2018assessing). A similar orthogonalization trick has been used in the bandit literature by (Krishnamurthy2018; greenewald2017action) so as to allow a degree of non-stationarity.

Next, let . The solution for is given by

 ^θ=(1NN∑n=1T∑t=1XntX⊺ntπnt(1−πnt))−1×(1NN∑n=1T∑t=1RntXntπnt(1−πnt)) (3)

and is asymptotically normal with covariance defined by

 Σθ =E[T∑t=1XntX⊺ntπnt(1−πnt)]−1 (4) ×E[T∑t=1XntX⊺ntπnt(1−πnt)]−1

where , are the dimensions of respectively, and .

###### Theorem 1.

Under assumptions in this section, and the assumption that matrices , are invertible, the distribution of converges, as

increases, to a normal distribution with

mean and covariance , where ,

 W=E[ T∑t=1(Rnt−X⊺ntθ∗)(Ant−πnt)Zntπnt(1−πnt) ×T∑t=1(Rnt−X⊺ntθ∗)(Ant−πnt)Z⊺ntπnt(1−πnt)].

for and .

###### Proof.

The proof is a minor adaptation of (boruvka2018assessing). See Appendix Section  A.1. ∎

Remark: Define and . Then can be written as and . Suppose we make the further assumption that is independent of conditional on and that . Then can be written as

Further suppose the working model is correct, that is, Equation 2 holds. Then the second term in is zero. When the mean structure is not correct, is still unbiased but with inflated covariance.

Finally, recall that and . One can obtain from , and from . To test the null hypothesis, one can use statistic which asymptotically follows a where is the number of parameters in . Under the alternate hypothesis , has an asymptotic non-central

distribution with degrees of freedom

and non-centrality parameter .

## 4 Power Constrained Bandits

The asymptotic distribution for the estimator in Equation  3 depends on the policy . Intuitively, given subjects and times, we can imagine there should be some minimum and maximum randomization probabilities and such that the experiment is sufficiently powered for the test above—that is, if we don’t sufficiently explore, we won’t be able to determine the effect of the treatment.

We first prove that this intuition is true: for a fixed randomization probability, e.g., , there exists a and () such that when or , the experiment is sufficiently powered.

###### Theorem 2.

Let and , for

We choose such that , where denotes the cdf of a non-central distribution with d.f. and non-central parameter , and denotes the inverse cdf of a distribution with d.f. . For a given trial with subjects each over time units, if the randomization probability is fixed as or , the resulting power converges to as .

###### Proof.

(Sketch) The rejection region for is which results in the power of

 1−β0=1−Φp;cN(Φ−1p(1−α0)) (5)

where . Note that we have derived the formula for in Theorem  1, thus we only need to solve for when we substitute the expression for in . Full analysis in Appendix Section  A.2. ∎

In some cases, such as in the work of liao2015micro, may be available in advance of the study. In other cases, the study designer will need to specify a space of plausible models and determining the power for some fixed will require finding the worst-case .

Second, we prove that as long as the randomization probability lies between and , the power constraint will be met. Our proof holds for any selection strategy for , including ones where the policy is adversarially chosen to minimize power based on the subject’s history . Having the condition across myraid ways of choosing is essential to guaranteeing power for any contextual bandit algorithm that can be made to produce clipped probabilities.

###### Theorem 3.

Given the values of we solved for above, if for all and all we have that , then the resulting power will converge to a value no smaller than as .

###### Proof.

(Sketch) The left hand side of Equation  5 is monotonically increasing with respect to . The resulting power will be no smaller than as long as . We show that this holds when . Full proof in Appendix Section A.3

## 5 Regret with Power-Constrained Bandits

In Section  4, we provided a way to guarantee that a study’s power constraints were met in an algorithm-agnostic fashion. Given that probabilities have to be bounded away from zero or one (as noted in Section  3, this is a standard design requirement in many scientific studies), the best we can do with respect to regret is to be optimal with respect to an oracle whose action probabilities lie within and .

As we consider algorithms to achieve optimal regret, we note that one common use case is when someone wants to run an algorithm for binary decision making with efficient exploration and good performance guarantees under some specific assumptions, but wishes to preserve sufficient power for later analysis even if those assumptions are violated.

For example, in social science settings or digital health, study designers may be willing to make more assumptions that if true, will enable faster personalization and lower regret, but the same designers may be much more conservative about the assumptions that they are willing to make regarding their later scientific analyses. Thus, in the following, we focus on regret bounds under the assumptions of the contextual bandit algorithm selected by study designers, rather than with respect to our specific model assumptions in Section  3. However, our power guarantees will always hold with respect to the very general setting in Section  3.

### 5.1 A General Power-Preserving Wrapper Algorithm

We first provide a very general wrapper algorithm in Algorithm 1 that can be used for this purpose. The wrapper takes as input a contextual bandit algorithm , and a pre-computed computed from Section 4. The input algorithm can be stochastic or deterministic. Conceptually, our wrapper operates as follows: for a given context, if the input algorithm

returns a probability distribution over choices that already satisfies

(where , then we sample the action according to . However, if the maximum probability of an action exceeds the required for our power constraint, then we sample that action according to .

The key to guaranteeing good regret with this wrapper for a broad range of input algorithms is in deciding what information we share with the algorithm. Specifically, the sampling approach in lines 11-22 determines whether the action that was ultimately taken would have been taken absent the wrapper; the context-action-reward tuple from that action is only shared with the input algorithm if would have also made that same decision. This process ensures that the input algorithm only sees samples that match the data it would observe if it was making all decisions.

Now, suppose that the input algorithm was able to achieve some regret bound with respect to some setting (which, as noted before, may be more specific than that in Sec. 3). The wrapped version of input by Alg. 1 will achieve the desired power bound by design; but what will be the impact on the regret? We prove that as long as the setting allows for data to be dropped, then an algorithm that incurs regret in its original setting suffers at most linear regret in the clipped setting. Specifically, if an algorithm achieves an optimal rate rate with respect to a standard oracle, its clipped version will achieve that optimal rate with respect to the clipped oracle.

###### Theorem 4.

Assume as input and a contextual bandit algorithm . Assume algorithm has a regret bound under one of the following assumptions on the setting : (1) assumes that the data generating process for each context is independent of history, or (2) assumes that the context depends on the history, and the bound for algorithm is robust to an adversarial choice of context.

Then our wrapper Algorithm 1 will (1) return a dataset that satisfies the desired power constraints under the data generation process of Section  3 and (2) has regret no larger than if assumptions are satisfied in the true environment.

###### Proof.

Regarding power (1): By construction our wrapper algorithm ensures that the selected actions always satisfy the required power constraints.

Regarding regret (2): Note that in the worst case, the input algorithm deterministically selects actions , which are discarded with probability . Therefore if running in an environment satisfying the assumptions of the input algorithm , our wrapper could suffer at most linear regret on points, and will incur the same regret as the algorithm on the other points (which will appear to algorithm as if these are the only points it has experienced).

Note that since the wrapper algorithm does not provide all observed tuples to algorithm , this proof only works for assumptions on the data generating process that assume the contexts are independent of history, or in setting in which is robust to adversarially chosen contexts. ∎

Essentially this result shows that one can get robust power guarantees while incurring a small linear loss in regret (recall that will tend toward 1, and toward 0, as gets large) if the setting affords additional structure commonly assumed in stochastic contextual bandit settings. Because our wrapper is agnostic to the choice of input algorithm , up to these commonly assumed structures, we enable a designer to continue to use their favorite algorithm—perhaps one that has seemed to work well empirically in the domain of interest—and still get guarantees on the power.

###### Corollary 1.

For algorithms that satisfy the assumptions of Theorem 4, our wrapper algorithm will incur regret no worse than with respect to a clipped oracle.

###### Proof.

Recall that a clipped oracle policy takes the optimal action with probability and the other action with probability . By definition, any clipped oracle will suffer a regret of . Therefore relative to a clipped oracle, our wrapper algorithm will have an regret rate that matches the regret rate of the algorithm in its assumed setting when the true environment satisfies those assumptions. This holds for algorithms satisfying the assumptions of Theorem 4. ∎

Therefore relative to a clipped oracle, our wrapped approach achieves the original regret rate. This shows that in the dominant terms, the primary loss in regret is due to clipping. The above derivations assumes a static ; an interesting question for future work is how to best dynamically adjust within a trial to preserve power and further reduce regret.

### 5.2 Regret Rates without Dropping Data

The main drawback of the general wrapper in Algorithm  1 is that up to context-action-reward tuples are not provided to the input algorithm . This process was introduced to stay as general as possible—all guarantees related to the regrets on the input algorithm must continue to hold because the wrapper is invisible it—but, as noted above, will slow down the learning of the input algorithm by up to a constant factor of . We now describe specific cases in which the data do not have to be dropped, and thus that constant factor can be avoided.

A Simple Case: Action-Centered Thompson Sampling (ACTS).

greenewald2017action provides ACTS algorithm with optimal first order regret with respect to a clipped oracle in a non-stationary, adversarial setting; however, they do not provide guidance on how the clipping probabilities are chosen. Substituting our clipping probabilities from Section  4 will result in an ACTS scheme that gets optimal regret with respect to a clipped oracle while satisfying required power guarantees.

A Simple Case: Semi-Parametric Contextual Bandits (BOSE). The BOSE algorithm of (Krishnamurthy2018) provides guarantees of optimal first order regret with respect to a standard oracle in a non-stationary, adversarial setting. Their paper mentions in the two action case that uniformly selecting each action with probability until one action is selected with probability 1 is sufficient to achieve their regret bounds. Since our clipping is only activated in settings in which BOSE is sure of what action to take, a clipped version of BOSE will continue to get that optimal regret with respect to the clipped oracle.

A More Subtle Case: Linear Stochastic Bandits (OFUL). Finally, consider the OFUL algorithm of abbasiyadkori2011 which is designed for the case in which the study designer is willing to make stronger assumptions than in Section 3. In particular, OFUL is developed using a linear assumption on the entire mean reward that for features . To adapt OFUL to accommodate the clipped constraint, we make a slight modification to OFUL to ensure optimism under the constraint. Specifically we replace the criterion, by where Bernoulli(). The construction of the confidence set, , remains the same. See Algorithm 2 in the Appendix. In Appendix Section  A.5, we prove that the regret compared to the clipped oracle is the same as the regret (compared to the unclipped oracle) in abbasiyadkori2011.

General Outlook While each of the cases above relied on specific properties or modifications, we conjecture that the class of algorithms that, if appropriately modified to meet the clipped constraint, will achieve optimal regret with respect to a clipped oracle, is fairly large. Our rationale is clearest when the algorithm is based on optimism as is the case with OFUL. In this case the common critical step in a proof for the regret is to upper bound the mean reward under the optimal policy by substituting the unknown parameters and action by optimistic values. In the clipped setting both the optimal policy selects actions under the clipped constraint and the optimistic values of the unknown parameters and action should be consistent with this constraint as well so as to similarly provide an upper bound. A general result in this direction is an interesting open problem.

## 6 Experiments

We perform experiments to demonstrate various properties of power-constrained bandits with respect to hypothesis testing, regret, and robustness.

### 6.1 Settings, Baselines, and Metrics

Settings We simulate three different environments: a two arm bandit model (here is a constant scalar,  are i.i.d), a contextual bandit model from Krishnamurthy2018 with an adversary using to corrupt the information the learner receives (here is a function of and is i.i.d.), and a mobile health simulator from  liao2015micro (here is a function of and follows a AR(1) process). Details of environment settings are described in Appendix Section  C.

Baselines To our knowledge, the idea of designing a bandit algorithm to obtain power guarantees is novel. Thus, we compare our probability clipping strategy against various algorithms focused on minimizing regret. Specifically, we consider: Fixed Policy (), which chooses actions with probability for all and , Action Centered Thompson Sampling (ACTS) (greenewald2017action), Bandit Orthogonalized Semiparameric Estimation (BOSE) (Krishnamurthy2018) and linear Upper Confidence Bound (linUCB) (chu2011contextual) (where linUCB is very similar to OFUL which we analyzed in  5.2, but simpler to implement and more commonly used in practice). Details of the algorithms, pseudo codes are included in the Appendix Section  B. For each algorithm, we demonstrate that with our clipping strategy, the power constraint can be met without significantly increasing the regret.

Metrics For each of the algorithms, we compute the Type 1 error, the power (under correct and various incorrect specifications of the effect size and the reward mean structure) and the average return of each algorithm. We also compute the regret with respect to an oracle with no clipping (reg), as well as the regret with respect to a clipped oracle (reg). We compute reg as

 reg =E[T∑t=1γnt+max(0,Z⊺ntδ0)+ϵnt]−E[T∑t=1Rnt]. (6)

We compute as

 regc= E[T∑t=1γnt+π∗ntZ⊺ntδ0+ϵnt]−E[T∑t=1Rnt] (7)

where .

The regret computations allow us to see how the returns of our clipped algorithms compare against the best possible rate we could achieve (against the clipped oracle); they also highlight the cost of clipping.

Hyperparameters

All of the algorithms require priors or other hyperparameters. The prior of the ACTS algorithm

is set based on the average return over individuals. We use a similar procedure for tuning the parameters in BOSE and linUCB. In all cases, we note that while one could set some of these parameters based on bounds, the bounds are loose and result in parameter choices that cause the algorithms to over-explore. Thus, we follow the common practice111Personal communication with BOSE authors. of setting these parameters to minimize regret rather from the bounds themselves. This gives the algorithms the best possible chance to perform, even as we then change the models and hypotheses to be incorrectly specified. Finally, the same parameter values are used in the clipped and non-clipped versions of the algorithms. The hyperparameter settings we find are listed in Appendix Table  1

### 6.2 Results

We generate simulation datasets for each experiment. We set the desired Type 1 error and desired power . For the simulation dataset, we estimate using Equation  3. With all simulation datasets, we empirically estimate one using Equation  4. Then we compute and from and . The test statistics follow the distribution described in Section  3. We find the following main effects.

When there is no treatment effect, we recover the correct Type 1 error. Before considering power, a very basic but critical consideration is whether our manipulations still leave us with the correct Type 1 error. As stated in Section  3, the statistics follows a distribution. It follows straightforwardly that The Type 1 error is given by . In this set of experiments, the treatment effect of the environments is set to (for two arm bandit and for the other two ). The are solved for using a non-zero guess of the effect size given in Appendix Section  C.

From Figure  1, we see that we get Type 1 errors that are slightly higher than . The reason is that the estimated covariance is biased downwards due to sample size  (mancl2001covariance), especially BOSE drops a large portion of data on which it is certain about the treatment effect. To obtain better sample approximation of the covariance, one could use adjusted covariance estimators or critical values based on Hotelling’s distribution instead of distribution. As in literature, we expect our Type 1 errors to decrease under these covariance estimation adjustments.

When there is a treatment effect, we recover the correct power if we guessed the effect size correctly. From Figure  2, we see that, as expected, a policy with fixed randomization achieves the highest power because the exploration is maximal. For ACTS, in a simple environment such as the two arm bandit, sufficient power can be recovered as ACTS is exploratory by nature. In more complex environments, such as the contextual bandit and the mobile health simulator, our clipping scheme is required to achieve the desired power as it forces more exploration. A similar effect occurs for BOSE.

Since linUCB selects between actions either with probability or , the power is approximately . We cannot conduct statistical analyses on linUCB without clipping as our test statistics requires a stochastic policy.

The power is reasonably robust to mis-estimated effect size. Next, we consider the effect on the power when our guess of the effect size is overestimated () or underestimated ().

Specifically, for the two arm bandit, we tested two different mis-estimated treatment effects, , fixing the true at . For the other environments, the estimated treatment effects are set as times smaller and times larger of the true effect, respectively. The solved values for ’s and ’s corresponding to different effect sizes are in Appendix Table  2.

In Figure 3, we see that for linUCB, underestimates of result in more exploration, and correspondingly higher power but lower return. Overestimates result in less exploration, lower power, and higher returns. For ACTS and BOSE, the powers given over and underestimation are more robust to this mis-specification; the powers are similar. This occurs because clipping the action probabilities also affects trials that do not originally require clipping. For example, in the contextual bandit environment, when the effect size is underestimated, BOSE explores more and allows to converge to faster. Thus we have less trials of which generally decreases power; overall the effect is that the power does not change by much.

Different algorithms have different regrets, but all still converge as expected with respect to the clipped oracle. Fixing has the lowest average return, as we see in Figure  4. LinUCB, which makes the strongest assumptions w.r.t. regret, has the highest average return among all algorithms as well as the greatest sensitivity to being clipped. The average returns of ACTS decrease the least, as the proportion of trials that requires clipping is the least. The average returns of linUCB decrease the most since without clipping, linUCB uses deterministic policy. Overall, the regret of clipped algorithms with respect to a clipped oracle is roughly on the same scale as the regret of non-clipped algorithms with respect to a non-clipped oracle. The complete results of AR, reg and are listed in Appendix Table  6

There can be some trade-off between regret and the resulting power. From Figure  5, we see that the average return often increases as the power overall decreases. Although fixing gives us the highest power, as shown in Figure  2, it has the lowest average return, as we see in Figure  4. Without probability clipping, ACTS and BOSE achieves higher average return but results in less power.

It is interesting to note that clipped linUCB preserves the desired power guarantee while offering stronger performance than the other approaches.

The resulting power is robust to mis-specified reward mean models. As stated in Remark of Theorem  1, when the reward mean is not constructed correctly, the second term of is nonzero, and in Appendix  A.4, we prove that when that happens, the resulting power will always decrease. The amount of decrease depends on different algorithms and different environments. In this experiment, we use as a bad approximation of the reward mean structure. Note that such approximation will not affect the two arm bandit, as the true is a constant. As we see in Figure  6, the power of fixed probability policy decreases the most while that of the other three decreases slightly. Except for the case of linUCB, the studies are still sufficiently powered for the algorithms under our specific environments. This is because linUCB starts with a power of approximately when the reward mean structure is correctly specified and decreases below given model mis-specification.

## 7 Discussion & Conclusion

Our work provides a general approach to satisfy an important need for ensuring that studies are sufficiently powered while also personalizing for an individual. Our wrapper algorithm guarantees that power constraints are met with minimal regret increase for a general class of algorithms; we also provide stronger regret bounds for specific algorithms.

Our results show that our algorithms meet their claims and are also robust to mis-specified models and effect size estimates. In practice, important considerations for using our approach would include applying more accurate ways to estimate the estimator covariance (as discussed in Section 6.2) to control Type 1 error more tightly, as well as our ability to estimate the expected feature settings . While we focus on derivations for a single power constraint, in settings where potential secondary analyses are known, one can seamlessly apply our methods to guarantee power for multiple analyses by considering the minimum and maximum .

Our study also opens avenues to several interesting directions for future work. While our results are optimal with respect to the design constraint of fixed clipping probabilities, it may be possible to get better regrets if the clipping is allowed to change over time (but still be sufficiently bounded away from 0 and 1 to preserve the ability to perform post-hoc analyses). Finally, in reality mHealth and other settings, actions have effects that persist over time in consistent, model-able ways (e.g. effect of user burden). It would be interesting to see how power constraints and regrets could be guaranteed in MDP-like settings in addition to adversarial bandit settings.

## Appendix A Proofs

### a.1 Proof of Theorem 1

#### Theorem 1

Under assumptions in Section  3 of main paper, and the assumption that matrices , and are invertible, the distribution of converges, as increases, to a normal distribution with mean and covariance , where

 Q =E[T∑t=1ZntZ⊺nt]−1, W =E⎡⎢ ⎢ ⎢ ⎢⎣T∑t=1(Rnt−X⊺nt[γ∗δ0])(Ant−πnt)Zntπnt(1−πnt)×T∑t=1(Rnt−X⊺nt[γ∗δ0])(Ant−πnt)Z⊺ntπnt(1−πnt)⎤⎥ ⎥ ⎥ ⎥⎦, where γ∗ =E[T∑t=1BntB⊺ntπnt(1−πnt)]−1E[T∑t=1BntRntπnt(1−πnt)] and Xnt=[Bnt(Ant−πnt)Znt].

#### Remark

Define and . Then can be written as and . If we make the further assumption that given , and that , matrix can be further simplified to

 W=E[T∑t=1σ2πnt(1−πnt)ZntZ⊺nt]+E[T∑t=1(γnt+πntZ⊺ntδ0−B⊺ntγ∗)2ZntZ⊺ntπnt(1−πnt)],

#### Proof

Note that since the time series, are independent and identically distributed, do not depend on . The estimated effect is the minimizer of the loss

 L(γ,δ)=1NN∑n=1T∑t=1(Rnt−B⊺ntγ−(Ant−πnt)Z⊺ntδ)2πnt(1−πnt) (8)

Let, , , where are the dimensions of respectively. (Note is random because depend on random history). The loss can be rewritten as

 L(θ)=1NN∑n=1T∑t=1(Rnt−X⊺ntθ)2πnt(1−πnt)

By solving , we have

 ^θN\lx@notefootnote$^θN$denotestheestimateof$θ$with$N$samples.Wedropthesubscript$N$inthefollowingtextforshortnotation.=(1NN∑n=1T∑t=1XntX⊺ntπnt(1−πnt))−1(1NN∑n=1T∑t=1RntXntπnt(1−πnt))

Using the weak law of large numbers and the continuous mapping theorem we have that

converges in probability, as to where

 θ∗=(E[T∑t=1XntX⊺ntπnt(1−πnt)])−1(E[T∑t=1RntXntπnt(1−πnt)]).

We show that and is given by the statement in the theorem. One can do this directly using the above definition for or by noting that that . We use the latter approach here. Recall all the time series are independent and identical so

 E[∂L∂θ]∣∣∣θ=θ∗=E[T∑t=1Rnt−B⊺ntγ∗−(Ant−πnt)Z⊺ntδ∗πnt(1−πnt)[Bnt(Ant−πnt)Znt]]=0 (9)

We first focus on the part with which is related to

 E[T∑t=1Rnt−B⊺ntγ∗−(Ant−πnt)Z⊺ntδ∗πnt(1−πnt)⋅(Ant−πnt)Znt]=0

Note that given history , . Thus, for all

 E[−B⊺ntγ∗(Ant−πnt)Zntπnt(1−πnt)] =E[−B⊺ntγ∗E[Ant−πntπnt(1−πnt)∣∣∣Hnt]Znt] =E[−B⊺nt⋅0⋅Znt]=0

which leaves us with

We rewrite . Note for all

 E[Rnt(0)(Ant−πnt)Zntπnt(1−πnt)]=E[Rnt(0)E[Ant−πntπnt(1−πnt)∣∣∣Hnt]Znt]=0.

Thus, we only need to consider,

 E[T∑t=1[Rnt(1)−Rnt(0)]Ant−(Ant−πnt)Z⊺ntδ∗πnt(1−πnt)⋅(Ant−πnt)Znt]=0 (10)

We observe that for all

 E[[Rnt(1)−Rnt(0)]πntπnt(1−πnt)⋅(Ant−πnt)Znt]=0. (11)

Subtracting Equation  11 from Equation  10, we obtain

 E[T∑t=1[Rnt(1)−Rnt(0)](Ant−πnt)−