1 Introduction and related work
The problem of performing repeated randomized trials for comparing statistical hypotheses dates back to the fifties [14]. With the advent of internet companies, decision making algorithms that adhere to this paradigm witnessed a new wave of interest, and several variants of this setting have been introduced in recent years [5, 6, 7, 8, 9]. As a concrete application, consider an online advertising company that keeps on testing out new technologies in order to maximize its profit. Before deploying each new technology, the company wants to figure out if its benefits outweigh its costs. As long as a reasonable performance metric is available (e.g., time spent on a page, clickthrough rates, conversion of curiosity to sales), the company can perform a randomized test and make a statistically sound decision. In real life applications, however, companies are likely not interested in being absolutely sure to pick the best technology, but they want to spend as little time as possible to make these decisions because of budget constraints. Indeed, testing technologies can be expensive and might slow down the regular work flow. Therefore, discarding technologies with a small positive margin could be significantly better than investing a large amount of resources to prove that they are marginally better than the current one. The use of such datadriven sequential decisions processes is known as A/B testing.
These types of settings are reminiscent of multiarmed bandits [1, 3, 13]
, where the set of arms is the set of all decision rules (or policies) used by the decision maker to determine the outcome of an A/B test. However, the two problems are significantly different. In each round of a stochastic bandit problem, the learner picks an arm and gets to see an unbiased estimate of the expected reward of that arm. The total reward accumulated by the learner is an additive function of time, the regret is simply the difference between the total reward of the best fixed arm and the total reward of the algorithm (whose partial sums the learner gets to see after each time step), and regret bounds typically scale with the number of arms
[2].In repeated A/B testing, the learner faces a sequence of A/B tests indexed by a positive integer . The outcome of the th A/B test is measured by a real number that can be either positive or negative, depending on whether the tested population B generates higher rewards than the control population A. Hence, represents the difference in performance between the two populations. Neither the absolute value nor the sign of can be directly observed by the learner. However, the learner can draw samples to estimate , and use this estimate to make a decision about whether to accept B or not. Given , samples are assumed to be i.i.d. with expectation . Thus drawing more samples improves the estimate of , but also makes the test run longer. This is not an issue if only one A/B test is performed. However, when a sequence of such tests is performed, one wants to maximize the number of successfully implemented new technologies in a given time window.
A typical way an A/B test is performed is the following. An increasing number of i.i.d. samples
are gathered, and the learner uses their running average to update a confidence interval around their common expectation
. As soon as this confidence interval does not contain zero, the sign ofbecomes known with high probability (if zero falls below the confidence interval, then
, and vice versa). When this happens, B is accepted if and only if is positive. The major drawback of this approach is that might be arbitrarily close to zero. In this case, the number of samples needed to determine the sign of , which is of order , becomes arbitrarily large. The company then spends a large amount of time on an A/Btest whose return is negligible. In hindsight, it would have been better to discard this A/B test, and hope that the next one was better, i.e., that was positive and bounded away from zero. Because of that, A/B tests are usually given a termination horizon: if after many samples no decision can be taken, then discard this technology and move on to the next one. Denote this “early stopping” rule to compute the number of samples by and the subsequent decision by . A policy with termination horizon is then a pair . Given a set of such of policies, the goal is to learn efficiently, at the smallest cost, the optimal termination horizon . This number is learnable if the sequence has some stationarity property. For this reason we assume that are i.i.d. random variables.A crucial point in sequential A/B testing is picking a metric to evaluate each learning algorithm. As mentioned before, we aim at keeping a steady flow of improvements. For this reason, the performance of the learner is measured using the ratio between the expected reward of all accepted divided by the total expected number of samples drawn throughout the process. This choice ensures that the optimal policy maximizes the ratio of expectations , where is the number of samples drawn by policy and is the decision of either accepting or rejecting after drawing at most samples (as described above). The main objective of the learner is to minimize the notion of regret defined as the difference between the maximum of the ratio (over a possibly infinite set of policies) and the learner’s performance defined above. As we discuss in depth in Section 3, this choice is also wellsuited to model sequential A/B testing when the learners can perform as many A/B tests as they want, as long as the total number of samples does not exceed a budget . That is, when the constraint is on the number of samples rather than on the number of A/B tests.
Another possible metric that is used to optimize sequential A/Btests is the false discovery rate (FDR) —see [11, 16] and references therein. Roughly speaking, the FDR in our setting would be the ratio of implemented that are negative over the total number of implemented . This quantity is usually controlled by looking at the values of tests and accepting (resp., rejecting) a test if its value is below (resp., above) some dynamically adapted threshold. This is an interesting and fruitful theory, but unfortunately disregards the relative “performance” of tests, i.e., the values of . For instance, assume that the samples belong to , and can only take two values: with probability and with probability , for some . Then a company could implement all A/B tests immediately after the first sample. This would provide a ratio of the values of accepted tests over the number of samples of . To control the FDR, one would have to sample each policy approximately times, yielding a ratio of order of . Another simple strategy that would outperform the FDR approach is simply accepting after the first sample if and only if the sample is positive. A direct computation shows that, in this case, the ratio of the values of accepted tests over the number of samples has order . Our approach departs from online FDR [4, 12] by taking the relative value of A/B tests into account. It is acceptable to accept a few slightly negative A/B tests as long as they are compensated by accepting many positive ones. The optimal ratio of FDR should actually be datadependent and, as a consequence, unknown to the learner beforehand.
It turns out that running a decision making policy during an A/B test generates a collection of samples that —in general— cannot be put together in an unbiased estimate of the policy reward. Correcting this bias is a nontrivial challenge arising from our setting, and a major difference with multiarmed bandits. Moreover, the performance measure that we use is not additive in the number of A/B tests (nor in the number of samples). Therefore, algorithms have to be analyzed differently, and banditlike techniques —where the regret is controlled during each time step and then summed over rounds— cannot be directly applied. On the positive side, the structure of our setting can be leveraged to derive regret bounds that are independent of the number of policies, another significant difference with standard multiarmed bandits. This problem can be thought of as a multiarmed bandit setting with nonadditive regret in which the (biased) perround feedback is computed by solving a bestarm identification problem over two arms. We believe these technical difficulties are a necessary price to pay in order to define a realistic setting, applicable to reallife scenarios. Indeed, an important contribution of this paper is the definition of a novel setting with a suitable performance measure.
In section4, we present an algorithm whose analysis can be applied to a broad class of policies. The algorithm is divided in three consecutive phases, each performing a specific task. During the first phase, policies requesting a progressively increasing amount of samples are tested, until one of them is found to have a strictly positive reward with high probability. Leveraging the definition of our performance measure, an estimate of the reward of this policy can be used to bound from above the expected number of samples drawn by the optimal policy (or policies, if there are more than one). With this we can determine a finite set of potentially optimal policies whose cardinality is independent of the total number of policies (indeed, the initial set of policies can even be infinite). This is a peculiar feature of our problem that, to the best of our knowledge, has not appeared elsewhere. During phases 2 and 3, the algorithm proceeds in an explorethenexploit fashion. In phase 2 (exploration), it performs enough A/B tests, drawing enough samples for each one of them, in order to accurately estimate the expected reward per expected sample size of all potentially optimal policies. During phase 3 (exploitation) the algorithm consistently runs the policy with the highest of such estimates. For this algorithm we prove a highprobability datadependent regret bound of order , independent of the number of policies, where is the total number of A/B tests.
2 Problem definition and notation
We say that a function is a horizon if only depends on the first variables in . In the introduction, we mentioned a few possible horizons: collect at most samples, and stop before if 0 is outside some confidence intervals. A function is a decision (rule) if only depends on and the first variables in . In the introduction, the decision rule was to implement an A/B test () if 0 was below the aforementioned confidence interval. We call the pair a policy.
Fix a (finite or countable) set of policies (known to the learner), and a distribution on (unknown to the learner). We study the following online protocol: for each A/B test

[topsep=0pt,parsep=0pt,itemsep=0pt]

an unobserved sample , called mutation, is drawn i.i.d. according to ,

the learner runs a policy , i.e., the learner makes a decision after drawing valued samples such that , where , , and are drawn i.i.d. given , and independently of past A/B tests.
The goal of the learner is to minimize the regret after A/B tests, defined as
(1) 
where , , expectations are taken with respect to the random draw of and , and the supremum is with respect to .
For each policy belonging to , we allow the learner to reject any mutation regardless of the outcome of the sampling, and to draw arbitrarily more samples of mutations (provided these additional samples are not taken into account in the decision). Formally, for all policies and all , the learner has access to the policies and . Note that invoking the power to reject a mutation after samples increases the cost of sampling in (1) by while not increasing the reward in the numerator; indeed for all , with for some decision . Similarly, using a policy rather than has no effect on the numerator but increases the cost in the denominator by .
3 Choice of performance measure
In this section we discuss our choice of performance measure. More precisely, we compare several different benchmarks and discuss how things differ if the learner has a budget of samples rather than A/B tests. We see that all choices but one are essentially equivalent and the last one —perhaps the most natural— is surprisingly poorly suited to model our problem.
At a high level, a learner constrained by a budget would like to maximize its reward per “time step”. This can be done in several different ways. If the constraint is on the number of A/B tests, then the learner might want to maximize the objective defined by
This is equivalent to our choice of maximizing , indeed, multiplying both the numerator and the denominator by and applying Hoeffding’s inequality gives
. Furthermore, by the law of large numbers and Lebesgue’s dominated convergence theorem,
when .Assume now that the constraint is on the number of samples. We say that the learner has a budget of samples if as soon as the total number of samples reaches during A/B test (which is now a random variable), they have to interrupt the run of the current policy, reject the current mutation , and end the process. Formally, the random variable that counts the total number of A/B tests performed by repeatedly running policy is defined by
In this case, the learner might want to maximize the objective
where the sum stops at because the the last A/B test is interrupted and no reward is gained. Note first that for all deterministic functions and all , the random variable is independent of ; indeed depends on only. Hence
Moreover, assume without loss of generality that and note that during each A/B test at least one sample is drawn, hence and
We can therefore apply Wald’s identity[15] to deduce
which, using
and , yields
i.e., and noting that if , we have that when .
This proves that having a budget of A/B tests, samples, or using any the three natural objectives introduced so far is essentially the same. For this reason, and to make the presentation clearer we chose to put a constraint on the number of A/B tests and maximize the ratio of expectations.
Before proceeding to design and analyze an algorithm to minimize the regret (1), we discuss a very natural definition of objective which should be avoided because, albeit easier to maximize, it is not wellsuited to model our problem. Consider as objective the average payoff of accepted mutations per amount of time used to make the decision, i.e.,
We now give some intuition on the differences between the ratio of expectations and the expectation of the ratio and we discuss why the former might be better than the latter.
Fix , , and assume that
is uniformly distributed over
. Consider the policy satisfyingi.e., the learner understands quickly ( samples) that , accepting it or rejecting it accordingly, but takes a long time ( samples) to figure out that the mutation is nonpositive when . In this instance, our definition of ratio of expectations and the expectation of the ratio give
This is due to the fact that the expectation of the ratio “ignores” outcomes with null (or very small) rewards, even if a large number of samples is needed to learn them. On the other hand, the ratio of expectations weighs the number of samples and it is highly influenced by it, a property we are interested in capturing with our model.
4 Explore then exploit
As described in the introduction, horizons in A/B testing are usually defined by a capped earlystopping rule (e.g., drawing samples until leaves a confidence interval around the empirical average, but quitting with a rejection after a certain number of samples has been drawn). In this section we design an algorithm that achieves vanishing regret (with high probability) for an infinite family of policies depending on an arbitrary decision and any monotone sequence of capped horizons, not necessarily defined as truncated versions of a base earlystopping rule. Our algorithm falls in the category of the explore then exploit algorithms[10]. There are two separate phases of the algorithm dedicated to the estimation of the performance of all promising policies (exploration) and the consistent run of the policy with the best estimated performance (exploitation).
Fix any decision and any sequence of horizons such that for all and if . Consider the set of policies (parameterized by )
(2) 
We say that is the maximum horizon of policy . We denote the (expected) reward of policy by
and its (expected) horizon by
We say that is an optimal (maximum) horizon if it satisfies
Namely, if it is the maximum horizon of an optimal policy . Note that for all and implies
(3) 
If , , and policy is run during A/B test , we denote the empirical average of the first and the last samples of by
respectively, where is defined as . If policy is run for consecutive A/B tests , for all we define the upper confidence bound of and the lower confidence bounds of and of by
If policy is run for consecutive A/B tests , we denote the empirical average of its horizon by
For all and all , let
Input: decision , horizons , budget of mutations , confidence , accuracy , exploration length , number of extra samples
Initialization: let for all
Algorithm 1 is divided into three phases. During phase 1 (lines 1–7), mutations (all rejected by the algorithm) are used to determine a horizon that upper bounds all optimal horizons (there could be multiple optimal horizons) with high probability. This is possible because of the structure of our problem. Indeed, each optimal policy satisfies which implies that for all policies with , the expected number of samples drawn by satisfies . With this in mind, we first upper bound with high probability the expected number of samples drawn by optimal policies using an estimate of some with (lines 1–2). Then we proceed to finding the smallest (up to a factor of ) such that with high probability (lines 3–7). This is an upper bound on all optimal horizons . During phase 2 (lines 8–9), the algorithm draws samples from each one of the next mutations (again, all rejected). These are used to compute accurate estimates of all rewards of potentially optimal policies, i.e., all policies with maximum horizon at most . During phase 3 (10–11), the algorithm runs on the remaining mutations the policy that it estimated to be optimal by the end of phase 2.
Theorem 1.
In particular, picking , , and gives with probability at least . Note that in general one cannot pick a constant value of (such as, say, ). Indeed depend on and the theorem holds only if phase 1 is terminated successfully initializing at line 6. This is stated (with a slight abuse of notation) in the condition , which implicitly states that the number of A/B tests performed during phase 1 has to be strictly smaller than the total budget of A/B tests . Picking a large might result in the tests at lines 3 or 6 to never be true, so that the algorithm neither leaves phase 1, nor initializes . However, if is set to and , a direct verification shows that this can only happen if all optimal policies satisfy , which in turn ensures that the regret is still of order . The same argument applies to the other constants that appear inside the the big O notation (see proof below for the explicit form of the bound). Note that this bound is independent of the number of policies (which would be infinite). It scales instead with the datadependent constant , i.e., the upper bound on all optimal horizons, which has a clear interpretation as a measure of hardness of the instance. The more is concentrated around , the hardest the problem of determining if becomes and the more samples are needed in order to prove it.
Proof.
Note that , , and are the numbers of A/B tests performed during phases 1, 2, and 3 respectively. The total number of samples drawn during phase 1 is
Since all mutations are rejected, the sum of the rewards accumulated during phase 1 is zero. The total number of samples drawn during phase 2 is equal to . Since all mutations are rejected, the sum of the rewards accumulated during phase 2 is again zero. The total expected number of samples drawn during phase 3 is upper bounded by . We now proceed to lower bound the total amount of reward accumulated during phase 3. We begin by showing that for all optimal horizons with high probability. Note that (line 2) is an empirical average of i.i.d. unbiased estimators of . Indeed, by the independence of of the variables and the conditional independence of the samples, for all A/B tests performed at line 2,
Taking expectations to both sides proves the claim. Then, recalling that and denoting by the algebra generated by , Hoeffding’s inequality implies for all . By the law of total expectation, the same holds without the conditioning. Similarly, for all , . Hence, the event
(4) 
occurs with probability at least
Note now, that for all horizons and all optimal horizons ,
Hence, all optimal horizons satisfy for all horizons such that . By (3), (4) and line 6, with probability at least , for all , , where by (3) and line (3). Therefore, all optimal horizons satisfy with probability at least .
Now we show that the policy determined by the end of phase 2 is a close approximation of all optimal policies with high probability. Proceeding as above yields, with probability at least , for all , and . Then, using for all with and , we have with probability at least , for all optimal horizons ,
That is, for all optimal horizons , the policy played during the exploitation phase satisfies, with probability at least ,
where we used in the last inequality. Putting everything together and using for all and all gives, with probability at least , for all optimal horizons ,
∎
5 Conclusions and future work
In this paper we introduced a sequential A/B testing problem in which the goal of the learner is to simultaneously maximize the reward accumulated by accepting mutations and minimize the number of samples necessary to do so. While we managed to design a general algorithm with vanishing regret, some interesting questions remain open.
The explorethenexploit approach could be replaced with a more adaptive online protocol. This might lead to a faster convergence rate (e.g., ). The issue that emerges from directly applying UCB [2] strategies to this problem is that those algorithms typically rely on building upper confidence bounds around empirical averages of unbiased estimates of the reward. As we discussed earlier, the estimates we get by running a policy without oversampling are biased. It is not clear whether such biased estimates could be used, as their nonvanishing bias depends in a nontrivial way on the form of the policy. One could however get unbiased estimates by drawing as little as one extra sample for each A/B test. This would ensure fast convergence, but to a slightly suboptimal policy. In fact, this algorithm would have a competitive ratio of , unlike a proper regret bound. Other approaches, such as explore then exploit with a more adaptive exploration, might yield better regret guarantees. These lines of research will be explored in future works.
Acknowledgments
Nicolo CesaBianchi and Tommaso Cesari gratefully acknowledge the support of Criteo AI Lab through a Faculty Research Award.
References
 Alon et al. [2015] Noga Alon, Nicolo CesaBianchi, Ofer Dekel, and Tomer Koren. Online learning with feedback graphs: Beyond bandits. In Annual Conference on Learning Theory, volume 40. Microtome Publishing, 2015.
 Auer et al. [2002] Peter Auer, Nicolo CesaBianchi, and Paul Fischer. Finitetime analysis of the multiarmed bandit problem. Machine learning, 47(23):235–256, 2002.
 Bubeck et al. [2012] Sébastien Bubeck, Nicolo CesaBianchi, et al. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
 Chen and Kasiviswanathan [2019] Shiyun Chen and Shiva Kasiviswanathan. Contextual online false discovery rate control. arXiv preprint arXiv:1902.02885, 2019.
 Foster and Stine [2008] Dean P Foster and Robert A Stine. investing: a procedure for sequential control of expected false discoveries. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(2):429–444, 2008.
 Genovese et al. [2006] Christopher R Genovese, Kathryn Roeder, and Larry Wasserman. False discovery control with pvalue weighting. Biometrika, 93(3):509–524, 2006.
 Heesen and Janssen [2016] Philipp Heesen and Arnold Janssen. Dynamic adaptive multiple tests with finite sample fdr control. Journal of Statistical Planning and Inference, 168:38–51, 2016.
 Javanmard et al. [2018] Adel Javanmard, Andrea Montanari, et al. Online rules for control of false discovery rate and false discovery exceedance. The Annals of statistics, 46(2):526–554, 2018.
 Li and Barber [2019] Ang Li and Rina Foygel Barber. Multiple testing with the structureadaptive benjamini–hochberg algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81(1):45–74, 2019.
 Perchet et al. [2016] Vianney Perchet, Philippe Rigollet, Sylvain Chassang, Erik Snowberg, et al. Batched bandit problems. The Annals of Statistics, 44(2):660–681, 2016.
 Ramdas et al. [2017] Aaditya Ramdas, Fanny Yang, Martin J Wainwright, and Michael I Jordan. Online control of the false discovery rate with decaying memory. In Advances In Neural Information Processing Systems, pages 5650–5659, 2017.
 Robertson and Wason [2018] David S Robertson and James Wason. Online control of the false discovery rate in biomedical research. arXiv preprint arXiv:1809.07292, 2018.
 Rosenberg et al. [2007] Dinah Rosenberg, Eilon Solan, and Nicolas Vieille. Social learning in onearm bandit problems. Econometrica, 75(6):1591–1611, 2007.
 Tukey [1953] John Wilder Tukey. The problem of multiple comparisons. Unpublished, 1953.
 Wald [1944] Abraham Wald. On cumulative sums of random variables. The Annals of Mathematical Statistics, 15(3):283–296, 1944.
 Yang et al. [2017] Fanny Yang, Aaditya Ramdas, Kevin G Jamieson, and Martin J Wainwright. A framework for multia (rmed)/b (andit) testing with online fdr control. In Advances in Neural Information Processing Systems, pages 5957–5966, 2017.
Comments
There are no comments yet.