Thresholding Bandit for Dose-ranging: The Impact of Monotonicity

11/13/2017 ∙ by Aurélien Garivier, et al. ∙ Université de Toulouse 0

We analyze the sample complexity of the thresholding bandit problem, with and without the assumption that the mean values of the arms are increasing. In each case, we provide a lower bound valid for any risk δ and any δ-correct algorithm; in addition, we propose an algorithm whose sample complexity is of the same order of magnitude for small risks. This work is motivated by phase 1 clinical trials, a practically important setting where the arm means are increasing by nature, and where no satisfactory solution is available so far.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The phase 1 of clinical trials is devoted to the testing of a drug on healthy volunteers for dose-ranging. The first goal is to determine the maximum tolerable dose (MTD), that is the maximum amount of the drug that can be given to a person before adverse effects become intolerable or dangerous. A target tolerance level is chosen (typically ), and the trials aim at identifying quickly which is the dose entailing the toxicity coming closest to this level. Classical approaches are based on dose escalation, and the most well-known is the "traditional 3+3 Design": see Le Tourneau et al. (2009); Genovese et al. (2013) for and references therein for an introduction.

We propose in this chapter a complexity analysis for a simple model of phase 1 trials, which captures the essence of this problem. We assume that the possible doses are , for some positive integer . The patients are treated in sequential order, and identified by their rank. When the patient number is assigned a dose , we observe a measure of toxicity

which is assumed to be an independent random variable. Its distribution

characterizes the toxicity level of dose

. To avoid obfuscating technicalities, we treat here the case of Gaussian laws with known variance and unknown mean, but some results can easily be extended to other one-parameter exponential families such as Bernoulli distributions. The goal of the experiment is to identify as soon as possible the dose

which has the toxicity level closest to the target admissibility level , with a controlled risk to make an error.


This setting is an instance of the thresholding bandit problem: we refer to Locatelli et al. (2016) for an important contribution and a nice introduction in the fixed budget setting. Contrary to previous work, we focus here on identifying the exact sample complexity of the problem: we want to understand precisely (with the correct multiplicative constant) how many samples are necessary to take a decision at risk . We prove a lower bound which holds for all possible algorithms, and we propose an algorithm which matches this bound asymptotically when the risk tends to .

But the classical thresholding bandit problem does not catch a key feature of phase 1 clinical trials: the fact that the toxicity is known in hindsight to be increasing with the assigned dose. In other words, we investigate how many samples can be spared by algorithms using the fact that . Under this assumption, we prove another lower bound on the sample complexity, and provide an algorithm matching it. The sample complexity does not take a simple form (like a sum of inverse squares), but identifying it exactly is essential even in practice, since it is the only way known so far to construct an algorithm which reaches the lower bound.

We are thus able to quantify, for each problem, how many samples can be spared when means are sorted, at the cost of a slight increase in the computation cost of the algorithm.

Connections to the State of the Art.

Phase 1 clinical trials have been an intense field of research in the statistical community (see Le Tourneau et al. (2009) and references therein), but not considered as a sequential decision problem using the tools of the bandit literature. The important progress made in the recent years in the understanding of bandit models has made it possible to shed a new light on this issue, and to suggest very innovative solutions. The closest contribution are the works of Locatelli et al. (2016) and Chen et al. (2014), which provides a general framework for combinatorial pure exploration bandit problems. This work tackles the more specific issue of phase 1 trials. It aims at providing strong foundations for such solutions: it does not yet tackle all the ethical and practical constraints. Observe that it might also be relevant to look for the highest dose with toxicity below the target level: we discuss this variant in Section 4; however, it seems that practitioners do not consider this alternative goal in priority.

From a technical point the view, the approach followed here extends the theory of Best-Arm Identification initiated by Kaufmann et al. (2016) to a different setting. Building on the mathematical tools of that paper, we analyze the characteristic time of a thresholding bandit problem with and without the assumptions that the means are increasing. Computing the complexity with such a structural constraint on the means is a challenging task that had never been done before. It induces significant difficulties in the theory, but (by using isotonic regression) we are still able to provide a simple algorithm for computing the complexity term, which is of fundamental importance in the implementation of the algorithm. The computational complexity of the resulting algorithm is discussed in Section 3.1.


These lower bounds are presented in Section 2. We compare the complexities of the non-monotonic case versus the increasing case. This comparison is particularly simple and enlightening when , a setting often referred to as A/B testing. We discuss this case in Section 2.1, which furnishes a gentle introduction to the general case. We present in Section 3 an algorithm and show that it is asymptotically optimal when the risk goes to . The implementation of this algorithm requires, in the increasing case, an involved optimization which relies on constraint sub-gradient ascent and unimodal regression: this is detailed in Section 3.1. Section 3.2 shows the results of some numerical experiments for different strategies with high level of risk that complement the theoretical results. Section 4 discusses the interesting, but simpler variant of the problem where the goal is to identify the arm with mean closest, but also below the threshold. Section 5 summarizes further possible developments, and precedes most of the technical proofs which are given in appendix.

1.1 Notation and Setting

For , we consider a Gaussian bandit model

, which we unambiguously refer to by the vector of means

. Let and

be respectively the probability and the expectation under the Gaussian bandit model

. A threshold is given, and we denote by any optimal arm.

Let be the set of Gaussian bandit models with an unique optimal arm and be the subset of models with increasing means.

Definition of a -correct algorithm.

A risk level is fixed. At each step an agent chooses an arm and receives a conditionally independent reward . Let be the information available to the player at step . Her goal is to identify the optimal arm while minimizing the number of draws . To this aim, the agent needs:

  • [noitemsep,nolistsep]

  • a sampling rule , where is -measurable,

  • a stopping rule , which is a stopping time with respect to the filtration ,

  • a -measurable decision rule .

For any setting (the non-monotonic or the increasing case), an algorithm is said to be -correct on if for all it holds that and .

2 Lower Bounds

For , we define the set of alternative bandit problems of the bandit problem by


and the probability simplex of dimension by . The first result of this chapter is a lower bound on the sample complexity of the thresholding bandit problem, which we show in the sequel to be tight when is small enough. Let and . For all -correct algorithm on and for all bandit models ,


where the characteristic time is given by


In particular, this implies that

This result is a generalization of Theorem 1 of Garivier and Kaufmann (2016): the classical Best Arm Identification problem is a particular case of our non-monotonic setting with an infinite threshold . It is proved along the same lines. As Garivier and Kaufmann (2016), one proves that the supremum and the infimum are reached at a unique value, and in the sequel we denote by the optimal weights


2.1 The Two-armed Bandit Case

As a warm-up, we treat in the section the case . Here (only), one can find an explicit formula for the characteristic times. When ,


The Equality (6) is a simple consequence of Lemma 2.2 proved in Section A.2. It remains to treat the first Equality (5). Let and suppose, without loss of generality, that arm is optimal. Let be the mean of two arms and be the gap. Noting that

we obtain

where , and we denote by the function

Writing , easy computations lead to

Thus, since the maximum of is attained at , we just proved that .

Note that for both alternative sets the optimal weights defined in Equation (4) are uniform: . If the alternative set is , the optimal alternative, i.e. the element of (the closure of ) which reaches the infimum in (3) for the optimal weights , is . In words, in the optimal alternative the arms are translated in such a way that the mean of the two mean values is moved to the threshold . If the alternative set is and , the optimal alternatives can be of two different forms. If the threshold is between the two mean values, then the optimal alternative is the same as for the increasing case. Otherwise, the optimal alternative is identical to the one of Best Arm Identification (see Garivier and Kaufmann (2016)): . Thus, if , the two characteristic times coincide, as can be seen in Figure 2.

Figure 1: The complexity terms in the bandit model . Top: inverse of the characteristic time as a function of the threshold ; red solid line: non-monotonic case ; blue dotted line: increasing case . Middle: how to move the means to get from the initial bandit model to the optimal alternative in . Bottom: the optimal weights in function of the threshold .
Figure 2: Inverse of the characteristic times as a function of the threshold , for . Solid red: general thresholding case (). Dotted blue: increasing case ().

2.2 On the Characteristic Time and the Optimal Proportions

We now illustrate, compare and comment the different complexities for a general bandit model with (see Figure 1). Since , it is obvious that . The difference is almost everywhere positive, and can be very large. Both and tend to as tends to middle of two consecutive arms.

On the structure of the optimal weights in the non-monotonic case.

For all ,

In the non-monotonic case , there are two types of optimal alternatives (as in Section 2.1). Indeed, the proof of Lemma 2.2 in Appendix A.2 shows that the best alternative takes one of the two following forms. Either the optimal arm and its challenger are moved to a pondered mean (by the optimal weights ) of the two arms (just like in the Best Arm Identification problem), leading to a constant in Equation (2.2). Or, as in the increasing case (see the proof of Proposition 2.1), both arms and are translated in the same direction, leading to the constant . Figure 1 summarizes the different possibilities on a simple example with arms, for different values of the threshold . According to the value of , the best alternative is shown in the second plot from the top.

On the structure of the optimal weights in the increasing case.

In the increasing case , one can show the remarkable property that the optimal weights put mass only on the optimal arm and its two closest arms. This strongly contrasts with the non-monotonic case, as illustrated at the bottom of Figure 1. For simplicity we assume that . Let be some weights in . Let be the cost, with weights , for moving from the initial bandit problem to a bandit problem where arm has mean and is halfway between and ,

The explicit formula for is

Similarly we can do the same with arm : moving from to a bandit problem , defined for by

where both arms and are optimal. For this alternative the cost is

It appears, see the proof of Proposition 2.2 in Appendix A.1, that these two types of alternative and are the optimal one. Note that they are also in , the closure of the set of alternatives of . For all ,


The intuition behind this proposition is that if we try to transform into an alternative with as optimal arm we have to pass by an alternative with optimal arm since we impose to the means to be increasing. It remains to see that this intermediate alternative has always a smaller cost. The cases with are similar considering only the the alternatives if and if . We can also derive bounds on the characteristic time to see that the dependence in disappear. It is important to note that this property is really asymptotic when goes to zero and it is not clear at all that the dependence of the complexity in would also disappear for moderate value of , we think it is not the case. For all such that , considering the gaps: , and ,


3 An Asymptotically Optimal Algorithm

We present in this section an asymptotically optimal algorithm inspired by the Direct-tracking procedure of Garivier and Kaufmann (2016) (which borrows the idea of tracking from GAFS-MAX algorithm of Antos et al. (2008)). At any time let (where stands for the positive part of ) and be the set of "abnormally rarely sampled" arms. After rounds the empirical mean of arm a is

where denotes the number of draws of arm up to and including time .

Sampling rule
Stopping rule
Decision rule
Algorithm 1 Algorithme (Direct-tracking).

When , we adopt the convention that and

[Asymptotic optimality] 
For , for the constant and for , Algorithm 1 is -correct on and asymptotically optimal, in the sense that


The analysis of Algorithm 1 is the same in both the increasing case and the non-monotonic case . It is deferred to Appendix B. However, the practical implementations are quite specific to each case, and we detail them in the next section.

3.1 On the Implementation of Algorithm 1

The implementation of Algorithm 1 requires to compute efficiently the optimal weights given by Equation (4). For the non-monotonic case , one can follow the lines of Garivier and Kaufmann (2016), Section 2.2 and replace their Lemma 3 by Lemma 2.2 above.

In the increasing case , however, implementing the algorithm is more involved. It is not sufficient to simply use Proposition 2.2, since is not necessarily in . Let be the set of alternatives with as optimal arm. Noting that the function


is concave (since it is the infimum of linear functions), one may access to its maximum by a sub-gradient ascent on the probability simplex (see e.g. Boyd et al. (2003)). Let denote the closure of , and let


be the argument of the second infimum in Equation (11). The sub-gradient of at is

where denotes the convex hull operator and where is the set of arms that reach the minimum in (11). Thus, performing the sub-gradient ascent simply requires to solve efficiently the minimization program (12). It appears that this problem boils down to unimodal regression (a problem closely related to isotonic regression, see for example Barlow et al. (1973) and Robertson et al. (1988)). Indeed, we can rewrite the set

Assume that (the other case is similar). Then , since and play a symmetric role in the constraints. Thus, in this case, one may only consider the set

Let be the new variables such that


Then is the solution of the following minimization program


Thanks to Lemma 36 in Appendix C, it holds that


is the unimodal regression of with weights and with a mode located at . It is efficiently computed via isotonic regressions (e.g. Frisén (1986), Geng and Shi (1990), Mureika et al. (1992)) with a computational complexity proportional to the number of arms . From , one can go back to by reversing Equation (13). Since we need to compute for each , the overall cost of an evaluation of the sub-gradient is proportional to .

3.2 Numerical Experiments

Table 1 presents the results of a numerical experiment of an increasing thresholding bandit. In addition to Algorithm 1 (DT), we tried the Best Challenger (BC) algorithm with the finely tuned stopping rule given by (9). We also tried the Racing algorithm (R), with the elimination criterion of (9). For a description of all those algorithms, see Garivier and Kaufmann (2016) and references therein. Finally, in order to allow comparison with the state of the art, we added the sampling rule of algorithm APT (Anytime Parameter-free Thresholding algorithm) from Locatelli et al. (2016) in combination with the stopping rule (9). We chose to set the parameter of APT to be roughly equal to a tenth of the gap. It appears that the exploration function prescribed in Theorem 3 is overly pessimistic. On the basis of our experiments, we recommend the use of instead. It does, experimentally, satisfy the -correctness property. For each algorithm, the final letter in Table 1 indicates whether the algorithm is aware () or not () that the means are increasing.

1 3913 3609 4119 5960 2033
2 3064 3164 3098 3672 1861
1 483 494 611 1127 247
2 2959 2906 3072 3531 1842
Table 1:

Monte-Carlo estimation (with

repetitions) of the expected number of draws for Algorithm 1 and Best Challenger Algorithm in the increasing and non-monotonic cases. Two thresholding bandit problems are considered: bandit problem 1, with , and bandit problem 2, with . The target risk is (it is approximately reached in the first scenario, while in the second the frequency of errors is of order ).

We consider two frameworks: in the first one, knowing that the means are increasing provides much information and gives a substantial edge: it permits to spare a large portion of the trials for the same level of risk. In the second, the complexities of the non-monotonic setting is very close to that of the increasing setting. We chose a value of the risk which is relatively high (), in order to illustrate that in this regime, the most important feature for efficiency is a finely tuned stopping rule. This shows that, even without an optimal sampling strategy, the stopping rule of (9) is a key feature of an efficient procedure. When the risk goes down to , however, optimality really requires a sampling rule which respects the proportions of Equation (4), as shown by Theorem 3. The poor performances of APT can be explained by the crude adaptation of this algorithm to the fixed confidence setting. This possibly comes from the fact that it was originally designed for the fixed budget setting and it appears that these two frameworks are fundamentally different, as argued by Carpentier and Locatelli (2016).

4 Closest mean below the threshold

In this section we briefly discuss a variant of the previous problem: finding the arm with the closest mean below the threshold, i.e. , still under the assumption that the means are increasing. Surprisingly this new problem is simpler than the previous one and it is possible to compute exactly, in this case, the optimal weights and the characteristic time.

Let be a bandit problem such that the optimal arm for this new setting is unique (with for the sake of clarity). As in Section 2.2, we only need to consider alternative bandit problems such that arm or is optimal. But only one arm needs to be moved: the optimal alternative (resp. ) where (resp. ) is optimal are defined by


With the same arguments used to prove Proposition 2.2, one obtains that


and that the associated optimal weights are defined by

In some way, this problem is closer to the classical threshold bandit problem (Locatelli et al., 2016) with two arms since we are testing if the mean of arms and is below or above the threshold.

For an optimal strategy, Algorithm 1 can be adapted to this new setting, as well as the Theorem 3 and its asymptotic optimality proof. The only point that needs to be discussed is how to implement this algorithm in practice. One may follow the procedure described in Section 3.1: perform an gradient ascent on the simplex to compute the maximum of defined in (11). The main difficulty is to compute the sub-gradient of at and in particular the projection given by (12), that rewrites in this setting


But this projection can also be easily computed using two isotonic regressions under bound restrictions, see for example Hu (1997).

5 Conclusion

We provided a tight complexity analysis of the dose-ranging problem considered as a thresholding bandit problem with, and without, the assumption that the means of the arms are increasing. We proved that, surprisingly, the complexity terms can be computed almost as easily as in the best-arm identification case, despite the important constraints of our setting. We proposed a lower bound on the expected number of draws for any -correct algorithm and adapted the Direct-Tracking algorithm to asymptotically reach this lower bound. We also compared the complexities of the non-monotonic and the increasing cases, both in theory and on an illustrative example. We showed in Section 3.1 how to compute the optimal weights thanks to a sub-gradient ascent in the increasing case, a new and non-trivial task relying on unimodal isotonic regression. In order to complement the theoretical results, we presented some numerical experiments involving different strategies in a regime of high risk. In fact, despite the asymptotic nature of the results presented here, the procedure proposed here appears to be the most efficient in practice even when the number of trials implied is rather low (which is often the case in clinical trials).

In the case where several arms are simultaneously closest to the threshold, the complexity of the problem is infinite. This suggests to extend the results presented here to the PAC setting, where the goal is to find any -closest arm with probability at least . This extension, and extensions to the non-Gaussian case, are left for future investigation since they induce significant technical difficulties.

As a possibility of improvement, we can also mention the possible use of the unimodal regression algorithm of Stout (2000) in order to compute directly (11) with a complexity of order

. We treated here mostly the case of Gaussian distributions with known variance. While the general form of the lower bound may easily be extended to other settings (including Bernoulli observations), the computation of the complexity terms is more involved and requires further investigations (in particular due to heteroscedasticity effects). The asymptotic optimality of Algorithm 

1, however, can be extended directly. It remains important but very challenging tasks to make a tight analysis for moderate values of , to measure precisely the sub-optimality of Racing and Best Challenger strategies, and to develop a more simple and yet asymptotically optimal algorithm.

The authors thank Wouter Koolen for suggesting a key ingredient in the proof of Proposition 2.2. The authors acknowledge the support of the French Agence Nationale de la Recherche (ANR), under grants ANR-13-BS01-0005 (project SPADRO) and ANR-13-CORD-0020 (project ALICIA).

Appendix A Proofs for the Lower Bounds

a.1 Expression of the Complexity in the Increasing Case

Fix and let be the optimal arm . We recall the definitions of and two functions defined over by


if . Else, if we define

and if we define

[of Proposition 2.2] We just treat here the case , the two other limit cases are very similar. We begin by proving that for all


Indeed, let such that . Suppose for example that . Let be the family of bandit problems defined for by

For all , we have . For and , let be the average of two consecutive means with the convention and . As in the case of two arms we have that is equivalent to and . Therefore we have the following inequalities

Thus, by continuity of the applications there exits