1 Introduction
Clinical trials typically randomize patients using a fixed randomization scheme, where the probabilities of assigning patients to the experimental treatments and control are prespecified and constant. A common method is to simply use equal randomization to the different arms of the trial. However, such randomization schemes can mean that a substantial proportion of the trial participants will continue to be allocated to treatments that are not the best available, even if interim data indicates that one treatment is likely to be superior. Responseadaptive trials address this concern by adaptively changing the randomization probabilities, so that a greater proportion of patients are allocated to the treatment arm which has a better performance based on the cumulated response data. Hence, as the trial continues and accumulates more data, patients in the trial can benefit from having a higher probability of being assigned to a better treatment.
Many classes of responseadaptive randomization schemes have been proposed in the literature for binary outcomes. Randomization schemes based on urn models (such as the randomized playthewinner rule (Wei and Durham, 1978)), and adaptive biased coin designs (such as the doublyadaptive biased coin design (Eisele, 1994)), have been extensively studied, with a comprehensive presentation given by Hu and Rosenberger (2006). Many Bayesian adaptive randomization (BAR) schemes have also been proposed (Thall and Wathen, 2007; Trippa et al., 2012; Yin et al., 2012; Wason and Trippa, 2014), where the randomization probabilities are recursively updated using a Bayesian model for the patient outcomes.
There is also a growing interest in responseadaptive randomization for continuous responses. For example, there are schemes based on doublyadaptive biased coin designs (Hu and Rosenberger, 2006; Zhang and Rosenberger, 2006; Biswas et al., 2007), urnbased droptheloser designs (Ivanova et al., 2006) and banditbased designs (Smith and Villar, 2017). A comprehensive recent overview is given by Biswas and Bhattacharya (2016). In this paper, our focus is on normally distributed outcomes, which are encountered in many clinical trials. Indeed, 23 out of the 59 multiarm clinical trials identified in a review by Wason et al. (2014) had a continuous outcome.
A comprehensive discussion of the relative advantages and disadvantages of adaptive versus fixed randomization is beyond the scope of this paper. Indeed, the use of adaptive randomization is a widely discussed and somewhat controversial topic in clinical trials. For binary responses, a number of comparisons (Korn and Freidlin, 2011; Berry, 2011; Thall et al., 2015; Wathen and Thall, 2017) have focused on the BAR scheme proposed by Thall and Wathen (2007). Particularly in the twoarm setting, fixed randomisation appears to be preferable to this scheme in terms of power and the number of treatment failures, except when the number of patients to be treated beyond the trial is small (as in rare diseases) or where there are large treatment differences (Lee et al., 2012; Du et al., 2015).
However, even in the twoarm setting, optimal responseadaptive schemes (i.e. those that target some formal optimality criteria) have been shown to have benefits over fixed randomisation by increasing both power and patient benefit simultaneously (Rosenberger et al., 2001; Rosenberger and Hu, 2004; Tymofyeyev et al., 2007; Bello and Sabo, 2016). In the multiarm setting, which is the focus of this paper, adaptive randomisation can have further advantages over fixed randomization (Berry, 2011; Wason and Trippa, 2014; Hey and Kimmelman, 2015; Berry, 2015), particularly for more complex trial designs.
Responseadaptive designs also have application outside of the context of clinical trials. For example, multiarm bandit models are used for market learning in economics (Bergemann and V lim ki, 2006) and to improve modern production systems that emphasize ‘continuous improvement’ (Scott, 2010). Some of the ethical concerns surrounding adaptive randomization (Hey and Kimmelman, 2015) would not apply in these contexts.
Despite the extensive literature on responseadaptive randomization, relatively few clinical trials have actually used such schemes in practice. One of the first examples, which used a randomized playthewinner rule, was a trial of extracorporeal membrane oxygenation to treat newborns with respiratory failure (Bartlett et al., 1985). More recent examples include a threearmed trial in untreated patients with adverse karyotype acute myeloid leukemia (Giles et al., 2003), which used BAR. The ongoing ISPY 2 trial (Park et al., 2016; Rugo et al., 2016), which screens drugs in neoadjuvant breast cancer, also uses BAR as part of its design.
A key concern over using responseadaptive randomization, particularly from a regulatory perspective, is ensuring that the type I error rate is controlled. Indeed, draft regulatory guidance from the U.S. Food and Drug Administration (2010) includes adaptive randomization under a section entitled “Adaptive Study Designs Whose Properties Are Less Well Understood”. It then goes on to state that “particular attention should be paid to avoiding bias and controlling the Type I error rate” (Food and Drug Administration, 2010, pg. 27) when using adaptive randomization in trials.
In a multiarm trial, multiple hypotheses are tested simultaneously by design, which leads to a multiple testing problem. To account for this, testing procedures are used that guarantee strong control of the familywise error rate (FWER), which ensures the maximum probability of making at least one type I error is controlled. For confirmatory trials in particular, demonstrating strong control of the FWER is often required by regulators (Food and Drug Administration, 2010; European Medicines Agency, 2002).
For responseadaptive trials, a rigorous proof of FWER control for a particular design is difficult given the complexities of the treatment allocation process. Hence error control has typically either been demonstrated through simulation studies, or by exploiting the asymptotic structure of the adaptive randomization procedure (Hu and Rosenberger, 2006; Zhu and Hu, 2010). However, neither method provides a guarantee of FWER control, particularly with small sample sizes. Gutjahr et al. (2011) showed how to achieve strong control of the FWER for normally distributed outcomes in a twostage design incorporating responseadaptive randomization. However, our focus is on general responseadaptive trials, without the necessity of restricting to two stages or having a final stage of equal randomization.
In this paper, we show how to guarantee strong control of the FWER for both fully sequential and block randomized responseadaptive trials, for a large class of adaptive randomization rules. Our proposed procedure works by reweighting the usual
statistic through an iterative application of the conditional invariance principle. The resulting adaptive test statistic can then be used to test the elementary null hypothesis that a treatment is superior to the control.
The rest of the paper is organised as follows. In Section 2, we describe the proposed method for fully sequential responseadaptive trials with a fixed allocation to the control. This method is then modified for block randomized responseadaptive trials in Section 3, for both a fixed or adaptive control allocation. Simulation studies for the proposed methods are presented in Section 4, and Section 5 gives a case study based on a trial in primary hypercholesterolemia. We conclude with a discussion in Section 6. All proof details can be found in the Appendices.
2 Fully sequential responseadaptive trials
2.1 Trial setting
Suppose a trial is conducted to test experimental treatments against a common control, using the following design. A total of patients are allocated to the experimental treatments, and patients are allocated to the control, where and are fixed in advance. Patients are allocated to the different experimental treatments using responseadaptive randomization, where we assume that the randomization rule does not depend on the control information. We also assume the allocation to the control is fixed; that is, the probability of assigning a patient to the control is prespecified and constant. Maintaining allocation to the control is recommended by the Food and Drug Administration (2010), since it best maintains the power of the trial, and helps address the concern about changing patient characteristics over the course of the trial.
The responseadaptive randomization for the experimental treatments starts with a burnin period , which uses equal randomization to allocate patients to the th treatment , with the again fixed in advance. Hence a total of patients are allocated to the experimental treatments during the burnin period. Let denote the treatment allocation for the th experimental patient (), where if the th patient is allocated to the th treatment. Also, let denote the efficacy outcome for the th patient. Similarly, let denote the efficacy outcomes for the th patient on the control (). We assume that
The variance
is assumed known and, without loss of generality, we set . Here represents the incremental benefit of treatment compared to the control, and is the parameter of interest. Finally, let denote the total number of allocations to the th experimental treatment, including the burnin period.2.2 Hypothesis testing
The elementary null hypotheses of interest are against the onesided alternatives . We discuss the case when at the end of Section 2.5. One general method to control for multiple testing is to use the closure principle (Marcus et al., 1976) and consider all intersection hypotheses , where . To strongly control the FWER, we reject an elementary null hypothesis if we also reject every with using a local level test. Hence we need to define a valid level test for all the intersection hypotheses . The naïve test for , which does not take into account the responseadaptive randomization used in the trial, rejects if the test statistic
is greater than , where and is the
standard normal quantile.
As an alternative to using the closure principle with the test statistic above, we could control the FWER by simply using a Bonferroni correction, or a stepup/stepdown procedure such as the Holm procedure. These would only involve calculating test statistics for the elementary null hypotheses, i.e. calculating for (). Hence we present the methodology assuming the closure principle will be used, with the Bonferroni and Holm procedures considered as special cases. We return to this issue in Section 4.
2.3 Inflation of the familywise error rate
Since the test ignores the adaptive randomization used, it is possible to inflate the FWER. As an example, consider the following adaptive randomization scheme for treatments:
where . This can be viewed as implementing early stopping for efficacy for treatment 1, which is not taken into account using the naïve test.
We ran a simulation study to calculate the type I error rate using the above randomization scheme. We set , , and the true treatment means , . The type I error rate as averaged over simulations is , more than double the nominal level. We subsequently refer to allocation rules of this type as ‘type I error inflator’ rules (which clearly would never be used in practice).
2.4 Auxiliary design
Working with the actual design of the trial is difficult because the responseadaptive randomization affects the distribution of the usual test statistics. Hence for each we introduce a simpler design, called the auxiliary design, for which we do know the distribution. The actual trial design can then be viewed as a series of datadependent modifications of the auxiliary design, where we account for the modifications using the conditional invariance principle. The auxiliary designs are purely hypothetical, and are only used to construct the modified tests for the actual design. As well, the allocations in the auxiliary designs are fixed before the start of the actual trial.
The auxiliary design for hypothesis is as follows. As in the actual design, a total of patients are allocated to the experimental treatments, and patients are allocated to the control. The allocations and responses to the control treatment are the same as the actual design. For the patients allocated to the experimental treatments, the auxiliary design starts with a burnin period with patients that is identical to the actual design. The subsequent allocations are given by a fixed sequence , which can be chosen arbitrarily. These allocations can be considered as a ‘guess’ of a likely allocation sequence of the actual trial design. One possibility would be to randomize equal numbers of patients for each treatment. The final allocation must be to one of the treatments in .
We now introduce some notation for the auxiliary design. Let denote the total number of allocations to the th experimental treatment. Also let denote the total number of allocations to the th treatment for patients . We define and . Under the auxiliary design, is fixed for all , and hence under , the usual statistic
is normally distributed with mean zero and variance . Hence we reject if is greater than .
2.5 Adaptive test statistic
Adaptive designs, such as the trial being considered, follow a common conditional invariance principle in order to control the type I error rate (Brannath et al., 2007). For our responseadaptive trial in question, we apply the conditional invariance principle sequentially, where each step considers the next patient recruited into the trial. Below we give the test statistic for testing hypothesis under the actual design, given that the allocation is fully sequential. The proof of Theorem 2.1 can be found in Appendix A.
Theorem 2.1.
Under , the following test statistic is normally distributed with mean 0 and variance :
where
Hence we reject if is greater than . In Appendix B, we give some simple numerical examples of how the weights change over the course of a trial. In practice, to keep the weights as close to the natural weight for as many of the control observations as possible, we recommend setting and , as used for the simulation studies in Section 4.1.
In all of the scenarios that we have investigated, the weights for the experimental treatments have been positive. Hence in these cases, the test procedure also controls the FWER for the composite null hypotheses . To see this, suppose the elementary null hypotheses are . Under , we can rewrite the distribution of the responses as , where . Hence under
where and are the adaptive test statistics for and respectively.
3 Block randomized responseadaptive trials
3.1 Trial setting
It may not be feasible or desirable to randomize patients onebyone in a fully sequential manner. Instead one can use block randomization, where after the burnin period , patients are adaptively randomized to the experimental treatments in blocks of size over stages, with . The randomization of the th block depends on the data up to block , as well as any external information available at the time. Defining , let for , which represents the total number of allocations by the end of th block, with the zeroth block corresponding to the burnin period. For notational convenience, we let . The allocation to the control is again assumed to be fixed throughout the trial.
Due to the block structure of the trial, we can relax the assumption that the randomization rule used for the experimental treatments does not depend on the control information. This is achieved by splitting up the patients allocated to the control into blocks. More explicitly, suppose that during the burnin period, patients are allocated to the control, where is fixed in advance. Subsequently, in the th block, patients are allocated to the control, where . We assume that for the final block .
The responseadaptive randomization at block may now depend on the control information available at the end of block ; that is, the outcome data available from the first patients allocated to the control. For notational convenience, define and let (), which represents the total number of allocations to the control by the end of th block. For notational convenience, let .
To control the FWER, we can modify the approach described in Section 2 to account for the block structure. As before, we have an auxiliary design for the patients on the experimental treatments, but now in step of the process () the actual design is a datadependent modification of all the allocations for the patients in block . Hence the weights for the observations in each block will be the same, and are updated blockbyblock.
3.2 Auxiliary design and adaptive test statistic
The auxiliary design for an intersection hypothesis is the same as described in Section 2.4, except that we now impose a block structure on the auxiliary assignments to the experimental treatments. As before, the auxiliary and actual designs are identical during the burnin period , and we require . For the auxiliary design, let denote the total number of allocations to the th treatment (), including the burnin period. Also let
denote the total number of allocations to the control and th treatment respectively for patients in blocks . We define and .
We apply the conditional invariance principle blockbyblock, where each step considers an additional block of patients recruited into the trial. This gives the following test statistic for testing , with a proof and the formulae for the weights given in Appendix C.
Theorem 3.1.
If then under , the following test statistic is normally distributed with mean 0 and variance :
Corollary 3.2.
If , then let , where . Under , the following test statistic is normally distributed with mean 0 and variance :
We reject if is greater than . In order to keep the weights as close to the natural weight for as many of the control observations as possible, we recommend setting and , as used for the simulation studies in Section 4.2. In all of the scenarios that we have investigated, the weights for the experimental treatments have all been positive. Hence in these cases, the test procedure also controls the FWER for the composite null hypotheses .
3.3 Extension for adaptive control allocations
Thus far, we have assumed that the allocations to the control follow some fixed scheme. We now relax this assumption in the blockrandomized setting. Since the form of the adaptive test statistic is similar to the one presented above, the formula for can be found in Appendix D. Note that it is possible the procedure will fail to give a valid test statistic in this setting, as shown in Appendix E.1.
4 Simulation studies
As we have already seen in Section 2.3, using the closure principle with the usual test does not strongly control the FWER. An alternative method of control is to use the Bonferroni correction on the elementary null hypotheses . We also consider the Holm procedure, which is a stepdown procedure that is uniformly more powerful than Bonferroni (Holm, 1979). An advantage of both these procedures is that only test statistics are calculated, rather than test statistics when using the closure principle. This motivates also applying the Holm procedure to the values derived from the adaptive test statistics for . More precisely, we use the adjusted values , instead of the usual values derived from the test.
To distinguish between the different methods, we call our proposed procedure that uses the closure principle the ‘adaptive closed test’. Similarly, applying the closure principle to the usual test gives the ‘closed test’. Applying the Holm procedure to our adjusted values gives the ‘Holm adaptive test’, while applying the Holm procedure to the usual values gives the ‘Holm test’. In our simulation studies, we compare the different methods primarily by looking at the FWER. However, clearly another key consideration is the power of the different tests. To keep the comparisons simple, and as a similar measure to the FWER, we present results for the disjunctive power, which is the probability of rejecting at least one false null hypothesis.
4.1 Fully sequential randomization
We first consider a fully sequential responseadaptive trial, as presented in Section 2, with patients allocated to the experimental treatments after the burnin and patients allocated to the control. In the burnin period, five patients are allocated to each of the experimental treatments. We set and the true control mean for simplicity. We compare the methods under two randomization schemes described below.
Type I error inflator: For treatments, this is the same randomization scheme as presented in Section 2.3. For treatments, if , then we randomize patient to treatments 2 and 3 with equal probability.
BAR: The efficacy outcome for the th experimental treatment follows a distribution. For simplicity, we assign independent normal priors to the , so that , and let . After observing the efficacy outcomes for the first patients, the posterior for is as follows:
We use a suggested BAR scheme of Yin et al. (2012). For experimental treatments, the randomization probabilities after observing the th patient are:
For experimental treatments, we first obtain the average of the posterior means . The randomization probabilities after observing the th patient are:
In our simulations, for simplicity we set the priors and , while .
Simulation results: Table 1 gives the results for the type I error inflator randomization scheme, while Table 2 gives the results for BAR. The auxiliary designs in all scenarios were simply
random draws from a discrete uniform distribution on
.Adaptive closed test  Adaptive test (Holm)  Closed test  test (Holm)  test (Bonferroni)  

Parameter values  Error  Power  Error  Power  Error  Power  Error  Power  Error  Power  
1.  3.3    4.7    4.7    7.0    7.0    
2.  ,  4.8  21.7  3.7  27.5  10.3  26.5  9.9  63.6  5.0  63.5 
3.    62.4    52.4    69.9    61.6    61.6  
4.  2.8    3.8    4.1    5.9    5.9  
5.  ,  3.2  13.1  4.2  24.2  5.1  17.2  6.4  54.2  4.5  54.1 
6.  ,  4.6  22.2  3.2  28.0  9.7  27.0  9.0  75.4  3.2  75.4 
7.  , ,  4.0  19.1  2.6  24.5  9.1  23.9  7.4  58.5  3.2  58.4 
8.    51.3    41.7    57.8    49.7    49.7  
Adaptive closed test  Adaptive test (Holm)  Closed test  test (Holm)  test (Bonferroni)  

Parameter values  Error  Power  Error  Power  Error  Power  Error  Power  Error  Power  
1.  4.7    4.5    4.8    4.1    4.1    
2.  ,  4.6  46.4  4.4  52.4  3.9  46.7  3.6  53.6  1.9  53.5 
3.    70.8    66.4    71.2    65.9    65.9  
4.  3.8    4.1    4.0    3.8    3.8  
5.  ,  4.4  59.9  4.2  88.7  4.3  60.1  3.8  90.6  2.6  90.6 
6.  ,  4.8  89.8  4.7  95.1  4.0  90.1  3.9  96.0  1.3  96.0 
7.  , ,  4.3  74.8  3.9  88.2  3.9  75.7  3.4  90.0  1.4  90.0 
8.    56.5    51.8    57.9    52.7    52.7 
Looking first at the results for the type I error inflator in Table 1, the closed test does not control the FWER in any of the scenarios where at least one null hypothesis is false, with an error rate as high as in scenario 2. Applying the Holm procedure to the test does not control the FWER, and actually increases the error rate in some scenarios (such as 1 and 4). Applying the Bonferroni correction to the test also does not control the FWER, as can be seen in the scenarios where all null hypotheses are true. This may appear surprising at first, but the inflation occurs because the naïve test is not a valid level– test for each elementary hypothesis. In contrast, both the adaptive closed test and the Holm adaptive test strongly control the FWER.
As for the power of the different methods, when at least one of the null hypotheses is true (as in scenarios 2, 5, 6 and 7), the Holm test has substantially higher power than the closed test. Indeed, the power more than doubles in all four scenarios, and even more than triples in scenario 5. This dramatic increase in power demonstrates that in these scenarios, the closed test is not very sensitive. This is because the test statistic for will be ‘diluted’ by the contribution from responses belonging to the null hypotheses that are true. It is only when all of the null hypotheses are false, as in scenarios 3 and 8, that the power of the closed test is reasonable, with a slightly higher power than the Holm test.
As for the adaptive tests, the adaptive closed test has a slightly lower power than the closed test for all scenarios, with an absolute decrease of between in scenario 5 and in scenario 3. However, the Holm adaptive test has a substantially lower power than the Holm test, with the latter having more than double the power. This demonstrates the high cost in terms of power that controlling the FWER can incur for this randomization scheme. We return to this issue in Section 4.3.
Turning to the BAR scheme in Table 2, this time all of the methods strongly control the FWER. All methods are slightly conservative, with the adaptive closed test being generally the closest to the nominal level. The Bonferronicorrected test is noticeably more conservative than all the other methods, particularly when there are three treatments. In terms of disjunctive power, if at least one of the null hypotheses are true, we again see that the closed tests suffer from reduced power compared to the Holm versions. However, with BAR the loss of power is less dramatic, with a maximum of a relative decrease in power in scenario 5, but with much smaller decreases in scenarios 2 and 7 for example. This time, the adaptive closed test has almost the same power as the closed test, losing a maximum of only in scenario 8. In addition, the Holm adaptive test and Holm test now have comparable power, with a maximum loss of only in scenarios 6 and 7. This indicates that for BAR schemes, the adaptive tests do not lose out very much in terms of power.
4.2 Block randomization with a fixed control allocation
We now consider block randomized trials with a fixed control allocation, as presented in Section 3.1. We use the setup of a trial with blocks, with sizes (40, 40, 40) for the experimental treatments and (20, 20, 20) for the control. In the burnin period, five patients are allocated to each of the treatments including the control. We set the true control mean , and . We compare the methods under the randomization schemes below.
Type I error inflator: The allocation probabilities for block , patient and treatment are:
where .
BAR: The efficacy outcome for the th treatment follows a distribution. For notational convenience, let ; that is, the mean of the control. We assign independent normal priors to the (), such that . At stage , when the efficacy outcomes have been observed, the posterior for is as follows:
where .
We use a similar BAR scheme to the one in Wason and Trippa (2014). If there are experimental treatments, the randomization probabilities for the experimental treatments at the th stage are:
In our simulations, for simplicity we set the priors and , while .
Simulation results: Table 1 gives the results for the type I error inflator randomization scheme, while Table 2 gives the results for BAR. The auxiliary designs in all scenarios were simply random draws from a discrete uniform distribution on .
Adaptive closed test  Adaptive test (Holm)  Closed test  test (Holm)  test (Bonferroni)  

Parameter values  Error  Power  Error  Power  Error  Power  Error  Power  Error  Power  
1.  3.8    4.8    4.6    6.5    6.5    
2.  ,  4.8  22.0  3.6  26.9  8.3  25.6  7.8  61.1  4.3  61.0 
3.    92.7    87.9    94.6    91.7    91.7  
4.  3.2    4.1    4.1    6.1    6.1  
5.  ,  3.7  14.2  4.4  23.4  4.7  18.1  6.2  61.2  4.5  61.1 
6.  ,  4.9  20.1  3.2  26.1  8.1  23.0  7.3  78.5  3.2  78.4 
7.  , ,  4.7  17.7  3.0  23.8  8.0  21.1  6.7  66.2  2.8  66.2 
8.    91.3    83.4    94.0    89.7    89.7  
Adaptive closed test  Adaptive test (Holm)  Closed test  test (Holm)  test (Bonferroni)  

Parameter values  Error  Power  Error  Power  Error  Power  Error  Power  Error  Power  
1.  4.8    4.6    4.8    4.5    4.5    
2.  ,  5.0  61.2  4.9  82.7  4.9  61.2  4.8  82.9  2.5  82.8 
3.    94.5    92.3    94.5    92.2    92.2  
4.  3.7    4.5    3.7    4.2    4.2  
5.  ,  4.4  36.1  4.6  71.8  4.3  36.0  4.4  71.8  3.0  71.7 
6.  ,  5.0  67.3  4.6  85.6  4.8  66.8  4.4  85.4  1.6  85.4 
7.  , ,  4.6  51.1  3.7  73.0  4.4  50.9  3.5  72.6  1.6  72.6 
8.    93.5    90.7    93.4    90.4    90.4  
The results are broadly similar to those for the fully sequential setting presented in Section 4.1. For the type I error inflator, we see that the closed test does not control the FWER in general (as seen in scenarios 2, 6 and 7), and neither does applying the Holm procedure to the test. The Bonferronicorrected test has an inflated FWER when all null hypotheses are true, as in scenarios 1 and 4. In contrast, the adaptive tests strongly control the FWER in all scenarios. However, again this comes at the cost of reduced power. There is a slight reduction in power between the closed test and the closed adaptive test, of between in absolute terms. In scenarios where at least one null hypothesis is true, the Holm test has a much higher power than the Holm adaptive test, with the power more than doubling in these scenarios, and actually tripling in scenario 6.
As for the BAR scheme, all of the methods strongly control the FWER. This time, for some scenarios the adaptive closed test basically achieves the nominal level, as in scenarios 2 and 6. When there are three treatments, the Bonferronicorrected test can again be overly conservative, as in scenarios 6 and 7. In contrast to the fully sequential setting, with block randomization we see that the adaptive tests actually have the highest power out of all the methods in all scenarios except scenario 2. When at least one null hypothesis is true, the Holm adaptive test has the highest power, while when all null hypotheses are false the adaptive closed test has the highest power. The power gains are small, but demonstrate that we do not always lose out in terms of power when using the proposed adaptive tests.
Block randomization with an adaptive control allocation: In Appendix E.1, we present a simulation study considering block randomization with an adaptive control allocation, as presented in Section 3.3. The results are broadly similar to those presented above.
4.3 Summary
In summary, the simulation results show that in the randomization settings considered, our proposed adaptive tests strongly control the FWER, as would be expected from theory. In contrast, the various tests can all fail to control the error rate, as seen in the results for the type I error inflator. However, given a more realistic randomization scheme, such as the BAR schemes we considered, the tests achieve strong familywise error control. As for disjunctive power, we see that when at least one null hypothesis is true, the closed tests suffer a very large drop in power compared to the Holm versions. This is because of the ‘dilution’ of the test statistic as mentioned in Section 4.1. However, when all the null hypotheses are true, then the closed test has the higher power, although the gains are at most modest.
The adaptive tests can pay a large price in terms of power when compared with the tests, as seen in the results for the type I error inflator. In Appendix E.2, we give an additional simulation study with two treatments, where the randomization scheme used is simply a fixed allocation to the experimental treatments but with unequal randomization probabilities. We show that when the probability of assignment to treatment 2 is low (i.e. less than 0.2), there is a large drop in the power of the adaptive tests for testing . This explains what is happening with the type I error inflator when , where in the majority of trial scenarios, apart from the unlikely event that treatment 1 stops early for ‘efficacy’, the probability of assignment to treatment 2 is zero by design. Hence, the type I inflator is in fact close to a worstcase scenario for the adaptive tests.
However, most adaptive randomization schemes are unlikely to have such extreme imbalances. Indeed, authors such as Korn and Freidlin (2011) recommend restricting the probability of arm assignment to between 0.2 and 0.8 in order to prevent extreme patient allocation. Hence, for ‘sensible’ adaptive randomization schemes with such a restriction, we would not expect there to be a substantial loss of power when using the Holm adaptive test compared with the Holm test, particularly in the block randomized setting.
5 Case study
Finally, we illustrate our proposed methodology using an example based on a phase II placebocontrolled trial in primary hypercholesterolemia (Roth et al., 2012). The purpose of the study was to compare the effects of using the SAR236553 antibody with highdose or losedose atorvastatin, as compared with highdose atorvastatin alone. The primary outcome was the leastsquares mean percent reduction from baseline of lowdensity lipoprotein cholesterol (LDLC). Patients were randomly assigned, in a 1:1:1 ratio, to receive 80 mg of atorvastatin plus placebo, 10 mg of atorvastatin plus SAR236553, or 80 mg of atorvastatin plus SAR236553. For convenience, we label these different interventions as the ‘control’, ‘low dose’ and ‘high dose’ respectively.
In the trial, the observed leastsquares mean SE percent reduction from baseline in LDLC was for the control, for the low dose and for the high dose. There were patients on the control, patients on the low dose and patients on the high dose, giving a total of patients on the two experimental doses. For our illustrative case study, we use the observed values from the trial and assume that the distribution of the leastsquares standardized mean percent reduction from baseline of lowdensity LDLC is for the control, for the low dose, and for the high dose.
Now suppose that the trial was carried out as an adaptive block randomized trial with a fixed control allocation, as described in Section 3.1. Let the trial have blocks, with block sizes (15, 15, 15) for the experimental treatments and (8, 8, 8) for the placebo. In the burnin period, 7 patients are allocated to the control and 8 patients are allocated to each of the experimental doses. Hence, a total of 31 patients are on the control and 61 on the experimental treatments, as in the original trial. We use the BAR scheme of Section 4.2, with priors and (), while .
Table 5 shows the results for a simulated trial with the above parameters, where the BAR scheme allocated 13 patients to the low dose and 32 patients to the high dose after the burnin period. This yields the natural weights used in the naïve test of for the low dose and for the high dose. The natural weight for the control is by design. The auxiliary design randomly assigned 44 patients to the low or high dose in a 1:1 ratio, and allocated 21 patients to the low dose and 23 patients to the high dose.
Low dose  High dose  

test statistic  13.76 ()  15.50 () 
Adaptive test statistic  12.21 ()  16.22 () 
Natural weights  ,  , 
Adaptive weights  
The adaptive test statistic is slightly smaller than the test statistic for the low dose, while the converse is true for the test statistics for the high dose. Looking at the adaptive weights for the burnin period and the three blocks, we see that for the low dose, the weights for the low dose decrease for each block while the control weights increase. This pattern is reversed for the high dose. Given that all the values are less than , using either the test or the adaptive test we would conclude that adding the SAR236553 antibody to highdose or lowdose atorvastatin leads to a statistically significant reduction in LDLC levels.
6 Discussion
A major regulatory concern over the use of responseadaptive trials in clinical practice has been ensuring control of the type I error rate. We have proposed procedures that guarantee strong familywise error control in the following multiarmed trial settings:

Fully sequential responseadaptive trials with a fixed control allocation (where the randomization rule does not depend on the control information)

Blockrandomized responseadaptive trials with a fixed control allocation

Blockrandomized responseadaptive trials including an adaptive control allocation
These procedures are applicable to a large class of responseadaptive randomization rules, particularly in settings (2) and (3) where there are no restrictions on the rule used. Hence both Bayesian and ‘optimal’ responseadaptive randomization schemes proposed in the literature can be used without adjustment, with only the final test statistic having to be modified.
In practice, to control the FWER we would recommend using the Holm adaptive test. Importantly, it has a much higher power than the adaptive closed test when at least one of the null hypotheses are true. As well, it only requires hypothesis tests as compared with hypothesis tests for the adaptive closed test.
Our adaptive tests lead to unequal weightings of patients, which may be controversial (Burman and Sonesson, 2006). One solution is to use the socalled ‘dual test’, and reject a hypothesis only if both the adaptive test and the naïve test rejects (Denne, 2001; Posch et al., 2003; Chen et al., 2004), although this comes at the cost of reduced power.
We have assumed that the variances of the control and experimental treatments are known. Fully accounting for unknown variances would add considerable complexity to our approach. In Appendix E.3, we show that estimating the common variance from the data does not inflate the FWER when using the Holm adaptive test, for any of the simulation scenarios considered in this paper.
Our proposed procedures are designed for normallydistributed outcomes, and it would be useful to apply our approach to binary outcomes as well. As a starting point, it may be possible to use the asymptotically normal test statistic for contrasting each treatment arm with the control (Jennison and Turnbull, 2000; Wason and Trippa, 2014), particularly in the block randomised setting.
Finally, although we did not explicitly consider it in this paper, the adaptive randomization procedures used could also incorporate covariate information, so that the allocation probabilities vary across patients with different covariates. These covariateadjusted responseadaptive randomization schemes are particularly useful when certain characteristics of the patients may be correlated with the primary outcome (Hu and Rosenberger, 2006). A related setting would be biomakerguided responseadaptive trials, such as ISPY 2.
References
 Bartlett et al. (1985) Bartlett, R. H., Roloff, D. W., Cornell, R. G., Andrews, A. F., Dillon, P. W., and Zwischenberger, J. B. (1985). Extracorporeal circulation in neonatal respiratory failure: a prospective randomized study. Pediatrics 76, 479–487.
 Bello and Sabo (2016) Bello, G. A. and Sabo, R. T. (2016). Outcomeadaptive allocation with natural leadin for threegroup trials with binary outcomes. Journal of Statistical Computation and Simulation 86, 2441–2449.
 Bergemann and V lim ki (2006) Bergemann, D. and V lim ki, J. (2006). Bandit problems. Technical report, Cowles Foundation, http://ssrn.com/abstract=877173 [accessed 1 Mar 2018].
 Berry (2011) Berry, D. A. (2011). Adaptive clinical trials: the promise and the caution. Journal of Clinical Oncology 29, 606–609.
 Berry (2015) Berry, D. A. (2015). Commentary on Hey and Kimmelman. Clinical Trials 12, 107–109.
 Biswas et al. (2007) Biswas, A., Bhattachary, R., and Zhang, L. (2007). Optimal responseadaptive designs for continuous responses in phase III trials. Biometrical journal 49, 928–940.
 Biswas and Bhattacharya (2016) Biswas, A. and Bhattacharya, R. (2016). Responseadaptive designs for continuous treatment responses in phase III clinical trials: A review. Statistical Methods in Medical Research 25, 81–100.
 Brannath et al. (2007) Brannath, W., Koenig, F., and Bauer, P. (2007). Multiplicity and flexibility in clinical trials. Pharmaceutical statistics 6, 205–216.
 Burman and Sonesson (2006) Burman, C.F. and Sonesson, C. (2006). Are flexible designs sound? Biometrics 62, 664–9; discussion 670–83.
 Chen et al. (2004) Chen, Y. H. J., DeMets, D. L., and Lan, K. K. G. (2004). Increasing the sample size when the unblinded interim result is promising. Statistics in Medicine 23, 1023–1038.
 Denne (2001) Denne, J. S. (2001). Sample size recalculation using conditional power. Statistics in Medicine 20, 2645–2660.
 Du et al. (2015) Du, Y., Wang, X., and Lee, J. J. (2015). Simulation study for evaluating the performance of responseadaptive randomization. Contemporary Clinical Trials 40, 15–25.
 Eisele (1994) Eisele, J. R. (1994). The doubly adaptive biased coin design for sequential clinical trials. Journal of Statistical Planning and Inference 38, 249–261.
 European Medicines Agency (2002) European Medicines Agency (2002). Points to Consider on Multiplicity Issues in Clinical Trials. London: CPMP .
 Food and Drug Administration (2010) Food and Drug Administration (2010). Guidance for Industry: Adaptive Design Clinical Trials for Drugs and Biologics; 2010. Available from: https://www.fda.gov/downloads/drugs/guidances/ucm201790.pdf [accessed 1 Mar 2018] .
 Giles et al. (2003) Giles, F. J., Kantarjian, H. M., Cortes, J. E., GarciaManero, G., Verstovsek, S., Faderl, S., et al. (2003). Adaptive randomized study of idarubicin and cytarabine versus troxacitabine and cytarabine versus troxacitabine and idarubicin in untreated patients 50 years or older with adverse karyotype acute myeloid leukemia. Journal of Clinical Oncology 21, 1722–1727.
 Gutjahr et al. (2011) Gutjahr, G., Posch, M., and Brannath, W. (2011). Familywise error control in multiarmed responseadaptive twostage designs. Journal of Biopharmaceutical Statistics 21, 818–830.
 Hey and Kimmelman (2015) Hey, S. P. and Kimmelman, J. (2015). Are outcomeadaptive allocation trials ethical? Clinical Trials 12, 102–106.
 Holm (1979) Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics pages 65–70.
 Hu and Rosenberger (2006) Hu, F. and Rosenberger, W. F. (2006). The theory of responseadaptive randomization in clinical trials, volume 525. John Wiley & Sons.
 Ivanova et al. (2006) Ivanova, A., Biswas, A., and Lurie, A. (2006). Responseadaptive designs for continuous outcomes. Journal of Statistical Planning and Inference 136, 1845–1852.
 Jennison and Turnbull (2000) Jennison, C. and Turnbull, B. (2000). Group sequential methods with applications to clinical trials. ChapmanHall/CRC, Boca Raton, FL .
 Korn and Freidlin (2011) Korn, E. L. and Freidlin, B. (2011). Outcome–adaptive randomization: is it useful? Journal of Clinical Oncology 29, 771–776.
 Lee et al. (2012) Lee, J. J., Chen, N., and Yin, G. (2012). Worth adapting? Revisiting the usefulness of outcomeadaptive randomization. Clinical Cancer Research 18, 4498–4507.
 Marcus et al. (1976) Marcus, R., Peritz, E., and Gabriel, K. R. (1976). On closed testing procedures with special reference to ordered analysis of variance. Biometrika pages 655–660.
 Park et al. (2016) Park, J. W., Liu, M. C., Yee, D., Yau, C., van ’t Veer, L. J., Symmans, W. F., et al. (2016). Adaptive Randomization of Neratinib in Early Breast Cancer. The New England Journal of Medicine 375, 11–22.
 Posch et al. (2003) Posch, M., Bauer, P., and Brannath, W. (2003). Issues in designing flexible trials. Statistics in Medicine 22, 953–969.
 Rosenberger and Hu (2004) Rosenberger, W. F. and Hu, F. (2004). Maximizing power and minimizing treatment failures in clinical trials. Clinical Trials 1, 141–147.
 Rosenberger et al. (2001) Rosenberger, W. F., Stallard, N., Ivanova, A., Harper, C. N., and Ricks, M. L. (2001). Optimal adaptive designs for binary response trials. Biometrics 57, 909–913.
 Roth et al. (2012) Roth, E. M., McKenney, J. M., Hanotin, C., Asset, G., and Stein, E. A. (2012). Atorvastatin with or without an antibody to PCSK9 in primary hypercholesterolemia. The New England Journal of Medicine 367, 1891–1900.
 Rugo et al. (2016) Rugo, H. S., Olopade, O. I., DeMichele, A., Yau, C., van ’t Veer, L. J., Buxton, M. B., et al. (2016). Adaptive Randomization of VeliparibCarboplatin Treatment in Breast Cancer. The New England Journal of Medicine 375, 23–34.
 Scott (2010) Scott, S. L. (2010). A modern Bayesian look at the multiarmed bandit. Applied Stochastic Models in Business and Industry 26, 639–658.
 Smith and Villar (2017) Smith, A. and Villar, S. S. (2017). Bayesian adaptive banditbased designs using the gittins index for multiarmed trials with normally distributed endpoints. Journal of Applied Statistics doi: 10.1080/02664763.2017.1342780.
 Thall et al. (2015) Thall, P., Fox, P., and Wathen, J. (2015). Statistical controversies in clinical research: scientific and ethical problems with adaptive randomization in comparative clinical trials. Annals of Oncology 26, 1621–1628.
 Thall and Wathen (2007) Thall, P. F. and Wathen, J. K. (2007). Practical Bayesian adaptive randomisation in clinical trials. European Journal of Cancer 43, 859–866.
 Trippa et al. (2012) Trippa, L., Lee, E. Q., Wen, P. Y., Batchelor, T. T., Cloughesy, T., Parmigiani, G., and Alexander, B. M. (2012). Bayesian adaptive randomized trial design for patients with recurrent glioblastoma. Journal of Clinical Oncology 30, 3258–3263.
 Tymofyeyev et al. (2007) Tymofyeyev, Y., Rosenberger, W. F., and Hu, F. (2007). Implementing optimal allocation in sequential binary response experiments. Journal of the American Statistical Association 102, 224–234.
 Wason et al. (2014) Wason, J. M., Stecher, L., and Mander, A. P. (2014). Correcting for multipletesting in multiarm trials: is it necessary and is it done? Trials 15, 364.
 Wason and Trippa (2014) Wason, J. M. S. and Trippa, L. (2014). A comparison of Bayesian adaptive randomization and multistage designs for multiarm clinical trials. Statistics in Medicine 33, 2206–2221.
 Wathen and Thall (2017) Wathen, J. K. and Thall, P. F. (2017). A simulation study of outcome adaptive randomization in multiarm clinical trials. Clinical Trials 14, 432–440.
 Wei and Durham (1978) Wei, L. and Durham, S. (1978). The randomized playthewinner rule in medical trials. Journal of the American Statistical Association 73, 840–843.
 Yin et al. (2012) Yin, G., Chen, N., and Jack Lee, J. (2012). Phase II trial design with Bayesian adaptive randomization and predictive probability. Journal of the Royal Statistical Society: Series C (Applied Statistics) 61, 219–235.
 Zhang and Rosenberger (2006) Zhang, L. and Rosenberger, W. F. (2006). Responseadaptive randomization for clinical trials with continuous outcomes. Biometrics 62, 562–569.
 Zhu and Hu (2010) Zhu, H. and Hu, F. (2010). Sequential monitoring of responseadaptive randomized clinical trials. The Annals of Statistics 38, 2218–2241.
Appendix A: Derivation of the weights for familywise error control in fully sequential responseadaptive trials
Below is a diagrammatic representation of the assignments and observations for the auxiliary design compared to the actual design for the patients on the experimental treatments:
Actual design
Comments
There are no comments yet.