Familywise error control in multi-armed response-adaptive trials

03/14/2018
by   David S. Robertson, et al.
University of Cambridge
0

Response-adaptive designs allow the randomization probabilities to change during the course of a trial based on cumulated response data, so that a greater proportion of patients can be allocated to the better performing treatments. A major concern over the use of response-adaptive designs in practice, particularly from a regulatory viewpoint, is controlling the type I error rate. In particular, we show that the naive z-test can have an inflated type I error rate even after applying a Bonferroni correction. Simulation studies have often been used to demonstrate error control, but do not provide a guarantee. In this paper, we present adaptive testing procedures for normally distributed outcomes that ensure strong familywise error control, by iteratively applying the conditional invariance principle. Our approach can be used for fully sequential and block randomized trials, and for a large class of adaptive randomization rules found in the literature. We show there is a high price to pay in terms of power to guarantee familywise error control for randomization schemes with extreme allocation probabilities. However, for proposed Bayesian adaptive randomization schemes in the literature, our adaptive tests maintain or increase the power of the trial compared to the z-test. We illustrate our method using a three-armed trial in primary hypercholesterolemia.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

04/12/2022

Familywise error rate control for block response-adaptive randomization

Response-adaptive randomization allows the probabilities of allocating p...
05/24/2021

Approximating the Operating Characteristics of Bayesian Uncertainty Directed Trial Designs

Bayesian response adaptive clinical trials are currently evaluating expe...
04/15/2020

A Practical Response Adaptive Block Randomization Design with Analytic Type I Error Protection

Response adaptive randomization is appealing in confirmatory adaptive cl...
09/29/2021

A Markov Decision Process for Response-Adaptive Randomization in Clinical Trials

In clinical trials, response-adaptive randomization (RAR) has the appeal...
04/16/2019

Robust Response-Adaptive Randomization Design

In clinical trials, patients are randomized with equal probability among...
05/01/2020

Response-adaptive randomization in clinical trials: from myths to practical considerations

Response-adaptive randomization (RAR) is part of a wider class of data-d...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Clinical trials typically randomize patients using a fixed randomization scheme, where the probabilities of assigning patients to the experimental treatments and control are pre-specified and constant. A common method is to simply use equal randomization to the different arms of the trial. However, such randomization schemes can mean that a substantial proportion of the trial participants will continue to be allocated to treatments that are not the best available, even if interim data indicates that one treatment is likely to be superior. Response-adaptive trials address this concern by adaptively changing the randomization probabilities, so that a greater proportion of patients are allocated to the treatment arm which has a better performance based on the cumulated response data. Hence, as the trial continues and accumulates more data, patients in the trial can benefit from having a higher probability of being assigned to a better treatment.

Many classes of response-adaptive randomization schemes have been proposed in the literature for binary outcomes. Randomization schemes based on urn models (such as the randomized play-the-winner rule (Wei and Durham, 1978)), and adaptive biased coin designs (such as the doubly-adaptive biased coin design (Eisele, 1994)), have been extensively studied, with a comprehensive presentation given by Hu and Rosenberger (2006). Many Bayesian adaptive randomization (BAR) schemes have also been proposed (Thall and Wathen, 2007; Trippa et al., 2012; Yin et al., 2012; Wason and Trippa, 2014), where the randomization probabilities are recursively updated using a Bayesian model for the patient outcomes.

There is also a growing interest in response-adaptive randomization for continuous responses. For example, there are schemes based on doubly-adaptive biased coin designs (Hu and Rosenberger, 2006; Zhang and Rosenberger, 2006; Biswas et al., 2007), urn-based drop-the-loser designs (Ivanova et al., 2006) and bandit-based designs (Smith and Villar, 2017). A comprehensive recent overview is given by Biswas and Bhattacharya (2016). In this paper, our focus is on normally distributed outcomes, which are encountered in many clinical trials. Indeed, 23 out of the 59 multi-arm clinical trials identified in a review by Wason et al. (2014) had a continuous outcome.

A comprehensive discussion of the relative advantages and disadvantages of adaptive versus fixed randomization is beyond the scope of this paper. Indeed, the use of adaptive randomization is a widely discussed and somewhat controversial topic in clinical trials. For binary responses, a number of comparisons (Korn and Freidlin, 2011; Berry, 2011; Thall et al., 2015; Wathen and Thall, 2017) have focused on the BAR scheme proposed by Thall and Wathen (2007). Particularly in the two-arm setting, fixed randomisation appears to be preferable to this scheme in terms of power and the number of treatment failures, except when the number of patients to be treated beyond the trial is small (as in rare diseases) or where there are large treatment differences (Lee et al., 2012; Du et al., 2015).

However, even in the two-arm setting, optimal response-adaptive schemes (i.e. those that target some formal optimality criteria) have been shown to have benefits over fixed randomisation by increasing both power and patient benefit simultaneously (Rosenberger et al., 2001; Rosenberger and Hu, 2004; Tymofyeyev et al., 2007; Bello and Sabo, 2016). In the multi-arm setting, which is the focus of this paper, adaptive randomisation can have further advantages over fixed randomization (Berry, 2011; Wason and Trippa, 2014; Hey and Kimmelman, 2015; Berry, 2015), particularly for more complex trial designs.

Response-adaptive designs also have application outside of the context of clinical trials. For example, multi-arm bandit models are used for market learning in economics (Bergemann and V lim ki, 2006) and to improve modern production systems that emphasize ‘continuous improvement’ (Scott, 2010). Some of the ethical concerns surrounding adaptive randomization (Hey and Kimmelman, 2015) would not apply in these contexts.

Despite the extensive literature on response-adaptive randomization, relatively few clinical trials have actually used such schemes in practice. One of the first examples, which used a randomized play-the-winner rule, was a trial of extracorporeal membrane oxygenation to treat newborns with respiratory failure (Bartlett et al., 1985). More recent examples include a three-armed trial in untreated patients with adverse karyotype acute myeloid leukemia (Giles et al., 2003), which used BAR. The ongoing I-SPY 2 trial (Park et al., 2016; Rugo et al., 2016), which screens drugs in neoadjuvant breast cancer, also uses BAR as part of its design.

A key concern over using response-adaptive randomization, particularly from a regulatory perspective, is ensuring that the type I error rate is controlled. Indeed, draft regulatory guidance from the U.S. Food and Drug Administration (2010) includes adaptive randomization under a section entitled “Adaptive Study Designs Whose Properties Are Less Well Understood”. It then goes on to state that “particular attention should be paid to avoiding bias and controlling the Type I error rate” (Food and Drug Administration, 2010, pg. 27) when using adaptive randomization in trials.

In a multi-arm trial, multiple hypotheses are tested simultaneously by design, which leads to a multiple testing problem. To account for this, testing procedures are used that guarantee strong control of the familywise error rate (FWER), which ensures the maximum probability of making at least one type I error is controlled. For confirmatory trials in particular, demonstrating strong control of the FWER is often required by regulators (Food and Drug Administration, 2010; European Medicines Agency, 2002).

For response-adaptive trials, a rigorous proof of FWER control for a particular design is difficult given the complexities of the treatment allocation process. Hence error control has typically either been demonstrated through simulation studies, or by exploiting the asymptotic structure of the adaptive randomization procedure (Hu and Rosenberger, 2006; Zhu and Hu, 2010). However, neither method provides a guarantee of FWER control, particularly with small sample sizes. Gutjahr et al. (2011) showed how to achieve strong control of the FWER for normally distributed outcomes in a two-stage design incorporating response-adaptive randomization. However, our focus is on general response-adaptive trials, without the necessity of restricting to two stages or having a final stage of equal randomization.

In this paper, we show how to guarantee strong control of the FWER for both fully sequential and block randomized response-adaptive trials, for a large class of adaptive randomization rules. Our proposed procedure works by reweighting the usual

-statistic through an iterative application of the conditional invariance principle. The resulting adaptive test statistic can then be used to test the elementary null hypothesis that a treatment is superior to the control.

The rest of the paper is organised as follows. In Section 2, we describe the proposed method for fully sequential response-adaptive trials with a fixed allocation to the control. This method is then modified for block randomized response-adaptive trials in Section 3, for both a fixed or adaptive control allocation. Simulation studies for the proposed methods are presented in Section 4, and Section 5 gives a case study based on a trial in primary hypercholesterolemia. We conclude with a discussion in Section 6. All proof details can be found in the Appendices.

2 Fully sequential response-adaptive trials

2.1 Trial setting

Suppose a trial is conducted to test experimental treatments against a common control, using the following design. A total of  patients are allocated to the experimental treatments, and  patients are allocated to the control, where  and  are fixed in advance. Patients are allocated to the different experimental treatments using response-adaptive randomization, where we assume that the randomization rule does not depend on the control information. We also assume the allocation to the control is fixed; that is, the probability of assigning a patient to the control is pre-specified and constant. Maintaining allocation to the control is recommended by the Food and Drug Administration (2010), since it best maintains the power of the trial, and helps address the concern about changing patient characteristics over the course of the trial.

The response-adaptive randomization for the experimental treatments starts with a burn-in period , which uses equal randomization to allocate patients to the th treatment , with the again fixed in advance. Hence a total of  patients are allocated to the experimental treatments during the burn-in period. Let denote the treatment allocation for the th experimental patient (), where if the th patient is allocated to the th treatment. Also, let  denote the efficacy outcome for the th patient. Similarly, let  denote the efficacy outcomes for the th patient on the control (). We assume that

The variance 

is assumed known and, without loss of generality, we set . Here  represents the incremental benefit of treatment  compared to the control, and is the parameter of interest. Finally, let  denote the total number of allocations to the th experimental treatment, including the burn-in period.

2.2 Hypothesis testing

The elementary null hypotheses of interest are against the one-sided alternatives . We discuss the case when at the end of Section 2.5. One general method to control for multiple testing is to use the closure principle (Marcus et al., 1976) and consider all intersection hypotheses , where . To strongly control the FWER, we reject an elementary null hypothesis  if we also reject every  with  using a local level- test. Hence we need to define a valid level- test for all the intersection hypotheses . The naïve -test for , which does not take into account the response-adaptive randomization used in the trial, rejects if the test statistic

is greater than , where and  is the

standard normal quantile.

As an alternative to using the closure principle with the test statistic above, we could control the FWER by simply using a Bonferroni correction, or a step-up/step-down procedure such as the Holm procedure. These would only involve calculating test statistics for the  elementary null hypotheses, i.e. calculating for (). Hence we present the methodology assuming the closure principle will be used, with the Bonferroni and Holm procedures considered as special cases. We return to this issue in Section 4.

2.3 Inflation of the familywise error rate

Since the -test ignores the adaptive randomization used, it is possible to inflate the FWER. As an example, consider the following adaptive randomization scheme for treatments:

where . This can be viewed as implementing early stopping for efficacy for treatment 1, which is not taken into account using the naïve -test.

We ran a simulation study to calculate the type I error rate using the above randomization scheme. We set , , and the true treatment means , . The type I error rate as averaged over simulations is , more than double the nominal  level. We subsequently refer to allocation rules of this type as ‘type I error inflator’ rules (which clearly would never be used in practice).

2.4 Auxiliary design

Working with the actual design of the trial is difficult because the response-adaptive randomization affects the distribution of the usual -test statistics. Hence for each  we introduce a simpler design, called the auxiliary design, for which we do know the distribution. The actual trial design can then be viewed as a series of data-dependent modifications of the auxiliary design, where we account for the modifications using the conditional invariance principle. The auxiliary designs are purely hypothetical, and are only used to construct the modified tests for the actual design. As well, the allocations in the auxiliary designs are fixed before the start of the actual trial.

The auxiliary design for hypothesis  is as follows. As in the actual design, a total of  patients are allocated to the experimental treatments, and  patients are allocated to the control. The allocations and responses to the control treatment are the same as the actual design. For the patients allocated to the experimental treatments, the auxiliary design starts with a burn-in period  with patients that is identical to the actual design. The subsequent  allocations are given by a fixed sequence , which can be chosen arbitrarily. These allocations can be considered as a ‘guess’ of a likely allocation sequence of the actual trial design. One possibility would be to randomize equal numbers of patients for each treatment. The final allocation must be to one of the treatments in .

We now introduce some notation for the auxiliary design. Let denote the total number of allocations to the th experimental treatment. Also let denote the total number of allocations to the th treatment for patients . We define and . Under the auxiliary design, is fixed for all , and hence under , the usual -statistic

is normally distributed with mean zero and variance . Hence we reject  if is greater than .

2.5 Adaptive test statistic

Adaptive designs, such as the trial being considered, follow a common conditional invariance principle in order to control the type I error rate (Brannath et al., 2007). For our response-adaptive trial in question, we apply the conditional invariance principle sequentially, where each step considers the next patient recruited into the trial. Below we give the test statistic for testing hypothesis  under the actual design, given that the allocation is fully sequential. The proof of Theorem 2.1 can be found in Appendix A.

Theorem 2.1.

Under , the following test statistic is normally distributed with mean 0 and variance :

where

Hence we reject if is greater than . In Appendix B, we give some simple numerical examples of how the weights change over the course of a trial. In practice, to keep the weights as close to the natural weight  for as many of the control observations as possible, we recommend setting and , as used for the simulation studies in Section 4.1.

In all of the scenarios that we have investigated, the weights  for the experimental treatments have been positive. Hence in these cases, the test procedure also controls the FWER for the composite null hypotheses . To see this, suppose the elementary null hypotheses are . Under , we can rewrite the distribution of the responses as , where . Hence under

where and are the adaptive test statistics for and respectively.

3 Block randomized response-adaptive trials

3.1 Trial setting

It may not be feasible or desirable to randomize patients one-by-one in a fully sequential manner. Instead one can use block randomization, where after the burn-in period , patients are adaptively randomized to the experimental treatments in blocks of size over  stages, with . The randomization of the th block depends on the data up to block , as well as any external information available at the time. Defining , let for , which represents the total number of allocations by the end of th block, with the zeroth block corresponding to the burn-in period. For notational convenience, we let . The allocation to the control is again assumed to be fixed throughout the trial.

Due to the block structure of the trial, we can relax the assumption that the randomization rule used for the experimental treatments does not depend on the control information. This is achieved by splitting up the patients allocated to the control into blocks. More explicitly, suppose that during the burn-in period, patients are allocated to the control, where  is fixed in advance. Subsequently, in the th block, patients are allocated to the control, where . We assume that for the final block .

The response-adaptive randomization at block  may now depend on the control information available at the end of block ; that is, the outcome data available from the first patients allocated to the control. For notational convenience, define and let (), which represents the total number of allocations to the control by the end of th block. For notational convenience, let .

To control the FWER, we can modify the approach described in Section 2 to account for the block structure. As before, we have an auxiliary design for the patients on the experimental treatments, but now in step  of the process () the actual design is a data-dependent modification of all the allocations for the patients in block . Hence the weights for the observations in each block will be the same, and are updated block-by-block.

3.2 Auxiliary design and adaptive test statistic

The auxiliary design for an intersection hypothesis  is the same as described in Section 2.4, except that we now impose a block structure on the auxiliary assignments to the experimental treatments. As before, the auxiliary and actual designs are identical during the burn-in period , and we require . For the auxiliary design, let denote the total number of allocations to the th treatment (), including the burn-in period. Also let

denote the total number of allocations to the control and th treatment respectively for patients in blocks . We define and .

We apply the conditional invariance principle block-by-block, where each step considers an additional block of patients recruited into the trial. This gives the following test statistic for testing , with a proof and the formulae for the weights given in Appendix C.

Theorem 3.1.

If then under , the following test statistic is normally distributed with mean 0 and variance :

Corollary 3.2.

If , then let , where . Under , the following test statistic is normally distributed with mean 0 and variance :

We reject if is greater than . In order to keep the weights as close to the natural weight for as many of the control observations as possible, we recommend setting and , as used for the simulation studies in Section 4.2. In all of the scenarios that we have investigated, the weights  for the experimental treatments have all been positive. Hence in these cases, the test procedure also controls the FWER for the composite null hypotheses .

3.3 Extension for adaptive control allocations

Thus far, we have assumed that the allocations to the control follow some fixed scheme. We now relax this assumption in the block-randomized setting. Since the form of the adaptive test statistic is similar to the one presented above, the formula for  can be found in Appendix D. Note that it is possible the procedure will fail to give a valid test statistic in this setting, as shown in Appendix E.1.

4 Simulation studies

As we have already seen in Section 2.3, using the closure principle with the usual -test does not strongly control the FWER. An alternative method of control is to use the Bonferroni correction on the elementary null hypotheses . We also consider the Holm procedure, which is a step-down procedure that is uniformly more powerful than Bonferroni (Holm, 1979). An advantage of both these procedures is that only test statistics are calculated, rather than test statistics when using the closure principle. This motivates also applying the Holm procedure to the -values derived from the adaptive test statistics for . More precisely, we use the adjusted -values , instead of the usual -values derived from the -test.

To distinguish between the different methods, we call our proposed procedure that uses the closure principle the ‘adaptive closed test’. Similarly, applying the closure principle to the usual -test gives the ‘closed -test’. Applying the Holm procedure to our adjusted -values gives the ‘Holm adaptive test’, while applying the Holm procedure to the usual -values gives the ‘Holm -test’. In our simulation studies, we compare the different methods primarily by looking at the FWER. However, clearly another key consideration is the power of the different tests. To keep the comparisons simple, and as a similar measure to the FWER, we present results for the disjunctive power, which is the probability of rejecting at least one false null hypothesis.

4.1 Fully sequential randomization

We first consider a fully sequential response-adaptive trial, as presented in Section 2, with patients allocated to the experimental treatments after the burn-in and patients allocated to the control. In the burn-in period, five patients are allocated to each of the experimental treatments. We set  and the true control mean for simplicity. We compare the methods under two randomization schemes described below.

Type I error inflator: For treatments, this is the same randomization scheme as presented in Section 2.3. For treatments, if , then we randomize patient  to treatments 2 and 3 with equal probability.

BAR: The efficacy outcome for the th experimental treatment follows a distribution. For simplicity, we assign independent normal priors to the , so that , and let . After observing the efficacy outcomes for the first patients, the posterior for  is as follows:

We use a suggested BAR scheme of Yin et al. (2012). For experimental treatments, the randomization probabilities after observing the th patient are:

For experimental treatments, we first obtain the average of the posterior means . The randomization probabilities  after observing the th patient are:

In our simulations, for simplicity we set the priors and , while .

Simulation results: Table 1 gives the results for the type I error inflator randomization scheme, while Table 2 gives the results for BAR. The auxiliary designs in all scenarios were simply

 random draws from a discrete uniform distribution on

.

Adaptive closed test Adaptive test (Holm) Closed -test -test (Holm) -test (Bonferroni)
Parameter values Error Power Error Power Error Power Error Power Error Power
1. 3.3 - 4.7 - 4.7 - 7.0 - 7.0 -
2. , 4.8 21.7 3.7 27.5 10.3 26.5 9.9 63.6 5.0 63.5
3. - 62.4 - 52.4 - 69.9 - 61.6 - 61.6
4. 2.8 - 3.8 - 4.1 - 5.9 - 5.9
5. , 3.2 13.1 4.2 24.2 5.1 17.2 6.4 54.2 4.5 54.1
6. , 4.6 22.2 3.2 28.0 9.7 27.0 9.0 75.4 3.2 75.4
7. , , 4.0 19.1 2.6 24.5 9.1 23.9 7.4 58.5 3.2 58.4
8. - 51.3 - 41.7 - 57.8 - 49.7 - 49.7
Table 1: Familywise error rate and disjunctive power for the type I error inflator in the fully sequential setting. There were simulated trials for each set of parameter values.
Adaptive closed test Adaptive test (Holm) Closed -test -test (Holm) -test (Bonferroni)
Parameter values Error Power Error Power Error Power Error Power Error Power
1. 4.7 - 4.5 - 4.8 - 4.1 - 4.1 -
2. , 4.6 46.4 4.4 52.4 3.9 46.7 3.6 53.6 1.9 53.5
3. - 70.8 - 66.4 - 71.2 - 65.9 - 65.9
4. 3.8 - 4.1 - 4.0 - 3.8 - 3.8
5. , 4.4 59.9 4.2 88.7 4.3 60.1 3.8 90.6 2.6 90.6
6. , 4.8 89.8 4.7 95.1 4.0 90.1 3.9 96.0 1.3 96.0
7. , , 4.3 74.8 3.9 88.2 3.9 75.7 3.4 90.0 1.4 90.0
8. - 56.5 - 51.8 - 57.9 - 52.7 - 52.7
Table 2: Familywise error rate and disjunctive power for BAR in the fully sequential setting. There were simulated trials for each set of parameter values.

Looking first at the results for the type I error inflator in Table 1, the closed -test does not control the FWER in any of the scenarios where at least one null hypothesis is false, with an error rate as high as in scenario 2. Applying the Holm procedure to the -test does not control the FWER, and actually increases the error rate in some scenarios (such as 1 and  4). Applying the Bonferroni correction to the -test also does not control the FWER, as can be seen in the scenarios where all null hypotheses are true. This may appear surprising at first, but the inflation occurs because the naïve -test is not a valid level– test for each elementary hypothesis. In contrast, both the adaptive closed test and the Holm adaptive test strongly control the FWER.

As for the power of the different methods, when at least one of the null hypotheses is true (as in scenarios 2, 5, 6 and 7), the Holm -test has substantially higher power than the closed -test. Indeed, the power more than doubles in all four scenarios, and even more than triples in scenario 5. This dramatic increase in power demonstrates that in these scenarios, the closed -test is not very sensitive. This is because the test statistic for will be ‘diluted’ by the contribution from responses belonging to the null hypotheses that are true. It is only when all of the null hypotheses are false, as in scenarios 3 and 8, that the power of the closed -test is reasonable, with a slightly higher power than the Holm -test.

As for the adaptive tests, the adaptive closed test has a slightly lower power than the closed -test for all scenarios, with an absolute decrease of between in scenario 5 and in scenario 3. However, the Holm adaptive test has a substantially lower power than the Holm -test, with the latter having more than double the power. This demonstrates the high cost in terms of power that controlling the FWER can incur for this randomization scheme. We return to this issue in Section 4.3.

Turning to the BAR scheme in Table 2, this time all of the methods strongly control the FWER. All methods are slightly conservative, with the adaptive closed test being generally the closest to the nominal level. The Bonferroni-corrected -test is noticeably more conservative than all the other methods, particularly when there are three treatments. In terms of disjunctive power, if at least one of the null hypotheses are true, we again see that the closed tests suffer from reduced power compared to the Holm versions. However, with BAR the loss of power is less dramatic, with a maximum of a relative decrease in power in scenario 5, but with much smaller decreases in scenarios 2 and 7 for example. This time, the adaptive closed test has almost the same power as the closed -test, losing a maximum of only in scenario 8. In addition, the Holm adaptive test and Holm -test now have comparable power, with a maximum loss of only in scenarios 6 and 7. This indicates that for BAR schemes, the adaptive tests do not lose out very much in terms of power.

4.2 Block randomization with a fixed control allocation

We now consider block randomized trials with a fixed control allocation, as presented in Section 3.1. We use the setup of a trial with blocks, with sizes (40, 40, 40) for the experimental treatments and (20, 20, 20) for the control. In the burn-in period, five patients are allocated to each of the treatments including the control. We set the true control mean , and . We compare the methods under the randomization schemes below.

Type I error inflator: The allocation probabilities for block , patient and treatment are:

where .

BAR: The efficacy outcome for the th treatment follows a distribution. For notational convenience, let ; that is, the mean of the control. We assign independent normal priors to the  (), such that . At stage , when the efficacy outcomes have been observed, the posterior for  is as follows:

where .

We use a similar BAR scheme to the one in Wason and Trippa (2014). If there are experimental treatments, the randomization probabilities for the experimental treatments at the th stage are:

In our simulations, for simplicity we set the priors and , while .

Simulation results: Table 1 gives the results for the type I error inflator randomization scheme, while Table 2 gives the results for BAR. The auxiliary designs in all scenarios were simply random draws from a discrete uniform distribution on .

Adaptive closed test Adaptive test (Holm) Closed -test -test (Holm) -test (Bonferroni)
Parameter values Error Power Error Power Error Power Error Power Error Power
1. 3.8 - 4.8 - 4.6 - 6.5 - 6.5 -
2. , 4.8 22.0 3.6 26.9 8.3 25.6 7.8 61.1 4.3 61.0
3. - 92.7 - 87.9 - 94.6 - 91.7 - 91.7
4. 3.2 - 4.1 - 4.1 - 6.1 - 6.1
5. , 3.7 14.2 4.4 23.4 4.7 18.1 6.2 61.2 4.5 61.1
6. , 4.9 20.1 3.2 26.1 8.1 23.0 7.3 78.5 3.2 78.4
7. , , 4.7 17.7 3.0 23.8 8.0 21.1 6.7 66.2 2.8 66.2
8. - 91.3 - 83.4 - 94.0 - 89.7 - 89.7
Table 3: Familywise error rate and disjunctive power for the type I error inflator, for block randomization with a fixed control allocation. There were simulated trials for each set of parameter values.
Adaptive closed test Adaptive test (Holm) Closed -test -test (Holm) -test (Bonferroni)
Parameter values Error Power Error Power Error Power Error Power Error Power
1. 4.8 - 4.6 - 4.8 - 4.5 - 4.5 -
2. , 5.0 61.2 4.9 82.7 4.9 61.2 4.8 82.9 2.5 82.8
3. - 94.5 - 92.3 - 94.5 - 92.2 - 92.2
4. 3.7 - 4.5 - 3.7 - 4.2 - 4.2
5. , 4.4 36.1 4.6 71.8 4.3 36.0 4.4 71.8 3.0 71.7
6. , 5.0 67.3 4.6 85.6 4.8 66.8 4.4 85.4 1.6 85.4
7. , , 4.6 51.1 3.7 73.0 4.4 50.9 3.5 72.6 1.6 72.6
8. - 93.5 - 90.7 - 93.4 - 90.4 - 90.4
Table 4: Familywise error rate and disjunctive power for BAR, for block randomization with a fixed control allocation. There were simulated trials for each set of parameter values.

The results are broadly similar to those for the fully sequential setting presented in Section 4.1. For the type I error inflator, we see that the closed -test does not control the FWER in general (as seen in scenarios 2, 6 and 7), and neither does applying the Holm procedure to the -test. The Bonferroni-corrected -test has an inflated FWER when all null hypotheses are true, as in scenarios 1 and 4. In contrast, the adaptive tests strongly control the FWER in all scenarios. However, again this comes at the cost of reduced power. There is a slight reduction in power between the closed -test and the closed adaptive test, of between in absolute terms. In scenarios where at least one null hypothesis is true, the Holm -test has a much higher power than the Holm adaptive test, with the power more than doubling in these scenarios, and actually tripling in scenario 6.

As for the BAR scheme, all of the methods strongly control the FWER. This time, for some scenarios the adaptive closed test basically achieves the nominal level, as in scenarios 2 and 6. When there are three treatments, the Bonferroni-corrected -test can again be overly conservative, as in scenarios 6 and 7. In contrast to the fully sequential setting, with block randomization we see that the adaptive tests actually have the highest power out of all the methods in all scenarios except scenario 2. When at least one null hypothesis is true, the Holm adaptive test has the highest power, while when all null hypotheses are false the adaptive closed test has the highest power. The power gains are small, but demonstrate that we do not always lose out in terms of power when using the proposed adaptive tests.

Block randomization with an adaptive control allocation: In Appendix E.1, we present a simulation study considering block randomization with an adaptive control allocation, as presented in Section 3.3. The results are broadly similar to those presented above.

4.3 Summary

In summary, the simulation results show that in the randomization settings considered, our proposed adaptive tests strongly control the FWER, as would be expected from theory. In contrast, the various -tests can all fail to control the error rate, as seen in the results for the type I error inflator. However, given a more realistic randomization scheme, such as the BAR schemes we considered, the -tests achieve strong familywise error control. As for disjunctive power, we see that when at least one null hypothesis is true, the closed tests suffer a very large drop in power compared to the Holm versions. This is because of the ‘dilution’ of the test statistic as mentioned in Section 4.1. However, when all the null hypotheses are true, then the closed test has the higher power, although the gains are at most modest.

The adaptive tests can pay a large price in terms of power when compared with the -tests, as seen in the results for the type I error inflator. In Appendix E.2, we give an additional simulation study with two treatments, where the randomization scheme used is simply a fixed allocation to the experimental treatments but with unequal randomization probabilities. We show that when the probability of assignment to treatment 2 is low (i.e. less than 0.2), there is a large drop in the power of the adaptive tests for testing . This explains what is happening with the type I error inflator when , where in the majority of trial scenarios, apart from the unlikely event that treatment 1 stops early for ‘efficacy’, the probability of assignment to treatment 2 is zero by design. Hence, the type I inflator is in fact close to a worst-case scenario for the adaptive tests.

However, most adaptive randomization schemes are unlikely to have such extreme imbalances. Indeed, authors such as Korn and Freidlin (2011) recommend restricting the probability of arm assignment to between 0.2 and 0.8 in order to prevent extreme patient allocation. Hence, for ‘sensible’ adaptive randomization schemes with such a restriction, we would not expect there to be a substantial loss of power when using the Holm adaptive test compared with the Holm -test, particularly in the block randomized setting.

5 Case study

Finally, we illustrate our proposed methodology using an example based on a phase II placebo-controlled trial in primary hypercholesterolemia (Roth et al., 2012). The purpose of the study was to compare the effects of using the SAR236553 antibody with high-dose or lose-dose atorvastatin, as compared with high-dose atorvastatin alone. The primary outcome was the least-squares mean percent reduction from baseline of low-density lipoprotein cholesterol (LDL-C). Patients were randomly assigned, in a 1:1:1 ratio, to receive 80 mg of atorvastatin plus placebo, 10 mg of atorvastatin plus SAR236553, or 80 mg of atorvastatin plus SAR236553. For convenience, we label these different interventions as the ‘control’, ‘low dose’ and ‘high dose’ respectively.

In the trial, the observed least-squares mean SE percent reduction from baseline in LDL-C was for the control, for the low dose and for the high dose. There were patients on the control, patients on the low dose and patients on the high dose, giving a total of patients on the two experimental doses. For our illustrative case study, we use the observed values from the trial and assume that the distribution of the least-squares standardized mean percent reduction from baseline of low-density LDL-C is for the control, for the low dose, and for the high dose.

Now suppose that the trial was carried out as an adaptive block randomized trial with a fixed control allocation, as described in Section 3.1. Let the trial have blocks, with block sizes (15, 15, 15) for the experimental treatments and (8, 8, 8) for the placebo. In the burn-in period, 7 patients are allocated to the control and 8 patients are allocated to each of the experimental doses. Hence, a total of 31 patients are on the control and 61 on the experimental treatments, as in the original trial. We use the BAR scheme of Section 4.2, with priors and (), while .

Table 5 shows the results for a simulated trial with the above parameters, where the BAR scheme allocated 13 patients to the low dose and 32 patients to the high dose after the burn-in period. This yields the natural weights used in the naïve -test of for the low dose and for the high dose. The natural weight for the control is by design. The auxiliary design randomly assigned 44 patients to the low or high dose in a 1:1 ratio, and allocated 21 patients to the low dose and 23 patients to the high dose.

Low dose High dose
-test statistic 13.76 () 15.50 ()
Adaptive test statistic 12.21 () 16.22 ()
Natural weights , ,
Adaptive weights
Table 5: Test statistics, -values and weights for a simulated block randomized trial using a BAR scheme.

The adaptive test statistic is slightly smaller than the -test statistic for the low dose, while the converse is true for the test statistics for the high dose. Looking at the adaptive weights for the burn-in period and the three blocks, we see that for the low dose, the weights for the low dose decrease for each block while the control weights increase. This pattern is reversed for the high dose. Given that all the -values are less than , using either the -test or the adaptive test we would conclude that adding the SAR236553 antibody to high-dose or low-dose atorvastatin leads to a statistically significant reduction in LDL-C levels.

6 Discussion

A major regulatory concern over the use of response-adaptive trials in clinical practice has been ensuring control of the type I error rate. We have proposed procedures that guarantee strong familywise error control in the following multi-armed trial settings:

  1. Fully sequential response-adaptive trials with a fixed control allocation (where the randomization rule does not depend on the control information)

  2. Block-randomized response-adaptive trials with a fixed control allocation

  3. Block-randomized response-adaptive trials including an adaptive control allocation

These procedures are applicable to a large class of response-adaptive randomization rules, particularly in settings (2) and (3) where there are no restrictions on the rule used. Hence both Bayesian and ‘optimal’ response-adaptive randomization schemes proposed in the literature can be used without adjustment, with only the final test statistic having to be modified.

In practice, to control the FWER we would recommend using the Holm adaptive test. Importantly, it has a much higher power than the adaptive closed test when at least one of the null hypotheses are true. As well, it only requires  hypothesis tests as compared with  hypothesis tests for the adaptive closed test.

Our adaptive tests lead to unequal weightings of patients, which may be controversial (Burman and Sonesson, 2006). One solution is to use the so-called ‘dual test’, and reject a hypothesis only if both the adaptive test and the naïve -test rejects (Denne, 2001; Posch et al., 2003; Chen et al., 2004), although this comes at the cost of reduced power.

We have assumed that the variances of the control and experimental treatments are known. Fully accounting for unknown variances would add considerable complexity to our approach. In Appendix E.3, we show that estimating the common variance from the data does not inflate the FWER when using the Holm adaptive test, for any of the simulation scenarios considered in this paper.

Our proposed procedures are designed for normally-distributed outcomes, and it would be useful to apply our approach to binary outcomes as well. As a starting point, it may be possible to use the asymptotically normal test statistic for contrasting each treatment arm with the control (Jennison and Turnbull, 2000; Wason and Trippa, 2014), particularly in the block randomised setting.

Finally, although we did not explicitly consider it in this paper, the adaptive randomization procedures used could also incorporate covariate information, so that the allocation probabilities vary across patients with different covariates. These covariate-adjusted response-adaptive randomization schemes are particularly useful when certain characteristics of the patients may be correlated with the primary outcome (Hu and Rosenberger, 2006). A related setting would be biomaker-guided response-adaptive trials, such as I-SPY 2.

References

  • Bartlett et al. (1985) Bartlett, R. H., Roloff, D. W., Cornell, R. G., Andrews, A. F., Dillon, P. W., and Zwischenberger, J. B. (1985). Extracorporeal circulation in neonatal respiratory failure: a prospective randomized study. Pediatrics 76, 479–487.
  • Bello and Sabo (2016) Bello, G. A. and Sabo, R. T. (2016). Outcome-adaptive allocation with natural lead-in for three-group trials with binary outcomes. Journal of Statistical Computation and Simulation 86, 2441–2449.
  • Bergemann and V lim ki (2006) Bergemann, D. and V lim ki, J. (2006). Bandit problems. Technical report, Cowles Foundation, http://ssrn.com/abstract=877173 [accessed 1 Mar 2018].
  • Berry (2011) Berry, D. A. (2011). Adaptive clinical trials: the promise and the caution. Journal of Clinical Oncology 29, 606–609.
  • Berry (2015) Berry, D. A. (2015). Commentary on Hey and Kimmelman. Clinical Trials 12, 107–109.
  • Biswas et al. (2007) Biswas, A., Bhattachary, R., and Zhang, L. (2007). Optimal response-adaptive designs for continuous responses in phase III trials. Biometrical journal 49, 928–940.
  • Biswas and Bhattacharya (2016) Biswas, A. and Bhattacharya, R. (2016). Response-adaptive designs for continuous treatment responses in phase III clinical trials: A review. Statistical Methods in Medical Research 25, 81–100.
  • Brannath et al. (2007) Brannath, W., Koenig, F., and Bauer, P. (2007). Multiplicity and flexibility in clinical trials. Pharmaceutical statistics 6, 205–216.
  • Burman and Sonesson (2006) Burman, C.-F. and Sonesson, C. (2006). Are flexible designs sound? Biometrics 62, 664–9; discussion 670–83.
  • Chen et al. (2004) Chen, Y. H. J., DeMets, D. L., and Lan, K. K. G. (2004). Increasing the sample size when the unblinded interim result is promising. Statistics in Medicine 23, 1023–1038.
  • Denne (2001) Denne, J. S. (2001). Sample size recalculation using conditional power. Statistics in Medicine 20, 2645–2660.
  • Du et al. (2015) Du, Y., Wang, X., and Lee, J. J. (2015). Simulation study for evaluating the performance of response-adaptive randomization. Contemporary Clinical Trials 40, 15–25.
  • Eisele (1994) Eisele, J. R. (1994). The doubly adaptive biased coin design for sequential clinical trials. Journal of Statistical Planning and Inference 38, 249–261.
  • European Medicines Agency (2002) European Medicines Agency (2002). Points to Consider on Multiplicity Issues in Clinical Trials. London: CPMP .
  • Food and Drug Administration (2010) Food and Drug Administration (2010). Guidance for Industry: Adaptive Design Clinical Trials for Drugs and Biologics; 2010. Available from: https://www.fda.gov/downloads/drugs/guidances/ucm201790.pdf [accessed 1 Mar 2018] .
  • Giles et al. (2003) Giles, F. J., Kantarjian, H. M., Cortes, J. E., Garcia-Manero, G., Verstovsek, S., Faderl, S., et al. (2003). Adaptive randomized study of idarubicin and cytarabine versus troxacitabine and cytarabine versus troxacitabine and idarubicin in untreated patients 50 years or older with adverse karyotype acute myeloid leukemia. Journal of Clinical Oncology 21, 1722–1727.
  • Gutjahr et al. (2011) Gutjahr, G., Posch, M., and Brannath, W. (2011). Familywise error control in multi-armed response-adaptive two-stage designs. Journal of Biopharmaceutical Statistics 21, 818–830.
  • Hey and Kimmelman (2015) Hey, S. P. and Kimmelman, J. (2015). Are outcome-adaptive allocation trials ethical? Clinical Trials 12, 102–106.
  • Holm (1979) Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics pages 65–70.
  • Hu and Rosenberger (2006) Hu, F. and Rosenberger, W. F. (2006). The theory of response-adaptive randomization in clinical trials, volume 525. John Wiley & Sons.
  • Ivanova et al. (2006) Ivanova, A., Biswas, A., and Lurie, A. (2006). Response-adaptive designs for continuous outcomes. Journal of Statistical Planning and Inference 136, 1845–1852.
  • Jennison and Turnbull (2000) Jennison, C. and Turnbull, B. (2000). Group sequential methods with applications to clinical trials. Chapman-Hall/CRC, Boca Raton, FL .
  • Korn and Freidlin (2011) Korn, E. L. and Freidlin, B. (2011). Outcome–adaptive randomization: is it useful? Journal of Clinical Oncology 29, 771–776.
  • Lee et al. (2012) Lee, J. J., Chen, N., and Yin, G. (2012). Worth adapting? Revisiting the usefulness of outcome-adaptive randomization. Clinical Cancer Research 18, 4498–4507.
  • Marcus et al. (1976) Marcus, R., Peritz, E., and Gabriel, K. R. (1976). On closed testing procedures with special reference to ordered analysis of variance. Biometrika pages 655–660.
  • Park et al. (2016) Park, J. W., Liu, M. C., Yee, D., Yau, C., van ’t Veer, L. J., Symmans, W. F., et al. (2016). Adaptive Randomization of Neratinib in Early Breast Cancer. The New England Journal of Medicine 375, 11–22.
  • Posch et al. (2003) Posch, M., Bauer, P., and Brannath, W. (2003). Issues in designing flexible trials. Statistics in Medicine 22, 953–969.
  • Rosenberger and Hu (2004) Rosenberger, W. F. and Hu, F. (2004). Maximizing power and minimizing treatment failures in clinical trials. Clinical Trials 1, 141–147.
  • Rosenberger et al. (2001) Rosenberger, W. F., Stallard, N., Ivanova, A., Harper, C. N., and Ricks, M. L. (2001). Optimal adaptive designs for binary response trials. Biometrics 57, 909–913.
  • Roth et al. (2012) Roth, E. M., McKenney, J. M., Hanotin, C., Asset, G., and Stein, E. A. (2012). Atorvastatin with or without an antibody to PCSK9 in primary hypercholesterolemia. The New England Journal of Medicine 367, 1891–1900.
  • Rugo et al. (2016) Rugo, H. S., Olopade, O. I., DeMichele, A., Yau, C., van ’t Veer, L. J., Buxton, M. B., et al. (2016). Adaptive Randomization of Veliparib-Carboplatin Treatment in Breast Cancer. The New England Journal of Medicine 375, 23–34.
  • Scott (2010) Scott, S. L. (2010). A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry 26, 639–658.
  • Smith and Villar (2017) Smith, A. and Villar, S. S. (2017). Bayesian adaptive bandit-based designs using the gittins index for multi-armed trials with normally distributed endpoints. Journal of Applied Statistics doi: 10.1080/02664763.2017.1342780.
  • Thall et al. (2015) Thall, P., Fox, P., and Wathen, J. (2015). Statistical controversies in clinical research: scientific and ethical problems with adaptive randomization in comparative clinical trials. Annals of Oncology 26, 1621–1628.
  • Thall and Wathen (2007) Thall, P. F. and Wathen, J. K. (2007). Practical Bayesian adaptive randomisation in clinical trials. European Journal of Cancer 43, 859–866.
  • Trippa et al. (2012) Trippa, L., Lee, E. Q., Wen, P. Y., Batchelor, T. T., Cloughesy, T., Parmigiani, G., and Alexander, B. M. (2012). Bayesian adaptive randomized trial design for patients with recurrent glioblastoma. Journal of Clinical Oncology 30, 3258–3263.
  • Tymofyeyev et al. (2007) Tymofyeyev, Y., Rosenberger, W. F., and Hu, F. (2007). Implementing optimal allocation in sequential binary response experiments. Journal of the American Statistical Association 102, 224–234.
  • Wason et al. (2014) Wason, J. M., Stecher, L., and Mander, A. P. (2014). Correcting for multiple-testing in multi-arm trials: is it necessary and is it done? Trials 15, 364.
  • Wason and Trippa (2014) Wason, J. M. S. and Trippa, L. (2014). A comparison of Bayesian adaptive randomization and multi-stage designs for multi-arm clinical trials. Statistics in Medicine 33, 2206–2221.
  • Wathen and Thall (2017) Wathen, J. K. and Thall, P. F. (2017). A simulation study of outcome adaptive randomization in multi-arm clinical trials. Clinical Trials 14, 432–440.
  • Wei and Durham (1978) Wei, L. and Durham, S. (1978). The randomized play-the-winner rule in medical trials. Journal of the American Statistical Association 73, 840–843.
  • Yin et al. (2012) Yin, G., Chen, N., and Jack Lee, J. (2012). Phase II trial design with Bayesian adaptive randomization and predictive probability. Journal of the Royal Statistical Society: Series C (Applied Statistics) 61, 219–235.
  • Zhang and Rosenberger (2006) Zhang, L. and Rosenberger, W. F. (2006). Response-adaptive randomization for clinical trials with continuous outcomes. Biometrics 62, 562–569.
  • Zhu and Hu (2010) Zhu, H. and Hu, F. (2010). Sequential monitoring of response-adaptive randomized clinical trials. The Annals of Statistics 38, 2218–2241.

Appendix A: Derivation of the weights for familywise error control in fully sequential response-adaptive trials

Below is a diagrammatic representation of the assignments and observations for the auxiliary design compared to the actual design for the patients on the experimental treatments:

Actual design