Response-adaptive randomization in clinical trials: from myths to practical considerations

05/01/2020 ∙ by David S. Robertson, et al. ∙ University of Cambridge 0

Response-adaptive randomization (RAR) is part of a wider class of data-dependent sampling algorithms, for which clinical trials have commonly been used as a motivating application. In that context, patient allocation to treatments is defined using the accrued data on responses to alter randomization probabilities, in order to achieve different experimental goals. RAR has received abundant theoretical attention from the biostatistical literature since the 1930's and has been the subject of heated debates. Recently it has received renewed consideration from the applied community due to some successful practical examples. Most position papers on the subject present a one-sided view on its use, which is of limited value for the non-expert. This work aims to address this gap by providing a critical, balanced and updated review of methodological and practical issues to consider when debating the use of RAR in clinical trials.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Preface

Few topics in the biostatistical literature have been as debated as response-adaptive randomization. The controversy about its value in clinical trials has persisted over the decades ever since it was first proposed in the early 1930s. This lack of consensus and the extreme opposing views surrounding it does not reflect well upon the biostatistical community. A clinical colleague considering it for an experiment can equally easily encounter pieces of work from an utmost enthusiast as well as a hard opponent, both of which can be respected, experienced and well-known biostatisticians.

This situation can be not only detrimental to the use of the technique in practice (which is perhaps unfortunate given the potential advantages it can offer), but it can also be very confusing for those to whom response-adaptive randomization is completely new. This situation has motivated us to write this paper with the hope that it can become a “must read” paper for those who, like ourselves, have found some of these arguments to be partial or confusing.

We also hope this paper will open the door to research that rather than taking the route of “discouraging” or “encouraging” the use of the technique, instead takes the more constructive path of addressing these issues with new ideas that can work to offer most of the advantages that response-adaptive randomization brings with fewer (or none) of its downsides. We also point out open methodological questions of high importance for the practical use of this adaptive design in practice.

In the context of the current COVID-19 pandemic, our sincere hope is that this paper stimulates a deep and critical thinking of experimental goals and how to deliver them when designing a clinical trial, rather than merely following simplistic views on a complex subject.

2 Introduction

Randomization as a method to allocate patients to treatments in a clinical trial has long been considered a defining element of a well-conducted study, ensuring comparability of treatment groups, mitigating selection bias, and additionally providing the basis for statistical inference (Rosenberger and Lachin, 2016). In clinical trials practice, a constant randomization probability through the trial (most often an equal probability) is still the most popular randomization procedure in use. An alternative mode of patient allocation is known as response-adaptive randomization (RAR), in which randomization probabilities are altered during the trial based on the accrued data on responses, with the aim of achieving different experimental objectives while ideally (though not necessarily) preserving inferential validity. Such experimental objectives may include: selecting earlier a promising treatment among several candidates, increasing the power of a specific treatment comparison and/or assigning more patients to a favorable arm during a trial.

Response-adaptive randomization (also known as outcome-adaptive randomization) has been a fertile area of methodological research over the past three decades, with books and many papers in top statistical journals being published on the subject (which becomes evident from the References Section of this paper). Despite this, the uptake of RAR in practice remains disproportionately slow in comparison with the theoretical attention it has received, and it continues to stand as a controversial and highly debated issue within the statistics community. These debates tend to intensify and multiply during health care crises such as the Ebola outbreak (Brittain and Proschan, 2016) or the current COVID-19 pandemic (Proschan and Evans, 2020). Unfortunately, such debates are mostly geared towards presenting arguments to justify one-sided positions around the use of RAR in clinical trials, and are usually highly technical which makes it challenging for a non-expert to follow. None of these debates provides a fair, updated and balanced discussion over the merits and disadvantages of using this wide adaptive design class. Such a well balanced discussion is very much needed in practice for a non-expert to make a knowledgeable decision of its use for a specific experiment and it is what we attempt to do in this piece of work.

Such conflicting and one-sided views, as published in the modern literature, are illustrated by the quotations below and constitute one of the initial drivers and main motivations for writing this paper.

Outcome adaptive randomization has several undesirable properties. These include a high probability of sample size imbalance in the wrong direction … it produces inferential problems that decrease potential benefit to future patients, and may decrease benefit to patients enrolled in the trial … For randomized comparative trials to obtain confirmatory comparisons, designs with fixed randomization probabilities and group sequential decision rules appear to be preferable to RAR, scientifically, and ethically.” (Thall et al., 2015b)

… optimal response-adaptive randomization designs allow implementation of complex optimal allocations in multiple-objective clinical trials and provide valid tools to inference in the end of the trial. In many instances they prove superior over traditional balanced randomization designs in terms of both statistical efficiency and ethical criteria. (Rosenberger et al., 2012)

Response adaptive randomization (RAR) is a noble attempt to increase the likelihood that patients receive better performing treatments, but it causes numerous problems that more than offset any potential benefits. We discourage the use of RAR in clinical trials. (Proschan and Evans, 2020)

The lingering lack of consensus within the specialized literature is perhaps one of the most influential reasons to explain why the use of RAR procedures remains rare in clinical trials practice. The extreme positions on the use of these methods persist, despite recent methodological developments directly addressing past criticisms and providing guidance for selecting appropriate procedures in practice. Specifically, over the last 10 years there have been additional theoretical advances which - to the best of the authors’ knowledge - are currently not included in any published review of response-adaptive methods.

In parallel to this, response-adaptive procedures have boomed in machine learning applications

(Auer et al., 2002; Bubeck and Cesa-Bianchi, 2012; Kaufmann and Garivier, 2017; Kaibel and Biemann, 2019; Lattimore and Szepesvári, 2019)

where the uptake and popularity of Bayesian RAR ideas (or Thompson sampling) has been incredibly high. Their use in practice has been associated with substantial gains in system performances. In the clinical trial community, a crucial development has been the success of some well-known biomarker led trials, such as I-SPY 2

(Rugo et al., 2016; Park et al., 2016) or BATTLE (Kim et al., 2011). The goal of these trials was to learn which subgroups (if any) benefit from a therapy and to then change the randomization ratio to favour patient allocation in that direction. These trials have set new precedents and expectations which, contrary to what ECMO trials did to RAR in the 1980s (see the Section 3), are increasingly driving investigators to the use of RAR to enhance the efficiency and ethics of their trial designs. These trials also show that practically and logistically, the use of RAR is clearly feasible, at least in areas such as oncology.

Both in the machine learning literature and in the recent trials cited above, the methodology used in these applications is a class of the larger family of adaptive methods (i.e. some of which are response-adaptive but not randomized). However, most of the recent general criticisms and praise for response-adaptive methods in clinical trials has been driven mainly by arguments that only apply to a very specific subclass within these methods. This paper aims to contribute to the current discussion by providing an updated and broad critical review and a summary of some of our thoughts on this debate. We believe this work is timely and highly needed to support those considering RAR as a potential defining element of clinical trials to be run in the current COVID-19 pandemic.

We start by providing an updated historical overview of RAR (Section 3) and continue by summarizing the classification of the broader class of RAR procedures defined in the literature (Section 4). We then present popular beliefs published in the literature about RAR and we critically discuss each of them (Section 5). Finally, we give a brief summary of our own opinions on the future of RAR related research and some general considerations in Section 6.

3 A Historical Perspective on RAR

“Those who cannot learn from history are doomed to repeat it.” (Paraphrase of an aphorism by George Santayana)

The history of RAR can be better presented by splitting it into the tale of two very distinct areas: theory and practice. While a large amount of high quality theoretical work has accumulated over the years, RAR in practice has been marked by only a few highly influential examples. We believe it is important to start our review by tracing and reporting to a non-expert the highlights of the historical development of RAR both in theory and in practice. In particular, our view is that we should use both early controversies such as the infamous ECMO trial (see below) and recent successes such as I-SPY 2 to learn when and how RAR might be used appropriately, rather than dismissing or encouraging using any kind of RAR on the basis of a single example. Hence, in this section, we give a high level historical overview of RAR in terms of key methodology and its use in practice for clinical trials, with a timeline shown in Figure 1. We also include some more recent developments, which have not been included in previously published historical overviews.

3.1 RAR methodology literature

The origins of response-adaptive procedures can be traced back to Thompson (1933)

, who first suggested to allocate patients to the more effective treatment arm via a posterior probability computed using interim data. This work motivated procedures (commonly known as Thompson sampling) which are used to allocate resources in many modern application areas. Another early method was the play-the-winner rule, proposed by

Robbins (1952) and then Zelen (1969). In this non-randomized rule, a success on one treatment leads to a subsequent patient being assigned to that treatment, while a failure leads to a subsequent patient being assigned to the other treatment.

RAR also has roots in the methodology for sequential stopping problems (where the sample size is random, but with a stopping boundary), as well as bandit problems (where resources are allocated to maximize the expected reward). Since most of the work in these areas have been non-randomized (i.e. deterministic), we do not describe their development here. For the interested reader, Rosenberger and Lachin (2016, Section 10.2) gives a brief summary of the history of both of these areas, and an overview of multi-arm bandit models is presented in the review paper of Villar et al. (2015a). For a review of non-randomized algorithms for the two-arm bandit problem, see Jacko (2019).

One of the first explicit uses of randomization in a response-adaptive treatment allocation procedure was the randomized play-the-winner (RPW) rule proposed by Wei (1978). The RPW rule can be described as an urn model: each treatment assignment is made by drawing a ball from an urn (with replacement), where the composition of the urn is updated based on the patient responses. In the following decades, many RAR designs based on urn models were proposed, with a particular focus on generalizing the RPW rule. We refer the reader to Hu and Rosenberger (2006, Chapter 4) and Rosenberger and Lachin (2016, Section 10.5) for a detailed description. One example was the urn model proposed by Ivanova (2003), called the drop-the-loser rule.

These urn-based RAR procedures are intuitive, but are not optimal in a formal sense. However, from the early 2000s another perspective on RAR emerged that was based on optimal allocation targets, which are derived as a solution to a formal optimization problem. For two-arm trials, a general optimization approach was proposed by Jennison and Turnbull (2000)

for normally-distributed outcomes, and helped lead to the development of a whole class of optimal RAR designs. One early and well-known example is the work of 

Rosenberger et al. (2001) for trials with binary outcomes. Further examples of optimal RAR designs can be found in Section 5.1. In order to achieve the desired optimal allocation targets, a key development was the modification by Hu and Zhang (2004) of the double adaptive biased coin design (DBCD), which was originally described by Eisele (1994). Subsequent theoretical work by Hu and Rosenberger (2006) focused on asymptotically best RAR procedures, which led to the development of the class of efficient response adaptive randomization designs (ERADE) proposed by Hu et al. (2009).

All of the RAR procedures described above are myopic, in the sense that they only use past observations to determine the treatment allocation for the next patient, without considering the future patients to be treated and the information they could provide. A more recent methodological development has been the proposal of non-myopic or forward-looking RAR procedures, which are based on solutions to the multi-bandit problem. The first such fully randomized procedure was proposed by Villar et al. (2015b) for trials with binary responses, with subsequent work by Williamson et al. (2017) accounting for an explicit finite time-horizon. Even more recently, forward-looking RAR procedures have been proposed for trials with normally-distributed outcomes as well (Williamson and Villar, 2020).

3.2 RAR in clinical practice

One of the earliest uses of RAR in clinical practice was the Extracorporeal Circulation in Neonatal Respiratory Failure (ECMO) trial, performed in Michigan by Bartlett (1985). This trial used the RPW rule on a study of critically ill babies randomized either to ECMO or to the conventional treatment. In total, 12 patients were observed: 1 in the control group, who died, and 11 in the ECMO group, who all survived. The ECMO trial has been the focus of much debate. Up to date, it has accrued 198 citations alone, and subsequent discussions on the ECMO trial, such as the one presented by Donald Berry that same year, has itself been cited 30 times. Indeed, to this day the ECMO trial is regarded as a key example against the use of RAR in clinical practice, due to the trial’s extreme treatment imbalance and highly controversial interpretation (Rosenberger and Lachin, 1993; Burton et al., 1997). From the ethical point of view, Berry comments: “In the case of ECMO, there was a substantial amount of historical data that, in my view, not only carry more weight than the Ware study, but suggest that randomizing patients to a non-ECMO therapy as in the Ware study was unethical.” In large part due to the controversy around the ECMO trial, there was little use of RAR in clinical trials in the subsequent 20 years. One exception was the Fluoxetine trial (Tamura et al., 1994), which again used the RPW rule, but with a burn-in period to ensure that there would not be too few controls. However, more recently there have been several high-profile clinical trials that use Bayesian RAR, in the spirit of Thompson (1933).

Two important examples in oncology are the BATTLE trials and the I-SPY 2 trial. The BATTLE trials (Zhou, 2008; Kim et al., 2011; Papadimitrakopoulou, 2016) used RAR based on a Bayesian hierarchical model, where the randomization probabilities are proportional to the observed efficacy based on the patient’s individual biomarker profiles. Similarly, the I-SPY 2 trial (Barker, 2009; Carey, 2016; Rugo et al., 2016; Park et al., 2016) used RAR based on Bayesian posterior probabilities, which are specific to different biomarker signatures. These oncology trials have generated valuable discussions about the benefits and drawbacks in using RAR in clinical trials (Das, 2017; Korn, 2017; Marchenko, 2014; Siu, 2017). We discuss some of these further in Section 5.

1933

Thompson sampling

1978

RPW rule

1985

ECMO trial

1994

Fluoxetine trial

2000

J&T optimization approach

2001

RSIHR paper

2004

DBCD

2008

BATTLE trial

2009

ERADE

2010

I-SPY 2 trial

2015

Non-myopic RAR

Figure 1: Timeline summarizing some of the key developments around the use of RAR in clinical trials. J&T = Jennison and Turnbull (2000), RSIHR = Rosenberger et al. (2001).

4 A Taxonomy of RAR

From the historical perspective provided in Section 3, we note that the adoption of the RPW rule in the ECMO trial has largely been used as the quintessential example to cite against the use of all RAR procedures in clinical trials. In reality, RPW is just one example of a very specific RAR procedure out of several other possible ones for a clinical trial. Many critical papers have overlooked this fact when criticizing it use and hence, perhaps unintendedly, depreciated the value of many other RAR procedures that are markedly different from RPW.

In this section, we aim to correct for this omission by trying to provide meaningful classification criteria and clarity around how to assess the existing RAR procedures in the literature. We hope this section will provide a basis for readers to compare the myriad of existing approaches when considering their use for a specific application at hand.

4.1 How many different types of RAR procedures are there?

This is an ubiquitous and daunting question a non-expert may be faced with when reading through the RAR literature. Experts use differing criteria and jargon to classify and describe RAR procedures, which can become confusing very easily. Starting first with adaptive randomization procedures in general,

Hu and Rosenberger (2006)

classify these in terms of the data that are used to determine the allocation probabilities: response-adaptive randomization (RAR), covariate-adaptive randomization (CAR), and covariate-adjusted response-adaptive randomization (CARA). The data used are reflected by the names of the procedures. Hence the allocation probabilities are determined using the accrued information of the response variable and/or covariate(s), in light of the objectives of using a particular randomization procedure.

In the rest of this review, we focus specifically on RAR procedures. Of course, many of the issues we subsequently discuss for RAR are applicable to some degree to CARA, but this is beyond the scope of this paper. For a further discussion of using covariates in randomization, we refer the reader to the comprehensive review paper by Rosenberger and Sverdlov (2008).

To achieve the objectives of a RAR procedure, two common approaches are often considered in the literature (see Hu and Zhang (2004) for a similar classification):

  1. [label=(0)]

  2. Construct an optimal allocation target, where a specific criterion is optimized based on a population response model.

  3. Define procedures for determining the allocation probabilities that are not optimal in the formal sense of (1), but which may have intuitive motivation.

Below are examples of RAR procedures that belong to the two classes:

E.g. The optimal allocation of Rosenberger et al. (2001) for binary responses belongs to class (1). A formal optimization problem is defined based on the population response model and the inference at the end of the trial: the power of the trial (using a Z-test for the difference in proportions) is fixed and the expected number of treatment failures is minimized.

E.g. The Randomized Play-the-Winner rule for binary responses belongs to class (2). It skews the allocation probability in favor of the effective treatment, where effectiveness is inferred from the accrued data during the trial. The rules for computing and choosing the allocation probability are formulated using an intuitive approach.


However, this distinction between the two classes is not absolute, since some RAR procedures are ‘nearly’ optimal, as we now discuss. One key example is Bayesian RAR by Thall and Wathen (2007) (see Section 5.4), which includes Thompson sampling as a special case. This procedure sequentially computes the allocation probabilities based on the posterior distribution of the parameter of interest, which can be viewed as having the intuitive aim of assigning more patients to the effective arm. Thompson sampling is also asymptotically optimal in terms of minimizing cumulative regret (Kaufmann et al., 2012). Hence, in finite samples the Thall and Wathen procedure belongs to class (2), but asymptotically it belongs to class (1). Another example is the forward-looking Gittins index (FLGI) rule, which trades off a small deviation in optimality (in terms of expected total reward) to give a fully randomized RAR procedure with good patient benefit properties (Villar et al., 2015a). Hence, strictly speaking the FLGI rule belongs to class (2) as it is not exactly optimal, but it is ‘nearly’ optimal and hence ‘nearly’ in class (1).

In general, a RAR procedure depends on unknown parameter(s), which affects either the construction of the optimization problem or the computation of the allocation probabilities. When this problem arises, for example in trials with a binary outcome where the variance of the outcome depends on the unknown parameter, a Bayes approach could be employed at the design stage of the trials. Hence, one may also consider classifying RAR procedures based on the school of statistics used in for its

design: frequentist or Bayesian. We suggest the use of the following definitions:

A randomization procedure can be classified as Bayesian when a prior distribution is incorporated into the design criteria/optimization problem and/or into the calculation of the allocation probability

E.g. the optimal RAR design proposed by Cheng and Berry (2007) is based on a Bayesian decision-analytic approach; Sabo (2014) propose the use of a decreasingly informative priors for Bayesian RAR.

Meanwhile, a randomization procedure can be classified as frequentist when no prior distribution is incorporated into the design problem/ optimization problem and/or a frequentist approach is used for estimating the unknown parameter(s) in the allocation probability.

E.g. the generalized biased coin design by Smith (1984) and Neyman allocation (see Section 5.1).

We note that in some cases a frequentist approach would correspond to a Bayesian approach with a specific prior distribution, e.g. the Randomized Play-the-Winner rule (Atkinson and Biswas, 2014, pg. 271)

. This is analogous to the situation where the posterior mode coincides with the maximum likelihood estimator when a uniform prior probability is used.

RAR procedures could alternatively be defined as frequentist of Bayesian with respect to the inference procedure

used. In our opinion, we think this approach may not be helpful for someone who is not familiar with the literature on RAR, since the inference procedure greatly depends on the goal of the trial, and is very much influenced by regulators’ preference between these two approaches. Readers interested in understanding the pros and cons of frequentist and Bayesian inference are referred to materials such as

Press (2005); Wagenmakers et al. (2008); Samaniego (2010) as this falls outside the scope of our review.

Another way in which RAR procedures differ is around the goal they are designed to achieve. Some can consider competing objectives such as both efficiency gain and patient benefit, while others might prioritize one over the other. Additionally, some procedures might be non-myopic (i.e. allowing for the sequential update of the allocation probability to account for future patients) while others might be myopic. We also note that for some RAR procedures, such as the model-based optimal allocation procedures, the optimization problem can account for multiple objectives (see e.g. Hu et al. (2015)).

We encourage the reader to consider what these terminologies mean in terms of their specific experiment before trying to use the above classifications of RAR procedures for the design of their investigation. For instance, while efficiency can correspond to the power of a trial (see the section below), it can also correspond to the precision of the estimates for the treatment effectiveness (see for example Flournoy et al. (2013)). Similarly, patient benefit can mean assigning patients to a more effective treatment arm based on the accrued data, or it could also include the benefit for patients who are outside the trial (but who could potentially benefit from the trial results). These are important caveats that can have a large impact on the choice of a RAR procedure and on the decision of whether to use it or not. We discuss this point in more detail below.

4.2 How many different ways to assess RAR procedures are there?

Another area where there has been a lack of consistency and clarity is on how RAR procedures have been assessed. This section aims to clarify the basis for evaluating RAR procedures, rather than listing all the possible metrics available for comparing the myriad variants of RAR. An important caveat when comparing two published results using RAR is that different metrics could have been used, and this needs to be factored in the conclusions of the comparison. This leads us to emphasize that it is extremely important to report not only the details of the procedures used and how they have been defined, but also to carefully describe the specification of the metrics used in their evaluation to allow for adequate comparisons to be made. We also encourage investigators to emphasize when their findings may not apply to all the different RAR classes. This should reduce the chances of readers misunderstanding the scope of the pros and cons of a class of RAR procedures.

In Section 5

, we aim to identify and clarify some common misconceptions around RAR. In particular, when addressing them we shall use the following metrics: power, type I error, and bias. In the RAR literature, we find that ‘power’ is often perceived as a frequentist property, i.e. the probability of rejecting a null hypothesis when the true parameter follows the alternative distribution. Similarly, the ‘type I error’ is defined as the probability of rejecting a null hypothesis when the true parameter follows the null distribution. Moving away from the frequentist definitions, some authors defined power (respectively type I error) as the probability of satisfying a criterion that reflects the goal of treatment comparisons. Typically, this is found using simulation studies under the alternative (null) scenarios.

For multi-arm trial settings, power can reflect the goal of a trial, such as selecting the best experimental treatment or declaring at least one experimental treatment effective; with the false positive rate or familywise error rate as the generalization of type I error in the two-arm setting. The ‘power’ of a multi-armed trial can have multiple definitions, and needs to be clearly stated when reporting results. For example, pairwise power, marginal power, experiment-wise power and disjunctive power are all used as definitions of ‘power’. However, it is possible for a RAR procedure to have a high power according to one definition but not according to another.

On the other hand, ‘bias’ is often defined as a property of an estimator which reflects how different an estimate is from the true underlying parameter. An estimate may be biased due to the rule used for calculating the estimand, or due to the heterogeneity in the observed data. Heterogeneity corresponds to the situation where the data may not come from the same underlying distribution. A key example is the presence of time trends, which can cause a different treatment effect for patients who were enrolled at different time points during the trial. We return to this issue in Section 5.3.

Apart from the properties described above, some authors report the following when presenting their simulation results:

  • the expected number of treatment failures/successes (for binary outcomes) or the expected total response (for continuous outcomes) in the trial

  • the probability of selecting a truly effective experimental arm

  • summary statistics of the sample size per treatment arm, with a particular focus on the allocation to a superior arm (where this exists)

  • the probability of a treatment arm stopping early for futility or for efficacy

  • the probability of sample size imbalance (see Section 5.4)

Unfortunately, there is no perfect randomization procedure, in the sense that it will be superior to any other in terms of all of these operating characteristics. However, this fact makes the need for careful consideration when deciding which randomization approach is best suited for a specific clinical trial even more important. A RAR procedure should be chosen carefully according to the specific context and specific goals of a trial, in light of the practical challenges and constraints that implementing RAR poses. We will discuss some practical issues when implementing a RAR procedure in Section 5.5.

5 Popular beliefs about RAR

In this section, we critically examine a number of popular beliefs that have been published on RAR. Some of these beliefs can be rationally justified in particular scenarios, but they most certainly do not apply to all types of RAR procedures and/or all trial settings. Our aim is to reexamine some of these statements to provide a more balanced view of the use of RAR procedures, which acknowledges potential problems and disadvantages, but also emphasizes the potential solutions and advantages. By avoiding extreme views and generalizations (in either direction), we hope to provide some clarity around the use of RAR in practice and avoid the creation of myths around it.

5.1 Does the use of RAR reduce statistical power?

One of the most popular belief about RAR procedures is that their use can reduce statistical power, as stated in Thall et al. (2015b):

Compared with an equally randomized design, outcome AR …[has] smaller power to detect treatment differences.

Similar statements can be found in Korn and Freidlin (2011a) and Thall et al. (2015a). Through simulation studies these papers show that a fixed randomized design can have a higher power than one using a particular RAR procedure (see below) when the sample sizes are the same, or equivalently that a larger sample size is required for RAR to achieve the same target power and type I error rate as an equally randomized design.

A common feature of these papers is that they only consider the Bayesian RAR procedure proposed by Thall and Wathen (2007) (see Section 5.4 for a formal definition), which includes Thompson sampling as a special case. The Thall and Wathen procedure is well-known, and is sometimes referred to as ‘Bayesian Adaptive randomization’ (BAR) without qualification. However, although the Thall and Wathen procedure attempts to assign more patients to the better treatment while preserving power, this is only established in an intuitive way. Hence, it should not be a surprise that there are trial designs where using the procedure results in a lower power compared with using equal randomization. However, extending this conclusion to RAR in general is an over-generalization.

In what follows, we focus solely on power considerations, but it is important to note that in practice, maximizing power might not be the only trial objective. Also, for now we assume the use of standard inferential tests to make power comparisons, which we return to in Section 5.2.

Two-arm trials

As discussed in Section 4.1, there are RAR procedures that formally target optimality criterion reflecting the trial’s objectives including power. As a concrete example, consider the simple trial setting of two treatments with a binary outcome as described in Rosenberger and Hu (2004). Let  and  denote the true probabilities of success for patients on treatments  and  respectively, with and . Let  and denote the number of patients assigned to treatments  and  respectively, with denoting the total sample size. We consider the usual -test for the difference in proportions.

One strategy is to fix the power of the trial and find to minimize the total sample size , or equivalently to fix the total sample size  and find to maximize the power. This gives the following allocation ratio:

which is known as Neyman allocation. The result illustrates that in general, it is not true that equal allocation in a trial maximizes the power for a given sample size. This is a popular belief that appears without qualification in many papers, such as the following (published in the BMJ):

Most randomized trials allocate equal numbers of patients to experimental and control groups. This is the most statistically efficient randomization ratio as it maximizes statistical power for a given total sample size. (Torgerson and Campbell, 2000)

Such a statement is only true in specific settings, such as when comparing the difference in means of two normally-distributed outcomes with the same known variance.

An issue with Neyman allocation is that if , then more patients will be assigned to the treatment with the smaller success probability. This is clearly an ethical problem, and again highlights the potential trade-off between power and patient benefit. Rosenberger et al. (2001) resolve this by modifying the optimization problem. The solution gives an optimal allocation that minimizes the expected number of treatment failures (ENF) given a fixed power, or equivalently fixes the ENF and maximizes the power. If the response probability of the treatment arms were known, then using this optimal allocation ratio would therefore guarantee that on average the power of the trial is preserved.

For binomial outcomes (as well as survival outcomes), the model parameters in the optimization problem are unknown and need to be estimated from the accrued data. These estimates can then be used (for example) in the doubly-adaptive biased coin design (DBCD) (Hu and Zhang, 2004), or the efficient response adaptive randomization designs (ERADE) (Hu et al., 2009) to target the optimal allocation above. Using the DBCD in this manner, Rosenberger and Hu (2004) found in their simulation studies that it was

…as powerful or slightly more powerful than complete randomization in every case and expected treatment failures were always less

This is consistent with a general set of guidelines given by Hu and Rosenberger (2006) on which RAR procedures should be used in a clinical trial, one of which is that power should be preserved. When following these guidelines, Rosenberger et al. (2012) states that

Response-adaptive randomization should ensure that the expected number of treatment failures is reduced over standard randomization procedures, and that power should be slightly enhanced or maintained.

RAR procedures that achieve this aim have also been derived (in a similar spirit to the optimal allocation above) for continuous (Zhang and Rosenberger, 2006) and survival (Zhang and Rosenberger, 2007) outcomes.

In summary, we have seen that targeting the Neyman allocation leads to a higher power than equal randomization for binary and survival outcomes, which is implemented in practice using a sequential RAR approach such as the DBCD or ERADE. Targeting the optimal allocation leads to the same (or slightly greater) power than equal randomization, but is potentially more ethically attractive than targeting the Neyman allocation. In both cases, there is no power loss when compared with fixed randomization even if within the two-armed trial scenario.

Multi-arm trials

There are similar concerns about the reduction in power of multi-arm RAR procedures. For example, Wathen and Thall (2017) simulate a variety of five-arm trial scenarios and conclude

In multi-arm trials, compared to equal randomization, several commonly used adaptive randomization methods give much lower probability of selecting superior treatments.

Similarly, Korn and Freidlin (2011b) simulate a four-arm trial and find that a larger average sample size is required when using a RAR procedure compared with fixed 1-1-1-1 randomization in order to achieve the same pairwise power. Lee et al. (2012) reaches similar conclusions in the three-arm setting when considering disjunctive power. However, again these papers only explore generalizations of the Thall and Wathen procedure (the “commonly used adaptive randomization methods” quoted above) for multi-arm trials, and these conclusions may not hold for other types of RAR procedures.

The optimal allocation described above for the two-arm setting can be generalized for multi-arm trials, under the null hypothesis of homogeneity. The allocation is optimal in that it fixes the power of the test of homogeneity and minimizes the ENF. This was first derived by Tymofyeyev et al. (2007), who showed through simulation that for three treatment arms, using the DBCD to target the optimal allocation

…provides increases in power along the lines of 2–4% [in absolute terms]. The increase in power contradicts the conclusions of other authors who have explored other randomization procedures [for two-arm trials]

Similar conclusions (for three treatment arms) are given by Jeon and Hu (2010), Sverdlov and Rosenberger (2013a) and Bello and Sabo (2016).

These optimal allocation procedures maintain (or increase) the power of the test of homogeneity, but may have low marginal powers compared with equal randomization in some scenarios, as shown in Villar et al. (2015b). However, even considering the marginal power to reject the null hypothesis for the best treatment, Villar et al. (2015b) propose non-myopic RAR procedures that in some scenarios have both a higher marginal power and a higher expected number of treatment successes when compared with equal randomization with the same sample size.

Finally, many of the power comparisons made throughout this section have been against equal or fixed randomization. Arguably a more interesting comparison would be to consider group-sequential (GS) and multi-arm multi-stage (MAMS) designs with a fixed allocation for each treatment in each stage. It is still unclear as to how the power of different RAR procedures compare with well-chosen GS and MAMS designs. Although only focusing on RAR based on the Thall and Wathen procedure, both Wason and Trippa (2014) and Lin and Bunn (2017) show that these RAR procedures can have a higher power than MAMS designs when there is a single effective treatment.

Summary

In conclusion, if the aim is to maintain (or even increase) power compared to an equally randomized design, in many trial scenarios this can be achieved using some kinds of RAR procedures, while still reducing (or maintaining) the number of treatment failures. However, the choice of the RAR procedure is crucial, and needs to be made with the objectives of the trial in mind. Of course the power of the trial is not the only consideration, and sometimes (such as in the rare disease setting) the patient benefit properties may be much more important. If maintaining power is a key concern though, then this need not be the sole rationale to use equal randomization instead of RAR.

5.2 Can robust statistical inference be performed after using RAR?

As noted in Proschan and Evans (2020), the Bayesian approach to statistical inference allows the seamless analysis of results of a trial that uses RAR. However, when using frequentist inference, challenges can occur:

The frequentist approach faces great difficulties in the setting of RAR …Use of response-adaptive randomization eliminates the great majority of standard analysis methods …

Rosenberger and Lachin (2016)

also note that using RAR in a trial can make the subsequent (frequentist) statistical inference more challenging

Inference for response-adaptive randomization is very complicated because both the treatment assignments and responses are correlated.

This raises a key question: how does an investigator analyze a trial using RAR when using frequentist inference? In particular, can standard statistical tests and regression techniques be used without inflating the type I error rate? And are standard estimators of the treatment effect(s) biased? Without clear answers to these questions, it is unsurprising that the challenge of statistical inference (within the frequentist framework) is still seen as a key barrier to the use of RAR in clinical practice. In this section, we aim to show that valid statistical inference, especially in terms of type I error rate control and unbiased estimation, is possible for a wide variety of RAR procedures. Please note that in what follows, we do not consider the issue of time trends and patient drift, a separate discussion of which is given in Section 

5.3.

Perhaps the most straightforward approach to inference following a trial using RAR is to simply use standard statistical tests and estimators without adjustment, in contrast to the quotation above from Proschan and Evans (2020). The justification is that the asymptotic properties of standard estimators and tests are preserved for a large class of RAR procedures, including those for the multi-arm setting. Firstly, Melfi and Page (2000) proved that any estimator that is consistent when responses are independent and identically distributed will also be consistent for any RAR procedure (under the assumption that the number of observations for each treatment tends to infinity). Secondly, Hu and Rosenberger (2006) showed that when the responses follow an exponential family, simple conditions on the RAR procedure ensure the asymptotic normality of the maximum likelihood estimator (MLE). The basic condition is that the allocation proportions for each treatment arm converge in probability to constants in

, which also implies that the RAR procedure does not make a ‘choice’ or select a treatment during the trial (and hence the sample size in each arm can tend to infinity). Since many test statistics are just functions of the MLE, this result implies that the asymptotic null distribution of such test statistics is not affected by the RAR. Further asymptotic results for urn-based procedures are given in 

Hu and Rosenberger (2006) and Zhang et al. (2011).

These asymptotic results are the justification for the first guideline given by Hu and Rosenberger (2006) on RAR procedures, which states that

Standard inferential tests can be used at the conclusion of the trial.

Of course, relying on asymptotic results to use standard tests and estimators may not be valid for trials without a sufficiently large sample size, and the effect of a smaller sample size on inference is greater for more aggressive RAR procedures (see for example the results in Williamson and Villar (2020)). As noted by Rosenberger et al. (2012), for some RAR procedures in the two-arm trial setting, there has been extensive literature investigating the accuracy of the large sample approximations under moderate sample sizes using simulation (Hu and Rosenberger, 2003; Rosenberger and Hu, 2004; Zhang and Rosenberger, 2006; Duan and Hu, 2009). These papers showed that for the DBCD, sample sizes of to are sufficient, while for urn models reasonable convergence was achieved for a sample size of . For these procedures, Gu and Lee (2010) explored which asymptotic test statistic to use for a clinical trial with a small to medium sample size and binary responses.

If the asymptotic results above cannot be used, either because of small sample sizes or because the conditions on the RAR procedures are not met, then alternative small sample methods for testing and estimation have been proposed. We summarize the main methods below, concentrating on type I error rate control and unbiased estimation.

Type I error rate

One common method for controlling the type I error rate, particularly for Bayesian RAR procedures, is a simulation-based calibration approach. Given a trial design that incorporates RAR and an analysis strategy (e.g. a test statistic and stopping boundary), a large number of trials are simulated under the null hypothesis. Applying the analysis strategy to each of these simulated trial realizations gives a Monte Carlo approximation of the relevant type I error rates. If necessary, the analysis strategy (e.g. the stopping boundaries) can then be adjusted to satisfy the type I error constraints. Variations of this approach have been used in Wason and Trippa (2014); Wathen and Thall (2017); Zhang et al. (2019) for example, all in the context of calibrating multi-arm Bayesian RAR procedures to have correct type I error control. Applying this approach can be computationally intensive however.

A related approach is to use a re-randomization test, which is also known as randomization-based inference. In such a test, the observed outcome data are treated as fixed, but the randomization sequence is regenerated many times using the RAR procedure. For each replicate, the test statistic is recalculated, and a consistent estimator of the -value is given by the proportion of the randomization sequences that give a test statistic as (or more) extreme than that observed. Intuitively, this is valid because under the null hypothesis of no treatment differences, the treatment assignments and outcome data are independent. Simon and Simon (2011) give commonly held conditions under which the re-randomization test guarantees the type I error rate. Galbete and Rosenberger (2016) showed that replicates appears to be sufficient to accurately estimate even very small -values. A key advantage of using re-randomization tests is that they can protect against unknown time trends, as we will discuss further in Section 5.3. However, re-randomization tests can suffer from a lower power compared with using standard tests (Villar et al., 2018), particularly if the RAR procedure has allocation probabilities that are highly variable (Proschan and Dodd, 2019).

Both of the methods above are simulation-based, and hence there may be concerns about Monte Carlo error as well as the computation burden of such tests. There have been a few proposals that do not rely on simulation and which can be used for type I error control. Robertson and Wason (2019) proposed a re-weighting of the usual -test that guarantees familywise error control for a large class of RAR procedures for multi-arm trials with normally-distributed outcomes, although with a potential substantial loss of power. Galbete et al. (2016) derived the exact distribution of a test statistic for a family of RAR procedures in the context of a two-arm trial with binary outcomes, and hence showed how to obtain exact -values.

Estimation bias

Although the MLEs for the parameters of interest are typically consistent for a trial using RAR, for finite samples they will be biased in general. This can be seen for a number of RAR procedures for binary outcomes in the simulation results given in Villar et al. (2015a), and for procedures based on Thompson sampling in Thall et al. (2015b). However, the latter point out that in their setting, which incorporates early stopping with continuous monitoring,

…most of the bias appears to be due to continuous treatment comparison, rather than AR per se.

Hence it is important to distinguish between bias induced by early stopping and that induced by the use of the RAR procedure itself.

A simple formula for the bias of the MLE for the response probability is given in Bowden and Trippa (2017), which is valid for general multi-arm RAR procedures without early stopping. In the common case where RAR assigns more patients to treatments that appear to work well, these results show that the bias of the MLE will be negative. In addition, the magnitude of this bias is decreasing with the number of patients assigned to the treatment. When estimating the treatment difference however, the bias can be either negative or positive, which agrees with the results in Thall et al. (2015b). More general characterisations of the bias of the sample mean (even with early stopping) for multi-arm bandit procedures is given in Shin et al. (2019, 2020).

Bowden and Trippa (2017) showed that when there is no early stopping, the magnitude of the bias tends to be small for the RPW rule and the Bayesian RAR procedure proposed by Trippa et al. (2012). For more aggressive RAR procedures, the bias can be larger however, see Williamson and Villar (2020) for an example. As a solution, Bowden and Trippa (2017) proposed methods to correct for the bias of the MLE, using inverse probability weighting and Rao-Blackwellization, although these can be computationally intensive. For urn-based RAR procedures, Coad and Ivanova (2001) also proposed bias-corrected estimators for the response probability.

Finally, adjusted confidence intervals for RAR procedures have received less attention in the literature.

Rosenberger and Hu (1999) proposed a bootstrap procedure for general multi-arm RAR procedures with binary responses, using a simple rank ordering. Meanwhile, Coad and Govindarajulu (2000) proposed corrected confidence intervals following a sequential adaptive design for a two-arm trial with binary responses. Recent work by Hadad et al. (2019) gives a strategy to construct asymptotically valid confidence intervals for a large class of adaptive experiments (including the use of RAR).

Summary

In conclusion, for trials with sufficiently large sample sizes, asymptotic results justify the use of standard statistical tests and frequentist inference procedures when using many types of RAR. When asymptotic results do not hold, inference does become more complicated, but there is a growing body of literature demonstrating how to control the type I error rate, and corrections for the bias of the MLE have been proposed. All this should give increased confidence that the results from a trial using RAR, if analyzed appropriately, can be both valid and convincing. Finally, we reiterate that from a Bayesian viewpoint, the use of RAR does not pose any additional inferential challenges.

5.3 Can RAR be used if there is potential for time trends or patient drift?

The issue of time trends caused by changes in the standard of care or by patient drift (i.e. changes in the characteristics of recruited patients over time, as noted in Section 4.2) is seen as a major barrier to the use of RAR in practice:

One of the most prominent arguments against the use of AR is that it can lead to biased estimates in the presence of parameter drift. (Thall et al., 2015b)

A more fundamental concern with adaptive randomization, which was noted when it was first proposed, is the potential for bias if there are any time trends in the prognostic mix of the patients accruing to the trial. In fact, time trends associated with the outcome due to any cause can lead to problems with straightforward implementations of adaptive randomization. (Korn and Freidlin, 2011a)

Both papers cited above show (for procedures based on Thompson sampling) that time trends can dramatically inflate the type I error rate when using standard analysis methods, and induce bias into the MLE. Further simulation results on the impact of time trends for BAR procedures (in the context of two-arm trials with binary outcomes) are given in Jiang et al. (2020). In Villar et al. (2018), a comprehensive simulation study is given for different time trend assumptions for a variety of RAR procedures in trials with binary outcomes (including the multi-arm setting).

Although all these papers show that time trends can inflate the type I error rate when using RAR procedures, there are two important caveats given in Villar et al. (2018). Firstly, they conclude that a largely ignored but highly relevant issue to consider is the size of the trend and its likely of occurrence in a specific trial

…the magnitude of the temporal trend necessary to seriously inflate the type I error of the patient benefit-oriented RAR rules need to be of an important magnitude (i.e. change larger than 25% in its outcome probability) to be a source of concern.

Secondly, they also show (through simulation) that certain power-oriented RAR procedures are effectively immune to time trends. In particular, RAR procedures that protect the allocation to the control arm in some way are particularly robust.

As pointed out in Proschan and Evans (2020), temporal trends seem more likely to occur in two settings:

…1) trials of long duration, such as platform trials in which treatments may continually be added over many years and 2) trials in infectious diseases such as MERS, Ebola virus, and coronavirus.

Despite this, little work has looked at estimating these trends, especially when doing so to inform a trial design choice in the midst of an epidemic. Investigating both of these points would be essential to be able to make a sound assessment as to the value of using RAR in a particular setting or not. Furthermore, as we now discuss, there are analysis methods that can prevent the type I error inflation that those trends could create when combined with a RAR design.

As mentioned in the previous section, one method to correct for type I error inflation is to use a re-randomization test (or more generally, randomization-based inference). Simon and Simon (2011) proved that using a re-randomization test (under commonly-held conditions) guarantees type I error control even under arbitrary time trends. Simulation studies illustrating this can be found in Galbete and Rosenberger (2016) and Villar et al. (2018). However, the latter shows that using a randomization-based inference can come at the cost of a considerably reduced power compared with using an unadjusted testing strategy.

An alternative to randomization-based inference is to use a blocking or stratified analysis at the end of the trial, as proposed in e.g. Coad (1992); Karrison et al. (2003) and Korn and Freidlin (2011a). These papers show (though simulation) that a stratified analysis can eliminate the type I error inflation induced through time trends. However, Korn and Freidlin (2011a) also showed that using block randomization and subsequently block-stratified analysis can reduce the trial efficiency, in terms of increasing the required sample size and the chance of patients being assigned to the inferior treatment.

Another approach is to explicitly incorporate time-trend information into the regression analysis. For example,

Coad (1992) modified a class of sequential tests to incorporate a linear time trend for normally-distributed outcomes. Meanwhile, Villar et al. (2018)

assessed incorporating the time trend into a logistic regression (for binary responses), and showed that this can alleviate the type I error inflation of RAR procedures, if the trend is correctly specified and the associated covariates are measured and available. However, this can result in a loss of power and complicate estimation (due to the technical problem of separation).

Finally, it is possible to try control the impact of a time-trend during randomization. Rosenberger et al. (2011) proposed a covariate-adjusted response-adaptive procedure for a two-armed trial that can take a specific time trend as a covariate. More recently, Jiang et al. (2020) proposed a BAR procedure that includes a time trend in a logistic regression model, and uses the resulting posterior probabilities as the basis for the randomization probabilities. This model-based BAR procedure controls the type I error rate and mitigates estimation bias, but at the cost of a reduced power.

Summary

In summary, large time trends can inflate the type I error rate when using RAR procedures. However, not all RAR procedures are affected in this way, with those that protect the allocation to the control arm being particularly robust. For other types of RAR procedures, methods have been developed to mitigate the type I error inflation caused by time trends, although these tend to result in a loss in power. Finally, it is important to note that time trends can affect inference in all types of adaptive clinical trials, and not just those using RAR.

5.4 Does RAR lead to a substantial chance of allocating more patients to an inferior treatment?

Thall et al. (2015a) described a number of undesirable properties of RAR, including the following:

…there may be a surprisingly high probability of a sample size imbalance in the wrong direction, with a much larger number of patients assigned to the inferior treatment arm, so that AR has an effect that is the opposite of what was intended.

This was illustrated through simulation studies of two-arm trials, which showed that Thompson sampling can have a substantial chance (up to 14% for the parameter values considered) of producing sample size imbalances of more than 20 patients in the wrong direction out of a maximum sample size of 200. However as we shall illustrate next, this result is true for a single RAR procedure and may not hold in general for other types of RAR.

To show this, we perform a simulation study using a very similar setup to that in Thall et al. (2015a). As in Section 5.1, we consider a two-arm trial with binary outcomes comparing treatments  and , with corresponding true success probabilities  and  (and total number of patients  and ). We first consider Thompson sampling, and its more general formulation given by the Thall and Wathen procedure. Given the data observed so far, the Thall and Wathen procedure randomizes the next patient to treatment  with probability

Here is the posterior probability that treatment  is better than treatment  estimated from the data observed so far and using uniform priors. The parameter  controls the variability of the resulting procedure. Setting  gives equal randomization, while setting  gives Thompson sampling. Thall and Wathen (2007) suggest setting  equal to  or , where  is the current sample size and  is the total (or maximum) sample size of the trial.

As comparators to the Thall and Wathen procedure, we consider the RPW rule (described in Section 3), as well as the DBCD and ERADE designs targeting the optimal allocation of Rosenberger et al. (2001) (see Section 5.1). We set and , with . The only difference therefore from the setup of Thall et al. (2015a) is that we do not include early stopping, in order to isolate the effects of using RAR procedures. Table 1 shows the mean (2.5 percentile, 97.5 percentile) of the sample size imbalance , and the probability of a imbalance of 20 or more in the wrong direction .

RAR procedure
200 Thompson 95 (-182, 190) 0.137
TW(1/2) 74 (-90, 174) 0.085
TW() 49 (-20, 120) 0.037
RPW 14 (-16, 44) 0.011
DBCD 17 (-10, 46) 0.003
ERADE 16 (-6, 42) 0.000
654 Thompson 461 (-356, 640) 0.045
TW(1/2) 384 (44, 594) 0.015
TW() 272 (54, 456) 0.005
RPW 46 (-8, 100) 0.009
DBCD 55 (8, 106) 0.001
ERADE 54 (16, 96) 0.000
Table 1: Measures of imbalance for various RAR procedures, where and . Results are from trial replicates. TW(c) = Thall and Wathen procedure with parameter .

The results show that Thompson sampling has a substantial probability (almost 14%) of a large imbalance in the wrong direction, while using the Thall and Wathen procedure reduces this probability, which all agrees with the results of Thall et al. (2015a). In contrast, the RPW, DBCD and ERADE designs have negligible values of of or less, which is also reflected in the confidence interval for the sample size imbalance. Of course this comes at the cost of a smaller mean value of .

Another important factor is the choice of the total sample size . Setting means that the trial has low power to declare treatment  superior to treatment . Indeed, if  is chosen so that fixed (equal) randomization yields a power of greater than 80% (when using the standard -test), then needs to be at least 654. Rerunning the simulation with , Table 1 shows that the values of are substantially reduced for Thompson sampling and the Thall and Wathen procedure. Looking at the confidence intervals for sample size imbalance, we see that Thompson sampling still can get ‘stuck’ on the wrong treatment arm. However, TW(1/2) and TW() are now especially appealing in terms of sample size imbalance, with high values of the mean of and its 2.5% percentile.

Summary

In summary, RAR procedures do not necessarily have a high probability of sample size imbalance in the wrong direction, with designs targeting optimal allocation having a negligible probability of doing so. Even for BAR, this probability depends on the true parameter values being considered (see the further simulation results for given in Thall et al. (2015a)), as well as the sample size of the trial. Indeed, as the total sample size of the trial increases (to meet a minimum power constraint for example) the probability of sample size imbalance in the wrong direction will decrease.

5.5 Practical considerations: Is RAR more challenging to implement?

Once an investigator has decided to adopt a certain type of RAR, there is still a decision to be made as to how to best implement it in the specific context at hand. There are plenty of issues to consider, most of which are in common with non-adaptive designs. In this section, we focus on a few issues that are potentially different for RAR in particular, and hence merit an additional discussion.

Measurement/classification error and missing data

The presence of measurement error (for continuous variables) or classification error (for binary variables) and missing data are common in medical research. Many analysis approaches have been proposed to reduce the impact of these issues on statistical inference (see e.g. 

Guolo (2008), Little and Rubin (2002), and Blackwell et al. (2017)) but limited literature on overcoming these issues when implementing a RAR procedure is available. The main concern when there are missing values and/or measurement error is that the sequentially updated allocation probability may become biased. This happens when the unobserved true values come from a distribution that is different to that of the observed values.

To the best of our knowledge, the only work considering classification (or measurement) error is by Li and Wang (2012) and Li and Wang (2013). They derived optimal allocation targets under a model with constant misclassification probabilities, which could differ between the treatment arms. In the latter paper, the effect of misclassification (in the two-arm setting) on the usual optimal allocation designs was also explored through simulation.

As for missing data, Biswas and Rao (2004) considered the presence of missing responses for a CARA design. For a two-arm setting with a normal outcome, a probit link function that depends on the covariate adjusted unknown treatment effect parameters is used to construct the allocation probability. Under the assumption of missing at random (see Rubin (1976)

) and with a single imputation for the missing responses, they found that the standard deviation and the mean of the proportion of patients assigned to a treatment arm are not affected by the structure of the missing data mechanism. In a thesis by

Ma (2013), a new allocation function for CARA in the presence of missing covariates and missing responses is defined.

For RAR, Williamson and Villar (2020) proposed a forward-looking bandit-based allocation procedure for Phase II cancer trials with normal outcome and an imputation method to facilitate the implementation of the procedure when the outcome underlying the RECIST categories (Therasse et al., 2000; Eisenhauer et al., 2009) is undefined, e.g. due to death or complete removal of a tumor. They suggested to fill the incomplete data for these extreme cases with random samples drawn from the lower tail and the upper tail of the distribution under the null and the alternative scenarios that were used for sample size determination.

The investigation for the more complex scenarios, e.g. not missing at random, remains unexplored. Nevertheless, these complex issues have rarely been considered at the design stage of a trial, except for some simple setting such as the work by Lee et al. (2018).

Delayed responses

The use of RAR is clearly not appropriate in clinical trials where the patient outcomes are only observed after all patients have been recruited and randomized. This may be the case when the recruitment period is limited (e.g. due to a high recruitment rate), or when the outcome of interest takes a long time to observe (e.g. a survival endpoint). One way to address the latter point is to use a surrogate outcome that is more quickly observed. For example, Tamura et al. (1994) used a surrogate response to update the urn when using a RPW rule in a trial treating patients with depressive disorder. Another possibility when outcomes are delayed in some way is to use a randomization plan that is implemented in stages as more date becomes available. As an example, Zhao and Durkalski (2014) described a two-arm trial for patients suffering acute stroke which was divided into stages: stage 1 was a burn-in period with equal randomization, stage 2 started to maintain covariate imbalance and only in stage 3 did the RAR allocation begin.

In general, as long as some responses are available then RAR can be used in trials with delayed responses, as stated in Hu and Rosenberger (2006, pg. 105):

From a practical perspective, there is no logistical difficulty in incorporating delayed responses into the response-adaptive randomization procedure, provided some responses become available during the recruitment and randomization period. For urn models, the urn is simply updated when responses become available (Wei, 1988). For procedures based on sequential estimation, estimates can be updated when data become available. Obviously, updates can be incorporated when groups of patients respond also, not just individuals.

Although practically speaking there are no difficulties in incorporating delayed responses into the RAR procedures, from a hypothesis testing point of view, statistical inferences at the end of the trial can be affected. As noted in Rosenberger et al. (2012), this has been explored theoretically for urn models (Bai et al., 2002; Hu and Zhang, 2004; Zhang et al., 2007) as well as the DBCD (Hu et al., 2008). These papers show that the asymptotic properties of these RAR procedures were preserved under widely applicable conditions. In particular, when more than 60% of patient responses are available by the end of the recruitment period, simulations show that the power of the trial is essentially unaffected for these procedures.

Patient consent

Patient consent is as much an ethical element of clinical trials as equipoise is. Informed consent protects patients’ autonomy, and requires an appropriate balance between information disclosure and understanding (Beauchamp, 1997). There is considerable evidence that the basic elements to ensure informed consent (recall and understanding) can be very difficult to ensure even for traditional non-adaptive studies (Sugarman, 1999; Dawson, 2009).

The added complexity of allocation probabilities that may (or may not change) in response to accumulated data only makes achieving patient consent more challenging. Understanding the caveats of each randomization algorithm has proven hard for statisticians, putting into context the challenge of explaining these concepts to the regular public and expecting them to make an informed decision when their health is at stake. Moreover, since these novel adaptive procedures are still rarely used in real trials, there is little practical experience to draw upon.

RCTs also require a level of blinding that will only make the balance between the required disclosure (for the purpose of understanding) and consenting patients during the trial even harder for studies that use RAR, as this requires the use of accumulated data to alter the design and to adequately inform patients as they are recruited.

All these challenges do not imply that a trial design using RAR may not be the best design option for a particular setting, they simply pose challenges that need to be properly addressed and considered in conjunction with other reasons to use RAR or not. For a more in depth discussion of these issues see Sim (2019).

Implementing randomization changes in practice

Randomization of patients in clinical trials, whether adaptive or not, must be done in accordance with standards of good clinical practice. As such, randomization of patients in most clinical trials worldwide is done through a dedicated web-based system that will typically be used by several members of the trial team, including the trial manager, statistician, independent reviewers and those who use it to randomize patients. Randomization has moved on from the times where it was done with paper and envelopes, and now requires a system that can be backed up and maintained, and is easy to use, secure and available 24/7.

In the UK, for example, most clinical trials unit will outsource their randomization to external companies which in turn ensures compliance with good clinical practice. This outsourcing can eventually end up limiting the ways in which randomization can be implemented in a trial to those that are currently offered by the companies in use. To date and to the best of the authors’ knowledge, in the UK there is no company offering a RAR portfolio, and so every change in a randomization ratio is treated as a trial change (and charged as such) rather than being considered an integral part of the trial design. Beyond the extra costs that this brings, it can introduce unnecessary delays as the randomization of patients needs to be stopped while the change is implemented. This can certainly be a very important deterrent to the use of RAR in practice, as this can imply a much larger effort to implement the randomization procedure and ensure that it is compliant with regulations.

A related issue is that of preserving blinding. Maintaining treatment blinding is key to protecting the integrity of clinical trials. This is particularly important when applying a design that incorporates RAR, as when an investigator is aware of what treatment is more likely to be allocated to a patient in the future, selection bias is more likely to occur. Therefore, implementing a RAR algorithm in practice places extra challenges to preserve the blinding of clinical trial staff to results of the trial and to avoid biases. In most cases, preserving blindness will require the appointment of an independent statistician (which requires extra resources) to handle the interim data and implement the randomization ratios, or a data manager can provide clean data to an external randomization provider who can then update the randomization probabilities independently of the clinical and statistical team. A further discussion on some of the practical issues around handling interim data and unblinding can be found in Sverdlov and Rosenberger (2013b).

5.6 Is using RAR in clinical trials (more) ethical?

Ethical reasons have been the most well-known and cited arguments in favor of using RAR over the years.

Our explicit goal is to treat patients more effectively, but a happy side effect is that we learn efficiently. (Berry, 2004)

Research in response-adaptive randomization developed as a response to a classical ethical dilemma in clinical trials. (Hu and Rosenberger, 2006)

The goal of response-adaptive randomization is to achieve some ethical or statistical objective that might not be possible by fixing an allocation probability in advance. (Rosenberger and Lachin, 2016)

Nevertheless, within the statistical community there are also positions arguing that response-adaptive randomization may not be “ethical”.

For RCTs where treatment comparison is the primary scientific goal, it appears that in most cases designs with fixed randomization probabilities and group sequential decision rules are preferable to AR scientifically, ethically and logistically (Thall et al., 2015a)

Clinical research (implemented by clinical trials designed to create generalize knowledge) poses several ethical questions. To start with, there is an inevitable tension between clinical research and clinical practice, where the latter is concerned with best treating an individual patient. Such ethical conflicts are becoming ever more discussed as treatment of patients becomes more linked to research activities, as is currently observed in cancer research (London, 2018). Although the idea of “treating more patients effectively” by using RAR appears to be ethically attractive, particularly from patients’ point of view, the extent to which these adaptive designs are truly more “ethical” than the traditional randomization designs is only recently starting to be formally addressed by ethicists.

It is our belief that a) this issue should receive more attention from ethicists and b) collaborations between ethicists and statisticians are needed to fully address all the caveats and complexities that this broad family of methods can distinctly have. Any argument (be that in favour or against) that involves an ethical side should ideally be discussed with an ethicist and a statistician jointly. We also believe that any extreme position regarding RAR based purely on statistical or ethical arguments each considered in isolation are most likely to be wrong. Two reasons motivate our belief. First, as illustrated in the previous sections of this paper, RAR is a broad family of methods to which almost no generalization is true. More importantly, because the trial context to which RAR is applied to matters significantly, making compromises between statistical and ethical objectives can have very different implications under different settings. Therefore, in this section, we will not aim to answer if RAR methods are ethical or not (a question that would need to be addressed for each method and trial context specifically). However, we review key concepts that would affect this answer and give some current discussions by ethicists that statisticians would benefit from reading.

The “equipoise” concept

An argument against the use of RAR has been that it violates the principle of equipoise on which clinical trials (or medical research more generally) is based upon (Laage et al., 2017). Equipoise is a state of uncertainty of the individual investigator regarding the relative merits of two or more interventions for some population of patients. Such uncertainty justifies randomizing patients to treatments as this does not imply knowingly disadvantaging patients by doing so. This concept was considered too narrow and therefore extended to allow for randomizing patients when there is “honest, professional disagreement among expert clinicians” about the relative clinical merits of some interventions for a particular patient population (Freedman, 1987). This broader definition of equipoise is known as ‘clinical equipoise’ while the first one is known as ‘theoretical equipoise’.

Changing the randomization probabilities in light of patients’ responses is viewed as disturbing equipoise, because the updated allocation weights reflect the relative performance of the interventions in question. Once the randomization weights become unbalanced, the study has a preferred treatment and allocating participants to treatments regarded as inferior is unethical, as it violates the concern for welfare. However, this argument that RAR is unethical because it breaks equipoise is based on two assumptions: 1) randomization ratios reflect a single agent’s beliefs about the relative merits of the interventions being tested in a study; and 2) equipoise is a state of belief in which the relevant probabilities are assumed to be equally balanced. Neither of these two assumptions are consistent with the definition of ‘clinical equipoise’ as the clinical community is certainly composed of more than one agent and an honest disagreement among them will not necessarily take a 50%-50% split of opinions.

Patient horizon (individual and collective ethics)

The ethical value of RAR (and of any other design) also depends considerably on trial specific considerations. A particular feature that can affect comparisons between design options relates to disease prevalence. Suppose a clinical trial is being planned and let  be the size of the “patient horizon” for that study, i.e. those patients within and outside of the trials who will benefit from the conclusions of the trial. The concept of patient horizon can be traced back to Anscombe (1963) and Colton (1963). The precise value of  is never known, but in extreme situations impact (or should impact) the choice of design and sample size of that trial. A trial in a rare paediatric cancer (for example) is likely to have a large proportion of the patient horizon included in the trial, while a trial relevant to patients with coronary artery disease will have the vast majority of the patient horizon outside of the trial. Similar thoughts apply in the context of emerging life-threatening diseases (e.g. the recent Ebola crisis or the current COVID-19 pandemic), where the patient horizon can be short for other reasons than prevalence.

Certainly, the impact of the patient horizon on the comparison of designs from an ethical point of view depends on considerations around individual and collective ethics and potential conflicts between these two. As Tamura et al. (1994, pg. 775) express it, RAR “represents a middle ground between the community benefit and the individual patient benefit” and because of this “it is subject to attack from either side”. This specific point has been very well discussed and formally studied in the statistical literature (see Berry and Eick (1995); Berry (2004); Cheng et al. (2003) for some examples). Despite this, prevalence of a disease is almost never taken into account, neither in practice when designing real trials nor in a large number of articles comparing RAR methods from an ethical point of view.

Two-armed versus multiple armed trials

A final point that can affect any ethical judgment about a trial design and of a RAR procedure is the number of arms considered in the trial. In a two-armed setting, trade-offs between ethical and statistical features are such that there is no scope for a design to be superior (or dominate) any other in all aspects. In the multi-armed setting, this does not have to be the case, and depending on the main objective of the trial (e.g. the relevant power definition used), designs using RAR can be superior to a equally randomized trial in the sense that they can achieve both efficiency and ethical gains over a traditional RCT.

6 Discussion

RAR methods have received as much theoretical attention as it has generated heated debates, especially during acute health crises like the one we are currently facing with COVID-19. This is not surprising, as the main reason that drives the desire to change randomization ratios in clinical trials is to better respond to very difficult ethical challenges under pressing conditions. However, for such debates around its use in practice to be useful and meaningful, they have to remain level-headed and bear in mind that generalizations within such a large class of methods are very likely to be partial and/or misleading. Even within one class of RAR, conclusions about “typical” issues can be very different if one applies the procedure in a two-armed or in a multi-armed trial setting, and if we are considering an early phase study (naturally to be followed by a confirmatory trial) or a later phase study. In most cases, the relative performance of designs is highly dependent on the preferred method of inference (frequentist or Bayesian).

Our recommendation is that if a trial design calls for the use of RAR as a possible option to address a specific concern (be that ethical or not), careful consideration should be given to all the issues mentioned in the present work. Also, our advice is to think of RAR as a long list of possible design options rather than a simple unique technique to include or not. The number of possible ways to implement RAR in a clinical trial is far too large to have a statement that will apply to all of them in a given context. We also advise the use of extensive simulations that explore widely the parameter space (and not only subsets of interest). Every RAR procedure can very well address a specific need at the expense of a cost in a different area, and this needs to be made explicit at a design stage. There is no perfect trial design, but some trade-offs might be acceptable in certain cases (or even necessary for a greater good) while the same trade-offs can be absolutely rejected in other contexts.

We would like to end the paper with a short recapitulation as to what we feel the future for RAR methods research should bring to accompany a more prevalent use of it in practice. Indeed, once RAR methods start being more commonly used in clinical trials, this will unlock a demand for a considerable amount of applied research in terms of the analysis of such studies. Practical issues such as accounting for missing data or measurement error when using a RAR method are still a big open question. In terms of more theoretical work, we feel that the main issue to address is that of robust inference for the broader class of methods. However, there are some design issues that could still be addressed. For instance, the RAR class has scope for addressing delicate issues when designing studies with composite or complex endpoints, where there might be effects of the same direction or even of opposite ones.

Finally, a higher uptake in practice requires a more wide availability of user-friendly software for both the implementation of the randomization algorithm as well as for the analysis approaches that were mentioned in Section 5 of this paper. Applied statisticians could benefit from training specifically oriented on how to use simulations for the evaluation of RAR techniques, which is crucial to understand their potential limitations and benefits.

Acknowledgements

The authors acknowledge funding and support from the UK Medical Research Council (grants
MC_UU_00002/3 (SSV, BCL-K), MC_UU_00002/6 (DSR), MR/N028171/1 (KML)) and the Biometrika Trust (DSR).

References

  • Anscombe (1963) Anscombe, F.J. (1963). Sequential medical trials. Journal of the American Statistical Association 58 365–383.
  • Atkinson and Biswas (2014) Atkinson, A.C. and Biswas, A. (2014). Randomized Response-Adaptive Designs in Clinical Trials. CRC Press, Boca Raton.
  • Auer et al. (2002) Auer, P., Cesa-Bianchi N. and Fischer, P. (2002). Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning 47(2–3) 235–256.
  • Barker (2009) Barker, A.D., Sigman, C.C., Kelloff, G.J., Hylton, N.M., Berry, D.A., Esserman, L.J. (2009). I-SPY 2: An adaptive breast cancer trial design in the setting of neoadjuvant chemotherapy. American Society for Clinical Pharmacology and Therapeutics 86 97–100.
  • Bartlett (1985) Bartlett, R., Roloff, D., Cornell, R., Andrews, A., Dillon, P. & Zwischenberger, J. (1985). Extracorporeal Circulation in Neonatal Respiratory Failure: A Prospective Randomized Study. Pediatrics Journal 76(4) 479–487.
  • Beauchamp (1997) Beauchamp, T.L. Informed consent. In Medical Ethics, 2nd edition, 185–-208. ed. Robert M. Veatch, Boston: Jones and Bartlett.
  • Bello and Sabo (2016) Bello, G.A. and Sabo, R.T. (2016). Outcome-adaptive allocation with natural lead-in for three-group trials with binary outcomes. Journal of Statistical Computation and Simulation 86(12) 2441–2449.
  • Berry and Eick (1995) Berry, D.A. and Eick, S.G. (1995). Adaptive assignment versus balanced randomization in clinical trials: a decision analysis. Statistics in Medicine 14(3) 231–246.
  • Berry (2004) Berry, D.A.

    (2004). Bayesian Statistics and the Efficiency and Ethics of Clinical Trials.

    Statistical Science 19(1) 175–187.
  • Berry (2011) Berry, D.A. (2011). Adaptive Clinical Trials: The Promise and the Caution. Journal of Clinical Oncology 29(6) 606–609.
  • Bai et al. (2002) Bai, Z.D., Hu, F. and Rosenberger, W.F. (2002). Asymptotic properties of adaptive designs for clinical trials with delayed responses. Annals of Statistics, 30 122–139.
  • Biswas and Rao (2004) Biswas, A. and Rao, J. N. K. (2004). Missing responses in adaptive allocation design. Statistics & Probability Letters, 70(1) 59–70.
  • Blackwell et al. (2017) Blackwell, M., Honaker,J. and King, G. (2017). A unified approach to measurement error and missing data: overview and applications. Sociological Methods & Research, 46(3) 303–341.
  • Bowden and Trippa (2017) Bowden, J. and Trippa, L. (2017). Unbiased estimation for response adaptive clinical trials. Statistical Methods in Medical Research 26(5) 2376–2388.
  • Brittain and Proschan (2016) Brittain, E.H. and Proschan, M.A. (2016). Comments on Berry et al.’s response-adaptive randomization platform trial for Ebola. Clinical Trials 13(5) 566–567.
  • Bubeck and Cesa-Bianchi (2012) Bubeck, S. and Cesa-Bianchi (2012). Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Foundations and Trends in Machine Learning 5(1) 1–122.
  • Burton et al. (1997) Burton, P.R., Gurrina, L.C. and Hussey, M.H. (1997). Interpreting the clinical trials of extracorporeal membrane oxygenation in the treatment of persistent pulmonary hypertension of the newborn. Seminars in Neonatology 2 69–79.
  • Carey (2016) Carey, L.A. and Winer, E.P. (2016). I-SPY 2 - Toward More Rapid Progress in Breast Cancer Treatment New England Journal of Medicine 375 83–84.
  • Cheng et al. (2003) Cheng, Y., Su, F. and Berry, D.A. (2003). Choosing sample size for a clinical trial using decision analysis. Biometrika 90(4) 923–936.
  • Cheng and Berry (2007) Cheng, Y. and Berry, D.A. (2007). Optimal adaptive randomized designs for clinical trials. Biometrika 94 673–689.
  • Coad (1992) Coad, D.S. (1991). Sequential tests for an unstable response variable. Biometrika 78(1) 113–121.
  • Coad (1992) Coad, D.S. (1992). A comparative study of some data-dependent allocation rules for Bernoulli data. Journal of Statistical Computation and Simulation 40(3–4), 219–231.
  • Coad and Govindarajulu (2000) Coad, D.S. and Govindarajulu, Z. (2000). Corrected confidence intervals following a sequential adaptive trial with binary response. Journal of Statistical Planning and Inference 91 53–64.
  • Coad and Ivanova (2001) Coad, D.S. and Ivanova, A. (2001). Bias calculations for adaptive urn designs. Sequential Analysis 20 229–239.
  • Colton (1963) Colton, T. (1963). 963). A model for selecting one of two treatments. Journal of the American Statistical Association 58 388–400.
  • Das (2017) Das, S. and Lo, A.W. (2017). Re-inventing drug development: A case study of the I-SPY 2 breast cancer clinical trials program Contemporary Clinical Trials 62 168–174.
  • Dawson (2009) Dawson, A. (2009). The normative status of the requirement to gain an informed consent in clinical trials: Comprehension, obligations and empirical evidence. In The limits of consent: A sociolegal approach to human subject research in medicine, ed. Oonagh Corrigan, John McMillan, Kathleen Liddell, Martin Richards, and Charles Weijer, 99–-113. Oxford: Oxford University Press.
  • Duan and Hu (2009) Duan, L. and Hu, F. (2009). Doubly-adaptive biased coin designs with heterogenous responses. Journal of Statistical Planning and Inference 139 3220–3230.
  • Eisele (1994) Eisele, J.R. (1994). The double adaptive biased coin design for sequential clinical trials. Journal of Statistical Planning and Inference 38 249–261.
  • Eisenhauer et al. (2009) Eisenhauer, E.A., Therasse, P., Bogaerts, J., Schwartz, L.H., Sargent, D., Ford, R., Dancey, J., Arbuck, S., Gwyther, S., Mooney, M., Rubinstein, L., Shankar, L., Dodd, L., Kaplan, R., Lacombe, D. and Verweij, J. (2009). New response evaluation criteria in solid tumours: revised RECIST guideline (version 1.1). European Journal of Cancer, 45(2) 228–247.
  • Fan et al. (2018) Fan, L., Yeatts, S.D., Wolf, B.J., McClure, L.A., Selim, M. and Palesch, Y.Y.

    (2018). The impact of covariate misclassification using generalized linear regression under covariate-adaptive randomization.

    Statistical Methods in Medical Research, 27(1) 20–34.
  • Flournoy et al. (2013) Flournoy, N., Haines, L.M. and Rosenberger, W.F. (2013). A Graphical Comparison of Response-Adaptive Randomization Procedures. Statistics in Biopharmaceutical Research, 5(2) 126–141.
  • Freedman (1987) Freedman, B. (1987). Equipoise and the Ethics of Clinical Research. New England Journal of Medicine 317 141–145.
  • Galbete et al. (2016) Galbete, A., Moler, J.A. and Plo, F. (2016). Randomization tests in recuresive response-adaptive randomization procedures. Statistics 50(2) 418–434.
  • Galbete and Rosenberger (2016) Galbete, A. and Rosenberger, W.F. (2016). On the use of randomization tests followign adaptive designs. Journal of Biopharmaceutical Statistics 26(3) 466–474.
  • Gu and Lee (2010) Gu, X. and Lee, J.J. (2010). A simulation sudy for comparing testing statistics in response-adaptive randomization. BMC Medical Research Methodology 10 48.
  • Guolo (2008) Guolo, A. (2008). Robust techniques for measurement error correction: a review. Statistical Methods in Medical Research, 17(6) 555–580.
  • Hadad et al. (2019) Hadad, V., Hirshberg, D.A., Zhan, R., Wager, S. and Athey, S. (2019). Confidence Intervals for Policy Evaluation in Adaptive Experiments. arXiv preprint arXiv:1911.02768.
  • Hu and Rosenberger (2003) Hu, F. and Rosenberger, W.F. (2003). Optimality, variability, power: Evaluating response-adaptive randomization procedures for treatment comparisons. Journal of the American Statistical Association 98 671–678.
  • Hu and Zhang (2004) Hu, F. and Zhang, L-X. (2004a). Asymptotic properties of doubly adaptive biased coin design for multi-treatment clinical trials. The Annals of Statistics 32(1) 268–301.
  • Hu and Zhang (2004) Hu, F. and Zhang, L-X. (2004b). Asymptotic normality of adaptive designs with delayed response. Bernoulli 10 447–463.
  • Hu and Rosenberger (2006) Hu, F. and Rosenberger, W.F. (2006). The Theory of Response-Adaptive Randomization in Clinical Trials. Wiley Series in Probability and Statistics.
  • Hu et al. (2008) Hu, F., Zhang, L-X., Cheung, S.H. and Chan, W.S. (2008). Double-adaptive biased coin designs with delayed responses. Canadian Journal of Statistics 36 541–559.
  • Hu et al. (2009) Hu F., Zhang L. and He X. (2009). Efficient Randomized-Adaptive Designs. The Annals of Statistics 37(5A) 2543–2560.
  • Hu et al. (2015) Hu, J., Zhu, H. and Hu, F. (2015). A Unified Family of Covariate-Adjusted Response-Adaptive Designs Based on Efficiency and Ethics. Journal of the American Statistical Association. 110:509 357–367.
  • Ivanova (2003) Ivanova, A. (2003). A play-the-winner-type urn design with reduced variability. Metrika 58(1) 1–13.
  • Jacko (2019) Jacko, P. (2019). The Finite-Horizon Two-Armed Bandit Problem with Binary Responses: A Multidisciplinary Survey of the History, State of the Art, and Myths. arXiv preprint. arXiv:1906.10173.
  • Jennison and Turnbull (2000) Jennison, C. and Turnbull, B.W. (2000). Group Sequential Methods with Applications to Clinical Trials. Chapman and Hall/CRC, Boca Raton, FL.
  • Jeon and Hu (2010) Jeon, Y. and Hu, F. (2010). Optimal Adaptive Designs for Binary Response Trials with Three Treatments. Statistics in Biopharmaceutical Research. 2(3) 310–318.
  • Jiang et al. (2020) Jiang, Y., Zhao W. and Durkalski-Mauldin, V. (2020). Time-trend impact on treatment estimation in two-arm clinical trials with a binary outcome and Bayesian response adaptive randomization. Journal of Biopharmaceutical Statistics. 30(1) 69–88.
  • Johnston et al. (2019) Johnston, K., Bruno, A., Pauls, Q., Hall, C.E., Barrett, K.M., Barsan, W., Fansler, A., Van de Bruinhorst, K., Janis, S. and Durkalski-Mauldin, V.L. for the Neurological Emergencies Treatment Trials Network and the SHINE Trial Investigators (2019). Intensive vs Standard Treatment of Hyperglycemia and Functional Outcome in Patients With Acute Ischemic Stroke: The SHINE Randomized Clinical Trial. Journal of the American Medical Association. 322(4) 326–335.
  • Kaibel and Biemann (2019) Kaibel, C. and Biemann, T. (2019). Rethinking the Gold Standard With Multi-armed Bandits: Machine Learning Allocation Algorithms for Experiments. Organizational Research Methods https://doi.org/10.1177/1094428119854153.
  • Karrison et al. (2003) Karrison, T.G., Huo, D. and Chappell, R. (2003). A group sequential, response-adaptive design for randomized clinical trials. Controlled Clinical Trials 24 506–522.
  • Kaufmann et al. (2012) Kaufmann, E., Korda, N. and Munos, A. (2012). Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis. International Conference on Algorithmic Learning Theory 2012: Proceedings 199–213.
  • Kaufmann and Garivier (2017) Kaufmann, E. and Garivier, A. (2017). Learning the distribution with largest mean: two bandit frameworks. ESAIM: Procs 60 114–131.
  • Kim et al. (2011) Kim, E. S., Herbst, R.S., Wistuba, I.I., Lee, J.J., Blumenschein, G.R., Tsao, A., Stewart, D.J., Hicks, M.E., Erasmus, J. Jr, Gupta, S., Alden, C.M., Liu, S., Tang, X., Khuri, F.R., Tran, H.T., Johnson, B.E., Heymach, J.V., Mao, L., Fossella, F., Kies, M.S., Papadimitrakopoulou, V., Davis, S.E., Lippman, S.M. and Hong, W.K. (2011). The BATTLE trial: personalizing therapy for lung cancer. Cancer Discovery 1(1) 44–53.
  • Korn and Freidlin (2011a) Korn, E.L. and Freidlin, B. (2011a). Outcome-adaptive randomization: is it useful? Journal of Clinical Oncology, 29 771–776.
  • Korn and Freidlin (2011b) Korn, E.L. and Freidlin, B. (2011b). Reply to Y. Yuan et al. Journal of Clinical Oncology 29 e393.
  • Korn (2017) Korn, E.L. and Freidlin, B. (2017). Commentary. Adaptive Clinical Trials: Advantages and Disadvantages of Various Adaptive Design Elements. Journal of the National Cancer Institute 109(6) djx013.
  • Laage et al. (2017) Laage, T., Loewy, J.W., Menon, S., Miller, E.R., Pulkstenis, E., Kan-Dobrosky, N. and Coffey, C. (2017). Ethical Considerations in Adaptive Design Clinical Trials. Therapeutic Innovation & Regulatory Science 51(2) 190–199.
  • Lattimore and Szepesvári (2019) Lattimore, T. and Szepesvári, C. (2019). Bandit Algorithms. In press https://tor-lattimore.com/downloads/book/book.pdf.
  • Lee et al. (2012) Lee, J.J., Chen N. and Yin, G. (2012). Worth Adapting? Revisiting the Usefulness of Outcome-Adaptive Randomization. Clinical Cancer Research 18(17) 4498–4507.
  • Lee et al. (2018) Lee, K.M., Mitra, R. and Biedermann, S. (2018). Optimal design when outcome values are not missing at random. Statistica Sinica, 28(4) 1821–1838.
  • Li and Wang (2012) Li, X. and Wang, X. (2012). Variance-penalized response-adaptive randomization with mismeasurement. Journal of Statistical Planning and Inference, 142 2128–-2135.
  • Li and Wang (2013) Li, X. and Wang, X. (2013). Response adaptive designs with misclassified responses. Communication in Statistics – Theory and Methods, 42 2071–-2083.
  • Lin and Bunn (2017) Lin, J. and Bunn, V. (2017). Comparison of multi-arm multi-stage design and adaptive randomization in platform clinical trials. Contemporary Clinical Trials 54 48–59.
  • Little and Rubin (2002) Little, R. J. and Rubin, D. B. (2002). Bayes and multiple imputation. In: Statistical analysis with missing data, 2nd edition, 200–220. New York (NY): Wiley.
  • London (2018) London, A.J. (2018). Learning health systems, clinical equipoise and the ethics of response adaptive randomization. Journal of Medical Ethics 44 409–415.
  • Ma (2013) Ma, Z. (2013). Missing data and adaptive designs in clinical studies. PhD Thesis, University of Virginia.
  • Marchenko (2014) Marchenko, O., Fedorov, V., Lee, J., Nolan, C. and Pinheiro, J. (2014). Adaptive Clinical Trials: Overview of Early-Phase Designs and Challenges. Therapeutic Innovation & Regulatory Science 48(1) 20–30.
  • Melfi and Page (2000) Melfi, V.F. and Page, C. (2000). Estimation after adaptive allocation. Journal of Statistical Planning and Inference 87(2) 353–363.
  • Papadimitrakopoulou (2016) Papadimitrakopoulou, v., Lee, J.J., Wistuba, I., Tsao, A. Fossella, F., Kalhor, N., Gupta, S., Averett Byers, L., Izzo, J., Gettinger, S., Goldbert, S., Tang, X., Miller, V., Skoulidis, F., Gibbons, D., Shen, L., Wei, C., Diao, L., Peng, S. A., Wang, J., Tam, A., Coombes, K., Koo, J. Mauro, D., Rubin, E., Heymach, J., Hong, W. and Herbst, R. (2016). The BATTLE-2 Study: A Biomarker-Integrated Targeted Therapy Study in Previously Treated Patients With Advanced Non-Small-Cell Lung Cancer. Journal of Clinical Oncology 34(30) 3638–3647.
  • Park et al. (2016) Park, J. W., Liu, M.C., Yee, D., Yau, C., van’t Veer, L.J., Symmans, W.F., Paoloni, M., Perlmutter, J., Hylton, N.M., Hogarth, M., DeMichele, A., Buxton, M.B., Chien, A.J., Wallace, A.M., Boughey, J.C., Haddad, T.C., Chui, S.Y., Kemmer, K.A., Kaplan, H.G., Isaacs, C., Nanda, R., Tripathy, D., Albain, K.S., Edmiston, K.K., Elias, A.D., Northfelt, D.W, Pusztai, L., Moulder, S.L., Lang, J.E., Viscusi, R.K. Euhus, D.M., Haley, B.B., Khan, Q.J., Wood, W.C., Melisko, M., Schwab, R., Helsten, T., Lyandres, J., Davis, S.E., Hirst, G.L., Sanil, A., Esserman, L.J. and Berry, D.A. for the I-SPY 2 Investigators (2016). Adaptive randomization of neratinib in early breast cancer. New England Journal of Medicine 375(1) 11–22.
  • Press (2005) Press, S. J.

    (2005). Applied multivariate analysis: using Bayesian and frequentist methods of inference.

    Courier Corporation.
  • Proschan and Dodd (2019) Proschan, M.A. and Dodd, L.E. (2019). Re-randomization tests in clinical trials. Statistics in Medicine 38, 2292–2302.
  • Proschan and Evans (2020) Proschan, M. and Evans, S. (2020). The Temptation of Response-Adaptive Randomization. Clinical Infectious Diseases. Advance access, doi: 10.1093/cid/ciaa334.
  • Robbins (1952) Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society 58 527–535.
  • Robertson and Wason (2019) Robertson, D.S. and Wason, J.M.S. (2019). Familywise error control in multi-armed response-adaptive trials. Biometrics 75(3) 885–894.
  • Rosenberger and Lachin (1993) Rosenberger, W.F. and Lachin, J.M. (1993). The use of response-adaptive designs in clinical trials. Controlled Clinical Trials 14(6) 471–484.
  • Rosenberger et al. (2001) Rosenberger, W.F., Stallard, N., Ivanova, A., Harper, C.N. and Ricks, M.L. (2001). Optimal adaptive designs for binary response trials. Biometrics 57 909–913.
  • Rosenberger et al. (2011) Rosenberger, W.F., Vidyashankar, A.N. and Agarwal, D.K. (2001b). Covariate-adjusted response-adaptive designs for binary response. Journal of Biopharmaceutical Statistics 11(4) 227–236.
  • Rosenberger et al. (2012) Rosenberger, W.F., Sverdlov, O. and Hu, F. (2012). Adaptive Randomization for Clinical Trials. Journal of Biopharmaceutical Statistics 22 719–736.
  • Rosenberger and Hu (1999) Rosenberger, W.F. and Hu, F. (1999). Bootstrap methods for adaptive designs. Statistics in Medicine 18 1757–1767.
  • Rosenberger and Hu (2004) Rosenberger, W.F. and Hu, F. (2004). Maximising power and minimizing treatment failures in clinical trials. Clincal Trials 1 141–147.
  • Rosenberger and Lachin (2016) Rosenberger, W.F. and Lachin, J.M. (2016). Randomization in Clinical Trials. Wiley Series in Probability and Statistics, Hoboken, New Jersey.
  • Rosenberger and Sverdlov (2008) Rosenberger, W.F. and Sverdlov, O. (2008). Handling Covariates in the Design of Clinical Trials. Statistical Science 23(3) 404–419.
  • Rubin (1976) Rubin, D. B. (1976). Inference and missing data. Biometrika 63(3), 581–592.
  • Rugo et al. (2016) Rugo, H. S. Olopade, O.I., DeMichele, A., Yau, C., van’t Veer, L.J., Buxton, M.B., Hogarth, M., Hylton, N.M., Paoloni, M., Perlmutter, J., Symmans, W.F., Yee, D., Chien, A.J., Wallace, A.M., Kaplan, H.G., Boughey, J.C., Haddad, T.C., Albain, K.S., Liu, M.C., Isaacs, C., Khan, Q.J., Lang, J.E., Viscusi, R.K., Pusztai, L., Moulder, S.L., Chui, S.Y., Kemmer, K.A., Elias, A.D., Edmiston, K.K., Euhus, D.M., Haley, B.B., Nanda, R., Northfelt, D.W., Tripathy, D., . Wood, W.C., Ewing, C., Schwab, R., Lyandres, J., Davis, S.E., Hirst, G.L., Sanil, A., Berry, D.A. and Esserman, L.J. for the I-SPY 2 Investigators (2016). Adaptive randomization of veliparib–carboplatin treatment in breast cancer. New England Journal of Medicine 375(1) 23–34.
  • Sabo (2014) Sabo, R.T. (2014). Adaptive allocation for binary outcomes using decreasingly informative priors. Journal of Biopharmaceutical Statistics 24(3) 569–578.
  • Samaniego (2010) Samaniego, F. J. (2010). A comparison of the Bayesian and frequentist approaches to estimation. Springer Science & Business Media.
  • Shin et al. (2019) Shin, J., Ramdas, A. and Rinaldo, A. (2019). Are sample means in multi-armed bandits positively or negatively biased? NeurIPS 2019 arXiv:1905.11397.
  • Shin et al. (2020) Shin, J., Ramdas, A. and Rinaldo, A. (2020). On conditional versus marginal bias in multi-armed bandits. arXiv preprint arXiv:2002.08422.
  • Sim (2019) Sim, J. (2019). Outcome-adaptive randomization in clinical trials: issues of participant welfare and autonomy. Theoretical Medicine and Bioethics 40(2) 83-–101.
  • Simon and Simon (2011) Simon, R. and Simon, N.R. (2011). Using randomization tests to preserve type I error with response adaptive and covariate adaptive randomization. Statistics and Probability Letters 81(7) 767–772.
  • Smith (1984) Smith, R.L. (1984). Properties of Biased Coin Designs in Sequential Clinical Trials. Annals of Statistics 12(3) 1018–1034.
  • Siu (2017) Siu, L. L., Ivy, S. P., Dixon, E. L., Gravell, A. E., Reeves, S. A. and Rosner, G. L. (2017). Challenges and Opportunities in Adapting Clinical Trial Design of Immunotherapies. Clin Cancer Res 23(17) 4950–4958.
  • Sugarman (1999) Sugarman, J., Douglas C. McCrory, D.C., Powell, D., Krasny, A., Adams, B., Ball, E. and Cassell, C. (1999). Empirical research on informed consent. Hastings Center Report 29(suppl) S1–S42.
  • Sverdlov and Rosenberger (2013a) Sverdlov, O. and Rosenberger, W.F. (2013). On recent advances in optimal allocation designs in clinical trials. Journal of Statistical Theory and Practice 7(4) 753–773.
  • Sverdlov and Rosenberger (2013b) Sverdlov, O. and Rosenberger, W.F. (2013). Randomization in clinical trials: can we eliminate bias? Clinical Investigation Journal 3(1) 37–47.
  • Tamura et al. (1994) Tamura, R.N., Faries, D.E., Andersen, J.S. and Heiligenstein, J.H. (1994). A Case Study of an Adaptive Clinical Trial in the Treatment of Out-Patients with Depressive Disorder. Journal of the American Statistical Association 89, 768–776.
  • Thall and Wathen (2007) Thall, P.F. and Wathen, J.K. (2007). Practical Bayesian adaptive randomization in clinical trials. European Journal of Cancer, 43, 859–866.
  • Thall et al. (2015a) Thall, P.F., Fox, P.S. and Wathen, J.K. (2015a). Some caveats for outcome adaptive randomization in clinical trials. CRC Press, Boca Raton.
  • Thall et al. (2015b) Thall, P.F., Fox P. and Wathen, J. (2015b). Statistical controversies in clinical research: scientific and ethical problems with adaptive randomization in comparative clinical trials. Annals of Oncology 26 1621–1628.
  • Therasse et al. (2000) Therasse, P., Arbuck, S.G., Eisenhauer, E.A., Wanders, J., Kaplan, R.S., Rubinstein, L., Verweij, J., Van Glabbeke, M., van Oosterom, A.T., Christian, M.C. and Gwyther, S.G. (2000). New guidelines to evaluate the response to treatment in solid tumors. Journal of the National Cancer Institute, 92(3) 205–216.
  • Thompson (1933) Thompson, W.R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25 285–294.
  • Torgerson and Campbell (2000) Torgerson, D.J. and Campbell, M.K. (2000). Use of unequal randomization to aid the economic efficiency of clinical trials. BMJ 321 759.
  • Trippa et al. (2012) Trippa, L., Lee, E.Q., Wen, P.Y., Batchelor, T.T., Cloughesy, T., Parmigiani, G. and Alexander, B.M. (2012). Bayesian adaptive trial design for patients with recurrent gliobastoma. Journal of Clinical Oncology 30 3258–3263.
  • Tymofyeyev et al. (2007) Tymofyeyev, Y., Rosenberger W.F. and Hu, F. (2007). Implementing optimal allocation in sequential binary response experiments. Journal of the American Statistical Association 102(477) 224–234.
  • Villar et al. (2015a) Villar, S.S., Bowden, J. and Wason, J. (2015a). Multi-armed Bandit Models for the Optimal Design of Clinical Trials: Benefits and Challenges. Statistical Science 30(2) 199–215.
  • Villar et al. (2015b) Villar, S.S., Wason J. and Bowden, J. (2015b). Response-adaptive randomization for multi-arm clinical trials using the forward looking Gittins index rule. Biometrics 71(4) 969–978.
  • Villar et al. (2018) Villar, S.S., Bowden, J. and Wason, J. (2018). Response-adaptive designs for binary responses: How to offer patient benefit while being robust to time trends? Pharmaceutical Statistics 17(2) 182–197.
  • Wagenmakers et al. (2008) Wagenmakers, E.J., Lee, M., Lodewyckx, T. and Iverson, G. (2008). Bayesian versus frequentist inference. In Bayesian evaluation of informative hypotheses, 181–207. Springer, New York, NY.
  • Wason and Trippa (2014) Wason, J.M.S. and Trippa, L. (2014). A comparison of Bayesian adaptive randomization and multi-stage designs for multi-arm clinical trials. Statistics in Medicine 33(13) 2206–2221.
  • Wathen and Thall (2017) Wathen, J.K. and Thall, P.F. (2017). A simulation study of outcome adaptive randomization in multi-arm clinical trials. Clinical Trials 14(5) 432–440.
  • Wei (1978) Wei, L.J. and Durham, S. (1978). The randomized play-the-winner rule in medical trials. Journal of Medical Statistics Association 73 840–843.
  • Wei (1988) Wei, L.J. (1988). Exact two-sample permutation tests based on the randomized play-the-winner rule. Biometrika 75 603–606.
  • Williamson et al. (2017) Williamson, S.F., Jacko, P., Villar, S.S. and Jaki, T. (2017). A Bayesian adaptive design for clinical trials in rare diseases. Computational Statistics and Data Analysis 113 136–153.
  • Williamson and Villar (2020) Williamson, S. F. and Villar, S. S. (2020). A Response‐Adaptive Randomization Procedure for Multi‐Armed Clinical Trials with Normally Distributed Outcomes. Biometrics 76(1) 197–209
  • Zelen (1969) Zelen, M. (1969). Play the Winner Rule and the Controlled Clinical Trial. Journal of the American Statistical Association 64 131–146.
  • Zhang and Rosenberger (2006) Zhang, L. and Rosenberger, W.F. (2006). Response-Adaptive Randomization for Clinical Trials with Continuous Outcomes. Biometrics 62 562–569.
  • Zhang and Rosenberger (2007) Zhang, L. and Rosenberger, W.F. (2007a). Response-adaptive randomization for survival trials: the parametric approach. Applied Statistics 56(2) 153–165.
  • Zhang et al. (2007) Zhang, L., Chan, W.S., Cheung, S.H. and Hu, F. (2007b). A generalized urn model for clinical trials with delayed responses. Statistica Sinica 17 387–409.
  • Zhang et al. (2011) Zhang, L., Hu, F., Cheung, S.H. and Chan, W.S. (2011). Immigrated urn models – theoretical properties and applications. Annals of Statistics 39 643–671.
  • Zhang et al. (2019) Zhang, L., Trippa, L. and Parmigiani, G. (2019). Frequentist operating characteristics of Bayesian optimal designs via simulation. Statistics in Medicine 38 4026–4039.
  • Zhao and Durkalski (2014) Zhao, W. and Durkalski, V. (2014). Managing competing demands in the implementation of response-adaptive randomization in a large multicenter phase III acute stroke trial. Statistics in Medicine 33(23) 4043–4052.
  • Zhou (2008) Zhou, X., Liu, S., Kim, E.S., Herbst, R.S. and Lee, J.J. (2008). Bayesian adaptive design for targeted therapy development in lung cancer - a step toward personalized medicine Clinical Trials 5(3) 181–193.