A Markov Decision Process for Response-Adaptive Randomization in Clinical Trials

In clinical trials, response-adaptive randomization (RAR) has the appealing ability to assign more subjects to better-performing treatments based on interim results. The traditional RAR strategy alters the randomization ratio on a patient-by-patient basis; this has been heavily criticized for bias due to time-trends. An alternate approach is blocked RAR, which groups patients together in blocks and recomputes the randomization ratio in a block-wise fashion; the final analysis is then stratified by block. However, the typical blocked RAR design divides patients into equal-sized blocks, which is not generally optimal. This paper presents TrialMDP, an algorithm that designs two-armed blocked RAR clinical trials. Our method differs from past approaches in that it optimizes the size and number of blocks as well as their treatment allocations. That is, the algorithm yields a policy that adaptively chooses the size and composition of the next block, based on results seen up to that point in the trial. TrialMDP is related to past works that compute optimal trial designs via dynamic programming. The algorithm maximizes a utility function balancing (i) statistical power, (ii) patient outcomes, and (iii) the number of blocks. We show that it attains significant improvements in utility over a suite of baseline designs, and gives useful control over the tradeoff between statistical power and patient outcomes. It is well suited for small trials that assign high cost to failures. We provide TrialMDP as an R package on GitHub: https://github.com/dpmerrell/TrialMDP



There are no comments yet.


page 1

page 2

page 3

page 4


Robust Response-Adaptive Randomization Design

In clinical trials, patients are randomized with equal probability among...

Sequential Patient Recruitment and Allocation for Adaptive Clinical Trials

Randomized Controlled Trials (RCTs) are the gold standard for comparing ...

Familywise error rate control for block response-adaptive randomization

Response-adaptive randomization allows the probabilities of allocating p...

A note on the amount of information borrowed from external data in hybrid controlled trials with time-to-event outcomes

In situations where it is difficult to enroll patients in randomized con...

Familywise error control in multi-armed response-adaptive trials

Response-adaptive designs allow the randomization probabilities to chang...

A Practical Response Adaptive Block Randomization Design with Analytic Type I Error Protection

Response adaptive randomization is appealing in confirmatory adaptive cl...

Harmonizing Fully Optimal Designs with Classic Randomization in Fixed Trial Experiments

There is a movement in design of experiments away from the classic rando...

Code Repositories


An optimization solver for blocked Response-Adaptive Randomized (RAR) trial design

view repo


Analyses, experiments, and evaluations for the TrialMDP method

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Randomization is a common technique used in clinical trials to eliminate potential bias and confounders in a patient population. Most clinical trials utilize fixed randomization, where the probability of assigning subjects to a treatment group is kept fixed throughout the trial. Response-adaptive randomization (RAR) designs were developed due to the captivating benefit of increasing the probability of assigning patients to more promising treatments, based on the responses of prior patients. A big downside for RAR designs is that the time between treatment and outcome must be short, in order to inform future patients’ randomization.

Traditional RAR designs recompute the randomization ratio on a patient-by-patient basis (Thall and Wathen, 2007), usually after a burn-in period of fixed randomization. However, traditional RAR designs have been widely criticized (Karrison et al., 2003). Traditional RAR designs induce bias due to temporal trends in clinical trials. Temporal trends are especially likely to occur in long duration trials. Patients’ characteristics might be completely different throughout the trial or even at the beginning and end of the trial (Proschan and Evans, 2020). However, standard RAR analyses assume that the sequence of patients who arrive for entry into the trial represents samples drawn at random from two homogeneous populations, with no drift in the probabilities of success (Proschan and Evans, 2020; Chandereng and Chappell, 2020). This assumption is usually violated. For example, in the BATTLE lung cancer elimination trials (Liu and Lee, 2015), more smokers enrolled in the latter part of the trial than at the beginning.

Despite this serious flaw, there is not much literature to address the temporal trend issue in RAR designs. Villar et al. explored the hypothesis testing procedure adjusting for covariates for correcting type-I error inflation and the effect on power in RAR designs with temporal trend effects added to the model for two-armed and multi-armed trials

(Villar et al., 2018). Karrison et al. (2003) introduced a stratified group-sequential method with a simple example of altering the randomization ratio to address this issue. Chandereng and Chappell (2019) further examined the operating characteristics of the blocked RAR approach for two treatment arms proposed by Karrison et al. (2003). They concluded that blocked RAR provides a good trade-off between ethically assigning more subjects to the better-performing treatment group and maintaining high statistical power. They also suggested using a small number of blocks since large numbers of blocks have low statistical power. However, Chandereng and Chappell (2019) designed trials with equal-sized blocks, which is not generally optimal.

Other works formulate adaptive trial design as a Multi-Armed Bandit Problem (MABP), employing ideas that are often associated with reinforcement learning—e.g., sequential decision-making and regret minimization. These entail sophisticated algorithms, such as Gittins index computations (Villar et al., 2015, 2015) and dynamic programming (Hardwick and Stout, 1995, 1999, 2002). These works have important limitations. The Gittins index approaches of Villar et al. assume either (i) a fully sequential trial with similar weaknesses to traditional RAR or (ii) a blocked trial with equal-sized blocks. The dynamic programming algorithms of Hardwick and Stout yield allocation rules that (i) are deterministic, (ii) are fully sequential, or (iii) assume a blocked trial with a fixed number of blocks. At the time, Hardwick and Stout’s approaches were also limited by computer speed and memory, which have improved famously over the years.

This paper presents TrialMDP, an algorithm that designs blocked RAR trials. TrialMDP is most closely related to the MABP-based approaches mentioned above. However, it models a blocked RAR trial as a Markov Decision Process (MDP), a generalization of the MABP. It relies on a dynamic programming algorithm, similar to those of Hardwick and Stout. However, our method differs in that it optimizes the size and number of blocks as well as their treatment allocations. That is, the algorithm yields a policy that adaptively chooses the size and composition of the next block, based on results seen up to that point in the trial. The current version of TrialMDP is tailored for two-armed trials with binary outcomes. Future versions may permit a more general class of trials.

Our paper has the following structure. In Section 2, we describe our problem formulation and algorithmic solution. In Section 3, we compare TrialMDP’s designs with other designs that have been widely used in clinical trials. We use our proposed method to redesign a phase II trial in Section 3.2. We discuss TrialMDP’s limitations and potential improvements in Section 4. Our Supplementary Materials include appendices that justify some of our mathematical and algorithmic choices.

2 Proposed method

2.1 Problem formulation

Class of trials.

In this paper we focus on blocked RAR trials with two arms and binary outcomes. We label the arms and (“treatment” and “control”, respectively) and outcomes 0 and 1 (“failures” and “successes”). A trial has access to some number of available patients, . The trial proceeds in blocks. We require that all results from the current block are observed before the next block begins. Importantly, we allow to adapt as the trial progresses. This gives the trial useful kinds of flexibility. In general, a trial may attain better characteristics if it permits differently-sized blocks.

Let denote the treatments’ success probabilities. We assume a frequentist test is performed at the end of the trial, with the following null and alternative hypotheses :

We focus specifically on the one-sided Cochran-Mantel-Haenzsel (CMH) test, which is well-suited for stratified observations; in our setting, the strata are blocks of patients. It has been argued that blocked RAR trials with CMH tests are more robust to temporal trend effects than, e.g., traditional RAR trials with chi-square tests (Chandereng and Chappell, 2019).

Figure 1:

(A) Contingency table notation. (B) Trial history notation. A history

is a sequence of cumulative contingency tables, . A subscript indicates a history’s suffix.

Notation: tables and histories.

We establish some notation for clarity. A contingency table has the following attributes: and , the numbers of patients assigned to each treatment; and , the numbers of successes for each treatment; and , the total numbers of failures and successes; and , the total number of outcomes recorded in the table. Symbols ,

represent point estimates of the treatment success probabilities. See Figure

1 for illustration.

Each block of the trial has its own contingency table with corresponding quantities. We use a subscript to indicate the block. For example, the block of the trial has its own table with quantities , , , and so on.

At any point we can summarize the state of the trial in a contingency table, , of cumulative results. That is, contains all of the trial’s observations up to that point; or, put another way, is the sum of all preceding block-wise contingency tables. We typically refer to as a state. We use an underline to indicate a quantity computed from a state. For example, after completing blocks we have quantities ;   ;   ; and so on.

The sequence of states occupied by a trial forms a trial history , where is always the empty contingency table and always has observations. We use a subscript to denote the suffix of a history. For example, is the sequence of states after the block of the trial. It is useful to think of a history as a random object, subject to uncertainty in the patient outcomes and the values of , .

Utility function.

We aim to design blocked RAR trials that balance (i) statistical power and (ii) patient outcomes. We also recognize that each additional block entails a cost in time and other overhead. As such, we wish to avoid an excessive number of blocks. We formalize these goals with the following utility function:


where is a proxy for the trial’s statistical power; measures the number of failures; and is the number of blocks. This utility function promotes a high statistical power while penalizing failures and blocks. The coefficients and control the relative importance of patient outcomes and blocks, respectively.

The functions , , and have the following forms:

Function simply returns the number of blocks in the trial history. Function quantifies bad patient outcomes (i.e., failures) as a fraction of all patients. It is a function only of the final state, , and becomes small when the estimates and are close.

Function serves as a proxy for the trial’s statistical power. It is crafted such that maximizing also maximizes the power of the Cochran-Mantel-Haenzsel test (Cochran, 1954) to an acceptable approximation. Each is the harmonic mean of that block’s treatment allocations. takes larger values when the allocations are balanced; and when are close to each other, and far from . The factor makes consistent across trials with differing sample sizes. See Appendix A of the Supplementary Materials for a more detailed justification of .

Markov Decision Process formulation.

In our effort to maximize the expected utility (Equation 1), we find it useful to model a blocked RAR trial as a Markov Decision Process (MDP). An MDP is a simple model of sequential decision-making. It consists of an agent navigating a state space. At each time-step, the agent chooses an action. Given the agent’s current state and chosen action, the agent transitions to a new state and collects a reward. In general the transition is stochastic, governed by a transition distribution. One solves an MDP by obtaining a policy that maximizes the expected total reward. We refer the reader to Chapter 38 of Lattimore and Szepesvari’s text for detailed information about MDPs (Lattimore and Szepesvari, 2020).

Figure 2: (A) State space . At any point, the state of the trial is summarized by a contingency table of all observations. We can order the set of all contingency tables by their numbers of observations, . The trial begins with the empty table in ; the trial ends when it reaches a state in (in this example ). (B) Transition distribution. In this example, current state and action induce a distribution for the next state. The next state necessarily has

. Its entries are governed by Beta-Binomial distributions, parameterized by the entries of the current contingency table.

We model a blocked RAR trial as an MDP with the following components:

  • State space. In our setting the state space consists of every possible contingency table with observations. We can order the states by their numbers of observations. We let denote the subset of containing tables with exactly observations. The trial always begins at the empty contingency table in and terminates at some table in . The state space grows quickly with , . Figure 2(A) illustrates for .

  • Actions. With each block of the trial we choose an action , the block’s size and allocation. Suppose we have completed blocks; then may take any integer value from 1 to . The allocation is the fraction of patients assigned to treatment in this block. We constrain to a finite set of possible values, . For example, . Importantly, exactly patients (rounded to the nearest integer) are assigned to treatment . In other words, patients are randomized to treatments “without replacement.” Contrast this with other randomized designs—traditional RAR, blocked RAR, etc.—that assign each patient to with independent probability . For example, action implies that the next block will treat 60 patients, assigning exactly of them to treatment and to treatment .

    We let denote the set of all actions, and denote actions available at state .

  • Transition distributions. Given the current contingency table and the chosen block design , the next contingency table is randomly distributed. This randomness consists of two parts: (i) the stochasticity of patient outcomes given the true success probabilities and , and (ii) our uncertainty about the values of and . Given the true values for and , the numbers of successes and for this block would have Binomial distributions:

    However, we only have imperfect knowledge of and , encoded in the entries of the current table

    . We use Beta distributions to describe this uncertainty about

    and :

    where each

    is a smoothing hyperparameter typically set to 1. Together, these two sources of randomness assign independent Beta-Binomial probabilities to

    and , which in turn define the distribution for . See Figure 2(B) for illustration. We sometimes use the notation to indicate the transition distribution for , given and .

  • Rewards. Given the current state and the chosen action , the trial transitions to state and receives a reward . In an MDP the goal is to maximize expected total reward. Recall, however, that our ultimate goal is to maximize the expected utility (Equation 1). We craft a reward function consistent with , as follows:


    The total reward for a trial history is identical to the utility (Equation 1) of that trial history. With each block, the reward function produces that block’s contribution to the total utility. This includes the block’s term for ; the block’s cost ; and the final failure penalty when is terminal.

    Notice that our particular is a function only of and . We sometimes write for compactness.

  • Policy. A policy is a function mapping each state in the MDP to an action. In our setting policies are trial designs. For each state in the trial, a policy dictates the design of the trial’s next block: . Our MDP is solved by the optimal policy satisfying

    We let denote the corresponding maximal value at each state .

Casting our problem into the MDP framework helps us design algorithmic solutions. Our particular MDP lends itself to a straightforward dynamic programming approach, since there are no cycles in its directed graph of possible transitions.

2.2 Solution via dynamic programming

The MDP described in Section 2.1 can be solved by a relatively simple dynamic programming algorithm. This makes our method a close relative of past dynamic programming approaches for trial design (Woodroofe and Hardwick, 1990; Hardwick and Stout, 1995, 1999, 2002). However, our method differs from them in an important respect: we seek to maximize an objective that is a function of the trial history, and not just a function of the final state. Concretely, our objective function (Equation 1) includes and , which are functions of block-wise attributes. Formulating the problem as an MDP gives us the flexibility to consider such an objective.

Recurrence relations.

Like any dynamic programming algorithm, ours divides the problem at hand into subproblems and solves them in an order that efficiently reuses computation. This dependence between subproblems is defined by a set of recurrence relations. In our case we have a single recurrence based on the Bellman equation (Lattimore and Szepesvari, 2020):


The algorithm computes this recurrence at every state in , iterating through the state space in order of decreasing . In other words the algorithm evaluates the recurrence at each state in , , and so on, until it finally computes for the the empty table and terminates. At each state the algorithm also tabulates the maximizing action . This table of optimal actions is the algorithm’s most important output, as it constitutes , the optimized trial design. Figure 3 illustrates the algorithm in detail with pseudocode.

The trial design (i.e., policy) yielded by this recurrence is guaranteed to maximize the expected utility (subject to the MDP formulation described in Section 2.1), since our optimization problem has the optimal substructure property. See Appendix B of the Supplementary Materials for more discussion and a proof of optimal substructure.

1:procedure MainLoop()
2:     initialize tables U, A
3:     for   do
4:         U[s] = 0      
5:     for   do
6:         U[]
7:         for   do
9:              for  do
12:              if u U[then
13:                  U[s] = u; A[s] =                             
14:     return U, A
Algorithm 1 TrialMDP
function ()                                                if   then                                                 return
Figure 3: TrialMDP algorithm pseudocode. The algorithm populates tables U and A with optimal utilities and actions, respectively. Tables U and A are indexed by states; i.e., U[] yields the utility for state . The for-loop on line 5 iterates through states in order of decreasing . The for-loop on line 7 iterates through all possible actions for the current state; and the loop on line 9 computes the expectation of for the current state and action. Function evaluates the reward function given by Equation 2. We use “dot notation” to access the attributes of states and actions; e.g., yields for state .

Computational expense.

At a high level TrialMDP is a nested loop over every possible state, action, and transition. For each state the algorithm stores a set of values, along with the optimal action. Hence the algorithm uses space. The number of possible actions and transitions varies between states; summing across all states yields total time cost , where is the set of allocation fractions mentioned in Section 2.1.

These complexities apply if we allow the algorithm to consider every possible state and action. However, there are practical ways to prune away states and actions, attaining much lower computational cost without sacrificing much utility. Introducing a minimum block size parameter eliminates all of the states in and ; and reduces the number of possible actions at each remaining state. An additional block increment parameter further constrains the algorithm to states where is an integer multiple of , resulting in a “coarsened” state space. These parameters reduce the algorithm’s space and time cost to and , respectively. See Appendix C of the Supplementary Materials for derivations. We typically set and . Unless specified otherwise, we use . These settings yielded trials with competitive characteristics, without incurring undue computational expense during the evaluations of Section 3.

Empirically, we observe a time cost of 5; ; and seconds for trials with 40, 100, and 140 patients respectively. These measurements used a single-threaded implementation of TrialMDP, on a laptop with Intel 1.1GHz CPUs.

3 Evaluation

3.1 Simulation study

We performed a simulation study to compare TrialMDP against established trial designs. At each point in a grid of values for and , we ran 10,000 simulated trials using TrialMDP and a suite of baseline designs. The baselines included (i) a 1:1, fixed randomization design; (ii) a traditional Response-Adaptive Randomized (RAR) design; and (iii) a blocked RAR design.

For null scenarios with , we chose an arbitrary sample size of . For alternative scenarios with , we chose large enough for a 1:1 design to attain a power of 0.8. See Tables 1 and 2 for the exact values of , , and used in our simulated scenarios.

The traditional RAR baseline used a 1:1 randomization ratio for the first of patients, and adaptive randomization thereafter according to the procedure used by Rosenberger et al. (2001). That is, the patient was assigned to treatment with probability


The blocked RAR baseline used two blocks of equal size. The first block used a 1:1 randomization ratio; the second block used the same randomization given by Equation 4. This agrees with the blocked RAR procedure described by Chandereng and Chappell (2019).

We used TrialMDP to generate trial designs over a grid of parameter settings: . Each parameter setting implies a different balance between statistical power, patient outcomes, and the number of blocks.

We simulated 10,000 trials for every scenario , for each baseline design, and for each TrialMDP parameter setting. As an initial sanity check we visualized some trial histories to see whether the designs behaved as expected. Figure 4 shows some examples. TrialMDP always chose 1:1 allocation for the first block, increasing the allocation to in subsequent blocks when . As increased, the designs reliably increased allocation to , in agreement with our expectations. The baseline designs also yielded trial histories that agreed with our expectations.

Recall that TrialMDP is supposed to optimize the utility function (Equation 1) in expectation. If this were true, we would expect our designs to attain higher utility than the others, averaged over the simulated histories. To verify this we computed the utility for every simulated history and for every design, and tabulated the resulting averages.

Table 1 shows some representative results from the alternative scenarios. These results employed TrialMDP with and . Under these particular parameter settings TrialMDP attained slightly lower power than the other designs, but its superior patient outcomes gave it the greatest utility across all scenarios. Indeed, we found that our algorithm does typically achieve higher average utility than the baseline designs (i) under the alternative hypothesis and (ii) as long as is sufficiently large. When is not large enough, our designs have highest utility among the adaptive designs, but the 1:1 design is mathematically guaranteed to attain highest utility. We show this in Appendix D of the Supplementary Materials.

We highlight the fact that TrialMDP assigned many more patients to the superior treatment on average, in all the scenarios of Table 1. Furthermore, it did so reliably. The 5%-ile for is higher for TrialMDP than for any other adaptive design, in every alternative scenario.

It is also important to note that TrialMDP’s design yielded slightly biased estimates of the effect size in the alternative scenarios. We hypothesize that this bias—on the order of 0.01—stems from the rapidly changing randomization ratio prescribed by TrialMDP. The user ought to weigh this against other matters, such as vastly improved patient outcomes, when considering TrialMDP.

Table 2 shows the corresponding results for null scenarios. Notice that in some cases TrialMDP’s designs showed somewhat inflated type-I error. The percentiles of show that TrialMDP

is more prone to creating an imbalanced allocation under the null hypothesis. Another salient observation is the relative decrease in utility for

all of the adaptive designs. This has a simple explanation. Under the null hypothesis, a 1:1 design always has optimal utility. A 1:1 design attains maximal and minimal ; and under the null hypothesis, for any design. Hence, every adaptive design will yield lower utility than the 1:1 design.

(A) (B)

Figure 4: Simulated trial histories. Each plot traces the treatment allocation of 10,000 simulated trials. Histograms on the right give distributions of final allocations and report the mean. For each of these plots, . (A) Histories for RAR and blocked RAR trial designs. (B) Histories for a TrialMDP design, with parameter settings and . Under these specific settings TrialMDP allocates many more patients on average to the superior treatment.
Power (CMH test) Effect Bias (5%, 95%) Utility -score
0.3 0.1 94 0.79 0.78 0.79 0.78 0.00 0.00 0.01 16.94 (-4, 36) 11.22 (-8, 30) 19.88 (-2, 50) 3.26 -60.25 6.23 8.40
0.4 0.1 46 0.78 0.75 0.77 0.74 0.00 0.00 0.01 6.80 (-6, 18) 3.98 (-8, 16) 15.26 (0, 26) 3.87 -9.18 2.65 8.17
0.2 124 0.80 0.78 0.79 0.77 0.00 0.00 0.01 17.03 (-6, 42) 11.18 (-10, 32) 31.77 (-4, 66) 3.50 -104.45 6.10 11.64
0.5 0.3 144 0.80 0.80 0.78 0.76 0.00 0.00 0.01 14.20 (-10, 38) 9.54 (-12, 32) 42.23 (-6, 76) 3.41 -157.52 5.11 14.80
0.7 0.4 62 0.79 0.77 0.77 0.73 0.00 0.00 0.01 6.96 (-8, 22) 4.60 (-10, 18) 23.03 (0, 34) 3.85 -17.25 2.89 11.22
0.5 144 0.81 0.79 0.79 0.77 0.00 0.00 0.01 9.22 (-12, 30) 6.13 (-14, 28) 42.33 (-6, 76) 3.34 -174.98 3.16 14.78
0.9 0.6 46 0.79 0.78 0.78 0.76 0.00 0.00 0.01 3.72 (-8, 16) 2.55 (-8, 14) 13.43 (0, 26) 3.50 -9.56 1.62 6.89
0.7 94 0.80 0.80 0.80 0.79 0.00 0.00 0.00 4.53 (-12, 20) 3.08 (-14, 20) 19.17 (-2, 50) 3.22 -73.27 1.32 7.80
Table 1: Simulation study alternative scenarios. Labels RAR, BRAR, and MDP refer to adaptive trials designed by traditional RAR, blocked RAR, and TrialMDP, respectively. The label 1:1 refers to a fixed randomization trial with one-to-one allocation. The “Effect Bias” multicolumn reports the average difference between estimated effect size and true effect size. The “ (5%, 95%)” multicolumn shows the difference in patient allocation between treatments; it reports the mean, with the 5%-ile and 95%-ile in parentheses. shows the average number of blocks. It only varies for TrialMDP; for RAR and for BRAR in all scenarios. The “Utility -score” multicolumn reports gain in utility relative to the 1:1 trial design, computed as . For these results, TrialMDP used parameter settings and .
Size (CMH test) Effect Bias (5%, 95%) Utility -score
0.1 100 0.05 0.05 0.05 0.05 0.00 0.00 0.00 0.25 (-22, 24) 0.14 (-20, 22) 0.13 (-18, 18) 2.77 -542.40 -5.26 -9.15
0.3 100 0.05 0.05 0.05 0.05 0.00 0.00 0.00 -0.18 (-22, 20) -0.07 (-20, 20) -0.07 (-32, 32) 3.77 -505.00 -5.14 -13.45
0.5 100 0.05 0.05 0.05 0.06 0.00 0.00 0.00 0.05 (-18, 18) 0.12 (-18, 18) 0.06 (-46, 46) 3.88 -490.50 -4.96 -13.36
0.6 100 0.05 0.05 0.04 0.05 0.00 0.00 0.00 0.03 (-18, 18) 0.07 (-18, 18) -0.04 (-42, 42) 3.85 -491.29 -4.98 -13.31
0.7 100 0.05 0.05 0.05 0.06 0.00 0.00 0.00 0.11 (-18, 18) -0.01 (-16, 16) 0.32 (-36, 36) 3.68 -499.43 -5.12 -12.64
0.9 100 0.05 0.05 0.05 0.05 0.00 0.00 0.00 -0.10 (-16, 16) -0.01 (-16, 16) -0.02 (-18, 18) 2.77 -511.70 -5.05 -9.03
Table 2: Simulation study null scenarios. We report trial size rather than power; all other columns have the same meaning as in Table 1. For these results, TrialMDP used and .
Power/Size (CMH test) Effect Bias (5%, 95%) Utility -score
0.4 0.4 20 0.06 0.05 0.05 0.05 0.00 0.00 0.00 -0.04 (-8, 8) -0.03 (-8, 8) 0.01 (-10, 10) 2.57 -45.59 -2.48 -3.57
0.8 0.4 20 0.61 0.60 0.54 0.56 0.00 0.00 0.02 2.13 (-6, 10) 1.47 (-6, 8) 5.09 (-6, 10) 2.28 -9.94 0.24 2.50
Table 3: Results of trial redesign. We report the same quantities as in the simulation study. For these results, TrialMDP used and .

Beyond a one-dimensional comparison of utility, it is useful to compare the designs in two dimensions: statistical power and patient outcomes. As we vary the parameter , TrialMDP designs trials that balance these quantities differently. We visualize this with frontier plots; trial designs are shown as points in two dimensions, with statistical power on the horizontal axis and allocation to on the vertical axis. Higher and to the right is better. Figure 5 gives examples. In some scenarios, TrialMDP’s designs dominate the other adaptive designs, attaining higher power and assigning more patients to treatment . Figure 5(A) is one such case. In other scenarios, TrialMDP’s designs do not dominate the others. However, Figure 5(B) shows that even in those cases, TrialMDP still provides a useful way to control the balance between power and patient outcomes. For example, the user is free to choose a trial design with much better patient outcomes at the cost of slightly lower power, by selecting larger values of .

Figure 5: Frontier plots. (A) Trial design characteristics for scenario . (B) Trial design characteristics for scenario In each plot the curve traced by TrialMDP corresponds to . TrialMDP used in both plots. The points represent mean

power and patient allocations; the error bars show symmetric 90% confidence intervals for the means. Note that there are error bars for the vertical direction, but they are too compact to be seen.

3.2 Trial redesign

We demonstrate TrialMDP’s practical usage by applying it to a historical trial. We chose the phase-II thymoglobulin trial described by Bashir et al. (2012) because it (i) had two arms, (ii) had a small sample size (), and (iii) the trial designers saw fit to use an adaptive design, for ethical reasons. This combination made the trial well-suited for testing our algorithm.

We redesigned the trial in two phases: “parameter tuning” and “testing.” In the parameter tuning phase we swept through the same grid of values used in our simulation study, but with the sample size fixed at . We ran our algorithm and simulated 10,000 trials at each grid point, and generated frontier plots similar to those in Figure 5. Visual inspection suggested that TrialMDP with and would yield reasonable power and patient outcomes for a variety of scenarios.

In the testing phase we simulated the thymoglobulin trial by computing point estimates of and from the original trial’s results. We simulated two scenarios: a null scenario where , and an alternative where and . Using the design from TrialMDP with “tuned” parameter values and , we simulated 10,000 trials for each scenario. The results are aggregated in Table 3. Under the alternative scenario we found that TrialMDP’s design, on average, assigned significantly more patients to treatment with a slightly decreased power of 0.557. Note also that in the null scenario, TrialMDP’s design had a somewhat inflated type-I error of 0.055.

4 Discussion

Key takeaways.

We presented TrialMDP, an algorithm for designing blocked RAR trials. TrialMDP represents a blocked RAR trial as a Markov Decision Process, and solves for the optimal design via dynamic programming. The resulting design dictates the size and treatment allocation of the next block, given the results observed thus far.

Our algorithm allows users to choose the relative importance of (i) statistical power and (ii) patient outcomes. The trial designs generated by TrialMDP consistently attain superior utility against a suite of baselines when (i) the effect size is large and (ii) patient outcomes are given sufficient importance. The simulation study in Section 3.1 demonstrates this.

TrialMDP has some shortcomings worth keeping in mind. It is currently restricted to a narrow class of trials: two-armed trials with binary outcomes. All outcomes for past blocks must be observed before the next block can begin. The MDP formulation assumes a single statistical test (one-sided CMH) is performed at the end of the trial. While interim analyses may be used in trials governed by the current version of TrialMDP, we provide no guarantees of optimality in that case. TrialMDP’s computational cost grows quickly with the number of patients, and becomes impractical for . Setting large values for the minimum block size and block increment parameters ( and ) can ameliorate some of this expense. Simulations showed that in some scenarios, TrialMDP’s designs have modestly inflated type-I error, and may yield a slightly biased estimate of effect size. These weaknesses should be weighed against the vastly superior patient outcomes TrialMDP can deliver.

Practical recommendations.

The user of TrialMDP immediately faces a question: what values of and should be used? Consider the terms of Equation 1. Since is only a proxy for the statistical power, there isn’t a clear way to assign practical meaning to . For example, we cannot interpret as a literal “conversion rate” between units of failure and units of statistical power. This makes it difficult to set in a principled way. Instead we recommend tuning and through a process like the one demonstrated in Section 3.2: (i) use the algorithm to design trials for a grid of values; (ii) simulate trials for each design, for a set of scenarios ; (iii) examine the simulation results and choose that yield acceptable power and patient outcomes across scenarios. As a starting point, , yielded reasonable characteristics across all the scenarios in this paper.

Future improvements.

Although TrialMDP’s current implementation is single-threaded, it is highly parallelizable and would have a speedup roughly linear in the number of threads. A multi-threaded parallel implementation is a natural next step.

There are multiple ways that TrialMDP could be extended to a broader class of trials. For instance, it could permit more than two arms and more than two outcomes. This would incur exponentially greater computational expense, but may be useful for some very small trials.

The current MDP formulation assumes that the trial terminates after all patients have been treated. A more sophisticated MDP could incorporate interim analyses, accounting for the possibility of early termination for success or futility.

5 Software

We implemented TrialMDP in C++ and provide it as an R package on GitHub:
https://github.com/dpmerrell/TrialMDP. We also provide the code for our Section 3 evaluations in another repository: https://github.com/dpmerrell/TrialMDP-analyses. This includes a Snakemake workflow (Mölder et al., 2021) that reproduces all results in this paper.

6 Supplementary Material

Our Supplementary Material contains four appendices. Appendix A gives our justification for using the function . Appendix B shows that our optimization problem has the optimal substructure property (and hence TrialMDP yields an optimal policy with respect to our MDP assumptions). Appendix C derives the computational complexities given in Section 2.2. Appendix D shows that must be sufficiently large for an adaptive trial to attain higher utility than a single-block trial.


We thank Zhu Xiaojin and Blake Mason for conversations about bandit algorithms. DM was funded by the National Institutes of Health (award T32LM012413).

Conflict of Interest: None declared.


  • Q. Bashir, M. F. Munsell, S. Giralt, L. d. P. Silva, M. Sharma, D. Couriel, A. Chiattone, U. Popat, M. H. Qazilbash, M. Fernandez-Vina, R. E. Champlin, and M. J. d. Lima (2012) Randomized phase II trial comparing two dose levels of thymoglobulin in patients undergoing unrelated donor hematopoietic cell transplant. Leukemia & Lymphoma 53 (5), pp. 915–919. Note: Publisher: Taylor & Francis _eprint: https://doi.org/10.3109/10428194.2011.634039 External Links: ISSN 1042-8194, Link, Document Cited by: §3.2.
  • T. Chandereng and R. Chappell (2019) Robust Blocked Response-Adaptive Randomization Designs. arXiv:1904.07758 [stat] (en). Note: arXiv: 1904.07758 External Links: Link Cited by: §1, §2.1, §3.1.
  • T. Chandereng and R. Chappell (2020) How to do response-adaptive randomization (rar) if you really must. Clinical Infectious Diseases. Cited by: §1.
  • W. G. Cochran (1954) Some Methods for Strengthening the Common chi-squared Tests. Biometrics 10 (4), pp. 417–451. Note: Publisher: [Wiley, International Biometric Society] External Links: ISSN 0006-341X, Link, Document Cited by: Appendix A, §2.1.
  • J. P. Hardwick and Q. F. Stout (1995) Exact Computational Analyses for Adaptive Designs. Lecture Notes-Monograph Series 25, pp. 223–237. Note: Publisher: Institute of Mathematical Statistics External Links: ISSN 07492170, Link Cited by: §1, §2.2.
  • J. P. Hardwick and Q. F. Stout (1999) Using Path Induction to Evaluate Sequential Allocation Procedures. SIAM Journal on Scientific Computing 21 (1), pp. 67–87 (en). External Links: ISSN 1064-8275, 1095-7197, Link, Document Cited by: §1, §2.2.
  • J. Hardwick and Q. F. Stout (2002) Optimal few-stage designs. Journal of Statistical Planning and Inference 104 (1), pp. 121–145 (en). External Links: ISSN 03783758, Link, Document Cited by: §1, §1, §2.2.
  • T. G. Karrison, D. Huo, and R. Chappell (2003) A group sequential, response-adaptive design for randomized clinical trials. Controlled Clinical Trials 24 (5), pp. 506–522. Cited by: §1, §1.
  • T. Lattimore and C. Szepesvari (2020) Bandit Algorithms. Cambridge University Press. External Links: ISBN 978-1-108-57140-1 Cited by: §2.1, §2.2.
  • S. Liu and J. J. Lee (2015) An overview of the design and conduct of the battle trials.. Chinese clinical oncology 4 (3), pp. 33–33. Cited by: §1.
  • F. Mölder, K. P. Jablonski, B. Letcher, M. B. Hall, C. H. Tomkins-Tinch, V. Sochat, J. Forster, S. Lee, S. O. Twardziok, A. Kanitz, A. Wilm, M. Holtgrewe, S. Rahmann, S. Nahnsen, and J. Köster (2021) Sustainable data analysis with Snakemake. F1000Research 10, pp. 33 (en). External Links: ISSN 2046-1402, Link, Document Cited by: §5.
  • M. Proschan and S. Evans (2020) Resist the temptation of response-adaptive randomization. Clinical Infectious Diseases 71 (11), pp. 3002–3004. Cited by: §1.
  • W. F. Rosenberger, N. Stallard, A. Ivanova, C. N. Harper, and M. L. Ricks (2001) Optimal Adaptive Designs for Binary Response Trials. Biometrics 57 (3), pp. 909–913. Note: _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.0006-341X.2001.00909.x External Links: Link, Document Cited by: §3.1.
  • P. F. Thall and J. K. Wathen (2007) Practical bayesian adaptive randomisation in clinical trials. European Journal of Cancer 43 (5), pp. 859–866. Cited by: §1.
  • S. S. Villar, J. Bowden, and J. Wason (2015) Multi-armed Bandit Models for the Optimal Design of Clinical Trials: Benefits and Challenges. Statistical Science 30 (2), pp. 199–215 (en). External Links: ISSN 0883-4237, Link, Document Cited by: §1.
  • S. S. Villar, J. Bowden, and J. Wason (2018) Response‐adaptive designs for binary responses: How to offer patient benefit while being robust to time trends?. Pharmaceutical Statistics 17 (2), pp. 182–197. External Links: ISSN 1539-1604, Link, Document Cited by: §1.
  • S. S. Villar, J. Wason, and J. Bowden (2015) Response-Adaptive Randomization for Multi-arm Clinical Trials Using the Forward Looking Gittins Index Rule. Biometrics 71 (4), pp. 969–978. External Links: ISSN 0006-341X, Link, Document Cited by: §1.
  • M. Woodroofe and J. Hardwick (1990) Sequential Allocation for an Estimation Problem with Ethical Costs. The Annals of Statistics 18 (3), pp. 1358–1377 (en). External Links: ISSN 0090-5364, Link, Document Cited by: §2.2.

Appendix A Derivation of

Equation 1 uses the following function, , as a proxy for a trial’s statistical power:


This appendix provides some justification for .

We’re interested in blocked RAR trials where the final analysis uses a Cochran-Mantel-Haenzsel (CMH) superiority test. Recall that the CMH statistic takes this form:


Under the null hypothesis, asymptotically. Intuitively, we maximize the power of the test by choosing such that when , the distribution of

has large mean without inflated variance. Our goal is to find an objective function

that, when maximized, yields trial designs with those characteristics.

As a first candidate we may try maximizing the expected value of of :

where is the fraction of block ’s patients allocated to . The trial designer has no control over or . So if they wish to maximize this quantity then they may ignore the factor , yielding


as a proxy objective for maximizing power. It’s important to note, however, a subtle property of Expression 5. The denominator is minimized when more patients are allocated to the treatment with more extreme success probability—i.e., success probability closer to 0 or 1. As a result, the maximizer of Expression 5 exhibits a preference toward that treatment. This preference manifested itself in earlier versions of the algorithm, which would do well when , but would do worse when .

As a second candidate, we may try maximizing the the related quantity


Cochran uses Expression 6 as a proxy for the power of a CMH test in his original justifications for the CMH statistic (Cochran, 1954). Like Expression 5, the new Expression 8 also exhibits a preference based on extremality of the success probabilities. However, it instead favors the treatment with less extreme success probability, i.e., probability nearer . Versions of the algorithm based on Expression 8 would manifest this preference during simulations. The algorithm would attain superior utility when , but would do worse when .

Note the similarity between Expression 8 and Expression 5. They have identical numerators, and both denominators have the form where is some “combined variance” computed from . They differ precisely in how they compute . This in turn produces their different preferences (toward the treatment with less-extreme and more-extreme success probability, respectively). Neither of these preferences are favorable. We would like a proxy for power that has simpler dependence on and , which are unknown. To that end we propose our final candidate:

or, after squaring,

This new quantity lets , which has neither of the preferences exhibited by Expressions 5 or 8. Of course in practice we don’t know or , so we substitute their MAP estimates at each block:


which is the expression for used in Section 2.1 (up to a factor of ).

Appendix B Optimal Substructure

We show that our optimization problem—maximizing expected utility—possesses the optimal substructure property. In other words, we prove that the recurrence relation (Equation 3) correctly decomposes the problem into subproblems, and reuses their solutions to solve the original problem.

Suppose the algorithm is evaluating for some state , and that it’s already evaluated for every possible successor state of . Let denote the optimal policy, i.e., the one yielding . Then optimal substructure follows from the linearity of our utility function. Assuming is not terminal:

which agrees exactly with the recurrence in Equation 3. A similar computation covers the case when is terminal.

Put another way, our dynamic program’s recurrence relation computes correctly at each state, and will yield the optimal policy .

Appendix C Computational complexity

We derive the space and time complexities given in Section 2.2.

Let denote the set of all contingency tables containing observations. Define , the set of integers ranging from to in increments of . Then , and the size of the full state space is

The algorithm stores data proportional to , so this gives the space complexity.

The time complexity results from a nested sum over states, actions, and transitions: