Optimizing Interim Analysis Timing for Bayesian Adaptive Commensurate Designs

05/17/2019
by   Xiao Wu, et al.
Harvard University
0

In developing products for rare diseases, statistical challenges arise due to the limited number of patients available for participation in drug trials and other clinical research. Bayesian adaptive clinical trial designs offer the possibility of increased statistical efficiency, reduced development cost and ethical hazard prevention via their incorporation of evidence from external sources (historical data, expert opinions, and real-world evidence), and flexibility in the specification of interim looks. In this paper, we propose a novel Bayesian adaptive commensurate design that borrows adaptively from historical information and also uses a particular payoff function to optimize the timing of the study's interim analysis. The trial payoff is a function of how many samples can be saved via early stopping and the probability of making correct early decisions for either futility or efficacy. We calibrate our Bayesian algorithm to have acceptable long-run frequentist properties (Type I error and power) via simulation at the design stage. We illustrate our approach using a pediatric trial design setting testing the effect of a new drug for a rare genetic disease.

READ FULL TEXT VIEW PDF

page 1

page 16

08/20/2019

Bayesian leveraging of historical control data for a clinical trial with time-to-event endpoint

The recent 21st Century Cures Act propagates innovations to accelerate t...
12/17/2021

On Frequentist and Bayesian Sequential Clinical Trial Designs

Clinical trials usually involve sequential patient entry. When designing...
10/20/2020

Hi3+3: A Model-Assisted Dose-Finding Design Borrowing Historical Data

Background – In phase I clinical trials, historical data may be availabl...
06/23/2020

Adding flexibility to clinical trial designs: an example-based guide to the practical use of adaptive designs

Adaptive designs for clinical trials permit alterations to a study in re...
07/08/2021

A bayesian reanalysis of the phase III aducanumab (ADU) trial

In this article we have conducted a reanalysis of the phase III aducanum...
09/02/2020

A pragmatic adaptive enrichment design for selecting the right target population for cancer immunotherapies

One of the challenges in the design of confirmatory trials is to deal wi...

1 Introduction

The need for more efficient clinical trial methods continues to increase. Developers of new drugs and medical devices are under increasing pressure to control development costs, especially in the clinical testing phase. In the U.S., regulators at the Food and Drug Administration (FDA) have been motivated since December 2016 by the 21st Century Cures Act and corresponding regulatory rule changes in the Prescription Drug User Fee Act (PDUFA) VI. These documents have encouraged FDA to consider Phase II and even Phase III applications that utilize novel statistical methods that borrow from previous clinical data and perhaps even real world evidence (RWE) [wechsler2016pdufa, sobel2018real, jain2019pdufa].

Bayesian clinical trial designs offer the potential advantages of reduced study sample size, increased statistical power, and reductions in cost and ethical hazard [hobbs2011hierarchical]. In this paper, we propose a Bayesian adaptive statistical approach [carlin2008bayesian], implemented using commensurate priors [hobbs2011hierarchical], and utilizing a novel “payoff function" to select an optimal time to perform an interim look at the data. Our Bayesian adaptive approach gets the most out of available data by (a) permitting borrowing from adult data in our pediatric setting, and (b) by allowing the study to terminate early (at the interim look) if the novel treatment emerges as unequivocally better than placebo (“early win"), or fails to deliver some minimum level of efficacy (“futility"). These features allow reduction of total trial duration, thus reduce cost and ethical hazard. Statistically, adaptive trials are most easily implemented using a Bayesian framework (see e.g. [berry2010bayesian]), since it avoids problems with traditional -values and “alpha-spending functions" (reviewed by e.g. [demets1994interim]), instead directly computing the probability that each treatment is effective given the available data (a posterior probability calculation). Bayesian procedures also more readily permit incorporation of external evidence (such as historical data and expert opinion) when needed and appropriate.

A specific example motivating this research was the consideration of a pediatric trial design to test the effect of a oral drug for Gaucher disease, a rare genetic disease belonging to the class of lysosomal storage disorders [grabowski2015gaucher]. In 2014, FDA granted approval for this drug as a first-line treatment for adults with Gaucher disease type 1 who have a CYP2D6 extensive, intermediate, or poor metabolizer phenotype based on two pivotal studies [mistry2015effect, cox2015eliglustat]. In particular, efficacy in treatment-naive patients was demonstrated in the placebo-controlled ENGAGE trial [mistry2015effect], which enrolled patients with Gaucher disease type 1 who were at least 16 years of age, with the primary endpoint being reduction in spleen volume (percent change from baseline). In order to extend the label to treatment-naive children (under age 16), a pediatric study was needed. However, there were significant challenges in conducting an adequately powered placebo-controlled study in the treatment-naive pediatric population, due to very slow expected enrollment, resulting in a high likelihood that the trial will be unable to fully enroll enough patients to achieve acceptable power. We would also expect challenges to pediatric patients (especially those assigned to placebo) in remaining compliant with the study protocol.

The questions raised by this example motivate us to consider an alternative, adaptive commensurate study design to maximize the information available at an interim analysis (IA). This design cautiously borrows from the adult data when appropriate, and potentially stops the study after enrolling fewer patients without sacrificing statistical validity. In the case of our pediatric study setting, it was reasonable from a clinical perspective to assume that the primary endpoint used to measure the treatment effect in adult populations would still be appropriate for pediatric patients, and that the magnitude of change in the primary endpoint would likely be similar between adults and children in both the placebo and treatment arms. In this situation, common to many pediatric study designs [bavdekar2013pediatric], the commensurate prior approach for incorporating information from historical data [hobbs2012hierarchical, murray2014semiparametric, viele2014use, van2018including, lewisetal2019] can be useful .

The IA is highly desirable in the above example, since the fixed sample size design may require a sample size that is not realistic, with the result that we cannot finish the study in a realistic period of time. On the other hand, although the Bayesian method allows us to assess posterior probabilities of futility and efficacy continuously as the data accumulate, due to the significant cost involved in cleaning and making the database available for IA, multiple IAs are not desirable. Therefore, it’s important to determine an optimal time to perform the single IA that provides the maximum chance of making a correct early decision. Papers investigating the optimal placement of interim analyses do not appear plentiful in the literature. Togo and Iwasaki

[TogoIwasaki2013] proposed a method that seeks to minimize the total expected sample size under a specified treatment effect, and find that, regardless of the effect size, the optimal time for a single IA is at approximately of the planned sample size for the O’Brien-Fleming-type and approximately

for the Pocock-type alpha spending functions, where the expected sample sizes were calculated under a fixed treatment effect as used for the study power. They also noted that when the true effect size was better than or worse than the planned treatment effect, the optimal time would be shifted. In practice, the timing of an interim analysis for non-Bayesian type of studies was typically chosen in the range of 40-60 % of the total sample size based on number of patients needed for safety assessment and enrollment estimation to allow potential saving with the early stopping. Yet a goal of clinical trials is often to seek to optimize the tradeoff between costs (e.g., the expected sample size) and benefits (e.g., the correct early futility/efficacy decisions at IA), along with many other consideration that go beyond a standard sample size calculation

[anderson2014timing].

The Bayesian paradigm is especially promising for constructing our adaptive framework, since it provides a unified and interpretable language for data collection, inference, and decision making [parmigiani2002modeling]. However, on the Bayesian side, the literature for investigating optimal IA timing is even sparser. A rare exception is the work of Huang and Fu [HuangFu2016], who use simulation to estimate the optimal location of a single IA using a utility-based Bayesian adaptive design in a particular dose-response setting. We hope the new Bayesian designs proposed in this paper can be soon applied in the future clinical development program in the comparable situations.

The rest of our paper is organized as follows. Section 2 lays out the details of our Bayesian adaptive commensurate prior approach, along with a step-by-step algorithm for its implementation. Section 3 then gives the results of an extensive simulation study to check our method’s performance in the pediatric example setting. Our approach is able to obtain sensible optimal look times that maximize a payoff function that is essentially measures the weighted conditional probability of early stopping relative to the total sample size expected. Finally, Section 4 summarizes our findings and offers avenues for future research, including alternate definitions of the payoff function.

2 Statistical Methods and Algorithmic Approach

In our approach, we apply Bayesian methods with a commensurate prior to potentially stop a study at a single interim analysis. We use early futility and efficacy criteria based on Bayesian posterior probabilities of a treatment difference reaching pre-specified thresholds [berry2010bayesian], after adaptively borrowing information from historical adult study data, dependent on its similarity to data from the current study. We calibrate our Bayesian procedures to have acceptable long-run frequentist properties (Type I error and power) via computer simulation at the design stage. The optimal timing of the IA will be evaluated via simulation by assessing a grid of plausible time points, among all decision criteria that meet certain Bayesian and frequentist properties. This optimization is based on a payoff function that characterizes the benefit/cost ratio of the decision. The proposed payoff function also introduces a weight parameter that allows expert input, including level of interest and/or confidence on the new treatment, available budget, and the internal and external competitive environment.

To formalize ideas, let be the sample size for the pediatric study in group , where is the placebo group and is the treatment group. Let be the sample size for the two historical (adult) groups respectively. Since we will typically have much more adult data than pediatric, in what follows we set . Let be the maximum total pediatric sample size if the trial runs to completion. We recommend selecting the maximum pediatric sample size to achieve a reasonable power based on a clinically meaningful target treatment effect, in order to allow the study to still have a good chance to achieve its objective in the least favorable case where no information at all can be borrowed from the adult study data. We propose a trial with a single interim look, after pediatric spleen reductions have been observed. Let be the mean reduction in spleen volume for children in group and be the same quantities for the two historical (adult) groups respectively. o2001bayesian introduced the notion of using two different prior distributions in clinical trial settings: a design prior, a more realistic choice used to evaluate the likely properties of a design, and an analysis prior, a typically more conservative choice that will actually be used when the data are observed. Consider the latter choice first; our analysis prior uses a commensurate prior framework [hobbs2011hierarchical], and assumes

(1)

where we assume follows a vague prior for the adult percent spleen reductions, e.g. is known. The commensurability parameters (precisions)

are assigned independent hyperpriors, e.g., the conjugate choices

as relatively vague Gamma specifications.

Turning to the observed data, suppose we assume that and

, the observed percent reductions for each pediatric and adult patient, are also normally distributed, that is,

(2)

where the patient index runs from 1 to or

, respectively, and we again use vague conjugate priors for

and ; e.g., .

We design and calibrate the trial to have acceptable long-run frequentist properties (Type I error and power) for any given value of , and then select the value that maximizes trial payoff (as defined below). Also, our design uses one interim look to check for early stopping due to success or futility, accounting for commensurability of the adult and pediatric data, but does not consider adjusting the randomization ratio depending on how many adults we are “effectively" borrowing. Such an enhancement would be possible to add to our design [hobbs2013adaptive, normington2018]. Finally, we note that either posterior or predictive distributions can be used for these calculations. For simplicity, we use the former (implemented via MCMC computation in BUGS, R/Stan or SAS) for our interim stopping rules as follows:

Early winner:

If at the interim look, the probability that the novel treatment arm () is better exceeds some prespecified probability , i.e., if

then Arm 2 is declared the early winner and the trial is stopped early. We might take be a fairly high value, so early trial termination is permitted only when evidence for an early winner is overwhelming [anderson2014timing].

Final winner:

Early winner rules are typically paired with corresponding final winner rules, e.g: If, after all patients have been randomized and reported results, the probability that the treatment arm is the best exceeds some prespecified probability , i.e., if

then Arm 2 is declared the final winner. If however the treatment arm cannot meet this criterion, then we do not make a final selection as to “best treatment", and merely summarize the performance of both treatments. We might set as a slightly less demanding threshold than the early winner level .

Early futility:

If at the interim look, the probability that the novel treatment arm () is better than some prespecified minimally tolerable response rate falls below some prespecified probability , i.e., if

then the trial is declared futile and is stopped early (i.e., after just patients). We might set as the minimum reduction in spleen volume from the novel treatment that can be clinically relevant), and take fairly small. Thus, if the treatment cannot muster at least a chance of a reduction in spleen volume at our interim look, we will give up on the treatment and the trial is stopped early for futility.

Algorithm: In summary, our overall algorithm for given choices of and is as follows:

  1. Fix

    , so that the null hypothesis is true (no difference in pediatric spleen volume reduction between treatment and placebo).

  2. Use equation (2) with a fix to generate Monte Carlo pediatric observations , and combine with the actual adult observations .

  3. Perform the interim look at the data, estimating the posterior precision of the pediatric response in each group using both the pediatric data alone and the full model (commensurate prior with adult historical data); namely, and , where and

    denote the interim pediatric and full adult data, respectively. If posteriors are being computing using MCMC, these precisions would just be the reciprocals of the sample variances of the

    MCMC samples for the 2 groups and the 2 different models (interim pediatric only vs. full data).

  4. For k=1,2, compute the effective historical sample sizes

    so that is the total adult effective historical sample size [hobbs2013adaptive]. Check to make sure this is not unacceptably large (say, more than twice as large as , the interim pediatric sample size).

  5. Use the early winner and futility rules above to see if the trial can stop now; if so, write this down and skip the next step.

  6. Use (2) with the same in Step 2 to generate the remaining pediatric observations , and then use the “final winner" rule above to see if the trial can now choose a definite winner. Note that this approach is equivalent to using an appropriately sized Bayesian credible interval (BCI) for the pediatric treatment effect . For example, with , the equivalence would be a 95% equal-tail BCI: if it is totally above 0, conclude treatment is superior to placebo; if it is totally below 0, conclude treatment is inferior to placebo; and if it contains 0, fail to conclude superiority of either treatment. The equivalence of and to their corresponding BCIs can be established respectively.

  7. Repeat Steps 2–6 times, and estimate the Type I error of our design as

    (3)

    Repeat Steps 2–6 and grid search on the choices of , and in the stopping rules for the study designs with the desired test size (say, ).

  8. Keep but change (or any known value meet the target efficacy), so that now the alternative hypothesis is true (clinically significant improvement in pediatric spleen volume reduction on treatment as compared to placebo). Repeat Steps 2–7 above, estimating the power of our design using equation (3), and check if it is above the desired level (e.g., ). If the power is not above the desired level, we can alter the choices of , and in the stopping rules and try again, however, to maintain the procedure’s Type I error calibration we can only choose , and among the study designs with the desired test size obtained in Step 6. (Otherwise, if no design obtained in Step 6 achieves the desired power, we might instead need to increase , or alter the hyperpriors on the so that more strength is borrowed from the historical adult data).

  9. Rather than fix and as in Steps 1 and 8, repeatedly sample them from a particular design prior, for example

    (4)

    for the children where and are known, and set and . We again use the actual adult observations, and to be realistic we might set smaller than the mean observed reduction in adults, to reflect the plausible situation that the treatment offers a greater benefit to adults than it does to children.

    In all cases, we repeat Steps 2–7 above again, estimating the marginal probabilities of early stopping under our design prior, including early futility and early winner under the design prior. The numerator (benefit) of the payoff function can be defined under both the null hypothesis and alternative hypothesis for optimizing beneficial goals (i.e., to estimate the marginal probabilities of making correct decisions at IA, including early futility under null hypothesis and early efficacy under the alternative hypothesis separately). Define

    (5)

    Use these quantities to compute the trial payoff as

    (6)

    where is a preselected weight that trades off the two types of decisions in (5). The denominator (cost) can be explained as the expected sample size of the study design.

    An alternative full Bayesian payoff function computes the marginal probabilities of early stopping, early futility and early efficacy under the design prior. This redefines and as

    (7)

    now averaging over the design prior, and again use these quantities to compute the trial payoff in (6).

  10. Repeat all the steps above (Steps 1–9) across a grid of values. Choose the value that maximizes the Payoff as computed in equation (6). This is optimal under design prior (4), and the resulting design has correctly calibrated and acceptable Type I error and power by construction.

In the next section, we implement this algorithm and use it to determine the optimal timing for an interim analysis in the context of our pediatric study setting.

3 Simulation study

3.1 Simulation settings

In our simulation study, we implement the Bayesian algorithm proposed in Section 2. To illustrate the method, we simulated the historical study data hypothetically from a normal distribution with the endpoint being % reduction in spleen volume. The historical study sample size are assumed to be , in which are in the placebo group and are in the treated group. The simulated historical data have mean difference

and corresponding standard deviation

. We consider a current pediatric study with planned sample size of 40 ( per arm). This sample size will provide approximately power (under Type I error) to detect a target treatment difference of without considering interim look or external evidence (i.e. historical data borrowing in this study). We consider potential IA times after and 40 enrollments. We vary the weight as , to represent the considerations mentioned in Section 2. For example, with , the benefit (numerator of the payoff function) will represent the probability of an early win when the true treatment effect is at the target; stopping early for futility is deemed to have no value. In this case, the highest payoff will maximize the probability of early success at IA when the drug is working (benefit), while controlling the expected sample size (cost). The use of will place equal weight on stopping early for a win and stopping early to give up on the drug, while places heavier emphasis on earlier abandonment of an apparently ineffective drug. This may occurs if the external information suggested less favorable profile of the drug or emergence of a new competitor drug make the new treatment less desirable for further development. The use of is extremely unlikely in practice as it will place no benefit on an early win, and we present the outcome only for completeness.

We set the null hypothesis and target alternative hypothesis as, respectively,

and

We choose a minimal efficacy level of for defining futility at the interim look. Under our design prior, we consider four values of the mean treatment effect: , for scenarios in which we expect the new treatment will show no improved efficacy; , for scenarios in which we expect the treatment will achieve minimal efficacy; , for scenarios in which we expect the treatment will achieve the same high efficacy as the adult (historical) study; and finally , for surprising scenarios in which we expect the new treatment will achieve even higher efficacy than that seen in the historical adult study.

3.2 Simulation results

Figure 1 shows our algorithm controls power at the level of when we calibrate Type I error (one-sided at ) by finding the suitable choice of where we fix and . This represents a notable boost of power (at least ) compared to the standard frequentist method without historical data borrowing. Table 1 shows the amount of effective historical samples borrowed from the historical study under different design priors. Noting that in general we borrow more placebo than the treated, since under the design prior, we tend to believe the placebo arms are similar between adults and pediatrics, since they are both untreated and should have no reduction on spleen volume. Yet the amount of borrowing for treated arm highly depends on the specifications of design priors and true treatment effect; note for example the extensive borrowing from treateds even for later IA times when in Table 1. If on the other hand we believe the effects of drug on children are quite different from those in adults (other values of in the table), we tend to rely less on the historical data.

Figure 2 and Table 2 give our main results for payoff function defined in (5)-(6). Figure 2 presents the values of payoff functions under different scenarios. It is clear that each estimated payoff curve has a maximal point, which can be interpreted as the optimal time for an interim analysis. In general, we observe the optimal IA times using the specified payoff function are within the range of recruiting of patients, depending on the choices of design priors. One interesting perspective is that when , in which we expect the treatment only achieves minimal efficacy, the study design provides the latest IA time compared to other scenarios. This is not surprising as under this marginal case where the current study achieves only minimal efficacy, more data (longer waiting time) is needed in order to make a clear decision regarding early wins and losses. That is, this is the case that is most difficult for our commensurate prior framework to handle, as the decision whether or not to borrow is not clear-cut. In contrast, under the surprised high efficacy scenario (), we observe the earliest optimal IA timing, reflecting the investigator’s high confidence in the treatment’s effectiveness in the current pediatric trial. The no efficacy () and high efficacy () scenarios provide roughly identical optimal IA timing, yet it’s worth noting that the stopping mechanisms of IA are different: under no efficacy, early futility dominates the IA decision, while under high efficacy, the early stops are due to early winners.

Table 2 also illustrates the impact of the weight on our pediatric study design. We see the optimal IA time is relatively insensitive to the choice of , likely since the early win and early futility probabilities are both monotonically increasing as the IA moves later, and they do this at roughly comparable rates in this example. We do still see a slight trend towards earlier optimal times when is larger, which corresponds to our placing greater importance on early stopping for futility. Overall, the insensitivity to is something of a relief, since its choice is somewhat subjective and thus difficult in practice. However, caution needs to be paid that in certain scenarios, when the rates of change over time in probabilities of early win and early futility are not similar, the choice of may have larger impact on the optimal IA timing, and then a careful choice of need to be made based on external information. The results of alternative full Bayesian payoff function are presented in Appendix A1.

Figure 3 plots the expected sample size for the implementations of our study under different design priors. Expected sample size was defined as the expectation of sample size as we either conduct the IA and then stop, or conduct the IA and then continue to recruit more patients until the completion of study. The finding is that the minimal expected sample size appears if we conduct the interim analysis when have recruited patients. Our findings, under a Bayesian adaptive design, shows the timing of minimizing expected sample size appears earlier than those shown in TogoIwasaki2013, likely due to the information borrowed from the historical data. However, as demonstrated in Section 2, minimizing the expected sample size is not necessary to be the only criterion for finding optimal IA timing. Rather, a view toward maximizing a payoff function that characterizes the benefit/cost ratio is used in our Bayesian adaptive design. Under the optimal IA timing obtained by maximizing our specified payoff function, savings in expected sample size were observed. This outperforms the study designs presented in [TogoIwasaki2013], which found

savings in sample size with a single IA. Again, the reason is our Bayesian adaptive design effectively borrows from historical information. We also note that our simulations reveal enormously larger Monte Carlo standard errors (SEs) associated with the expected sample size estimation for early IA times (left side of Figure 

3), yet the SEs shrink to 0 as IA time increases, reflecting the fact that later IAs provide progressively more accurate estimation of cost as they progress toward full enrollment.

4 Discussion and Future Work

In this paper, we have used a Bayesian commensurate prior formulation to design a clinical trial with an optimally placed single interim look. Our goal was to move beyond simple optimality criteria that involve only overall expected sample size to those that actually measure not just the savings resulting from persons not enrolled in the study, but gains to the sponsor arising from making a correct decision as soon as possible. While the goal of an optimally designed clinical trial is to be able to declare study success as soon as possible when the drug is working, it’s also important to stop the trial as soon as possible when the drug is not working. The use of weight in the payoff function allows the project team to assess the relative importance of these two actions and reach to an ethical decision based on existing information. Also, since the design has controlled both Type I and Type II errors overall, the proposed payoff function was intended to maximize the chance of making a “weighted" correct decision as high as possible and as early as possible.

Our findings suggest optimal IA times tended to be different from the optimal time based on cost alone (when the expected sample size is the smallest). The optimal time may be earlier when the treatment effect is unequivocal (either very small or very large), or when greater importance is placed on early stopping for futility (higher values). By contrast, equivocal treatment effects (i.e., close to those deemed minimally clinically significant) or a higher emphasis on early stopping for efficacy (the “early winner") lead to later optimal IA times.

Our payoff function (6) resulted from a hybrid of Bayesian and frequentist ideas, which we do not view as inappropriate in a field where methods that are formally Bayesian but also required to have good frequentist properties are routinely used. Unlike the existing methods which calculate cost only based on a fixed alternative hypothesis, the current estimate of cost (denominator) based on different Bayesian prior allows a more realistic estimation of the cost. For an actual trial design, the expected cost should be assessed under all possible scenarios (based on existing knowledge) to assess their impact on optimal IA timing.

Still, while useful, our payoff function is fairly ad hoc, namely through equation (6) where is an estimate of the probability of an “early loss", is a corresponding estimate of the probability of an “early win", is the probability of early stopping for any reason, and is a weight that trades off these two early stop probabilities.

Suppose we define two more Bayesian posterior probability estimates [cheng2007optimal],

Let us now think of gain and cost on a purely financial (i.e., dollar) scale. Obviously we would need help from the trial sponsor to do this, but we could consider a range of possibilities. Define

and

We might take , so that the “gain" from a late loss is actually negative, corresponding to the financial loss associated with having to postpone development of other drugs while we waited for this one to fail. Similarly, we would surely take , but would also take , due to the missed opportunity to sell the drug while we waited for the trial to run to completion.

Next, let be the trial’s per-patient cost. Then we would have and since patients cost the same regardless of whether we win or lose. Thus the Bayesian expected net gain for the trial is:

Once again, we could choose the location of the IA to maximize this posterior expected net gain, instead of the Payoff function in (6). We hope such an investigation will be the subject of a future manuscript.

acknowledgements

The work of all three authors was supported in part by Sanofi Pharmaceuticals. The authors are grateful to Dr. Xun Chen for initial discussions that influenced the direction of this work. The computations in this paper were run on the Odyssey cluster supported by the FAS Division of Science, Research Computing Group at Harvard University.

conflict of interest

None

References

Figures and tables

Figure 1: IA timing (%) vs Type I error and type II error. We calibrate type I error (one-sided at the size of ) by grid searching the suitable value and fix and . We show by choosing suitable stopping rules at the design stage, the type II error are also controlled at the level of (), which is equivalent to power. The grey dashed line represents the level of .
IA time
30% 17.81 / 6.02 17.97 / 14.99 17.81/ 17.99 17.89/ 14.77
40% 17.08 / 3.76 17.16 / 13.18 17.13/ 17.28 17.12/ 12.93
50% 15.98 / 2.52 16.18 / 11.45 16.29/ 16.41 16.12/ 11.19
60% 15.07 / 1.87 15.17 / 10.03 15.37/ 15.32 15.16/ 9.61
70% 14.19 / 1.60 14.14 / 8.62 14.36 / 14.46 14.25 / 8.46
80% 13.24 / 1.36 13.26 / 7.56 13.62 / 13.50 13.29 / 7.50
90% 12.55 / 1.21 12.49 / 6.52 12.76 / 12.63 12.51 / 6.62
Table 1: Effective historical sample sizes (EHSS) borrowed from historical study at interim look (Placebo/Treated)
Figure 2: IA timing vs Payoff. Each panel represents the values of payoff function with respect to different weights . IA timing under different design prior with effect size: 1) , no efficacy, 2) , at minimal efficacy, 3) , high efficacy, and 4) , surprised high efficacy.
effect size
0 20 (50%) 20 (50%) 20 (50%) 20 (50%)
15 28 (70%) 28 (70%) 24 (60%) 24 (60%)
25 20 (50%) 20 (50%) 20 (50%) 20 (50%)
35 16 (40%) 16 (40%) 16 (40%) 16 (40%)
Table 2: Optimal IA timing chosen under different choices of design priors and choices of weights in the payoff
Figure 3: IA timing vs Expected sample size. Expected sample size was defined as the expectation of sample size as we either conduct the IA and then stops or conduct the IA and then continue to recruit patients until the completion of the study. Under the optimal IA timing obtained by minimizing our specified payoff, saves of expected sample size were observed.

Appendix

A1 Fully Bayesian payoff function

In this appendix, we include the simulation results for our fully Bayesian payoff function defined in (7). These results are given in Figure A1 and Table A1. Interestingly, although the shapes of payoff curves change dramatically, the optimal IA times chosen by the two payoff functions are generally comparable. In particular, we still see the latest optimal IA times when we expect the treatment to achieve only minimal efficacy, and the earliest optimal IA times when we expect very high efficacy. Almost flat payoff curves are observed for the weight under design prior corresponding to no efficacy, and for the weight under the design prior corresponding to high or very high efficacy. This is because under such scenarios, the payoff function only rewards the early stopping reason that is unlikely to happen (e.g., under the design prior with no efficacy, an early win would be very unusual).

Figure A1: IA timing vs Full Bayesian Payoff. Each panel represents the values of payoff function with respect to different weights . IA timing under different design prior with effect size: 1) , no efficacy, 2) , at minimal efficacy, 3) , high efficacy, and 4) , surprised high efficacy.
effect size
0 16 (40%) 20 (50%) 20 (50%) 20 (50%)
15 32 (80%) 32 (80%) 28 (70%) 24 (60%)
25 20 (50%) 20 (50%) 20 (50%) 4 (10%)
35 16 (40%) 16 (40%) 16 (40%) 0 (0%)
Table A1: Optimal IA timing chosen under different choices of design priors and choices of weights in the payoff

A2 Additional data tables

We also include four tables (Table A2-A5) that give detailed results from our simulation study under the four true states of nature (, and ). Information from these tables was used in the construction of the figures in the main paper.

IA time IA_stop Type I error Power EHSS (Placebo/Treated)
0 0.13 0.10 0.21 0.052 0.923 0.990 0.250 0.975 20.96 / 20.58
4 0.23 0.15 0.35 0.050 0.923 0.990 0.250 0.975 20.19 / 19.56
8 0.42 0.29 0.63 0.049 0.927 0.992 0.250 0.975 18.87 / 17.01
12 0.58 0.47 0.84 0.051 0.929 0.990 0.250 0.975 17.89 / 14.77
16 0.69 0.62 0.94 0.051 0.932 0.986 0.250 0.975 17.12 / 12.93
20 0.79 0.72 0.98 0.049 0.933 0.984 0.250 0.975 16.12 / 11.19
24 0.84 0.79 0.99 0.048 0.940 0.982 0.250 0.975 15.16 / 9.61
28 0.88 0.85 1.00 0.050 0.947 0.978 0.250 0.975 14.25 / 8.46
32 0.91 0.90 1.00 0.049 0.950 0.974 0.250 0.975 13.29 / 7.50
36 0.93 0.94 1.00 0.052 0.956 0.968 0.250 0.975 12.51 / 6.62
40 0.94 0.96 1.00 0.050 0.957 0.966 0.250 0.975 11.61 / 5.86
Table A2: Results under surprised high efficacy design prior (). and are the marginal probabilities of early futility under null hypothesis and early efficacy under alternative hypothesis respectively. IA_stop represents the probability of stopping the study at IA. , and are early winner rule, final winner rule and early futility rule respectively.
IA time IA_stop Type I error Power EHSS (Placebo/Treated)
0 0.13 0.10 0.13 0.052 0.923 0.990 0.250 0.975 21.09 / 20.83
4 0.23 0.15 0.22 0.050 0.923 0.990 0.250 0.975 20.20 / 20.04
8 0.42 0.29 0.41 0.049 0.927 0.992 0.250 0.975 18.75 / 18.64
12 0.58 0.47 0.62 0.051 0.929 0.990 0.250 0.975 17.81 / 17.99
16 0.69 0.62 0.79 0.051 0.932 0.986 0.250 0.975 17.13 / 17.28
20 0.79 0.72 0.87 0.049 0.933 0.984 0.250 0.975 16.29 / 16.41
24 0.84 0.79 0.92 0.048 0.940 0.982 0.250 0.975 15.37 / 15.32
28 0.88 0.85 0.96 0.050 0.947 0.978 0.250 0.975 14.36 / 14.46
32 0.91 0.90 0.98 0.049 0.950 0.974 0.250 0.975 13.62 / 13.50
36 0.93 0.94 0.99 0.052 0.956 0.968 0.250 0.975 12.76 / 12.63
40 0.94 0.96 0.99 0.050 0.957 0.966 0.250 0.975 11.75 / 11.74
Table A3: Results under high efficacy design prior (). and are the marginal probabilities of early futility under null hypothesis and early efficacy under alternative hypothesis respectively. IA_stop represents the probability of stopping the study at IA. , and are early winner rule, final winner rule and early futility rule respectively.
IA time IA_stop Type I error Power EHSS (Placebo/Treated)
0 0.13 0.10 0.10 0.052 0.923 0.990 0.250 0.975 20.92 / 20.26
4 0.23 0.15 0.15 0.050 0.923 0.990 0.250 0.975 20.14 / 19.46
8 0.42 0.29 0.25 0.049 0.927 0.992 0.250 0.975 18.76 / 16.86
12 0.58 0.47 0.37 0.051 0.929 0.990 0.250 0.975 17.97 / 14.99
16 0.69 0.62 0.51 0.051 0.932 0.986 0.250 0.975 17.16 / 13.18
20 0.79 0.72 0.61 0.049 0.933 0.984 0.250 0.975 16.18 / 11.45
24 0.84 0.79 0.68 0.048 0.940 0.982 0.250 0.975 15.17 / 10.03
28 0.88 0.85 0.75 0.050 0.947 0.978 0.250 0.975 14.14 / 8.62
32 0.91 0.90 0.82 0.049 0.950 0.974 0.250 0.975 13.26 / 7.56
36 0.93 0.94 0.88 0.052 0.956 0.968 0.250 0.975 12.49 / 6.52
40 0.94 0.96 0.90 0.050 0.957 0.966 0.250 0.975 11.54 / 5.82
Table A4: Results under moderate efficacy design prior (). and are the marginal probabilities of early futility under null hypothesis and early efficacy under alternative hypothesis respectively. IA_stop represents the probability of stopping the study at IA. , and are early winner rule, final winner rule and early futility rule respectively.
IA time IA_stop Type I error Power EHSS (Placebo/Treated)
0 0.13 0.10 0.15 0.052 0.923 0.990 0.250 0.975 20.86 / 18.89
4 0.23 0.15 0.26 0.050 0.923 0.990 0.250 0.975 20.18 / 16.93
8 0.42 0.29 0.44 0.049 0.927 0.992 0.250 0.975 18.66 / 10.36
12 0.58 0.47 0.60 0.051 0.929 0.990 0.250 0.975 17.81 / 6.02
16 0.69 0.62 0.72 0.051 0.932 0.986 0.250 0.975 17.08 / 3.76
20 0.79 0.72 0.82 0.049 0.933 0.984 0.250 0.975 15.98 / 2.52
24 0.84 0.79 0.87 0.048 0.940 0.982 0.250 0.975 15.07 / 1.87
28 0.88 0.85 0.93 0.050 0.947 0.978 0.250 0.975 14.19 / 1.60
32 0.91 0.90 0.96 0.049 0.950 0.974 0.250 0.975 13.24 / 1.36
36 0.93 0.94 0.98 0.052 0.956 0.968 0.250 0.975 12.55 / 1.21
40 0.94 0.96 0.99 0.050 0.957 0.966 0.250 0.975 11.46 / 1.15
Table A5: Results under no efficacy design prior (). and are the marginal probabilities of early futility under null hypothesis and early efficacy under alternative hypothesis respectively. IA_stop represents the probability of stopping the study at IA. , and are early winner rule, final winner rule and early futility rule respectively.