Improving interim decisions in randomized trials by exploiting information on short-term outcomes and prognostic baseline covariates

04/09/2019 ∙ by Kelly Van Lancker, et al. ∙ 0

Conditional power calculations are frequently used to guide the decision whether or not to stop a trial for futility or to modify planned sample size. These ignore the information in short-term endpoints and baseline covariates, and thereby do not make fully efficient use of the information in the data. We therefore propose an interim decision procedure based on the conditional power approach which exploits the information contained in baseline covariates and short-term outcomes. We will realise this by considering the estimation of the treatment effect at the interim analysis as a missing data problem. This problem is addressed by employing specific prediction models for the long-term endpoint which enable the incorporation of baseline covariates and multiple short-term endpoints. We show that the proposed procedure leads to an efficiency gain and a reduced sample size, without compromising the Type I error rate of the procedure, even when the adopted prediction models are misspecified. In particular, implementing our proposal in the conditional power approach allows earlier decisions relative to standard approaches, whilst controlling the probability of an incorrect decision. This time gain results in a lower expected number of recruited patients in case of stopping for futility, such that fewer patients receive the futile regimen. We explain how these methods can be used in adaptive designs with unblinded sample size reassessment based on the inverse normal p-value combination method to control type I error. We support the proposal by Monte Carlo simulations based on data from a real clinical trial.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Statistical rules to guide the decision of whether or not to stop a clinical trial early for futility and to adapt the sample size are often based on the conditional power (e.g.  Halperin et al. ,; Lachin,

). This monitoring approach quantifies the probability of rejecting the null hypothesis at the end of the study based on a chosen statistical test, given the primary endpoint data observed thus far and an assumption about the future primary endpoint data (e.g.  the effect size used when powering the study). While this methodology is easy to implement and typically well understood by the clinical team, it ignores information in short-term endpoints and baseline variables that can improve the precision of treatment effect estimates. Such increased precision implies a higher probability of stopping early for futility in the absence of a treatment effect, a reduction in average sample size after sample size reassessment and a gain in time.

In view of the above, recent research has focused on incorporating predictive baseline covariates (Qian et al. , ) and short-term measurements in interim analyses (e.g.  Stallard, ; Hampson and Jennison, ; Niewczas et al. , ). This brings several challenges. First, the additional information accrued between the interim analysis and the final analysis may involve observations of subjects who already contributed information to the interim analysis. This makes it more challenging to maintain whether the independent increments property of the Brownian motion structure, used to justify the conditional power approach (Lachin, ). Second, the incorporation of baseline covariates and short-term endpoints requires the postulation of statistical models. This raises concerns that their misspecification may result in bias (Austin et al. , ; Assmann et al. , ; Pocock et al. , ).

In this paper, we will overcome the abovementioned two concerns. For this, the precision of the treatment effect estimator at the interim analysis will be optimized by incorporating short-term endpoints as well as baseline covariates. We will realise this by considering the estimation of the treatment effect at the time of the interim analysis as a missing data problem. This problem can be addressed by making use of specific prediction models for the long-term endpoint which, besides multiple short-term endpoints, can also incorporate baseline covariate information. This methodology surprisingly has the appealing feature that it delivers a consistent treatment effect estimator. Of all estimators that have this property, the considered one is most efficient (provided that all models are correct). The proposed estimator is closely related to one given in the appendix of Qian et al. (). In contrast to Qian et al. (

), we justify the proposed procedure to imbed the interim test statistic based on this proposed interim estimator in the conditional power approach by relating to the general work of Scharfstein et al. (

) on an information based-design and monitoring procedure. This extension of the conditional power approach allows earlier stopping for true futility whilst controlling the probability of incorrectly stopping. This generally leads to a reduction in the number of recruited patients in the case of stopping for futility, such that fewer patients receive the futile regimen.

Moreover, to further allow for modifying the design of the remaining study if the trial is not stopped early for futility, we next extend the method to adaptive designs with unblinded sample size reassessment based on conditional power arguments (Bauer and König, ). Since sample size adaptations can inflate the type I error (Bretz et al. , ), we use the adaptive combination test proposed by Bauer () and Bauer and Köhne () based on the inverse normal method as the combination function (Lehmacher and Wassmer, 1999). We also justify the independence of the stage-wise test statistics used in the combination test by relating to the general work in Scharfstein et al. (). We support the proposal with simulation studies based on data from a real clinical trial.

1.1 Motivating Example

The motivating phase clinical trial was designed to evaluate the efficacy of a new experimental treatment for multidrug-resistant tuberculosis on top of the standard of care regimen (referred to as background regimen BR) as compared to placebo plus BR with regard to the proportion of subjects with a favorable treatment outcome (ClinicalTrials.gov, NCT00449644 [7]) defined as confirmed culture conversion weeks after randomization. Besides the clinical endpoint of interest, (predictive) baseline covariates as well as confirmed culture conversion (cure) weeks after randomization were planned to be measured. The available prior phase b data suggested that the early measurements are reasonably predictive of the the long-term measurements, in the sense that for the majority of subjects achieving the primary endpoint (culture conversion at week ), culture conversion is expected to have occurred by week .

An interim analysis was planned to be performed prior to enrollment completion to evaluate whether the data collected up to that point contain any evidence for superiority and if not, the trial would be stopped for futility. Since many enrolled patients may have no information on the primary endpoint available at the time of the interim analysis, restricting the interim analysis to those patients with long-term information available may result in lack of information to support futility decisions. To have a reasonable chance of detecting true futility early, we will include information on baseline covariates and short-term endpoints into interim analyses of the long-term endpoint, and thereby make fully efficient use of the information in the data. Guided by this motivating example, the paper will focus on binary endpoints. However, the proposed approach is applicable to other endpoints (e.g.  continuous and survival endpoints).

2 Setting and Estimation

2.1 Trial Design

Consider a study design which intends to collect i.i.d. data {(, , , ), }, with the binary primary endpoint, a correlated short-term endpoint and

a vector of prognostic baseline covariates (e.g. age, gender, …) for patient

who is randomly assigned to either experimental treatment () or control (). For pedagogic purposes, we consider a single short-term endpoint , but all results extend directly to the general case with multiple short-term endpoints. Let () correspond to the pre-planned number of patients in each treatment arm and the planned sample size. Define () as the probability of a successful outcome at the end of a trial in the experimental () and control () arm. The primary hypothesis of interest, , will be tested against the one-sided alternative, at level with power . To evaluate this hypothesis, the test statistic for the difference in marginal sample proportions of the two arms is compared to :

where and denote the estimated probability of a successful outcome at the end of the trial in respectively the experimental and control arm respectively.

2.2 Effect Estimates at the Interim Analysis

Consider an interim analysis of the primary endpoint and assume that patients are continuously recruited during the course of the trial. Since not all patients have full data observed at the interim analysis, we use to denote whether is already observed () or not (), to denote whether is already observed () or not () and to denote the missingness status of : if and only if is observed (i.e. if and only if patient has been enrolled in the study at the time of the interim analysis). The estimation of the effect at an interim analysis can then be seen as a missing data problem, where we can distinguish cohorts of patients: a first cohort of patients for whom all data are available ((, , )=(, , )), a second cohort of patients who passed the timepoint at which is evaluated but not yet and thus for whom only and are observed ((, , )=(, , )), a third cohort of patients for whom only is observed ((, , )=(, , )) and a fourth cohort of patients who have not yet been recruited ((, , )=(, , )). Although the end-of-study outcome under the experimental treatment is only seen for the treated patients in cohort , under the assumption that recruitment occurs randomly (i.e. independent censoring holds, in the sense that (, , ) is independent of , and the potential outcomes of and ), the missing outcomes for the other recruited patients on treatment can be unbiasedly predicted in large samples (Tsiatis, ). Similar to the closely related estimator described in the appendix of Qian et al.  (), this can de done as follows:

  1. regress on and among the patients in cohort of the treatment arm ( and ) using a canonical generalized linear working model for the conditional mean of : , where is a known function, evaluated at a parameter with unknown population value ; for example ,

  2. use this regression model to predict the outcome under the experimental treatment for the treated patients in cohort (, and ) based on their observed baseline covariates and short-term endpoint as ,

  3. then, regress the combination of these fitted values for the treated patients in cohort and the observed values for the treated patients in cohort () on the baseline covariates using a canonical generalized linear working model for the conditional mean of : , where is a known function, evaluated at a parameter with unknown population value ; for example ,

  4. and use this regression model to predict for the treated patients in cohort based on their observed baseline covariates as .

An interim estimator of is obtained by taking the average of the observed -values for the treated patients in cohort , the predicted values based on and for the treated patients in cohort and the predicted values based on for the treated patients in cohort . A theoretical derivation of this estimator and its properties are given in Appendix A. The interim estimator for the outcome under control can be analogously estimated by applying the same steps to the untreated patients . If the study design does not intend to collect an earlier measurement , the first two steps can be omitted since there are then no patients in cohort . in step then reduces to ; this then involves fitting a model for given among the (un)treated patients in cohort .

As a result of the random recruitment and simple randomisation, the estimators and have the appealing feature that misspecification of the outcome models in step and does not introduce bias in large samples (see Appendix A.2). Moreover, when the outcome models are correctly specified, these estimators are asymptotically efficient in the subclass of estimators that are unbiased as soon as (, ), and are independent of respectively , (, ) and (, , ) (e.g. , Yang and Tsiatis, 2001; Tsiatis, 2006; Moore and van der Laan, 2009; Stallard, 2010; Qian, 2018).

In order to calculate the variance of the treatment effect

, one must take into account that the predictions are estimated based on particular outcome regression models. It is therefore not sufficient to compute the variance of the predictions and observed values of the primary endpoint. In Appendix A., we show that this can be easily accommodated under randomization and random recruitment by calculating the asymptotic variance of , denoted , as times the sample variance of the values

(1)

with the observed randomization probability, and and denoting the total number of recruited patients at the time of the interim analysis. We refer the interested reader to Appendix A.3 for more details.

2.2.1 Precision Gain based on Baseline Covariates and Short-Term Endpoints

As mentioned before, in large samples, our proposed estimator is never outperformed by the standard analyses that use only information on the short-term and/or primary endpoint. The proposed estimator also reduces to these standard analyses if no baseline covariates are available. The magnitude of the precision gain itself depends on different characteristics of the interim analysis and available data. First, the predictivity of the short-term endpoints and baseline covariates plays a very important role: the more predictive they are, the more information available, and thus, the larger the gain in efficiency relative to respectively the standard analyses that use only information on the primary endpoint and standard analyses that use only information on the primary and short-term endpoint. Nonetheless, adjusting for a prognostic baseline covariate alone always leads to larger precision gain than adjusting for an equally prognostic short-term outcome alone because of the larger number of partially observed patients (cohort and ) compared to the number of patients in cohort only (Qian et al , ). Second, the recruitment rate and the time at which the interim analysis is conducted, are also of great value since they influence the number of patients in each cohort. For example, the relative benefit that results from incorporating baseline covariates and short-term endpoints is generally smaller at later interim analyses since more primary endpoint data are then available. Lastly, the time at which the short-term endpoint is measured, is crucial. Assuming two equally predictive short-term endpoints, adjusting for the earliest measured endpoint will be more advantageous since it will be measured for more patients relative to adjusting for the one measured later in time.

3 Interim Analysis

3.1 Futility stopping Based on Conditional Power

To calculate conditional power, we need to define how far through the trial we are at the time of the interim analysis. This can be expressed in terms of the information fraction , which is defined as the fraction of information available at the time of the interim analysis versus the expected information at the end of the study; i.e. the fraction of the expected variance of the treatment effect estimator at the end of the study versus the variance of the interim estimator. Define and as the -statistic of the treatment effect at information fraction and at the end of the trial, respectively, and let

for random variables

and denote that is independent of . For any test satisfying the independent increments property, i.e. , the conditional power can be computed using the Brownian motion structure in combination with the -value (Lan and Wittes, ), which is defined as , :

with the assumed drift parameter. This parameter reflects the expected -score at the final analysis based on the assumption made about the data to be observed in the remainder of the study, e.g.  when the effect size used for powering the study is used as assumption for the remaining (unobserved) primary endpoint data. To decide whether or not to stop for futility, the conditional power is compared to some threshold. If the conditional power is below this threshold, the trial is stopped for futility; otherwise the trial is continued.

The traditional conditional power approach uses a test statistic based on the treatment-specific sample averages at the time of the interim analysis and , which only use the complete cases (i.e., ). It seems appealing to employ instead the interim test statistic based on the more efficient estimator introduced in the previous section

which also incorporates information on the short-term outcome and the baseline covariates. Note that the test statistic for the primary analysis (at the end of the study), , coincides with the standard test statistic which only incorporates information on the primary endpoint. Since the conditional power is computed using standard Brownian motion arguments, the independent increments property needs to be satisfied. A potential concern is that the additional information accrued between the interim analysis and the final analysis involves additional observations of subjects who already contributed information to the interim analysis. Building on Scharfstein et al.  (), we show in Appendix B that the independent increments assumption is nonetheless (asymptotically) maintained when the prediction models are correctly specified since our test is then semiparametric efficient. Interestingly, we show in Appendix B that this is even true when the prediction models are misspecified, assuming that the treatment is randomly assigned and the recruitment is random.

The information fraction is now calculated as the fraction of the variance of the estimator at the end of the study and the variance of the interim estimator . This is not readily available since at the time of the interim analysis, the variance of the final estimator is unknown as is only available for the patients in cohort (). Nevertheless, under the assumption of random recruitment, , the variance of the final estimator can be estimated using only the complete cases since they then form a representative sample. This allows estimating the variance as times the sample variance of in cohort .

3.2 Sample Size Reassessment Based on Conditional Power

When the trial is not stopped early for futility at the interim analysis, one may still consider modifying the sample size based on the unblinded treatment effect. After sample size adaptations, the usual test statistic at the end of the study, which simply pools the outcome data obtained before and after the sample size adjustment, cannot be applied since its naive use might inflate the type I error rate (Bretz et al. , ). The adaptive -value combination test proposed by Bauer () and Bauer and Köhne () allows to combine the independent test statistics and based on the data observed before (stage ) and after (stage ) the interim analysis, respectively, while guaranteeing Type I error rate control. Moreover, this two stage procedure corresponding with the combination test provides a simple way for sample size reassessment.

The combination test for the final analysis can be obtained with the weighted inverse normal combination function (Lehmacher and Wassmer, 1999), which can be written as

where

denotes the standard normal cumulative distribution function, and

a pre-specified weight which can be chosen arbitrarily between and . It follows from Lehmacher and Wassmer () that, no matter the choice of weight , rejecting the null hypothesis when , leads to a test at level when and are independent. A natural choice for the weight is the originally planned information fraction

at which the interim analysis is performed, since the inverse normal combination test using normally distributed test statistics then coincides with a group sequential design test statistic at the final analysis when sample size remains as pre-planned

[23]. The interim test statistic is used as the test statistic corresponding with the first stage, since it is based on all data observed before the interim analysis. Now, define as the naive standard (unweighted) test statistic for the primary endpoint at the end of the trial after sample size re-estimation and as the fraction of information available at the interim analysis relative to the information available at the end of the study after sample size re-assessment; i.e. the fraction of the variance of the treatment effect estimator at the end of the trial after sample size re-estimation versus the variance of the interim estimator. We propose , the test statistic based on the data to be observed in the remainder of the study, to equal , which is asymptotically independent of (see Appendix B and C.1) and guarantees that equals if there are no adaptations. In particular, the proposed adaptive combination test can be written as

Using this combination test makes it possible to perform a sample size reassessment during the course of the trial. The total sample size is adapted so as to make the conditional power equation equal to the pre-specified design power of , assuming the effect size used for powering the study for the remaining (unobserved) primary endpoint data (Bauer and König, ). Based on the principle that the a priori fixed weights derived from the planned information fraction at the interim analysis are used for the calculation of the final test-statistics, rearrangement of the formulae of Bauer and König () (see Appendix C.2) leads to

(2)

where corresponds to the total sample size after adaptation. Note that points to the fact that the new experimental treatment is significantly better and no further recruitment is needed. Considering that the total sample size is bounded to equal at least the number of recruited patients at the interim analysis, we then obtain

(3)

with the total number of patients recruited at the interim analysis. If a sample size decrease is not allowed, the new sample size is obtained by taking the minimum of and the second part in formula (3). In practice, sample size reassessment may be done carrying forward the observed treatment effect instead of the effect size used for powering the study, by replacing in the previous formulas by .

3.3 Blinded Information Fraction

Usually, the interim analysis is implemented at the time at which the pre-planned information fraction is reached. Calculating this information fraction can be either done in a blinded or unblinded way. Unblinding the treatment assignment code, however, requires an independent data monitoring committee to review the accumulating data from the beginning of the trial. We therefore recommend to approach the operational planning alternatively, in a way that preserves blinding. As before, one can calculate the information fraction in a blinded way by estimating the fraction of the variances of the treatment effect estimator at the end of the study versus during monitoring of the trial. The estimates of the two variances should now be based on the blinded data, without making use of the observed data on treatment. Under the null hypothesis () and the assumption that , the variance of the estimator for the treatment effect at the end of the study can be blindly estimated as times the sample mean of over all patients in cohort , with (see Appendix D). Assuming that and , the blinded estimate of , the variance of the estimator for the treatment effect during monitoring the trial, is obtained by following the same steps as in Section 2.2 but supposing now that everyone is in the same group. Formulas and their derivation are included in Appendix D.

4 Simulation Study

We conducted a simulation study to examine the finite-sample performance of the proposed interim estimator for making early decisions based on the conditional power approach as well as for reassessing the sample size. The impact of the short-term endpoint and baseline covariates on the one hand and model misspecification on the other were investigated. We evaluate the expected total number of subjects, the expected number of subjects needed at the interim analysis, as well as the type error and power.

4.1 Data Generating Distributions for Simulation Studies Based on TMC207-C208 Stage 2

To determine a realistic data generating model, the phase b data of the motivating example were employed. This dataset consists of participants, in each arm, with observed primary endpoint data. We generate a simulated trial of -depending on the simulation settings- hypothetical participants using the original dataset as follows. We resample with replacement from the original study population of patients and only extract the baseline covariate information from the subjects in the resampled population. Then, treatment and control are assigned with probability to each hypothetical participant. The short-term endpoint , measured

months after randomization, is predicted based on a Bernoulli distribution with probability determined by a logistic regression model for

fitted on the original phase b dataset. The primary endpoint data, measured months after randomization, are generated under different scenarios depending on the simulation study. Unless otherwise stated, we consider the same recruitment scenario as in the phase b dataset, where on average patients per month enter the study.

Respectively and ( for futility scenarios) Monte Carlo simulations were performed for the different settings evaluating the proposal for the conditional power approach and the sample size reassessment. The cut-off values to decide whether or not to stop for futility at the interim analysis are based on the O’Brien-Fleming futility boundaries assuming a total power of and a one-sided (see [25]).

4.2 Simulations on Conditional Power

4.2.1 Without Short-Term Measurements

First, we conducted a simulation study to show that the conditional power based on the proposed interim estimator outperforms the standard conditional power approach when adjusting for predictive baseline covariates. For pedagogic purposes, only one continuous covariate that might be predictive for cure was selected based on the phase b data. The resulting model -allowing for different degrees of predictivity- was employed to simulate the primary endpoint data for each subject (), , where with and the coefficients correspond to these obtained from a logistic regression of on the continuous covariate fitted on the original data. The proportion of the variance in that is predictable from is shown in Table 1 (see Appendix E).

Model
  • Note: corresponds with the proportion of the variance in that is predictable from (estimated using formulas in Appendix E); and , probability of success in respectively the treatment and control arm under superiority.

Table 1: Empirical -squared values and probability of succes (under alternative hypothesis) for the data generated under the settings without short-term endpoint. Simulations results are based on Monte Carlo replications.

The sample size of patients in each arm corresponds with the required sample size to obtain a power of under a one-sided significance level of for the scenario with . Note that fixing the sample size implies a different power under each setting ( for ) since the different scenarios resemble different treatment effects ( in Table 1).

First, the interim analysis is conducted at the fixed point in time when of the patients are expected to be recruited ( patients in cohort , patients in cohort , patients in cohort ). In Figure 1, the proposed conditional power approach incorporating baseline covariates is compared with standard conditional power for the different scenarios under true futility and true superiority, respectively. A comparison between both methods shows that incorporating baseline covariates leads to a free upgrade: a higher information fraction as well as a higher probability to stop for true futility (Figure 1(a)) are obtained with negligible loss of power (Figure 1(b)). The former is a consequence of the higher cutoff value corresponding with a higher information fraction. Also note that, in general, a lower power (corresponding with higher and ) implies a higher probability to stop incorrectly for futility and a slightly higher power loss. Comparing the different scenarios, shows that more predictive baseline covariates induce an increase in the information fraction and the probability to correctly stop for true futility. This is due to the efficiency gain from adjusting for baseline variables [18].

(a) True Futility
(b) Superiority
Figure 1: True Futility: Comparison of the information fraction and the probability to stop for futility between the standard conditional power approach and the conditional power using the proposed interim estimator for different scenarios under true futility. True Superiority: Comparison of the information fraction and the loss of power compared to the design power between the standard conditional power approach and the conditional power using the proposed interim estimator for different scenarios under superiority. The interim analysis is conducted at a fixed point in time.

Second, an interim analysis is conducted when 111This information fraction is determined based on only for the standard conditional power and based on , and for the proposed method. of the information in the simulated datasets is obtained in order to evaluate the time gain compared to the standard conditional power. Table 2 shows that this information fraction is obtained earlier for the proposal, resulting in fewer recruited patients at the time of the interim analysis. Moreover, when using the proposed method, more predictive covariate(s) (higher ) lead to an earlier implementation of the interim analysis. The upper part of the table shows that the probability to stop incorrectly for futility as well as the loss of power is comparable between the two methods, while in the lower part it can be seen that the probability to stop correctly for true futility is similar. Thus, by using the proposal, fewer patients need to be recruited to make a decision with the same probability to correctly stop for true futility and with the same loss of power as with the standard method (power loss ).

Model Method Days Recruited Prob. to Stop Power Loss
Proposal 1288 74% 0.6% 0.24%
Standard CP 1294 74% 0.5% 0.16%
Proposal 1275 73% 0.8% 0.48%
Standard CP 1294 74% 0.8% 0.44%
Proposal 1243 72% 2.2% 0.96%
Standard CP 1294 74% 2.5% 0.88%
Proposal 1207 69% 5.4% 0.92%
Standard CP 1294 74% 5.9% 1.00%
Proposal 1173 68% 9.6% 0.88%
Standard CP 1294 74% 9.7% 0.92%
Proposal 1144 66% 13.3% 0.88%
Standard CP 1294 74% 13.6% 0.92%
Model Method Days Recruited Prob. to Stop
Proposal 1289 74% 59.9%
Standard CP 1294 74% 60.4%
Proposal 1279 74% 59.4%
Standard CP 1294 74% 59.2%
Proposal 1253 72% 59.0%
Standard CP 1294 74% 59.6%
Proposal 1222 70% 59.9%
Standard CP 1294 74% 60.1%
Proposal 1191 69% 60.1%
Standard CP 1294 74% 60.0%
Proposal 1164 67% 59.6%
Standard CP 1294 74% 60.0%
  • Note: Days, average number of days elapsed since beginning of the study; Recruited, average percentage of patients already recruited; Prob. to Stop, probability to stop for futility using O’Brien Fleming boundaries; Power Loss, loss of power when conducting an interim analysis compared to the design power.

Table 2: Comparison of the standard conditional power approach and the conditional power using the proposed interim estimator when the interim analysis is conducted at an information fraction of . Upper table: results under superiority; Lower table: results under futility.

As can be seen in Figure 2, conducting the interim analysis at different information fractions leads to similar absolute reductions in number of recruited patients. Therefore, the relative gain, defined as the difference in number of recruited patients between both methods relative to the number of recruited patients using the standard conditional power, increases with decreasing information fraction. Another determining factor is the average number of monthly recruited patients. We therefore simulated datasets under the futility scenario with for a recruitment rate of patients a month instead of . The percentage of recruited patients at information fraction is and for the standard and proposed conditional power, respectively, corresponding with a relative gain of . For a recruitment rate of patients a month, these equal and , respectively, corresponding with a relative gain of . The reduction in number of recruited patients is thus larger for faster recruitment settings.

Figure 2: Comparison of percentages of patients recruited at the time of the interim analysis between the standard conditional power and the porposed conditional power for different information fractions. Setting: Futility .

4.2.2 With Short-Term Measurements

To also incorporate short-term measurements along with three baseline covariates, the following data-generating mechanism () was employed, with predictions from a logistic regression model for involving the -way interaction between and , the -way interaction between , a continuous covariate (same as in Section 4.2.1) and a dichotomous covariate , the -way interaction between , a continuous covariate (same as in Section 4.2.1) and a -level covariate , and all lower order terms. Under this data-generating model, the marginal probabilities of success in the control and experimental treatment arm are and , respectively. To attain power at a one-sided significance level of , patients in both arms are required.

Simulation experiments with a correctly specified outcome model used the working models and to predict the missing outcomes in cohort and working models and to predict the missing outcomes in cohort . To evaluate the performance when the prediction models are misspecified and to investigate whether this could lead to incorrectly stopping, we also considered the following outcome working models

  1. a misspecified model including , , and but without interactions; and to predict the missing outcomes in cohort and and to predict the missing outcomes in cohort ;

  2. a model only including and ; and to predict the missing outcomes in cohort and and to predict the missing outcomes in cohort ;

  3. a misspecified model including and the absolute value of ; and to predict the missing outcomes in cohort and and to predict the missing outcomes in cohort ;

  4. a model only including and ; and to predict the missing outcomes in cohort and and to predict the missing outcomes in cohort .

(a) Outcome working model
(b) Outcome working model
Figure 3: Visualisation of the model misspecfication for outcome working models and : expected values under the correctly specified model and outcome working models and are plotted against .

The expected values for outcome working models and as well as for the correctly specified model are plotted against in Figure 3. To compare our proposal to methods only using (e.g. Niewczas et al. , ), outcome working models only including were used. Note that, in that case, we don’t predict the outcome for the patients in cohort . The proportion of the variance in that is related to , and both under these different prediction models is shown in Table 3.

Model
Correct
Misspecified ()
Misspecified ()
Misspecified ()
Misspecified ()
Only
  • Note: , and correspond to the proportion of the variance in that is predictable from respectively , and both and ; estimated using formulas in Appendix E.

Table 3: Empirical -squared values for the data generated under the settings for the conditonal power with short-term endpoint measurements. Simulations results are based on Monte Carlo replications.

First, the interim analysis is conducted at the fixed point in time when of the patients are expected to be recruited ( patients in cohort , patients in cohort , patients in cohort ). Table 4 shows that a higher total (see Table 3) and the use of correctly specified working models, lead to a higher information fraction and a higher probability to correctly stop for true futility. This is again due to the efficiency gain from adjusting for (more predictive) baseline variables and short-term endpoints ([18]).

When conducting an interim analysis at the time of the information is obtained, the advantage over the standard conditional power increases with increasing total , while the advantage over the method only using is mainly determined by . The latter is a consequence of the fact that the proposal also incorporates baseline covariates on top of the short-term endpoint. However, in this simulation study there are on average only patients in cohort compared to in cohort and in cohort . Therefore, the relative advantage over the method only using compared to the advantage of the latter over the standard method, is rather limited. The relative gain will increase when cohort contains proportionately more patients. To see this, we conducted a simulation study where the short-term endpoint is measured months after randomization instead of months, corresponding with an average of patients in cohort , in cohort and in cohort . The percentage of recruited patients is then given by , and for respectively the standard conditional power, the conditional power only including the short-term endpoint and our proposal. The absolute advantage () of including both baseline covariates and the short-term endpoint over the method only incorporating the short-term endpoint is similar as for measured months after randomization. The gain in terms of percentage recruited patients at the time of the interim analysis by incorporating baseline covariates on top of the short-term endpoint relative to the gain by only incorporating the short-term endpoint increases from to , meaning that the decrease in number of recruited patients compared to the standard conditional power is respectively and times larger for the proposal than for the conditional power only including and .

Method Prob. to Stop Power Loss
Cohort , Cohort , Cohort
Proposal, correct
Proposal, misspecified ()
Proposal, misspecified ()
Proposal, misspecified ()
Proposal, misspecified ()
Proposal, only
Standard CP, only
Method Prob. to Stop
Cohort , Cohort , Cohort
Proposal, correct
Proposal, misspecified ()
Proposal, misspecified ()
Proposal, misspecified ()
Proposal, misspecified ()
Proposal, only
Standard CP, only
  • Note: , average information fraction; Prob. to Stop, probability to stop for futility using O’Brien Fleming boundaries; Power Loss, loss of power when conducting an interim analysis compared to the design power.

Table 4: Comparison when the interim analysis is conducted at the fixed point in time where of the patients are recruited. Upper table: results under superiority; Lower table: results under futility.
Method Days Recruited Prob. to Stop Power Loss
Proposal, correct
Proposal, misspecified ()
Proposal, misspecified ()
Proposal, misspecified ()
Proposal, misspecified ()
Proposal, only
Standard CP, only
Method Days Recruited Prob. to Stop
Proposal, correct
Proposal, misspecified ()
Proposal, misspecified ()
Proposal, misspecified ()
Proposal, misspecified ()
Proposal, only
Standard CP, only
  • Note: Days, average number of days elapsed since beginning of the study; Recruited, average percentage of patients already recruited; Prob. to Stop, probability to stop for futility using O’Brien Fleming boundaries; Power Loss, loss of power when conducting an interim analysis compared to the design power.

Table 5: Comparison when the interim analysis is conducted at an information fraction of . Upper table: results under superiority; Lower table: results under futility.

4.3 Simulations on Sample Size Re-Estimation

In these simulation studies, futility stopping based on conditional power as well as sample size adaptations were performed. The total sample size was bounded to be at most times the original sample size. We generated data under a data-generating mechanism where for each (), , with , and the coeffcients correspond to these obtained from a logistic regression fitted on the original phase b data. The marginal probabilities of success in the treatment and control arm under the data-generating model with are and , respectively, corresponding with a treatment effect of . To attain a power of under this alternative at a one-sided significance level of the required sample size in both treatment arms is patients. The other probabilities of success in the treatment arm are , and for , and , respectively, corresponding with a treatment effect of respectively , and , respectively.

Table 6 shows the results for a simulation study where the interim analysis is conducted at the fixed point in time where out of the planned patients, corresponding with , are recruited. We can see that in all cases the type I error rate is controlled, since the power after sample size reassessment is around for a treatment effect equal to . As before, the information fraction as well as the probability to stop for true futility are higher for the proposal. However, this also causes a slightly larger loss in power relative to the standard approach. When sample size re-assessment is performed, there is an increase in power for both methods compared to the designs without sample size reassessment and the classical one-stage trial, except for a treatment effect equal to . On average a smaller sample size is needed for the proposal, but this comes with a lower overall power over the two stages. The lower sample size causes a higher probability to be unsuccessful after sample size reassessment for a trial that is not stopped for futility and successful without sample size reassessment, which results in a lower power.

The results for the simulation study where the interim analysis is conducted when of the total information is obtained, are summarised in Table 7 in Appendix F. The probability to stop for futility and the power loss when no sample size reassessment is performed are more comparable in this simulation study, but generally the results are similar for the setting where the interim analysis is conducted at a fixed point in time.

Doing a sample size reassessment after performing an interim analysis thus seems to be equally beneficial for both methods; resulting in (slightly) lower sample sizes at a cost of (slightly) lower overall power for the proposal compared to the standard method. Given the lower total sample size and lower number of patients recruited at the time of the interim analysis compared to the standard method, it seems advantageous to use the proposal, even with the cost of a (slightly) lower power.

Treatment Effect
Power One Stage Trial
Proposal Average
Probability to Stop for Futility
Power Loss
Power No SSR
No SSR
No SSR
No SSR
No SSR
No SSR
ASS No SSR () () () ()
Power SSR
SSR
SSR
SSR
SSR
SSR
ASS SSR () () () ()
% Rejected with SSR, not without
% Not rejected with SSR, rejected without
Standard CP Average
Probability to Stop for Futility
Power Loss
Power No SSR
No SSR
No SSR
No SSR
No SSR
No SSR
ASS No SSR () () () ()
Power SSR
SSR
SSR
SSR
SSR
SSR
ASS SSR () () () ()
% Rejected with SSR, not without
% Not rejected with SSR, rejected without