1 Introduction
In drug development, randomized controlled trials remain the gold standard to confirm efficacy and safety of novel drug candidates. Often phase III trials embed formal interim analyses to allow studies to be stopped earlier for futility if the novel drug is not efficacious or for efficacy if the treatment effect is overwhelmingly positive.
Immunooncology (IO) is a rapidly evolving area in the development of anticancer drugs. IO agents can have effect on both the human immune system and the tumor microenvironment. By doing so, the tumors may be eradicated from the host or disease progression may be delayed. The effect of an IO agent is not typically directed to the tumor itself; it instead boosts or releases the brake from the patient’s immune system, and this positive effect may not be observed immediately. The lag between the activation of immune cells, their proliferation and impact on the tumor is described in the literature as a delayed treatment effect. Some patients may not derive clinical benefit before their disease progresses while others may derive sustained response or control of their disease. The primary endpoints often used for confirmatory phase III studies in oncology are time to event: progression free survival (PFS) and overall survival (OS). PFS is defined as time from randomization until disease progression or death and OS is defined as time from randomization until death from any cause. The delayed treatment effect may translate to inferior or equal PFS or OS compared to control treatment in the first months of therapy and superior survival thereafter leading to nonproportionality of hazards in the experimental and control arms of study. Therefore, the original design based on a proportional hazards assumption will lead to an underpowered study and hence both the sample size calculation and the analysis methods to be used should be reconsidered [1].
A weighted version of the logrank test that incorporates the Fleming and Harrington class of weights [2], allows tuning the two parameters depending on if we expect early, middle or late delays, is proposed in the literature to increase the power at the end of the trial. However, tuning these parameters is not straightforward, since a misspecification may cause an even larger power drop with respect to the logrank test.
The Fleming and Harrington class of weights, along with the estimated delay, can be incorporated into the sample size calculation in order to maintain the desired power once the treatment arm differences start to appear (see [3]).
In this article we make an empirical evaluation of the impact of having a delayed effect on power and type I error rate in the design of a confirmatory phase III study with an IO agent used in combination with a standard of care, assuming a range for delay time. We assess the performance of the weighted logrank test as an alternative to the logrank test given it allows weighting of late differences and the potential gain power under nonproportional hazards. The evaluation is made for both group sequential and adaptive group sequential designs with fixed values of the Fleming and Harrington class of weights. We also give some practical recommendations regarding the methodology to be used in the presence of delayed effects depending on certain characteristics of the trial.
The manuscript is organized as follows. In section 2, we describe the weighted logrank test and derive derive the sample size calculation formula needed to incorporate the estimated delay and the Fleming and Harrington class of weights, and we introduce the combination test statistic that will be necessary when doing sample size reassessment. In section 3 we briefly describe group sequential and adaptive group sequential designs, emphasizing two popular methods used to do sample size reassessment. In section 4, we describe the simulated example.
2 Methods
In this section we describe the statistical methodology we review in this article. In sections 2.1 and 2.2 we present the weighted logrank test and derive an optimal sample size when using this test following [3]. This sample size derivation is presented as an alternative to the Schoenfeld’s formula [4], which is normally used when calculating the necessary sample size in confirmatory trials. In section 2.3 we introduce the combination test statistic, which will be necessary when we perform sample size reestimation in adaptive group sequential designs.
Let
be a vector that contains the event times,
, between the patients’ enrollment date and the patients’ final event date, , such that . Let the number of events at time be denoted as , the total number of patients at risk at that time be denoted as , and the effect delay (in months) be denoted as . As previously described if both survival curves go in parallel and once , the survival curves will start diverging. Hence, we assume the following density functions , survival functions and hazard functions for the control group () and for the experimental group ():(1) 
where so that . This way, we assume a step function for the hazard ratio where from time 0 to , the hazard ratio is equal to 1, and from time the hazard ratio is equal to .
In this article we assume that the control group receives the standard of care and the experimental group receives a combination of the standard of care plus the IO agent which causes the delayed effect. Hence, any observed difference from time 0 until time is random. The conclusions we obtain are only applicable to studies where a similar assumption is made. Otherwise, we cannot guarantee that from time 0 to time , both groups have a common survival function.
2.1 Weighted logrank test
The weighted logrank test is defined as
[2] proposed the use of to weight early, middle and late differences through the class of weighted logrank tests, where the weight function at a time point is equal to
(3) 
where corresponds to the KaplanMeier estimator.
Depending on the values of and , we will have different weight functions that will emphasize early differences , middle differences or late differences in the hazard rates or the survival curves. The parameter combination attributes equal weights to all data values and hence does not emphasize any survival differences between treatment arms. Moreover, with this parameter combination (2) corresponds to the usual logrank test.
As mentioned by [3], since we focus on the entire survival curve rather than the late difference, valid inference requires prespecification of and prior to any data collection.
Prior specification of is always advisable for the trial integrity, although some authors (see e.g., [5]) note that the value of can be modified at the interim analysis without typeI error rate inflation. At the end of the trial, we are interested in estimating the hazard ratio across the entire study, which is obtained through the standard Cox model [6]. Note however that there will be a disconnect between the hazard ratio (i.e., the standard Cox model) and the weighted logrank test. To obtain an estimate based on the Cox model that corresponds to the weighted logrank test see [7].
In this article we focus on the use of the weighted logrank test in confirmatory trials with delayed effects. Other areas of use may include treatment switching, which is sometimes present in confirmatory trials and also induces nonproportional hazards (see [8]). However, it is out the scope of this article to evaluate the performance of the weighted logrank test under the presence of treatment switching and further research on this matter would be necessary.
2.2 Sample size derivation for the weighted logrank test
We introduce the optimal sample size derivation proposed by [3]. Assume that we recruit patients during time at a certain rate in a confirmatory trial where we aim to compare survival time between two groups (): a control group, with a constant hazard over time, and an experimental group, with a hazard that changes over time. The final analysis is performed at time after the first patient is enrolled. The study period is partitioned into subintervals of equal length . Let be the hazard function for group at time and be the expected number of patients at risk for group at time , where .
[9]
showed that the weighted logrank statistic is normally distributed with unit variance and approximate expectation of
(4) 
where
(5) 
represents the allocation ratio for group , and corresponds to the FlemingHarrington’s class of weights where and represents the pooled survival function. Even though it was originally proposed by [10], [3] uses as a substitute for the pooled survival function, where represents the survival function of group at time . However, as stated by [3], equation (4) can be equivalently expressed as
(6) 
where
(7) 
Assuming that the weighted logrank statistic is normally distributed with mean and unit variance, then for a power equal to and onesided significance level we have
(8) 
where and correspond to the th and th percentile of the standard normal distribution respectively. The required sample size is calculated as
(9) 
and the total expected number of events is equal to .
2.3 Test statistic
We aim to test the null hypothesis, , against the alternative, . In the context of group sequential designs, since we are only interested in early efficacy testing we make use of the well known classical group sequential design methodology (see [11]) and make use of the O’Brien and Fleming rejection boundaries. In the context of adaptive group sequential designs, we make use of the independent increment property of the inverse normal method, which is an efficient way of incorporating data of patients who where censored at interim analysis while ensuring typeI error rate control (see [12]). The test statistic is defined as
(10) 
where and denote the separate stage pvalues from stages 1 and 2, denotes the inverse of the standard normal distribution, and and are prespecified weights such that , and where and represent the number of events observed in each stage. The null hypothesis will be rejected at level if .
However, the inverse normal method is in general not valid when doing sample size reassessment if the adaptations depend on endpoints such OS or PFS (see [13]). We use the approach proposed by [14] where, in equation (10), the first stage pvalue is defined by the cohort of patients included before the interim analysis and is calculated only at the end of the trial. This allows the inclusion of all the events, but it prohibits early stopping for efficacy. See [15] for a detailed review of the existing methods on this matter.
3 Group sequential and adaptive group sequential designs
In this section we aim to briefly describe how group sequential and adaptive group sequential designs work. For a detailed definition and explanation of this methodology see [11].
3.1 Group sequential designs
The formulae presented in section 2.2 allow to obtain a sample size that maintains an acceptable power at the end of the trial under the presence of delayed effects. However, a key condition is to have some knowledge about the delay of the drug. Assuming we have this knowledge when designing the confirmatory trial, we can implement a group sequential design with an interim analysis for efficacy. Note that interim analysis for futility is not advised in the presence of delayed effect because of high risk of stopping the study for futility even in scenarios that favor the alternative hypothesis.
A group sequential design with one interim analysis for efficacy is graphically described in Figure 1.
3.2 Adaptive group sequential design
Even though the sample size derivation described in section 2.2 guarantees that after a prespecified effect delay we will have an acceptable power at the end of the trial while controlling the typeI error rate, we may have misspecified the delay value or maybe this value is unknown. Either way, an adaptive group sequential design that allows interim analyses and sample size reassessment would be useful in case we expect a lack of statistical power at the end of the trial given the results at the interim analyses. Hence, with this design we aim to recover the power lost due to misspecification of the delay. As explained in section 2.3, to maintain typeI error rate control when the sample size criteria is based on survival endpoints, the interim analysis is only used to do a sample size reassessment and not for early stopping. Because we need to distinguish between the effect at the interim analysis and the effect at the final analysis, let be the hazard ratio at the interim analysis and let be the hazard ratio at the end of the trial.
We now introduce two popular approaches for sample size reassessment:
3.2.1 Mehta and Pocock’s “promising zone” approach [16]
[16] propose a method that adaptively increases the sample size when interim results are considered “promising”. For that, we compute the conditional power at the interim analysis using rather than the true . The formula for the conditional power is defined as
(11) 
If the conditional power is within a certain prespecified range that we consider promising, we may reestimate the sample size to recover the power lost due to the effect delay. The selection of this range depends not only on the estimate of the effect delay but also on the budget of the sponsor for this particular trial. For example, if we have an estimated effect delay between 3 and 7 months, but we only have budget to guarantee 80% of power up to 5 months, the sponsor can choose to stop the trial. Therefore, following [16], we partition the sample space of attainable values into three zones:

Favorable: We consider the interim results to be in the favorable zone if . In this zone, the study is sufficiently powered for the observed and therefore no sample size reestimation is required.

Promising: We consider the interim results to be in the promising zone if . In this zone, is close to but the study is not sufficiently powered and a sample size reestimation is required. Specifically, the sample size will be increased to
(12) where is the maximum sample size the sponsor is willing to enroll and satisfies that . Following [17], it is possible to show that
(13) 
Unfavorable: We consider the interim results to be in the unfavorable zone if the value of . The value of is prespecified before the trial starts and it depends on the prior knowledge about the effect delay. In this zone the interim results are not promising and the sample size will not be reestimated.
TypeI error rate is controlled following [18], where it is shown that the overall typeI error does not increases if the sample size is only reassessed when
(14) 
3.2.2 Jennison and Turnbull’s “start small then ask for more” approach [19]
[19] made a detailed analysis of Mehta and Pocock’s “promising zone” approach.
One drawback of the “promising zone” approach is the use of in the construction of the promising zone and sample size increase function. The reason is that is considered as a highly variable estimate of , and also because it is used twice in determining the conditional power that underlies the sample size function: the first time through the value of and the second time when evaluating the conditional power at . This double use of was also pointed out by [20] who recommends a careful inspection of the operating characteristics when using .
Another drawback of Mehta and Pocock’s “promising zone” approach is that, despite the typeI error rate being controlled, because of the restriction showed in (14), the gain in power is relatively small for the increases in the expected sample size. Moreover, [19] demonstrated that other alternatives such us a fixed sample design and a group sequential design have exactly the same power curve and a lower expected sample size around the true value of .
To overcome the last limitation, [19] propose an optimal sample size calculation rule where we need to find the value of that maximizes the objective function
(15) 
where can be considered as “a tuning parameter that controls the degree to which the sample size may be increased when interim data are promising but not overwhelming”.
[19] pointed out that even though the objective function given by equation (15
) “concerns conditional probabilities given the interim data, choosing a sample size rule to optimize this objective function also yields a design with an overall optimality property expressed in terms of unconditional power”. They show that
4 Simulation setup
We implement the methodology described in sections 2 and 3 on a scenario that tries to imitate a realistic phase III trial with delayed effects in oncology.
Survival data for the control arm is simulated using an exponential distribution while data for the experimental arm is simulated using a distribution that is piecewise exponential (see equation (
1)). Under proportional hazards, we assume that the control arm has a median survival of 6 months while the experimental arm has a median survival of 9 months. Hence, the hazard ratio is equal to 0.667. However, under the presence of delayed effects we assume a step function for the hazard ratio where it will be equal to 1 until a certain time point , and then it will be at its full effect after . This means that while the control arm will keep its median survival of 6 months, the median survival of the experimental arm will no longer be 9 months because of the delayed effect.We establish a total study duration of 25 months, a total enrollment period of 17.5 months, randomization ratio 1:1, a power of 90% and a onesided level of 2.5%.
Clinical trial enrollment follows a Poisson distribution with rate of 10 patients per month. Plotting the cumulative distribution function of a Poisson distribution of these characteristics using, for instance, the R function
ecdf(), it is straightforward to see that after 17.5 months almost all the patients, if not all, are enrolled in the trial. Results are obtained running 200,000 simulated trials. R code is showed in the appendix explaining how to simulate survival data under the presence of delayed effects.In Table 1 we show the information fraction, the cumulative spent, the O’Brien and Fleming efficacy boundaries, and the boundary crossing probability at each look. Recall that these boundaries are only used in the context of group sequential designs where the sample size is not reassessed and they are calculated based on the information fraction only. If the sample size needs to be reassessed, we employ different methodology (see section 2.3)
Look # 





1  0.75  0.01  2.34  0.688  
2  1  0.025  2.012  0.212 
For both the group sequential and the adaptive group sequential designs, we estimate the empirical power and the empirical typeI error rate at the final analysis. In the context of group sequential designs, let Z be the Zstatistic obtained at the end of the trial and Z be the efficacy boundary of the final analysis presented in Table 1. In scenarios under the alternative hypothesis, the empirical power is defined as
(17) 
whereas in scenarios under the null hypothesis, (17) is the empirical typeI error rate. In the context of group sequential adaptive designs, in equation (17), Z needs to be substituted by Z and Z needs to be substituted by Z in order the implement the inverse normal method described in section 2.3.
5 Results
In this section we evaluate the repercussion of delayed effects on the power and the typeI error rate in group sequential and adaptive group sequential designs. The results presented in this section are based on the simulated scenario described in section 4.
Because one of the purposes of this work is to make a comparison between the logrank test and weighted logrank test, in Table 2 we show, for different delay times, the required number of events and the sample size using the parameter values and following the formulas presented in section 2.2. As we can see, under proportional hazards the parameter combination is more efficient since it requires 258 events whereas the parameter combination requires 369 events to maintain 90% of power. However, with 5 months delay, the parameter combination becomes more efficient since it requires 741 events whereas the parameter combination requires 1436 events to maintain 90% of power.
Delay (months)  0  1  2  3  4  5  
# of events  258  359  492  686  986  1436  
# of patients  330  456  621  860  1228  1777  
# of events  369  376  406  468  578  741  
# of patients  472  478  512  587  719  917 
5.1 Group sequential design
In Figure 2 we show the empirical power and typeI error rate at the final analysis for a wide range of and combinations with the design characteristics presented in section 4 assuming no delayed effect in the sample size calculation. As expected, the results show that the parameter combination achieves 90% of power and 2.5% typeI error at the final analysis. However, as the delay increases, we observe that power drops faster than other combinations of and as the effect delay increases. Other combinations like have less power under proportional hazards but maintain higher power as the effect delay increases. These results are expected since low values of and high values of weight late differences, which is the situation we recreate in this simulated trial. However, combinations that weight late differences produce a slight typeI error rate inflation as we can observe in Figure 2, right image.
Using the methodology described in section 2.2
, if we incorporate an estimate of the effect delay in the sample size calculation, we are able prevent the power to drop until that specified moment. This is shown in Figure
3, where for each delay time we calculate the sample size necessary to achieve 90% power taking the delay into account. Moreover, when correctly specifying the effect delay, we observe that not only low values of and high values of achieve high power. However, in terms of typeI error rate, we observe the same slight typeI error rate inflation we observed in Figure 2 for low values of and high values of .To control the typeI error rate, we propose to use a similar approach as the one used by [21] in which, although in a different context, instead of calculating the sample size for , a lower value of is fixed so the final typeI error rate is maintained at 2.5%.
5.2 Adaptive group sequential design
In this section we show how performing a sample size reassessment we recover some of the power lost due to the delayed effect. As in the previous section, the results presented here make use of the simulated example described in section 4. However, rather than using a wide range of combinations of and , we use the combination since we believe it is the most suitable combination for this kind of setting.
In Figure 4 we present the empirical typeI error (topleft image), empirical power (topright image), percent of times we readjust the sample size (bottomleft image) and the ratio between new sample size and original sample size (bottomright image) for different effect delays using the weighted logrank test with the parameter combination using the promising zone approach proposed by [16].
We employ three different promising zone lower bounds (0.5, 0.1, 0.001) and compare their operating characteristics against a design that does not reassess the sample size. Without any sample size reassessment, the power is below 80% after 3 months. Using a promising zone lower bound of 0.5, the power will be below 80% after 3.5 months. However, if the promising zone lower bound is 0.1 or 0.001, the power will be below 80% after 4 and 6 months, respectively. As discussed in the literature (see [19]) we corroborate that the gains when using a lower bound of 0.5 is practically negligible and the greatest gains in power are likely to be found outside the region defined by [16].
In terms of typeI error, we observe it is perfectly controlled for any value of the promising zone lower bound. However, note that we implemented our previously described proposal in which instead of calculating the sample size for , a lower value of is fixed so the final typeI error rate is maintained at 2.5%. Otherwise we would see the same slight typeI error rate inflation we identified in the Figures 2 and 3 due to the and parameters that we employ.
In terms of percent of times we fall in the promising zone, when the lower bound is 0.5, the probability of readjusting the sample size reaches its maximum value, which is around 15% at 4 months. If the lower bound is 0.1, the probability of readjusting the sample size reaches its maximum value, which is around 35% between 4 and 5 months. Last, if the lower bound is 0.001, the probability of readjusting the sample size reaches its maximum value, which is close to 70% at 6 months.
In terms of how much we need to increase the sample size with respect to the original sample size every time we fall in the promising zone, we observe that if the lower bound is 0.5, we need around 1.5 times the original sample size regardless the delay time. If the lower bound is 0.1, we need around 2.5 times the original sample size also regardless the delay time. Last, if the lower bound is 0.001, for a delay time , we need around 4.5 times the original sample size. For a delay time we need around 9 times the original sample size and for a delay time we need around 15 times the original sample size.
It is important to mention that, in practice, a promising zone lower bound of 0.001 may not be possible to implement given the excessively increase in the number of events needed and the consequent increase in the budget for the trial. However, we believe it is interesting to show that it is possible to maintain a power of 80% for another three extra months, regardless of the additional duration and expenses of the trial.
Last, in Figure 5 we make a comparison between the approaches of [16] and [19]. We selected the promising zone’s lower bound 0.001 because it is the one that is more expensive to put into practice and where greater differences are observed. As expected, the approach from [19] is able to maintain the same power as the approach from [16]. However, in terms of how much we need to increase the sample size with respect to the original sample size, [19] requires smaller sample size, specially after 4 months of delay.
6 Practical Considerations
In the previous sections we evaluated the impact of delayed effects in clinical trials and what methodology exists in order to reduce it. However, we cannot conclude which methodology is better in general terms because it will depend on many factors. In this section, we emphasize some practical considerations regarding the use of the presented methodology.
The first question we tackled in this article in the use of the weighted logrank test versus the logrank test in group sequential and adaptive group sequential designs. In the presence of known delayed effects, we observed that the weighted logrank test with parameter values is the overall best choice, not only for the analysis but also for the sample size formula. We recall that the use of these parameter values in the weighted logrank test generates a slight typeI error rate inflation and hence the value of needs to be slightly decreased in order to achieve a final typeI error rate of 2.5%.
In cases where the delayed effect is unknown or underestimated in the sample size calculation, there exists methodology that readjusts the sample size in order to increase the power at the final analysis. The use of each method will depend on the characteristics of the trial. From the two methods we evaluated in the article, we observed that the proposal of [19] outperforms the proposal of [16] in the sense that for the same power, [19] requires less sample size. However, with these approaches it is possible to backcalculate the conditional power at the interim analysis if we know the sample size increase recommended for the second stage of the trial. If this situation does not compromise the integrity of the trial, we recommend the use of [19] as it is proven to be more efficient. However, if the effect at the interim analysis has to remain masked, we propose the use of a modified version of [16], which would work as follows.
We would establish a promising zone, as in the original method, in which we recalculate the sample size if the conditional power falls within a certain prespecified range. The original method would calculate a different sample size for each conditional power (or delay time). However, in order to avoid backcalculations based on the second stage sample size, we propose to fix in advance the sample size to be used in the second stage of the trial. To avoid having an underpowered trial, we can fix the sample size increase assuming the lowest possible value for the conditional power (or the highest delay time) of the promising zone. This value would represent the maximum fixed sample size increase, although with this approach, we will be overpowering the trial for almost all values of the conditional power that fall in the promising zone. On the other hand, we can also fix the sample size increase to the highest possible value for the conditional power (or the lowest delay time) of the promising zone. This fixed value would represent the minimum fixed sample size increase, although with this approach, we will be underpowering the trial for almost all values of the conditional power that fall in the promising zone. This modification of the method proposed by [16] is illustrated (using a toy example) in Figure 6.
In this case, even though the “safest” option would always be using the maximum fixed sample size increase, we cannot give a recommendation since a large number of sample sizes between the maximum and the minimum fixed sample size increases can be employed and the choice depends on how much risk of having an underpowered study the sponsor is willing to take.
7 Conclusions
In this article we evaluated the impact of delayed effects, in terms of power and typeI error rate, in phase III clinical trials. We studied the use of the weighted logrank test as an alternative to the logrank test in group sequential and adaptive group sequential designs. This includes not only the analysis but also the incorporation of the Fleming and Harrington class of weights, as well as a delay estimate, in the sample size calculations. Also, we reviewed two different sample size readjustment methods, and explored which one is more efficient.
Results show that, in the presence of delayed effects when assuming proportional hazards, the weighted logrank test with parameter values was the best overall choice, as it was the one that maintained a higher power as the delay increases. When incorporating the Fleming and Harrington class of weights, as well as a delay estimate, into the sample size calculation, we observed that the power remains until the delay estimate we provided and the difference in terms of power between parameter values was not as big as under the assumption of proportional hazards, although the parameter values were overall the best combination. Sample size readjustment allows increasing the sample size at the interim analysis to lower the risk of failing to meet the study objective. We explored the operating characteristics of two popular approaches for sample size readjustment: the “promising zone” approach by [16] and the “start small then ask for more” approach by [19].
With the proposal from [16] it is possible to maintain the power high enough for the trial to be valid. However, the proposal from [19] is proven to be more efficient as for the same power curve, it requires less sample size. Nevertheless, there are situations in which having a “promising zone” may be more beneficial. This is the case when the effect at the interim analysis has to remain masked for integrity reasons. The problem is that it is possible to backcalculate the effect at the interim analysis by knowing the sample size increase. Hence, in this article we propose a modified version of the proposal from [16]. It does not require any modification of the original formulation. If a trial has a conditional power that falls in a prespecified promising zone, we apply a prespecified fixed sample size increase that will be used regardless the value of the conditional power as long as it falls in the promising zone. With this approach, even though we maintain the effect masked at the interim analysis, there is the risk of having an underpowered study if the fixed sample size increase in not large enough. However, if we want to avoid that risk, we will need to recruit more patients than necessary with the associated extra cost.
8 Acknowledgments
We would like to acknowledge Michael Branson for inspiration and initial discussions, and Ekkehard Glimm, Franz König and Thomas Jaki for the valuable advice during the development of this work. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie SklodowskaCurie grant agreement No 633567.
References
 [1] Chen TaiTsang. Statistical issues and challenges in immunooncology Journal for immunotherapy of cancer. 2013;1:18.
 [2] Fleming Thomas R, Harrington David P. A class of hypothesis tests for one and two sample censored survival data Communications in StatisticsTheory and Methods. 1981;10:763–794.
 [3] Hasegawa Takahiro. Sample size determination for the weighted logrank test with the Fleming–Harrington class of weights in cancer vaccine studies Pharmaceutical statistics. 2014;13:128–135.
 [4] Schoenfeld David A. Samplesize formula for the proportionalhazards regression model Biometrics. 1983:499–503.
 [5] Lawrence John. Strategies for changing the test statistic during a clinical trial Journal of biopharmaceutical statistics. 2002;12:193–205.
 [6] Cox David R. Regression models and lifetables in Breakthroughs in statistics:527–541Springer 1992.
 [7] Lin Ray S, León Larry F. Estimation of treatment effects in weighted logrank tests Contemporary clinical trials communications. 2017;8:147–155.
 [8] Bowden Jack, Seaman Shaun, Huang Xin, White Ian R. Gaining power and precision by using model–based weights in the analysis of late stage cancer trials with substantial treatment switching Statistics in medicine. 2016;35:1423–1440.
 [9] Lakatos Edward. Sample sizes based on the logrank statistic in complex clinical trials Biometrics. 1988:229–241.
 [10] Fleming Thomas R, Harrington David P. Counting processes and survival analysis;169. John Wiley & Sons 2011.
 [11] Jennison Christopher, Turnbull Bruce W. Group sequential methods with applications to clinical trials. CRC Press 1999.
 [12] Wassmer Gernot. Planning and analyzing adaptive group sequential survival trials Biometrical Journal. 2006;48:714–729.
 [13] Bauer Peter, Posch Martin. Letter to the editor Statistics in medicine. 2004;23:1333–1334.
 [14] Jenkins Martin, Stone Andrew, Jennison Christopher. An adaptive seamless phase II/III design for oncology trials with subpopulation selection using correlated survival endpoints Pharmaceutical statistics. 2011;10:347–356.
 [15] Magirr Dominic, Jaki Thomas, Koenig Franz, Posch Martin. Sample size reassessment and hypothesis testing in adaptive survival trials PloS one. 2016;11:e0146465.
 [16] Mehta Cyrus R, Pocock Stuart J. Adaptive increase in sample size when interim results are promising: A practical guide with examples Statistics in medicine. 2011;30:3267–3284.
 [17] Gao Ping, Ware James H, Mehta Cyrus. Sample size reestimation for adaptive sequential design in clinical trials Journal of biopharmaceutical statistics. 2008;18:1184–1196.
 [18] Chen YH, DeMets David L, Gordon Lan KK. Increasing the sample size when the unblinded interim result is promising Statistics in medicine. 2004;23:1023–1038.
 [19] Jennison Christopher, Turnbull Bruce W. Adaptive sample size modification in clinical trials: start small then ask for more? Statistics in medicine. 2015;34:3793–3810.
 [20] Glimm Ekkehard. Comments on ‘Adaptive increase in sample size when interim results are promising: A practical guide with examples’ by CR Mehta and SJ Pocock Statistics in medicine. 2012;31:98–99.
 [21] Golkowski Daniel, Friede Tim, Kieser Meinhard. Blinded sample size reestimation in crossover bioequivalence trials Pharmaceutical statistics. 2014;13:157–162.
Appendix
In this section of the manuscript we present the R code we used to simulate survival data for confirmatory and for running a group sequential design with one interim analysis for efficacy under the presence of delayed effects.
Comments
There are no comments yet.