Properties of the weighted log-rank test under delayed effects assumption in the design of confirmatory studies with delayed effects

06/29/2018 ∙ by Jose L Jimenez, et al. ∙ 0

Proportional hazards are a common assumption when designing confirmatory clinical trials in oncology. This assumption not only affects the analysis part but also the sample size calculation. The presence of delayed effects causes a change in the hazard ratio while the trial is ongoing since at the beginning we do not observe any difference between treatment arms and after some unknown time point, the differences between treatment arms will start to appear. Hence, the proportional hazards assumption no longer holds and both sample size calculation and analysis methods to be used should be reconsidered. The weighted log-rank test allows a weighting for early, middle and late differences through the Fleming and Harrington class of weights, and is proven to be more efficient when the proportional hazards assumption does not hold. The Fleming and Harrington class of weights, along with the estimated delay, can be incorporated into the sample size calculation in order to maintain the desired power once the treatment arm differences start to appear. In this article, we explore the impact of delayed effects in group sequential and adaptive group sequential designs, and make an empirical evaluation in terms of power and type-I error rate of the of the weighted log-rank in a simulated scenario. We also give some practical recommendations regarding which methodology should be used in the presence of delayed effects depending on certain characteristics of the trial.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In drug development, randomized controlled trials remain the gold standard to confirm efficacy and safety of novel drug candidates. Often phase III trials embed formal interim analyses to allow studies to be stopped earlier for futility if the novel drug is not efficacious or for efficacy if the treatment effect is overwhelmingly positive.

Immuno-oncology (IO) is a rapidly evolving area in the development of anti-cancer drugs. IO agents can have effect on both the human immune system and the tumor microenvironment. By doing so, the tumors may be eradicated from the host or disease progression may be delayed. The effect of an IO agent is not typically directed to the tumor itself; it instead boosts or releases the brake from the patient’s immune system, and this positive effect may not be observed immediately. The lag between the activation of immune cells, their proliferation and impact on the tumor is described in the literature as a delayed treatment effect. Some patients may not derive clinical benefit before their disease progresses while others may derive sustained response or control of their disease. The primary endpoints often used for confirmatory phase III studies in oncology are time to event: progression free survival (PFS) and overall survival (OS). PFS is defined as time from randomization until disease progression or death and OS is defined as time from randomization until death from any cause. The delayed treatment effect may translate to inferior or equal PFS or OS compared to control treatment in the first months of therapy and superior survival thereafter leading to non-proportionality of hazards in the experimental and control arms of study. Therefore, the original design based on a proportional hazards assumption will lead to an underpowered study and hence both the sample size calculation and the analysis methods to be used should be reconsidered [1].

A weighted version of the log-rank test that incorporates the Fleming and Harrington class of weights [2], allows tuning the two parameters depending on if we expect early, middle or late delays, is proposed in the literature to increase the power at the end of the trial. However, tuning these parameters is not straightforward, since a misspecification may cause an even larger power drop with respect to the log-rank test.

The Fleming and Harrington class of weights, along with the estimated delay, can be incorporated into the sample size calculation in order to maintain the desired power once the treatment arm differences start to appear (see [3]).

In this article we make an empirical evaluation of the impact of having a delayed effect on power and type I error rate in the design of a confirmatory phase III study with an IO agent used in combination with a standard of care, assuming a range for delay time. We assess the performance of the weighted log-rank test as an alternative to the log-rank test given it allows weighting of late differences and the potential gain power under non-proportional hazards. The evaluation is made for both group sequential and adaptive group sequential designs with fixed values of the Fleming and Harrington class of weights. We also give some practical recommendations regarding the methodology to be used in the presence of delayed effects depending on certain characteristics of the trial.

The manuscript is organized as follows. In section 2, we describe the weighted log-rank test and derive derive the sample size calculation formula needed to incorporate the estimated delay and the Fleming and Harrington class of weights, and we introduce the combination test statistic that will be necessary when doing sample size re-assessment. In section 3 we briefly describe group sequential and adaptive group sequential designs, emphasizing two popular methods used to do sample size re-assessment. In section 4, we describe the simulated example.

2 Methods

In this section we describe the statistical methodology we review in this article. In sections 2.1 and 2.2 we present the weighted log-rank test and derive an optimal sample size when using this test following [3]. This sample size derivation is presented as an alternative to the Schoenfeld’s formula [4], which is normally used when calculating the necessary sample size in confirmatory trials. In section 2.3 we introduce the combination test statistic, which will be necessary when we perform sample size re-estimation in adaptive group sequential designs.

Let

be a vector that contains the event times,

, between the patients’ enrollment date and the patients’ final event date, , such that . Let the number of events at time be denoted as , the total number of patients at risk at that time be denoted as , and the effect delay (in months) be denoted as . As previously described if both survival curves go in parallel and once , the survival curves will start diverging. Hence, we assume the following density functions , survival functions and hazard functions for the control group () and for the experimental group ():

(1)

where so that . This way, we assume a step function for the hazard ratio where from time 0 to , the hazard ratio is equal to 1, and from time the hazard ratio is equal to .

In this article we assume that the control group receives the standard of care and the experimental group receives a combination of the standard of care plus the IO agent which causes the delayed effect. Hence, any observed difference from time 0 until time is random. The conclusions we obtain are only applicable to studies where a similar assumption is made. Otherwise, we cannot guarantee that from time 0 to time , both groups have a common survival function.

2.1 Weighted log-rank test

The weighted log-rank test is defined as

(2)

where , and

under the null hypothesis

.

[2] proposed the use of to weight early, middle and late differences through the class of weighted log-rank tests, where the weight function at a time point is equal to

(3)

where corresponds to the Kaplan-Meier estimator.

Depending on the values of and , we will have different weight functions that will emphasize early differences , middle differences or late differences in the hazard rates or the survival curves. The parameter combination attributes equal weights to all data values and hence does not emphasize any survival differences between treatment arms. Moreover, with this parameter combination (2) corresponds to the usual log-rank test.

As mentioned by [3], since we focus on the entire survival curve rather than the late difference, valid inference requires pre-specification of and prior to any data collection.

Prior specification of is always advisable for the trial integrity, although some authors (see e.g., [5]) note that the value of can be modified at the interim analysis without type-I error rate inflation. At the end of the trial, we are interested in estimating the hazard ratio across the entire study, which is obtained through the standard Cox model [6]. Note however that there will be a disconnect between the hazard ratio (i.e., the standard Cox model) and the weighted log-rank test. To obtain an estimate based on the Cox model that corresponds to the weighted log-rank test see [7].

In this article we focus on the use of the weighted log-rank test in confirmatory trials with delayed effects. Other areas of use may include treatment switching, which is sometimes present in confirmatory trials and also induces non-proportional hazards (see [8]). However, it is out the scope of this article to evaluate the performance of the weighted log-rank test under the presence of treatment switching and further research on this matter would be necessary.

2.2 Sample size derivation for the weighted log-rank test

We introduce the optimal sample size derivation proposed by [3]. Assume that we recruit patients during time at a certain rate in a confirmatory trial where we aim to compare survival time between two groups (): a control group, with a constant hazard over time, and an experimental group, with a hazard that changes over time. The final analysis is performed at time after the first patient is enrolled. The study period is partitioned into subintervals of equal length . Let be the hazard function for group at time and be the expected number of patients at risk for group at time , where .

[9]

showed that the weighted log-rank statistic is normally distributed with unit variance and approximate expectation of

(4)

where

(5)

represents the allocation ratio for group , and corresponds to the Fleming-Harrington’s class of weights where and represents the pooled survival function. Even though it was originally proposed by [10], [3] uses as a substitute for the pooled survival function, where represents the survival function of group at time . However, as stated by [3], equation (4) can be equivalently expressed as

(6)

where

(7)

Assuming that the weighted log-rank statistic is normally distributed with mean and unit variance, then for a power equal to and one-sided significance level we have

(8)

where and correspond to the -th and -th percentile of the standard normal distribution respectively. The required sample size is calculated as

(9)

and the total expected number of events is equal to .

2.3 Test statistic

We aim to test the null hypothesis, , against the alternative, . In the context of group sequential designs, since we are only interested in early efficacy testing we make use of the well known classical group sequential design methodology (see [11]) and make use of the O’Brien and Fleming rejection boundaries. In the context of adaptive group sequential designs, we make use of the independent increment property of the inverse normal method, which is an efficient way of incorporating data of patients who where censored at interim analysis while ensuring type-I error rate control (see [12]). The test statistic is defined as

(10)

where and denote the separate stage p-values from stages 1 and 2, denotes the inverse of the standard normal distribution, and and are pre-specified weights such that , and where and represent the number of events observed in each stage. The null hypothesis will be rejected at level if .

However, the inverse normal method is in general not valid when doing sample size re-assessment if the adaptations depend on endpoints such OS or PFS (see [13]). We use the approach proposed by [14] where, in equation (10), the first stage p-value is defined by the cohort of patients included before the interim analysis and is calculated only at the end of the trial. This allows the inclusion of all the events, but it prohibits early stopping for efficacy. See [15] for a detailed review of the existing methods on this matter.

3 Group sequential and adaptive group sequential designs

In this section we aim to briefly describe how group sequential and adaptive group sequential designs work. For a detailed definition and explanation of this methodology see [11].

3.1 Group sequential designs

The formulae presented in section 2.2 allow to obtain a sample size that maintains an acceptable power at the end of the trial under the presence of delayed effects. However, a key condition is to have some knowledge about the delay of the drug. Assuming we have this knowledge when designing the confirmatory trial, we can implement a group sequential design with an interim analysis for efficacy. Note that interim analysis for futility is not advised in the presence of delayed effect because of high risk of stopping the study for futility even in scenarios that favor the alternative hypothesis.

A group sequential design with one interim analysis for efficacy is graphically described in Figure 1.

Figure 1: Graphical representation of a group sequential design with an interim analysis for efficacy where is the efficacy boundary at the interim analysis and is the efficacy boundary at the final analysis.

3.2 Adaptive group sequential design

Even though the sample size derivation described in section 2.2 guarantees that after a pre-specified effect delay we will have an acceptable power at the end of the trial while controlling the type-I error rate, we may have misspecified the delay value or maybe this value is unknown. Either way, an adaptive group sequential design that allows interim analyses and sample size re-assessment would be useful in case we expect a lack of statistical power at the end of the trial given the results at the interim analyses. Hence, with this design we aim to recover the power lost due to misspecification of the delay. As explained in section 2.3, to maintain type-I error rate control when the sample size criteria is based on survival endpoints, the interim analysis is only used to do a sample size re-assessment and not for early stopping. Because we need to distinguish between the effect at the interim analysis and the effect at the final analysis, let be the hazard ratio at the interim analysis and let be the hazard ratio at the end of the trial.

We now introduce two popular approaches for sample size re-assessment:

3.2.1 Mehta and Pocock’s “promising zone” approach [16]

[16] propose a method that adaptively increases the sample size when interim results are considered “promising”. For that, we compute the conditional power at the interim analysis using rather than the true . The formula for the conditional power is defined as

(11)

If the conditional power is within a certain pre-specified range that we consider promising, we may re-estimate the sample size to recover the power lost due to the effect delay. The selection of this range depends not only on the estimate of the effect delay but also on the budget of the sponsor for this particular trial. For example, if we have an estimated effect delay between 3 and 7 months, but we only have budget to guarantee 80% of power up to 5 months, the sponsor can choose to stop the trial. Therefore, following [16], we partition the sample space of attainable values into three zones:

  1. Favorable: We consider the interim results to be in the favorable zone if . In this zone, the study is sufficiently powered for the observed and therefore no sample size re-estimation is required.

  2. Promising: We consider the interim results to be in the promising zone if . In this zone, is close to but the study is not sufficiently powered and a sample size re-estimation is required. Specifically, the sample size will be increased to

    (12)

    where is the maximum sample size the sponsor is willing to enroll and satisfies that . Following [17], it is possible to show that

    (13)
  3. Unfavorable: We consider the interim results to be in the unfavorable zone if the value of . The value of is pre-specified before the trial starts and it depends on the prior knowledge about the effect delay. In this zone the interim results are not promising and the sample size will not be re-estimated.

Type-I error rate is controlled following [18], where it is shown that the overall type-I error does not increases if the sample size is only re-assessed when

(14)

3.2.2 Jennison and Turnbull’s “start small then ask for more” approach [19]

[19] made a detailed analysis of Mehta and Pocock’s “promising zone” approach.

One drawback of the “promising zone” approach is the use of in the construction of the promising zone and sample size increase function. The reason is that is considered as a highly variable estimate of , and also because it is used twice in determining the conditional power that underlies the sample size function: the first time through the value of and the second time when evaluating the conditional power at . This double use of was also pointed out by [20] who recommends a careful inspection of the operating characteristics when using .

Another drawback of Mehta and Pocock’s “promising zone” approach is that, despite the type-I error rate being controlled, because of the restriction showed in (14), the gain in power is relatively small for the increases in the expected sample size. Moreover, [19] demonstrated that other alternatives such us a fixed sample design and a group sequential design have exactly the same power curve and a lower expected sample size around the true value of .

To overcome the last limitation, [19] propose an optimal sample size calculation rule where we need to find the value of that maximizes the objective function

(15)

where can be considered as “a tuning parameter that controls the degree to which the sample size may be increased when interim data are promising but not overwhelming”.

[19] pointed out that even though the objective function given by equation (15

) “concerns conditional probabilities given the interim data, choosing a sample size rule to optimize this objective function also yields a design with an overall optimality property expressed in terms of unconditional power”. They show that

(16)

where represents the density of under , and since we maximize equation (15), for every , we also maximize the right hand side of equation (16). Moreover, it is possible to show that this sample size rule has the minimum expected sample size among all rules that achieve the sample power under .

In algorithms 1 and 2 we describe how to implement the reviewed methodology in case the sample size needs to be re-assessed.

1:procedure 
2:     Recruit up to patients and when events are observed analysis compute CP
3:     Calculate the number of events and total sample size necessaries for the second stage.
4:     Recruit patients until events are observed.
5:     Compute is calculated at the final stage using only the patients enrolled before the interim analysis.
6:     if (then
7:         Outcome is rejected at the final analysis
8:     else
9:         Outcome is not rejected at the final analysis
10:     end if
11:     return Outcome
12:end procedure
Algorithm 1 Group sequential adaptive design using Mehta and Pocock’s “promising zone” approach.
1:procedure 
2:     Recruit up to patients, and when events are observed do the interim analysis.
3:     Calculate the number of events and total sample size necessaries for the second stage.
4:     Recruit patients until events are observed.
5:     Compute is calculated at the final stage using only the patients enrolled before the interim analysis.
6:     if (then
7:         Outcome is rejected at the final analysis
8:     else
9:         Outcome is not rejected at the final analysis
10:     end if
11:     return Outcome
12:end procedure
Algorithm 2 Group sequential adaptive design with one interim analysis for efficacy using Jennison and Turnbull’s “start small then ask for more” approach.

4 Simulation setup

We implement the methodology described in sections 2 and 3 on a scenario that tries to imitate a realistic phase III trial with delayed effects in oncology.

Survival data for the control arm is simulated using an exponential distribution while data for the experimental arm is simulated using a distribution that is piece-wise exponential (see equation (

1)). Under proportional hazards, we assume that the control arm has a median survival of 6 months while the experimental arm has a median survival of 9 months. Hence, the hazard ratio is equal to 0.667. However, under the presence of delayed effects we assume a step function for the hazard ratio where it will be equal to 1 until a certain time point , and then it will be at its full effect after . This means that while the control arm will keep its median survival of 6 months, the median survival of the experimental arm will no longer be 9 months because of the delayed effect.

We establish a total study duration of 25 months, a total enrollment period of 17.5 months, randomization ratio 1:1, a power of 90% and a one-sided level of 2.5%.

Clinical trial enrollment follows a Poisson distribution with rate of 10 patients per month. Plotting the cumulative distribution function of a Poisson distribution of these characteristics using, for instance, the R function

ecdf(), it is straightforward to see that after 17.5 months almost all the patients, if not all, are enrolled in the trial. Results are obtained running 200,000 simulated trials. R code is showed in the appendix explaining how to simulate survival data under the presence of delayed effects.

In Table 1 we show the information fraction, the cumulative spent, the O’Brien and Fleming efficacy boundaries, and the boundary crossing probability at each look. Recall that these boundaries are only used in the context of group sequential designs where the sample size is not re-assessed and they are calculated based on the information fraction only. If the sample size needs to be re-assessed, we employ different methodology (see section 2.3)

Look #
Information
Fraction
Cumulative
spent
Efficacy
boundary Z
Boundary crossing
probability (incremental)
1 0.75 0.01 2.34 0.688
2 1 0.025 2.012 0.212
Table 1: Information fraction, the cumulative spent, the efficacy boundaries, and the boundary crossing probability at each analysis in the group sequential design we use as an example.

For both the group sequential and the adaptive group sequential designs, we estimate the empirical power and the empirical type-I error rate at the final analysis. In the context of group sequential designs, let Z be the Z-statistic obtained at the end of the trial and Z be the efficacy boundary of the final analysis presented in Table 1. In scenarios under the alternative hypothesis, the empirical power is defined as

(17)

whereas in scenarios under the null hypothesis, (17) is the empirical type-I error rate. In the context of group sequential adaptive designs, in equation (17), Z needs to be substituted by Z and Z needs to be substituted by Z in order the implement the inverse normal method described in section 2.3.

5 Results

In this section we evaluate the repercussion of delayed effects on the power and the type-I error rate in group sequential and adaptive group sequential designs. The results presented in this section are based on the simulated scenario described in section 4.

Because one of the purposes of this work is to make a comparison between the log-rank test and weighted log-rank test, in Table 2 we show, for different delay times, the required number of events and the sample size using the parameter values and following the formulas presented in section 2.2. As we can see, under proportional hazards the parameter combination is more efficient since it requires 258 events whereas the parameter combination requires 369 events to maintain 90% of power. However, with 5 months delay, the parameter combination becomes more efficient since it requires 741 events whereas the parameter combination requires 1436 events to maintain 90% of power.

Delay (months) 0 1 2 3 4 5
# of events 258 359 492 686 986 1436
# of patients 330 456 621 860 1228 1777
# of events 369 376 406 468 578 741
# of patients 472 478 512 587 719 917
Table 2: Sample size calculation for different effect delay times using the parameter values and using the sample size formulae reviewed in Section 2.2.

5.1 Group sequential design

In Figure 2 we show the empirical power and type-I error rate at the final analysis for a wide range of and combinations with the design characteristics presented in section 4 assuming no delayed effect in the sample size calculation. As expected, the results show that the parameter combination achieves 90% of power and 2.5% type-I error at the final analysis. However, as the delay increases, we observe that power drops faster than other combinations of and as the effect delay increases. Other combinations like have less power under proportional hazards but maintain higher power as the effect delay increases. These results are expected since low values of and high values of weight late differences, which is the situation we recreate in this simulated trial. However, combinations that weight late differences produce a slight type-I error rate inflation as we can observe in Figure 2, right image.

Using the methodology described in section 2.2

, if we incorporate an estimate of the effect delay in the sample size calculation, we are able prevent the power to drop until that specified moment. This is shown in Figure

3, where for each delay time we calculate the sample size necessary to achieve 90% power taking the delay into account. Moreover, when correctly specifying the effect delay, we observe that not only low values of and high values of achieve high power. However, in terms of type-I error rate, we observe the same slight type-I error rate inflation we observed in Figure 2 for low values of and high values of .

To control the type-I error rate, we propose to use a similar approach as the one used by [21] in which, although in a different context, instead of calculating the sample size for , a lower value of is fixed so the final type-I error rate is maintained at 2.5%.

Figure 2: Empirical power and type-I error for a wide range of combinations of and at the final analyses with different effect delay times and a unique sample size calculated assuming proportional hazards. In black, the five combinations with less cumulative power loss over time, in dark grey the power loss of the log-rank test () over time, and in light grey the power loss of the rest of the combinations.
Figure 3: Empirical power (left) and type-I error (right) for a wide range of combinations of and at the final analyses with different effect delay times and a different sample size for each delay time. In the left image, in black, the five combinations with highest mean power over time. In dark grey the log-rank combination () and in light grey the rest of the combinations. In the right image, in black the type-I error of the five combinations with highest mean power over time. In dark grey the log-rank combination () and in light grey the rest of the combinations.

5.2 Adaptive group sequential design

In this section we show how performing a sample size reassessment we recover some of the power lost due to the delayed effect. As in the previous section, the results presented here make use of the simulated example described in section 4. However, rather than using a wide range of combinations of and , we use the combination since we believe it is the most suitable combination for this kind of setting.

In Figure 4 we present the empirical type-I error (top-left image), empirical power (top-right image), percent of times we re-adjust the sample size (bottom-left image) and the ratio between new sample size and original sample size (bottom-right image) for different effect delays using the weighted log-rank test with the parameter combination using the promising zone approach proposed by [16].

We employ three different promising zone lower bounds (0.5, 0.1, 0.001) and compare their operating characteristics against a design that does not reassess the sample size. Without any sample size reassessment, the power is below 80% after 3 months. Using a promising zone lower bound of 0.5, the power will be below 80% after 3.5 months. However, if the promising zone lower bound is 0.1 or 0.001, the power will be below 80% after 4 and 6 months, respectively. As discussed in the literature (see [19]) we corroborate that the gains when using a lower bound of 0.5 is practically negligible and the greatest gains in power are likely to be found outside the region defined by [16].

In terms of type-I error, we observe it is perfectly controlled for any value of the promising zone lower bound. However, note that we implemented our previously described proposal in which instead of calculating the sample size for , a lower value of is fixed so the final type-I error rate is maintained at 2.5%. Otherwise we would see the same slight type-I error rate inflation we identified in the Figures 2 and 3 due to the and parameters that we employ.

In terms of percent of times we fall in the promising zone, when the lower bound is 0.5, the probability of re-adjusting the sample size reaches its maximum value, which is around 15% at 4 months. If the lower bound is 0.1, the probability of re-adjusting the sample size reaches its maximum value, which is around 35% between 4 and 5 months. Last, if the lower bound is 0.001, the probability of re-adjusting the sample size reaches its maximum value, which is close to 70% at 6 months.

In terms of how much we need to increase the sample size with respect to the original sample size every time we fall in the promising zone, we observe that if the lower bound is 0.5, we need around 1.5 times the original sample size regardless the delay time. If the lower bound is 0.1, we need around 2.5 times the original sample size also regardless the delay time. Last, if the lower bound is 0.001, for a delay time , we need around 4.5 times the original sample size. For a delay time we need around 9 times the original sample size and for a delay time we need around 15 times the original sample size.

It is important to mention that, in practice, a promising zone lower bound of 0.001 may not be possible to implement given the excessively increase in the number of events needed and the consequent increase in the budget for the trial. However, we believe it is interesting to show that it is possible to maintain a power of 80% for another three extra months, regardless of the additional duration and expenses of the trial.

Last, in Figure 5 we make a comparison between the approaches of [16] and [19]. We selected the promising zone’s lower bound 0.001 because it is the one that is more expensive to put into practice and where greater differences are observed. As expected, the approach from [19] is able to maintain the same power as the approach from [16]. However, in terms of how much we need to increase the sample size with respect to the original sample size, [19] requires smaller sample size, specially after 4 months of delay.

Figure 4: Empirical type-I error rate (top left), empirical power (top right), percent of times sample size is reassessed (bottom left) and ratio between the reassessed number of events and the original number of events (bottom right) at different delay times, when the sample size is calculated assuming no delay, using the “promising zone” approach.
Figure 5: Empirical power and ratio between the reassessed number of events and the original number of events when using the approaches from [16] and [19]

6 Practical Considerations

In the previous sections we evaluated the impact of delayed effects in clinical trials and what methodology exists in order to reduce it. However, we cannot conclude which methodology is better in general terms because it will depend on many factors. In this section, we emphasize some practical considerations regarding the use of the presented methodology.

The first question we tackled in this article in the use of the weighted log-rank test versus the log-rank test in group sequential and adaptive group sequential designs. In the presence of known delayed effects, we observed that the weighted log-rank test with parameter values is the overall best choice, not only for the analysis but also for the sample size formula. We recall that the use of these parameter values in the weighted log-rank test generates a slight type-I error rate inflation and hence the value of needs to be slightly decreased in order to achieve a final type-I error rate of 2.5%.

In cases where the delayed effect is unknown or underestimated in the sample size calculation, there exists methodology that re-adjusts the sample size in order to increase the power at the final analysis. The use of each method will depend on the characteristics of the trial. From the two methods we evaluated in the article, we observed that the proposal of [19] outperforms the proposal of [16] in the sense that for the same power, [19] requires less sample size. However, with these approaches it is possible to back-calculate the conditional power at the interim analysis if we know the sample size increase recommended for the second stage of the trial. If this situation does not compromise the integrity of the trial, we recommend the use of [19] as it is proven to be more efficient. However, if the effect at the interim analysis has to remain masked, we propose the use of a modified version of [16], which would work as follows.

We would establish a promising zone, as in the original method, in which we re-calculate the sample size if the conditional power falls within a certain pre-specified range. The original method would calculate a different sample size for each conditional power (or delay time). However, in order to avoid back-calculations based on the second stage sample size, we propose to fix in advance the sample size to be used in the second stage of the trial. To avoid having an underpowered trial, we can fix the sample size increase assuming the lowest possible value for the conditional power (or the highest delay time) of the promising zone. This value would represent the maximum fixed sample size increase, although with this approach, we will be overpowering the trial for almost all values of the conditional power that fall in the promising zone. On the other hand, we can also fix the sample size increase to the highest possible value for the conditional power (or the lowest delay time) of the promising zone. This fixed value would represent the minimum fixed sample size increase, although with this approach, we will be underpowering the trial for almost all values of the conditional power that fall in the promising zone. This modification of the method proposed by [16] is illustrated (using a toy example) in Figure 6.

In this case, even though the “safest” option would always be using the maximum fixed sample size increase, we cannot give a recommendation since a large number of sample sizes between the maximum and the minimum fixed sample size increases can be employed and the choice depends on how much risk of having an underpowered study the sponsor is willing to take.

Figure 6: Fixed sample size increase illustration following a modified version of the “promising zone” proposed by [16].

7 Conclusions

In this article we evaluated the impact of delayed effects, in terms of power and type-I error rate, in phase III clinical trials. We studied the use of the weighted log-rank test as an alternative to the log-rank test in group sequential and adaptive group sequential designs. This includes not only the analysis but also the incorporation of the Fleming and Harrington class of weights, as well as a delay estimate, in the sample size calculations. Also, we reviewed two different sample size re-adjustment methods, and explored which one is more efficient.

Results show that, in the presence of delayed effects when assuming proportional hazards, the weighted log-rank test with parameter values was the best overall choice, as it was the one that maintained a higher power as the delay increases. When incorporating the Fleming and Harrington class of weights, as well as a delay estimate, into the sample size calculation, we observed that the power remains until the delay estimate we provided and the difference in terms of power between parameter values was not as big as under the assumption of proportional hazards, although the parameter values were overall the best combination. Sample size re-adjustment allows increasing the sample size at the interim analysis to lower the risk of failing to meet the study objective. We explored the operating characteristics of two popular approaches for sample size re-adjustment: the “promising zone” approach by [16] and the “start small then ask for more” approach by [19].

With the proposal from [16] it is possible to maintain the power high enough for the trial to be valid. However, the proposal from [19] is proven to be more efficient as for the same power curve, it requires less sample size. Nevertheless, there are situations in which having a “promising zone” may be more beneficial. This is the case when the effect at the interim analysis has to remain masked for integrity reasons. The problem is that it is possible to back-calculate the effect at the interim analysis by knowing the sample size increase. Hence, in this article we propose a modified version of the proposal from [16]. It does not require any modification of the original formulation. If a trial has a conditional power that falls in a pre-specified promising zone, we apply a pre-specified fixed sample size increase that will be used regardless the value of the conditional power as long as it falls in the promising zone. With this approach, even though we maintain the effect masked at the interim analysis, there is the risk of having an underpowered study if the fixed sample size increase in not large enough. However, if we want to avoid that risk, we will need to recruit more patients than necessary with the associated extra cost.

8 Acknowledgments

We would like to acknowledge Michael Branson for inspiration and initial discussions, and Ekkehard Glimm, Franz König and Thomas Jaki for the valuable advice during the development of this work. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 633567.

References

  • [1] Chen Tai-Tsang. Statistical issues and challenges in immuno-oncology Journal for immunotherapy of cancer. 2013;1:18.
  • [2] Fleming Thomas R, Harrington David P. A class of hypothesis tests for one and two sample censored survival data Communications in Statistics-Theory and Methods. 1981;10:763–794.
  • [3] Hasegawa Takahiro. Sample size determination for the weighted log-rank test with the Fleming–Harrington class of weights in cancer vaccine studies Pharmaceutical statistics. 2014;13:128–135.
  • [4] Schoenfeld David A. Sample-size formula for the proportional-hazards regression model Biometrics. 1983:499–503.
  • [5] Lawrence John. Strategies for changing the test statistic during a clinical trial Journal of biopharmaceutical statistics. 2002;12:193–205.
  • [6] Cox David R. Regression models and life-tables in Breakthroughs in statistics:527–541Springer 1992.
  • [7] Lin Ray S, León Larry F. Estimation of treatment effects in weighted log-rank tests Contemporary clinical trials communications. 2017;8:147–155.
  • [8] Bowden Jack, Seaman Shaun, Huang Xin, White Ian R. Gaining power and precision by using model–based weights in the analysis of late stage cancer trials with substantial treatment switching Statistics in medicine. 2016;35:1423–1440.
  • [9] Lakatos Edward. Sample sizes based on the log-rank statistic in complex clinical trials Biometrics. 1988:229–241.
  • [10] Fleming Thomas R, Harrington David P. Counting processes and survival analysis;169. John Wiley & Sons 2011.
  • [11] Jennison Christopher, Turnbull Bruce W. Group sequential methods with applications to clinical trials. CRC Press 1999.
  • [12] Wassmer Gernot. Planning and analyzing adaptive group sequential survival trials Biometrical Journal. 2006;48:714–729.
  • [13] Bauer Peter, Posch Martin. Letter to the editor Statistics in medicine. 2004;23:1333–1334.
  • [14] Jenkins Martin, Stone Andrew, Jennison Christopher. An adaptive seamless phase II/III design for oncology trials with subpopulation selection using correlated survival endpoints Pharmaceutical statistics. 2011;10:347–356.
  • [15] Magirr Dominic, Jaki Thomas, Koenig Franz, Posch Martin. Sample size reassessment and hypothesis testing in adaptive survival trials PloS one. 2016;11:e0146465.
  • [16] Mehta Cyrus R, Pocock Stuart J. Adaptive increase in sample size when interim results are promising: A practical guide with examples Statistics in medicine. 2011;30:3267–3284.
  • [17] Gao Ping, Ware James H, Mehta Cyrus. Sample size re-estimation for adaptive sequential design in clinical trials Journal of biopharmaceutical statistics. 2008;18:1184–1196.
  • [18] Chen YH, DeMets David L, Gordon Lan KK. Increasing the sample size when the unblinded interim result is promising Statistics in medicine. 2004;23:1023–1038.
  • [19] Jennison Christopher, Turnbull Bruce W. Adaptive sample size modification in clinical trials: start small then ask for more? Statistics in medicine. 2015;34:3793–3810.
  • [20] Glimm Ekkehard. Comments on ‘Adaptive increase in sample size when interim results are promising: A practical guide with examples’ by CR Mehta and SJ Pocock Statistics in medicine. 2012;31:98–99.
  • [21] Golkowski Daniel, Friede Tim, Kieser Meinhard. Blinded sample size re-estimation in crossover bioequivalence trials Pharmaceutical statistics. 2014;13:157–162.

Appendix

In this section of the manuscript we present the R code we used to simulate survival data for confirmatory and for running a group sequential design with one interim analysis for efficacy under the presence of delayed effects.

#Sample size
n = 330
#Overall number of events
nevents = 258
#Number of events at the interim analysis
nevents_look1 = 194
#Enrollment rate
enrollment_rate = 10
#Survival median of the control group
originalMedian1 = 6
#Survival median of the experimental group
originalMedian2 = 9
#Delay
epsilon = 2
#Parameters rho and gamma (remember that rho=0 and
#gamma=0 implies using the usual log-rank test).
prho = 0
pgamma = 0
#Interim analysis rejection boundary
rejectionBoundary1 = 2.339711
#Final analysis rejection boundary
rejectionBoundary2 = 2.011719
#Simulated trial number ’k’ (if you want to compute the power
#and type-I error of this design a for loop is needed since this code only
#simulates one clinical trial).
k = 1
#We generate the survival times for all the patients. Note that because until
#the delayed effect kicks the hazard ratio is equal to one, and hence all the
#survival data is generated from the same exponential distribution.
event_time = rexp(n, rate = log(2)/originalMedian1)
#We generate the treatment arm.
group = c(rep(0, n/2),rep(1,n/2))
#We generate the enrollment times.
enrollment_times = rpois(n,enrollment_rate)
#We select the treatment arm with immuno-therapy and if the survival
#time is larger than the moment where the delay kicks in (epsilon),
#we resample that value.
for(i in 1:n){
  if(group[i] == 1 & event_time[i] > epsilon){
    event_time[i] =  epsilon + rexp(1, rate = log(2)/originalMedian2)
    }
}
#We create a variable containing what we call pseudo-survival times.
#It contains the sum of the enrollment time plus the survival time and
#we use it to establish the censoring threshold using the number of events
#required to do the interim analysis.
pseudo_surv = enrollment_times+event_time
#INTERIM ANALYSIS
#We sort our pseudo-survival variable and choose the time of
#the observation number that is equal to the number of events
#required to do the interim analysis.
threshold_censoring[k,1] = sort(pseudo_surv)[nevents_look1]
#We create a variable to identify enrolled patients.
enrolled = ifelse(enrollment_times < threshold_censoring[k,1], 1, 0)
#We create a variable that contains the survival times taking
#into account the censoring status.
survival_time = ifelse(pseudo_surv > threshold_censoring[k,1],
threshold_censoring[k,1] - enrollment_times, event_time)
#We create a variable with the censoring status
censor = ifelse(pseudo_surv > threshold_censoring[k,1], 0, 1)
#We put all the variables together and keep only the enrolled
#patients for the analysis.
data.df1 <- data.frame(survival_time,censor,group,enrolled)
data.df1 = data.df1[data.df1[,4]==1,]
#We do the weighted log-rank test to reject or not the null hypothesis.
fit.wlr00_st1 <- wtdlogrank(Surv(survival_time, censor)~group,
data = data.df1, sided = 1, WtFun="FH", param=c(prho,pgamma))
#If z1 > rho1 (according to manuscript notation) we reject the
#null hypothesis that states that the hazard ratio is equal to one.
reject[k,1] = ifelse(abs(fit.wlr00_st1$Z) > rejectionBoundary1, 1, 0)
#This is only to be sure that we account for the rejection at the interim
#of the null hypothesis in the final analysis as well, even though we
#don’t technically do the final analysis because the hypothesis is
#rejected at the interim and the trial stopped.
if(reject[k,1] == 1){
  reject[k,2] = 1
  next
}
#We now continue with the second stage of the trial if
#we didn’t stop for efficacy.
#For the second stage we repeat the same procedure we did for the first stage.
#However, this time we do the calculations with the censoring threshold
#obtained when the total number of events is reached.
threshold_censoring[k,2] = sort(pseudo_surv)[nevents]
enrolled = ifelse(enrollment_times < threshold_power[k,2], 1, 0)
survival_time = ifelse(pseudo_surv > threshold_censoring[k,2],
threshold_censoring[k,2] - enrollment_times, event_time)
censor = ifelse(pseudo_surv > threshold_censoring[k,2], 0, 1)
data.df2 <- data.frame(survival_time,censor,group,enrolled)
data.df2 = data.df2[data.df2[,4]==1,]
fit.wlr00_st2 <- wtdlogrank(Surv(survival_time, censor)~group,
data = data.df2, sided = 1, WtFun="FH", param=c(prho,pgamma))
reject[k,2] = ifelse(abs(fit.wlr00_st2$Z) > rejectionBoundary2, 1, 0)