The one–sample log–rank test is the method of choice for single–arm Phase II trials with time–to–event endpoint. It allows to compare the survival of the patients to a prefixed reference survival curve that typically represents the expected survival under standard of care. First proposed by , its practical implementation including sample size calculation has been described by . The one–sample log–rank test is often criticized in different directions. First, it has been reported repeatedly in the literature that the classical one–sample log–rank statistic tends to be conservative (see [3, 4]
). One reason for the test’s inaccuracy is the dependence between the estimators of mean and variance of the original one–sample log–rank statistic when sample size is small. Several attempts have been made in the literature to correct for this (see[3, 4, 5, 6, 7, 8]). Amongst those, the proposal made by  is presently implemented in the commercial software PASS  for sample size calculation for the one–sample log–rank test. Another more conceptual point of criticism against the one–sample log–rank test relates to the process of selecting the reference survival curve. It is common practice to choose the reference survival curve in the light of historic data on standard treatment. This implies that choice of the reference survival curve itself is thus prone to statistical error which, however, is ignored in the classical one–sample log–rank statistic. As lined out in , this is as general problem in clinical trials with historical controls. Accordingly, common one–sample log–rank tests rather assume that the reference survival curve is a priori known and deterministic as in [2, 3, 4, 5, 6, 7, 8]. This ignores that the reference curve resulted from an estimation process and complicates interpretation of the test results. Moreover, historic data often suffer from not reflecting recent advances in diagnostics and/or concomitant therapy for standard of care.
To overcome these interpretative limitations we propose a new one–sample log–rank test that explicitly accounts for statistical error made in the process of estimating and fixing the reference survival curve. Principally, the new test applies to both historic and prospective comparisons of a new treatment to a standard in the framework of Phase II survival trials. In the latter case, the new test may also be interpreted as a two–sample test for survival distributions.
The paper is organized as follows. After settling notation and the testing problem, we describe the test statistic and it distributional properties. Additionally, we provide sample size calculation methods. Calculation of rejection regions and sample size are based on the approximate distribution of the new test statistic in the large sample limit. Therefore small sample properties of the new test regarding type I and type II error rate control are studied by simulation, and compared to the classical one–sample log–rank test / two–sample log–rank test. These simulations and a case study shed light on the inflation of the type I error rate that results from ignoring the sampling variability of the reference curve in the planning phase of a new single-armed trial. We conclude with a discussion of future research. Mathematical proofs are shifted to Appendix A.
2 General Aspects
We consider a survival trial with survival data from two treatment groups A (control intervention, prospectively collected or historic data) and B (experimental intervention, prospectively collected data). Let denote the set of patients from group , the number of such patients, and the total number of patients. In particular, we denote by the treatment group allocation ratio. We denote by or the time from entry to event or censoring for patient from group , respectively. Let denote the minimum of both. As usual, we assume that the and are mutually independent (non–informative censoring). For each , we denote by the –algebra of information available by study time :
Based on the observed data, we calculate the number of events from treatment group up to study time as
and the number at risk by study time in treatment group . Let indicate whether there are still patients at risk in treatment group by study time . As usual, we let denote the hazard of a patient from treatment group . We denote by the corresponding cumulative hazard function for treatment group , respectively. Finally, we denote by , , (or , , ) the density, distribution function and survival function of the time to event (or time to censoring ) in treatment group . Notice that , , , , and , , are assumed to coincide for all patients from the same treatment group .
We will also need the Nelsen–Aalen estimator
of the cumulative hazard function for group , and the corresponding estimator of the variance function
We consider , , , and as stochastic processes in study time , adapted to the filtration . Notice that we define whenever formal division of zero by zero occurs in a mathematical expression.
2.2 The Testing Problem
We consider testing the null hypothesis that the survival function of patients from the experimental treatment groupcoincides with the reference curve that is given by the true survival function under standard of care, i.e.
3 The Testing Procedure
Starting point is the stochastic process . When holds true, is (known to be) a mean–zero –martingale. depends on data from the experimental treatment arm , only, and is commonly used as a basis to construct one–sample log–rank tests (see ). Notice, however, the difficulty that depends on the true unknown cumulative hazard function under standard of care. In the context of the classical one–sample log–rank test it is common practice to estimate from historic data, and to identify the obtained estimate with , while treating as a deterministic function. I.e., the classical one–sample log–rank test effectively assesses the null hypothesis using the test statistic while pretending that is an a priori known deterministic reference function representing the expected survival under standard of care. This, however, may detract from the actually interesting null hypothesis when random deviation of from is large. To avoid those interpretive difficulties, we here propose to incorporate the process of reference curve estimation into the one–sample log–rank statistic: Replacing with its Nelsen–Aalen estimate (see [12, 13, 14]) in the definition of while treating as random, we obtain a new stochastic process that (i) can be calculated from the data, and (ii) may be used as test statistic for the original null hypothesis as we will see below. Notice that replacing with in increases the variance of the stochastic process since contributes additional variability. Deriving the correct rejection regions thus requires separate consideration which is not covered by the underlying one–sample test methodology. The resulting significance test may also be interpreted as a two–sample survival test, as the reference curve coincides with the true survival function under standard of care. Our proceeding defines a general strategy to lift existing methodology for one–sample survival tests to a multi–sample setting for a variety of different design settings as will be further discussed.
3.2 Test Statistic and Significance Test
Consider the –adapted stochastic processes
with , and acc. to (2), (3) and (4). Assume that the null hypothesis holds true. Then by Theorem 1 (see Appendix A) the following applies: (i) is a mean–zero –martingale with asymptotically independent increments, i.e. for any , and are approximately independent when sample size is sufficiently large, and (ii) for each fixed we have in distribution as , where
(see Appendix A, Lemma 1). In particular, the random variable
is approximately standard normally distributed under the null hypothesis. Notice that the parameters and cancel out in the definition of , so that can be calculated from the observed data. Thus an approximate level test of is defined by rejecting whenever , where is the desired two–sided significance level and is the standard normal distribution function.
To enable easy application of the proposed significance test in clinical practice, we provide R code (see ) that calculates the value of the test statistic and the corresponding two–sided –value for given input data set. The R code as well as instructions how to prepare the input data are given in S1 File..
4 Sample Size Calculation
Sample size is calculated under the proportional hazards planning alternative for some prefixed hazard ratio . By Theorem 2 (see Appendix A), the test statistic from (6) is approximately normally distributed under planning alternative hypothesis with unit variance and mean where
with and denoting the treatment arm allocation ratio. Large negative values of the test statistic support validity of the planning alternative . The power of the trial is thus approximately given by
In practice, the following assumptions on accrual and censoring are commonly made when calculating the required sample size of a survival trial:
Patients enter the trial uniformly between year and year with prefixed constant accrual rate , say, and are then followed–up for further years until the time of final analysis in year year.
No loss to follow–up, i.e.
is uniformly distributed on.
These assumption amount to
thus further simplifying above expressions for and . For prefixed two–sided significance level , hazard ratio , treatment group allocation ratio , overall accrual rate , length of the follow–up period , and control arm cumulative hazard function , it remains to choose the only remaining free parameter in (8) such that a desired power is achieved. With the parameter calculated this way, the required number of patients to achieve a power of under planning alternative is .
5 Simulation Study I: Comparison with the Classical One–Sample Log–Rank Test
In the application of the classical one–sample log–rank test from [1, 2] it is common practice to estimate the standard arm hazard function from historic data, and to choose the obtained estimate as the reference curve, while treating as deterministic. This may lead to type I error rate inflation when the underlying null hypothesis to be tested is , because the random deviation of from is neglected and the variance of the involved test statistics is thus underestimated. The objective of this simulation study I is to quantify the amount of type I error rate inflation in settings of clinical relevance: We study and compare the empirical type I error rates when (i) the classical one–sample log–rank test (without correction for sampling variability of the reference curve) and (ii) the new one–sample log–rank test (with correction for sampling variability of the reference curve) is used to test null hypotheses .
In our simulations, patients were assumed to enter the trial uniformly between year and year with overall accrual rate of per year. Accordingly the calendar times of entry were generated according to a uniform distribution on . After the end of the accrual period, patients were assumed to be followed up for further years, while assuming no loss to follow–up. Accordingly, we set . Survival times in the control intervention group were generated according to a Weibull distribution with prefixed shape parameter and 1-year survival rate . Survival times in the experimental intervention group were generated acc. to a Weibull distribution with , where is the true hazard ratio.
To perform the classical one–sample log–rank test, the standard arm data was used to calculate the Nelsen–Aalen estimate of . The obtained estimate was then treated as a deterministic function and used as (prefixed, deterministic) reference cumulative hazard function in the classical one–sample log–rank statistic (see [1, 2]). In contrast, the new test was performed according to our previously shown results.
To study the impact of sample size and allocation ratio on the amount of type I error rate inflation, the total sample size of the virtual data sets was chosen as . For each of these total sample sizes we considered allocation ratios . Scenarios with are more likely to reflect common practice as the size of the experimental cohort is typically smaller than the size of the historical control cohort. To study the impact of different shapes of the survival distribution, we considered different values for the Weibull shape parameter from the interval .
For each parameter constellation, we generated 10000 samples of size to which we applied both the new test as well as the classical one–sample log–rank test. The desired two–sided significance level was . Results are shown in Table 1 and discussed below.
Empirical type I error rates and for testing using the new test statistic and the classical one–sample log–rank statistic, respectively, for Weibull distributed survival times with shape parameter and 1–year survival rate in the control arm. Theoretical two–sided significance level: . Underlying total sample size of with allocation ratio .
The classical one–sample log–rank test does not account for sampling variability of the reference curve estimate. This leads to type I error rate inflation when the underlying null hypothesis to be tested is . As expected, our simulations support that the amount of type I error rate inflation of the classical one–sample log–rank test is most pronounced when the allocation ratio is large. For any fixed allocation ratio, the inflation slightly decreases with increasing overall sample size but remains on a similar level. For ratios , the true type I error rate is more than three times higher than the desired one ( instead of for ). For low allocation ratios as or , the actual type I error still exceeds the nominal level, but to an extent that might be acceptable for a phase II trial ( for and ). With a view to the classical one-sample log-rank test, this supports that choice of the reference curve should be based on a historic control that is at least 10 times larger than the new experimental trial cohort. Reassuringly, the new test that explicitly accounts for reference curve variability realizes an empirical type I error rate close to the desired in almost all scenarios. Notice that the new test would hardly be applied in the scenario with and as this implies a trial with only. So the entries for and have to be interpreted with care, but are shown for reasons transparency and completeness. The simulations thus support that neglecting the reference curve variability relevantly compromises type I error rate control when testing null hypotheses . Notice that the classical one–sample log–rank test only realizes strict type I error rate control for testing the null hypothesis which, however, detracts from the null hypothesis when random deviation of from is large.
6 Simulation Study II: Comparison with the Two–Sample Log–Rank Test
We proposed a significance test for null hypothesis based on the approximate large sample distribution of the test statistic introduced before. Despite of being derived as a one–sample log–rank test with consideration of reference curve variability, the new test may also be interpreted as a two–sample survival test. This simulation therefore aims to study performance of the new survival test for sample sizes of practical relevance, as compared to the classical two–sample log–rank test (see [16, 17]). Asymptotically (i.e. for sufficiently large sample size) the classical two–sample log-rank test is known to be the optimal test under proportional hazards (PH) alternatives. It is thus of particular interest to compare performance of both tests under PH alternatives.
In our simulations, patients were assumed to enter the trial uniformly between year and year with overall accrual rate of per year. Accordingly, the calendar times of entry were generated according to a uniform distribution on . Patients were allocated equally to both treatment arms and (allocation ratio ), corresponding to an annual accrual rate of patients per group. After the end of the accrual period, patients were assumed to be followed up for further years, while assuming no loss to follow–up. Accordingly, we set . Survival times in the control intervention group were generated acc. to a Weibull distribution with prefixed shape parameter and 1-year survival rate . To implement the PH condition, survival times in the experimental intervention group were generated according to a Weibull distribution with , where is the true hazard ratio. The true hazard ratio has to be distinguished from the expected hazard ratio , which defines the planning alternative underlying sample size calculation.
The classical two–sample log–rank test serves as reference. Sample size of the virtual trials was thus calculated as follows: In a first step, we used Schoenfeld’s formula from  to calculate the required number of events for the two–sample log–rank test to achieve a power of under the planning alternative for allocated two–sided significance level . The expected number of events under the planning alternative by calendar time is in the standard treatment group A, and in the experimental treatment group B. Solving the condition for the indeterminate yields the required length of the accrual period. The required total sample size is (i.e. per treatment group).
To cover scenarios of larger and smaller sample sizes, we let the expected hazard ratio range in the set . To study the impact of different shapes of the survival distribution, we considered different values for the Weibull shape parameter from the interval . To study the impact of the event rate, we chose a reference arm 1-year survival rate of (Table 2), (see Appendix C) and (see Appendix D).
|Scenario 1||Scenario 2|
Empirical type I error rates ( and ) and powers ( and ) for the new test and for the classical two–sample log–rank test, respectively, under proportional hazards alternatives for Weibull distributed survival times with shape parameter and 1–year survival rate in the control arm. Theoretical two–sided significance level: . Underlying total sample size (or ) in Scenario 1 (or Scenario 2) calculated to achieve a theoretical power of under the planning alternative for the classical log–rank test using Schoenfeld’s formula (or for the new test using formula (9)).
For each parameter constellation, we generated 10000 samples of size to which we applied both the new test as well as the classical two–sample log–rank test. We finally also used formula (8) to calculate the sample size such that our new test achieves a power of under planning alternative for allocated two–sided significance level of , and then repeated above simulations based on a total sample size of instead of . Reported in Tables 1–3 are the empirical type I and type II error rates for each parameter constellation and test based on a sample size of (Scenario 1) or (Scenario 2).
6.2 Results of the Main Setting (Table 2, Scenario 1)
Reassuringly, for large sample sizes (), both tests preserve the desired significance level and achieve similar power levels close to the desired for all shape parameter values . On closer inspection, one notices that both tests tend to be conservative for small values of , and slightly anti–conservative for larger values of on an acceptable degree (empirical type I error ranging between for and for and ). For the classical two–sample log–rank test, this effect is overlapped by a general tendency to anti–conservativeness when sample size is small (), resulting in an empirical type I error up to for the classical two–sample log–rank test when and .
For shape parameters , both test perform similarly well with empirical type I error rate close to . Interestingly, the new test even surpasses the classical two–sample log–rank test regarding power performance when
. This effect is most pronounced for exponentially distributed survival time (), when the new test achieves a power up to as compared to for the classical two–sample log–rank test. For shape parameter close to the exponential distribution , the new test is observed to show even better type I error rate control than the classical two–sample log–rank test when sample size is small ().
For the extreme scenario of large shape parameters in combination with small sample size , however, the new test is observed to become quite conservative with profound loss in power ( instead of for and ). This is due to the fact that the new test requires estimation of the control arm cumulative hazard function, which seems to fail when sample size of the control arm is small and at the same time early events are rare ( for and ). In contrast, the classical two–sample log–rank test maintains power also in these extreme scenarios, with a tendency towards anti–conservativeness, though.
This behavior of both tests is consistently observed amongst scenarios with different event rates (see the tables in Appendices C and D).
7 Case Study
As seen in the preceding simulations, the type I error of the classical one-sample log-rank test always exceeds the nominal type I error level if the sampling variability of the reference curve is not taken into account. However, the magnitude of this excess depends on the data from the reference cohort as well as the sample size in the new, experimental cohort.
The only difference between the test statistic of the classical one-sample log-rank test
and the new test is the denominator. Let denote the ratio of the standardisations without and with consideration of the sampling variability. The expected level of a two-sided classical one-sample log-rank test with nominal level neglecting the sampling variability is then given by which can be approximated by . Analogously, can be approximated via a first-order Taylor expansion by
which is now a quantity we can estimate from given historical control data and design parameters of a trial.
From the computations in  we get
After another approximation and some computations (see Appendix B), we also get
Under the null hypothesis, this can be estimated by plugging in Kaplan-Meier estimates from the control group for resp. . For a given historical control group, these formulas can now be used to compute the type I error inflation when sampling variability is not taken into account. Of course, the treatment group allocation ratio is essential for the extent of this inflation.
We will now illustrate the influence of basic design parameters on the type I error inflation with a practical example. We employ the setting of the Mayo Clinical trial in primary biliary cirrhosis of the liver (PBC), which is a rare but fatal chronic disease whose cause is still unknown (see ). In this double-blinded randomized trial the drug D-penicillamine (DPCA) was compared with a placebo. The study data is publicly available via the survival package in R [20, 15].
Among the 158 patients of the cohort treated with DPCA, 65 died during the trial. The Kaplan-Meier survival curve of these patients can be found in Fig 1. The time scale is given in years. There, we also display the empirical distribution of the censoring variable in this cohort. As we will see below, this distribution also plays a substantial role for our computations here. We now suppose, that a new treatment becomes available and the data from this trial shall be used to compare the survival under this treatment to the survival under treatment with DPCA. This shall be accomplished in a trial in which patients are recruited uniformly over a accrual period of length and followed-up in an additional period of length . The allocation ratio will again be denoted by
. If one cannot find a suitable parametric model to be fitted to the data, the Kaplan-Meier resp. Nelson-Aalen estimates (see Fig1) are employed as reference curves for the one-sample log-rank test.
Similar to our first simulation study (see 5 Simulation Study I: Comparison with the Classical One–Sample Log–Rank Test), we investigate the influence of the allocation ratio on the inflation of the type I error level in the first part of our study. We choose , and . The results in terms of the actual type I error level of the one-sample log-rank test can be found on the left hand side of Fig 2. For any fixed , the actual type I error level increases nearly linearly in the range of allocation ratios considered here. So, as a rule of thumb, each additional trial patient raises the level by a fixed number of percentage points. This number however seems to depend on the length of the follow-up, where a longer duration of the follow-up period leads to steeper increases.
In the second part of this case study, we take a closer look at the role of the trial duration. As already seen in the first part, longer trials lead to a larger inflation of the error levels. To analyse this dependence, we now choose , and . The results can be found on the right hand side of Fig 2. As we can see, trials with a longer total duration () tend to lead to a higher type I error inflation. This effect is most substantial if the total trial duration is close to the longest observation in the reference data set which amounts to about 12.5 years. In this case, the testing procedure needs to utilize parts of the Nelson-Aalen estimator which are affected by a high amount of variability because of the high number of censored observations. However, the inflated type I error neither behaves completely monotonically w.r.t. the accrual duration nor the follow-up duration . Even if the variance of the classical one-sample log-rank test and the additional variance which is due to the sampling variability (see appendix A) increase monotonically in and , the ratio can increase if the increase of the former is steeper than the increase of the latter. Nevertheless, there is a clear tendency towards a larger inflation of the type I error if either or increases.
Traditional one–sample log–rank tests compare the survival function of an experimental treatment to a prefixed reference survival curve, which typically represents the expected survival under standard of care. Choice of the reference survival curve is typically based on historic data on standard therapy and thus prone to statistical error. Nevertheless, traditional one–sample log–rank tests do not account for this variance of the reference curve estimator. Here we study and propose a non–parametric one–sample log–rank test that explicitly accounts for sampling variability of the reference curve.
The new test may also be interpreted as two–sample test for survival distributions, while inheriting the interpretability from the underlying one–sample log–rank test. Admittedly, our simulations suggest that it may be advisable to compare the data of a historical control cohort with the new data in a single-arm Phase II trial via the two-sample log-rank test if one wants to account for sampling variability of the reference curve or in case of allocation ratios close to 1. Nevertheless, in Phase II settings with fast events (Weibull shape parameter ), our simulations reveal the potential of the new test to outperform the classical two–sample log–rank test even under PH alternatives. A non-consideration of the sampling variability leads to an inflation of the type I error rate. The extent of this inflation depends in particular on the size of the control cohort. A major objective of this work was to investigate how large this control must be chosen so that the type I error inflation remains within an acceptable range. In this regard, our simulations support that the classical one-sample log-rank test is adequate if the historical control cohort is at least about 10 times larger than the new cohort () and the maximum follow-up in the new trial is reasonably small in view of the follow-up duration in the historic cohort (see 7 Case Study).
Conceptually, the proposed new test also sheds light on a general strategy for lifting existing methodology for single–arm survival trials to a randomized, multi–arm setting. This might be of interest for designing confirmatory survival trials with interim analyses. Performance of interim analyses in clinical trials is of ethical and economic interest. On the one hand, interim analyses enable faster decisions regarding rejection or acceptance of the underlying null hypothesis when the treatment effect is larger or smaller than initially expected. Moreover, interim analyses offer the possibility for data dependent modifications of the trial (e.g. sample size recalculation) in the case of new insights, thus increasing the prospects of the trial. Trial designs with interim analyses offering such kind of flexibility at full type I error rate control are commonly referred to as confirmatory adaptive designs [21, 22]. Whereas methodology for confirmatory adaptive designs is well understood for trials with short–term endpoints as in [23, 24], subtle problems arise for adaptive survival trials. With standard methodology for group–sequential adaptive survival trials from , the degree of flexibility is highly limited. For example, in a survival trial with primary endpoint overall survival (OS), essentially only interim information on the survival status of the patients may be used for design modifications (e.g. sample size recalculation). Further interim information, e.g. on progression status of the patients, must not be used for design modifications in these classical adaptive Phase III survival trials, because otherwise type I error rate inflation may occur (see ). This situation is clinically unsatisfactory. If larger degree of flexibility is desired, the patient–wise separation approach as initially proposed by  has to be chosen which, however, either implies neglection of some part of the observed survival data in the test statistic or requires some worst–case adjustments resulting in a conservative design as shown in . Until today, no satisfactory methodology for adaptive Phase III survival trial exists, that offers larger flexibility while avoiding those problems involved with the patients–wise separation approach. Recently, however, such methodology was proposed for single arm Phase II survival trials. In , an adaptive one–sample log–rank test was suggested that allows the simultaneous use of several time–to–event endpoints for data–dependent design modifications, while avoiding those problems involved with the patient–wise separation approach. In a similar way the common one–sample log–rank test was lifted to a two–sample setting in this paper, we expect that the multivariate adaptive one–sample log–rank test proposed by  may be lifted to a two–sample setting, thus solving an outstanding problem in the theory of adaptive design methodology. Implementation of this idea, however, is beyond the scope of this paper and will be contents of an upcoming paper. The objective of this paper is to provide methodology for accounting for sampling variability of the reference curve in classical one–sample log–rank tests, and to show feasibility of the underlying lifting procedure regarding type I and type II error rate control.
The work of Moritz Fabian Danzer was funded by the German Science Foundation (Deutsche Forschungsgemeinschaft, DFG, grant number 413730122).
Appendix A: Proof of Distributional Properties
Theorem 1. Let be given s.t. and assume that the null hypothesis for all is true. Set
Then the following is true:
has asymptotically independent increments, i.e. for all and sufficiently large sample size , the random variables and are approximately independent.
Pointwise, for each , we have as , where is the large sample limit of (existing acc. to Lemma 1 below).
Proof. It is well known that is a mean–zero –martingale with optional covariation . By independence of the summands, it follows that is a mean–zero –martingale with optional covariation . In particular, for any left–continuous –adapted process , is a mean zero –martingale with optional covariation . Choosing we recover the well–known result that is a mean zero –martingale with optional covariation . Notice that , where is the Nelsen–Aalen estimate of , and . By independence of the treatment groups it follows that
is a mean–zero –martingale with optional covariation . Since is an –stopping–time (see Lemma 2 below), we conclude from appeal to the optional stopping theorem and compatibility of stopping with covariation that the stopped process is a mean–zero –martingale with
To see the last assertion in (A.3), use bilinearity of the covariation operator together with and for by independence of the patients and treatment groups, where for any –adapted process and any –stopping–time we use the common notation .
We are finally interested in the large sample properties of the mean zero –martingale
First notice that the jumpsize of is bounded by and thus vanishes in the large sample limit , because the jump sizes of and are bounded by , as no two event indicators and jump simultaneously a.s.. Making use of (see Lemma 3 below) and noticing that from (4), some algebra shows that has optional covariation . So, by Lemma 1 below, converges pointwise in
in probability to the strictly increasing, deterministic function
. All in all, we conclude from appeal to Rebolledo’s martingale central limit theorem thatconverges on in distribution to a mean zero Gaussian martingale with independent increments and variance function .