On variance estimation for the one-sample log-rank test

08/18/2021 ∙ by Moritz Fabian Danzer, et al. ∙ 0

Time-to-event endpoints show an increasing popularity in phase II cancer trials. The standard statistical tool for such endpoints in one-armed trials is the one-sample log-rank test. It is widely known, that the asymptotic providing the correctness of this test does not come into effect to full extent for small sample sizes. There have already been some attempts to solve this problem. While some do not allow easy power and sample size calculations, others lack a clear theoretical motivation and require further considerations. The problem itself can partly be attributed to the dependence of the compensated counting process and its variance estimator. We provide a framework in which the variance estimator can be flexibly adopted to the present situation while maintaining its asymptotical properties. We exemplarily suggest a variance estimator which is uncorrelated to the compensated counting process. Furthermore, we provide sample size and power calculations for any approach fitting into our framework. Finally, we compare several methods via simulation studies and the hypothetical setup of a Phase II trial based on real world data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As recently shown [1] time-to-event endpoints, such as overall survival (OS) and progression-free survival (PFS), experience increased popularity in Phase II oncology trials. This may be due to the changes in the clinical development of oncology treatments, with new treatment types relying on different mechanisms coming up. Additionally, a majority of these Phase II trials is single-armed [1]. However, the traditional testing method for single-armed survival trials, which is the one-sample log-rank test, is known to be quite conservative [2]

. This is the case especially for small sample sizes. There have been some efforts to solve this problem. Unfortunately, some of these methods are more complicated as they require estimations of higher moments and are not suitable for sample size calculation as the distribution of the test statistic under alternative hypotheses is unknown

[3, 4] while others lack a clear theoretical motivation and can be anti-conservative in some cases [5, 6]. Nevertheless, the latter approach sheds a light on the previously neglected possibility to include the counting process into the variance estimation.
The conservativeness of the classical approach resp. the anti-conservativeness of the new approach [6]

in many scenarios is due to the skewedness of the underlying test statistic which is the ratio of the compensated counting process of observed events and the square root of its variance estimator. The skewedness of its distribution under the null hypothesis in many scenarios can be attributed to two different causes:

  1. The skewedness of the numerator itself.

  2. The dependence between the numerator and its variance estimator.

While the first problem is difficult to handle, we will focus on the second one. Although Basu’s theorem [7]

guarantees independence of mean and variance estimator for normally distributed data, this does in general not apply in our case. However, the theoretical result on which the asymptotical correctness is based

[8]

leaves us with one degree of freedom concerning the choice of the variance estimator. We develop a framework in which this property is exhausted. The classical one-sample log-rank test and already existing improvements

[5, 6] can be embedded as special cases into it. Furthermore, we extend existing methodology [5] to enable sample size and power calculations for any approach fitting into this framework. Building on this, we can find a variance estimator which is uncorrelated to the compensated counting process under the null hypothesis of the testing problem.
Especially recently developed methodology concerning adaptive [9], multi-stage [10, 11, 12] and multivariate extensions [13]

of the one-sample log-rank test require and benefit from a proper behaviour of the underlying simple one-sample log-rank test as they rely on the concordance of the distribution of the test statistics with the normal distribution for even smaller quantiles.


One should note that the computations which are necessary for our suggestions are neither computationally intensive nor do they need any additional information than is already requested from commercial software as PASS [14] or nQuery [15]. Hence, existing tools for planning and execution of the one-sample log-rank test can easily be extended to incorporate any approach fitting into our framework.
The paper is organized as follows. We start in Section 2 with settling basic notation and revisiting existing methods. We will use this in Section 3 to construct our framework. Afterwards, we present power and sample size calculations therefor. In Section 5 we derive the uncorrelated variance estimator which also qualifies for our framework. Existing approaches and their properties for small sample sizes are compared by several simulations in Section 6 and we conduct a real data example in Section 7 in order to give an example for an application of the procedure. We conclude with a discussion in which we try to conclude with some advice for planning and execution of one-sample log-rank tests.

2 Definitions and preliminary considerations

Let

be the space upon which all random variables are defined. In a study with

subjects, and denote the survival and censoring time of the -th subject. Additionally, denotes the recruitment calendar time of the -th patient. In what follows, it will be important to distinguish between censoring that occurs at a given date of analysis and additional random dropouts. The former one is given by for any analysis date while represent only the latter one. We assume and to be mutually independent for any and the tuples to be independent and identically distributed. Of course, at calendar time , only will be observable for any . Let and denote distribution function, density, survival function, hazard and cumulative hazard of the survival time random variable. Analogously, and resp. and denote the same entities for the censoring and recruitment random variable. In any of the three cases the well-known dependencies

(1)

are given. In the statistical testing framework, one naturally deals with two different probability distributions

and on the measurable space

which characterize the null resp. the alternative planning hypothesis. If distributions of certain random variables are different under the two probability distributions, the index of the functions refering to the time-to-event variable under investigation will also show if it is the function under the null or an alternative hypothesis. Testing problems can now be defined by the two-sided hypothesis

(2)

which is equal to the intersection of the the two one-sided hypotheses

(3)

In the formulation of the hypotheses, denotes the cumulative hazard function of the survival time rando variable under whereas the trial will be planned under the distribution of the same varible under which can be characterized by .
The number of events observed after calendar time given for any is given by

(4)

The number of expected events under the null hypothesis given for any is given by

(5)

It is uniquely characterized by . We define additionally

(6)

The following considerations also refer to the null distribution . Obviously, is a martingale under the null hypothesis w.r.t. the filtration generated by the survival processes of the patients. It follows from Theorem II.5.1 of Andersen et al.[8] that this martingale converges in distribution against a continuous Gaussian martingale as the number of patients converges to infinity. The same theorem also leaves us with two possible choices to estimate the variance of this process because

(7)

as well as

(8)

where is the non-decreasing covariance function of the limiting process.
If one fixes an analysis date , both of the trivial choices

(9)

lead to asymptotically correct tests when choosing decision bounds according to a standard normal distribution. The latter one corresponds to the choice which has been the historical cornerstone for the one-sample log-rank test [16]. As already mentioned, some problems now occur, especially for small sample sizes. There have been some attempts to solve these issues [3, 4, 6, 17]. In one of them [6] the test statistic

(10)

has been suggested. The asymptotical correctness of this approach can even be generalized as

(11)

for any Of course, this value does not need to be the same for any . We will try to exploit this by choosing an appropriate weight and make suggestions how to alter this weight, depending on the censoring mechanism and the timing of the analysis.
After briefly summarizing the mathematical foundation of this concept, we will address the sample size and power calculations for this approach. Afterwards, we suggest a concrete weight function which is motivated by introducing an estimator of which is uncorrelated with

and evalaute the performance of different approaches w.r.t. empirical type I and type II errors in a simulation study.

3 Theoretical foundations

Let for any the tuples of random variables be defined as in the previous section as well as the functions characterizing their distributions under different hypotheses. Let be the filtration comprising all the information availaible at calendar time , i.e. for any , the -algebra is generated by the random variables

(12)

for all . We know that for any

(13)

and the summed process

(14)

are -martingales where is the cumulative hazard function of the survival time random variables [8]. In case of a continuous compensator and a converging accrual process for , we can apply Theorem II.5.1 of Andersen et al.[8] to show that for any analysis calendar date we have the convergence

(15)

where and is the covariance function of this Gaussian process. Here, we have

(16)

where is identically distributed to for any . In order to construct a statistical testing procedure we need to standardise (14) and hence to estimate . Conveniently, the same theorem yields the asymptotical results

(17)

Hence, for any we have

(18)

So after Slutsky’s theorem we have

(19)

for any where . Nevertheless, it is necessary to choose the weights . Otherwise there is a non-zero probability of being negative, yielding an undefined test statistic. As we will see later on, it is advisable to choose the weight depending on the time of analysis if the accrual mechanism and the mechanism of random droputs are assumed. This leads us to a function . Building on that we can also define

(20)

It should be emphasised here that the functional form of must be determined in advance and must not be changed in the course of the study. Under this condition, distributional convergence is guaranteed pointwise in according to formula (19) independent from the specific choice of . By choosing a suitable function , small sample properties can be improved (see Section 5). Of course, it is easy to embed the original choice and Wu’s suggestion into this framework as we have resp. in these cases which lead to

(21)
(22)

Anyway, for any

we now obtain an almost surely well-defined, asymptotically correct two-sided test with type I error level

of the hypothesis by rejecting it if

(23)

where is defined as in (6) resp. (5) with the cumulative hazard function under the null hypothesis plugged in. Analogously one obtains a one-sided test with type I error level of the null hypothesis by rejecting it if

(24)

4 Power and sample size calculation

Fortunately, power and sample size calculation for our general approach can be adopted from power and sample size calculations from the specific choice [5]. So that our notation fits into this work, we need to define for a previously fixed analysis date . By the assumed independence of and , we have

(25)

for any . To avoid confusion we use similar naming convention in this section as in the original work [5]. Hence, under a fixed planning alternative which can be characterized (among others) by the density or the survival function of the survival random variable we have:

(26)

The expectation and variance of under the alternative hypothesis now amount to

(27)

while for our variance estimator with the prefixed weight we have

(28)

under the alternative hypothesis . Now, it follows from Slutsky’s theorem that

(29)

under . Given these quantities, we obtain a sample size of

(30)

for a two-sided test with nominal type I error level and power where denotes the -quantile of a standard normal distribution. We already recognize, that a decrease of in (30) leads to a decrease of the sample size required for fixed type I & II errors. As typically for all , we have . In terms of (28) it seems to be advisable to choose . For the reasons explained in Section 5 this deteriorates the testing procedure in terms of the type I error for small sample sizes. Anyhow, our suggestion presented in the next section tries to circumvent one of these issues while attributing a positive value to to increase the power.
Please also note that (30) is only an implicit formula if the accrual rate is the quantity which is given externally in the planning stage of a trial. Nevertheless, one can use this formula and standard numerical methods to solve this in terms of the accrual duration .

5 Uncorrelated variance estimator for the one sample log-rank test

An obvious problem of the test statistics from (9) is the correlation of the numerator and the denominator, which is the square root of the variance estimator of the former. This dependence structure is one of the causes for the skewedness of the distribution of the test statistic under the null hypothesis. This problem has been observed for several cases in which numerator an denominator themselves are symmetrically distributed while their correlation causes a skewedness of the ratio [18, 19, 20]

. In particular, positive correlation causes a left-skew and negative correlation causes a right-skew of the emerging distribution. This causes an increase of the weight on the left tail and a decrease of weight on the right tail in the former case while in the latter case it is just the other way round. Because of this, it is important to not just examine empirical type I error levels of two-sided tests in simulation studies, but also to consider how this empirical level I error is distributed among the two underlying one-sided tests.


Of course, this is not the only problem when evaluating the concordance of the distribution of the test statistic of the one-sample log-rank test under the null hypothesis with the normal distribution for small sample sizes. Another problem is given by the skewedness of the numerator itself which is not treated here. In what follows, we will try to solve the problem of correlation of nominator and denominator.

5.1 General considerations

As for any in any trial where there is a.s. some patient with a positive length of stay, there is a s.t.

(31)

We will see later that for any for this choice. Together with (11) this yields a consistent variance estimator which is uncorrelated with the martingale in the numerator of the test statistics. As we will see in the simulations scenarios, the choice is a good first choice w.r.t. a decrease of the correlation of numerator and denominator, but it is not difficult to improve the choice in this regard without major disadvantages concerning the power of the trial.
We are looking for a weight s.t.

(32)

Obviously, this is equivalent to

(33)

resp.

(34)

The right hand side of this equations can be rewritten using already derived quantities [2]. Firstly, we have

(35)

Secondly, we have

(36)

where denotes the first summand of . The weight is thus obtained by

(37)

One should note, that the analysis date also plays a role in as we can see from (25). For this calculation, we require no further assumptions than those needed for sample size caculation anyway [2]. Also, as lined out in Section 3, misspecifications of accrual or censoring mechanisms in this calculation do not affect the asymptotical properties of the test. With (37), we can already see that for this choice. Also, after two partial integrations, we obtain

(38)

from which we can see that . In the following subsection, we will elaborate this choice of the weight for some situations explicitly.

5.2 Trials with simultaneous entry of patients

In the virtual case of a trial with simultaneous entry and fixed follow-up (until calendar time ) of patients and without random dropouts, the weight does only depend on and it now amounts to

(39)

With (1) we can see that in this case

(40)

and that the function, whose derivative is given by

(41)

is strictly monotonous increasing in if the distribution has full support on as . Hence the weight for the variance estimator is shifted from the compensator to the counting process for increasing length of the follow-up period. This shift is continuous if the distribution of is absolutely continuous w.r.t. the Lebesgue measure on Because of this continuity there must obviously be a case in which the choice of is equal to the choice resulting from our calculations. This is exactly the case if

(42)

where denotes the -th branch of the Lambert function. Hence, this choice approximately corresponds to our suggestion if about three fourths of possible events can be observed. In this view, one should note, that the numerical examples given in previous publications about the one-sample log-rank test all deal with cases in which only for less than half of the study cohort the event under consideration is observed. [2, 3, 4, 6, 17]

5.3 Trials with staggered entry of patients

Commonly, we are given an accrual period of length and a subsequent follow-up period of length , i.e. the final analysis is conducted at calendar time . As in common practice, we assume that the patients are recruited uniformly over the interval . Hence, and as we assume again that there are no further random dropouts . For a given accrual duration the weight function amounts to

(43)

For any fixed , one can show that is monotonously increasing in and converging to for if the distribution of has full support on and is absolutely continuous w.r.t. the Lebesgue measure. For , it holds .
In order to illustrate the change of weight, we show some plots of for different accrual period lengths in Figure 1. In the underlying scenario,

is exponentially distributed with parameter

under the null hypothesis, s.t. the median survival time is year.

[width=]weight_functions.pdf

Figure 1: Plot of the calculated weight as a function of the follow-up time for different accrual durations if the survival endpoint follows an exponential distribution with parameter log(2).

6 Simulation study

In this section we want to shed a light on the differences between our suggested approach from Section 5 and the other approaches fitting into our framework presented in Section 3 on three different levels. At first, we study the correlation of and the variance estimator and its impact on the skewness of the resulting distribution. Afterwards, we consider at a fixed survival scenario in which the follow-up time is altered to get an impression on how the performance of the different approaches change with the follow-up time. Finally, we take a look at sample sizes and empirical errors for a wide range of scenarios.
As already explained, a skewed distribution of the test statistic implies a lack of concordance with the normal distribution on both tails. Nevertheless, it is possible that the deviations from the nominal level on both tails cancel each other out and it may seem that the empirical error level is close to the nominal level, although the test may be misbehaving at both tails. An example for such a behaviour can be found in Section 7. Therefore we primarily focus on the left tail and report empirical errors of the left-sided test, whose rejection would result in the acceptancce of the superiority of the new treatment considered in this analysis. As we naturally carry out two-sided tests with a nominal type I error level of , we consider the left-sided tests with a nominal level of in what follows. All simulations were performed using R, version 4.0.2 [21].

6.1 Correlation of the unstandardised test statistics and its variance estimators

At first we want to illustrate the problems lined out in the introduction of section 5. In the scenario used here, is exponentially distributed with parameter s.t. the median survival time is 2 years. For and , (43) yields a weight of approximately . As we can see in the first row of Figure 2, there is an obvious correlation of as defined in (14) and the variance estimator following the original approach (as in (21)) or Wu’s approach (as in (22)). The empirical correlation in our simulation with 100 000 runs and patients amounts to -0.908 resp. 0.591 while it is only -0.002 for our suggested approach. The resulting skew w.r.t. the normal distribution can be seen in the QQ-plots in the second row of Figure 2. As mentioned before, a negative correlation (as with the original approach) leads to a right skew while a positive correlation (as with Wu’s approach) leads to a right skew. And while the empirical type I errors in terms of the two-sided test look good for all of the three approaches (5.133% resp. 5.048% resp. 4.997%), there is a noteworthy imbalance between the empirical errors of the two one-sided tests for the first two approaches as the empirical type I errors for the left-sided test amount to 1.823% resp. 2.856% resp. 2.562%.

[width=]comparison_fixed_followup.pdf

Figure 2: Comparison of original, Wu’s and our suggested choice of the variance estimator. The correlation of the (non-standardised) and the variance estimator for the different approaches can be seen in the first row. This leads to skewed test statistics as to be seen in the corresponding QQ-plots in the second row.

6.2 Variation of follow-up length

Of course, not only our proposed weights, but also the properties of the unnormalised martingale depend on the length of the follow-up. For very small sample sizes and either very long or very short follow-up times, the distribution of for a fixed is already skewed and our proposed standardisation procedure is not able to remove this skew. In Figure 3 we compared the four different approaches fitting into our framework developed in Section 3 concerning their empirical type I error for a left-sided test of level . The underlying scenario is the same as above: is exponentially distributed with parameter and the patients are recruited uniformly over 1 year. While the original approach overestimates the type I error for any of the considered follow-up times, Wu’s approach underestimates it. Our suggestion performs quite well in an area in which about one to two thirds of the patients experience an event, whereas Wu’s approach appears to be superior in scenarios with larger event rates.

[width=]emp_errors_by_followup.pdf

Figure 3: Comparison of empirical type I errors for one-sample log-rank tests with different variance estimation methods for variable follow-up lengths. In brackets under the x-axis: expected share of patients with events under the null hypothesis.

As we can see from Figure 3, the empirical type I errors of our approach shows a slight deviation from the desired nominal type I error in some cases too. This can be attributed to the skewedness of the distribution of

for small sample sizes as one can see from the curve for the empirical type I error in case of a standardisation with the true underlying standard deviation. In general this is not advisable as it cannot be specified without uncertainty (see Section

8). Nevertheless, in a range of practitally relevant scenarios, our suggested approach works quite well. All in all, our results suggest that a combination of Wu’s and our variance estimator promises a nearly optimal performance concerning the type I error rate, independent from the event rate. Such a combination corresponds to choosing the weight according to

(44)

The share of patients with events expected under the null hypothesis until the different follow-up times can be seen under the y-axis.

6.3 Power and sample size

In order to compare the performance of the different approaches concerning sample size, empirical type I and type II error, we conducted a simulation study. To ensure comparability to already existing literature on this topic, we considered scenarios inspired by previous simulation studies [2, 6].
Hence, the survival distribution under the null hyothesis is taken as a Weibull distribution with distribution function and cumulative hazard function where the shape parameter and the median survival time are given. We assume that the survival time under the alternative is also given by a Weibull distribution with the same shape parameter and a different median survival time , which is determined by the hazard ratio through . The censoring mechanism has been implemented as in previous simulations with and
We will investigate shape parameters , but different from other publications [2, 6], we will not only consider the fixed mean survival time , but values of . On the other hand, we restrict the possible values of to one small (1.2), one medium (1.5) and one large (2) hazard ratio.

0.1 0.25 0.5 1 2 5
1 52.98% 57.58% 65.31% 78.96% 91.52% 97.18%
0.3307 0.3706 0.4481 0.6280 0.8626 0.9599
2 50.55% 51.40% 52.98% 56.04% 61.85% 67.35%
0.3114 0.3199 0.3383 0.3897 0.5324 0.8062
4 48.16% 45.50% 41.39% 34.43% 24.85% 12.89%
0.2931 0.2750 0.2504 0.2175 0.1873 0.1664
Table 1: Expected event rates under the null hypothesis and weights for our proposed variance estimation for different combinations of shape parameter and median survival time

The reason for extending the range of median survival times under the null hypothesis can be seen in Table 1. As the first row indicates, leads to trials with high event rates. In each of these scenarios event rates of more than 50% are expected and two of the six scenarios lead to event rates of more than 90%. By including larger median event times, we want to broaden the range of expected events under the null hypothesis. With this we can clearly distinguish which approach is most useful in which setting.
As already mentioned we focused on the empirical errors of the one-sided tests with nominal level of . The sample size was planned with a power of . The results can be found in Table 2. We conducted 100 000 simulation runs for each scenario.

Variance estimation
0.1 compensator 494 .0240 .8047 519 .0230 .8017 545 .0228 .8031
counting process 435 .0320 .7909 457 .0297 .7896 480 .0305 .7877
Wu 465 .0278 .7998 488 .0265 .7960 513 .0259 .7953
uncorrelated 475 .0268 .8013 500 .0250 .7985 526 .0248 .7987
0.25 compensator 454 .0226 .8071 510 .0231 .8063 578 .0226 .8053
counting process 400 .0308 .7925 449 .0311 .7901 509 .0296 .7893
Wu 427 .0261 .8010 480 .0268 .7987 543 .0262 .7960
uncorrelated 434 .0251 .8018 491 .0253 .8009 559 .0242 .7999
0.5 compensator 398 .0239 .8055 495 .0236 .8059 636 .0226 .8018
counting process 351 .0309 .7926 436 .0305 .7906 560 .0303 .7882
Wu 374 .0274 .7981 466 .0263 .7997 598 .0261 .7966
uncorrelated 377 .0270 .7993 475 .0255 .8014 617 .0246 .7993
1 compensator 325 .0226 .8105 466 .0230 .8076 767 .0230 .8026
counting process 287 .0302 .7943 410 .0300 .7919 675 .0310 .7887
Wu 306 .0262 .8016 438 .0262 .7993 721 .0270 .7953
uncorrelated 301 .0271 .7997 444 .0255 .8009 747 .0246 .7997
2 compensator 276 .0218 .8114 418 .0230 .8114 1065 .0233 .8053
counting process 244 .0289 .7959 369 .0300 .7970 937 .0306 .7888
Wu 260 .0245 .8030 393 .0260 .8038 1001 .0266 .7977
uncorrelated 248 .0279 .7977 392 .0261 .8036 1041 .0242 .8019
5 compensator 258 .0214 .8128 377 .0222 .8091 2057 .0231 .8023
counting process 228 .0282 .7962 333 .0290 .7953 1810 .0307 .7872
Wu 243 .0250 .8052 355 .0253 .8031 1934 .0266 .7931
uncorrelated 229 .0279 .7974 342 .0273 .7984 2016 .0246 .7985
0.1 compensator 113 .0202 .8113 119 .0204 .8096 125 .0205 .8108
counting process 86 .0389 .7801 90 .0376 .7763 95 .0392 .7811
Wu 100 .0277 .7988 105 .0278 .7948 110 .0281 .7952
uncorrelated 104 .0251 .8017 110 .0252 .7998 117 .0246 .8045
0.25 compensator 104 .0195 .8173 117 .0200 .8110 133 .0206 .8101
counting process 78 .0364 .7823 88 .0369 .7769 100 .0384 .7761
Wu 91 .0271 .7998 103 .0274 .7974 117 .0285 .7956
uncorrelated 95 .0245 .8073 108 .0244 .8011 124 .0247 .8020
0.5 compensator 90 .0201 .8151 114 .0203 .8130 147 .0198 .8106
counting process 68 .0371 .7794 86 .0382 .7771 111 .0372 .7785
Wu 80 .0278 .8035 100 .0276 .7961 129 .0269 .7944
uncorrelated 81 .0267 .8045 104 .0247 .7996 138 .0232 .8020
1 compensator 73 .0191 .8230 106 .0211 .8125 178 .0208 .8070
counting process 56 .0359 .7914 81 .0384 .7829 134 .0386 .7754
Wu 64 .0264 .8054 94 .0282 .8021 156 .0285 .7932
uncorrelated 62 .0286 .8020 97 .0265 .8048 168 .0239 .8001
2 compensator 61 .0193 .8293 95 .0198 .8199 248 .0205 .8067
counting process 47 .0335 .7979 72 .0350 .7877 186 .0381 .7734
Wu 54 .0254 .8154 84 .0264 .8073 217 .0280 .7914
uncorrelated 49 .0312 .8027 83 .0270 .8049 236 .0234 .8007
5 compensator 56 .0183 .8283 84 .0195 .8288 480 .0209 .8087
counting process 44 .0337 .8021 65 .0353 .7979 361 .0385 .7781
Wu 50 .0250 .8168 74 .0260 .8128 421 .0285 .7942
uncorrelated 44 .0330 .7997 69 .0313 .8065 461 .0231 .8048
0.1 compensator 46 .0185 .8230 48 .0180 .8172 51 .0169 .8204
counting process 28 .0502 .7624 30 .0484 .7657 31 .0502 .7597
Wu 37 .0302 .7939 39 .0297 .7936 41 .0294 .7922
uncorrelated 40 .0250 .8048 43 .0244 .8089 45 .0240 .8030
0.25 compensator 42 .0178 .9269 47 .0183 .8189 54 .0176 .8196
counting process 26 .0496 .7685 29 .0493 .7630 33 .0506 .7640
Wu 34 .0295 .8000 38 .0290 .7937 44 .0294 .7978
uncorrelated 36 .0258 .8057 42 .0242 .8106 48 .0239 .8040
0.5 compensator 36 .0180 .8251 46 .0178 .8220 60 .0183 .8192
counting process 23 .0463 .7768 28 .0491 .7607 37 .0500 .7628
Wu 29 .0289 .7964 37 .0293 .7941 48 .0299 .7906
uncorrelated 30 .0272 .8039 40 .0249 .8032 54 .0234 .8047
1 compensator 29 .0170 .8393 43 .0179 .8256 72 .0181 .8131
counting process 18 .0447 .7765 27 .0470 .7728 44 .0516 .7550
Wu 24 .0272 .8190 35 .0287 .8013 59 .0310 .7950
uncorrelated 22 .0310 .8012 37 .0260 .8091 66 .0226 .8029
2 compensator 23 .0158 .8441 38 .0169 .8353 101 .0181 .8111
counting process 15 .0410 .7907 24 .0460 .7801 62 .0513 .7555
Wu 19 .0257 .8189 31 .0281 .8105 82 .0302 .7894
uncorrelated 17 .0356 .8167 31 .0289 .8143 94 .0218 .8037
5 compensator 21 .0157 .8509 33 .0160 .8536 198 .0181 .8152
counting process 14 .0387 .7988 22 .0401 .8035 121 .0515 .7589
Wu 18 .0244 .8348 28 .0255 .8361 161 .0311 .7937
uncorrelated 15 .0373 .8164 24 .0333 .8140 186 .0219 .8097
Table 2: Sample sizes and empirical type I and type II errors for the four different approaches of variance estimation in all considered scenarios; in bold: method with the smallest absolute deviation from the nominal type I error level in the respective scenario

As already seen in the previous subsection, our approach works quite well for small and medium event rates and is the best choice concerning the absolute deviance from the nominal type I error level in most of the cases except from some special cases. The first exception concerns scenarios with high event rates which lead to weights greater than . One can see from Table 1 to which scenarios this apples. Please remind that the computed weight does not depend on the hazard ratio used to plan the trial. Here, the type I error inflation for the uncorrelated variance estimation exceeds the error inflation for Wu’s suggestion. The other exception concerns scenarios with low hazard ratios (which lead to high sample sizes) and rather small event rates. This is the case for scenarios with , and . In these cases the original version of the one-sample log-rank test performs best in means of absolute deviation from the nominal type I error.
As one can already see from the flexible sample size formula (30), the approach using only the counting process to estimate the variance requires the smallest sample sizes. Nevertheless the type I error is inflated in any scenario, ranging from 2.82% to 5.16%. The original variance estimation just behaves the other way round. Obviously, the required sample size is always the highest, while the type I error is always deflated. It ranges between 1.57% and 2.4%, depending on the scenario. The remaining approaches are located in between concerning the sample size whereby our proposed approach requires higher sample sizes if and only if the weight (see Table 1) is smaller than .
Concerning the compliance with the given type II error rate, the uncorrelated variance estimation works best on average with empirical values lying between 79.7% and 81.7%. Here, also the highest deviations occur for high event rates, i.e. in scenarios in which Wu’s approach is also superior concerning the empirical type I error. These results confirm that a combination of Wu’s and our variance estimator, as given in (44), promises the most satisfying performance.

7 Practical example

We illustrate the differences of the approaches using a practical example. We employ the setting of the Mayo Clinic trial in primary biliary cirrhosis of the liver (PBC), which is a rare but fatal chronic disease whose cause is still unknwon [22]. In this double-blinded randomized trial the drug D-penicillamine (DPCA) was compared with a placebo. The study data is publicly available via the survival package in R [21, 23].
Among the 158 patients of the cohort treated with DPCA, 65 died during the trial. For the sake of comparability, we adopt the previous parameter estimation of their survival curve [2], which states that a Weibull distribution with shape parameter and median survival time fits the data well. We now suppose, that a new treatment becomes available and the data from this trial shall be used to compare the survival under this treatment to the survival under treatment with DPCA. This shall be accomplished in a trial in which patients are recruited uniformly over a accrual period of length and followed-up in an additional period of length . As in the preceding simulations, the planning hypothesis is supposed to fulfill the proportional hazards assumption and hence for any . A hazard ratio of shall be detected with a power of via a two-sided test with significance level

Variance estimation
compensator 113 .0493 .0194 .8122
counting process 76 .0568 .0445 .7652
Wu 95 .0511 .0300 .7922
uncorrelated 106 .0495 .0230 .8039
Table 3: Sample sizes and empirical type I & type II errors for the planning of a single-arm survival trial for PBC, where denotes the overall empirical type I error and denotes the part which is due to rejections on the left hand side.

The equation (43) yields a weight of 0.1923 for this scenario and hence, our suggested test statistic amounts to

(45)

The results of a simulation with 100 000 runs, which are displayed in Table 3 show that in this real world example, our proposed approach is closest to the nominal type I error level in terms of empirical type I error of the two-sided test as well as the left-sided test. Nevertheless, the original one-sample log-rank test and Wu’s suggestion look similar in terms of the former while they perform remarkably worse in terms of the latter. The phenomenon of unbalanced left- and right-sided type I errors which cancel each other out quite well in their sum is remarkable here.
Although the sample size for our suggested approach is about 10% higher than for Wu’s suggestion, there is still a saving in sample size compared to the standard approach.

8 Discussion

We introduced a simple but extensive framework, enabling a continuum of consistent variance estimators referring to the one-sample log-rank test. Asymptotical correctness and asymptotically correct power and sample size calculations are provided. The classical one-sample log-rank test [16] as well as a practical alternative [6] naturally fit into it. Please note that one could still extend the options to estimate the variance and also use itself in its estimation as it is explicitly given in (16) if the accrual mechanism and the distribution of the ‘s is known. This would yield

(46)

for any . But it is important to note that a misspecification of the accrual mechanism and possible additional random drop-outs would lead to a wrong value s.t. the values of on the left and right hand side of (46) do not coincide and the asymptotics no longer applies. Therefore, we do not pursue this approach any further, but rather focus on the choice made in (18).
In addition, we elaborated only one special choice for the weight function which guarantees that the variance estimator is uncorrelated to the compensated counting process under the null hypothesis. In several simulations and in an example based on real world data, we can see that the emerging test is superior to other approaches concerning the adherence to the nominal type I error level. This superiority is most remarkable in small sized trials with small to medium event rates.
Nevertheless, we saw in our simulation studies that Wu’s suggested weight works better than the uncorrelated variance estimation in scenarios with high event rates. To prevent a remarkable anti-conservativeness in such a case, one could also imagine to choose the weight according to (44).
One can also conduct multiple simulations to find the perfect weight for the envisaged scenario which can cancel out a possible skewness of under the null hypothesis. Once the weight is determined this way, the theory from Section 3 provides the asymptotical correctness and sample size calculation can be done as lined out in Section 4. To avoid anti-conservativeness one could also execute an exact calculation of the third moment of under the null hypothesis and only use the uncorrelated estimation if it is positive. This should prevent from any left-skew which causes anti-conservativeness on the left hand side, but still ensure a sample size reduction compared to the classical one-sample log-rank test. If the accrual and censoring mechanism are again given via a uniform accrual during a period of length and a subsequent follow-up period of length , the third moment of under the null hypothesis is given by

(47)

A more cautious planing could also incorporate the consideration of additional random dropouts. The distribution function of the overall censoring variable is given by at analysis date under the assumptions of independence of and . This could in turn be plugged into (38) and would in most cases lead to a lower weight for the counting process and hence a more conservative approach concerning the distriution of the test statistic.
In conclusion, our framework yields a solid foundation for such and possible further considerations. This includes extensions to multi-stage [10, 11, 12], multivariate [13] and other variations [24] of the classical one-sample log-rank test.

Acknowledgments

The work of the corresponding author was supported by the German Science Foundation (DFG, grant number 413730122).

References

  • [1] Ivanova A et al. Nine-year change in statistical design, profile, and success rates of Phase II oncology trials. Journal of Biopharmaceutical Statistics 2016; 26(1): 141–149.
  • [2] Wu J. Sample size calculation for the one-sample log-rank test. Pharmaceutical Statistics 2015; 14(1): 26–33.
  • [3] Tu D and Gross AJ. A Bartlett-type correction for the subject-years method in comparing survival data to a standard population. Statistics & Probability Letters 1996; 29(2): 149–157.
  • [4] Sun X, Peng P and Tu D. Phase ii cancer clinical trials with a one-sample log-rank test and its corrections based on the edgeworth expansion. Contemporary clinical trials 2011; 32(1): 108–113.
  • [5] Wu J. Single-arm Phase II cancer survival trial designs. Journal of Biopharmaceutical Statistics 2016; 26(4): 644–656.
  • [6] Wu J. A new one-sample log-rank test. Journal of Biometrics and Biostatistics 2014; 5(4): 1–5.
  • [7] Basu D. On Statistics Independent of a Complete Sufficient Statistic. Sankhyā: The Indian Journal of Statistics (1933-1960) 1955; 15(4): 377–380.
  • [8] Andersen PK et al. Statistical Models Based on Counting Processes Springer, 1993.
  • [9] Schmidt R, Faldum A and Kwiecien R. Adaptive designs for the one-sample log-rank test. Biometrics 2018; 74(2): 529–537.
  • [10] Shan G and Zhang H. Two-stage optimal designs with survival endpoint when the follow-up time is restricted. BMC medical research methodology 2019; 19(1): 74.
  • [11] Belin L, De Rycke Y and Broet P. A two-stage design for phase II trials with time-to-event endpoint using restricted follow-up. Contemporary Clinical Trials Communications 2017; 8: 127–134.
  • [12] Kwak M and Jung S. Phase II clinical trials with time-to-event endpoints: Optimal two-stage designs with one-sample log-rank test. Statistics in Medicine 2014; 33(12): 2004–2016.
  • [13] Danzer MF, Terzer T, Berthold F, Faldum A and Schmidt R. Confirmatory adaptive group sequential designs for single-arm phase II studies with multiple time-to-event endpoints. Biometrical Journal 2021; DOI:10.1002/bimj.202000205.
  • [14] PASS 16. Power and Sample Size Software NCSS, LLC. Kaysville, Utah, USA, 2018. ncss.com/software/pass.
  • [15] nQuery. Sample Size and Power Calculation ”Statsols” (Statistical Solutions Ltd.), Cork, Ireland, 2017. statsols.com/nquery.
  • [16] Breslow N. Analysis of survival data under the proportional hazards model. International Statatistical Review 1975; 43: 45–48.
  • [17] Kerschke L, Faldum A and Schmidt R. An improved one-sample log-rank test. Statistical Methods in Medical Research 2020; 29(10): 2814–2829.
  • [18] Hinkley DV. On the Ratio of Two Correlated Normal Random Variables. Biometrika 1969; 56(3): 635–639.
  • [19] Nadarajah S. On the ratio x/y for some elliptically symmetric distributions.

    Journal of Multivariate Analysis

    2006; 97(2): 342–358.
  • [20] Ly S, Pho KH, Ly S and Wong WK. Determining distribution for the quotients of dependent and independent random variables by using copulas. Journal of Risk and Financial Management 2019; 12(1): 42.
  • [21] R Core Team. R: A language and environment for statistical computing R Foundation for Statistical Computing, Vienna, Austria, 2020. https://www.R-project.org/.
  • [22] Fleming TR and Harrington DP. Counting Processes and Survival Analysis Wiley, 2011.
  • [23] Therneau TM. A Package for Survival Analysis in R R package version 3.2-7, 2020.
  • [24] Chu C, Liu S and Rong A. Study design of single-arm phase II immunotherapy trials with long-term survivors and random delayed treatment effect. Pharmaceutical Statistics 2020; 19: 358–369.