In medical research, such as oncology studies, time-to-event outcomes are often used as clinical endpoints. It is of primary interest to compare the survival distributions between treatments and quantify the treatment effects. In practice, the classic logrank test is the most popular testing method, and Cox regression is used for estimating the treatment effects. Though it is well known that both logrank test and Cox regression are optimal under proportional hazards assumption, the research by[jachno2019non] shows that the majority of the reviewed studies used the logrank test (88%) and Cox regression (97%) for the primary outcome and only a few checked the proportional hazards assumption. Methodologies assuming proportional hazards were predominantly used despite the potential power improvements with alternative methods.
The common types of alternatives to the proportional hazards include the delayed treatment effects, gradually increasing treatment benefits, diminishing treatment effects, and crossing treatment effects (i.e., initial adverse event and long-term benefits or vice versa). Several methods were proposed to improve the test power under non-proportional hazards assumptions.
The first type of test is based on weighted logrank test statistic. These tests give observed risk differences different weights at different time points. The famous Fleming-Harrington , also calledtest ([harrington1982class], [fleming2011counting]) is of this type. The weight is based on the Kaplan-Meier estimate of the pooled survival function at the previous event time. A vast of literature studied the properties of the test and proposed different weights even before the Fleming-Harrington test. For example, [gehan1965generalized] used the number at risk in the combined sample () as weight and yielded the generalized two-sample Mann-Whitney-Wilcoxon test. [tarone1977distribution] suggested a class of tests where the weight is a function of the number at risk (). They suggested using , which gives more weight to the event time points with most data. See [arboretti2018nonparametric] for a thorough summary of the weights and corresponding test names. The researchers should be cautious in choosing a proper weight function. If they have prior knowledge of the direction of the alternatives, a function that puts more weight on the departure of the hazards can be chosen to improve the power. Otherwise, an improper weight may perform worse than the logrank test. Adaptive tests were proposed to circumvent specifying weights beforehand. Essentially, these tests are also based on the weighted logrank statistic. The adaptive property reflects in the weight estimation and selection. [peckova2003adaptive]
proposed an adaptive test that selects a weight from a finite set of statistics based on efficiency criteria. Under the assumption of a time-transformed shift model, the length of the confidence interval for the shift is used as an efficiency estimator for each test. The statistic with maximum efficiency is selected in the procedure. Although they suggested usingas the transformation function, the specification has an impact on the power of the test. [yang2005semiparametric] proposed a hazard model that accommodates different scenarios. The parameters in the model have the interpretations of short-term and long-term hazard ratios. An adaptive test ([yang2010improved]) based on the weights estimated from the proposed model is proposed. The weight functions are hazard ratio estimates as and . The corresponding test is defined as
. It is shown to be more powerful under various alternative cases, but the test has an overly inflated type I error according to[chauvel2014tests]. Another issue we found is that the hazard estimate - are model-based and asymmetric. If the labels of the groups will be flipped, the test statistics and p-values are different. This is not a feature a test desires. The maximum weighted logrank test that takes the maximum of a finite cluster of statistics can also be considered as an adaptive test in the sense that it does not require pre-specification of weight functions and automatically select the largest one. [lee1996some] proposed a test based on the maximum of selected members of statistics. The weights addressing different types of alternatives are pooled together. [gares2015omnibus] focused on the late treatment effect in the preventive randomized clinical trial. They proposed a maximum test based on the logrank statistic and several Fleming-Harrington statistics for late effect, which showed power improvements under late effect. [lin2020alternative] examined a number of tests under non-proportional cases and proposed the so-called MaxCombo test, a maximum test based on specific statistics because of its robustness across different patterns of the various hazard ratios. [brendel2014weighted] and[ditzhaus2020more] proposed the projection test, which combines a cluster of statistics by mapping the multiple statistics into one single statistic. The power advantage over various methods was illustrated.
The second one includes the Renyi-type or supremum statistic. It’s the generalization of Kolmogorov-Smirnov statistic. The test statistic takes the maximum difference across the time points. [fleming1981class], [fleming1987supremum] proposed the weighted version of the Renyi statistics, where the maximum is based on the weighted logrank statistics. Those supremum versions of logrank statistics are assumed to be more sensitive to cases with crossing hazard functions.
The third type is based on the survival curves. [pepe1989weighted] proposed the weighted Kaplan-Meier Statistic(WKM), which is based on the integrated weighted difference in Kaplan-Meier estimators. The test is sensitive against stochastic ordered alternatives. [liu2020resampling] used the scaled area between curves(ABC) statistic, which is based on the absolute difference between the Kaplan-Meier estimators.
In this paper, we proposed a new type of maximum weighted logrank test. Overall, the maximum logrank tests are very robust against different types of hazards, as shown in the previous studies. However, the cluster of statistics is all based on the statistics, according to our research. Depending on the selection of , the statistics are sensitive to early, middle, or late hazard differences. Therefore, we are motivated to introduce a type of weight that preserves the advantage and is more sensitive to crossing hazards. So the power of the maximum test can be further improved in detecting crossing hazard alternatives. We organize the paper as follows: In Section 2, we briefly review the weighted logrank test and methods used for later comparison; in Section 3, we introduce the new test method; in Section 4, the new test method is examined in simulation studies along with several comparative tests. Next, we illustrate the new method in a real data example in Section 5. Finally, a brief discussion and summary conclude the paper in Section 6.
2.1 Two-sample data set-up
We focus on the standard setting with two-sample right censored survival data. The classic set-up is given by the event time and censoring time , where and are the event time and censoring time of subjects from group and and
are the cumulative distribution functions. The event timeis assumed to be independent of censoring time . In practice, the available data only consists of , , where denotes the observed time of subject in group and indicates whether the observation is an event or censoring. Let denote the cumulative hazard function and denote the hazard rate. and are related by . The goal is to test
against two sided alternatives for some ; stochastic ordered alternatives where the inequality is strict for at least some or ordered hazard alternatives where the inequality is strict for at least some . Clearly, implies Alternative includes the crossing hazards but neither nor consists of the case where survival curves cross. The example in Figure 1 illustrates the stochastic ordered alternatives. Though the survival rate of Group 1 is stochastically worse than Group 0, the hazard of Group 1 is greater at the beginning and less than the hazard of Group 0 in the longer term. If a two-sided test is performed, we should be cautious in the interpretation of the treatment effect. [magirr2019modestly] discusses the risk of concluding the treatment is efficacious when it is uniformly inferior, as shown in Figure 1
. They proposed to use a "strong" null hypothesis that survival in the treatment group is stochastically less than or equal to survival in the control group and recommended consistently using one-sided hypothesis tests in confirmatory clinical trials.
2.2 Review of selected tests
2.2.1 Weighted logrank tests
[gill1980censoring] introduced a class of linear rank statistic and named ithem as "tests of the class ". This class statistics are the so-called weighted logrank statistics. In the notation of counting process ( [fleming2011counting]), the classic weighted logrank test can be expressed as
is an adapted and bounded predictable process with if . is an estimator of the cumulative hazard . Thus, is essentially a sum of weighted differences in the estimated hazards over times. Equation (1) can be rewritten as
is asymptotically distributed as a zero-mean Gaussian process with variance function ([gill1980censoring], [fleming2011counting])
where . An estimator of is given by
The class statistic is given by , where is the left continuous Kaplan-Meier estimator of survival function based on pooled data. , corresponds to the logrank statistic. corresponds to the Prentice-Wilcoxon statistic, which emphasizes early difference. It’s shown that the logrank test is most efficient under proportional hazard and statistic (Prentice-Wilcoxon test is a special case) is efficient against monotone decreasing hazard ratio alternative. ([fleming2011counting]). In general, a test with weight proportional to is most powerful under alternatives (see [schoenfeld1981asymptotic]). In practice, if a prior knowledge about the direction of the hazards can be obtained, a proper weight can be chosen to emphasize the difference. For example, if there is evidence to believe the two groups have maximum difference in the middle of the follow-up time period, weight gives more power. However, if the underlying hazards are diverging over time, an erroneous chosen statistic for example, performs worse than the logrank test.
2.2.2 Renyi test
The Renyi test statistic is the censored-data version of the Kolmogorov-Smirnov statistic for uncensored samples. To construct the test, we should find the value of the test statistic (2) at each event time point. The test statistic for a two-sided test is given by
where is the largest event time with . Under null hypothesis, the distribution Q is approximated by the distribution of , is a standard Brownian motion process ([klein2006survival]). This superemum version of the test statistics is more powerful in detecting crossing hazards. The weight function introduced above can be used for Renyi test. Accordingly, we obtain various versions of Renyi test by choosing different weights.
2.2.3 Projection-type test
[brendel2014weighted] introduced the projection-type test and showed the asymptotic proprieties of the statistic. [ditzhaus2020more] further clarified and simplified the methods. The way to construct the test statistic is as follows: choose different weights
representing different directions for the hazard and construct the the vector, where is the given by (2) with weight . In the paper of [ditzhaus2020more], they consider linearly independent weights. If the independence assumption is met, under the null hypothesis, we have
where is the empirical covariance matrix of . The corresponding test is given by In the paper of [brendel2014weighted] , independent assumption is not needed. The test statistic is given by , where stands for the Moore-Penrose inverse of the covariance matrix. In the later comparison, the projection test refers to this one.
3 Proposed Test
In this section, we constructed a new test statistic, that has good power against varied alternatives and is especially sensitive in detecting crossing hazards. The statistic is based on a number of statistics in the form of equation (1). In what follows, we will present the asymptotic distributions and procedures to obtain the p-value.
Let denote the statistic as described in (1). The weight function for each statistic can be expressed as , where is a deterministic function and is the pooled Kaplan-Meier estimate of CDF. For instance, the Prentice-wilcoxon statistic corresponds to function . For those statistics, the difference lies in the choice of the deterministic functions. Under , we have
where is a mean zero m-variate Gaussian random vector. The covariance between and is given by
A consistent estimator of is
The two-sided maximum weighted logrank test statistic is defined as . Some researchers investigated the performance of maximum tests using type weights. For example, [lee1996some] proposed . The four weights emphasize proportional hazards, late, early and middle difference in hazards. The test demonstrated power improvement under various hazards scenarios. [kosorok1999versatility] used , corresponding to . Recently, [lin2020alternative] used and named the test as Maxcombo test. Those methods focus on non-crossing hazards alternatives. Under crossing hazards and especially crossing survival curves, the weighted logrank tests usually perform poorly. Conceptually, it is not difficult to understand because the weighted logrank statistic is a sum of differences in hazards over time. Under ordered hazards alternatives, the signs of the summation components are stochastically in the same direction. Correctly putting more weights where the differences are large improves the test power. However, in terms of crossing hazards, as shown in Figure 1, the summation components tend to have opposite signs before and after the crossing time point. Whether the emphasis is put on the beginning, middle, or end times, similar to the logrank test, the early differences between two hazard rates are canceled out by later differences with opposite signs. The Maxcombo test introduced previously may improve the power to detect crossing hazards, but there is still room for improvement.
Our proposed test is to improve the power in detecting crossing hazards and maintain good power in other alternatives as well. The four weights in the new test are - , where is the pooled Kaplan-Meier estimate of CDF and represents the value of at the point where the crossing occurs. In practice, is unknown. So we recommend the default value 0.5, simplifying to . Figure 2 shows with varying from 0.1 to 0.9. As the CDF increases with time, a smaller value for is preferable if the researchers believe the crossing occurs at an early stage; otherwise, a larger value is recommended. As shown in the simulation, our test is robust to the choice of . Even if a smaller is chosen for a late crossing, the proposed test still outperforms the logrank test and has similar performance to Maxcombo test.
For a two-sided test, the rejection region and p-value of the proposed test are obtained based on the asymptotic normality. According to (4),
follows a multivariate normal distribution with estimated variance-covariance given by (3, 5). Let denote the critical value under significance level . To control the type I error under , we have , equivalent to . Here is obtained via finding the corresponding critical values of the multivariate normal distribution. In the extreme case that the four statistics are independent, the equation above is simplified as . So . In the opposite case when the four statistics are the same, we have . In general, the critical value ranges between and . The p-value is easy to obtain accordingly.
For a one-sided test, the four-weight test statistic becomes . Consider the ordered alternative ,where the strict inequality holds for some time, the type I error is defined as ; for the alternative , the type I error is given by
4 Simulation Study
4.1 Simulation Settings
Two censoring mechanisms are considered in this simulation. Under censoring Type I, the length of the study is fixed, and the number of events depends on the study duration and hazard rate. A complete study includes an 18-week recruitment period and is terminated at week 42. Every participant has at least 24 weeks of follow-up if no event occurs. Under censoring Type II, the study is terminated once the specified number of events is obtained. The overall event rate is fixed, but the total length of study depends on the enrollment and hazard rates. We assume the participants are uniformly enrolled within 24 weeks. In both settings, the end of the study is the only reason for censoring.
Survival times for Group 0 are drawn from a log-logistic distribution with shape parameter and scale parameter . The hazard function is and the survival function is . Figure 3 shows the hazard functions and survival functions plotted for a selection of parameters -. The shape parameter is fixed at 2. While scale parameter increases from 12 to 40, the sharp peak of the hazard curve gets flattened and the survival curve is pulled to the upper right. The hazard function for the Group 1 is assumed to be a multiplier of the hazard of Group 0, that is, -, where is the hazard ratio between the two groups. In the proportional hazard case, is a constant of time. To cover a wide range of alternatives, has the following eight options.
Crossing Hazards 1:
Crossing Hazards 2:
Delayed Diverging Hazards:
Converging Hazards 1:
Converging Hazards 2:
Equal Hazards ():
The hazard ratios are time-dependent in case (A) - Case (F). The cumulative hazard for Group 1 is . Survival function is . it is known that . The survival times for Group 1 are generated via the inverse method, that is, .
The survival curves and hazard curves of the eight different cases are given in Figure 4 and Figure 5 ( for Group 0). Case (A) and case (B) represent crossing hazards. In (A), hazard ratio () is less than one at the beginning, then the two hazard curves cross around week 20, and the survival curves cross around week 35. In case (B), it is the other way around, that is, hazard ratio is greater than one at the beginning. Cases (C) and (D) show the diverging hazards of the two groups. If the hazard ratio is greater than one, the ratio increases over time; otherwise, it decreases over time. In the simulation, we consider the former case. In case (C) it illustrates the delayed response, where the medication takes effect after a certain amount of time. This is common in cancer vaccine trial ([copier2009improving]). Case (E) and case (F) both show the converging hazards over time. In case (E), the hazard ratio decreases to close to 1, so the survival curve of Group 1 is below the curve of Group 0. In case (F), the hazard ratio increases to close to 1, and Group 1 survives longer. Case (G) is for the proportional hazards with a ratio of 1.5. Case (G) denotes the null hypothesis where the two groups have no difference.
To investigate the operating characteristics of the proposed method, a variety of sample sizes and censoring rates are considered. Under censoring Type I, the study length is predefined. The censoring rate is altered by changing the parameters of the survival distribution. The shape parameter is set to 2 through simulation, while the scale parameter - is set to , corresponding to the low, medium, and high censoring rate. Under censoring Type II, sample size and censoring rate are assumed, yielding nine different combinations of and . The specified censoring rate is maintained by changing the study length to ensure the specific number of events is obtained. The parameters for generating the Group 0 survival times remain the same across all the combinations with . For both mechanisms, the sample size is allocated to the two groups at a 1:1 ratio.
We mainly consider the two-sided test in the above simulation settings and the one-sided test under the stochastic ordered alternative. The test level is for the two-sided test and for the one-sided test. All simulations are run in R, version 3.5.3 platform: x86_64-pc-Linux-gnu with 2000 replications. The largest simulation error is less than 0.01.
Six tests are considered in the two-sided test simulations. They are the logrank test, Fleming-Harrington test with , the MaxCombo test with four weight functions - , the proposed test with , the projection test and Renyi’s test. To account for the crossing hazards, the weights used for the projection test are . The six tests are denoted by Logrank, FH11, maxC, , ProjT and Renyi in the following tables. Considering the overly inflated Type I error in Yang’s method([yang2010improved]) and the moderate performance of the weighted Kaplan-Meier method in the study of [lin2020alternative], we did not include the two methods for comparison.
Under the null hypothesis that the two groups are not different, the two censoring mechanisms generate almost the same censoring rates at each group. The minor difference is simply due to the sampling variation. The Type I error rates for the Type I censoring and Type II censoring are displayed in Table 1 and Table 2. Except for Renyi’s test, all the remaining tests have inflated Type I error at sample size=60 regardless of censoring mechanism or censoring rate, although increasing censoring rate brings the Type I error rate down. At sample size=240, most of the tests control the Type I error rate under 0.05.
Table 3 shows the empirical power of each test under the Type I censoring mechanism with the total study length of 42 weeks. The scale parameter for the Group 0 is fixed across different sample sizes giving the same censoring rates. The censoring rates (column ) for Group 0 are about 0.18,0.38 and 0.6, corresponding to 15, 25 and 40 of . The average censoring rates for Group 1 are shown in column and the total sample sizes are given in column . The hazard curve and survival curve will alter accordingly with the changing of the scale parameter. For the crossing hazards with , the proposed test and projection test show a sizeable gain in power comparing to other tests. For sample size 240 and Crossing 1 alternative, both the proposed test and projection test have power over 80%, while the logrank test and the Maxcombo test have the power of 26.6% and 53.2% respectively. When is about 0.38 or 0.6, the proposed test is not as powerful as the projection test, though it still outperforms all other tests, including the Maxcombo test. This is because the choice of is not reflecting the true CDF value at the crossing time point. The sensitivity of the proposed test to the choice of is discussed in the next section. Under diverging hazards and delayed response, the Maxcombo test has the largest power in most cases, followed by the proposed test. Under the converging and constant hazards, the logrank test is the most powerful one in most cases, and the projection test loses noticeable power. It is only more powerful than Renyi’s test in some cases.
The power of the tests under various alternatives from the Type II censoring is shown in Table 4. The total event rates in the first column are pre-specified for each simulation. The average censoring rates for each group are given in the second and third columns denoted by and . The property of the design determines that it takes a longer follow-up time to obtain a larger number of events. In other words, the lower the censoring rate, the longer the follow-up time. The shapes of the two survival curves are the same under different censoring rates. However, the study termination times are different. This is different from the Type I censoring mechanism, where the study termination times are fixed at 42 weeks, but the shape of survival curves varies with censoring rates. At the crossing hazards, the proposed test and projection test have the largest power, showing a power advantage over the Maxcombo test. As expected, the study termination times have a significant impact on the power of the logrank test. For example, at a lower censoring rate - and , the power of logrank test is 19% under Crossing 1, while at a greater censoring rate and , the power becomes 26.1%, a sizable increase rather than decrease. Under ordered hazard alternatives, the increase in censoring rates brings the power of the test down. However, under the crossing hazard, the termination times play a more critical role in determining the power of the logrank test. The proposed test is more robust to the study termination point. In the same example, the power of the proposed test is 30.8% at a lower censoring rate and 27.2% at a large censoring rate, a slight decrease. At the diverging hazards, the Maxcombo test has the largest power in most cases, and the proposed test and projection test have comparable results. At the delayed response, the projection test and proposed test are the most powerful ones, followed by Maxcombo and logrank test. At the converging and proportional hazards, the logrank test is the most powerful one, the projection test and Renyi’s test are the least powerful ones. The Maxcombo and proposed test are in between.
To have an overall evaluation of the methods under crossing hazards and also across different alternatives, we borrow the idea of multi-criteria decision analysis and calculate a total score for each method via the following procedure: first rank each test by their power under each scenario, then sum up scores of each test over crossing hazards and all alternatives. The final scores are based on various sample sizes and censoring rates; results are displayed in Table 5 for Type I censoring mechanism and in Table 6 for Type II censoring mechanism. For the Type II censoring design, the proposed test - has the highest score, followed by the projection test and Maxcombo test under crossing hazards setting. The Maxcombo test has the highest score if all alternatives are considered, followed by the proposed test. Regarding the Type I censoring mechanism, the total scores across all scenarios have the same ranking as in the Type II censoring. Under the crossing hazard, the projection test has the highest score and then the proposed test. As discussed above, this is mainly because used in the test is 0.5, which is different from the actual value in some scenarios. Even so, the score of the test is better than that of the Maxcombo test.
4.3 Sensitivity of the test to the crossing time point
We recommend the default value 0.5 for if the information about when the crossing occurs is not available. In this section, we will investigate whether the performance of the test is sensitive to the choice of through simulation. The six crossing scenarios for sample size 240 under the Type I censoring mechanism are assumed. The empirical power of the proposed test with varying from 0.1 to 0.9 is simulated and shown in Table 7. We can see that the power varies a bit with different values of . For example, with and crossing 1, the highest power is 34.3% at and the lowest power occurs at . In this case, the crossing occurs at the very beginning, and if the researcher mistakenly specifies 0.9 for , the power is 17.7%, which is larger than the power of logrank test (7.2%) and close to the power of the Maxcombo test (18.2%). In general, the value of has some impacts on the power of the test, but even in the worst case, it can achieve similar power as the Maxcombo test and higher power than the logrank test.
4.4 One-sided test
Figure 1 is created from Type I censoring mechanism with . It illustrates the stochastic hazards. Suppose the alternative hypothesis of interest is . The power of the one-sided test with significance level 0.025 is simulated for the logrank test, Maxcombo test, and our proposed test. The censoring rates for Group 1 and Group 0 are 7.4% and 7%. At sample size 120, the corresponding power values are 26.6%, 42.6%, and 53.3% for the three tests, respectively. The Maxcombo test and the proposed test both have more power than the logrank test.
5 Real Data Application
The real data is from appendix A of [kalbfleisch2011statistical]. The data is about a randomized trial of two treatments for lung cancer. Besides treatment-related information, several covariates were collected. In this example, we use two covariates whether the patient had prior therapies and whether the patient was older than 65 years. A total of ninety-seven patients did not receive prior therapies, and forty subjects received therapies. Ninety-three patients are younger than 65, and forty-four patients are older than 65. The Kaplan-Meier curves by prior therapy and age are given in Figure 6
. The event rates are all beyond 90% in each group. On the left plot (prior therapy), the two curves cross when the survival probabilities are around 50%. Thus, patients receiving prior therapies have a lower survival rate initially but survive longer in the long run. The survival plot on the right (Age>65) is more likely to be from a process satisfying the proportional hazard. Seven competing tests, including logrank, Renyi, Maxcombo, projection, and the proposed method withare applied to this real data.
The p-values of all tests are shown in Table 8. In the prior therapy group, the logrank test gives a p-value equal to 0.48. All the rest tests have smaller p-values and the smallest one is from the proposed method with . This is because the survival curves cross at around 0.5, indicating the hazard must cross before 0.5. A correct specification of increases the power. Even if the incorrect selection of yields a p-value close to the one given by the Maxcombo test. It shows that the proposed method gains a lot in the correct specification of but loses a little with an incorrect choice in the crossing hazard scenario. The projection test also gives a reasonably small p-value. In the age group, the logrank test has the smallest p-value of 0.07, showing the optimality of the test under proportional hazard. Renyi’s test and projection test have the largest p-values, and the Maxcombo and proposed test have close p-values in between. These results are consistent with the simulation. The projection test loses more power than the Maxcombo test under proportional hazard.
In real-world studies, the proportional hazards assumption is predominately assumed in analyzing time-to-event data, even though the assumption may not hold in two distinct situations ([saad2018understanding]). One is that the treatment effects interact with patient characteristics; the other is that the treatment effects vary with time. Both cases are not rare in oncology trials. Therefore, researchers are motivated to find an omnibus test that performs well in most situations. In this study, we have proposed a maximum weighted logrank test, which particularly incorporates the weight for detecting crossing hazards. Through simulations under various sample sizes and censoring rates, we show that our proposed test has a sizeable gain in power under crossing hazards regardless of the selection of nuisance parameter compared to the logrank test. At the worst selection of , the power is comparable to the Maxcombo test, while the power increases noticeably if approximates the true crossing point. For the converging and proportional hazards, similar to the Maxcombo test, the proposed test has some power loss compared to the logrank test. The projection test is a comparative method in detecting crossing hazards, but it loses more power than the proposed test at the proportional and converging hazards.
If there is prior information suggesting the crossing hazards may hold, we can extend the proposed test to one that is only sensitive to the crossing hazards. For example, if we let denote the proposed crossing weight function, that is, . The test denoted by with weights - may further improve the power of the proposed test under crossing hazards. This test addresses early, middle, and late crossing scenarios, so there is no need to guess the actual crossing point. The same simulation scenarios as described in section 4 are used. The empirical power and Type I error rates based on the Type I censoring mechanism are shown in Table 9. Cases with correspond to small, medium and large censoring rates, same as the numbers shown in Table 3. If we compare the power with the projection tests under Crossing 1 alternatives, this maximum test shows power advantage in most cases. For example, with N=240, and Crossing 1 alternative, respectively, the power for the logrank, Maxcombo, proposed - and projection test are 4.8%, 22.2%, 40.9% and 55.4%. The power for this maximum test is 56.7%, larger than all the previous tests. However, the power loss under proportional hazard is greater than the Maxcombo test and proposed test-. For example, with N=240, and constant hazard respectively, the power for logrank, Maxcombo, and projection test are 58.6%, 56.6%, 54.4% and 47.4%. The power for this maximum test is 49.7%, only slightly larger than the power of the projection test. The Type I error rates are similar to the Maxcombo test and proposed test-.
In summary, we recommend to use test or projection test if crossing hazards scenario is the most likely one. However, if there is not enough prior information about the alternative hazards, the proposed test and Maxcombo test are more appropriate as they have quite robust power gain in various non-proportional hazards. We have complied all the functions used in the simulation including the maximum logrank test and projection test in an R package on GitHub (hcheng1118/maxLRT). Considering the various advantages of the maximum logrank test, we are also working on a project that considers different scenarios of non-proportional hazards in the design phase and proposes a simulation-free sample size calculation procedure based on the proposed test. The manuscript is to be finished soon. Applying the method in the adaptive design is also our future work.