A general class of two-sample statistics for binary and time-to-event outcomes

We propose a class of two-sample statistics for testing the equality of proportions and the equality of survival functions. We build our proposal on a weighted combination of a score test for the difference in proportions and a Weighted Kaplan-Meier statistic-based test for the difference of survival functions. The proposed statistics are fully non-parametric and do not rely on the proportional hazards assumption for the survival outcome. We present the asymptotic distribution of these statistics, propose a variance estimator and show their asymptotic properties under fixed and local alternatives. We discuss different choices of weights including those that control the relative relevance of each outcome and emphasize the type of difference to be detected in the survival outcome. We evaluate the performance of these statistics with a simulation study, and illustrate their use with a randomized phase III cancer vaccine trial. We have implemented the proposed statistics in the R package SurvBin, available on GitHub (https://github.com/MartaBofillRoig/SurvBin).



There are no comments yet.


page 1

page 2

page 3

page 4


Design of phase III trials with long-term survival outcomes based on short-term binary results

Pathologic complete response (pCR) is a common primary endpoint for a ph...

An Online Updating Approach for Testing the Proportional Hazards Assumption with Streams of Big Survival Data

The Cox model, which remains as the first choice in analyzing time-to-ev...

A Framework for Mediation Analysis with Multiple Exposures, Multivariate Mediators, and Non-Linear Response Models

Mediation analysis seeks to identify and quantify the paths by which an ...

A General Class of Weighted Rank Correlation Measures

In this paper we propose a class of weighted rank correlation coefficien...

MOVER confidence intervals for a difference or ratio effect parameter under stratified sampling

Stratification is commonly employed in clinical trials to reduce the cha...

Splitting the Sample at the Largest Uncensored Observation

We calculate finite sample and asymptotic distributions for the largest ...

Efficient Estimation of the Maximal Association between Multiple Predictors and a Survival Outcome

This paper develops a new approach to post-selection inference for scree...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many clinical studies, two or more endpoints are investigated as co-primary, with the aim of providing a comprehensive picture of the treatment’s benefits and harms. The study on the time until one relevant event has often been the sharp focus of clinical trial research. However, when there is more than one event of interest, the time until the appearance of the event is not always the center of attention for all endpoints while the occurrence of an event over a fixed time period is often the outcome of interest for some of them.

One example, which motivates this paper, is in the context of cancer immunotherapies trials where short-term binary endpoints based on the tumor size, such as objective response, are common in early-phase trials, whereas overall survival remains the gold standard in late-phase trials ((Wilson et al., 2015, Ananthakrishnan and Menon, 2013)). Since traditional oncology endpoints may not capture the clinical benefit of cancer immunotherapies, the idea of looking at both tumor response and survival has grown from the belief that together may achieve a better characterization of the clinical response ((Thall, 2008)).

The problem of how to analyze multiple outcomes has been widely discussed in the literature. Despite how to construct a test of individual hypothesis is well established, how to combine them when testing multiple hypotheses is often difficult. If one ignores the multiplicity when more than one null hypothesis are tested simultaneously, and test each hypothesis at level

, the probability of one or more false rejection generally increases with the number of hypotheses and may be much greater than

. The classical approach is to restrict the attention to multiple testing procedures that control the probability of one or more false rejections, the so-called familywise error rate, which guarantee the nominal significance level ((Lehmann and Romano, 2012)).

The most used multiple testing procedures are those based on correcting the significance level to control the prespecified nominal level (e.g, Bonferroni procedure ((Bland and Altman, 1995))), which only require to test the individual hypotheses and thus are straightforward to apply. However, many of such approaches may lead to conservative designs since they do not take the potential correlation between the statistics into account.

Other alternative approaches have been developed. Hothorn et al. ((2008))

extend the linear models theory to multiple comparisons within parametric and semi-parametric models. Their approach corrects the significance level by means of the simultaneous asymptotic normality of the commonly used

-statistics. Pipper et al. ((2012))

established the asymptotic joint distribution of the

-statistics for the effect of a covariate from models for different endpoints. Pocock et al. ((1987))

derived a global test statistic by combining asymptotically normal test statistics through the sum of them. All of these approaches require knowledge of the multivariate distribution of test statistics. However, the asymptotic distribution of the statistics might be hard to obtain, specially when there are different types of endpoints.

Within the context of cancer trials, several authors have considered both objective response and overall survival as co-primary endpoints. Lai and Zee ((2015)) proposed a single-arm phase II trial design with tumor response rate and a time-to-event outcome, such as overall survival or progression free survival. In their design, the dependence between the probability of response and the time-to-event outcome is modeled through a Gaussian copula. Lai et al. ((2012)) proposed a two-step sequential design in which the response rate and the time to the event are jointly modeled. Their approach relates the response rate and the time to the event by means of a mixture model and is build on the basis of the Cox proportional hazards model assumption.

Another characteristic of the immunotherapy trials is that delayed effects are likely to be found, bringing the additional challenge of the non-proportionality of the hazards into the statistical analysis ((Mick and Chen, 2015)). Statistics that look at differences between integrated weighted survival curves, such as those defined by Pepe and Fleming ((1989, 1991)) and extended by Gu et al. ((1999)) are better suited to detect early or late survival differences and do not depend on the proportional hazards assumption.

In this paper, we have followed the idea launched by Pocock et al. ((1987)) of combining multiple test statistics into a single hypothesis test. Specifically, we propose a general class of statistics based on a weighted sum of a difference in proportions test and a weighted Kaplan-Meier test-based for the difference of survival functions. Our proposal adds versatility in the study design by enabling different follow-up periods in each endpoint, and flexibility by incorporating weights. We define these weights to specify unequal priorities to the different endpoints and to anticipate the type of time-to-event difference to be detected. The proposed class of statistics could be used in seamless phase II/III design, to jointly evaluate the efficacy on binary and survival endpoints, and even in the presence of delayed treatment effects.

This article is organized as follows. In Section 2 we present the class of statistics for binary and time-to-event outcomes. In Section 3 we set out the assumptions and present the large sample distribution theory for the proposed statistics. In Section 4 we introduce different weights and discuss about their choice. We give an overview of our R package SurvBin in Section 5 and illustrate our proposal with a recent immunotherapy trial in Section 6. In Section 7 we evaluate the performance of these statistics in terms of the significance level with a simulation study. We conclude with a discussion.

All the required functions to use these statistics have been implemented in R and have been made available at: https://github.com/MartaBofillRoig/SurvBin.

2 A general class of binary and survival test statistics

Consider a study comparing two groups, control group () and intervention group (), each composed of individuals, and denote by the total sample size. Suppose that both groups are followed over the time interval and are compared on the basis of the following two endpoints: the occurrence of an event before (), and the time to a different event within the interval (). For the -th group (), let be the probability of having the event before , and be the survival function of the time to the event .

We consider the problem of testing simultaneously : and : , aiming to demonstrate either a higher probability of the occurrence of or an improved survival with respect to in the intervention group. The hypothesis problem can then be formalized as:


We propose a class of statistics –hereafter called -class– as a weighted linear combination of the difference of proportions statistic for the binary outcome and the integrated weighted difference of two survival functions for the time-to-event outcome, as follows,


for some real numbers , such that , and where:


denoting by the estimated proportion of events before , and by the Kaplan-Meier estimator of for group . The estimates and are such that converge in probability to and , respectively, as , where and represent the variances of and , respectively. Both theoretical and estimated expressions for the variances of and will be given in Section 3 (see equations (78) for the theoretical expressions and (1213) for the estimates). The term is a possibly random function which converges pointwise in probability to a deterministic function . For ease of notation, and letting , we will suppress the dependence on and use instead , , . Note that , , and depend on the sample size , but it has been omitted in notation for short.

The weights control the relative relevance of each outcome -if any- and the random weight function serves two purposes: to specify the type of survival differences that may exist between groups and to stabilize the variance of the difference of the two Kaplan-Meier functions. Some well-known special cases of are:

  1. , where is the pooled Kaplan-Meier estimator for the censoring distribution. This choice of down-weights the contributions on those times where the censoring is heavy.

  2. , where and is the pooled Kaplan-Meier estimator for the survival function. This corresponds to the weights of the family ((Fleming and Harrington, 1991)). Then, for instance, if and , emphasizes early differences between survival functions; whereas late differences could be highlighted with and .

  3. , where denotes the number of individuals at risk of at time . In this case accentuates the information at the beginning of the survival curve allowing early failures to receive more weight than later failures.

We state the precise conditions for the weight function in Section 3 and postpone the discussion about the choice of and to Section 4.

The statistics in the -class are defined for different follow-up configurations based on different choices of: the overall follow-up period, ; the time where the binary event is evaluated, ; and the origin time for the survival outcome, ; taking into account that . There are however no restrictions on whether or not these periods overlap and, if they do, how much and when. We illustrate two different situations with different configurations for in Figure 1. The first case is exemplified by an HIV therapeutic vaccination study where safety-tolerability response (binary outcome) and time-to-viral rebound (survival outcome) are outcomes of interest. Whereas the safety-tolerability is evaluated at week 6 (), the time-to-viral rebound is evaluated from week 6 to 18 ( and ) ((De Jong et al., 2019)). The second example in the area of immunotherapy trials includes a binary outcome (objective response), evaluated at month 6, and overall survival, evaluated from randomization until year 4 (, and ) ((Hodi et al., 2019)).

The -class statistics includes several statistical tests. If , and , then, corresponds to the global test statistic proposed by Pocock et al. ((1987)). If , , and , the statistic is the equivalent of the linear combination test of Logan et al. ((2008)) when there is no censorship until for testing for differences in survival curves after a pre-specified time-point.

Figure 1: Illustration of two different follow-up configurations, the red and blue arrows represent the time-frame for binary and time-to-event outcomes, respectively. The red line goes from the study starts (at time-point ) until the binary outcome is evaluated (). The blue (dashed) line goes from the time-to-event information begins to be collected () to the end of the study ().

3 Large sample results

In this section, we derive the asymptotic distribution of the -class of statistics given in (2) under the null hypothesis and under contiguous alternatives, present an estimator of their asymptotic variance, and discuss the consistency of the -statistics against any alternative hypothesis of the form of in (1). We start the section with the conditions we require for the -class of statistics. In order to make the paper more concise and more readable, proofs and technical details are in the Appendix and Supplementary material.

3.1 Further notation and Assumptions

We consider two independent random samples of () individuals and for each we denote the binary response by has occurred, the time to by and the censoring time by for and where is the usual 0/1 indicator function. Assuming that is non-informatively right-censored by , the observable data are summarized by , where and . Suppose as well that is independent of and that the occurrence of the survival and censoring times, and , does not prevent to assess the binary response, .

Denote by and the censoring survival function and the Kaplan-Meier estimator for the censoring times, respectively. As we will see in the next section, the distribution of the -statistics relies, among others, on the survival function for those patients who respond to the binary endpoint. We then introduce here the survival function for responders as P .

Furthermore we assume that: (i) , and ; (ii) the limiting fraction of the total sample size is non-negligible, i.e., ; and (iii) is a nonnegative piecewise continuous with finitely discontinuity points. For all the continuity points in , converges in probability to as . Moreover, and are functions of total variation bounded in probability.

Finally, we introduce the counting process as the number of observed events that have occurred by time t for the -th group () and as the number of subjects at risk at time for the -th group. We define and suppose that .

Remark: Throughout the paper and to refer to the group (), we will use subindexes for the individual observations and stochastic processes, as in , while we will use superindexes in parentheses for the functions and parameters, as in .

3.2 Asymptotic distribution

In order to derive the asymptotic distribution of the statistic , we first note that can be approximated by , the same statistic with the weights replaced by its deterministic function.

Lemma 3.1.

Let be the statistic defined by:


where is the statistic given in (3) and is the statistic given in (2) with replaced by , that is:

for some real numbers , such that , and for a function satisfying the conditions outlined in Section 3.1. Then, the -statistic , given in (2), can be written as:


converges in probability to . Hence, the asymptotic distribution of the statistic is the same as that of .

Roughly speaking, thanks to this theorem we can ignore the randomness of and use to obtain the limiting distribution of . In what follows, we state the asymptotic distributions under the null hypothesis in Theorem 3.2 and under a sequence of contiguous alternatives in Theorem 3.3.

Theorem 3.2.

Let be the statistic defined in (2). Under the conditions outlined in 3.1, if the null hypothesis holds, converges in distribution, as

, to a normal distribution as follows:

where , stand for the variances of and , respectively, and is the covariance between and . Their corresponding expressions are given by:


where , ( or ), , and
for .

Recall that , , and depend on , but we omit them for notational simplicity.

Theorem 3.3.

Let be the statistic defined in (2). Under the conditions outlined in 3.1., consider the following sequences of contiguous alternatives for both binary and time-to-event hypotheses satisfying, as :


for some constant and bounded function , and . Then, under contiguous alternatives of the form:

and (10)

we have that:

in distribution as , where , and are given in (7), (8) and (LABEL:sigbs), respectively.

The covariance in (LABEL:sigbs) involves the conditional probabilities and , while represents the survival function for responders –patients that have had the binary event –, stands for the probability of being a responder among patients experiencing at . Also note that, if , the survival experience starts after the binary event has been evaluated and only involves the second integral in (LABEL:sigbs).

We notice that the efficiency of the -statistics, , under contiguous alternatives is driven by the non-centrality parameter , that is, by the sum of the weighted non-centrality parameters of and .

3.3 Variance estimation and consistency

We now describe how to use the -statistics to test versus given in (1). Theorem 3.4 gives a consistent estimator of the asymptotic variance of , and Theorem 3.5 presents the standardized -statistics to test .

Theorem 3.4.

Let be the statistic defined in (2), and let , and be the variances and covariance given in (7), (8) and (LABEL:sigbs), respectively. The asymptotic variance of , given in Theorem 3.2, can be consistently estimated by:


where , , and denote the estimates of , and , and are given by:


where ( or ), is the Kaplan-Meier estimator of ; and is the estimator of . Kernel-density methods are used in the estimation of .

Theorem 3.5.

Let be the statistic defined in (2), and let be the variance estimator given in (11). Consider the global null hypothesis ((1)) and let the normalized statistic of be:


Then, the statistic defined in (15) converges in distribution to a standard normal distribution. Moreover, for positive , the statistic is consistent against any alternative hypothesis of the form of in (1) which contemplate differences and stochastic ordering alternatives for the binary and time-to-event outcomes, respectively.

We are presenting here a pooled variance estimator of . An unpooled variance estimator is proposed in Theorem A.1 in the Appendix.

Theorem 3.5 can be used to test the global null hypothesis by comparing to a standard normal distribution.

4 On the choice of weights

An important consideration when applying the statistics proposed in this paper is the choice of the weight functions. The -class of statistics involves the already mentioned random weight function and deterministic weights . These weights are defined according to different purposes and have different roles into the statistic . In this section, we include different weights and discuss some of their strengths as well as shortcomings. The list provided is not exhaustive, other weights are possible and might be useful in specific circumstances.

4.1 Choice of

The purpose of the weights is to prioritize the binary and the time-to-event outcomes. They have to be specified in advance according to the research questions. Whenever the two outcomes are equally relevant, we should choose . In this case the statistics will be optimal whenever the standardized effects on both outcomes coincide.

4.2 Choice of

The choice of might be very general as long as converges in probability to a function , and both and satisfy the conditions outlined in 3.1. In this section, we center our attention on a family of weights of the form:

where: (i) is a data-dependent function that converges, in probability to , a nonnegative piecewise continuous function with bounded variation on . The term takes care of the expected differences between survival functions and can be used as well to emphasize some parts of the follow-up according to the time-points (); (ii) the weights converge in probability to a deterministic positive bounded weight function . The main purpose of the weights is to ensure the stability of the variance of the difference of the two Kaplan-Meier functions. To do so, we make the additional assumption that:

for all , and for some constants .

Different choices of yield other known statistics. For instance, if , corresponds to the Weighted Kaplan-Meier statistics ((Pepe and Fleming, 1989, 1991)). Whenever and correspond to the weights (17) and (16), respectively, introduced below, we have the statistic proposed by Shen and Cai ((2001)). Furthermore, note that the weight functions of the form are similar to those proposed by Shen and Cai ((2001)); while they assume that is a bounded continuous function, we assume that is a nonnegative piecewise continuous function with bounded variation on , and instead of only considering the Pepe and Fleming weight function corresponding to (17), we also allow for different weight functions . Finally, if the random quantity is omitted corresponds to the difference of restricted mean survival times from to .

In what follows, we outline different choices of and , together with a brief discussion for each one:

  • We require to be small towards the end of the observation period if censoring is heavy. The usual weight functions involve Kaplan-Meier estimators of the censoring survival functions. The most common weight functions are:


    and , both proposed by Pepe and Fleming. Among other properties, has been proved to be a competitor to the logrank test for the proportional hazards alternative ((Pepe and Fleming, 1989)). Note that if the censoring survival functions are equal for both groups and the sampling design is balanced (), then, the differences in Kaplan-Meier estimators are weighted by the censoring survival function, that is, for . Also note that for uncensored data.

  • Analogously to Fleming and Harrington ((1991)) statistics, could be used to specify the type of expected differences between survival functions. That is, if we set:


    the choice , leads to a test to detect early differences, while , leads to a test to detect late differences; and leads to a test evenly distributed over time and corresponds to the weight function of the logrank.

  • In order to put more emphasis on those times after the binary follow-up period we might consider:

    for .

5 Implementation

We have developed the SurvBin package to facilitate the use of the -statistics and is now available on GitHub. The SurvBin package contains three key functions: lstats to compute the standardized -statistic, ; and bintest and survtest for the univariate binary and survival statistics (3) and (2), and , respectively. The SurvBin package also provides the functions survbinCov, that can be used to calculate ; and simsurvbin for simulating bivariate binary and survival data.

The main function lstats can be called by:

lstats(time, status, binary, treat, tau0, tau, taub, rho, gam, eta, wb, ws, var_est)

where time, status, binary and treat

are vectors of the right-censored data, the status indicator, the binary data and the treatment group indicator, respectively;

tau0, tau, taub denote the follow-up configuration; wb, ws are the weights ; rho, gam, eta are scalar parameters that controls the weight which is given by ; and var_est indicates the variance estimate to use (pooled or unpooled).

In this work, we estimate by means of the Epanechnikov kernel function, and the local bandwidth selection and the boundary correction described by Muller and Wang ((1994)) by using the muhaz package ((Hess and Gentleman, 2019)).

6 Examples

Melanoma has been considered a good target for immunotherapy and its treatment has been a key goal in recent years. Here we consider a randomized, double-blind, phase III trial whose primary objective was to determine the safety and efficacy of the combination of a melanoma immunotherapy (gp100) together with an antibody vaccine (ipilimumab) in patients with previously treated metastatic melanoma ((Hodi et al., 2019)). Despite the original endpoint was objective response rate at week 12, it was amended to overall survival and then considered secondary endpoint. A total of 676 patients were randomly assigned to receive ipilimumab plus gp100, ipilimumab alone, or gp100 alone. The study was designed to have at least power to detect a difference in overall survival between the ipilimumab-plus-gp100 and gp100-alone groups at a two-sided level of , using a log-rank test. Cox proportional-hazards models were used to estimate hazard ratios and to test their significance. The results showed that ipilimumab with gp100 improved overall survival as compared with gp100 alone in patients with metastatic melanoma. However, the treatment had a delayed effect and an overlap between the Kaplan-Meier curves was observed during the first six months. Hence, the proportional hazards assumption appeared to be no longer valid, and a different approach would had been advisable.

In order to illustrate our proposal, we consider the comparison between the ipilimumab-plus-gp100 and gp100-alone groups based on the overall survival and objective response as co-primary endpoints of the study. For this purpose, we have reconstructed individual observed times by scanning the overall survival Kaplan-Meier curves reported in Figure 1A of Hodi et al. ((2019)) using the reconstructKM package ((Sun, 2020)) (see Figure 2), and, afterwards, we have simulated the binary response to mimic the percentage of responses obtained in the study.

Figure 2: Kaplan-Meier Curves for Overall Survival.

Using the data obtained, we employ the -statistic by means of the function lstats in the SurvBin package. To do so, we need to specify the weights () to be used, and the time-points (). In our particular case, we take according to the trial design, choose to account for censoring and delayed effects in late times, and to emphasize the importance of overall survival over objective response. The results are summarized in Figure 2.

Since we obtained , we have a basis to reject and conclude that the ipilimumab either improved overall survival or increased the percentages of tumor reduction in patients with metastatic melanoma, or both.

7 Simulation study

7.1 Design

We have conducted a simulation study to evaluate our proposal in terms of type-I error. We have generated bivariate binary and time-to-event data through a copula-based framework and using conditional sampling as described in

((Trivedi and Zimmer, 2007)). The parameters used for the simulation (summarized in Table 1) have been the following: Frank’s copula with association parameter ; Weibull survival functions, , with and ; probability of having the binary endpoint ; and sample size per arm .

The censoring distributions between groups were assumed equal and uniform with . Two different follow-up configurations were considered for : (i) ; and (ii) . We have considered the weights: with and such that , and equal to . For each scenario, we ran 1000 replicates and estimated the significance level ().

We note that the chosen values of the association parameter correspond to an increasing association between the binary and time-to-event outcomes. Indeed, the values are equivalent to , , in terms of Spearman’s rank correlation coefficient between the marginal distributions of the binary and time-to-event outcomes. We have not considered higher values of as they do not fulfill the condition that ().

We have performed all computations using the R software (Version 3.6.2), and on a computer with an Intel(R) Core(TM) i7-6700 CPU, 3.40 GHz, RAM 8.00GB, 64bit operating system. The time required to perform the considered simulations was 52 hours.

Parameter Value Parameter Value

Table 1: Scenarios simulation study.

7.2 Size properties

The empirical results show that the type I error is very close to the nominal level

across a broad range of situations. The empirical size resulted in type I errors with a median of 0.049 and first and third quartiles of 0.043 and 0.055, respectively. Table

2 summarizes the results according to the parameters of the simulation study. The results show that the -statistics have the appropriate size and that are not specially influenced by the censoring distribution neither by the selection of weights (). Figure 3 displays how the empirical sizes behave according to the association parameter and the follow-up configuration . We observe that when the empirical size is slightly small than 0.05.

We compare the performance of the pooled and unpooled variance estimation and notice that the empirical sizes do not substantially differ between them.

Variance estimator
Pooled Unpooled
0.052 0.050
0.046 0.047
0.001 0.049 0.050
2 0.048 0.048
3 0.049 0.048
0.2 0.049 0.049
0.4 0.049 0.048
0.5 0.052 0.050
1 0.048 0.049
2 0.046 0.047
1 0.049 0.049
3 0.049 0.048
(0,1,0) 0.048 0.049
(1,1,0) 0.049 0.049
(0,0,1) 0.048 0.048
(0,1,1) 0.050 0.049
(1,1,1) 0.048 0.050

Table 2: Median empirical size from replications.
Figure 3: Empirical size according to the and parameters.

8 Discussion

We have proposed a class of statistics for a two-sample comparison based on two different outcomes: one dichotomous taking care, in most occasions, of short term effects, and a second one addressed to detect long term differences in a survival endpoint. Such statistics test the equality of proportions and the equality of survival functions. The approach combines a score test for the difference in proportions and a Weighted Kaplan-Meier test-based for the difference of survival functions. The statistics are fully non-parametric and level for testing the null hypothesis of no effect on any of these two outcomes. The statistics in the -class are appealing in situations when both outcomes are relevant, regardless of how the follow-up periods of each outcome are, and even when the hazards are not proportional with respect to the time-to-event outcome or in the presence of delayed treatment effects, albeit the survival curves are supposed not to cross. We have incorporated weighted functions in order to control the relative relevance of each outcome and to specify the type of survival differences that may exist between groups.

The testing procedure using the -class of statistics satisfies a property called coherence that says that the nonrejection of an intersection hypothesis implies the nonrejection of any sub-hypothesis it implies, i.e., and ((Romano and Wolf, 2005)). However, the testing procedure based on the -class of statistics does not fulfill the consonant property that states that the rejection of the global null hypothesis implies the rejection of at least one of its sub-hypothesis. Bittman et al. ((2009)) faced the problem of how to combine tests into a multiple testing procedure for obtaining a procedure that satisfies the coherence and consonance principles. An extension of this work to obtain a testing procedure that satisfies both properties could be an important research line to consider.

This work has been restricted to those cases in which censoring does not prevent to assess the binary endpoint response. We are currently working on a more general censoring scheme where the binary endpoint could be censored. Last but not least, extensions to sequential and adaptive procedures in which the binary outcome could be tested at more than one time-point remain open for future research.


We would like to thank Prof. Yu Shen and Prof. María Durbán for their helpful comments and suggestions. This work is partially supported by grants MTM2015-64465-C2-1-R (MINECO/FEDER) from the Ministerio de Economía y Competitividad (Spain), and 2017 SGR 622 (GRBIO) from the Departament d’Economia i Coneixement de la Generalitat de Catalunya (Spain). M. Bofill Roig acknowledges financial support from the Ministerio de Economía y Competitividad (Spain), through the María de Maeztu Programme for Units of Excellence in R&D (MDM-2014-0445).

Supplementary Materials

Web Appendix A, referenced in Section 3, is available with this paper at the Biometrics website on Wiley Online Library.


  • Ananthakrishnan and Menon ((2013)) Ananthakrishnan, R., Menon, S. (2013). Design of oncology clinical trials: A review. Critical Reviews in Oncology/Hematology 88(1), 144–153.
  • Bauer ((1991)) Bauer P. (1991). Multiple testing in clinical trials. Statistics in Medicine. 10:871–890.
  • Bland and Altman ((1995)) Bland JM, Altman DG. (1995). Multiple significance tests: the Bonferroni method. BMJ. Jan 21; 310(6973):170.
  • Bittman et al. ((2009)) Bittman, R. M., Romano, J. P., Vallarino, C., Wolf, M. (2009). Optimal testing of multiple hypotheses with common effect direction. Biometrika, 96(2), 399–410.
  • De Jong et al. ((2019)) de Jong W, Aerts J, Allard S, et al. (2019). iHIVARNA phase IIa, a randomized, placebo-controlled, double-blinded trial to evaluate the safety and immunogenicity of iHIVARNA-01 in chronically HIV-infected patients under stable combined antiretroviral therapy. Trials. 20(1):361.
  • Fleming and Harrington ((1991)) Fleming, T. R. and Harrington, D. P. (1991). Counting Processes and Survival Analysis, volume 8. Wiley Online Library.
  • Gu et al. ((1999)) Gu, M., Follmann, D., Geller, N. L. (1999). Monitoring a general class of two-sample survival statistics with applications. Biometrika, 86(1), 45–57.
  • Hodi et al. ((2019)) Hodi FS, O?Day SJ, Mcdermott DF, Al. E. (2010). Improved Survival with Ipilimumab in Patients with Metastatic Melanoma. The New England journal of medicine. 363(8):711–723.
  • Hothorn et al. ((2008)) Hothorn, T., Bretz, F., Westfall, P. (2008). Simultaneous inference in general parametric models. Biometrical Journal, 50(3), 346–363.
  • Lachin ((1981)) Lachin, J. M. (1981). Introduction to Sample Size determination and Power analysis for Clinical Trials. Controlled Clinical Trials, 2, 92–113.
  • Lai and Zee ((2015)) Lai, X., Zee, B. C. Y. (2015). Mixed response and time-to-event endpoints for multistage single-arm phase II design. Trials, 16(1), 1–10.
  • Lai et al. ((2012)) Lai, T. L., Lavori, P. W., Shih, M. C. (2012). Sequential design of phase II-III cancer trials. Statistics in Medicine, 31(18), 1944–1960.
  • Logan et al. ((2008)) Logan, B. R., Klein, J. P., Zhang, M. J. (2008). Comparing treatments in the presence of crossing survival curves: An application to bone marrow transplantation. Biometrics, 64(3), 733–740.
  • Lehmann and Romano ((2012)) Lehmann, E. L., Romano, J. P. (2005). Testing Statistical Hypotheses, 3rd ed. New York: Springer
  • Hess and Gentleman ((2019)) Hess, K., Gentleman, R. (2019). Package ’muhaz’: Hazard Function Estimation in Survival Analysis. Version
  • Mick and Chen ((2015)) Mick, R., Chen, T.T. (2015). Statistical Challenges in the Design of Late-Stage Cancer Immunotherapy Studies. Cancer Immunology Research, 3(12), 1292–1298.
  • Muller and Wang ((1994)) Muller, H.G., Wang, J.L. (1994). Hazard Rate Estimation under Random Censoring with Varying Kernels and Bandwidths. Biometrics, 50(1), 61–76.
  • Pepe and Fleming ((1989)) Pepe, M. S., Fleming, T. R. (1989). Weighted Kaplan-Meier Statistics: A Class of Distance Tests for Censored Survival Data. Biometrics, 45(2), 497–507.
  • Pepe and Fleming ((1991)) Pepe, M. S., Fleming, T. R. (1991). Weighted Kaplan-Meier Statistics: Large Sample and Optimality Considerations. Journal of the Royal Statistical Society. Series B (Methodological), 53(2), 341–352.
  • Pipper et al. ((2012)) Pipper, C. B., Ritz, C., Bisgaard, H. (2012). A versatile method for confirmatory evaluation of the effects of a covariate in multiple models. Journal of the Royal Statistical Society. Series C: Applied Statistics, 61(2), 315–326.
  • Pocock et al. ((1987)) Pocock, S. J., Geller, N. L., Tsiatis, A. A. (1987). The analysis of multiple endpoints in clinical trials. Biometrics, 43(3), 487–498.
  • Romano and Wolf ((2005)) Romano, J. P., Wolf, M. (2005). Exact and approximate stepdown methods for multiple hypothesis testing. Journal of the American Statistical Association, 100(469), 94–108.
  • Shen and Fleming ((1997)) Shen, Y., Fleming, T. R. (1997). Weighted mean survival test statistics: A class of distance tests for censored survival data. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 59(1), 269–280.
  • Shen and Cai ((2001)) Shen Y, Cai J. (2001). Maximum of the weighted Kaplan-Meier tests with application to cancer prevention and screening trials. Biometrics. 57(3):837–843.
  • Sun ((2020)) Sun R. (2020). GitHub Repository: https://github.com/ryanrsun/reconstructkm
  • Thall ((2008)) Thall, P. F. (2008). A review of phase 2-3 clinical trial designs. Lifetime Data Analysis, 14(1), 37–53.
  • Trivedi and Zimmer ((2007)) Trivedi, P. K. and Zimmer, D. M. (2007). Copula modeling: an introduction for practitioners. Foundations and Trends in Econometrics, 1(1), 1–111.
  • Wilson et al. ((2015)) Wilson, M. K., Collyar, D., Chingos, D. T., Friedlander, M., Ho, T. W., Karakasis, K., Oza, A. M. (2015). Outcomes and endpoints in cancer trials: Bridging the divide. The Lancet Oncology, 16(1), e43?e52.

Appendix A Proof of theorems

A sequence of random vectors that converges in probability to as will be denoted by . The convergence in distribution will be written as .

Proof of Lemma 3.1.

The proof is a direct consequence of the asymptotic representation of the time-to-event statistic which can be written as , where: