1 Introduction
Many trials are designed to evaluate more than one endpoint with the aim of providing a wider picture of the intervention effects FDA2017 ; Rosenblatt2017 . When the rate of occurrence of an event is expected to be low, it is common to consider the composite event defined as the occurrence of any of a set of prespecified events. This composite event is usually chosen as the primary efficacy endpoint for comparing two treatment groups, either by comparing proportions between groups at the end of the study or by using timetoevent analysis. In this paper, we focus on composite binary endpoints.
Power analysis and its subsequent sample size calculation have been widely discussed in the literature on comparing two proportions in the univariate case Lachin1981 ; Donner1984 ; Fleiss1981 ; Friedman1981 . These standard sample size formulae are based on the effect size and the frequency of occurrence of primary endpoint, and they could be applied in a straightforward way to a composite endpoint if its effect size and frequency are known prior to the initiation of the study. However, the effect size and frequency of observing the composite endpoint depend on the corresponding effect and frequency of the composite components, which are often quite dissimilar and thus make the composite parameters very difficult to anticipate.
The TACTICSTIMI trial Cannon2001 illustrates some problems that might arise when determining the sample size for a primary composite binary endpoint. TACTICSTIMI 18 was an international, multicenter, randomized trial that evaluated the efficacy of invasive and conservative treatment strategies in patients with unstable angina or nonQwave acute myocardial infarction treated with tirofiban, heparin, and aspirin. The primary hypothesis of the TACTICSTIMI 18 trial was that an early invasive strategy would reduce the combined incidence of death, acute myocardial infarction, and rehospitalization for acute coronary syndromes at six months when compared with an early conservative strategy. The primary endpoint was the composite endpoint formed by death, nonfatal myocardial infarction, and rehospitalization for acute coronary syndrome at 6 months.
Similar research questions such as those in TACTICSTIMI 18 were previously investigated in the TIMI IIIB and VANQWISH trialsCannon1998 . The TIMI IIIB trial TIMIIIIB considered the primary composite endpoint of death, postrandomization myocardial infarction, and a positive exercise test at 6 weeks; whereas the primary endpoint in the VANQWISH trial VANQWISH was the combination of death and nonfatal myocardial infarction at 12 months of followup. The initial planning of TACTICSTIMI was based on those trials expecting events of the primary composite endpoint in the conservativestrategy group, to detect a relative difference of between the two groups for a power. Those anticipated values resulted in the need to recruit at least patients. However, TACTICSTIMI yielded a frequency of observing the combination of death, acute myocardial infarction and rehospitalization at six months, which was remarkably lower than expected and delivered a relative difference of between groups, a figure that is seriously lower than the anticipated . Note that if the anticipated frequency of observing the composite endpoint had been closer to the observed results, at least patients rather than would have been required and the sample size needed would have been larger than the one initially planned.
In this paper, we present sample size formulations for detecting a hypothesized difference between treatments in a primary composite binary endpoint based on the event rates and effect sizes of the composite components. The motivation for this is mainly because prior information on the marginal effects and event rates is commonly available from previous or pivotal studies, as illustrated in the TACTICSTIMI trial. Moreover, the major findings in a trial with a primary composite endpoint should be well supported by its components FDA2017 ; EMA2016 , since the trial could be considered negative if the components are not in line with the result Pocock2015 ; ICH9 . Nevertheless, as shown in this paper, the sample size calculation for composite endpoints relies not only on the anticipation of the effect size and the event rates of the composite components, but also on the correlation between them. However, even though the marginal parameters could be obtained previously, the correlation is usually not reported in practice and, thus, is frequently unknown and difficult to anticipate.
Several authors have addressed the correlation’s influence on sample size determination when more than one endpoint is used as the primary endpoint. Sozu et al. Sozu2010 discuss several methods for calculating power and sample size for multiple coprimary binary endpoints, and they study the impact on the sample size, specifically regarding the association among endpoints. Senn and BretzSenn2007 examine sample size for trials under different power definitions for multiple testing problems. Rauch and Kieser Rauch2012 and Sander et al. Sander2016
define a multiple test procedure focused on a composite binary endpoint and a prespecified main component, and propose an internal pilot study for estimating the unknown parameters and revising the sample size. However, to the best of our knowledge, methodologies are limited in regard to handling the sample size calculation for composite binary endpoints when the correlation is unknown.
The aim of this paper is twofold. First, we focus on providing a general procedure for sizing trials with composite binary endpoints, doing so on the basis of anticipated information of the composite components even if the correlation is unknown. We show that the sample size for composite binary endpoints is strongly dependent on the correlation, and that slight deviations in the prior information on the marginal parameters may result in trials being too underpowered for achieving the study objectives at the prespecified significance level. We propose a sample size strategy to calculate the minimum sample size that guarantees the planned power while accounting for, on the one hand, the uncertainty of the correlation value and, on the other, plausible deviations in the marginal parameter values. Second, we present CompARE, a freely available webbased tool for characterizing binary composite endpoints and computing the needed sample size under several settings. CompARE provides aids to help understand the role played by each one of the components of the composite endpoint, as well as their consequences on the required sample size. The methodologies presented in this paper have been implemented in CompARE.
This paper is structured as follows. In Section 2, we introduce the settings of the problem and the CompARE web tool. In Section 3, we review sample size planning when evaluating risk difference. In Section 4, we present sample size formulae for composite binary endpoints based on the parameters of the components plus the correlation. We further describe the performance of these formulae according to the parameters and propose a strategy for sizing trials when the correlation is unknown. In Section 5, we exemplify the proposal by the TACTICSTIMI trial using CompARE, and in Section 6
we extend the proposal to those trials for which the treatment effect is measured by the relative risk or odds ratio. In Section
7, we investigate the performance of the power and significance level under misspecification of the correlation and evaluate the proposed sample size strategy with a simulation study. We conclude the paper with the Discussion.2 Notation, assumptions and CompARE
We consider a randomized clinical trial comparing two treatment groups: the control group () and treatment group (), each one composed of patients who are followed for a prespecified time . For simplicity, we consider only two events of potential interest, and . Let denote the response of the th binary endpoint for the th patient in the th group of treatment (, , ). The response is defined by if the event, , has occurred during the followup and otherwise.
We define the binary composite endpoint as the event that occurs whenever one of the endpoints is observed, that is, . At this point we assume that the composite endpoint is welldefined, that is, both composite components are important enough to be considered; and we include those adverse clinical outcomes that are relevant to the clinical setting. We denote by
the composite response defined as a Bernoulli random variable with probability of observing the event
, where:
(1) 
To evaluate whether there is a risk reduction in the treatment group compared with the control group, we set a hypothesis test where the null hypothesis states that there is no difference between the control and the treatment groups; whereas the alternative hypothesis assumes a risk reduction in the treatment group. The usual measures to evaluate the treatment effect when comparing two groups are the difference in proportions (also called risk difference), denoted by
; the relative risk (or risk ratio), ; and the odds ratio, . The relationship between these measures and the probabilities of observing the binary composite endpoint in each group are given in Table 1, together with the null and alternative hypothesis that should be set in each case. The following sections will be developed in terms of the risk difference of the composite binary endpoint. Section 6 extends the results to the relative risk and odds ratio.Parameter Effect  Null hypothesis  Alternative hypothesis  

Risk difference  
Relative risk  
Odds ratio 
2.1 An insight into the parameters of the composite endpoint
Let and represent the probabilities that occurs or not, respectively, for a patient belonging to the th group. Let denote Pearson’s correlation coefficient between the components in the th group. The probability of observing the composite event is in terms of the probabilities of and and the correlation, as follows:
(2) 
Note here that the probability of observing the composite endpoint becomes smaller as the correlation between the components of the composite increases.
The effect size in the composite endpoint in terms of the risk difference, , is given by:
(3) 
where () corresponds to the risk difference for each of its components.
From now on the correlation is assumed equal for both groups and denoted by , that is, . Let
denote the vector of event rates of the composite components in the control group, that is,
, and let represent the vector of marginal effect sizes, that is, . We will denote the risk difference as a function of the marginal parameters () and the correlation by ; and the probability of observing under the control group by . We remark here that when and are fixed such that and (), the risk difference increases with respect to the correlation (see Appendix A).2.2 CompARE
We present CompARE^{1}^{1}1Link to CompARE: https://cinna.upc.edu/compare/, the opensource code for CompARE is available at: https://github.com/MartaBofillRoig/CompARE, an opensource and completely free web platform that can be used as a tool for clinicians, medical researchers and statisticians to compute the sample size according to the procedure proposed in this paper. Furthermore, CompARE can be used to:

Determine the sample size for different situations, among them, when the correlation is not known.

Specify the treatment effect for the composite endpoint based on the marginal information of the composite components, and to study the performance of the composite parameters according to them.

Calculate and interpret the measures of association among the composite components, then investigate their characteristics.

Choose the best primary endpoint to lead the trial. CompARE computes the Asymptotic Relative Efficiency methodGomezLagakos ; BofillGomez , which quantifies differences in the efficiency of using – as the primary endpoint – a composite endpoint over one of its components.
Figure 1 summarizes all the capabilities of CompARE. To use CompARE, the least you should provide is the effect size and event rates of the composite components as well as the correlation.
3 Sample Size when the parameters of the composite endpoint can be anticipated
In this section we summarize the statistics and sample size formulae to test for a risk difference when the probability of occurrence in the control group of the composite binary endpoint can be anticipated and for a given expected risk difference. Since the composite endpoint is an univariate outcome, a single statistical test is performed and, consequently, no multiplicity problem occurs and no statistical adjustment is needed. Therefore, as we will see, the formulas follow the univariate case and are straightforward but to make the paper comprehensive and the following sections meaningful, we displayed them in terms of the composite endpoint parameters.
Herein we assume a clinical trial where, first, patients are randomized to one of two treatment arms following a balanced design and, second, where the primary endpoint is a binary composite endpoint. The aim is to detect a hypothesized risk reduction in the primary composite endpoint at the significance level of and with desired power equal to . Let be the total sample size required, with patients per group (); and let us denote by and the values of standardized normal deviates corresponding to and .
The null hypothesis is stated as and is compared against the alternative hypothesis . To test against we use the statistic:
(4) 
where . Under ,
follows, asymptotically, the standard normal distribution. We will reject the null hypothesis at the
level of significance if. The variance
in equation (4) can be estimated under using the pooled variance estimateDonner1984 :or under using the unpooled variance estimate:
For a given probability under control group , the required sample size using the pooled estimate to have power in order to detect an effect size of at a significance level is given by Lachin1981 ; Fleiss1981 :
(5) 
Note that in (5) we have replaced with .
4 Sample Size based on anticipated values of the composite components
Sample size formulae underlined in Section 3 are based on the parameters of the composite endpoint, that is, the event rate under the control group, , and the treatment effect, . In this section, we derive the sample size based on the anticipated information on the marginal parameter values and the correlation, even if the correlation value is not fully specified and/or the event rates values are not accurately anticipated.
4.1 Sample size based on composite components
Given the event rates in the control group , the expected effect size for each component , and the correlation between the occurrence of both components , we will denote by the needed sample size, which is computed by using the unpooled variance estimate, to detect a risk difference (see equation (3)) at significance level with power.
The expression for is obtained after direct substitution into formula (6) and is as follows:
(7) 
where is given in (2). Note that the sample size also relies on the significance level and the power , but these are omitted for ease of notation. The corresponding sample size under the pooled estimate can be analogously calculated by using , and and its expression can be found in the online support material.
4.2 Sample size bounds
Assuming that the correlation is the same in the two treatment groups, it follows that the correlation takes values between the lower bound, , and the upper bound, , which are functions of and , and are defined as:
(8)  
Note that when at least one of the event rates is very close to , the lower bound will also be close to and the plausible correlation values will be always positive. We also notice that, in clinical trials the probabilities of observing the events are often quite low and commonly smaller than . In this case, the expressions for and can be simplified. See the online supplementary material for more details.
Considering such bounds for a given marginal parameters and , the sample size is an increasing function of the correlation , and it is bounded below and above by and , respectively. As a consequence, the more correlated the single endpoints are, the larger will be the necessary sample size for detecting the differences between groups in the composite endpoint. Details for this derivation are provided in Appendix B (see Theorem 1).
4.3 Sample size with uncertain correlation value
Since the correlation plays an important role in calculating the sample size, we propose a strategy for deriving the sample size when the parameters that correspond to the composite components are known and the correlation value is not specified in advance.
Prior knowledge about the effect of the treatment being investigated can lead to scientists foreseeing whether the two events of interest, and , are weakly, moderately or strongly correlated. We allow for prior information by splitting the rank of the correlation into three equalsized intervals, and we consider three correlations categories: weak for the interval whose correlation values are lower; moderate for those intermediate correlation values; and strong for those correlation values that are higher. If any information exists, we will take it into account and will proceed as follows:

Correlation bounds for each category:
Considering the categories weak/moderate/strong for the correlation, the plausible correlation values for a given () are in this situation those between the lower and upper values within each category. If the events are weakly correlated, the correlation is between and ; if they are moderately correlated, its value lies between and ; and if they are strongly correlated, it is between and .
If we cannot place the correlation in any of the above categories, we use the most severe case within its plausible values, then, . (See Table 2). 
Calculate the sample size in each category:
For the sample size, we advocate using the maximum sample size across all its possible values. That is, , , and for weak, moderate or strong correlations, respectively. Note that since we are assuming the correlation value that maximizes the sample size across its plausible values, we are guaranteeing that the prespecified power is attained.
If the correlation value can not be ascribed to any category, then, we propose a conservative sample size strategy of using the overall possible maximum sample size, that is, . Table 2 outlines the range of correlations and sample sizes values, together with the proposed sample size for each category.
Category  Correlation Bounds  Sample Size Bounds  Sample Size 

Weak  
Moderate  
Strong  
No prior information 
4.4 Sample size accounting for departures from the anticipated event rates
The marginal parameters are often estimated through previous studies or pivotal trials with a limited number of patients and whose patient populations or concomitant drugs could differ from the current ones. Because of that, there is great uncertainty in the values that need to be anticipated for computing the sample size. In this section, we consider that the event rates and
have been previously estimated and their corresponding standard errors of the point estimate are provided.
Let denote a set of plausible values for the true value of
. For instance, for those previous trials in which we have the standard deviations for the event rates, we can use the set of plausible values for
that a confidence interval would yield. We address the issue of sizing a trial for a significance level and power based on the intervals and , and for fixed effects and when the correlation value is not known.We state that, for given and and at fixed , the sample size (see equation (7)) that is needed for power at a significance level , falls into the interval:
(9) 
This interval is such that it contains the sample size required to attain power , which is necessary for detecting an effect size equal to at a significance level according to the marginal effects and , the correlation , and the event rates within (). Note that the interval gives us the plausible sample size values by taking into account the uncertainty of the marginal parameter values, and it provides us the maximum sample size that we would need even though the anticipated event rates are not accurate.
Considering the set of values for the marginal parameters, and denoting by and the lower and upper bounds of the correlation within the set . Then, for all , and , we have that:
(10) 
Furthermore, for given , the sample size is an increasing function of the correlation .
The sample size given by delimits the values that the sample size could have in terms of the correlation accounting for plausible deviations in the anticipated event rates. If there is no prior information on the correlation, we can use as the needed sample size. If otherwise, we have some prior information on the correlation value, the rationale used in 4.3 using correlation categories can be as well applied here to the function . Table 3 provides the sample size strategy under this circumstance. We lay out the performance of the sample size when varying the event rates in the intervals and and the subsequent sample size behavior according to the correlation in Propositions 2 and 3 in the supplementary material.
Category  Correlation Bounds  Sample Size Bounds  Chosen Sample Size 

Weak  
Moderate  
Strong  
No prior information 
4.5 Power performance of the proposed strategies
Given () and for a fixed sample size , the power function using the unpooled variance estimate is defined as:
(11) 
where denotes the cumulative distribution of the standard normal distribution. The power function for the pooled variance estimator can be found in the online support material.
In what follows, we show that the planned power is achieved with any of the previous strategies in Subsections 4.3 and 4.4.

If and are fixed and the correlation value is not known, we have and the proposed sample size becomes . The resulting power is then such that:
The power attained using the upper bound of the correlation is equal to the prespecified power value () when the correlation is the maximum value within its range, that is, . Otherwise, if the correlation is less than , the power will be always higher than the prespecified power. Table S1 in the online supplementary material details the power performance when the correlation categories are taken into account.

If the event rate value is within the interval for and the effect sizes are fixed, then . If in addition we have no prior information on the correlation value, then since the sample size increases with respect to the correlation, it follows that , and then the proposed sample size turns into . The corresponding power then satisfies:
The power attained is equal to the prespecified power value when the event rates take the upper values and the correlation is equal to . If that is not the case, the power obtained will be larger than the prespecified .
5 Motivating example: TACTICSTIMI trial
In managing the syndrome of unstable angina and nonQwave acute myocardial infarction, there is controversy over whether using an invasive strategy rather than a conservative strategy offers any advantage. TACTICSTIMI was a randomized trial that evaluated the efficacy of invasive and conservative treatment strategies in patients with unstable angina and nonQwave AMI treated with tirofiban, heparin, and aspirin Cannon2001 .
Patients were randomly assigned to either an early invasive strategy or an early conservative strategy. The primary hypothesis of the TACTICSTIMI trial was that an early invasive strategy would reduce the combined incidence of death, acute myocardial infarction, and rehospitalization for acute coronary syndromes at six months when compared with an early conservative strategy. The primary endpoint was the composite endpoint formed by a combination of incidence of death or nonfatal myocardial infarction (), and rehospitalization for acute coronary syndrome () at six months.
For illustrative purposes, we assume that a trial will be planned for a similar setting and that the results of TACTICSTIMI 18 are to be used. Since previous studies to TACTICSTIMI 18 also considered the events death and nonfatal myocardial infarction altogether, we presume that the event rate and effect size on the endpoint can be anticipated despite being composed by two events. The estimated values for the frequency of death or nonfatal myocardial infarction () in the conservative strategy group was with a standard deviation of ; whereas the frequency of rehospitalization for acute coronary syndrome () was with a standard deviation of . Based on the standard deviations of the estimated event rates, we use the confidence intervals as a set of plausible values among which the true values , take values, that is, and . The observed effects on TACTICSTIMI were and , and we will use these as the expected effects on the new experimental trial.
We consider these parameters to construct the correlation bounds outlined in equation (8). The effects and and the values and imply that the eligible values for lie in the interval (, ). Using the intervals and , the correlation bounds are such that the considered values are plausible for any event rate within and . This gives us the correlation bounds (, ). Table 4 and its accompanying figure show the correlation bound according to and with varying values of the event rates. Observe that the upper bound takes the value when both event rates are equal, and the lower bound tends to when at least one of the event rates becomes smaller.
Event rate values  Correlation Bounds 

,  
,  
, 
FIGURE Lower bound (surface in blue) and upper bound (in red) for the correlation according to the effect sizes , and where the marginal event rates take values between and .
We illustrate the aspects of calculating power and sample size using the platform CompARE. CompARE calculates the sample size by anticipating the marginal information in terms of either risk difference, relative risk, or odds ratio. In this particular case, we use the statistical test for risk difference under pooled variance in order to ascertain the treatment differences in the composite endpoint at a significance level of and target power of . The results obtained from CompARE are presented in the form of summary tables and plots.
Figure 2 (left panel) depicts the performance of the sample size in terms of the correlation for given marginal parameters and ; and it illustrates the recommended sample size for each correlation category (weak, moderate, and strong). The solid line represents the sample size as a function of the correlation computed for the anticipated values , and the shaded areas represent the region of values, constructed by , , and , within which interval the sample size falls. Based on and the proposed sample size (in dotted lines) is the upper value of the shaded area within the correlation category.
Note that the sample size is highly sensitive to the anticipated parameters. For instance, for , using and , the required sample size is . This sample size, however, can differ substantially from that calculated using other reasonable values, such as the upper or lower limits for the intervals and , which would imply and , respectively.
Figure 2 (right panel) describes the statistical power achieved under the proposed method. Assuming that we have correctly anticipated the correlation category, observe that in all cases the achieved power is larger than the planned power, . Then, the method guarantees the desired power. If we could correctly anticipate the values of the event rates, then the achieved power would lie between and , in accordance with the plausible correlation values. If we base the sample size calculation on the intervals and , we will be overestimating the statistical power more than in the previous case, thus obtaining a power between and .
Table 5 describes the proposed sample size for each correlation category and reports the possible values for the statistical power, assuming that we have correctly anticipated the correlation category.
Based on point values , for the event rates:  
Correlation bounds: , .  
Association strength  Correlation  Sample size  Achieved power 
Weak  (, )  
Moderate  (, )  
Strong  (,)  
Based on intervals and for the event rates:  
Correlation bounds: , .  
Association strength  Correlation  Sample size  Achieved power 
Weak  (, )  
Moderate  (, )  
Strong  (, ) 
6 An extension for risk ratio and odds ratio
In this Section, we show that the risk ratio and odds ratio for the composite endpoint can also be expressed in terms of its margins plus the correlation, and we extend the sample size derivation given in Section 4 for evaluating the risk and odds ratio.
6.1 Composite effect expressed in terms of the risk ratio or the odds ratio
Let and denote the risk ratio and odds ratio, respectively, for the th event. The risk ratio for the composite endpoint, , is expressed in terms of the risk ratio of its components and , the event rates under control group, and , and the correlation between them, , as follows:
Comments
There are no comments yet.