1 Introduction
The amount of information collected from patients involved in clinical trials is steadily growing, in particular with the advent of genomics and proteomics. All collected baseline characteristics are potential covariates linked with the patient’s outcome. Some covariates may not always be of primary interest in a randomized clinical trial (RCT), however they could be used to explain the variability of the patient’s response and improve the study power when assessing treatment efficacy.
The adjustment for baseline covariates to improve the efficiency of randomized clinical trial analysis can be done in many ways. For example, patients may be stratified according to various factors before randomization. Age, gender, social status, or race are commonly used stratification criteria that may interact with outcome. One of the most traditional methods however is to include covariates in a general regression equation of the form , where is the outcome variable, a constant, the treatment,
the vector of covariates and
the error term. The parameter measures the adjusted treatment effect and is the vector of regression coefficients of covariates.Including covariates associated with the study outcome could greatly improve the efficiency and power of the trial. They could correct for potential bias coming from baseline covariate imbalance between the study arms. However, adding covariates in the analysis comes with a cost in degrees of freedom. As such, regression adjustment should be seen as a tradeoff between explained variance and loss of degrees of freedom. Clearly, for small trials, the number of covariates to be included in the model must be limited. There are many rulesofthumb on the number of covariates that can be included in an analysis (
Austin and Steyerberg (2015); Schmidt (1971)). A common one is to have 10 subjects per variable in the model. The problem with this heuristic rule and other approaches is the variance explained by the covariates that is not taken into account.
In this paper, we intend to investigate the cost/benefit ratio of including covariates in the analysis of RCTs. Instead of focusing on the estimation of the treatment effect, we search to minimize its sampling variance, i.e., to increase its statistical precision, while considering the covariates as nuisance factors. Within this context, we address the longrunning question “What is the optimum number of covariates to include in a clinical trial?”
To improve the cost/benefit ratio of covariates, their weights in the model could be estimated from historical data accumulated from outside sources. In cancer research, the treatment effect is often adjusted for a single baseline prognostic index. For example, the Nottingham prognostic index (NPI) defined by Galea et al. (1992) and used in breast cancer incorporates the size and grade of the tumor as well as the nodal status. Adjusting for all three parameters would explain more variance but at an additional cost in terms of degrees of freedom (Moons et al. (2009); International NonHodgkin’s Lymphoma Prognostic Factors Project (1993); Galea et al. (1992); Keys et al. (1972)). There are numerous examples of this kind. We investigate the benefits of replacing individual covariates by a composite covariate fitted on external data, motivated by the recent advances in placebo effect characterization (Horing et al. (2014); Pereira et al. (2016); Vachonpresseau et al. (2018)). Indeed, the multiple facets and highdimensionality of the placebo effect makes any adjustment difficult and could therefore advantageously benefit from a composite covariate approach.
This work is structured as follows. In Section 2, we introduce the general model describing the relationship between the patient’s outcome and the treatment while accounting for a vector of potential covariates. It serves as the generative data model in our theoretical developments, simulations and illustrations. Section 3 focuses on the sampling variance of the treatment effect with and without covariates adjustment. In Section 4, we propose an approach to select covariates minimizing the expected treatment effect sampling variance based on historical data. In Section 5, we discuss the relative efficiency of combining covariates a priori as a way to limit the number of parameters to be fitted in the model. In Section 6, we perform simulation studies to demonstrate the benefit of the composite covariate approach. We conclude with a brief discussion section.
2 Generative model of the data
Suppose we focus on the treatment effect only. Then, the response model writes
(1) 
where for simplicity the treatment variable equals 1 for treatment and 0 for placebo and
is the error term. The random variable
accounts for all factors not linked with treatment. Since in most clinical studies, patients are randomized between arms, the independence between and can be assumed.The random variable may in turn be expressed as a linear function of the covariates , namely
(2) 
where is a vector of covariates and the error term
is assumed to be normally distributed
, independently of .Thus, by combining Equations (1) and (2) and assuming without loss of generality, the general regression model writes
(3) 
For the sake of simplicity, this paper mainly focuses on the two groups setting: placebo versus active. However, all results can easily be generalized to study groups as presented in Appendix A.
3 Variance of the estimated treatment effect
To estimate the added value of the covariates, we should compare the variance of the estimators of with and without covariates. To avoid any confusion, we denote by
the ordinary least squares (OLS) estimator of
when no covariate is used in the regression (Equation 1) and by when covariates are included in the regression (Equation 3).Now, consider a random sample of observations with responses ()
(4) 
where is the response for patient , is the treatment assigned, are the observed covariates, and is the error term which is independent from and . The vector of treatment assignment is denoted . The design matrix is denoted .
3.1 Without covariate ()
When no covariates are included in the model, Equation (4) is simplified as
for and , the OLS estimated treatment effect writes
(5) 
and its sampling variance, conditional on , is given by the expression
(6) 
Observe that can be estimated without bias by , where is the residual sum of squares with degrees of freedom. Thus, the estimated sampling variability writes
(7)  
(8) 
3.2 With covariates
In a similar way, when covariates are included in the model, Equation (4) is used with . The OLS estimator of is given by the expression
(9) 
where are the OLS residuals from the regression of on the covariates based on observations , . The sampling variance of the OLS estimator, conditional on and , writes
(10) 
where is the estimated multiple coefficient of determination of the regression of on the covariates .
As before, we note that can be estimated without bias by where is the residual sum of squares with degrees of freedom. As such, the estimated sampling variability writes
(11)  
(12) 
3.3 Benefits of including covariates
Conditional on and , a gain in the statistical precision of the estimated treatment effect is obtained if
(13) 
Using Equations (6) and (10), the inequality writes
or
(14) 
However, the actual values of the covariates
are not known in advance and should be treated as random variables. Therefore, the inequality should hold on average by taking the expectation over the joint distribution of
and . In doing so, we define the relative efficiency of including the covariates in the model, namely(15) 
As discussed in Shieh (2020), it is constructive to assume the covariates have independent and identical normal distribution for each patient
(16) 
where is a vector and is a positivedefinite variancecovariance matrix. With the independence between and ,
follows a beta distribution,
. Therefore, direct computation gives(17) 
Since the relative efficiency should be less than 1, by combining Equations (15) and (17), we get the condition
(18) 
By setting
(19) 
the relative residual variance of including in the model, namely the proportion of variance of explained by the covariates after accounting for treatment, the condition can be written
(20) 
As a consequence, the covariates included in the model improve the statistical precision of the estimator of if
(21) 
Equivalently, the number of variables to be included should satisfy the inequality
(22) 
These equations can be easily extended in a more general setting with more than two groups. This extension to groups is presented in Appendix A. The result is a slightly modified version of Equation (20)
(23) 
For the sake of simplicity, we focus here on the two groups setting. However, the reader should keep in mind that and can be viewed as and .
4 Optimal number of covariates
In the previous section, we show how to estimate the maximum number of covariates. The next obvious question is: “What is , the optimal number of covariates to be included in an analysis?” To answer this question, hypotheses should be made on the gain in variance brought by each individual covariate. In a simplistic approach, let’s first assume that the covariates are independent and explain the same amount of variance, . Due to the independence assumption, the relative efficiency (Equation (20)) becomes:
(24) 
The relative efficiency is monotonically decreasing with if and increasing otherwise. As such, depending on , the optimal number of covariates that should be included in the analysis is either none or all of them. This results has an easy and interesting practical application for the a priori selection of covariates. Assuming the independence between the covariates, can only decrease while including covariates explaining more than . This threshold can easily be checked on prior data while computing the correlation between each covariate and the outcome. As such, one could include in the model all covariates with an expected correlation with the outcome above .
In a more realistic scenario, the amount of explained variance would not be equally spread amongst all covariates and the covariates might not be independent of each other. In a clinical research context, it is relatively fair to assume that a ranking from the most to the least interesting covariates is known. Then using Equation (20), the optimal number of covariates is:
(25) 
with . Here, is the population coefficient of determination of the regression with respect to the most interesting covariates and could be estimated from previous data in the same indication.
Indeed, developing a new drug requires the conduct of several successive clinical trials. In most cases, it is fair to assume the existence of previous study data in the same indication. As the covariates are expected to be independent of the study treatment, previous data could come from studies investigating other compounds as well. These previous study data could be leveraged to estimate all using the formula propose by Olkin and Pratt (1958):
(26) 
where is the multiple Rsquared of the regression of the by the covariates , the number of patients from the previous existing data, and the hypergeometric function. We used to make clear that the number of patients from the previous data is not the same as the number of patients from the current (or planned) study of interest.
Of note, the assumption that a ranking of the covariates is known is not strictly required. Indeed, one could compute while testing all possible sets of covariates. However, this might be computationally intensive and prone to overfitting.
5 Composite covariate approach
Assuming that historical data exist, as in the previous section, could we do better than only estimating the optimum number of covariates? The main problem with the use of covariates is the associated loss in degrees of freedom. To avoid this issue, we could derive the vector of covariates weights, , directly from historical data. More simply, we define a new covariate as a composition of the individual covariates. This composite covariate is then used as any covariate in the following studies while limiting the loss of degrees of freedom. Specifically, the model given by Equation (3) simplifies as follows
(27) 
We have a gain in statistical precision of the treatment effect (denoted ) if the relative efficiency of the new composite covariate, , is less than the relative efficiency of using , namely if,
(28) 
or, using expression in Equation (20) for and general , if
(29) 
or
(30) 
where is the relative proportion of the variance of explained by the composite covariate in the population. Thus, for a benefit of the composite covariate with respect to covariates, we need to have
(31) 
To summarize, the relative efficiency of models (a) with no covariate, (b) with covariates, and (c) with the composite covariate , can be compared. Figure 1 displays range values for and according to pairwise comparisons. Due to the linearity of the generative model defined in Section 2, is upperbounded by . As such, the yaxis, representing possible values of , ranges between and . However, in practice, is not constant but monotonically increasing with . On the xaxis, the number of covariates, , takes values between and . Firstly, from Equation (21), should be at least for to be as efficient as . This is represented by the horizontal line. Secondly, Equation (31) induces that should be larger than for the composite covariate to be more efficient than covariates. This bound is represented by the curve starting at the top left corner. Above the curve, is more efficient and below, is more efficient. Thirdly, from Equation (21), when is larger than , is less efficient than . This threshold is represented by the vertical line Figure 1.
The three lines cross each other at the same point, hence defining six sets of values for and according the three pairwise comparisons. In Figure 1, each set is identified by a unique ordering of the three estimators from most to least efficient. These results show that becomes the most efficient estimators when increases. In particular, a composite covariate does not need to be perfect and might be the best option even if .
Replacing the covariates by a composite covariate
estimated from previous data offers several advantages. First, it improves the precision while limiting the loss in degrees of freedom. The composite covariate approach could be seen as a way to borrow degrees of freedom from previous data. Another advantage is to free the estimation of the treatment effect from modeling the covariates. As such, one could use a nonlinear model or machinelearning to estimate the composite covariate (
Rasmussen and Williams (2006); Hastie et al. (2009); Bishop (2006)). The explained variance of a nonlinear composite covariate, , is not upperbounded by anymore. Furthermore, the size of the composite covariate is only limited by previous data and could be larger than .Of note, composite covariates are already used in practice, e.g. through prognostic indexes (Moons et al. (2009); International NonHodgkin’s Lymphoma Prognostic Factors Project (1993); Galea et al. (1992)). A common nonlinear example is the Body Mass Index (BMI) (Keys et al. (1972)). However, in this paper, we propose to use them as a way to optimize the precision of the estimated treatment effect.
6 Simulation studies
To further illustrate the impact and relative gain of covariates or of a composite covariate on treatment precision and power, numerical simulations are performed. All the simulations are performed with R software and are available in the supplementary materials. The simulated studies are generated according to the model described in Section 2 (Equation 1). The covariates and the random errors are generated with independent Normal distributions. The vector of true covariates weights is defined as:
(32) 
This arbitrary choice is made to be representative of a common study setting where a few covariates explain most of the variance. The treatment effect, , is computed for the studies to have 80% of power without any covariate. Previous data are also generated using exactly the same procedure.
For the simulations, we choose to fix the total number of patients to ( for each group) while varying the number of covariates, , included in the estimation of the treatment effect. Both and are directly linked to the degrees of freedom. There is little interest in changing both parameters at the same time. The number of patients in the previous data is set to . The total amount of variance explained by all possible covariates, , was arbitrarily set to .
(33) 
Following the current simulation hypotheses, the variance explained by the first covariates, , is:
(34) 
Using this result in Equation (20) gives the relative efficiency of the estimator with respect to : using covariates for the current simulations as compared to not using them. This relative efficiency is presented in Figure 2 with the dashed curve (a). The associated solid curve is the estimated relative efficiency of the
covariates with its 95% confidence interval based on 10,000 simulations. The estimated relative efficiency is computed as the ratio of the empirical variances of
and . As defined in Morris et al. (2019), the empirical variance is simply the estimated variance of over the simulations.To illustrate the use of a composite covariate, a ridge regression is trained on the historical data for each simulation while changing the number of covariates (
Hoerl and Kennard (1970)). These ridge models are then used to predict the composite covariate, , on the simulated studies. The estimated relative efficiency of using ( vs ) is presented with the solid line (b) in Figure 2 with its confidence interval. As we can see on the figure, the use of a composite covariate can lead to an important gain in precision. Of course, the gain depends on , the variance explained by the composite covariate. Similarly as for covariates, we can estimate the relative efficiency of a composite covariate approach assuming that . The relative efficiency of this ideal composite covariate is depicted in the figure with the dashed line (b). The larger the amount of historical data, the closer the composite covariate is from this upperbound.Figure 3 presents the power associated with the three approaches in the simulated studies: (a) without any covariate, (b) with covariates, and (c) with a composite covariate. Without any covariate, the power is around 80% as designed from the simulation protocol. The use of the covariates brings a boost in power and then decreases (solid line (b)). The solid line (c) shows the power gained by using the composite covariate. The composite covariate power remains high even when increases.
The dashed line (b) represents the expected power of using the covariates with respect to the simulation hypotheses. The dashed line (c) represents the expected power of an ideal composite covariate. These power estimations are performed using the approach proposed by Shieh (2020). Similarly, as for the relative efficiency, the advantage of the composite covariate grows with .
7 Discussion and conclusion
Assessing correctly the treatment efficacy is of critical importance in randomized clinical trials. However, since it is not ethical to expose too many patients to an unproven treatment, the sample size and power of phase I/II trials are often limited. In this context, several statistical approaches have been developed to maximize the study power and the statistical precision of the treatment effect estimate. One of such approaches, the analysis of covariance (ANCOVA), relies on baseline covariates to adjust for possible imbalance between study groups and to explain the variability of the patient’s response improving the study power.
Including covariates associated with the study outcome could greatly improve the efficiency and power of the trial. However, adding covariates in the analysis comes with a cost in degrees of freedom. As such, regression adjustment should be seen as a tradeoff between explained variance and loss of degrees of freedom. There are many rulesofthumb on the number of covariates that can be included in an analysis. To the best of our knowledge, none of them balances both explained variance and degrees of freedom.
In this paper, we answered the question of the number of covariates while focusing on the precision of the estimated treatment effect in an ANCOVA. Our result for the maximum number of covariates is a simple closedform formula, , combining the number of patients and groups with the variance explained by those covariates. We also proposed a simple method relying on available data to estimate the optimal number of covariates. This datadriven approach can easily be applied in practice to plan for future trials.
Assuming data of previous studies to be available, we showed how to further improve the study power by fitting the covariates weights a priori. Similarly, a composite covariate is fitted on previous data and replaces the individual covariates in the treatment effect estimation. The composite covariate approach is already used in practice, e.g. through prognostic indexes (see Moons et al. (2009); International NonHodgkin’s Lymphoma Prognostic Factors Project (1993); Galea et al. (1992)). With this paper, we investigated the use composite covariates specifically to optimize the precision of the treatment effect estimation. Using a composite covariate allows to trade some explained variance to avoid the loss in degrees of freedom. The associated gain is particularly important when the sample size is small and the number of covariates is large.
Considering the recent advances in placebo effect characterization (Horing et al. (2014); Pereira et al. (2016); Vachonpresseau et al. (2018)), the composite covariate approach could have a major impact on future RCTs by disentangling the placebo response from the actual treatment efficacy. The placebo effect is a complex phenomenon, individualdependent with components linked to the subject’s demography, psychology, sociology and disease intensity. The highly multivariate aspect of the placebo makes any adjustment difficult. The composite covariate approach could be used in this context to control for this major confounding factor in RCTs.
Acknowledgments
The authors are grateful for the valuable feedback and suggestions provided by Marc Buyse which greatly improved this article.
Appendix A Generalization to groups
In this section, we generalize the previous results to groups. The vector , the treatment variable, is now taking values in . We denote by the vector of all group intercepts. As previously, we can compute the variance of the estimator of , with and without the covariates. We denote by the estimator of when no covariate are used in the model and when covariates are included in the regression.
When there is no covariate, the sampling variancecovariance matrix of can be written as
(35) 
where , is the jth group size. We have .
When there are covariates, the sampling variancecovariance matrix of is
(36) 
where is the covariate vector for patient , is the mean vector for treatment , , and . We assume here, without any loss of generality, the covariates to be centered, .
The treatment effects are computed using a contrast matrix of size of full row rank, satisfying . The treatment effect is then a vector of size :
(37) 
Its sampling variancecovariance matrix is respectively
(38)  
(39) 
Variances of the marginal distributions for the individual entries of the vector are on the diagonal of the variancecovariance matrix. As such the sampling variance of the estimator of the th entry of the vector is
(40)  
(41) 
where is the th row of . The ratio of the sampling variance of the two estimator is
(42) 
As previously, we assume the covariates have independent and identical normal distribution for each patient. From Shieh (2020), we then have
(43) 
where . The relative efficiency becomes
(44)  
(45)  
(46) 
As a consequence, the covariates included in the model improve the statistical precision of the estimators if , i.e., if
(47) 
Equivalently, the number of variables to be included should satisfy the inequality
(48) 
References

Austin and Steyerberg (2015)
Austin, P. C. and Steyerberg, E. W. (2015).
The number of subjects per variable required in linear regression analyses.
Journal of Clinical Epidemiology, 68(6):627–636.  Bishop (2006) Bishop, C. M. (2006). Pattern Recognition and Machine Learning, volume 4.
 Galea et al. (1992) Galea, M. H., Blamey, R. W., Elston, C. E., and Ellis, I. O. (1992). The Nottingham Prognostic Index in primary breast cancer. Breast cancer research and treatment, 22(3):207–19.
 Hastie et al. (2009) Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, volume 18.
 Hoerl and Kennard (1970) Hoerl, A. E. and Kennard, R. W. (1970). Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 12(1):55–67.
 Horing et al. (2014) Horing, B., Weimer, K., Muth, E. R., and Enck, P. (2014). Prediction of placebo responses: a systematic review of the literature. Frontiers in psychology, 5(October):1079.
 International NonHodgkin’s Lymphoma Prognostic Factors Project (1993) International NonHodgkin’s Lymphoma Prognostic Factors Project (1993). A Predictive Model for Aggressive NonHodgkin’s Lymphoma. New England Journal of Medicine, 329(14):987–994.
 Keys et al. (1972) Keys, A., Fidanza, F., Karvonen, M. J., Kimura, N., and Taylor, H. L. (1972). Indices of relative weight and obesity. Journal of Chronic Diseases, 25(67):329–343.
 Moons et al. (2009) Moons, K. G., Royston, P., Vergouwe, Y., Grobbee, D. E., and Altman, D. G. (2009). Prognosis and prognostic research: What, why, and how? BMJ (Online), 338(7706):1317–1320.
 Morris et al. (2019) Morris, T. P., White, I. R., and Crowther, M. J. (2019). Using simulation studies to evaluate statistical methods. Statistics in Medicine, 38(11):2074–2102.
 Olkin and Pratt (1958) Olkin, I. and Pratt, J. W. (1958). Unbiased Estimation of Certain Correlation Coefficients. The Annals of Mathematical Statistics, 29(1):201–211.
 Pereira et al. (2016) Pereira, A., Duale, C., Clermont, F., Gramme, P., Branders, S., Gossuin, C., and Demolle, D. (2016). (171) Characterization and prediction of placebo responders in peripheral neuropathic patients in a 4week analgesic clinical trial. The Journal of Pain, 17(4):S18.
 Rasmussen and Williams (2006) Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. The MIT Press.
 Schmidt (1971) Schmidt, F. L. (1971). The relative efficiency of regression and simple unit predictor weights in applied differential psychology. Educational and Psychological Measurement, 31(3):699–714.
 Shieh (2020) Shieh, G. (2020). Power Analysis and Sample Size Planning in ANCOVA Designs. Psychometrika, 85(1):101–120.
 Vachonpresseau et al. (2018) Vachonpresseau, E., Berger, S. E., Abdullah, T. B., Huang, L., Cecchi, G. A., Griffith, J. W., Schnitzer, T. J., and Apkarian, A. V. (2018). Brain and psychological determinants of the placebo pill response in chronic pain patients. Nature communications, 9(1):3397.