1 Introduction
Designing new and efficient therapies is a long and ever more costly process, with less than ten percent of new treatments entering Phase I finally being approved by the FDA and commercialized [1, 2]. One of the major challenges for the improvement of drug development is to better understand how drugs interact with patients, particularly for treatments displaying heterogeneous responses. Therefore, conducting a detailed analysis of clinical trial data is critical to find subgroups of patients with higher benefitrisk ratio or to understand why a drug does not work on some subpopulation to improve existing therapeutic strategies.
Moreover, understanding the relationships of patient descriptors which compose the most responsive crosssection of the population is of great importance when planning a Phase III trial, for salvaging failed trials, or accelerating advances in personalized medicine. This process of biomarker identification is critical to detect subgroups within a given indication, but, as shown recently for immunotherapies, can also provide the basis for panindication drug approval [3].
Identifying these patient subgroups, and understanding the descriptors, or covariates, which distinguish them, is the domain of subgroup analysis. This field of research has a long history in clinical biostatistics [4, 5, 6], and requires a very careful and rigorous methodological approach [7], both for confirmatory or exploratory analysis, due to multiplicity, reproducibility, and false detection issues. Furthermore, subgroup analysis has garnered even more attention recently from the pharmaceutical community with both the advent of cheap and extensive genomic measurements as well as the dawning of the era of socalled bigdata
. Now that incredibly detailed patient characterizations are available, the problem of subgroup identification has shifted from careful statistical analysis of select covariates, to datamining and machine learning approaches. While the number of features is increasing by several order of magnitudes, the number of patients remains typically the same, which makes the statistical challenge even more difficult. Therefore, effective clinical trial data analysis requires highly sensitive tools capable of detecting covariateresponse associations from noisy and weak treatment response measurements, including discontinuous and nonmonotonic interactions, without inflating the TypeI error rate. Indeed, false positive detection is a major caveat in subgroup analysis
[7], and controlling TypeI error is a crucial requirement for any reliable methodology. These tools can be used as a preprocessing step, pruning out insignificant covariates which might obfuscate accurate subgroup identification. They can also enable investigators to rank covariates by significance so as to focus laboratory studies to a few of the most promising biological pathways.Many methods have been proposed for the detection of covariatetreatment interaction. In particular, we note modified outcome regression [8], outcome weighted learning [9], and changepoint detection [10] methods as stateofthe art techniques which are useful in constructing baseline comparisons, as we do in this work. We refer the reader to the recent review of Lipkovich et al. [11]
, where the authors review various methodologies ranging from classic statistical approaches to more sophisticated machine learning methods. On the one hand, wellfounded and analytically rich statistical techniques provide rigorous estimates of TypeI error, but often miss complex, nonlinear interactions. On the other hand, machine learning approaches allow for the detection of these complex, and sometimes nonmonotonic, interactions. However, in general they fail to provide a proper estimate of TypeI error for the impact of individual covariates; they do not possess the characteristics common to proper statistical tests. Thus, we note a lack of methods which take the best of both worlds, offering high sensitivity for interactions with complex dependencies, while also providing rich statistical analysis and a controlled TypeI error rate.
To address this shortfall, we propose a new series of statistical tests, each of which is designed to detect particular structures in the treatment response signal which are often observed in actual clinical trial data. The proposed tests are constructed from a cumulative process on the effect size obtained by ranking patients according to a given covariate. These processes characterize the complex dependency structure of the covariatetreatment effect. We introduce several observables which capture various facets of these interactions, going beyond a simple process maximum estimate, as used in [10]. When possible, we derive theoretical estimates to compute values thresholds characterizing TypeI error, and when not possible, we propose a MonteCarlo sampling procedure. We also present a new combined test which leverages the power of all our proposed individual tests to provide robust detection of significant covariates, while also providing finegrained control over the TypeI error rate. Our novel combined test compares favorably to existing stateoftheart procedures, serving as an effective tool for the exploration of clinical trial data. We evaluate our approach on both synthetic and real clinical trial datasets. For the purpose of this work, we have also created a synthetic benchmark that captures many “corner cases”, i.e. parameter regimes where existing methods reveal their limitations. These benchmarks may prove useful for other researchers to evaluate their methods.
The paper is organized as follows. In Sec. 2, we present the treatmentresponse model and introduce some notations. Next, in Sec. 3, we review the current stateoftheart in covariateresponse correlation analysis and covariate selection. Subsequently, in Sec. 4.1
, we demonstrate a simple correction based on treatment response correlations which addresses many variance issues observed in current techniques. Then, in Sec.
4.2, we present our novel individual cumulative response tests based on nulltests against Brownian motion, each of which is tailored to different features in the measured treatment response. Finally, in Sec. 4.3 we present an approach to merge these individual cumulative tests into a single combined test which is able to retain the performance of the individual tests while being robust to a priori unknown treatment response curves. To validate our approach, we present two numerical analyses. In our first synthetic analysis, reported in Sec. 5, we report objective results of significant covariate detection performance over a wide range of different treatment response models. In our second analysis, reported in Sec. 6, we take three realworld clinical trial datasets and report the significant variables discovered by our method. Finally, we conclude in Sec. 7 with a discussion of the applicability of our approach and possible avenues for future work.2 Drug response model
Let us now describe the mathematical framework of the drug response model. The drug response model characterizes the observed outcome of a patient as a function of the patient’s covariates and the treatment which was administered to the patient. We will assume that this observed outcome is binary or realvalued measurement, e.g. cell counts or cholesterol levels.
Let us denote the treatment indicator as . In the common setting of experimental treatment versus placebo,
is a binary variable with
corresponding to the placebo and,to the experimental treatment. Given a vector of covariates
belonging to a particular patient, we denote the observed outcome under treatment as as .A trial dataset consists of patient records, which are assumed to be drawn i.i.d. randomly, each of which contains the patient’s covariate vector, the applied treatment, and the observed outcome, i.e. where we define for conciseness.
We are interested in the detection of patient covariates which correlate with the spread between the experimental treatment and the placebo, or treatment effectiveness
(1) 
Such covariates can be helpful in the understanding of the treatment action mechanism and in the selection of patient subgroups where the treatment is the most efficient.
Of course, the spread is not directly observable in practice, as it would require two treatments to be carried out on the same patient, so we need special methods capable to estimate the correlation of interest from indirect measurements.
In the next section, we revise existing approaches that can be used for the detection of covariatetreatment interactions.
3 Existing approaches
The problem of identifying covariatetreatment interaction is directly related to the subgroup identification problem and methods developed for one can be often adapted for another. Many different approaches have been proposed to address these problems, starting from straightforward application of the standard multivariate regression techniques (with explicit modelling of the treatmentcovariate interactions), to more advanced models such as modified covariates [8], outcome weighted learning [9] or change point statistics analysis [10]. In [12]
, Tian & Tibshirani describe an adaptive index model which can be used for risk stratification or subgroup selection. Other examples of decision tree based algorithms are model based recursive partitioning
[13], SIDES method based differential effect search [14], virtual twins method [15], subgroup analysis via recursive partitioning, combined additive and tree based regression [16] and qualitative interaction trees [17].In the following subsection, we describe standard statistical procedures as well as existing state of the art approaches that can be used to detect features correlated with the treatment effect.
3.1 Linear Regression Test
One naïve approach for the detection of significant covariates would be to simply apply a linear regression to the observed response variables and study the magnitude of the regression coefficients learned for each covariate, using, for instance, the Ftest or
values. To construct such a regression, one first defines a linear observational model for the measured response as a function of the covariatetreatment pair ,(2) 
where
is a centered random variable independent w.r.t.
. The vector of regression coefficients describes the global impact of patient covariates on the outcome, irrespective of the applied treatment. This term is often referred to as the trend of the treatment response, and does not contain any information on treatment effectiveness. The significant coefficients the vector can be used as indicators of covariates which correlate with the treatment effect. In the univariate context, when we analyse a particular variable , (3) is simplified to a simple linear regression with only two terms:(3) 
3.2 Modified Outcome
The problem with the statistical test for the observational model defined in Eq. (3) one must estimate the coefficients of the trend term jointly with the coefficients of . Knowledge of does not aid in the detection of covariate/treatment interaction, and it harms the estimation of , the true variable of interest, by introducing additional variance.
In the modified outcome approach of [8], the authors propose a simple modification to the observed response prior to regression which removes the effect of the trend term, allowing for the direct estimation of the expected treatment effectiveness given a set of patient covariates, . The modification of the outcome, in the case of two treatments, consists of a multiplication of the observed response with the treatment indicator to create the modified outcome, . In [8], it is shown that can then be estimated by a regression performed according to the observational model,
(4) 
Due to the nature of clinical trials, for each draw of , we have only one observation of which is drawn from one of the potential arms of the trial at random (e.g. treatment or placebo). Thus, the modified outcome variable may have a very large variance. Indeed, even if we assume that the pertreatment responses have small variance, the modified outcome may then contain multiple distinct tight modes, corresponding to each realization of the treatment variable.
These modes have the effect of inflating the variance of ,
(5) 
Even if is small for each treatment arm, thus causing the expectation of the variance to be small as well, the variance of the expectation can be arbitrarily large, especially if the treatment in question has a very strong global effect or if there is a strong shift in outcome distribution. We propose one possible approach to counteract this variance inflation in Sec. 4.1.
3.3 Outcome weighted learning
The outcome weighted learning (OWL) [9]
approach was initially proposed for the identification of patient subgroups, but similarly to modified covariates it can be easily adapted for the detection of covariate treatment interactions. The idea of this approach is based on the construction of a classifier
for the following weighted classification problem(6) 
is a classification loss function (hinge loss, for example) and
are positive weights obtained from observed outcomes , for example, . It wouldn’t make any sense to try to predict from since by construction, is generated to be independent of . However, when we introduce weights, the classifier tries to separate above all, treatment and placebo patients with high outcome values, and it might be possible, if there is a pattern in patient to treatment response. Interestingly, if we adapt this to the univariate case, this approach becomes equivalent to modified outcome since basically we interested in a correlation between and weighted by which is nothing else than the correlation between and .3.4 Discontinuous Treatment Response
Until now, we have discussed the use of linear models for the detection of covariate/treatment interaction. However, this is a very simple assumption to make about the treatment response. In some cases, a strong nonlinear, discontinuous thresholding effect can be the dominant feature in the covariatetreatment response curve, as reported by Koziol & Wu in [10] for the case of erythropoietin treatment (rHuEPO) for the prevention of postsurgery blood transfusion. Here the authors observed sharp cutoffs in treatment effectiveness as measured against baseline hemoglobin levels.
In order to detect responsive patient profiles for rHuEPO treatment, the authors of [10] proposed to build a stochastic cumulative process test to detect the changepoint in measured baseline hemoglobin. This univariate test is constructed by building a test around a cumulative process description of the treatment effectiveness. Specifically, both the placebo and rHuEPO treatment response curves were sorted according to the value of the measured baseline hemoglobin for each patient record. Subsequently, a cumulative sum was taken for each treatment, and a test was constructed to observe the statistical significance of the difference between these cumulative sums under a specified threshold of baseline hemoglobin.
To construct the test itself, [10]
made the observation that under the nullhypothesis (
) of no covariatetreatment interaction, the treatment effectiveness should behave as a random walk when the scale of the covariate is mapped to , as the observed response would, in this case, be independent of the covariate and its ordering. In the limit of of , this random walk converges to a Brownian motion process.If we denote this random process under as , then the detection of statistically significant treatment effectiveness amounts to the detection of the measured cumulative process significantly diverging from . In [10], the authors proposed a statistical test that was constructed around the maximum
value of the measured cumulative process as compared to the most probable maximum value according to
. In Sec. 4.2, we will demonstrate a version of Koziol & Wu’s cumulative test, but within the modified outcome framework, and we will also present a series of new tests which are also based on the comparison between cumulative response and random process nullhypotheses.4 Proposed Approach
Similar to the approach of [10] which we described in Sec. 3.4, we focus on univariate detection of significant covariates through statistical tests based on a nullhypothesis of random walks. Since realworld trial data can contain many pathological and idiosyncratic response features, we go further than [10] by introducing, in Sec. 4.2, a set of statistical tests which are designed to detect different features in the underlying response signal. Since it is impossible to predict a priori what the best hypothesis of the trial data response should be, we also propose the use of a combined statistical test. In this way, we are able to utilize our proposed statistical tests as feature detectors, whose outputs, taken as a whole, create a robust description of covariate significance.
Because of the specific structure of the test framework, namely repeated tests on individual variables under controlled randomization, we are able to do better than simple Bonferoni correction [18] for combining values obtained over multiple tests. We note that the combined test we propose in Sec. 4.3 can be utilized independently of the specific tests we construct, and can be seen as a general procedure for obtaining robust predictions of covariate significance in the setting where one posses many possible statistical tests but requires an interpretation of the aggregate results.
First, however, we turn our attention to transformations of the raw response data. As first shown in the modified outcome approach of [8], such transformations can lead to significant improvements in the detection of treatment response. We will show how one such transformation can both improve the signaltonoise ratio for significant covariate detection, and also leads to new significance test for the cumulative process approach.
4.1 Centered Treatment Response
As discussed earlier in Sec. 3.2, it is possible that the separation of the treatment response curves can induce high variance in the treatment effectiveness, even when the pertreatment response variance is small. To counteract this effect, we propose the use of a pertreatment centering, removing the empirical average of the response conditioned on the treatment applied. This approach is similar in spirit to that of efficiency augmentation [8]. In the efficiency augmentation approach, one attempts to reduce estimator variance by removing the trend , while in our approach we remove . In general, it is possible to combine both approaches, however, we consider only pertreatment centering since explicit model fitting of may represent a significant risk of overfitting on trials with limited enrollment, where selecting a trend model a priori may introduce unnecessary bias.
For pertreatment centering, given the set of observed trial data , we modify the measured response,
(7) 
where is simply the empirical mean of all observed responses for a given treatment . Next, for the case of twoarm trials, we apply the same modified outcome approach of [8] to remove the trend term by taking the difference of the two treatments,
(8) 
At first glance, such pertreatment centering would seem to remove the treatment effectiveness signal. However, we note that we are not interested in the magnitude of the treatment effectiveness itself, but rather the correlation of the treatment effectiveness with the covariate of study. This correlation is preserved by the centering, and becomes more easily detectable as the variance of the outcome terms is reduced.
To help motivate our choice of pertreatment centering, we first demonstrate that pertreatment centering minimizes the variance of the modified outcome.
Lemma 1.
Given a randomized trial with each treatment chosen i.i.d. with nonzero probability and independently of , for the set of all possible modified outcomes of the form , where is an arbitrary function of the treatment, the function
provides the minimum achievable variance,
Proof.
See Appendix A.1. ∎
Additionally, we can observe that the original modified outcome of , as proposed in [8], will always have a larger variance, except in the case that the pertreatment responses are already centered.
Lemma 2.
Given a randomized twotreatment trial, , whose measured data has the empirical mean and variance and , respectively, where each treatment is chosen with probability , then , since where .
Proof.
See Appendix A.2. ∎
From Lemma 2, we see that, in the case of two treatment trials, the variance reduction provided by pertreatment response centering becomes more pronounced as the separation in expected response between the trials increases and the intratreatment variance decreases. This shows the corrective effect of centering in correcting for the problem of multimodal response distributions, as discussed earlier in Sec. 3.2.
Now that we have established that pertreatment centering is an effective form of variance reduction for the modified outcome, the question of treatmentresponse detection remains. We now investigate the effect of centering on the correlation between the modified response and treatment.
Lemma 3.
The centering only alters the correlation between treatment and outcome for a specified covariate by a constant amount which does not depend on ,
Proof.
See Appendix A.3. ∎
Since covariance between outcome and treatment is modified only by a constant which does not depend on the covariate , we see that the correlation conditioned on covariates is not lost by introducing a pertreatment centering, only translated.
Additionally, we can see that the centered modified outcome removes the effect of global effects on the covariateconditioned signal.
Lemma 4.
For , centering removes global dependence between the measured outcome and treatment,
Proof.
See Appendix A.4. ∎
Corollary 1.
Controlling for the ordering of covariates, the partial correlation between the centered treatment response and the treatment indicator is zero, .
From Lemma 4, we make the observation that since the global dependency is removed, then only conditioning on introduces a dependency between the treatment and centered response. Without centering, . This leads to Corollary 1, which demonstrates that when using centered treatment response, all of the dependency between and is entirely mediated by the covariate . In other words, for centered treatment response, we see that there remains no explanation of the covariance other than the effect of the covariate under investigation. This property provides a greater sensitivity when constructing significance tests, as we can be assured that the variability being observed in is only due to .
4.2 Cumulative Tests
As discussed in Sec. 3.4, nonlinearities in the treatment response signal when conditioned on patient covariates can lead to poor detection of covariatetreatment correlations when using linear regression fits. Instead, methods such as [10] propose the use of nonlinear tests based on a theory of random walks. We now discuss a set of novel tests for the detection of significant levels of covariatetreatment interaction according to value.
We first introduce the core transformation utilized by all of our proposed significance tests. Given some modified outcomes , centered or not, and a covariate of interest , a sorting permutation is constructed such that is the vector of patient covariate values sorted in ascending order. Here, we assume that the covariate in question has some interpretation that allows for a meaningful sense of ordering. However, it is still possible to use this procedure in the case of purely categorical covariates, as long as the arbitrarily chosen ordering remains consistent when repeated on the same covariate.
Subsequently, a cumulative response vector is then calculated as the cumulative sum,
(9) 
Rather than constructing tests on the modified outcomes themselves, we evaluate the statistics of the cumulative response process . By using the known statistics of random walks, we may construct a strong set of priors on the behavior of under the nullhypothesis of no covariatetreatment interaction,
(10) 
Detecting a significant interaction amounts to the detection of diverging significantly from the bulk where it is most explained by a random process.
Maximum Value Test.
We first turn our attention to small adaptation of the original cumulative maximum value test of [10], generalizing their changepoint detection test to the case of interaction detection. We propose the use of the maximum value test as a baseline comparison when used in conjunction with the modified outcome of [8].
Specifically, we define the maximum normalized absolute value of the cumulative response as
(11) 
where is the sample variance taken of the entries of the realized random walk ,
(12) 
Next, as mentioned earlier in Sec. 3.4, we observe that as , according to Donsker’s theorem [19], the cumulative process under converges to a Weiner process. The distribution of the extreme values of a Wiener process is well known in the literature [10, 20, 21],
(13) 
for any where
is the standard normal cumulative distribution function (CDF). The final significance test is constructed from (
13) by measuring the probability of the observed maximum value and comparing it to the probability that the nullhypothesis of the random walk could have produced such a result. In this case we are assuming that is large enough such that the approximation via Weiner process is accurate.Brownian Bridge Test.
In the case of the maximum value test, we define this baseline using the standard modified outcome. However, we would like to make use of the pertreatment response centering we propose in Sec. 4.1 in order to aid in the detection of covariatetreatment correlation. However, by using this pertreatment centering, the cumulative process is no longer well described by Brownian motion. Once the mean is removed, the cumulative process is pinned to 0 not just at , but also at the end of the process . An example of such a process is shown in Fig. 1.
Specifically, if we construct the cumulative response using the pertreatment centered response,
(14) 
then, by application of Donsker’s theorem [19], converges to a Brownian bridge process, , as instead of Wiener process. We can observe this convergence by rescaling the discrete range of patient indices via the product , to find
The same transformation can be applied to the maximum absolute value of the process, as well.
Now, according to Slutsky’s theorem, we can replace with its empirical estimate,
(15) 
which shows that the scaled extremal value of the centered process converges in distribution to the extremal of the Brownian Bridge. Thus, we are able to construct the statistical significance test against according to the distribution of extreme values of the Brownian Bridge process.
And the statistical test rejection zone can be computed according to
(16) 
While this extremal value test is a robust statistic for significant covariate detection, there are many examples of significant treatmentcovariate interaction patterns which may be missed by constructing from a Brownian Bridge process. In the next section, we discuss some of these shortcomings and demonstrate an extremal test on Brownian Excursion processes that succeeds where (16) fails.
Brownian Excursion Test.
The maximum absolute value is a good statistic for monotonic signals, where the treatment effect monotonically increases or decreases with the value of the covariate under study. However, when the covariate does not display monotonicity, then maximum value tests may produce false negatives with respect to covariate significance. An example of such a case might be a covariate for which the extreme values, low or high, are correlated with low treatment response, while midrange values are correlated with high treatment response. The max value cumulative process approach may fail in this case as a particular realization in this case may first decrease, rise, and then decrease again without ever reaching a critical extremal value which would indicate a statistically significant deviation from . However, such a process might indeed be judged significant if the origin of the process were shifted to the beginning of the domain correlated with positive treatment effect. Thus, a thoroughly general, if cumbersome, approach might be to utilize a max value cumulative test evaluated from every possible start position.
Specifically, given a centered cumulative process for a specific covariate , consider the family of circleshifted processes constructed from it,
(17) 
where
(18) 
Finding the extremal value for all possible shifts is then the solution to the double maximization,
(19) 
The solution to this maximization over shifts and time can be be equally accomplished by generalizing the Brownian bridge process to a ring domain and subsequently defining the “start” of this process to occur at its minimum. Subsequently, finding the extremal of this resultant process would be equivalent to (19). An example of this construction is given in Fig. 1.
). Coloring indicates the rearrangement used to construct the Brownian Excursion. For both processes, the 95% confidence region for the process is given in grey. In this case, a Brownian Bridge process well within the confidence interval (
) constructs a Brownian Excursion process whose maximum leaves the interval ().The procedure we have just described is nothing more than finding the extremal value of a Brownian Excursion,
(20) 
The rejection zone can be computed according to
(21) 
Process Normalization.
The overall maximum of a Brownian motion, or a Brownian bridge, are natural statistics to consider, however they treat the entire interval to be homogeneous. However, it is much more likely to observe the maximum value at the right extreme for Brownian motion, or the center for the Brownian bridge. Therefore, even if there is a signal in the zone of low variance, we compare it to the maximum over the entire interval and therefore it might easily be overlooked.
This problem can be solved by considering a rejection zone taking into account the changing variance of the underlying process. In the case of a Brownian motion, such normalization means that the rejection zone becomes a square root hull , and in the case of a Brownian bridge, the hull shape is defined by .
Area Tests.
Finally, we propose two more tests based on the total area under the curve (AUC) statistic, the idea behind the use of the entire area is that even if the maximum value does not reach the critical threshold, the fact that there are multiple points where the process approaches the limit may be indicative of a presence of a signal.
The test statistic in this case is computed as
(22) 
We also consider a modification of this test where we sum squared values of the cumulative process,
(23) 
Implementation.
Depending on a prior knowledge about the type of the signal, other tests can be introduced as well. Since it is not always possible to have a closed form for the statistic distribution, we use a general numerical framework based on MonteCarlo (MC) simulations to compute statistic critical values [22, 23]. We describe the process for this evaluation in Alg. 1.
4.3 Combined Test
As mentioned in the previous section, one may prefer different tests depending on the type of the signal one expects to observe in the data. However, it is not always possible, or desirable, to choose a particular test a priori. Instead, one might run all tests in parallel to see if some signal is detected by at least a single test. This procedure is a classical multiple test, and thus one needs to use a correction procedure for calculation of the final value in order to ensure that the TypeI error is not inflated. Standard correction procedures, such as Bonferoni correction [18], may be too severe, obfuscating the detection of significant covariates in practice. This is especially true in our case due to high correlation between our proposed significance tests, thus leading to a significant loss in combined test power. However, as in [24]
, it is possible to adapt our numeric procedure to run multiple tests simultaneously without any losses in TypeI or TypeII errors.
Specifically, when calculating the combined significance test, for each covariate permutation of the centered treatment response, we compute the list of metrics defined by each statistical test included in the combined test, along with their corresponding values. Subsequently, we compare the vector of observed statistics with samples generated from random permutations of the centered treatment response. Then, of all the values for the individual tests in the combined test, we use the minimum value as the final aggregated statistic. From this minimum, we define the order on the set of vectors with values and also compute the value of the combined test. We present the pseudocode for this combined test in Alg. 2.
In our experiments, we combine MaxB, MaxB, MaxBE, SAreaB, and SAreaB for our final combined significance test, thus accounting for many possible signal shapes. Selecting this subset, rather than applying all possible tests, reduces the computational burden of running the combined test. In practice, it is also possible to build a combined test tailored to a priori knowledge about a given dataset, as long as such a construction is fixed prior to any analysis of the dataset under investigation.
4
4.4 Multidose trials
Up to this point, we have considered the case of a clinical trial with only two treatment groups (e.g. placebo vs. treatment). However, in many cases, such as Phase II trials, multiple treatment doses are tested in parallel. In this section, we demonstrate how our approach may be generalized to the case of multiple doses trials. The generalization is quite straightforward: instead of being a binary variable, we let instead be a realvalued encoding of different doses. The set may be a discrete scale encoding only dose order, or perhaps a more precise representation of the dosing amount, e.g. log value of the amount of administered treatment.
As in the binary case, the only important condition we need to ensure is that . As long as this is true, then we can directly apply the modified outcome transformation and use the framework of cumulative processes without any additional modifications. In essence, in the binary case we focus on the conditional correlation between and a centered binary variable conditioned on , and in the multidose case we do the same thing only with being simply centered.
5 Synthetic Experiments
In order to evaluate the efficiency of the proposed approach in an objective manner with known groundtruth, we generate a synthetic dataset of clinical trials. Although these models do not reflect the full complexity of such data in practice, we consider synthetic models for two main reasons: (i) to validate the ability of the evaluated statistical tests to accurately recover covariates which are known to be significantly correlated with the treatment, and (ii) to investigate particular cases of linear and nonlinear interactions between covariates and treatment effects, demonstrating the strengths and weaknesses of each test. We list all of evaluated statistical tests in Table 1. Additionally, we demonstrate how to construct synthetic datasets for which each of the proposed test will fail, giving insight into the failure cases for each of the proposed statistical tests. We will also show that the combined test proposed in Sec. 4.3 provides near bestcase accuracy for most of the tested synthetic models.
5.1 Synthetic models
For our experiments, we construct a wide array of synthetic treatment response curves whose purpose is to simulate the complex behavior observed in realworld clinical trials. Below, we describe each of the different synthetic models.
We consider three types of synthetic models, in order of increasing complexity, (i) a linear model, (ii) a set of piecewiseconstant models, and (iii) a fully nonlinear model.
For all these models, we generate covariates as i.i.d. uniform random variables over . Next, given the full set of patient covariates , we write the response model in the general form,
(24) 
where is a noise term with variance , and and are functions operating across the set of covariates.
Linear Model.
The linear model (L) corresponds to the case where and are linear functions of the of the first covariate, in this case
(25)  
(26) 
A successful test for L would report as a statistically significant covariate in terms of treatment interaction, and all other covariates would be rejected.
Piecewiseconstant Models.
These proposed piecewiseconstant (PC) synthetic models contain either one or two discontinuous jumps, denoted as thresholding and interval effects, respectively. Such discontinuities are not captured in linear models, however such effects are ubiquitous throughout biology and medicine.
We introduce four versions of PC models to capture these different configurations. All choices of
are simple indicator functions with different supports to capture both thresholding and interval discontinuities. We present two thresholding models containing a single jump discontinuity,
PCTh and PCTh, as well as two interval models containing two jump discontinuities, PCInt and PCInt,(PCTh)  
(PCTh)  
(PCInt)  
(PCInt) 
Finally, for each of the four different PC models, we use a linear baseline trend of .
Nonlinear model.
For the nonlinear model (NL), we utilize the fourth synthetic scenario reported in Sec. 4 of [9]. Specifically, the covariatetreatment signal is generated according to
(27)  
(28) 
We note that covariate is already a decoy covariate in this definition. Additionally, covariates and are symmetric.
Experimental Parameters.
For our experiments, we evaluated the proposed statistical tests on the given synthetic models over a range of easy and difficult settings. Overall,the difficulty of the task can be controlled by modifying , the ratio between baseline and treatment effect (), and the number of significant covariates versus total number of decoy covariates. Specifically, we construct our experiments along these three axes in the following manner.

Noise: The variance of the noise term is evaluated over the range .

W1: The ratio between the scale of the trend term (baseline coefficient) and the interaction term. Here, we fix while varying over .

Decoy: Given the fixed number of significant covariates, 1 for L and PC and 7 for NL, decoy covariates drawn i.i.d. randomly are added to the dataset. We vary over .
Performance metrics.
Each of the evaluated significance tests are compared based on their statistical power when their TypeI errors are fixed. Specifically, after generating synthetic data several times and performing all with a value 0.05 significance threshold, we compare the significance tests by their ability to detect the known significant covariates in each synthetic model.
Since the test power is a function of many parameters describing various aspects of signaltonoise ratio, in our experiments we also trace test sensitivity as a function of the experimental axes described in the previous section. We then compare the resulting curves to estimate which test provides the best performance over the tested range. Fig. 3 shows an example of such curves for different significance tests performed on the PCInt model under varying noise and baseline strength. As an aggregate measure for overall test performance, we compute the area under each curve. In this particular example, MaxBE is the best performing significance test, as it is capable of detecting the signal more often than alternative approaches.
Comparison over noise standard deviation,
. Right: Comparison over the value of the model baseline coefficient . The legend remains consistent over both charts.5.2 Results
Centering effect.
Fig. 4 shows the improvement made by removing the average values on each group of patients, comparing only the Max tests performed on either a BrownianMotionlike process (no centering) or on a BrownianBridgelike process (centering). For visual convenience, the scores are normalized by the maximum value per dataset (column) such that there is alway a test with the performance score equal to 1 in each simulation scenario. As expected from our theoretical analysis, we observe that the test with centering correction (MaxB), in general, provides better significant covariate detection than the test with no centering correction (Max).
Main result summary.
A global summary of the main results is presented in Fig. 5 for the cumulative process tests MaxB, , MaxBE, , AreaB, SAreaB, the baseline linear correlation test with modified outcome MoLin, and also for the combined test Comb. On this chart, each row corresponds to the designated test, while each column corresponds to a particular synthetic model and the varied parameter (noise, W1, or decoy), as described in Sec. 5.1.
The first set of columns corresponds to model L. We report that most of the tests have the same statistical power for the detection of the significant covariate for this model. As expected, the linear test MoLin performs very well, similarly to tests which are sensitive to a global measurements on the cumulative process, such as AreaB and SAreaB, since averaging over the whole curve provides more robustness. The MaxBE test does not perform well on linear data, as it is designed to capture transient correlations. Because of this feature, for L, it mostly detects unrelated fluctuations. The combined test Comb is also shown to perform very well in the linear setting.
The next four sets of columns correspond to the piecewise constant models (PC) and their associated experimental axes. For this class of models, the tests under study perform very differently, revealing the strengths of each individual test, which we now describe.
i) Threshold detection. As expected, MaxB performs the best for detection of threshold effects, at least when threshold is centrally located (PCTh). Actually, this task appears relatively easy since most tests report good power, even MoLin. However, when the threshold is closer to the boundaries, and therefore more difficult to detect as in PCTh, the MaxB test fails to recognize the significant covariate, as do most of the other tests. However, MaxBN performs very well in this setting due to the normalization hull which brings more power to the extreme sides of.
ii) Interval detection. This task appears to be very difficult for most tests, which fail catastrophically on both PCInt and PCInt. The MaxB and tests can only detect a single jump discontinuity, while AreaB, SAreaB, and MoLin are global measures which are unable to capture such a localized phenomenon. The only two tests which give strong positive results on this task is the MaxBE test and its normalized version , since large excursions of the cumulative process correspond to a transient effect on the covariates.
We also see that Comb demonstrates very robust behavior across all the tested models and experimental parameters. While it does not always provide the best possible performance, its strength lies in its ability to operate in widely varied settings, from global linear trends to narrow threshold or interval effects.
Finally, to see how those tests would perform on less artificial models, the next set of columns correspond to experiments performed on the NL model. Here, we report the detection performance for each of the significant variables in NL. Once again, we observe that particular experimental configurations of this model can lead to individual test failures. However, once again we see that Comb demonstrates very robust performance across all tested experimental parameters.
These synthetic experiments have illustrated the ability of each individual test to detect known significant covariates under specific assumptions. When those assumptions are known a priori by the investigator (e.g. a known threshold effect), then it may be wiser to select the individual test according to this knowledge. However, when this knowledge is not available or too uncertain, our analysis suggests that Comb could be a robust solution. We present our full set of results both in terms of normalized AUC as well as raw statistical power in Appendix B.
6 Exploration on Real Trials
We now turn our attention to the utility of our new individual tests and their combination in a more realistic setting. In this section, we describe results of application of our method to several real world clinical trial datasets. For these datasets, there are no known ground truths for significant covariates. However, as we control TypeI error, we can deduce that detecting more significant covariates is a desirable property for a successful statistical test on these real datasets.
6.1 Calgb 40603 (Nct00861705)
The primary objective of this study was to investigate whether adding bevacizumab to paclitaxel (+/ carboplatin) and subsequent dosedense doxorubicin and cyclophosphamide (ddAC) significantly raises the rate of pathologic complete response (pCR) in the breast of patients with HRpoor/ HER2(), resectable breast cancer [25]. The patient level data of this study can be retrieved from Project DataSphere^{1}^{1}1https://www.projectdatasphere.org/projectdatasphere/html/content/162 [26]. The dataset contains the information on 443 patients randomly assigned to four different arms,

paclitaxel ddAC,

paclitaxel + bevacizumab ddAC + bevacizumab,

paclitaxel + carboplatin ddAC,

paclitaxel + carboplatin + bevacizumab ddAC + bevacizumab.
The dataset contains 45 covariates that can be used to explain the treatment effect when comparing the 4 arms.
Table 2 provides the number of significant variables (0.05 value significance threshold after Bonferoni correction) detected by the various tests when comparing Arm 1 versus 2, 1 versus 3, etc. For example, there two significant interactions when we compare Arms 2 and 4, the clinical Nstage (characterization of the regional lymph node involvement) and clinical Tstage (characterization of the size and extent of the tumor). We see that these covariates are detected only by the combined test, meaning that the application of individual test would not have been sufficient for the detection of this interaction. A similar phenomenon is observed when we analyze the results of the comparison of Arms 3 and 4. Two significant descriptors are detected: clinical Nstage and number of sentinel nodes examined. Again, only the combined test was capable of detecting both interactions simultaneously. Fig. 6 shows examples of the cumulative curves corresponding to significant variables for the comparison of Arms 3 and 4.
In both cases, the impact is negative, meaning that smaller covariate values correspond to larger differences between treatment and placebo (or reference treatment) arms, therefore implying that patients with smaller values of these covariates are more likely to benefit from the administration of bevacizumab. Since each of the detected covariates characterize the complexity of the tumor, the discovered dependencies suggest that patients at early stages of the disease are more likely to benefit from the addition of bevacizumab.
6.2 Bcrp
The objective of this study was to compare the impact of nutritional and educational interventions on the psychological and physical adjustment after treatment for earlystage breast cancer [27]. The dataset was retrieved from the Quint package [28]. In total, there are 252 patients split into three arms: Arm 1: nutrition intervention (), Arm 2: educational intervention (), Arm 3: standard care (). Similar to the CALGB results, as shown in Table 2, the combined test was the most sensitive test (0.05 value threshold after Bonferoni correction), indicating that the nationality covariate is a significant factor in determining the physical functioning score (SF36).
6.3 Acosog Z6051 (Nct00726622)
This is a randomized PhaseIII trial evaluating the safety and efficacy of laparoscopic resection for rectal cancer [29]. The 462 patients were split into two groups: Arm 1 (open rectal resection) and Arm 2 (laparoscopic rectal resection). Among all tests, as shown in Table 2, the combined test and extermal of the normalized Brownian Bridge test were capable of detecting one covariate significantly correlated with the efficacy of the treatment regime (0.05 value significance threshold after Bonferoni correction): distance to nearest radial margin. The cumulative process corresponding to this variable is shown in Fig. 7, where patients with lower values of the distance to nearest radial margin covariate are shown to be more likely to benefit more from the laparoscopic rectal resection. The distance to nearest radial margin variable is not a parameter that can be assessed prior to the intervention, it is a characteristic that can be measured once the operation is complete. Therefore, it can not be used to guide the selection of the intervention type. However, such posttreatment dependencies can provide useful insights into the circumstances where a treatment of interest may prove most efficient.
7 Discussion
In this paper, we presented a series of new statistical tests tailored for the detection of complex, nonlinear treatmentcovariate dependencies in clinical trial datasets. We describe how to merge these tests into a single combined test capable of efficiently detecting different types of interactions. We illustrated the performance of the proposed approach on various synthetic models, where we compare it to existing approaches. We also describe an application of the proposed procedure to three real world clinical trial datasets where we observed that our proposed tests do indeed allow one to detect signals which can go undiscovered using existing approaches.
The proposed technique is a univariate procedure tailored for the detection of single biomarkers, and therefore is relatively easy to interpret with respect to multivariate approaches. However, this characteristics is also a significant limitation of the procedure since, as with any univariate approach, one needs to apply multiple testing correction (e.g. Bonferoni) to account for the number of covariates analyzed. In some cases, this correction might be too drastic. This issue can be partially resolved by imposing a preorder on the list of covariates defined by an expert or by automatically extracting an ordering from the literature or an external dataset. Another consequence of the nature of univariate tests is that such a test can miss complex dependencies involving multiple covariates if each of them separately do not present statistically significant signals.
In our future research, we plan generalize our proposed cumulative tests to a multivariate setting, and explore the possibility of building a statistical test capable of detecting nonlinear multivariate dependencies with a proper control over TypeI error. Another interesting direction is the incorporation of external data sources as prior knowledge. By continuing to analyze publicly available clinical trial datasets, we hope to find new and interesting covariate dependencies, as well as to construct a large benchmark of datasets for the comparison of covariatedetection methods.
In this article, we focused on the detection of covariates interacting with the treatment efficacy. Such covariates provide a natural basis for subgroup selection. The extension of the proposed statistical tests for subgroup identification is another important direction that we hope to address in our future work.
References
 [1] D. W. Thomas, J. Burns, J. Audette, A. Carroll, C. DowHygelund, and M. Hay, “Clinical development success rates 20062015,” BIO Industry Analysis, Biomedtracker, 2016.
 [2] R. K. Harrison, “Phase II and phase III failures: 20132015,” Nautre Reviews: Drug Discovery, vol. 15, pp. 817–818, 2016.
 [3] U. Food and D. Administration, “FDA grants accelerated approval to pembrolizumab for first tissue/site agnostic indication,” https://www.fda.gov/Drugs/InformationOnDrugs/ApprovedDrugs/ucm560040.htm, 2017, U.S. FDA Press Release, May 2017.
 [4] S. J. Pocock, S. F. Assmann, L. E. Enos, and L. E. Kasten, “Subgroup analysis, covariate adjustment and baseline comparisons in clinical reporting: current practice and problems,” Statistics in Medicine, vol. 21, no. 19, pp. 2917–2930, 2002.
 [5] M. Halperin, J. H. Ware, D. P. Byar, N. Mantel, C. C. Brown, J. Koziol, M. Gail, and S. B. Green, “Testing for interaction in an contingency table,” Biometrika, vol. 64, no. 2, pp. 271–275, 1977.
 [6] R. Simon, “Patient subsets and variation in therapeutic efficacy,” vol. 14, no. 4, pp. 473–482, 1982.
 [7] R. Wang, S. W. Lagakos, J. H. Ware, D. J. Hunter, and J. M. Drazen, “Statistics in medicine—reporting of subgroup analyses in clinical trials,” New England Journal of Medicine, vol. 357, no. 21, pp. 2189–2194, 2007.
 [8] L. Tian, A. A. Alizadeh, A. J. Gentles, and R. Tibshirani, “A simple method for estimating interactions between a treatment and a large number of covariates,” Journal of the American Statistical Association, vol. 109, no. 508, pp. 1517–1532, 2014.
 [9] Y. Zhao, D. Zeng, A. J. Rush, and M. R. Kosorok, “Estimating individualized treatment rules using outcome weighted learning,” Journal of the American Statistical Association, vol. 107, no. 449, pp. 1106–1118, 2012.
 [10] J. A. Koziol and S.C. H. Wu, “Changepoint statistics for assessing a treatmentcovariate interaction,” Biometrics, vol. 52, no. 3, pp. 1147–1152, 1996.
 [11] I. Lipkovich and A. D. R. B. D. Sr., “Tutorial in biostatistics: datadriven subgroup identification and analysis in clinical trials,” Statistics in Medicine, vol. 36, no. 1, pp. 136–196, 2017.
 [12] L. Tian and R. Tibshirani, “Adaptive index models for markerbased risk stratification,” Biostatistics, vol. 12, no. 1, pp. 68–86, 01 2011. [Online]. Available: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3006126/
 [13] A. Zeileis, T. Hothorn, and K. Hornik, “Modelbased recursive partitioning,” Journal of Computational and Graphical Statistics, vol. 17, no. 2, pp. 492–514, 2008.
 [14] I. Lipkovich, A. Dmitrienko, J. Denne, and G. Enas, “Subgroup identification based on differential effect search—a recursive partitioning method for establishing response to treatment in patient subpopulations,” Statistics in medicine, vol. 30, no. 21, pp. 2601–2621, 2011.
 [15] J. C. Foster, J. M. Taylor, and S. J. Ruberg, “Subgroup identification from randomized clinical trial data,” Statistics in Medicine, vol. 30, no. 24, pp. 2867–2880, 2011.
 [16] E. Dusseldorp, C. Conversano, and B. J. Van Os, “Combining an additive and treebased regression model simultaneously: STIMA,” Journal of Computational and Graphical Statistics, vol. 19, no. 3, pp. 514–530, 2010.
 [17] E. Dusseldorp and I. Van Mechelen, “Qualitative interaction trees: a tool to identify qualitative treatment–subgroup interactions,” Statistics in Medicine, vol. 33, no. 2, pp. 219–237, 2014. [Online]. Available: http://dx.doi.org/10.1002/sim.5933
 [18] O. J. Dunn, “Multiple comparisons among means,” Jounrnal of the American Statistical Association, vol. 56, no. 293, pp. 52–64, 1961.
 [19] M. D. Donsker, “An invariance principle for certain probability limit theorems,” Memoirs of the American Mathematical Society, no. 6, 1951.
 [20] J.M. Azaïs and L.V. LozadaChang, “A toolbox on the distribution of the maximum of gaussian process,” 2013.
 [21] A. N. Borodin and P. Salminen, Handbook of Brownian Motion—Facts and Formulae. Springer Basel, 2002.
 [22] J. Shao and D. Tu, The jackknife and bootstrap. Springer Science & Business Media, 2012.
 [23] J. J. Higgins, Introduction to modern nonparametric statistics. Cengage Learning, 2003.
 [24] J. Gates, “Exact Monte Carlo tests using several statistics,” Journal of Statistical Computation and Simulation, vol. 38, no. 14, pp. 211–218, 1991.
 [25] V. G. Abramson and I. A. Mayer, “Molecular heterogeneity of triplenegative breast cancer,” Current Breast Cancer Reports, vol. 6, no. 3, pp. 154–158, 2014.
 [26] A. K. Green, K. E. ReederHayes, R. W. Corty, E. Basch, M. I. Milowsky, S. B. Dusetzina, A. V. Bennett, and W. A. Wood, “The project data sphere initiative: accelerating cancer research by sharing data,” The Oncologist, vol. 20, no. 5, pp. 464–e20, 2015.
 [27] M. F. Scheier, V. S. Helgeson, R. Schulz, S. Colvin, S. L. Berga, J. Knapp, and K. Gerszten, “Moderators of interventions designed to enhance physical and psychological functioning among younger women with earlystage breast cancer,” Journal of Clinical Oncology, vol. 25, no. 36, pp. 5710–5714, 2007.
 [28] E. Dusseldorp, L. Doove, and I. Van Mechelen, “Quint: An R package for the identification of subgroups of clients who differ in which treatment alternative is best for them,” Behavior Research Methods, vol. 48, no. 2, pp. 650–663, 2016.
 [29] J. Fleshman, M. Branda, D. J. Sargent, A. M. Boller, V. George, M. Abbas, W. R. Peters, D. Maun, G. Chang, and A. Herline, “Effect of laparoscopicassisted resection vs open resection of stage II or III rectal cancer on pathologic outcomes: the ACOSOG Z6051 randomized clinical trial,” Journal of the American Medical Association, vol. 314, no. 13, pp. 1346–1355, 2015.
Appendix A Outcome Centering Proofs
a.1 Proof of Lemma 1
As stated in the lemma, we assume that a “general” modified outcome written in the form , where is an arbitrary function of . Looking at the variance of this general modified outcome we have
(29) 
Using the law of total variation, we next condition on the treatment, ,
(30) 
The conditional variance can be written as
(31) 
which follows from the application of the translational invariance property as well as the scaling identity. Subsequently, for the conditional expectation, we find
(32) 
which follows from both the linearity of expectation as well as the independence of from . Taken together, we have
(33) 
From the above, we see that the only a variance term is dependent upon the choice of . Since the variance must not be negative, the minimum variance is obtained when . It can be directly observed that is one such choice of for which . ∎
a.2 Proof of Lemma 2
We first calculate the variance of the modified outcome as presented in [8]. Recalling , we see
(34) 
which follows first from the law of total variation, and also from our earlier calculations in Appendix A.1.
We now make the calculation of this variance explicit through the empirical pertrial first and second moments,
and , respectively. For the first term,(35) 
Subsequently,
(36) 
Finally, we have
(37) 
In contrast, we have the value of the modified outcome with pertreatment centering, as shown in Lemma 1,
(38) 
Thus, for the ratio ,
(39) 
where
is the logistic sigmoid function. Since
is bounded in range over the domain , we can see that , as well. Additionally, we see directly that the only instance in which is the case that , that is, when the measured responses for both treatments have the same expectation. ∎a.3 Proof of Lemma 3
The conditional covariance of the centered outcome can be shown to be
(40) 
Assuming a properly randomized trial such that and centered treatment variables, , then the covariance becomes
(41) 
However, since and the measured outcome is a fixed property of the dataset which does not change based on the covariate selection , we note that . Thus, the second term is constant with respect to the covariate under test and we write it as . ∎
a.4 Proof of Lemma 4
The unconditioned covariance between the centered treatment response and the treatment indicators is
cov  
(42) 
which follows from the simple distributional property of the covariance. Since we assume centered treatment indicator variables, , this difference in covariances can be simplified to the difference,
(43) 
However, we can see that these two terms are equivalent. Using the total expectation, we observe that , where we use subscripts to make the expectations explicit. Moving the constant outside of the conditional expectation, we have , which shows the equivalence of the two terms. Thus, . ∎
Comments
There are no comments yet.