1 Introduction
Randomised control trials (RCTs) are considered to be the most reliable means of comparing the effects of multiple treatments or health interventions, sometimes including a control group (placebo or no treatment). It is increasingly common to exploit the opportunities to conduct retrospective treatment interaction analysis (TIA)^{1}^{1}1Such analyses also go under the name of treatment effect heterogeneity (TEH) or treatment effect modifiers (TEM).
on predictor variables in RCT data
[1]. In such cases, the expected effect of a treatment is dependent on the value of one or more predictor variables, for example if the effect of a treatment varies with the age of the subject. Improvement in the volume of measurements obtainable from each patient (e.g. genomic profiling) leads to a myriad of candidate factors for interaction analysis. Given the large costs associated with clinical trials, there is a natural desire to learn from the data as much as possible. However, statistical power is not free and any treatment interaction analysis is certainly not immune from false positive rates. These are the problems that underlay many challenges of performing a valid interaction analysis. Our motivating example arises from a severe malaria trial, SEAQUAMAT [2], where investigating the treatment effect heterogeneity can provide valuable insight on the mechanism of the disease. Moreover, if the utility of a treatment composes of other factors (e.g. costs, side effects etc) additional to the treatment effect, then information on treatment effect heterogeneity can help us make better decisions on medical resource management.Clearly the power to detect an effect modifier is less than that to detect an overall treatment effect for an equivalent effect size [3, 4, 5, 6]. In addition, for most interaction analyses the original trial will have been designed to detect an overall main treatment effect without tests for interactions taken into consideration [7, 8, 9]. This can contribute to a lack of power for discovery unless the interaction is larger in magnitude than that assumed for the main effect in the study design [1], as for subgroups that benefit from a change in treatment the relative population size is smaller by definition.
If a large number of interaction tests are conducted, the study can also suffer from spurious false positives arising from multiple comparisons [10, 11, 6]. On the other hand, false negatives arise due to inadequate power to withstand correction for multiple testing [11, 6]. If the measurement covariates are dependent, further power can be lost when testing each variable interaction one at a time rather than in a joint analysis [5, 11].
For the reasons above, several recommendations have been made on such analyses to increase the reliability of their findings and better investment of statistical power [5, 11, 12, 6]. There is a strong emphasis on a prespecified analysis plan, wherein the hypotheses are defined a priori. Of these hypotheses, the primary analyses must be distinguished from the secondary analyses and several authors suggest that the primary hypotheses be restricted to a small number (one to two analyses) supported by external theoretical and empirical evidence.
These recommendations send a common message that an interaction analysis plan should be formulated with care. Careful expenditure of power also resonates with the rationale for using a single global test for interaction discovery, which provides a single pvalue against the null hypothesis of “no treatment interactions.” It can be envisaged that such a test would require tools that can effectively and efficiently learn about the structure of and the patterns in the entire dataset, which is a challenging task when the dimension the variable space is high. Machine learning (ML) and AI methods appear to be promising tools through their ability to learn about the data
[13, 14, 15, 16] and there has been increasing interest in employing machine learning methods in medical sciences for disease classification, prognosis and prediction [17, 18, 19, 20, 21].However, the issue remains that a large number of potential stratifying variables leads to an even larger search space for treatmentvariable interactions. Naïve application of ML approaches fed with many irrelevant variables can lead to high dimensional models that can substantially diminish the power of a hypothesis test. Careful selection of potential stratifiers before discovery can therefore improve the power to detect true associations for a given effect size.
In this article, we present a framework that facilitates the incorporation of ML algorithms for treatment interaction detection in RCT data. We show that it is possible to use the clinical response (outcome variable) in an initial screening stage to reduce the number of stratifying variables without inflating the typeI (false positive) error rate at the interaction stage. In this way the interaction analysis can be applied to a targeted subspace effectively reducing the multiple comparisons in the search for interactions. The essence is to prioritise main effects for interaction testing over those covariates showing no evidence of association with variation in the clinical response. We provide theoretical results to support our assertion, making use of the randomisation of treatments in the design. Hence ML algorithms can informatively guide the interaction detection in order to achieve more careful expenditure of power, while ensuring that the false positive rate is controlled in a rigorous manner.
It has been proposed to use an internally developed risk model to discover TEH in RCT data [22] and it has been applied to retrospective analysis of TEH in several large RCT datasets [23]. This approach permits the inclusion of multiple baseline covariates to construct the internal risk model (IRM). We provide theoretical support for this approach as it can be considered as a special case of our method. In addition, conventional practice of the IRM approach for TEH discovery excludes the treatment variable to avoid potential biases [23]. This is shown to be unnecessary as a consequence of the same mathematical results that support the validity of our framework. Furthermore, we illustrate that our framework can provide better resolution of TEH, and how that can be more sensitive to the TEH signal in certain scenarios.
1.1 South East Asian Quinine Artesunate Malaria Trial (SEAQUAMAT)
We demonstrate our TEH detection framework in the context of an analysis of a large antimalaria drug trial, SEAQUAMAT [2]. The R notebooks of the interaction analyses are in the Supporting Information (SI).
SEAQUAMAT is a large randomised clinical trial that compared intravenous quinine to intravenous artesunate for treatment of severe Plasmodium falciparum malaria. The trial was conducted at multiple medical centers in Bangladesh, India, Indonesia, and Myanmar. The findings of this trial provided strong evidence that intravenous artesunate had superior efficacy than intravenous quinine with one third reduction in mortality.
Severe malaria is a medical emergency with potentially a high rate of case fatality (10 – 40%), therefore it is of paramount interest to understand how the treatment effect of artesunate may vary in the patient population even though its superiority of overall efficacy is well established [24].
2 Screening for evidence of maineffects before testing for treatment effect heterogeneity
We describe a twostage approach to interaction discovery whereby in the first stage variables are ranked by their maineffect strength of association with the outcome response. This is followed by an interaction discovery stage using only the leading covariates from Stage1. We show that outcomedriven screening at the first stage does not inflate the pvalues obtained in the second stage testing for treatment interaction and moreover demonstrates the consistency property.
2.1 Analysis scheme
For ease of exposition we restrict attention to a balanced twoarm clinical trial where individuals are randomised with equal probability onto a treatment arm
versus a standardofcare .2.1.1 First stage: prognostic variable screening
We consider an additive generalised linear regression model (GLM) for the clinical outcome,
(1) 
where denotes the clinical response for individuals recruited onto the trial, represents an intercept term, the matrix contains interactioncandidate covariates so that is a
dimensional regression coefficient vector. The term
denotes an indicator binary vector of the randomised treatment allocations with elements if the ’th individual received the treatment and otherwise, such that captures the causal average treatment effect. The matrix contains covariates, decided a priori to be irrelevant to treatment interactions. These could be variables that characterise the trial procedure (e.g. site of treatment) instead of the patients themselves. The function is the link function that performs elementwise operation on the linear predictors.To screen for maineffects the GLM (1) can be fit to the data to obtain a rank ordered set of variables, , which is ordered from the most informative covariate to the least informative . Such ordering could be achieved through various ways, which we elaborate in section 2.2.1 .
Following this, only the leading variables, an matrix, are selected for inclusion in the interaction discovery stage. Note that the dimension should be decided a priori based on considerations of power, and not using the results of fitting the GLM. Setting adaptively based on the evidence in
from the first stage induces pvalues that are not uniformly distributed under H0. Because in that situation the degrees of freedom (DF) of the test will be subjected to random sampling.
2.1.2 Second stage: testing for treatment interactions
The selected covariates can then be tested for treatment interactions via the following model
(2) 
where is a indicator matrix with ’s on the diagonal elements where the corresponding vector has a 1, and 0’s elsewhere, so that we have a separate regression coefficient vector for the treatment group to that of the standardofcare .
A test for interaction can then be derived by considering the standardised magnitude of the difference vector using a test (e.g. LRT) with a corresponding pvalue that is uniform under the null hypothesis regardless of the fact that the variable selection in Stage1 uses information in the outcomedriven . As a test such as the LRT provide the single pvalue to indicate the evidence for interaction with treatment, it provides our framework facilitates a global test for detecting TEH.
2.2 On the flexibility of prognostic variable screening
In this section we discuss various possible options for screening prognostic variables in Stage1 (section 2.1.1). We begin with singlestage screening procedures (section 2.2.1), followed by multistaging screening (section 2.2.2), which is more generalised. We recommend using the multistage screening to make the most of supervised and unsupervised ML algorithms.
2.2.1 Singlestage screening for prognostic variables
We first consider the situation where variable selection is only applied once. Possible options are:

Fitting the full additive model (1), for example in R using glm.R, and then ordering by the associated pvalues assigned to the coefficients.

Running a LASSO model [13] and then ranking by the order for which the variables enter into the LASSO path.

Fitting independent GLMs of the form (1) but where we include only a single variable (column) of at each time. Then ranking covariates by the univariate pvalues assigned to the columns of from each model.

Taking principle components (PCs) of to form a new basis design matrix with orthogonal columns made up of the PCs, and then using either option 1. or option 2. above to rank the leading PC regression coefficients.
Option 1 is only suitable for datasets with a small or moderate number of variables, as [25] shows that in a logisitc regression, the
distribution is a poor approximation of the LRT statistic when the number of variables is large in a finite dataset. Option 2 is suitable for highdimensional data especially when many of the covariates are not explanatory. Option 3 is suitable for fast processing of high dimensional data, but because the selection is based on univariate regressions, it can lose power in the absence of other independent variables that can help to reduce the noise. Option 3 has a flavour of (univariate) filter methods for feature selection
[26, 27]. All of the options 1––3 do not resolve the problem of multicollinearity.Application of principal component analysis (PCA) in option 4 transforms the covariates into a set of orthogonal basis, it resolves the problem of multicollinearity. Hence this option is suitable for datasets having many highly correlated variables, which can occur when multiple covariates measure the same quantity. Consequently, it enables the application of supervised methods that suffer in presence of multicollinearity (e.g LASSO). Although treebased methods such as random forest can cope with multicollinearity and therefore their performances are not affected, applying the supervised analysis on the PC scores can avoid misinterpretation of variable importance resulted from multicollinearity.
It is worth noting that PCA is an unsupervised approach to explain the variation in the covariates. This means that while the most variable PC is the most informative of the shape of the covariate data cloud, it is not necessarily associated with the outcome of interest. The advantage of our supervised approach is that it permits the variable selection to be guided by the outcome. Thus, even if the most variable PCs are approximately noise variables, our approach is able to give priority to less variable PCs that demonstrate greater explanatory power.
There is a very interesting connection between our approach and the IRM approach [22]. The IRM approach first constructs a regression model to explain the outcome by a set of baseline variables. The treatment is excluded (blinded) and the model is fitted to the data from both treatment arms (instead of the control arm only).
(3) 
where the mathematical terms are defined the same as in equation 1. For each patient, one calculates a risk score, which defined by the linear predictor of equation 3. Subsequently, another regression model is fitted containing the main effect of the treatment, the main effect of the risk scores and the effect of the interaction between the two. The presence of TEH is suggested by a significant effect of treatmentriskinteraction.
Like the PC scores, a risk score is also a linear projection of the baseline variables. If , option 4 above suggests using a supervised approach to select the most explanatory linear projections provided by PCA. On the other hand, the IRM approach can be interpreted as selecting the most explanatory linear projection from all possible linear projections.
There are two important differences between our framework and the IRM approach. Firstly, the main effect of treatment is included in our framework. The IRM approach excludes the treatment assignment indicator when developing the IRM due to concerns with the potential inflation of false positive risktreatmentinteraction. Our mathematical argument for the validity of our framework also shows that this exclusion is unnecessary.
Secondly, our framework allows the strength of the interaction effect to vary across the selected variables, whereas in the risk score approach all variables composing the risks score share the same interaction effect. Our method has the advantage of providing greater resolution of TEH. Consider the scenario where a (selected) variable has a relatively small main effect but has major contribution to the effect of the interaction with treatment, while the variables with dominating main effects do not interact with the treatment. Our method can detect this signal directly because it models the individual interaction effect between the treatment and each selected variable. However, the IRM will suffer as the dominating variables in the risk score will largely determine the shared interaction effect. Therefore if those dominating variables do not interact with treatment, then the overall signal of treatmentriskinteraction can be diminished.
2.2.2 Multistage ML screening for prognostic variables
If the ML algorithm suggests that there are variables predictive of the outcome and , then modelling with the leading covariates may not be sufficient and potentially discarding what powerful ML methods can offer. This can be resolved by applying dimension reduction methods such as PCA to the subset of the covariates that are found via ML to be predictive of the response. The leading PCs will encapsulate the information from the covariates. Stage1 can be expanded to have multiple substages of selection:

SubstageI: covariate selection with ML algorithm

The output of step 1a is used to rank the explanatory power of the variables, and determines , which is an matrix that contains all the variables predictive of outcome.

SubstageII: Selection of PCs

PCA is applied to .

To produce , the
leading PCs can be selected based on their score variances or the strength of association with
.

The positive integer can vary from sample to sample as long as , is predefined given . This is because the subsetting by ML in SubstageIb of Stage1 is independent of the interaction (to be discussed in section 3 and Web Appendix A). Therefore the DF is still kept at for the LRT in Stage2. Here, the scheme is described with PCA, but it can be substituted with other (unsupervised) ordination methods.
3 Theory
In order to justify the proposed scheme, it requires to show that the variable selection at Stage1 does not influence the findings of interaction detection in Stage2.
Theorem 1
Let denote the standardised MLE of the main covariate effects in an additive GLM, while represents the standardised MLE of the difference in treatmentspecific covariate effects in an interaction model with randomised treatments and . Under a null hypothesis of “no difference in covariate effects across treatment arms”, is independent of , such that . Hence, dimension reduction via informative variable selection using does not bias the subsequent findings of interaction discovery .
Proof: The proof is contained in Web Appendix A. The proof also shows that theorem 1 holds after a linear transformation on the covariates.
In practice, the requirement of prognostic screening at Stage1 depends on the power we have in the test for interaction in Stage2. If there is sufficient power to perform a global test with all variables then Stage1 can be bypassed and therefore .
If the power at Stage2 is insufficient to include all variables, then prognostic screening is performed to select variables. should be a deterministic function of that is monotonic increasing with no upper bound, such as or power curve based on (but independent of the sampled covariate values as they vary from sample to sample). Permitting to increase with ensures the consistency of the procedure. That is even with the worst case scenario where true interaction occurs at the least prognostic variable , as tends to infinity, will eventually reach to include . The property of such a scheme relies on the test in Stage2 to be consistent, such as the LRT. Theoretically, this scheme can even withstand the cryptic interactions that leads from but .
In section 2.2.2 we have discussed a screening procedure with two substages. In SubstageI, a chosen supervised approach selects covariates, , based on the evidence for their prognostic effect. And subsequently in SubstageII, we select the leading PCs of from SubstageI.
If the selection of the PCs is based on the ranking of PC score variance, then it is independent of outcome. Hence the outcomedriven dimension reduction is only executed in SubstageI to obtain , and theorem 1 states this selection does not influence the subsequent interaction test.
If the PC selection is based on the ranking of their strength of association with the outcome, then the outcomedriven dimension reduction is performed in both substages. The covariates are selected in SubstageI as above. PCs of is a linear transformation. Since theorem 1 holds after a linear transformation on the covariates, this means that the PCs of can be selected via a supervised approach to obtain in substageII without biasing the test in Stage2 that detects the interaction between treatment and .
Lets consider all possible linear transformations. Because theorem 1 is stable under linear transformation, if we select the most prognostic linear combination of the covariates, it will not induce bias the subsequent interaction test. So a consequence of our proof is that if the treatment is included in a GLM risk model as in equation 1, it will not inflate the false positive rate of risktreatmentinteraction. In that case the GLM risk model would take the form in equation (1), and the risk score would be defined as .
In fact, given a set of variables, one can perform any transformation or basis expansion to generate a new set of bases as the interaction candidates. As long as these operations do not depend on a function of the treatment indicator, the new set of basis can be fed into our framework described in section 2.1 without inflating the false positive rate. This enables the accommodation of nonlinearity, and moreover allows the employment of various powerful ML methods such as boosting [14].
4 Empirical study
We demonstrate our proposed scheme for interaction detection on the RCT dataset from SEAQUAMAT [2]. Detailed description of the variables and the following analyses are in the R notebooks in SI.
We first describe the data preparation required for the analysis (section 4.1). Subsequently we present the analysis including all the biological variables in the interaction test, and illustrate its lack of power, which motivates the development of our methodology. As mentioned in section 2.2, we prefer the multistage screening procedure and therefore this screening procedure is chosen for our primary analyses presented in section 4.3. There we demonstrate that power can be gained from adopting our framework for TEH discovery.
The full details of the analyses in this section can be found in R notebook S1: Empirical demonstration of interaction analysis on full malaria trial dataset.
4.1 Preprocess of data
The dataset is first examined to determine any processing required prior to the analysis. Missing values are imputed by the R package MICE
[31] and 50 imputed datasets are created. The analysis is performed on each imputed dataset to investigate whether the findings are consistent across imputation variation.4.2 Interaction analysis with GLM including all variables
In this dataset there are 14 biological variables that serve as interactioncandidates. While it is understandable that treatment site (country) may have an overall effect, it is not so intuitive to expect it to interact with treatment. Therefore, country is included in the model for overall adjustments, but not included as an interactioncandidate.
We fit the full additive model and the full interaction model . So that includes the main effect of all variables, while also includes all the interaction terms between treatment and the 14 biological variables. For each imputed dataset, an LRT is performed to test whether is an adequate approximation of . The median LRT pvalue is 0.115 (pvalue range = [0.0775, 0.159], 3 s.f.^{2}^{2}2 s.f. is short for rounding to significant figures.), only 9 of the 50 imputed datasets have ANOVA pvalues . It is apparent that there is no evidence of treatment interaction in SEAQUAMAT dataset, which may be due to lack of power for testing all 14 variables simultaneously.
4.3 Interaction discovery with PCGLM after outcomedriven dimension reduction
After data preparation (section 4.1), we do not observe high correlations among the covariates. (Further details are provided in the data preparation section in the R notebook S1.) Therefore, it is more appropriate to perform the supervised variable selection on the covariates directly rather than the PCs. This supports our choice of multistage screening.
Recall that multistage screening divides Stage1 into substages. Section 4.3.1 describes SubstageI, where we perform supervised variable selection to subset the variables. Then in SubstageII, PCA is applied to this subset of variables. We present two variants of this analysis, where the leading PCs are selected based on (i) their variances (section 4.3.2) and (ii) supervised selection guided by outcome (section 4.3.3).
4.3.1 Supervised screening by the gradient boosting model
To screen the covariates, an additive model is fitted by a gradient boosting model (GBM)
[14, 15] with stumps by using the R package gbm [32]. The covariates are ranked according to the relative influence (RI) values.Of the 50 imputed datasets, 37 supports a model with treatment, country and additional 10 biological covariates. These biological covariates are consistent cross the 37 imputed datasets and these are coma, logged parasitaemia, age, heart rate, temperature (), systolic blood pressure, haematocrit, weight, jaundice, duration of fever (days).
4.3.2 Unsupervised PC selection
From the additive model fitted using GBM with stumps in section 4.3.1, we have identified a subset of 10 biological variables that have the greatest explanatory power in the majority (37/50) of the imputed datasets. PCA is subsequently performed on this subset and the PCs are ranked by their PC score variances. Here, we test whether there are differences in PC effects between treatment groups. Again, for the purpose of demonstrating our method, we iterate through values of from 1 to 10 for the LRT in the Stage2.
The results from testing the interactions between treatment and the leading PCs exhibit a reasonably stable pattern as portrayed in figure 1. There is generally no evidence of interactions between treatment and the leading PCs for , where all of the imputed datasets have pvalue 0.1. The number of imputed datasets with pvalue 0.1 is respectively 47, 50, 50, 50 and 34 for including the leading 6, 7, …, 10 PCs. The overall evidence for interactions is strongest when testing for interaction between treatment and the 7 leading PCs. The evidence gradually weakens as more PCs are added, suggesting the minor PCs are not so informative of TEH.
When including the 7 leading PCs in the interaction test, where the median pvalue = 0.0435 (pvalue range = [0.0221, 0.0638]) and pvalue 0.05 for 39 of the 50 imputed dataset. Moreover, figure 2 illustrates that the evidence for TEH is noticeably stronger (LRT pvalue decreases) for every imputed dataset when we reduce the variables tested from all 14 biological variables to the 7 PCs representing the 10 variables selected by GBM (in section 4.3.1).
There is a notable increase in the strength of evidence for interactions when PC6 is included to test for interaction with treatment. From inspecting the correlations between PC6 and biological variables, we find that PC6 is most strongly correlated with jaundice and then systolic blood pressure. Furthermore, jaundice and systolic blood pressure have the largest absolute PC6 coefficients across all imputed datasets. The signal for interactions attributed to jaundice and systolic blood pressure appears to be captured by PC6.
The variables identified here, jaundice and then systolic blood pressure, that seems to be responsible for TEH, are consistent with the results from the analysis with singlestage screening (B.1.1 of Web Appendix B), where the evidence for interactions strengthens when jaundice and systolic blood pressure are added in turn. Comparing figure 2 with Web Figure B.2 (in Web Appendix B), the power to detect interactions appears to greater when using multistage screening than singlestage screening.
In summary, this analysis clearly demonstrates the how power can be gained after supervised selection of variables based on the evidence for the main effects without segregation by treatment arms.
4.3.3 Supervised PC selection
For each imputed dataset, an additive LASSO model is constructed to explain the outcome by the treatment and the PCs obtained in PCA (described in 4.3.2). This provides a ranking of the PCs by the explanatory power of their main effects. Therefore, represents the most explanatory PCs informed by LASSO. To explore the property of this analysis design, we iterate through values of from 1 to 10 when performing the interaction test in StageII.
Figure 3 presents the LRT pvalues obtained from each imputed datasets for . In general, it appears that the power is the strongest when we include the most explanatory PCs in the interaction test. In this case, 34/50 of the imputed datasets have LRT pvalues 0.05, and 41/50 of those have LRT pvalues 0.1. For these 41 imputed datasets, PC6 is found in the PCs selected for the interaction test. This indicates that PC6 is most informative of TEH, which is consistent with the analysis presented in sections 4.3.2, which indicates that PC6 is most informative of TEH.
The pattern of pvalues across is much more noisy in figure 3 than that in figure 1. This is due to the variation in the LASSO ranking of PCs across imputed datasets. Once we have included the most two explanatory PCs in the interaction test, inclusion of additional PCs generally leads to drop of power. This is because, in this case PC6 is most informative of TEH and it is also (nearly) most explanatory. However, it is important to note that this phenomenon is not necessarily the case as we have already explained in section 3 and the proof in Web Appendix A.
4.3.4 Supplementary analyses
Aside from the primary analysis presented in section 4.3, we have also performed additional analyses on the same dataset to further explore our method. These supplementary analyses are presented in Web Appendix B, containing two sections: B.1 and B.2. In section B.1, we present a variant of the analysis in 4.3, where we employ singlestage screening instead of the preferred multistage screening. In section B.2, we present the analyses on subsets of the SEAQUAMAT trial data to investigate the behaviour of our method when the sample size is substantially reduced.
5 Discussion
In this paper we present a framework that increases the power of interaction discovery by outcomedriven selection of variables without inflating the type I error rate. Using the severe malaria trial data as a motivating study for our method, there appears to be TEH of artesuate. While the interaction test we perform is global, we are still able to decompose the signal, which suggests that the TEH may be attributed to jaundice and systolic blood pressure.
Interaction discovery plays a key role in stratified medicine and is a challenging task as soon as the number of interaction candidate exceeds a handful of variables. The limited power for interaction discovery forbids a carpet search for the interaction effects. This conflicts with the desire to learn as much as possible from the data and stresses the need for global hypothesis testing tools for interaction discovery. Our novel screening procedure is fairly flexible and hence can take advantage of powerful ML algorithms. While we have focused on application to clinical trial data in this paper, our flexible framework can be applied to datasets from any equivalent randomised experiments for retrospective exploration of TEH.
ML algorithms themselves lack the statistical rigour required for hypothesis testing. In spite of this, our mathematical results support the validity of how ML algorithms are incorporated in our hypothesis testing framework.
In line with the recommendations on interaction analyses that demands hypotheses to be predefined [5, 11, 12, 6], our framework requires determining a priori the degree of freedom () of the test for interaction effect in Stage2. This is an important reminder that there is no free power. The decision on the degrees of freedom and the rest of the analysis would be recorded in a scientific notebook such as an R notebook [33].
In contrast to the strictly hypothesis driven interaction analyses, our framework relaxes some of the stringencies as it does not require explicit specification of the (few) factors to be tested. Instead we can initiate with a much larger pool of candidates, and then narrow it down by variable selection and dimension reduction by ML algorithms. Furthermore, the highly flexible ML algorithms have much weaker assumptions on the properties of the underlying random process than traditional methodologies. Therefore it allows the data to speak for itself.
We found that our approach can be considered as a generalisation of the internal risk model approach. Because our framework facilitates TEH discovery with finer resolution by allowing the interactions between the treatment and the variables to be modelled individually. Therefore it can better detect TEH when the variables driving TEH have weaker signal for main effect than dominating prognostic variables that do not contribute to TEH.
The statistical evidence for TEH provided by this method is at the global level and not for the variables individually. If global signal is decomposed to pin down the potential set of variables responsible for treatment interaction, it should be treated with caution, specifically transformation has been applied.
In our demonstrations, PCA was employed for dimension reduction, which performs a linear transformation. By examining the PC coefficients or the correlations between the PCs scores and variables of interest, it allows us to hypothesise which variables interact with the treatment. However, if there are concerns regarding the interpretability of an analysis with a transformation, then the user need to design the analysis based on what is more important.
The user should always use the transformation that is most meaningful to the problem of interest. If the purpose of the analysis is to purely indicate whether or not there is evidence for TEH, then the user only need to interpret the LRT pvalue, so it might be more appropriate to choose a transformation that would achieve the most thorough exploration of the data. However, if the aim of the analysis is to generate hypotheses on which specific variables contribute to TEH, then they should employ the transformation (including no transformation) that would be most helpful for deciding which variables to be tested in future experiments. Of course, there is a tradeoff—a simpler analysis would be easier to interpret but it can potentially miss the complex patterns and hence the signal in the data. This is one of the examples of “nofreelunch” in statistical analysis. It is also important to note that if the user decomposes the global/aggregated interaction signal to obtain a suggested set of original variables driving the interaction with treatment, the identification of an interaction between the treatment and a specific variable should be viewed as hypothesis generation.
Our framework furnishes a more holistic exploration of the data while also provides rigorous evidence for potential interactions in retrospective interaction analyses. Although the false positive rate is controlled in our approach, the findings from using this framework will still need to be replicated in other datasets for validation [5]. However, our method can assist researchers to decide whether to further invest in exploring the heterogeneous effect of a treatment.
Acknowledgements
C.H. Wu is funded by Medical Research Council UK and CancerResearch UK (Grant no.: ANR0031). We thank Mahidol University Oxford Tropical Medicine Research Unit (MORU) for providing us with the data from the SEAQUAMAT trial. C. C. Holmes is supported by the Medical Research Council UK, the EPSRC, the Alan Turing Institute, and the Li Ka Shing Centre for Health Innovation and Discovery.
References
 [1] Douglas G Altman. Clinical trials: Subgroup analyses in randomized trials–more rigour needed. Nature Reviews Clinical Oncology, 12(9):506–507, 2015.
 [2] ANFSK Dondorp, François Nosten, Kasia Stepniewska, Nick Day, and Nick White. Artesunate versus quinine for treatment of severe falciparum malaria: a randomised trial. Lancet (London, England), 366(9487):717–725, 2005.
 [3] Sarah T Brookes, Elise Whitley, Tim J Peters, Paul A Mulheran, Matthias Egger, and G Davey Smith. Subgroup analyses in randomised controlled trials: quantifying the risks of falsepositives and falsenegatives. Health Technology Assessment, 5(33):1–56, 2001.
 [4] Sara T Brookes, Elise Whitely, Matthias Egger, George Davey Smith, Paul A Mulheran, and Tim J Peters. Subgroup analyses in randomized trials: risks of subgroupspecific analyses;: power and sample size for the interaction test. Journal of clinical epidemiology, 57(3):229–236, 2004.
 [5] Peter M Rothwell. Subgroup analysis in randomised controlled trials: importance, indications, and interpretation. The Lancet, 365(9454):176–186, 2005.
 [6] James F Burke, Jeremy B Sussman, David M Kent, and Rodney A Hayward. Three simple rules to ensure reasonably credible subgroup analyses. BMJ, 351:h5651, 2015.
 [7] A Dondorp, F Nosten, K Stepniewska, N Day, and N White. South east asian quinine artesunate malaria trial (seaquamat) group. artesunate versus quinine for treatment of severe falciparum malaria: a randomised trial. Lancet, 366(9487):717–725, 2005.
 [8] Graham R. Foster, Stephen Pianko, Ashley Brown, Daniel Forton, Ronald G. Nahass, Jacob George, et al. Efficacy of sofosbuvir plus ribavirin with or without peginterferonalfa in patients with hepatitis c virus genotype 3 infection and treatmentexperienced patients with cirrhosis and hepatitis c virus genotype 2 infection. Gastroenterology, 149(6):1462 – 1470, 2015.
 [9] Matthew T Seymour, Timothy S Maughan, Jonathan A Ledermann, Clare Topham, Roger James, Stephen J Gwyther, et al. Different strategies of sequential and combination chemotherapy for patients with poor prognosis advanced colorectal cancer (mrc focus): a randomised controlled trial. The Lancet, 370(9582):143 – 152, 2007.
 [10] John PA Ioannidis. Microarrays and molecular research: noise discovery? The Lancet, 365(9458):454–455, 2005.
 [11] David M Kent, Peter M Rothwell, John PA Ioannidis, Doug G Altman, and Rodney A Hayward. Assessing and reporting heterogeneity in treatment effects in clinical trials: a proposal. Trials, 11(1):85, 2010.
 [12] Xin Sun, John PA Ioannidis, Thomas Agoritsas, Ana C Alba, and Gordon Guyatt. How to use a subgroup analysis: users’ guide to the medical literature. Journal of the American Medical Association, 311(4):405–411, 2014.
 [13] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
 [14] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
 [15] Jerome H Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367–378, 2002.
 [16] Aristoklis D Anastasiadis, George D Magoulas, and Michael N Vrahatis. New globally convergent training scheme based on the resilient propagation algorithm. Neurocomputing, 64:253–270, 2005.

[17]
Florin M Selaru, Yan Xu, Jing Yin, Tong Zou, Thomas C Liu, Yuriko Mori, et al.
Artificial neural networks distinguish among subtypes of neoplastic colorectal lesions.
Gastroenterology, 122(3):606–613, 2002.  [18] David B Seligson, Steve Horvath, Tao Shi, Hong Yu, Sheila Tze, Michael Grunstein, et al. Global histone modification patterns predict risk of prostate cancer recurrence. Nature, 435(7046):1262–1266, 2005.

[19]
Olivier Gevaert, Frank De Smet, Dirk Timmerman, Yves Moreau, and Bart De Moor.
Predicting the prognosis of breast cancer by integrating clinical and microarray data with bayesian networks.
Bioinformatics, 22(14):e184–e190, 2006.  [20] Mehmet Fatih Akay. Support vector machines combined with feature selection for breast cancer diagnosis. Expert systems with applications, 36(2):3240–3247, 2009.
 [21] Michael LeBlanc and Charles Kooperberg. Boosting predictions of treatment success. Proceedings of the National Academy of Sciences, 107(31):13559–13560, 2010.
 [22] JF Burke, RA Hayward, JP Nelson, and DM Kent. Using internally developed risk models to assess heterogeneity in treatment effects in clinical trials. Circulation. Cardiovascular quality and outcomes, 7(1):163–169, 2014.
 [23] David M Kent, Jason Nelson, Issa J Dahabreh, Peter M Rothwell, Douglas G Altman, and Rodney A Hayward. Risk and treatment effect heterogeneity: reanalysis of individual participant data from 32 large clinical trials. International journal of epidemiology, 45(6):2075–2088, 2016.
 [24] NJ White, S Pukrittayakamee, TT Hien, MA Faiz, OA Mokuolu, and AM Dondorp. Malaria. The Lancet, 383(9918):723–735, 2014.
 [25] Pragya Sur, Yuxin Chen, and Emmanuel J Candès. The likelihood ratio test in highdimensional logistic regression is asymptotically a rescaled chisquare. arXiv preprint arXiv:1706.01191, 2017.
 [26] Roberto Battiti. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on neural networks, 5(4):537–550, 1994.
 [27] Isabelle Guyon and André Elisseeff. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar):1157–1182, 2003.
 [28] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.

[29]
Ming Yuan and Yi Lin.
Model selection and estimation in regression with grouped variables.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006. 
[30]
Yoav Freund and Robert E Schapire.
A desiciontheoretic generalization of online learning and an
application to boosting.
In
European conference on computational learning theory
, pages 23–37. Springer, 1995.  [31] Stef Buuren and Karin GroothuisOudshoorn. mice: Multivariate imputation by chained equations in r. Journal of statistical software, 45(3), 2011.
 [32] Greg Ridgeway with contributions from others. gbm: Generalized Boosted Regression Models, 2017. R package version 2.1.3.
 [33] JJ Allaire, Joe Cheng, Yihui Xie, Jonathan McPherson, Winston Chang, Jeff Allen, et al. rmarkdown: Dynamic Documents for R, 2017. R package version 1.5.
Supplementary Materials
Appendix A Web Appendix A: Mathematical arguments for theorem 1
In this section, we present the proof that under a null hypothesis, H0, of “no difference in covariate effects across treatment arms”, in a generalised linear model (GLM) framework, the standardised MLEs of the coefficients in an additive model are (asymptotically) independent of the standardised MLE difference between the groupspecific coefficients, in an treatment interaction model, when we have RCT data.
Defining the models and parameters
Let be the number of individuals, where , with indicating the number individuals in treatment group . The matrix contains the interactioncandidate covariates, that potentially contribute to the interaction effect. In a study, we might also observe covariates, which we do not consider as interactioncandidates. The matrix holds the interactionirrelevant covariates, whose coefficient values are “c”ommon across treatment groups. The term and are vectors of length and respectively, where all elements equal to 1. On the other hand, and are the equivalent 0 vectors. The outcome values of all patients are recorded in the vector , and are assumed to have come from a distribution in the exponential family.
For the purpose of the following calculations, the outcome vector and covariate matrices are arranged such that
(4) 
where , and are respectively the outcome vector and covariate matrices of treatment group .
Consider a GLM of the following form
where denotes the element of , is the link function and is the linear predictor of the observation. The vector represents the linear predictor for all observations.
Additive model
To set up the additive model (no interaction), we first construct the data matrix as
(5) 
This column arrangement has been selected because our primary interest is in the interactioncandidate covariates. The linear predictor of the additive GLM is defined as
(6) 
where the coefficient vector can be decomposed into . The is a vector the pooled main effects of the interactioncandidates, while is the corresponding vector for the interactionirrelevant covariates. The term is the intercept of treatment group . The matrix appears redundant here, however, having this ‘wrapper’ matrix enables more flexible construction of the linear model as we demonstrate in a later section on PCGLM.
Interaction model
To construct the interaction model, , the matrix is expanded subsequently to obtain
(7) 
where is a matrix and is a zero matrix.
The linear predictor of a full interaction model with treatment can be constructed as
(8) 
The coefficient vector is defined as . The vector is the treatmentspecifc coefficients of the interactioncandidates for treatment group . The terms , and are equivalent to , and respectively.
In preparation of an alternative construction, the columns of are rearranged to obtain:
(9) 
With this matrix, the linear predictor of an alternative interaction model, , is set up according to
(10) 
where the coefficient vector . Note that and have the same column space, and hence their respective coefficient vectors are composed of the same set of elements.
In this case, it is observed that the matrix can be defined in terms of linear projections of and , according to
(11) 
with
where both and are . The notation represents a identity matrix. The expression is a dimensional zero vector, while is a zero matrix.
Calculations
We first demonstrate , which can be used to argue that as . The MLEs of , and are respectively , and . Asymptotically,
follows a multivariate normal distribution,
, with the variancecovariance matrix . Similarly, , with . As is a linear transformation of , the difference vector also follows a normal distribution, albeit only having dimensions. This is due to the zeros in the last elements of .We use the notation to refer to the submatrix from rows to and from columns to of the original matrix . The covariance matrix is located in the submatrix
. By the properties of normal random variables,
implies and are independent.The MLEs of GLM coefficients are often computed using the iteratively reweighted least squares (IRLS) algorithm. The estimates of , and respectively take the form
is the MLE estimate of , which is an diagonal matrix, and the diagonal element, , is the weight of the individual given the data and model parameters. denotes the MLE the dimensional vector , which is the working dependent variable given the data and model parameters. In the interaction models, and , the terms , , and are defined similarly.
The covariate matrix can be expressed as
(12) 
Since and have the exact same set of column vectors, we have and . We can then rewrite equation (12) as
(13) 
where . This can be further simplified by substituting equation (11) into equation (13), leading to
(14) 
Because the first columns of are zero columns, so are those in the matrix and hence is a zero matrix.
When the underlying model is correct, as
, the MLE of a parameter approaches a normal distribution and its standard error approaches the asymptotic standard deviation of the estimator. Consequently, a standardised MLE will be asymptotically normally distributed. Therefore, the zerocovariance property is again a sufficient condition for independence. Let
and denote the standardised MLEs of and respectively.The covariance matrix of the standardised MLEs is
where diag() is a diagonal matrix with diagonal value equal to the element of some vector . The element of the vector is the standard error of the element of . The standard error vector for is similarly defined.
As the first columns of are zero vectors, the so that those in . Consequently, is a zero matrix, which satisfies the independence condition required in theorem 1.
Extension to GLM with linear projections of baseline covariates
In this section we present an extension of the proof to accommodate GLM with linear projections of baseline covariates. The columns of the matrix are the weights for the baseline covariates. To perform a regression on the projected variables, and are first projected by . As a result of the projection, the matrices , and will need to be redefined. The matrix of the additive model, , is now written as a linear projection of in equation (5):
(15) 
where the dimensions of matrix are . The covariate matrix of the interaction model, , becomes
(16) 
where is defined in equation (7). The square matrix has dimensions on each side. Similarly, the covariate matrix of the interaction model, , is a linear projection of by :
(17) 
where is defined in equation (9) and the square matrix has the same dimensions as .
Substituting the new definitions of , and described in equations (15), (16) and (17) respectively into the covariance expression in equation (12), and then following the same arguments and workings from there onwards, we will also arrive at equation (14). Therefore, the proof also holds for GLM with linear projections of baseline covariates.
In the case of PCGLM, and the columns of are the PC coefficients of defined in equation (4). On the other hand, in the unblinded IRM approach, and is a vector containing the coefficients for the main effects of in the risk model.
Appendix B Web Appendix B: Supplementary analyses of the SEAQUAMAT trial data
In this section, the first part (B.1) presents the analyses on the data contain all 1461 patients in the trial, where we apply singlestage screening of prognostic variables. This and the primary analysis (in section 4.3) allow the comparison between singlestage and multistage screening.
In the second part (section B.2), we subsample the SEAQUAMAT data so the subsets only contain 600 patients. We then apply the multistaging screening as in section 4.3.
b.1 Analyses with all patients recruited in the trial
b.1.1 Interaction discovery with GLM after outcomedriven dimension reduction
As the purpose of this demonstration is to showcase the properties of our approach, we iterate through values from 1 to 10. The LRT pvalues are presented in figure B.1.
Across the different values, the evidence is the strongest when we test for the interaction between treatment and the set of 6 variables = {coma, logged parasitaemia, age, heart rate, temperature, systolic blood pressure}. The interaction with this set provides a median pvalue = 0.0787 (pvalue range = [0.0460, 0.105], 3 s.f.) and pvalues are obtained from 48 of the 50 imputed datasets. Moreover, figure B.2 illustrates that the evidence for TEH is stronger (LRT pvalue decreases) for every imputed dataset when we reduce the variables tested from all 14 biological variables to the 6 variables selected by boosting.
As depicted in figure B.1, further inclusion of the remaining biological candidates displayed some instability. When the haematocrit variable is added to this set, the strength of evidence for interaction effect experiences a considerable drop with only three imputed datasets having pvalue . Subsequent inclusion of weight reduces to only one imputed dataset with pvalue . However, inclusion of jaundice pushes the number of datasets with pvalue up to 42 but adding on duration of fever leads to a reduction to 34 sets with pvalue .
In summary, even though the strength of evidence is not as strong a the multistage analysis, it is observed that the power to detect interactions can be increased by singlestage screening of the main effects in the additive model.
b.2 Analyses only with patients having staging data
The full details of the analyses in this section can be found in R notebook S2: Empirical demonstration of interaction analysis on the data of the subset of patients from the SEAQUAMAT trial who have staging information.
Of the 1461 patients in the SEAQUAMAT trial, 854 have staging data of the malaria parasite lifecycle. The data is in the form of percentages of the following stages: tiny ring, small ring, large ring, early troph, mid troph, late troph and gametocytes. There are another two additional variables in this dataset, which are base deficit and indication of prior antimalarials treatment received.
The purpose of this section is to demonstrate that when the sample size of the SEAQUAMAT trial data is substantially reduced, that our framework can still achieve better power to detect TEH than naïvely including all the variables in the interaction test, even when we account for subsampling variation. Section B.2.1 presents the problem encountered when including all biological variables in the interaction analysis on the 854 patients. Section B.2.2 describes the subsampling of the SEAQUAMAT data, so that the problem in section B.2.1 will have a greater impact and that we can account for subsampling variation. Subsequently, we apply our framework to each subset and present the results in section B.2.3. Here, we only consider multistage screening, which is our primary recommendation. In addition, we select the PC in SubstageII by its variance as that seems to exhibit more stable behaviour (section 4.3.2).
b.2.1 Interaction discovery with GLM including all variables
After the inclusion of all the aforementioned additional variables, the degrees of freedom of the LRT test for interaction reaches 25, with 30 and 55 variables respectively in the full additive and the full interaction model. Given the results presented by [25], it is reasonable to question the accuracy of the LRT pvalue of the interaction test. To investigate the bias, we simulate distribution of the LRT pvalues under H0, and this simulated distribution is presented in figure B.3, which clearly displays the underestimation of pvalues.
Using the simulated distribution to correct for the observed bias, the full interaction test for TEH provides a pvalue of 0.574 (3 s.f.), which is a borderline result that does not provides evidence against the global H0 of no interactions at the 5% confidence level.
b.2.2 Subsampling
While the result with the 854 patients is not statistically significant at the 5% level, the LRT pvalue is rather close to the 0.05 threshold. Therefore we subsample the data to 600 patients where there would be less power to detect TEH.
We have created 100 subsampled datasets. As the SEAQUAMAT trial data is roughly balanced between the two treatment arms, the subsampled data subsampling is by stratified treatment arms. For each subsample, we have simulated the LRT pvalue distribution for H0 including all 25 biological interactioncandidates. The LRT pvalue for full treatment interaction of each dataset is corrected using the corresponding simulated H0 distribution. We focus only on the 67 subsamples that have (corrected) LRT pvalues greater than 0.1.
b.2.3 Interaction discovery with PCGLM after outcomedriven dimension reduction
Each of subsampled dataset is fitted with GBM with stumps. Relative influence is used to determine the priority of variable inclusion for testing the global TEH. For each subsample, there is still a fairly large number of variables with that have positive relative influence, where many of these variables have very low relative influence (1), so they are likely to have little predictive information and we only select the variables with RI 1. Following this, PCA is applied to the variables selected for each subsampled dataset.
Because we have already performed the analysis on the full 1461patient dataset with a range of values (section 4.3), we will not do the same here. Instead we chose the 8 most variable PCs to include in the LRT to test whether they interact with the treatment. We chose 8 because of the 67 subsets, the mininum number of variables with RI greater than 1 is 8, suggesting that when we account for subsampling variation, there are at least 8 variables that are informative of prognosis. To protect from potential poor approximation by the distribution used in the LRT, the null distribution of the LRT pvalues are simulated and used to correct the LRT pvalues for the interactions between the 8 leading PCs and the treatment.
It is evident in figure B.4, we improved power to detect TEH when we reduce the variables tested from 25 biological variables to 8 PCs representing the baseline covariates selected by GBM. Fiftyseven of the 67 subsamples have improved in the strength of evidence for TEH. The strength of evidence for TEH varies considerably because subsampling patients will creates variation in the signal of TEH across the subsampled datasets. Nevertheless, the general trend is that our framework can have better power to detect TEH than including all variables.