Counterfactual Phenotyping with Censored Time-to-Events

by   Chirag Nagpal, et al.
Carnegie Mellon University

Estimation of treatment efficacy of real-world clinical interventions involves working with continuous outcomes such as time-to-death, re-hospitalization, or a composite event that may be subject to censoring. Causal reasoning in such scenarios requires decoupling the effects of confounding physiological characteristics that affect baseline survival rates from the effects of the interventions being assessed. In this paper, we present a latent variable approach to model heterogeneous treatment effects by proposing that an individual can belong to one of latent clusters with distinct response characteristics. We show that this latent structure can mediate the base survival rates and helps determine the effects of an intervention. We demonstrate the ability of our approach to discover actionable phenotypes of individuals based on their treatment response on multiple large randomized clinical trials originally conducted to assess appropriate treatments to reduce cardiovascular risk.



page 1

page 2

page 3

page 4


gfoRmula: An R package for estimating effects of general time-varying treatment interventions via the parametric g-formula

Researchers are often interested in using longitudinal data to estimate ...

Identifying Principal Stratum Causal Effects Conditional on a Post-treatment Intermediate Response

In neoadjuvant trials on early-stage breast cancer, patients are usually...

Defining Estimands Using a Mix of Strategies to Handle Intercurrent Events in Clinical Trials

Randomized controlled trials (RCT) are the gold standard for evaluation ...

Bayesian graphical modelling for heterogeneous causal effects

Our motivation stems from current medical research aiming at personalize...

Introducing the treatment hierarchy question in network meta-analysis

Background: Comparative effectiveness research using network meta-analys...

Effects of Purine Bases and Nucleosides on the Survival of Cortical Rat Astrocytes in vitro

Purines are responsible for the regulation of many physiological and pat...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Real world studies to estimate the effect of an intervention often involve time-to-event outcomes which are typically followed up only for a fixed period of time. Such studies are commonplace in healthcare and frequently arise when evaluating the effect of a drug or medical intervention on the time to events of interest such as death, re-hospitalization, or a composite physiological outcome. Randomized Control Trials (RCT) aim to eliminate group imbalance through randomizing treatment and control groups. Covariates are evaluated to ensure balanced control and treatment groups so the two groups can be compared without confounding the treatment effect. Hence, in an RCT the population-level event time distributions can be directly compared to obtain estimates of average treatment efficacy.

Indeed, popular population-level metrics for survival and time-to-event prediction involve comparing hazard ratios or summary metrics such as restricted mean time to event by building Proportional Hazard or Kaplan-Meier estimators on the treatment and control arms.

While population-level effect estimation is important as it informs current clinical guidelines and practices, the effect of any intervention is rarely uniform across any population under observation. The advent of precision medicine aims to address these differences in treatment effect by applying individualized treatments designed based on each patient’s individuals characteristics. This strategy assumes there are differences in treatment effects that may be explained by varying demographic factors, baseline physiology, or prior medical history. The crux of precision medicine thus entails careful phenotyping of individuals that may receive augmented benefit from a treatment versus those who would not benefit (or worse, suffer adverse effects). These individual patient factors can then be accounted for when framing clinical guidelines based on randomized trials.


Consider the following scenario from an established clinical practice. Landmark studies such as the ALLHAT clinical trial (davis1996rationale; cushman2002original) have established that thiazide diuretic treatment (Chlorthalidone) are not inferior to angiotensin-converting enzyme (ACE) inhibitors (Lisinopril) or calcium channel blockers (Amlodipine) for reducing cardiovascular risk in hypertensive patients. However, there are certain sub-populations, such as those with baseline chronic kidney disease, who may benefit from the renal-protective effects of an ACE inhibitor. Additionally, ACE inhibitors are not indicated as an initial treatment for hypertension in black patients, and either a thiazide diuretic or calcium channel blocker is recommended for this group (whelton20182017).

Such scenarios demonstrate that certain risk groups or phenotypes may not benefit uniformly from a given treatment. It is therefore of immense clinical interest to recover such groups or cohorts of patients to help guide more precise interventions, which leads to personalized medicine and improved patient safety and outcomes.

Figure 1 is an illustration of this ‘Counterfactual Phenotyping’ problem. In the untreated population latent groups mediate the base survival rates. However when an intervention, is performed, the treatment effect is mediated by the latent group , independent of the base survival rate.

In this paper, we propose a principled approach to discover such subgroups or cohorts of individuals that demonstrate heterogenous effects to an intervention in the presence of censored outcomes. The proposed method is not sensitive to strong assumptions of proportional hazards, and it can be applied in situations where such strong assumptions do not generalize uniformly across population.111

The software code implementation of the method described in this paper has been released as part of the open source

auton-survival python package and is available on github at

Our specific contributions can be summarized as follows:

  • [leftmargin=1em]

  • We propose a deep latent variable approach to recover subgroups of patients that respond differentially to an intervention, in the presence of censored outcomes.

  • We present conditions in which the counterfactuals are identifiable using observational data under the proposed model, along with an efficient approach for learning and inference.

  • We demonstrate the proposed approach applied to multiple large landmark clinical trials that were originally carried out to assess the efficacy of medical interventions to reduce risk of adverse cardiovascular outcomes among hypertensive and diabetic patients, and we discover clinically actionable counterfactual phenotypes.

Figure 1: Counterfactual Phenotyping: a) In the untreated population there are two subgroups ( MidnightBlue) and ( Mahogany) that demonstrate differential baseline survival rates represented as the latent group, . The intersecting survival curves between the latent groups in suggest that the observed untreated survival rates do not obey the Proportional Hazards assumption. b) Under intervention, we observe that treatment benefit is independent of base survival group, . Individuals in benefit while individuals in are harmed . We propose to recover such counterfactual phenotypes, that demonstrate heterogeneous effects to the intervention.

2 Related Work

Survival Regression, a modeling process of building estimators of Time-to-Event in the presence of censored outcomes is a classic problem in statistical estimation, which has recently received attention of the machine learning research community. Arguably, the semi-parametric Cox Proportional Hazards model 


and its extensions involving multi layer perceptrons 

(faraggi1995neural; katzman2018deepsurv) remain popular, even though they are constrained by strong assumptions on the event time distributions.

Recent research in survival analysis has focused on developing flexible estimators that can ease the restrictive assumptions of the classical Cox model. Non-parametric approaches have been introduced involving Random Forests

(ishwaran2008random) and Gaussian Processes (fernandez2016gaussian; alaa2017deep). (yu2011learning) proposed to treat survival as multitask classification over discrete time horizons. Work of (lee2018deephit)

extends that with deep neural networks in the presence of longitudinal data


Other relevant deep learning approaches include adversarial learning

(chapfuwa2018adversarial) and flexible parametric mixture models (nagpal2021deep; nagpal2021deepb; ranganath2016deep). Further research into survival estimation involves modelling of competing risks, easing restrictive proportional hazard assumptions, and ensuring that models are well calibrated (chapfuwa2020calibration; goldstein2020x; yadlowsky2019calibration). The focus on estimating counterfactuals with censored data has been somewhat limited in current machine learning research literature. chapfuwa2021enabling; curth2021survite

propose using integrated probability metric penalties

(shalit2017estimating) to learn overlapping representations of the treated and control populations in the presence of censored outcomes. More traditional statistical literature in counterfactual survival regression includes propensity score (linden2018estimating; hassanpour2019learning; hassanpour2019counterfactual) and doubly robust estimators (zhao2015doubly).

In this paper, we focus specifically on the problem of subgroup identification and phenotyping with censored outcomes. Recent research involving phenotyping of censored time to events are restricted to factual or observational phenotypes (nagpal2021dcm; chapfuwa2020sca; manduchi2022a). Our work, however, focuses on simultaneous discovery of latent clusters (or phenotypes) that are counterfactual, in that they demonstrate heterogeneous effects to an intervention, while learning effective predictive models capable of capturing the uncovered heterogeneity of response functions.

Machine learning techniques for recovery of subgroups with heterogeneous treatment effects has focused so far on problems where outcomes are either continuous or binary in nature. foster2011subgroup; vittinghoff2010estimating

propose using non-parametric estimators (Decision Trees or Random Forests) to directly regress the difference of the outcomes of the treated and control groups in an framework often called "Virtual Twins" (VT).

wang2022causal propose a sampling based approach to recover sparse rule-sets that identify subgroups with enhanced effects.

More recently, nagpal2020interpretable introduced a deep latent variable learning approach, called Heterogeneous Effect Mixture Model. While close in spirit to our contributions, this approach 1) does not decouple effects of baseline physiology on survival from the treatment effect and 2) is incompatible with censored time-to-event outcomes. Therefore, it cannot be applied to many real-world clinical datasets in a straightforward manner.

3 Proposed Methodology

Figure 2: Schematic description of the proposed CMHE. The set of features (confounders) are passed through an encoder to obtain deep non-linear representations. These representations then describe the latent phenogroups and that determine the base survival rate and the treatment effect respectively. Finally, the individual level hazard (survival) curve under an intervention is described by marginalizing over and as .

3.1 Notation and Setting

We consider a dataset of right censored observations in the form of four tuples, , where is either the time to event or censoring as indicated by , is the indicator of treatment assignment, and are individual covariates that confound the treatment assignment and the outcome.

3.2 Cox PH Model and Cox Mixtures

The Cox Proportional Hazards model is arguably the most popular approach to model censored survival outcomes. The Cox model involves assuming that the conditional hazard of an individual is


where is typically a linear function. Thus, under the Cox model, the full likelihood in terms of the cumulative hazard222The cumulative hazard is defined as . It can equivalently be described in terms of the base survival rate as . and parameters is as follows:


However, the assumption that the hazard rates remain proportional over time is a strong one and is often violated in practice. One example of such a violation is the presence of intersecting survival curves. nagpal2021dcm propose to relax the PH assumption by describing the data as belonging to a mixture of fixed size , such that the PH assumptions hold only conditioned on the latent component assignment, . The assignment function of an individual to a latent group can then be learned jointly along with the component-specific hazard ratios. Under this model the hazard rate of an individual belonging to latent is given as:


3.3 Cox Mixtures with Heterogeneous Effects

In this paper we show how to model counterfactual outcomes and individual level treatment effects using the proposed approach: Cox Mixtures with Heterogeneous Effects (CMHE). To accomplish that, we further extend the model in Equation 3 by introducing another latent variable that determines the treatment effect for an individual with confounders, . Thus under this new model,

Figure 3 presents CMHE in plate notation.
Assumption 1 (Independent Censoring).
The distribution of the time-to-event and the censoring times are independent conditional on the covariates, and the treatment assignment .
Assumption 2 (Conditional PH).
Conditional on the latent group , individual time-to-event distributions obey proportional hazards.
Assumption 3 (Latent Independence).
The baseline survival rate group, and the treatment effect group, are independent given the confounders , i.e., .
Assumption 4 (Ignorability).
The treatment assignment, is independent of the potential time-to-event outcomes, conditioned on the set of confounders , ie. .
Figure 3: CMHE in plate notation. confounds treatment assignment and outcome (Model parameters and censoring distribution have been abstracted out).

Note that Assumption 1 is stronger compared to standard regression as it requires conditioning on both the covariates and the treatment assignment. Assumption 4 essentially states that the treatment assignment is completely characterized by the available confounders . In other words, there are no exogenous factors (unobserved confounders) that may affect treatment assignments.

Remark 1 (Identifiability).

Under Assumptions 1 - 3, the counterfactual Time-to-Event distribution is identifiable with observables as follows, .

Remark 1 allows us to make phenogroup level counterfactual inference in terms of observables. It stems directly from the standard application of pearl2009causality’s do-calculus and Assumptions 3 and 4 (proof available in Appendix A.1). As a consequence of the Assumptions 1 - 3 under this new model, the full likelihood is:


3.4 Architecture

A non-linear representation of the input

is obtained using a multilayer perceptron with parameters

. This representation reflects the baseline survival specific log-hazard ratios, for the each of the base survival phenotypes through the function and the non-normalized probability of belonging to one of the base survival clusters, ie. as . Further, the non-normalized probability of assignment to counterfactual phenotype is defined as , i.e., . Figure 2 provides a schematic diagram of the proposed architecture for CMHE.

3.5 Learning

Parameter inference in semi-parametric latent variable models such as CMHE is hard as estimation of the baseline hazard rates (

) is carried out non-parametrically. Naive application of the Expectation Maximization (EM) algorithm requires inference over all possible

latent assignments for the entire dataset which is intractable. As described in nagpal2021deep we propose a stochastic EM algorithm involving Monte-Carlo sampling to make inference tractable. The M-step of our proposed EM algorithm involves the following function,


which we arrive at using the assumption that the Proportional Hazards hold within each baseline survival rate group . Here and are the respective soft posterior counts of and . and corresponds to the hard posterior counts sampled as , and is the partial likelihood.

Algorithm 1 describes the stochastic EM learning algorithm for CMHE. The proposed stochastic EM makes inference is tractable with a complexity of . A complete discussion of our formulation along with the functional form of the and functions are deferred to Appendix A.2

Input : Training set, ; batches, ;
while <not converged> do
        for   do
               Draw a minibatch from the full dataset. E-Step for   do
                      ; Sample soft posteriors. Categorical( Categorical( Draw hard posteriors.
               end for
               Update with gradient of . for  do
                      breslow1972contribution’s estimator.

Cubic-spline interpolation.

               end for
        end for
end while
Return : learnt parameters, ; baseline survival splines
Algorithm 1 Learning for CMHE with Stochastic EM

3.6 Inference

Following Remark 1, the estimated risk of an individual with confounders under an intervention at a time is


4 Experiments

ALLHAT ACCORD Dataset Treatment Control Hazard Ratio Event Rate ALLHAT-A Chlorothalidone Amlodipine/Lisinopril ALLHAT-B Amlodipine Lisinopril ACCORD Intensive Glycemia Standard Glycemia Figure 4: Kaplan-Meier estimates and summary statstics of the datasets used in the paper. (For ALLHAT, outcome of interest was death from cardiovascular disease. For ACCORD, the primary endpoint was the time to first instance of Non-Fatal Myocardial Infarction, Stroke, or Death.)

In our experiments, we consider data from the landmark ALLHAT and ACCORD clinical trials originally conducted to determine the optimal treatment for reducing risk from cardiovascular disease.

4.1 Datasets

Antihypertensive and Lipid-Lowering Treatment to Prevent Heart Attack (allhat): The ALLHAT clinical trial (furberg2002major) was constituted to establish the appropriate intervention between chlorothalidone (a diuretic), amlodipine (calcium channel blocker) and lisinopril (ACE inhibitor) for hypertensive patients to reduce adverse cardiovascular events. Patients were enrolled over a four year period with a mean follow up time of 4.9 years. The complete study involved 33,357 participants older than 55 years of age, all with hypertension. 15,255 () of participants were assigned to the chlorthalidone arm, while 9,048 () were assigned to amlodipine and 9,054 () to lisinopril. For the purposes of this paper, we consider two separate experiments on the ALLHAT dataset. ALLHAT-A: We consider all patients in the trial assigned to chlorothalidone as ‘Treated’ and the rest as ‘Controls’. ALLHAT-B: We only consider the 18,102 assigned to lisinopril or amlodipine. Amlodipine is considered the ‘Treated’ arm. For both ALLHAT A and B we consider the time to death from cardiovascular events as the primary outcome of interest.

a) versus b) versus c) Untreated Survival d) Survival under Treatment

Figure 5: a) The distribution of the latent baseline survival groups, in the space of features . b) The distribution of the latent treatment effect phenogroups in the space of features . c) Kaplan-Meier Estimators conditioned on the latent baseline groups in the untreated population ie. . d) Kaplan-Meier estimators of the treated population conditioned on the latent treated effect groups ie. . Notice that individuals in , benefit from the intervention but are harmed .

Action to Control CVD Risk in Diabetes (accord) : The ACCORD study (ismail2010effect) involved 10,251 patients over the age of 40 with Type-2 Diabetes Mellitus with a median follow up time of 3.7 years. 5,128 () patients were randomized to intensive glycemic control. (HbA1C333Glycated Hemoglobin (A1c):) and the rest to standard glycemic control (HbA1C: ). In this paper we consider the patients assigned to the standard glycemic control arm as ‘Treated’ and compare the performance in terms of time to the primary endpoint of the study (a logical alternative of death, myocardial infarction or stroke). The ACCORD trial is of particular interest, as the results from it demonstrate that a more intensive hyperglycemia treatment strategy for patients with diabetes not only fails to reduce adverse cardiovascular events, but it in fact may increase the patient’s risk for overall mortality. This is most likely due to adverse effects of the treatment itself.

Figure 4 presents the Kaplan-Meier estimates of the overall population level survival for the two studies along with the summary statistics including Hazard Ratio and Average Treatment Effect in Restricted Mean Survival Time. For both the ALLHAT and the ACCORD trials we consider a set of confounding features measured during the patients baseline visit at the time of randomization. This includes basic demographic information including sex and race, age at entry into the study, previous history of adverse cardiovascular events, etc. (See Appendix F for the full list of confounders.)

Synthetic: We further benchmark the proposed model on a synthetic dataset presented in Figure 5. This dataset is designed such that the latent treatment effect phenotype is not linearly separable in . The time-to-event conditioned on , latent and latent effect group are generated from a Gompertz distribution. (Complete details of this design are deferred to Appendix E.)

4.2 Counterfactual Phenotyping

We evaluate the performance of CMHE in its ability to identify phenotypes of varying sizes that have a more pronounced treatment effect when compared to the baselines. For all the datasets we keep of the dataset as training and the rest as testing to evaluate each model’s performance. For CMHE as well as the baselines we identify the phenogroup with the most enhanced and the most diminished treatment effect on the training set and report the estimated Conditional Average Treatment Effect in Restricted Mean Survival Time on the held out sample.

Definition 1 (Rmst).

The Restricted Mean Survival Time at time under an intervention for individual with confounders is the expected conditional Time-to-Event,

Note that in the case of time-to-event outcomes, The Restricted Mean Survival Time (RMST) is the truncated area under the survival curve as


Following from chen2001causal; royston2013restricted we define the Conditional Average Treatment Effect in terms of the difference in Restricted Mean Survival Time under treatment and control.

Definition 2 ().

The Conditional Average Treatment Effect at time is expected difference between the treated and control Restricted Mean Survival Time conditioned on the phenogroup, .

This can now be estimated as:


Here is the set of all individuals in the phenotype (we control the size of the phenotype by varying the threshold, and is the time horizon at which RMST is computed. and are the survival distributions under treatment and control.

Note that we cannot directly compare the survival curves conditioned on the recovered phenotype as within the phenotype treatment assignment is not random. In order to mitigate this problem, we fit separate Random Survival Forests (RSFs) (ishwaran2008random) on the treated and control populations in the training set in a 5-fold Cross Validation to minimize the Integrated Brier Score. The fitted estimators are then employed to estimate the individual counterfactual survival curves, and on the test data for evaluation.

Baselines: We compare the ability of CMHE against the following baseline strategies for counterfactual phenotyping:

Dimensionality Reduction + Clustering: Involves first performing dimensionality reduction of the input confounders,

, followed by clustering. For the experiments in the paper we consider Linear-PCA and Kernel-PCA with an Radial Basis Function Kernel for dimensionality reduction, followed by K-Means and Gaussian Mixture Models (GMMs) for clustering. The number of reduced dimensions for the confounders is tuned from

and the number of clusters from . For the GMMs we enforce the learned covariance matrix of the components to be diagonal.

Virtual Twins (foster2011subgroup; vittinghoff2010estimating): Involves first building regression estimators of the outcome conditioned on the confounders separately for the treated and control populations, followed by regressing the difference between counterfactual estimates on the confounders using a simpler model such as a Decision Tree. In the experiments, we estimate the virtual twin counterfactual survival models using a Linear Cox model and Cox model parameterized with a 2 hidden layer Multi-Layer Perceptron (MLP) with (faraggi1995neural; katzman2018deepsurv) in a 5 fold cross-validation fashion on the training dataset. For both the Linear and MLP Cox Model we tune the batchsize from the learning rate in . For the MLP, the hidden layer was fixed to have Tanh activations with a dimensionality of 50. The models were optimized using Adam (kingma2014adam). Once the counterfactual models are estimated, the difference in there estimates in terms of RMST@5-Year is modelled using a Random Forest with 25 trees whose depth is tuned from {4,5}. The trained Random Forest is then employed to recover the counterfactual phenotypes.

4.3 Results

Latent Phenogroups with Enhanced Treatment Effects

Latent Phenogroups with Diminished Treatment Effects

Figure 6: Conditional Average Treatment Effect in Restricted Mean Survival Time (95% Confidence Bands) over Time for counterfactual phenotypes recovered by CMHE and Baselines in comparison to the Average Treatment Effect (ATE). For each of the datasets we identify phenogroups with enhaced (diminished) treatment effect based on RMST on the training split and report the corresponding RMST on the out-of-sample testing split. Phenogroups of different sizes are generated by varying the threshold, at which an individual is assigned to the latent phenogroup, . (Here we report phenogroups that are of the size 15% of the total study population.) Notice how CMHE consistently recovers phenogroups with larger (Tabulated results in Table 3).

CMHE consistently recovered phenogroups that demonstrated higher Conditional Average Treatment Effect as compared to the the Virtual Twins and Dimensionality Reduction Clustering baselines as in Figure 6. In the case of ALLHAT A and B, CMHE recovered a sub-population of 15% of the test data that had a diminished @5-Years of and Days respectively as compared to the population ATE of and Days (Table 3).

In the ACCORD trial, the CMHE recovered a phenogroup involving 15% of the test set that had a much more dramatic treatment effect (@5-Years of Days, as compared to the population average treatment effect of Days). These results demonstrate that CMHE can discover compelling phenogroups with substantially different treatment responses.

Figure 7: ROC curves for recovery of latent by CMHE and proposed models on the synthetic dataset.

In the case of synthetic data, we directly compare the performance of out of sample recovery of by treating it as a binary classification problem. CMHE has higher discriminative performance (Area under Receiver Operating Characteristic: ) versus the clustering () and Virtual Twins () baselines, as shown in Figure 7. Notice that Clustering is not better than random, which is expected due to the nonlinear nature of as in Figure 5.

4.4 Interpretation

Rule explanation of CMHE Phenotype Size 5-Year RMST
Height<68.51 Dia.BP<88.5 GFR<65.7 Aspirin=+ve 5.05% 42.87 24.30
Dia.BP<82.50 Prior[T-inv.]=+ve LLT=-ve Race[Black]=+ve 13.00% 20.32 24.30
Sys.BP>144.5 Dia.BP>87.5 GFR>55.7 Race[Black]=+ve 8.71% 53.20 28.97
Dia.BP<78.5 GFR<86.1 Race[Black]=+ve Aspirin=-ve 6.86% 7.56 28.97
GFR<79.9 UACR<42.9 FPG>194.6 Prior[CVD]=-ve 5.58% 28.94 4.79
potassium<4.8 GFR>90.2 Prior[MI]=-ve Prior[CVD]=+ve 7.56% -23.26 4.79
Table 1: We employ a tree ensemble based rule learning algorithm (Details in Appendix D) to explain the phenotypes extracted by CMHE. Enhanced ( Diminished) Treatment Effect Rules in Blue ( Red). We report explanations that maximize score on the heldout dataset. Extensive discussion on physiologic interpretation of the extracted explanations are in Section 4.4.

We employed a tree ensemble based rule learning (friedman2003gradient; friedman2008predictive) to interpret the phenotypes discovered with CMHE in terms of parsimonious conjunctions. The extracted rules are presented in Table 1. Additional implementation details are provided in Appendix D. The extracted rules were subjected to qualitative evaluation by an expert clinician.

ALLHAT-A: Conditions associated with increased protective effect from chlorthalidone treatment include patients who are older, shorter, have decreased renal function evedenced from lower glomerular filtration rate (GFR) and exhibit less baseline cardiac disease (absence of coronary heart disease). Additionally, several conditions favor patients with lower baseline systolic or diastolic blood pressure. In clinical practice, the decision for first-line antihypertensive therapeutic agent continues to be debated (whelton20182017). However, ACE inhibitors are commonly used for treatment of patients with hypertension and cardiac disease (such as coronary heart disease). The conditions stated above are consistent with this, indicating patients with absence of coronary heart disease are more likely to benefit from diuretic therapy. Additionally, patients treated with calcium channel blockers may be prone to edema and fluid retention. Fluid retention is a risk for patients with lower baseline GFR, indicative of kidney dysfunction, thus diuretic therapy may be preferred in these patients. On the other hand, ACE inhibitors are believed to slow the progression of mild kidney disease, making them a reasonable treatment for these patients.

ALLHAT-B: Patients treated with amlodipine who achieved the greatest benefit tended to have higher baseline blood pressure (systolic and diastolic), weigh less (lower weight and body mass index), have higher baseline kidney function (higher GFR), and were more likely to be of Black, non-hispanic ethnicity. These characteristics are useful and actionable clinical parameters to guide clinicians in choosing a first antihypertensive agent. In fact, the American College of Cardiology and American Heart Association guidelines for management of hypertension suggest that Black patients should typically be prescribed either a thiazide diuretic or calcium channel blocker for first-line agent (whelton20182017). In this case, the rules discovered with CMHE demonstrate that the calcium channel blocker amlodipine is more effective than lisinopril for non-white patients. Conditions corresponding with diminished treatment effect are the inverse of those described for the group receiving the most benefit. ACE inhibitors are indicated treatment to slow the progression of renal insufficiency for those with mild renal dysfunction and hypertension, and the CMHE conditions support this, indicating that amlodipine is better in patients with higher baseline kidney function, whereas lisinopril may be better for patients with evidence of kidney disease.

ACCORD: CMHE revealed multiple conditions describing the phenogroup with decreased treatment effect from intense hyperglycemic control. However, another set of conditions describe the phenogroup with increased treatment effect from intensive glycemic control. In the former group, criteria tended to favor patients with less severe renal disease (e.g., higher baseline GFR, lower baseline creatinine, lower baseline potassium) or decreased evidence of cardiovascular disease (e.g., lower diastolic blood pressure, less bradycardia), though this was not true in all cases. In contrast, the phenogroup with increased treatment effect has a worse baseline kidney function (GFR < 79.9 mL/min, urine creatinine < 42.9 mg/dL), higher baseline fasted glucose levels (fasting blood glucose > 195), and lacked a documented history of cerebrovascular disease. These results suggest that patients at with decreased renal function and poor baseline glucose control, which is commonly seen in advanced diabetics, stand to gain the most from an intensive treatment regimen. Patients with a history of cerebrovascular disease (e.g., stroke), on the other hand, may be at greater risk from hypoglycemic complications related to intensive glucose control. This is consistent with clinical and biological intuition that aggressive treatment is required for the most severe forms of a disease, whereas milder or early forms of disease may not derive as much benefit and are at increased risk for net harm due to unintended side effects.

4.5 Factual Regression

For completeness, we further evaluate the performance of the proposed approach in estimating factual risk over multiple time horizons using the standard survival analysis metrics, inclduing:

Metric Concordance Index IBS Time Horizon 1 Year 3 Year 5 Year ALLHAT-A Cox-Linear 0.6822 0.6716 0.6688 0.1362 Cox-MLP 0.6797 0.6693 0.6677 0.1361 CMHE-Linear 0.6830 0.6722 0.6692 0.1360 CMHE-MLP 0.6832 0.6734 0.6705 0.1357 ALLHAT-B Cox-Linear 0.6741 0.6641 0.6629 0.1402 Cox-MLP 0.6717 0.6605 0.6602 0.1404 CMHE-Linear 0.6753 0.6651 0.6638 0.1401 CMHE-MLP 0.6760 0.6655 0.6640 0.1399 Metric Concordance Index IBS Time Horizon 1 Year 3 Year 5 Year ACCORD Cox-Linear 0.6615 0.6737 0.6713 0.0591 Cox-MLP 0.6564 0.6723 0.6706 0.0590 CMHE-Linear 0.6606 0.6720 0.6697 0.0591 CMHE-MLP 0.6755 0.6881 0.6850 0.0587 SYNTHETIC Cox-Linear 0.6224 0.6205 0.6158 0.1723 Cox-MLP 0.6623 0.6727 0.6740 0.1619 CMHE-Linear 0.6356 0.6365 0.6337 0.1698 CMHE-MLP 0.6676 0.6758 0.6786 0.1604
Table 2: Time Dependent Concordance Index and Integrated Brier Score (IBS) for CMHE (Linear and MLP) against Cox baselines in factual regression.

Brier Score : Defined as the Mean Squared Error (MSE) around the probabilistic prediction at a certain time horizon.


Time Dependent Concordance Index (): A rank order statistic that computes model performance in ranking patients based on their estimated risk at a specfic time horizon.


We compute the censoring adjusted estimates of the Time Dependent Concordance Index (antolini2005time; gerds2013estimating) and the Integrated Brier Score444Brier Score integrated over 1, 3 and 5 years. (gerds2006consistent; graf1999assessment)

to assess both disciminative performance and model calibration at multiple time horizons. For each of the datasets we perform 5-fold cross-validation over the hyperparameter grid as described in

Appendix C, and report the performance of the hyperparameter setting with the lowest Brier Score averaged over all folds. Table 2 presents the discriminative performance and calibration of CMHE compared to Cox PH models in factual regression. We find that CMHE had similar or better discriminative performance than a simple Cox Model with a linear and MLP hazard functions. CMHE was also better calibrated as evidenced by overall lower Integrated Brier Score, suggesting utility for factual risk estimation.

5 Discussion and Conclusion

We proposed CMHE, a deep learning approach that discovers latent phenogroups that respond differentially to an intervention in the presence of censored time-to-event outcomes. It provides a valuable adjunct to traditional statistical techniques in healthcare survival research and determination of treatment efficacy. CMHE provides the opportunity to gain individualized patient insights by identifying subjects who would benefit from a treatment as well as those who are at highest risk for harm. This can particularly useful in clinical practice when these personalized insights differ from population-level expectations of the efficacy of the established as well as novel treatment protocols.


Appendix A Additional Details on Cmhe

a.1 Identifiability

Proof of Remark 1.


a.2 Learning

We propose to maximize the likelihood in Equation 5 using a stochastic Expectation Maximization algorithm (Algorithm 1).

E-Step: Involves first computing the posterior counts of the joint of the latent and as follows:


Note that for the censored individuals is . In practice the are obtained through spline interpolation. Now the latent variable specific soft posterior counts can be computed by marginalizing out the complementary Latent Variable. Thus, soft posterior counts of and respectively are,

M-Step: as in standard EM, involves maximizing the function on the data, defined as:


Note that


can be directly optimized with a gradient based approach. However,

is the semi-parametric Cox event rate which is hard to optimize in the presence of soft posterior weights. We instead replace the soft weights in

with the hard posterior counts sampled as follows: .

We thus arrive at given as,

Remark 2.

is an unbiased estimate of the

in Equation 14.

Proof. Follows immediately from the fact that .

We can now rewrite




is equal to maximizing over each where,


Combining with


we arrive at the in Equation 6.

Appendix B Tabulated Results

Latent Phenogroups with Enhanced Treatment Effects

Model CATE (RMST) in Days
1 Year 3 Year 5 Year
DR-C 2.68 0.08 14.61 0.52 28.03 1.19
VT 3.08 0.11 17.36 0.65 35.31 1.44
CMHE 3.58 0.10 18.52 0.56 37.43 1.23
DR-C 4.23 0.15 21.57 0.79 40.37 1.68
VT 5.34 0.17 28.68 0.93 54.16 1.96
CMHE 5.64 0.13 30.44 0.74 59.54 1.59
DR-C 0.36 0.18 4.15 0.94 13.76 2.21
VT 1.40 0.24 6.19 1.16 18.11 2.65
CMHE 1.22 0.24 9.71 1.11 27.02 2.43

Latent Phenogroups with Diminished Treatment Effects

Model CATE (RMST) in Days
1 Year 3 Year 5 Year
DR-C 1.60 0.08 7.64 0.46 13.62 1.0
VT 2.24 0.10 9.36 0.54 15.52 1.16
CMHE 1.38 0.06 5.80 0.38 10.09 0.86
DR-C 3.57 0.20 12.16 1.08 15.89 2.29
VT 2.64 0.14 8.34 0.80 7.49 1.80
CMHE 2.39 0.18 6.20 1.02 2.28 2.18
DR-C -0.06 0.32 -4.51 1.86 -8.59 4.34
VT -0.33 0.24 -10.53 1.47 -26.92 3.50
CMHE 0.12 0.29 -10.45 1.79 -24.52 4.15
Table 3: Tabulated results of the proposed method versus baselines in counterfactual phenotyping. We report the Conditional Average Treatment Effect in Restricted Mean Survival Time over multiple time horizons.

Appendix B Tabulated Results

Latent Phenogroups with Enhanced Treatment Effects

Model CATE (RMST) in Days
1 Year 3 Year 5 Year
DR-C 2.68 0.08 14.61 0.52 28.03 1.19
VT 3.08 0.11 17.36 0.65 35.31 1.44
CMHE 3.58 0.10 18.52 0.56 37.43 1.23
DR-C 4.23 0.15 21.57 0.79 40.37 1.68
VT 5.34 0.17 28.68 0.93 54.16 1.96
CMHE 5.64 0.13 30.44 0.74 59.54 1.59
DR-C 0.36 0.18 4.15 0.94 13.76 2.21
VT 1.40 0.24 6.19 1.16 18.11 2.65
CMHE 1.22 0.24 9.71 1.11 27.02 2.43

Latent Phenogroups with Diminished Treatment Effects

Model CATE (RMST) in Days
1 Year 3 Year 5 Year
DR-C 1.60 0.08 7.64 0.46 13.62 1.0
VT 2.24 0.10 9.36 0.54 15.52 1.16
CMHE 1.38 0.06 5.80 0.38 10.09 0.86
DR-C 3.57 0.20 12.16 1.08 15.89 2.29
VT 2.64 0.14 8.34 0.80 7.49 1.80
CMHE 2.39 0.18 6.20 1.02 2.28 2.18
DR-C -0.06 0.32 -4.51 1.86 -8.59 4.34
VT -0.33 0.24 -10.53 1.47 -26.92 3.50
CMHE 0.12 0.29 -10.45 1.79 -24.52 4.15
Table 3: Tabulated results of the proposed method versus baselines in counterfactual phenotyping. We report the Conditional Average Treatment Effect in Restricted Mean Survival Time over multiple time horizons.

Appendix C Factual Regression Experiments

We compare the performance of CMHE in 5 Fold CV to a Linear Cox model and a Deep Cox Model in 5 fold cross validation with a 2 Hidden Layer MLP with dimensionality of 50 and Tanh activations. Each model was trained with Adam with learning rates tuned from and minibatch size of . For CMHE we tuned the number of treatment effect phenotypes from from and the base survival rate phenogroups from {1,2,3}.

Appendix D Rule Learning

We used the python package, scope-rules555 to explain the learnt phenotypes with parsimonius rules. The rules were restricted to have a maximum length of 4 and a precision of 0.8. Explanations with the highest score on the train set are reported in Table 1.

Appendix E Synthetic Dataset

We employ the python package sklearn666Pedregosa, Fabian, et al. "Scikit-learn: Machine learning in Python." the Journal of machine Learning research 12 (2011): 2825-2830. to generate the confounders .


Appendix F List of Confounding features

Name Description
RACE Race of Participant
HISPANIC If Participant was Hispanic
ETHNIC Ethnicity
SEX Sex of Participant
ESTROGEN Estrogen supplementation
BLMEDS Antihypertensive treatment
MISTROKE History of Stroke
HXCABG History of coronary artery bypass
STDEPR Prior ST depression/T-wave inversion
OASCVD Other atherosclerotic cardiovascular disease
DIABETES Prior history of Diabetes
HDLLT35 HDL cholesterol < 35mg/dl; 2x in past 5 years
LVHECG LVH by ECG in past 2 years
WALL25 LVH by ECG in past 2 years
LCHD History of CHD at baseline
CURSMOKE Current smoking status.
ASPIRIN Aspirin use
LLT Lipid-lowering trial
RACE2 Race (2 groups)
BLMEDS2 Antihypertensive treatment
GEOREGN Geographic Region
AGE Age upon entry
BLWGT Weight upon entry
BLHGT Height upon entry
BLBMI Body Mass Index upon entry
BV2SBP Baseline SBP
BV2DBP Baseline DBP
EDUCAT Education
APOTAS Baseline serum potassium
BLGFR Baseline est glomerular filtration rate
Table 4: List of confounding variables used for experiments involving the ALLHAT dataset.
Name Description
female Indicator if sex is Female
bl_age Age in years
cvd_hx_bl CVD History at Baseline: 0=No, 1=Yes
raceclass Race Class: White, Black, Hispanic, Other
sbp Systolic Blood Pressure (mmHg)
dbp Diastolic Blood Pressure (mmHg)
hr Heart Rate (bpm)
x1diab Diagnosis of type 2 diabetes of >3 months duration
x2mi Myocardial infarction
x2stroke Stroke
x2angina Angina/Ischemic changes (Graded Exercise/Imaging)
cabg CABG
ptci PTCI/PTCA/Atherectomy
cvdhist Participant has history of clinical CVD events
orevasc Other revascularization procedure
x2hbac11 HbA1c between 7.5% and 11.0% inclusive
x2hbac9 HbA1c between 7.5% and 9.0% inclusive
x3malb Micro or macro albuminuria within past 2 years
x3lvh LVH by ECG or Echocardiogram within past 2 years
x3sten Low ABI (<0.9)/>= 50% stenosis of coronary, carotid or,
lower extremity artery within past 2 years
x4llmeds On lipid lowering medication currently or,
untreated LDL-C > 130 mg/dL within past 2 years
x4gender Gender for low HDL-C within past 2 years
x4hdlf HDL-c < 50 mg/dL within past 2 years, female
x4hdlm HDL-c < 40 mg/dL within past 2 years, male
x4bpmeds Participant currently on BP medications
x4notmed Participant not on BP medication and,
most recent BP within past 2 years
x4smoke Current cigarette smoker
x4bmi BMI > 32 kg/m2 within past 2 years
chol Total Cholesterol (mg/dL)
trig Triglycerides (mg/dL)
vldl Very low density lipoprotein (mg/dL)
ldl Low density lipoprotein (mg/dL)
hdl High density lipoprotein (mg/dL)
fpg Fasting plasma glucose (mg/dL)
alt ALT (mg/dL)
cpk CPK (mg/dL)