The popularity of the Cox regression model has contributed to the enormous success of the hazard ratio as a concise summary of the effect of a randomised treatment on a survival endpoint. Notwithstanding this, use of the hazard ratio has been criticised over recent years. Hernán (2010) argued that selection effects render a causal interpretation of the hazard ratio difficult when treatment affects outcome. While the treated and untreated people are comparable by design at baseline, the treated people who survive a given time may then tend to be more frail (as a result of lower mortality if treatment is beneficial) than the untreated people who survive the given time , so that the crucial comparability of both groups is lost at that time. Aalen et al. (2015) re-iterated Hernán’s concern. They viewed the problem more as one of non-collapsibility (Martinussen and Vansteelandt, 2013), which is a concern about the interpretation of the hazard ratio, though not about its justification as a causal contrast. In particular, they argued that the magnitude of the hazard ratio typically changes as one evaluates it in smaller subgroups of the population (e.g. frail people), even in the absence of interaction effects on the log hazard scale. In this paper, we aim to develop more insight into these matters. We will focus in particular on what can be learned about the treatment effect from the hazard ratio becoming 1 after a certain point in time. Throughout, we will assume that data are available from a randomised experiment on the effect of a dichotomous treatment (coded 1 for treatment and 0 for control) on a survival endpoint , so that issues of confounding can be ignored, although our findings naturally extend to observational studies.
2 The Cox model and causal reasoning
Analyses of time-to-event endpoints in randomised experiments are commonly based on the Cox model
where denotes the hazard function of given , evaluated at time , and is the unspecified baseline hazard function. This model implies that for all
which suggests that the exponential of can be interpreted as
This represents a causal contrast (i.e., it compares the same population under different interventions). Indeed, let denote the potential event time we would see if the exposure is set to . Then, since randomisation ensures that for , we have that
This shows that, under the proportional hazard assumption,
forms a relative contrast of what the log survival probability would be at an arbitrary timeif everyone were treated, versus what it would be at that time if no one were treated.
The log-transformation makes the above interpretation of , while causal, difficult. It is therefore more common to interpret as a hazard ratio
Interpretation appears simpler now, but this is somewhat deceptive for two reasons. First, the righthand expression shows that contrasts the hazard functions with and without intervention for two separate groups of individuals, those who survive time with treatment () and those who survive time without treatment (). Those groups will typically fail to be comparable if treatment affects outcome (Hernán, 2010). In particular, when treatment has a beneficial effect then, despite randomisation, the subgroup in the numerator will generally contain more frail people than the subgroup in the denominator, where the frailest people may have died already. When viewed as a hazard ratio, therefore does not represent a causal contrast. Second, the interpretation of as a hazard ratio is further complicated by it being non-collapsible, so that its magnitude typically becomes more pronounced as one evaluates smaller subgroups of the study population (Martinussen and Vansteelandt, 2013; Aalen et al., 2015).
The summary so far is that, under the assumption of proportional hazards, the parameter in the Cox model expresses a causal effect, namely the ratio of log survival probabilities with versus without treatment in the study population. Because this interpretation is not insightful, results are best communicated by visualising identity (1
) in terms of estimated survival curves with versus without treatment(Hernán, 2010). This has the advantage that it provides better insight into the possible public health impact of the intervention, but the drawback that it does not permit a compact way of reporting and that survival curves do not provide an understanding of a possible dynamic treatment effect. To enable a more in-depth understanding, it is tempting to interpret as a hazard ratio, but then interpretation becomes subtle. We will demonstrate this in more detail in the next section, where we will investigate to what extent hazard ratios may provide insight into the dynamic nature of the treatment effect.
3 Time-varying hazard ratios are not causally interpretable
Consider now a study where the hazard ratio changes with time in the following sense:
with , where denotes the change point, which we assume to be known based on subject matter knowledge. Suppose in particular that and . This is commonly interpreted as if treatment is initially beneficial, but becomes ineffective from time onwards. We present a practical example in Section 4.2 where this situation arises. In the following sections, we will reason whether such interpretation is justified, and thus whether hazard ratios permit a dynamic understanding of the treatment effect.
3.1 A closer look at the causal mechanism
To develop a greater understanding, we will first develop insight into data-generating processes (DGP) that could give rise to (2). Let represent the participants’ unmeasured baseline frailty (higher means more frail), which affects , but is independent of by randomisation. Suppose that the hazard function of given and satisfies
for some function , and let. This specific choice is not essential, however, and we later consider a situation where is binary.
We will investigate what choices of give rise to model (2). With the Laplace transform associated with the distribution of , the following relationship between the hazard function of interest , and , can be shown to hold:
where . For subjects with given , it follows that the conditional hazard ratio, which we term , is
For , this simplifies to
which is smaller than 1 at all time points when, as previously assumed, . We conclude that treatment appears beneficial at all times amongst individuals with the same value of , regardless of . This contradicts the earlier, naïve interpretation that, across all individuals combined, treatment is ineffective from time onwards.
The root cause of these contradictory conclusions is the fact that the hazard ratio at a given time does not express a causal effect (Hernán, 2010). This has nothing to do with model misspecification, as all considered models hold by construction. In particular, when and , then
While is independent of by randomisation, it is thus no longer so amongst subgroups of survivors, where we are left with more frail subjects in the active treatment group: . That selection takes place does not rely on being Gamma-distributed; see Appendix A.1, where we consider a situation with a binary frailty variable.
To illustrate further, we generated a single data set with binary with and corresponding to low and high risk groups. The treatment variable was binary with . Further, we took , , the baseline hazard function , and the change point . We took a large sample size
so that sampling variability is small, and induced censoring according to a uniform distribution on, randomly for half of the individuals, and the rest censored at , resulting in an overall censoring percent of approximately 19%. The result from a change point Cox analysis was (SE 0.018) and (SE 0.034). The corresponding hazard ratios are for and for . In contrast, the conditional hazard ratio in Figure 1 is seen to be smaller than 1 at all times, not only in the first initial period. Indeed, no matter the value of , it lies between 0.33 and 0.80 in the considered time period from 0 to 8.
That we are misled by the hazard ratios calculated from the extended Cox regression analysis is due to selection taking place. This is shown in Figure 2, where we have plottedfor . Since the treatment has a beneficial effect, we are left with more and more frail subjects in this group compared to the untreated group. In Figure 2, the blue curve () therefore lies consistently above the green curve ().
As just shown, the DGP given by with implies the marginal Cox model. It is tempting to interpret as a causal hazard ratio, but this only holds under further untestable assumptions as shown in Section 3.2 and 3.4. In Appendix A.2, we formulate a more general DGP so that both the marginal Cox model and the model are correctly specified.
3.2 Towards a causal hazard ratio
To remedy this problem, it seems intuitively of interest to evaluate conditional hazard ratios
for a large collection of baseline variables , such that those who survive time with treatment () are comparable to those who survive time without treatment (), but have the same covariate values . Such comparability would be attained if
Under this assumption, it follows via Bayes’ rule that the righthand side of the above identity equals
This estimand expresses the instantaneous risk at time on treatment versus control for the principal stratum of individuals with covariates who would have survived up to time , no matter what treatment. It represents a causal contrast, which is closely related to the so-called survivor average causal effect (Rubin, 2000). We will refer to it as the conditional causal hazard ratio.
Unfortunately, the assumption (5) is untestable and biologically implausible, as it is essentially impossible to believe that one can get hold of all predictors of the event time such that knowledge of the event time without treatment does not further predict the event time with treatment. Furthermore, even if one could get hold of all such predictors, then because would probably carry so much information about the event time, one would logically expect the numerator and denominator of (6) to be so close to 0 or 1 that it would render the conditional causal hazard ratio essentially meaningless. Below, we will therefore focus on the marginal causal hazard ratio obtained from (6) with empty:
3.3 Why causal hazard ratios are not identified without strong assumptions
The reason why the causal hazard ratio (7) is not identifiable without invoking strong assumptions is that it attempts to answer an overly ambitious question. Imagine a trial that randomises participants over an implanted medical device (e.g. a stent or a pacemaker) versus no treatment (or placebo). Suppose that the medical device gradually deteriorates and stops being operational after some time . Then we would say that treatment no longer works from time onwards. This would correspond with for . However, how could data from a randomised trial be informative about the effect of treatment after time when no information is collected on the times at which the medical device is operational or not? To learn about the treatment effect at each time , we should ideally need data on whether () or not ( the device is operational at that time. When the operation time is ignorable, then one may learn about the treatment effect at each time through contrasts of the form
for all , where is the information generated by all , . It is unsurprising that without detailed data on the operation times of each device, strong assumptions are needed to develop insight into the dynamic nature of the treatment effect. Likewise, consider a trial that randomises participants over a once-daily treatment regimen versus placebo. Then we would say that treatment no longer works from time onwards when, from that time onwards, patients with the same history of treatment experience the same outcomes (in distribution), whether or not they continue treatment. To infer when treatment becomes ineffective, a multi-stage design is ideally needed where patients on the treatment arm may randomly be switched to the control arm at designated points in time. Without such design, some progress can still be made with data on daily pill intake. However, without such design and data, inferring when treatment becomes ineffective remains an ambitious undertaking.
3.4 Sensitivity analysis
Some further insight into the magnitude of the causal hazard ratio at a given time can be obtained under the monotonicity assumption that no one is harmed by treatment, i.e.
Under this assumption, we can write
It follows that
compares the instantaneous risk at time on treatment for individuals who remained event-free at time thanks to treatment versus individuals who would have been event-free at time regardless of treatment. It can be viewed as a selection effect and is therefore termed a sensitivity ratio (SR). Expression (8) thus conveys that the dependence of the hazard ratio on time may differ from the causal hazard ratio on time as a result of selection effects. In particular, when the observed hazard ratio is constant, e.g. suggesting a beneficial treatment effect of 0.8, then this will often correspond with a more pronounced treatment effect . This results from (8), if , as . The causal hazard ratio then need not be constant. This may be the result of varying treatment effectiveness over time, but may also be the result of the patient population (the principal stratum ) changing with time. It follows from the above results that
All terms in the righthand side, apart from , are identified from the observed data. The above expression may therefore be used as the basis of a sensitivity analysis where the user tries different choices of . However, it is not clear what would be reasonable values of , and the monotonicity assumption may also not be plausible.
Assume instead that there is a so that (5) holds, and so that the DGP is governed by
for some general functions and , then
Thus, if (5) and (9) hold then . In the worked application in Section 4.1 we compare to the Cox hazard ratio while modelling the correlation between and letting be Gamma distributed with varying variance so that (5) and (9) hold. As a further illustration we now describe how to simulate data so that (5) and (9) are fulfilled, and so that the marginal Cox model, conditioning only on , is also correctly specified. Take , , and to be independent so that is Gamma distributed with mean 1 and variance , and
are exponentially distributed with mean 1, and the exposureis binary with . Then let and let It follows directly that (5) holds, and further that the marginal Cox model is correctly specified, and also that
so (9) also holds. Thus, , and
Different values of will give different values of Kendall’s corresponding to different correlation between and . In the specific setting, we have . Figure 3 displays in scenarios with different values of Kendall’s . Note that the marginal Cox model induces the hazard ratio , which is taken to be . It is seen that is equal to only under independence between and ; otherwise looks like a decreasing function in with . When and are highly correlated (Kendall’s ) then is sharply decreasing towards zero. In Appendix A.4 we describe another DGP where again (5) holds and the marginal Cox model is correctly specified, but and are different due to failure of (9).
3.5 Hazard differences
Under some conditions, hazard differences - as opposed to hazard ratios - have a more appealing interpretation, apart from them being collapsible (Martinussen and Vansteelandt, 2013). In particular, suppose that the additive hazards model
holds for general functions , and baseline covariates . It has been shown that under this model, the baseline exchangeability of treated and untreated individuals w.r.t. the covariates , as guaranteed by randomisation, implies that (Vansteelandt et al., 2014; Aalen et al., 2015). However, since this is equivalent to
it does not mean that there is a balance in the risk set, because (11) shows that this is a comparison between two different groups of people: those with and those with . Indeed, the hazard difference
may not represent a causal contrast. If (5) holds, e.g. that is a sufficiently rich collection of variables that includes and (or deterministically predicts and ) then the above hazard difference reduces to
which represents a causal contrast. This is shown in Appendix A.5. While it is implausible to have such rich collection of data that it essentially deterministically predicts , interestingly, when model (10) holds for including , then it can be fitted without data on . Indeed, by collapsibility of the hazard difference (Martinussen and Vansteelandt, 2013), the hazard difference can be consistently estimated via Aalen least squares estimation in the unadjusted model
Unfortunately, however, the additive structure of model (10) will often be unlikely satisfied w.r.t. a rich collection of variables that predict , and thus the practical implications of the above reasoning remain limited. For instance, in the simulation study of Section 3.1, the marginal Aalen additive hazards model fits the data perfectly, because is binary, but also suggests a beneficial effect of the treatment in the first 4 years, which then disappears, see Figure 4.
4.1 MRC RE01 study
As an illustration, we reconsider the kidney cancer data described in White and Royston (2009). These data are from the MRC RE01 study that was a randomised controlled trial comparing interferon- (IFN) treatment with the best supportive care and hormone treatment with medroxyprogesterone acetate (control) in patients with metastatic renal carcinoma. We use the same 347 patients as in White and Royston (2009). In this illustrative analysis we consider only the first 30 months of follow up. The median follow-up time was 242 days, and 85% of the patients died within the considered time frame. The two Kaplan-Meier estimates in Figure 5 contain all the available information about the treatment effect. IFN treatment seems superior to the standard treatment (control), although a supremum test comparing the two survival curves results in a non-significant p-value of 0.09. The score process plot of Lin et al. (1993) was calculated using the R-package timereg (Martinussen and Scheike, 2006), giving no evidence against the proportional hazards assumption (P 0.43 according to the supremum test). We therefore fitted the Cox model, giving the estimate (SE 0.12, P 0.01), showing evidence that the IFN treatment reduces the risk of dying for these patients. The hazard ratio of 0.75 expresses the magnitude of the causal effect, but interpretation is subtle as explained before. As a sensitivity analysis we now assume (5) and (9), and take to be Gamma distributed with mean 1 and variance . This model fits the observed data equally well and thus cannot be refuted based on the observed data. The parameter expresses the correlation between and and was chosen to give a Kendall’s of 0.1 and 0.2 corresponding to estimated Kendall’s concerning the correlation between lifetimes for dizygotic and monozygotic Danish twins (Scheike et al., 2015). A higher correlation between and , corresponding to a Kendall’s of 0.3, was also considered. Figure 6 displays under these three scenarios. It is seen that is smaller than 0.75 at all times and decreases with time indicating a stronger treatment effect when comparing the instantaneous risk at time on treatment versus control for the principal stratum of individuals who would have survived up to time , no matter what treatment. This is more pronounced with the larger correlation. In view of these subtleties of interpretation, in this section, we will focus on a number alternative ways of describing the treatment effect.
We may alternatively use the Cox model to estimate the relative risk function
which can be estimated consistently by
where is the Cox partial likelihood estimator and is the corresponding Breslow estimator. This estimate, along with 95% pointwise and uniform confidence bands, is displayed in Figure 7, where we see that the estimated relative risk function is below 1 at all times, in favour of the IFN treatment. For instance, it is seen that the relative risk at one year is estimated to be approximately 0.85, and, judging from the 95% confidence bands (dashed curves), this is close to being significant. A uniform test over the considered time span is also close to being significant judging from the 95% uniform confidence bands.
Another way of quantifying the treatment effect is by using the restricted mean survival time (RMST) (Uno et al., 2014; Zhao et al., 2016). The RMST up to time is defined as . This is the area under the survival curve of up to time and can easily be estimated using the corresponding Kaplan-Meier curve up to time (see Zhao et al. (2016) for details on inference). Contrasts of the corresponding to different (randomised) treatment groups therefore carry a causal interpretation. The restricted mean time lost, is defined as . Here, the ‘months of life lost up to 30 months’ is given by and estimated to 17.3 for the IFN treatment and to 19.8 for the control treatment. The ratio of these two (IFN vs control) is 0.87 (95% CI, 0.70 to 1.04). Thus, on IFN treatment there is a 13% less loss of lifetime compared to the control treatment during the first 30 months of follow up.
We also fitted Aalen’s additive hazard model
where includes days from metastasis to randomization (log-transformed), WHO performance status (0, 1 and 2; with group 0 and 1 collapsed into one group), and Haemoglobin (g/dl). As the treatment variable is independent of the other covariates, the above Aalen model is collapsible meaning that the interpretation of is the same in the conditional and marginal model. The Cox model does not have this property. Model (12) appeared to fit the data well, using the tools described in Chapter 5 in Martinussen and Scheike (2006); specifically no interaction between the treatment indicator and the baseline risk factors was found. We estimated both from the conditional model and the marginal model, and the two estimators were almost identical, which, as pointed out, should be the case if model (12) is correctly specified. Using model (12
), we next tested the null hypothesisof a constant effect (P 0.58), which was subsequently estimated to be (SE, 0.009). If (or if the addition of additional variables conditional on which and become independent, does not change the additive structure of the model), then this is also the causal hazard difference. This would mean that over the course of the follow-up, an average of approximately 2 additional deaths will occur for each month of follow-up in each 100 persons under the control treatment alive at the start of the month and who would also be alive under the IFN treatment, compared with each 100 IFN treated persons alive at the start of the month and who would also be alive under the control treatment. As suggested before, the assumption that is implausible however.
4.2 Gastrointestinal tumour study
Stablein and Koutrouvelis (1985) presented survival data from a randomized clinical trial on locally unresectable gastric cancer. Half of the total 90 patients were assigned to chemotherapy, and the other half to combined chemotherapy and radiotherapy. It was suggested that there was superior survival for patients who received chemotherapy, but only in the first year or so. The same application was considered by Collett (2015) p. 386-389. For illustrative purposes we consider here the first 720 days of follow up corresponding to the two first time periods considered by Collett (2015). Figure 8 displays the Kaplan Meier curves corresponding to the two groups, and shows that the survival curves become close at the end of the considered time interval. Applying a Cox regression model with time-by-treatment-interaction, allowing for separate HR’s before and after 1 year of follow-up gives estimated HR’s (Combined vs chemotherapy) of 2.40 (95% CI; 1.25, 4.63) in the first year and 0.78 (95% CI; 0.34,1.76) thereafter (the supremum score process test of Lin et al. (1993) gave no convincing evidence against these two models, with P 0.07 in the first interval and P 0.26 in the second interval). Thus one might be tempted to conclude that the chemotherapy is beneficial in the first year only, and that there might even be a reverse effect afterwards, see Collett (2015) for a similar analysis and conclusion. However, arguing based on the two hazard ratios is subtle as we have shown. If chemotherapy is more effective then there will be more and more frail subjects in that group making the interpretation of hazard function difficult. We illustrate this using the two estimated hazard ratios and taking the frailty variable to be Gamma distributed mean and variance equal to corresponding to a Kendall’s of 0.3. Figure 9 displays the estimated which is seen to depend on time, being larger than 2.4 and increasing towards 3.7 in the first year and, after the change-point (1 year), starting at around 1.2 and then decreasing but being larger than 1 at all times.
5 Concluding Remarks
We have argued that the treatment effect in a proportional hazards model carries a causal interpretation, but that its interpretation is subtle. The proportional hazards assumption does not express, for instance, that treatment works equally effectively at all times, as the hazard ratio at a given time mixes differences between treatment arms due to treatment effect as well as selection. The danger of over interpreting hazard ratios become most pronounced when the hazard ratio is not constant over time (e.g. when the hazard ratio is below 1 for some time and then becomes 1). We have argued that this cannot be interpreted as implying that treatment effectiveness disappears after some time. In our opinion, this is the source of much confusion, and a real concern. Non-constant hazard ratios are indeed fairly common in real life because the proportional hazards assumption is a rather unstable assumption in the following sense. Even when valid in some population, this assumption is likely to fail in subgroups of that population (e.g. if one studies men and women separately), and vice versa. This makes the assumption, at best, an approximation in practice.
We have suggested possibilities to estimate hazard ratios that are causally interpretable because they compare intensities at a given time with and without treatment for the same patient population: those who would survive that time, no matter what treatment. Inferring such hazard ratios necessitates a sensitivity analysis, however. Furthermore, they have the disadvantage of describing the effect for an unknown subgroup of the population. A better strategy in practice, when interest lies in the dynamic aspects of a treatment, is therefore to design the study such that the collected data provide immediate insight into the dynamic aspects of treatment (e.g. by modifying treatment assignments over time).
- Aalen et al. (2015) Aalen, O. O., Cook, R. J., and Røysland, K. (2015). Does cox analysis of a randomized survival study yield a causal treatment effect? Lifetime Data Anal, 21(4):579–93.
- Collett (2015) Collett, D. (2015). Modelling Survival Data in Medical Research. CRC Press, Boca Raton.
- Hernán (2010) Hernán, M. (2010). The hazards of hazard ratios. Epidemiology (Cambridge, Mass.), 21(1):13–15.
- Lin et al. (1993) Lin, D. Y., Wei, L. J., and Ying, Z. (1993). Checking the Cox model with cumulative sums of martingale-based residuals. Biometrika, 80:557–572.
- Martinussen and Scheike (2006) Martinussen, T. and Scheike, T. H. (2006). Dynamic Regression Models for Survival Data. Springer-Verlag, New York.
- Martinussen and Vansteelandt (2013) Martinussen, T. and Vansteelandt, S. (2013). On collapsibility and confounding bias in cox and aalen regression models. Lifetime Data Anal, 19(3):279–96.
- Rubin (2000) Rubin, D. B. (2000). Causal inference without counterfactuals: Comment. Journal of the American Statistical Association, 95(450):435–438.
- Scheike et al. (2015) Scheike, T. H., Holst, K. K., and Hjelmborg, J. B. (2015). Measuring early or late dependence for bivariate lifetimes of twins. Lifetime Data Anal, 21(2):280–99.
- Stablein and Koutrouvelis (1985) Stablein, D. M. and Koutrouvelis, I. A. (1985). A two-sample test sensitive to crossing hazards in uncensored and singly censored data. Biometrics, 41(3):643–52.
- Uno et al. (2014) Uno, H., Claggett, B., Tian, L., Inoue, E., Gallo, P., Miyata, T., Schrag, D., Takeuchi, M., Uyama, Y., Zhao, L., Skali, H., Solomon, S., Jacobus, S., Hughes, M., Packer, M., and Wei, L.-J. (2014). Moving beyond the hazard ratio in quantifying the between-group difference in survival analysis. J Clin Oncol, 32(22):2380–5.
- Vansteelandt et al. (2014) Vansteelandt, S., Martinussen, T., and Tchetgen, E. T. (2014). On adjustment for auxiliary covariates in additive hazard models for the analysis of randomized experiments. Biometrika, 101(1):237–244.
- White and Royston (2009) White, I. R. and Royston, P. (2009). Imputing missing covariate values for the cox model. Stat Med, 28(15):1982–98.
- Zhao et al. (2016) Zhao, L., Claggett, B., Tian, L., Uno, H., Pfeffer, M. A., Solomon, S. D., Trippa, L., and Wei, L. J. (2016). On the restricted mean survival time curve in survival analysis. Biometrics, 72(1):215–21.
A.1 Binary frailty variable
Let be binary, for instance and , low and high risk groups, then , and similar results as in the Gamma-distribution case (Section 3.1) are obtained. In this binary case, the Laplace transform is
which is an increasing function. Hence, if we take and , then again
for all .
A.2 Frailty model arising by marginalization
It was shown in Section 3.1 that one can always pick a DGP so that the Cox model holds marginally, only conditioning on the observed . Rename to . We show now that similarly we can also pick a DGP so that it marginalizes to that further marginalizes to , the latter being the Cox model. For ease of calculations, let
with being Gamma distributed with mean and variance equal to 1, and independent of and . Similar calculations as those in Section 3.1 gives the following expression
A.3 Selection and Cox model
Assume that the Cox model is correctly specified. Will there always be selection? The answer is yes. The Cox model induces randomness as
where is exponentially distributed with mean 1. But then
A.4 A DGP with and being different
Let be Gamma distributed with mean , , and variance , and let and be independent Gamma distributed with mean and variance . Exposure is binary with . Generate data as follows: and Condition (5) holds, and the marginal Cox model is also correctly specified. But in this case , and it also easily seen that depends on . Different values of results in different values of Kendall’s thus controlling the correlation between and .