The additive hazard estimator is consistent for continuous time marginal structural models

Marginal structural models (MSMs) allow for causal interpretations of longitudinal data. The standard MSM is based on discrete time models, but the continuous time MSM is a conceptually appealing alternative for survival analysis. In particular, the additive hazard model allows for flexible estimation of treatment weights in continuous time MSMs. In applied analyses, it is often assumed that the theoretical treatment weights are known, but usually these weights are fundamentally unknown and must be estimated from the data. Here we provide a sufficient condition for continuous time MSM to be consistent even when the weights are estimated, and we show how additive hazard models can be used to estimate such weights. Our results suggest that continuous time weights perform better than discrete weights when the underlying process is continuous. Furthermore, we may wish to transform effect estimates of hazards to other scales that are easier to interpret causally, and here we show that a general transformation strategy also can be used on weighted cumulative hazard estimates. Finally we explain how this strategy can be applied on data using our R-package ahw.

Authors

• 2 publications
• 3 publications
• 3 publications
• Continuous and Discrete-Time Survival Prediction with Neural Networks

Application of discrete-time survival methods for continuous-time surviv...
10/15/2019 ∙ by Håvard Kvamme, et al. ∙ 0

• Joint marginal structural models to estimate the causal effects of multiple longitudinal treatments in continuous time with application to COVID-19

To draw real-world evidence about the comparative effectiveness of compl...
09/27/2021 ∙ by Liangyuan Hu, et al. ∙ 0

• Causal comparative effectiveness analysis of dynamic continuous-time treatment initiation rules with sparsely measured outcomes and death

Evidence supporting the current World Health Organization recommendation...
04/02/2019 ∙ by Liangyuan Hu, et al. ∙ 0

• Semiparametric estimation of structural failure time model in continuous-time processes

Structural failure time models are causal models for estimating the effe...
08/20/2018 ∙ by Shu Yang, et al. ∙ 0

• Covert Communication in Continuous-Time Systems

Recent works have considered the ability of transmitter Alice to communi...
01/20/2020 ∙ by Ke Li, et al. ∙ 0

• Two-Timescale Stochastic Gradient Descent in Continuous Time with Applications to Joint Online Parameter Estimation and Optimal Sensor Placement

In this paper, we establish the almost sure convergence of two-timescale...
07/31/2020 ∙ by Louis Sharrock, et al. ∙ 0

• Causal Navigation by Continuous-time Neural Networks

Imitation learning enables high-fidelity, vision-based learning of polic...
06/15/2021 ∙ by Charles Vorbach, et al. ∙ 1

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Outline

MSMs can be used to obtain causal effect estimates in the presence of confounders, which e.g. may be time-dependent [1]. The procedure is particularly appealing because it allows for a sharp distinction between confounder adjustment and model selection [2]: First, we adjust for observed confounders by weighing the observed data to obtain balanced pseudopopulations. Then, we calculate marginal effect estimates from these pseudopopulations based on our structural model.

Traditional MSM techniques for survival analysis have considered time to be a discrete processes [3]

. In particular, inverse probability of treatment weights (IPTWs) are used to create the pseudopopulations, and then e.g. several subsequent logistic regressions are fitted for discrete time intervals to mimic a proportional hazards model.

However, time is naturally perceived as a continuous process, and it also seems natural to analyse time-to-event outcomes with continuous models. Inspired by the discrete time MSMs, Røysland therefore suggested a continuous time analogue to MSMs [4]. Similar to the discrete MSMs, we have shown that the continuous MSMs can be used to obtain consistent effect estimates when the theoretical treatment weights are known [4]. In particular, the additive hazard regressions can be weighted with the theoretical continuous time weights to yield consistent effect estimates [5]. Nevertheless, the weights are usually unknown in real life and must be estimated from the data. To the best of our knowledge, the performance of MSM when the IPTW are estimated remains to be elucidated.

In this article, we show that continuous time MSM also perform desirable when the treatment weights are estimated from the data: We provide a sufficient condition to ensure that weighted additive hazard regressions are consistent. Furthermore, we show how such weighted hazard estimates may be consistently transformed to other parameters that are easier to interpret causally. To do this, we use stability theory of SDEs, which allows us to find a range of parameters expressed as solutions of ordinary differential equations; many examples can be found in

[5]. This is immediately appealing for causal survival analysis: First, we can use hazard models, that are convenient for regression modeling, to obtain weights. Estimates on the hazard scale are hard to interpret causally per se [6, 7, 8, 9], but we present a generic method to consistently transform these effect estimates to several other scales that are easier to interpret.

The continuous time weights and the causal parameters can be estimated in the R package ahw. We show that this ahw weight estimator, which is based on additive hazard models, satisfies the criterion that ensures consistency in Theorem 2. The ahw package makes continuous time marginal structural modeling simple to implement for applied researchers.

2.1. Motivation

We will present a strategy for dealing with confounding and dependent censoring in continuous time. Confounding, which may be time-varying, will often be a problem when analysing observational data, e.g. from health registries. Our goal is to assess the causal effect a treatment strategy on an outcome.

In general, we can describe processes in continuous time using local (in)dependence relations, and we may use local independence graphs to visualise these relations (a detailed description of local independence graphs are given in [4, see Figure 3.1]). The local independence graph we will focus on is

Heuristically, the time-dependent confounders and the exposure can influence the censoring process and event of interest . Moreover, the time-dependent confounders can both influence and be influenced by the exposure process. We include baseline variables, some of which may be confounders, in Section 2.2.

Such a graph can e.g. describe a follow-up study of HIV-infected subjects, where the initiation and adjustment of HIV treatment depend on CD4 count measurements over time [10]. The CD4 count is a predictor of future survival, and it is also a diagnostic factor that informs initiation of zidovudine treatment; a CD4 count below a certain threshold indicates that treatment is needed. The CD4 count will, in turn, tend to increase in response to treatment and is monitored over time to inform the future treatment strategy, making it a time-dependent confounder. In most follow-up studies there is a possibility of for subjects to be censored, and we allow the censoring to depend on the covariate and treatment history, as long as subjects are alive.

In [11] we analysed a cohort of Norwegian males diagnosed with prostate cancer, using the theory from this article to compare treatment effectiveness of radiation and surgery, even though time-dependent confounding were thought to be a minor issue. The continuous-time MSMs allowed us to estimate causal cumulative incidences on the desired time-scale, starting from the time of diagnosis, which shows that the continuous-time MSMs can also be a preferable choice in the absence of time-dependent confounding.

2.2. Hypothetical scenarios and likelihood ratios

Suppose we have observational event-history data where we follow i.i.d. subjects over time , and where we are interested in assessing effects an exposure has on some outcome. Let be counting processes that jump when treatment and outcome of interest respectively occur for subject . Furthermore, let be the at-risk process for and . We let be the collection of baseline variables that are not confounders, as well as the treatment and outcome processes. are the (time-dependent) confounders. For now we assume independent censoring; we will show how our methods can be applied in some scenarios with dependent censoring in Section 6.

We let denote the filtration that is generated by all the observable events for individual . Moreover, let denote the probability measure on that governs the frequency of observations of these events, and let denote the intensity for with respect to and the filtration .

We aim to estimate the outcome in a hypothetical situation where we had made a treatment intervention according to some specified strategy. Suppose that the frequency of observations we would have seen in this hypothetical scenario would be described by another probability measure on . Furthermore, we assume that all the individuals are also i.i.d. in this hypothetical scenario, and that , i.e. that there exists a likelihood-ratio

 Rit:=d~Pi|Fi,V0∪LtdPi|Fi,V0∪Lt

for each time . We will later describe how an explicit form of can be obtained. It relies on the assumption that the underlying model is causal, a concept we define in Section3

. For the moment we will not require this; we only require that

defines the intensity with respect to for both and ; that is, the functional form of is identical under both and .

Suppose that has an additive hazard with respect to and the filtration that is generated by the the components of . We stress that we consider the intensity process marginalized over , and thereby it is defined with respect to , and not . In other words, we assume that the hazard for event with respect to the filtration is additive, and can be written as

 (1) Xi⊺t−bt,

where

is a bounded and continuous vector valued function, and the components of

are covariate-processes or baseline-variables from .

Our main goal is to estimate the cumulative coefficient function in (1), i.e.

 (2) Bt:=∫t0bsds

from the observational data distributed according to . If we had known all the true likelihood-ratios , we could try to estimate (2) by re-weighting each individual in Aalen’s additive hazard regression [12, VII.4] according to its likelihood ratio. However, the true weights are unlikely to be known, even if the model is causal. In real-life situations, we can only hope to have consistent estimators for these weights. We therefore assume that we have -adapted estimates that converge to under relatively weak assumptions, such that Aalen’s additive hazard regression for the outcome re-weighted according to gives consistent estimates of the cumulative hazard we would have seen in the hypothetical scenario. The estimator we will consider is defined as follows: Let be the vector of counting processes and the matrix containing the ’s, that is,

 (3) N(n)t:=⎛⎜ ⎜ ⎜⎝N1,Dt⋮Nn,Dt⎞⎟ ⎟ ⎟⎠ and X(n)s:=⎛⎜ ⎜⎝X1,1s…X1,ps⋮⋮Xn,1s…Xn,ps⎞⎟ ⎟⎠,

and let denote the -dimensional diagonal matrix where the ’th diagonal element is . The weighted additive hazard regression is given by:

 (4) B(n)t:=∫t0(X(n)⊺s−Y(n),DsX(n)s−)−1X(n)⊺s−Y(n),DsdN(n)s.

2.3.1. Parameters that are transformations of cumulative hazards

It has recently been emphasized that the common interpretation of hazards in survival analysis as the causal risk of death during for an individual that is alive at , is often not appropriate; see e.g. [7]. A simple example in [8] shows that this can also be a problem in RCTs. If is a counting process that jumps at the time of the event of interest, is a randomized treatment, and is an unobserved frailty, the following causal diagram could describe the situation:

However, if we consider the probability of an event before , conditioning on no event at time , we condition on a collider that opens a non-causal path from to the outcome.

This could potentially have dramatic consequences since much of survival analysis is based on the causal interpretation of hazards, e.g. hazard ratios. In [5], we have suggested a strategy to handle this situation: Even if it is difficult to interpret hazard estimates causally per se, we may use hazard models to obtain other parameters that have more straightforward interpretations. Different from estimates on the hazard scale, population-based measures such as survival at time , cumulative incidences, restrictive mean survival functions, do not suffer from the built-in selection bias. Moreover, these measures, and many others (see [5, 13] for examples), solve differential equations driven by cumulative hazards, i.e. they are functions that can be written on the form

 (5) ηt=η0+∫t0F(ηs−)dBs,

where are cumulative hazard coefficients, and is a Lipschitz continuous matrix-valued function. In [5], we showed how to estimate by replacing coefficients in (5) with an estimator that can be written as a counting process integral. Examples of such include the Nelson-Aalen, or more generally Aalen’s additive hazard estimator. We could thus write down a stochastic differential equation for ,

 (6) η(n)t=η(n)0+∫t0F(η(n)s−)dB(n)s,

that is easy to solve on a computer; it is a piecewise constant, recursive equation that jumps whenever the integrator jumps. Hence, (6) can be solved using a for loop over the jump times of , i.e. the survival times of the population. A simple example of a parameter on the form (5) is the survival function, which reads , where is the cumulative hazard for death. Then, the estimation strategy (6) will yield the Kaplan-Meier estimator. Nevertheless, some common parameters cannot be written on the form (5), such as the median survival, and the hazard ratio.

Furthermore, we showed that provides a consistent estimator of if

• for every , i.e. the cumulative hazard estimator is consistent, and

• the estimator is predictably uniformly tight, abbreviated P-UT [5].

The additive hazard estimator satisfies both these criteria, and additive hazard regression can be used as an intermediate step for flexible estimation of several parameters, such as the survival, restricted mean survival, and cumulative incidence functions [5]. In Theorem 1, we show that also the re-weighted additive hazard regression satisfies these properties, which is a major result in this article. Thus, we can calculate causal cumulative hazard coefficients, and transform them to consistently estimate MSMs that solve ordinary differential equations. In Section 4.4 we illustrate how such estimation can be done, by assessing a marginal structural relative survival model.

The exact definition of P-UT is given in [14, VI.6a]. We will not need the full generality of this definition here. Rather, we will use [5, Lemma 1] to determine if processes are P-UT. It says that whenever is a sequence of semi-martingales on with Doob-Meyer decompositions

 J(n)t=∫t0ρ(n)sds+M(n)t,

where are square integrable local martingales and are predictable processes such that

 (7) lima→∞supnP(sups|ρ(n)s|1≥a)=0 and
 (8) lima→∞supnP(Tr⟨M(n)⟩T≥a)=0,

then is P-UT. Here, Tr is the trace of a matrix, and is the predictable variation.

2.4. Consistency and P-UT property

The consistency and P-UT property of introduced in Section 2.3 is stated as a theorem below. A proof can be found in the Appendix.

Theorem 1 (Consistency of weighted additive hazard regression).

Suppose that

1. The conditional density of given does not depend on ,

2.  EP[supt≤T|λ1,Dt|2]<∞ and EP[supt≤T|X1t|2]<∞
3. Let

 Γ(n)t:=(1nX(n)⊺t−Y(n),DtX(n)t−)=(1n∑nk=1R(k,n)t−Xk,it−Yk,DtXk,jt−)i,j,

and suppose that

4. Suppose that and are uniformly bounded and

 (9)

for every , and .

Then is P-UT and

 (10) limn⟶∞P(supt≤T∣∣B(n)t−Bt∣∣≥δ)=0,

for every .

Heuristically, condition states that, knowing individual ’s realization of the variables and processes in up to time , no other information on individual is used for estimating the ’th weight at . Condition ensures that the number of outcome events will not blow up, or suddenly grow by an extreme amount. Condition essentially means that there can be no co-linearity among the covariates, more precisely that the inverse matrix of has to be uniformly bounded in . Condition states that the weight estimator converges to the theoretical weights , in a not very strong sense. The uniform boundedness of is a positivity condition similar to the positivity condition required for standard inverse probability weighting.

3. Causal validity and a consistent estimator the individual likelihood-ratios

will now see that we can model the individual likelihood ratio quite explicitly in many situations when the underlying model is causal. Suppose that each individual is represented by the outcomes of baseline variables , and counting processes on . Moreover, let denote the filtration that is generated by all their possible events before .

Suppose that are the intensities of the counting processes with respect to the filtration and the observational probability . Now, by [15], is uniquely determined by all the intensities and the conditional densities at baseline of the form , as the joint density at baseline factorises as a product of conditional densities.

Suppose that the hypothetical scenario, where the frequency of events is described by , is subject to an intervention on the component represented by process . Our model is said to be causal if the intervention would not have changed the ’local characteristics’ for the remaining nodes. More precisely this means that

• The functional form of the intensities on which we do not intervene, coincide under and , i.e. would also define the intensity for with respect to when , and

• The conditional density of , given would be the same with respect to both and , i.e.

 dP(Qk|Qk−1,…,Q1)=d~P(Qk|Qk−1,…,Q1).

If the intervention instead was targeted at a baseline-variable, say , and this intervention would replace by , then the model is said to be causal if

• The intensity process for each with respect to and coincide, and

• The remaining conditional densities at baseline coincide, i.e.

 dP(Qk|Qk−1,…,Q1)=d~P(Qk|Qk−1,…,Q1),

for .

Note that the latter is in agreement with Pearl’s definition of a causal model [16].

It was realized that this notion of causal validity leads to an explicit formula for the likelihood ratio. If the intervention is aimed at the intensity , then the likelihood ratio takes the form

 (11) Rt=(∏s≤tθΔNjss)exp(∫t0λjs−~λjsds),

where , see [4] and [15].

If the intervention was targeted at a baseline variable, the likelihood ratio would correspond to the ordinary propensity score

 (12) R0:=d~P(Qj|Qj−1,…,Q1)dP(Qj|Qj−1,…,Q1).

Interventions on several nodes yield a likelihood ratio that is a product of terms on the form (11) and (12). The terms in the product could correspond to baseline interventions, time-dependent treatment interventions, or interventions on the censoring intensity. It is natural to estimate the likelihood ratio, or weight process by a product of baseline weights, treatment weights, and censoring weights. We stress that this known form of the likelihood ratio rests on the assumption that the model is causal. We want, of course, to identify the likelihood ratio that corresponds to , as this is our strategy to assess the desired randomized trial. Following equations (11) and (12), we see that the intervened intensities and baseline variables must be modeled correctly, and specifically, a sufficient set of confounders must be included. Additionally, the MSM must be correctly specified. An important result in this paper is that a class of such MSM parameters that solve ODEs driven by cumulative hazards can be modeled.

As long as the intervention acts on a counting process or a baseline variable, the same formula would hold in much more general situations where the remaining covariates are represented by quite general stochastic processes. The assumption of ’coinciding intensities’ must then be replaced by the assumption that the so-called ’characteristic triples’, see [14, II.2], a generalization of intensities to more general processes coincides for and .

3.1. Estimation of weights using additive hazard regression

Suppose we have a causal model as described in the beginning of Section 3, allowing us to use (11) to obtain the form of . The true likelihood ratio, however, is unlikely to be known, so in order to model the hypothetical scenario, we need to rely on estimates of the weight process. In the following, we will only focus on a causal model where we replace the intensity of treatment by , the intensity of with respect to and the filtration . It is a consequence of the innovation theorem [12] that . Moreover, an exercise in asymptotics of stochastic processes shows that if we discretize time, then the associated marginal model structural weights from [1] approximate (11) gradually as the time-resolution increases.

In the following, we will not follow the route of [1] to estimate , but instead we will use that (11) is the unique solution to the stochastic differential equation

 Rit =Ri0+∫t0Ris−dKis Kit =∫t0(θis−1)dNi,As+∫t0λi,Asds−∫t0~λi,Asds.

Here, we have suppressed the superscript in to keep the notation tidy. To proceed we assume that and satisfy the additive hazard model, i.e. that there are vector-valued functions and , and covariate processes and that are adapted to and respectively, and

 (13) λi,At=Yi,AtZi⊺tht and ~λi,At=Yi,At~Zi⊺t~ht.

Now, the previous equation translates into the following:

 Rit =Ri0+∫t0Ris−dKis Kit =∫t0(θis−1)dNA,is+∫t0Yi,AsZi⊺sdHs−∫t0Yi,As~Zi⊺sd~Hs,

where and .

Our strategy now is to replace , , and by estimators. This gives the following stochastic differential equation:

 (14) R(i,n)t =R(i,n)0+∫t0R(i,n)s−dK(i,n)s K(i,n)t =∫t0(θ(i,n)s−−1)dNi,As+∫t0Yi,AsZi⊺s−dH(n)s−∫t0Yi,As~Zi⊺s−d~H(n)s,

The quantity is assumed to be a consistent estimator of . We will use the additive hazard regression estimators and for estimation and , [12]. Moreover, suppose that is a consistent estimator for , the intensity ratio evaluated at zero. Our candidate for , when , depends on the choice of an increasing sequence with such that . We see below that can be interpreted as a smoothing parameter. The estimator takes the form

 (15) θ(i,n)t =⎧⎪ ⎪⎨⎪ ⎪⎩θ(i,n)0,0≤t<1/κn∫tt−1/κnYis~Zi⊺s−d~H(n)s∫tt−1/κnYisZi⊺s−dH(n)s,1/κn≤t≤T.

We let be the diagonal matrix where the ’th diagonal element is . The following theorem says that the above strategy actually works out.

Theorem 2.

Suppose that

1. Each is uniformly bounded and right-continuous at .

2. For each ,

 (16) limδ→0P(inft≤T|~Zi⊺t~ht|≤δ)=0,
3. and for every

4. and

Then we have that

 (17) limn→∞P(supt≤T|R(i,n)t−Rit|>δ)=0

for every and .

For Theorem 1 to apply we need that our additive hazard weight estimator and the likelihood-ratio are uniformly bounded. The latter will for instance be the case if both and are uniformly bounded. We will however not make any restrictions on the outcome intensities in this regard. We will only assume that the theoretical weights are uniformly bounded. In that case we can also make our weight estimator uniformly bounded, by merely truncating trajectories that are too large.

4. Example

4.1. Software

We have developed R software for estimation of continuous time MSMs that solve ordinary differential equations, where additive hazard models are used both to model the time to treatment and the time to the outcome. Our procedure involves two steps: First, we estimate continuous-time weights using the fitted values of the treatment model. These weights can be used to re-weight the sample for estimating the outcome model. Second, we take the cumulative hazard coefficients of the fitted (causal) outcome model and transform them for estimating other ODE parameters that are easier to interpret. The two steps can be performed using the R packages ahw and transform.hazards, both of which are available on the repository github.com/palryalen. Below, we show an example of how to use the packages on simulated data.

4.2. A simulation study

We simulate an observational study where individuals are at risk of a terminating event , in a way that depends additively on treatment and covariate process . and are counting processes that jump from 0 to 1 for an individual at the instant treatment is initiated, or the covariate changes. The subjects receive treatment depending on , such that it is a time-dependent confounder. The subjects can move into treatment, while the subjects in the group may receive treatment or move to the group in any order. All subjects are at risk of experiencing the terminating event. We want to assess treatment effects when the treatment regime is randomized, i.e. when treatment initiation does not depend on . Mathematically, the following data-generating hazards for and are utilized:

 αDt =αD|0t+αD|AtAt−+αD|LtLt−+αD|A,LtAt−Lt− αAt =αA|0t+αA|LtLt− αLt =αL|0t+αL|AtAt−.

The treatment effect in the randomized scenario is not easily found, and a weighted analysis is required.

We remark that our simple simulation scenario could be made more complicated by e.g. allowing the subjects to move in and out of treatment, or have recurrent treatments. We could also have included a dependent censoring process, and re-weighted to a hypothetical scenario in which censoring was randomized using the censoring weights in Section 6.

4.3. Weight calculation using additive hazard models

We assume that the longitudinal data are organized such that each individual has multiple time-ordered rows, one row for each time either , or changes.

Our goal is to convert the data to a format suitable for weighted additive hazard regression. Heuristically, the additive hazard estimates are cumulative sums of least square estimations evaluated at the event times in the sample. The main function will therefore need to do two jobs; i) the data must be expanded such that every individual, as long as he is still at risk of , has a row for each time occurs in the population, and ii) each of those rows must have an estimate of his weight process evaluated at that event time.

Our software relies on the aalen function from the timereg package. We fit two additive hazard models for the transition from untreated to treated. The first model assesses the transitions that we actually observe, i.e. where treatment status depends on the individual covariate histories. Here, we use the true, data generating hazard model for treatment initiation, i.e. an additive hazard model with intercept, and as a covariate. The second model describes the transitions under the hypothetical randomized trial where each individual’s treatment initiation time is a random draw of the treatment initiation times in the population as a whole. The treatment regime in our hypothetical trial is given by the marginal treatment initiation intensity of the study population, and can be estimated using the Nelson-Aalen estimator for the time to treatment initiation, by calling a marginal aalen regression.

In this way we obtain a factual and a hypothetical aalen object that are used as inputs in our makeContWeights function. Other input variables include the bandwidth parameter used in (15), weight truncation options, and an option to plot the weight trajectories.

The output of the makeContWeights function is an expanded data frame where each individual has a row for every event time in the population, with an additional weight column containing time-updated weight estimates. To do a weighted additive hazard regression for the outcome, we will use the aalen function once again. Weighted regression is performed on the expanded data frame by setting the weights argument equal to the weight column.

When the weighted cumulative hazard estimates are at hand, we can transform our cumulative hazard estimates as suggested in Section 2.3.1, to obtain effect measures that are easier to interpret. This step can be performed using the transform.hazards package; see the GitHub vignette for several worked examples.

4.4. A marginal structural model

We assume that treatment acts additively on the event hazard, and the model for the event hazard takes the form

 β(t|A)=β0t+βAtAt−.

A straightforward regression analysis of the observational data would not yield causal estimates, since the association between treatment and outcome is confounded by

. Using the ideas from Section 2, however, we can estimate the cumulative coefficients and consistently by performing a weighted additive hazard regression.

Cumulative hazards, however, are not easy to interpret. Therefore, we assess effects on the survival scale, using a marginal structural relative survival model. In this example, our marginal structural relative survival solves

 (18) RSA=at=1+∫t0(−RSA=asRSA=as)d(BA=asBA=0s).

The quantity can be interpreted as the survival probability a subject would have if he were exposed at time 0, relative to the survival he would have if he were never exposed. Our suggested plugin-estimator is then obtained by inserting the estimated causal cumulative coefficients, i.e. the weighted estimates and :

 ^RSA=at =1+∫t0(−^RSA=as−^RSA=as−)d⎛⎝^BA=as^BA=0s⎞⎠.

4.5. Simulation details and results

We simulate subjects that are untreated at baseline. Initially, all the patients start with , and the hazards for transitioning from one state to another is constant. As described in Section 4.3, we fit additive hazard models for the time to treatment initiation, one for the observed treatment scenario and one for the hypothetical randomized scenario. These models are inserted into makeContWeights to obtain weight estimates. Finally, we estimate the additive hazard model by calling the aalen function where the weights option is set equal to the weight column in the expanded data set.

We make comparisons to the discrete-time, stabilized inverse probability of treatment weights, calculated using pooled logistic regressions. The study period is discretized into

equidistant subintervals, and time is included as a categorical variable in the regressions. We fit two logistic regressions; one for the weight numerator, regressing only on the intercept and categorical time variable, and a covariate-dependent model for the weight denominator, regressing on the intercept, the categorical time variable, and the time-updated covariate process. We then calculate weights from the predicted probabilities of the above two model fits, using the cumulative product formula

[1, eq. (17)].

In the upper three rows of Figure 1 we display estimates of the causal cumulative hazard coefficient, i.e. estimates of , for a range of sample sizes. Compared to discrete weight estimators, our continuous-time weight estimator (14) gives better approximations to the curves estimated with the theoretical weights. In the lowest row of Figure 1 we plot , i.e. transformed estimates of the re-weighted cumulative hazard coefficients for the different weighting strategies.

5. Performance

In Figure 2 we plot mean weight estimates based on aggregated simulations of the set-up in Section 4. The plot suggests that the discrete weights gradually approximate the continuous likelihood ratio as the time discretization is refined. However, the continuous time weights are closer to the expected value of 1 at all times , indicating less bias.

Choosing the bandwidth parameter will influence the weight estimator, and weighted additive hazard estimator in a bias-variance tradeoff; a small

will yield estimates with large bias and small variance, while a large will give rise to small bias but large variance. It is difficult to provide an exact recipe for choosing the bandwidth parameter, since a good choice depends on several factors, such as the sample size, the distribution of the treatment times, as well as the form and complexity of the true treatment model: If the true treatment hazard is constant, a small is appropriate. If the treatment hazard is highly time-varying, should be chosen to be large, depending on the sample size. Heuristically, several treatment times in the interval for each

would be desirable, but this is not possible in every situation, e.g. when the treatment time distribution is skewed. Such distributions can lead to unstable, and possibly enormous weights for some subjects, even if the chosen bandwidth parameter is a good choice for most other subjects. One option is to truncate weights that are larger than a specified threshold, at the cost of introducing bias. We can assess sensitivity with respect to the bandwidth choice by performing an analysis for several bandwidth values, truncating weights if necessary, and comparing the resulting weighted estimators. This approach was taken in

[11, See Supplementary Figure 4], where no noticeable difference was found for four values of .

We inspect the bias and variance of our weight estimator for sample sizes under four bandwidth choices , at a specified time . By aggregating estimates of samples for each we get precise estimates of the bias and variance as a function of for each choice. The bandwidth functions are scaled such that they are identical at the smallest sample , with . Otherwise they satisfy and

We simulate a simple scenario where time to treatment initiation depends on a binary baseline-variable, such that for individual with at-risk indicator

. We calculate weights that re-weights to a scenario where the baseline variable has been marginalized out, i.e. such that is marginal. Utilizing the fact that the true likelihood ratio has a constant mean equal to 1, we can find precise estimates of the bias and variance of the additive hazard weight estimator (14) at time by choosing a large . We plot the bias and variance of the weight estimator as a function of under the strategies and in Figure 3. We see that the convergence strategy yields a faster relative decline in bias, but a higher variance as the sample size increases. Meanwhile, the strategy has a slower decline in bias, but a smaller variance than the other strategies. Finally, the strategy and lies mostly between and both concerning bias and variance, as a function of the sample size. We also see empirical justification for the requirement , as the variance under the strategy is declining very slowly when is increased.

6. Censoring weights

Most standard martingale-based estimators in survival analysis are consistent when we have independent censoring, see [12, III.2.1]. We have assumed independent censoring, when conditioned on . A likely situation where this is violated is when we have independent censoring when conditioned on , but if we only condition on , then we have dependent censoring. If the model is causal with respect to an intervention that randomizes censoring, then we can model the scenario where this intervention had been applied, and censoring is independent when only conditioning with respect to . This means that many estimators that are common in survival analysis will be consistent. Suppose that is a counting process that jumps when the individual is censored. Moreover, let denote the intensity of with respect to the filtration , and let denote its intensity of with respect to the filtration .

Suppose that there is a meaningful intervention that would give a scenario with frequencies that are governed by and its intensity for censoring with respect to , is replaced by , while the remaining intensities are the same. If the model is causal, then the corresponding likelihood ratio process is given by

 (19) Ri,ct=∏s≤t(~λi,csλi,cs)ΔNi,csexp(−∫t0~λi,cs−λi,csds).

However, as we only need to apply weights to observations strictly before the time of censoring, we only need to apply the weights:

 (20) Ri,ct=exp(−∫t0~λi,cs−λi,csds).

Note that this process is a solution to the equation

 (21) Ri,ct=1+∫t0Ri,cs(λi,cs−~λi,cs)ds.

Furthermore, we assume additive hazard models, i.e. that

 (22) λct=Yi,cUi⊺s−gt and ~λi,ct=Yi,c~Ui⊺s−~gt,

for an -adapted covariate process , and an -adapted covariate process , and vector valued functions and . Following Theorem 2, we see that these weights are consistently estimated by defined by the equation:

 R(i,n,c)t =1+∫t0R(i,n,c)s−dK(i,n,c)s K(i,n,c)t =∫t0Yi,csUi⊺s−dG(n)s−∫t0Yi,cs~Ui⊺s−d~G(n)s,

where and are the usual additive hazards estimates.

7. Discussion

Marginal structural modeling is an appealing concept for causal survival analysis. Here we have developed a theory for continuous time MSMs that may motivate the approach for practical research. Indeed, we show that the continuous-time MSM yields consistent effect estimates, even if the treatment weights are unknown and therefore must be estimated from the data. Our continuous time weights seem to perform better than discrete time weights when we study processes that develop in continuous time. Furthermore, these weights can be obtained from additive hazard models, which are easy to fit in practice. Importantly, we also show that causal effect estimates on the hazard scale, e.g. weighted cumulative hazard functions, can be transformed consistently to other parameters that are easier to interpret causally. Thereby we offer a broad strategy to obtain marginal effect estimates for time to event outcomes. Previously, [17] and [18] derived results on weighted additive hazard regression, but they do not cover our needs, as our weights are estimates of likelihood-ratios with respect to filtrations that are larger than the filtration for the additive hazard that we want to estimate.

Estimators of IPTWs may be unstable and inefficient, e.g. when there are strong predictors of the treatment allocation. In practice, applied researchers will often face a bias-variance tradeoff when considering confounder control and efficient weight estimation. This bias-variance tradeoff has been discussed in the literature, and weight truncation has been suggested to reduce the variance, at the cost of introducing bias; see e.g. [19]. Similar to IPTWs, and for the same reasons, our continuous-time weight estimator may be instable; proper weight estimation requires a delicate balance between confounder control and precision in most practical situations.

We have considered the treatment process to be a time to event variable, but our strategy can be generalized to handle recurrent, or piecewise constant exposures. If is allowed to have multiple jumps the estimation procedure becomes more complex, but the same estimators (4) and (14) can be used with a few modifications. We think, however, that many important applications can be explored assuming that is the time to an event.

A different approach that accounts for time-dependent confounding is the structural nested model, which parameterises treatment effects directly in a structural model [20]. While this procedure avoids weighting, and will often be more stable and efficient, it relies on other parametric assumptions and can be harder to implement.

We conjecture that there is a similar consistency result as Theorem 1 when the outcome model is a weighted Cox regression. However, using a Cox model in the hypothetical scenario after marginalization leads to restrictions on the data generating mechanisms that are not properly understood, see e.g. [21]. This issue is related to the non-collapsibility of the Cox model, and it is a problem regardless of the weights being used are continuous or discrete.

8. Funding

The authors were all supported by the research grant NFR239956/F20 - Analyzing clinical health registries: Improved software and mathematics of identifiability.

Appendix: proofs

We need some lemmas to prove Theorem 1.

Lemma 1.

Suppose that are processes on such that , then

 (23) lima→∞supnP(sups∣∣1nn∑i=1Vis∣∣≥a)=0.
Proof.

By Markov’s inequality, we have for every that

which proves the claim.

Lemma 2 (A perturbed law of large numbers).

Suppose

1. , ,

2. , such that is i.i.d., and , are measurable with respect to a -algebra ,

3. Triangular array such that

 (24) limn⟶∞P(|S(1,n)−S1|≥ϵ)=0

for every , and there exists a such that for every ,

4. The conditional density of given does not depend on .

This implies that

 (25) limn⟶∞E[∣∣∣1nn∑i=1S(i,n)Vi−EP[S1V1]∣∣∣]=0.
Proof.

From the triangle inequality and condition we have that

 E[∣∣∣1nn∑i=1S(i,n)Vi−1nn∑i=1SiVi∣∣∣]≤1nn∑i=1E[∣∣(S(i,n)−Si)Vi∣∣] = E[∣∣(S(1,n)−S1)V1∣∣].

The dominated convergence theorem implies that the last term converges to

. Finally, the weak law of large numbers and the triangle inequality yields

 limn⟶∞E[∣∣∣1nn∑i=1S(i,n)Vi−EP[S1V1]∣∣∣] ≤

Lemma 3.

i.i.d. non-negative variables in , then

 (26) limn→∞P(1nmaxi≤nVi≥ϵ)=0

for every .

Proof.

Note that

 P(1nmaxi≤nVi>ϵ) =1−P(maxi≤nVi≤ϵn)=1−P(V1≤ϵn)n =1−(1−P(V1>ϵn))n

If , we therefore have by Chebyshev’s inequality that

 P(1nmaxi≤nVi>ϵ)≤1−(1−E[V21]n2ϵ2)n,

where the last term converges to when since for every . ∎

Lemma 4.

Define , where is the ’th row of . If the assumptions of Theorem 1 are satisfied, then

 (27) limn⟶∞P(supt∣∣∣∫t0Γ(n)−11nn∑i=1R(i,n)s−Xi⊺s−(λi,Ds−γis)ds∣∣∣≥δ)=0

for every .

Proof.

Assumption from Theorem 1 and Lemma 1 implies that

 (28) limJ→∞infnP(supt∣∣Γ(n)−1t1nn∑i=1R(i,n)t−Xi⊺t−(λi,Dt−γit)∣∣>J)=0.

Moreover, Lemma 2 implies that

 1nn∑i=1R(i,n)t−Xi⊺t−(λi,Dt−γi