A novel approach to estimate the Cox model with temporal covariates and its application to medical cost data

by   Xiaoqi Zhang, et al.
University at Buffalo

We propose a novel approach to estimate the Cox model with temporal covariates. Our new approach treats the temporal covariates as arising from a longitudinal process which is modeled jointly with the event time. Different from the literature, the longitudinal process in our model is specified as a bounded variational process and determined by a family of Initial Value Problems associated with an Ordinary Differential Equation. Our specification has the advantage that only the observation of the temporal covariates at the time to event and the time to event itself are required to fit the model, while it is fine but not necessary to have more longitudinal observations. This fact makes our approach very useful for many medical outcome datasets, like the New York State Statewide Planning and Research Cooperative System and the National Inpatient Sample, where it is important to find the hazard rate of being discharged given the accumulative cost but only the total cost at the discharge time is available due to the protection of patient information. Our estimation procedure is based on maximizing the full information likelihood function. The resulting estimators are shown to be consistent and asymptotically normally distributed. Variable selection techniques, like Adaptive LASSO, can be easily modified and incorporated into our estimation procedure. The oracle property is verified for the resulting estimator of the regression coefficients. Simulations and a real example illustrate the practical utility of the proposed model. Finally, a couple of potential extensions of our approach are discussed.



page 1

page 2

page 3

page 4


Efficient Estimation For The Joint Model of Survival and Longitudinal Data

In survival studies it is important to record the values of key longitud...

A simulation-based approach to estimate joint model of longitudinal and event-time data with many missing longitudinal observations

Joint models of longitudinal and event-time data have been extensively s...

Kernel estimation of bivariate time-varying coefficient model for longitudinal data with terminal event

We propose a nonparametric bivariate time-varying coefficient model for ...

Calibrated regression estimation using empirical likelihood under data fusion

Data analysis based on information from several sources is common in eco...

On Selection of Semiparametric Spatial Regression Models

In this paper, we focus on the variable selection techniques for a class...

Cox regression analysis for distorted covariates with an unknown distortion function

We study inference for censored survival data where some covariates are ...

Adaptive Covariate Acquisition for Minimizing Total Cost of Classification

In some applications, acquiring covariates comes at a cost which is not ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the proportional hazards model (Cox (1972); Andersen and Gill (1982)), the hazard function of the event time takes the form


where is the conditional hazard function of given the

covariate vector

, is an unspecified baseline hazard function and is a vector of unknown regression coefficients. Although in the original paper, the covariate is viewed as random vectors independent from time , the model (1.1) can be easily extended to the case where is time-dependent and is assumed to be given through an unknown stochastic process.

The main-stream procedure used to estimate the model (1.1) is the maximum partial likelihood (MPL) procedure, which applies well whether or not the covariates are time-dependent. However, when time-dependent covariates being involved, the consecutive observation of the covariates is required in the sense that every subject must have its covariates observed at all the failure time prior to its own failure. In the other words, let be the observed failure time for two different subjects and , then for the dying later subject , its covariate must have values observed at both of and . Otherwise, the MPL procedure won’t work. Although some approximation methods were proposed to relax consecutive observation requirement as discussed in Andersen (1992), the MPL procedure can’t be applied effectively to the medical cost datasets, like the New York State’s Statewide Planning and Research Cooperative System (SPARCS), or the National Inpatient Sample (NIS), where it is important to find the hazard rate of being discharged given the accumulative cost while only the total cost for each inpatient observed at the discharge time is available.

In this paper, we shall propose a novel estimation procedure for the model (1.1), which can generate consistent estimates for parameters and the baseline hazard in the model (1.1) even if only the observations are available, where all the observations with are missing. Our procedure is based on joint modelling the longitudinal process that generates the time-dependent covariates and the time to event. The topic of joint model has been widely discussed (Henderson et al. (2000); Song et al. (2002); Hsieh et al. (2006); Ye et al. (2008); Rizopoulos (2011); Kim et al. (2013); Lawrence Gould et al. (2015)). A comprehensive review is also available in Tsiatis and Davidian (2004); Sousa (2011); Ibrahim et al. (2010). However, all these works require the condition that the number of observations of the longitudinal measurement before the time to event is greater than the dimension of the longitudinal process, which restricts their usefulness for the input data with only one observation of the covariate value at the event time. In this paper, we propose an alternative specification of the joint model. Formally, we only assume that the longitudinal measurements follow a bounded variational process that can be expressed as a stochastic integral as below:


where represents the initial value and . The model (1.2) consists of two components:

(1) the longitudinal process:


which characterizes the evolution of the longitudinal measurements, we assume the conditional expectation of the increment rate has the following parametric form


(2) the event process which determines the time to event and its conditional expectation has the form (1.1), i.e.


It turns out by Zhang and Ringland (2017), combining (1.4) and (1.5) yields a complete specification of the joint model in the sense that if two joint models share a common pair of the conditional hazard function (1.5) and the conditional expectation (1.4), all the distributions in interest arising from the two joint models are identical. The equation (1.4

) is the key to derive an explicit expression of the joint probability density function (pdf) of

and which helps design our estimation procedure. To our best knowledge, there has not been any previous works attempting to specify the longitudinal process as (1.4). We hope our work could provide some hints to the future development of this field.

In model (1.4) and (1.5), there are three sets of parameters, , and . Among them, has infinite dimension. Our procedure will estimate the three kinds of parameters through maximizing the full information likelihood function, where the likelihood is constructed from the joint probability density function (pdf) of and . By Zhang and Ringland (2017), this joint pdf is expressed as below by using the function and :


where the function is the time-dependent pdf induced by the longitudinal process and by Zhang and Ringland (2017) it can be expressed as


The function is the initial pdf induced by and for every , denotes the Jacobian of the function evaluated at the point . The function is solely determined by through solving a family of initial value problems (IVPs). Namely for every fixed and , is the solution to the following ordinary differential equation (ODE) for :


subject to the initial condition .

In addition to estimating model parameters, in practice it is also important to select the significant covariates. Variable selection approaches has been extensively studied by many authors. The least absolute shrinkage and selection operator (LASSO) was presented by Tibshirani (1996). Fan and Li (2001) developed the nonconcave penalized approach (SCAD) for variable selection which applies to likelihood-based estimation procedures, including the MPL procedure for the Cox model (Fan and Li (2002)). Zou (2006) developed an Adaptive LASSO approach and showed its oracle property under a general set of conditions. The estimation procedure proposed in the current paper can be easily combined with those variable selection approaches. In particular, we will incorporate a modified version of the Adaptive LASSO into our procedure and verify its oracle property.

The rest of this paper is organized as follows. In Section 2, we will sketch the estimation procedures in detail. The large sample properties of resulting estimators are stated in Section 3. Simulation results and the application to real world data are presented in Section 4. Section 5 discusses some extensions of our model and concludes. All proofs are collected in Appendix.

2 Estimation Procedure

The estimation procedure is based on maximizing the full information likelihood function which is formed as the product of joint pdf of the failure time and the observation of the longitudinal measure at time . To deal with the non-parametric , we adopt the method that approximates through a sequence of finite dimensional step-wise functions, denoted as , with the number of steps given by the sample size .

Throughout this section, we assume the data input for the estimation procedure only has the observation at the time to event, such as , while in the remark section, we will briefly discuss the adjustment of our procedure to deal with the case where more longitudinal observations are available.

2.1 Likelihood Function

Define as the domain of all possible values of the parameter , as the true parameter. Similarly, Define as the domain of , as the true parameter. For every fixed , define as the solution trajectories to the IVPs (1.8) conditional on . When the analytic form of is not available, we can use its numerical approximation in place. There are many efficient numerical solvers to the IVPs (1.8). In this paper, we pick up the Euler’s method for the purpose of being simple and illustrative. Similarly, write as the Jacobian of and when necessary it can be replaced by its numerical version.

By Zhang and Ringland (2017), the initial pdf is uniquely determined by the function (1.2) and the joint pdf (1.6). In particularly, given the joint pdf (1.6), there is a well defined map from the parameter space to the space of the pdfs over . Therefore, given the input data and a fixed parameter , we can estimate by the Gaussian kernel density method as below:


where denote the Gaussian kernel function with kernel width . In this paper, we simply select the kernel width as as can guarantee the function (2.1) converges to in the norm for all .

For the baseline hazard , without loss of generality, we set , and let be the ordered statistics of the observed failure time. A step-wise version of the non-parametric baseline hazard is constructed as below:


where is a set of parameters to be estimated. For each profile of the parameters , we can define the log likelihood as below:


The estimator resulting from maximizing (2.3) is denoted as and and .

Remark 2.1.

The first order condition of the optimization problem (2.3) indicates the relation


at the optimal point and and . Relation (2.4) can be inserted as a set of constraints back into the optimization problem (2.3), which helps sharply reduce the dimension of the original problem.

2.2 Variable Selection

A penalty function can be naturally incorporated into the estimation and determine the non-zero component of the coefficients automatically. There exist many types of penalty functions in the literature, such as LASSO (Tibshirani (1996)), adaptive LASSO (Zou (2006)), SCAD (Fan and Li (2001)), MCP. For convenience of calculation, we choose the adaptive LASSO penalty function defined by:


where is the -th component of the the vector , is a consistent estimator of (say the estimator generated by maximizing Eq. (2.3)), is a tuning parameter and (for simplicity of calculation, we let throughout the paper).

The penalty function is added into the likelihood function (2.3) and forms the following maximization problem (2.3):


where the choice of tuning parameter depends on the sample size . Denote , and ( with being the solution to (2.6)) as the estimator corresponding to maximize Eq. (2.6).

Remark 2.2.

In the original work (Zou (2006)), a wide range of candidate values can be selected for the tuning parameter as long as the asymptotic property and hold. In practice, there are multiple ways to select the value of such as the Bayesian criterion method which is based on iterative calculation of the values of the Bayesian criterion and the likelihood function. Instead of using the iterative procedure, we simply set in order to reduce the computation load.

Remark 2.3.

The original adaptive LASSO method is designed for the OLS procedure (Zou (2006)), but it turns out this method applies very well to the likelihood-based estimation procedures as ours. The oracle property of the adaptive LASSO estimator, , will be proved in Appendix (3.3).

3 Large Sample Properties

The consistency and asymptotic normality of the estimators, , and , constructed in section 2.1 will be established in this section. We will also show the oracle property of the adaptive LASSO estimator and the consistency and asymptotic normality of and .

3.1 Large Sample Properties of , and

Let be a given profile of parameter and in particular, be the true parameter. Define


as a random variable with

and being the random variables following the joint pdf (1.6) associated with , denoted as . The following technical conditions are needed for the consistency result:

. The domain and are compact. has open interior with . The domain of , denoted as , has and is a set of uniformly bounded right-continuous functions satisfying that for all (which means and under the weak- topology, the closure of is compact).

. , , are finite for all and all ; and the matrix and are positive definite.

. There exists a positive function such that for all , almost surely with respect to the probability measure .

. For all , and the map given through is continuous with respect to the topology.

. For every , there is an matrix , such that as ( is the Euclidean Norm of a vector). And for different and ,

has at least one eigenvalue with non-zero real part.

. has the full support . The true initial satisfies that for every .

Condition - are standard for the consistency and asymptotic normality of maximum likelihood estimators. is the regularity condition that guarantees the trajectories depends on smoothly. is the key to guarantee the identification of the model (1.1) and (1.4), although it turns out that can be discarded without any impact on the consistency of and , and both of and can be discarded when the event time and the longitudinal process satisfy a kind of Markovian property and the extra longitudinal observations are available. We will go back to these extensions in the section (3.2).

Theorem 3.1.

Under Condition and , model (1.1) and (1.4) are identifiable. And has the unique maximal point, . In addition, if holds, is continuous with respect to the variable .

Theorem 3.2.

(1). Under -, the estimator , are consistent and ;

(2). the estimator converges to according to the weak- topology and converges weakly to a Gaussian Process;

Theorem 3.3.

Under , the estimator and has the same properties as and as stated in Theorem 2, and the estimator has the following oracle property:

(1). denote as the set of indices with for and as the set of indices with for , then for all and ;

(2). denote , and , .

3.2 Extension

When the longitudinal observations are available at the observation time before failure occurs at , i.e. the input data has the form with for . A two-step procedure can be applied to estimate the parameter , and the resulting estimator turns out to be consistent and have asymptotically normal distribution even without the assumption and . Instead, the following Markovian-style condition are required:


where is the instantaneous variational rate of the longitudinal process as specified in model model (1.3). Eq. (3.2) implies that the conditional mean trajectory that reaches a given realization, , at the observational time won’t be affected by whether or not the event has already occurred. Formally, the two-step algorithm is stated as following:

Step 1: estimate the parameter through minimizing the empirical mean of the distance between the empirical longitudinal trajectories observed for each individual and the theoretical mean trajectories passing through the point :


It turns out when , the estimator of solving the problem (3.3) is consistent.

Step 2: replace by and maximize the likelihood function (2.3) or (2.6) for the parameter and .

It turns out that the resulting estimators do have the same properties as stated in Theorem (3.2) or (3.3). The two-step procedure separates the estimation of from the estimation of and . Thanks to this separation, in the second step, the initial pdf and the Jacobian can be completely removed from the likelihood function (2.3) or (2.6) because they only depends the parameter and the fixed underlying true distribution. In the other words, once if the parameter is replaced by its first-step estimator , the component of becomes constants, and can be deleted from the second-step maximization problem without any impact on the final estimators. As a consequence, there is no need to calculate the Jacobian , which makes the two-step procedure running much faster than the original procedure because the computation of the Jacobian is the most time-consuming part.

Due to the fact that the conditional density function


has already had the full support , and the second-step optimization is equivalent to the optimization of a conditional log-likelihood function formed by the sum of for all ’s, the full support condition in can be relaxed.

In addition, from the proof of the theorem (3.1), it is clear that the main difficulty to achieve the injectivity of the map from the parameter space to the space of all joint pdfs lies in the exclusion of the possibility that different could lead to the same joint pdf, which is exactly the condition and designed for. In contrast, when is given, the identification of and becomes trivial and doesn’t require any further conditions like and . So when the estimation of can be separated out, and are redundant.

Proof of the validity of the two-step procedure is in Appendix (A.4). A latent assumption behind the proof is that the observational time is uninformative and the total number of observations, , is at least for all subjects. Unlike the joint model discussed in Tsiatis and Davidian (2004), we don’t have to assume greater than the dimension of the covariates. This fact makes our two-step procedure more attractive to the scenarios where there are only a few longitudinal observations but a large set of covariates.

4 Numerical Studies

4.1 Simulation Studies

In this section, simulation studies are conducted to evaluate the finite-sample performance of the estimation procedures proposed in section (2.1). Consider the following examples:

Example 4.1.

50 samples, each consisting of subjects, are generated from simulating the version of model (1.2) that has covariate dimension , coefficients with non-zero covariate effects. Given

the conditional expectation function (1.4) is specified as the constant function as below:


The baseline hazard is specified through the function


The initial with being the

-dimensional identity matrix.

The simulation results are presented in the terms of the following criteria:

(1) Figure (B.1) shows the the estimated cumulative hazard versus the true cumulative hazard (4.2) for Example (4.1

). The bias and standard deviation of the estimated non-zero

are given in Table (B.2).

(2) We also conduct variable selection for both of the example (4.1) by the adaptive LASSO method (2.6), the result are summarized in Table (B.3) for Example (4.1). The result are reported by:

i. The average number of the true zero coefficients of that are correctly set to zero, denoted by .

ii. The average number of the true non-zero coefficients of that are incorrectly set to zero, which is given by .

iii. The proportion of samples that excluding any non-zero coefficients, denoted by fit.

iv. The proportion of samples selecting the exact subset models (correct-fit) and the proportion of smaples including all the variables (over-fit), labeled by fit and fit respectively.

From Table (B.2), it is clear that for both of the two cases and , the fitting to the non-zero coefficients are very good while the fitting accuracy in the case of is even better. As for variable selection, Table (B.3) shows that in most of the samples, the set of zero variables can be exactly identified by our procedure. In particular, as the sample size increases, the identification accuracy is risen up as well. Even in the rare samples where some zero variables are misclassified, the misclassification happened sparsely as and that value is close to the true value, .

4.2 Real Example

The New York State’s Statewide Planning and Research Cooperative System (SPARCS) 2013 is a system initially created to collect information on discharges from hospitals within New York State. SPARCS currently collects patient level detail on patient characteristics, diagnoses and treatments, services, and charges for each hospital inpatient stay and outpatient visit; and each ambulatory surgery and outpatient services visit to a hospital extension clinic and diagnostic and treatment center licensed to provide ambulatory surgery services. In 2013, the SPARCS contains nearly 2.5 million inpatient discharges from 218 facilities and 58 counties in New York State. Patient demographics in the SPARCS include age group at admission, gender, race, source of payment and zip code. Patient clinical characteristics include type of admission, diagnosis codes (MDC code, DRG code, CCS diagnosis code etc.) and treatment procedures undergone (CCS Procedure Code).

An important property of the SPARCS data is that there is not any other longitudinal observation available for time-dependent variables, like the cumulative charge, than the observation at the discharge time. Therefore, neither the traditional maximum partial likelihood method nor the estimation procedures designed for the joint models as discussed in Kim et al. (2013); Zeng and Lin (2007) can be well applied to the SPARCS data. In contrast, the approach proposed in this paper can effectively address the data issue as it is designed for.

In this paper, we consider the discharge time , with the time-dependent covariate, the logarithm of the cumulative charge

, and the stationary covariates consisting of the categorical variables,

, associated with 25 Major Diagnosis Code (MDC) and the degree () of the Severity of Illness, . Our analysis is conducted on a subsample of the entire SPARCS 2013 database with sample size . The summary statistics of our subsample are presented in Table (B.1).

The penalized maximum likelihood estimators are reported in Table (B.4). The non-parametric estimator for the cumulative baseline hazard are plotted in Figure (B.2).

In Table (B.4), the significant negative coefficients for log-charge indicates the strong positive correlation between the total charge and los. In addition, it seems that there does not exist robust connection between the los and the severity/mortality of illness.

By Figure (B.2), the day 5 seems to be relatively special because the variation of the slope of the cumulative hazard turns from increasing to decreasing around this time, which implies that for patients who have already stayed in hospital for 5 days, they are more probable to have a longer stay.

5 Remarks and Conclusion

In this paper, we proposed a maximum full information likelihood procedure to estimate the Cox model with temporal covariates. The most significant advantage of our procedure is that it can generate well-performed estimation without requiring the extra longitudinal observations before the time to event. There are also three potential extensions to the current work.

5.1 Censoring

Although censoring is not discussed in the current framework, it can be added in the standard way such that censoring is (1) independent from the occurrence of the interested event, or (2) conditionally independent from the event given the covariates at the observational time. In both of the two cases, the consistency and asymptotic normality of the resulting estimators still hold and their proof is straightforward from the proof of Theorem (3.2) and (3.3).

5.2 Forecast Long Term Survival Rate

In addition to the hazard function (1.1), the estimators proposed in section (2.1) indicates a consistent estimator to the long term survival rate (LTSR):


Using the notation as in Eq. (A.8), the estimator to (5.1) can be given as below:


where , and are the estimators derived in section (2.1), which can be replaced by their penalized version as well. The consistency and asymptotic normality of the estimator (5.2) is just a direct result of the theorem (3.2) and/or (3.3). It is worthwhile to mention that (5.2) is not possible to be constructed from the maximum partial likelihood estimators of the Cox model when temporal covariates are included. Because it is clear from (5.2) that relies on the information of the temporal covariates within the forecast interval , which is not available from the maximum partial likelihood estimators.

5.3 Semi-martingale Longitudinal Processes

Although in the current discussion, the longitudinal process is assumed to have bounded variation and absolutely continuous with respect to the Lebesgue measure on , the same framework should be extensible to more general cases where the longitudinal process may not have bounded variation (for example, given by a semi-martingale process). In a series of related works, the authors construct an explicit expression of the joint pdf of the event time and a semi-martingale longitudinal process, which enables us to construct the full information likelihood function. But the challenges to extend the current work to the situation with the semi-martingale longitudinal measurements are the identification of the resulting model and the challenge in computation. For the identification issue, it is clear from the proof (A.1) that in the current framework, the identification relies on detailed analysis of the solution trajectories of ODE system induced by the function (1.4

). In the case of semi-martingale longitudinal measurements, the ODE system will be replaced with a more complicated partial differential equation (PDE) system. Although it seems that there is no barrier to make the same trick in proof (

A.1) invalid, the details to transplant the proof (A.1) to the semi-martingale case is open to future studies. In the aspect of computation, we have to apply numerical method to a PDE system in place of an ODE system, while,as known, the numerical method to solve PDE system is much more time consuming. A potential solution to the computation issue is to utilize the relation between PDE systems and the semi-martingale processes, through which simulating the underlying process could yield exactly the same solution to the PDE problem. The details of implementing that idea are left as another open problem for further study.

Appendix A

a.1 Proof for Theorem (3.1)

The identifiability of the model (1.1) and (1.4) is equivalent to that as long as ,


within a positive measure set (with respect to the standard Lebesgue Measure).

(1) Suppose (A.1) does not hold for some with . The right-continuity condition in requires that


where is the true initial pdf. The assumption that the true joint pdf (1.6) has the full support implies that has full support as well. Therefore, Eq. (A.2) leads to


both and are probability density function, which yield that


for some that contradicts to the requirement in . Consequently, every that could potentially break down the condition (A.1) must have .

(2) On the other hand, if , we have for all :


where is the survival function of the failure time that follows the joint pdf associated with , by definition it has the following form:


Because the survival function is uniquely determined by the pdf of the event time which is furthermore uniquely determined by the joint pdf. Under the assumption that and corresponds to exactly the same joint pdf, the equation (A.5) enforces that for all that breaks the condition (A.1).

(3) Suppose there exists with for which the condition (A.1) doesn’t hold. Then, the following identity holds:


for all pairs such that where is the inverse trajectories of and defined through the relation


Factor out Eq. (A.7) by and take the limit as yielding the following identity:


where for every , the map is the diffeomorphism obtained from solving the ODE system:


is just the point reached at the time by the trajectory starting at that solves Eq. (A.10). is the Jacobian associated with . By the language of ergodic theory, Eq. (A.9) implies that the probability measure is invariant under the -action on the space induced by the solutions . However, under the condition , the action associated with the pair of and does not allow any invariant probability measure fully supported on unless . This contradiction guarantees the condition (A.1).

The uniqueness of the maximal point of function is simply the consequence of the standard proof the consistency of the full information maximum likelihood estimator, and can be found in every advanced textbook of econometrics, like Amemiya (1985).

The continuity of with respect to and its differentiability with respect to the component comes from by the the dominant convergent theorem. This completes the proof for Theorem (3.1).

a.2 Proof for Theorem (3.2)

The relation (2.4) defines a map, denoted as , that assigns every a step-wise function with the step heights specified by (2.4). We can define the asymptotic version of as below:


where is the joint pdf associated with the true parameter . It turns out that is continuous with respect to the weak topology on the space and has the compact domain . In addition, for every pair , in the weak topology.

Then, the consistency of , and follows immediately from and the facts: (1) the function

(by the strong law of large number); (2) the function

is continuous and has a unique maximal point, , within a compact domain (by the theorem (3.1)).

The asymptotic normality of can be verified by the standard argument for the asymptotic normality of a maximum likelihood estimator.

To verify the asymptotic normality of , firstly notice that let be a process satisfying model (1.4) associated with the true parameter and be a random vector following the distribution associated with , denote being the counting process determined by and . The processes and determines a martingale process as below:


it turns out that ,

for . On the other hand, by the relation in Eq. (2.4), we have:


where and , with , and given as below:

By the assumption that is strictly positive, the fact that , and that for every fixed , ,

uniformly by uniform law of large number. Therefore, by central limit theorem, we have:


Applying a vector version of the central limit theorem as well as Eq. (A.14), we can prove the weak convergence of to the Gaussian Process .

Finally, the consistency and the weak convergence of the estimator is direct from the consistency and asymptotic normality of .

a.3 Proof for Theorem (3.3)

Construct a function with