Joint models of longitudinal and event-time data have been extensively studied. It has the following general form:
where is the conditional hazard function given time and the temporal covariate , which has the Cox form and consists of a non-parametric baseline hazard and a parametric component, . The temporal covariate is given through a -dimensional stochastic process which is parametrized by a set of parameter .
Joint models of type (1.1) have many variants, which are derived either from changing the functional form of hazard function or from different specification of longitudinal processes. A variety of alternative functional forms of hazard have been proposed and widely studied in literature (Rizopoulos, 2011; Kim et al., 2013; Wang and Taylor, 2001; Tsiatis et al., 1995; Taylor et al., 1994; Chen et al., 2014), although they are quite useful in different application settings, from the perspective of estimation strategy, they do not really make a difference to the ordinary form of Cox hazard function (1.1). In contrast, variations in specification of longitudinal process is more interesting, as different specifications associate with different protocol models in survival analysis, to which different estimation procedures have to be developed.
For instance, when
the longitudinal process reduces to a deterministic constant process (), model (1.1) reduces to the standard Cox proportional hazard model (Cox, 1972) which can be estimated through the classical maximum partial likelihood procedure(MPL) (Cox, 1972; Andersen and Gill, 1982; Andersen, 1992).
When longitudinal process is given through a linear mixed model as below:
model (1.1) becomes a family of joint models that are extensively applied by bio-statisticians in medical-related fields where clinical-trail data is available (Taylor et al., 1994; Tsiatis et al., 1995; Wang and Taylor, 2001; Rizopoulos, 2011; Kim et al., 2013; Barrett et al., 2015; Wu and Yu, 2014; Chen et al., 2014). In (1.3), ,
is a constant parameter vector, while
is a random vector assumed to be normally distributed with mean
is a white-noise observational error subjecting to meanand co-variance . , are the so-called configuration matrices which are essentially two deterministic and known functions in variable . Due to the existence of unobservable , estimation of model (1.3
) is based on Expectation-Maximization (EM) algorithm(Wu et al., 2007; Rizopoulos, 2010).
Longitudinal processes do not have to be continuous in all dimensions, it can take the jumped counting process as below:
where is the indicator to the occurrence of some recurrent events at time which is supposed to following some underlying distribution subjecting to parameter , operation counts the cardinality of the given set. Model (1.4) is a special case of panel count models (Riphahn et al., 2003; Sun, 2014), many procedures are proposed for its estimation and a MPL-based estimation is discussed in Andersen and Gill (1982); Andersen (1992).
Although estimation of model (1.1) with different longitudinal processes has been extensively discussed, in most cases, the discussion focus merely on that there are sufficient amount of longitudinal observations. Availability of longitudinal data may not be a big issue for clinical trails, but in many other applications, such as medical cost study and default risk forecast, it is often the case that longitudinal observations are systematically missed (Laird, 1988; Hogan and Laird, 1997; Chen et al., 2014; Sattar and Sinha, 2017). For instance of medical cost study, in order to protect privacy, many publicly available medical cost databases do not release longitudinal observations of the cost accumulation process during inpatient hospital stay, except the total charge by the discharge day. Medical databases of this type include the National Inpatient Sample (NIS) of United State and the New York State’s Statewide Planning and Research Cooperative System (SPARCS). In the study of default rate of small business loans and/or consumer loans, all key financial indicators of borrowers are missing during the entire repayment period, this is because for small business and individual borrowers, financial data is even never collected unless some crucial event occurs as a trigger, such as overdue and/or default. So, it is common in financial applications that values of key variables are only available at the event time, all intermediate observations are missed.
To handle missing longitudinal data, a novel simulation-based estimation procedure is proposed in this paper. Without loss of generality, it is designed for the following form of input data, which is the typical data type in studies of default risk and medical cost:
where subjects are in the input data, is the number of longitudinal observation times for ; and are the th observation time for subject and the value of the th variable observed at time for , respectively. For observation times, we assume the last observation time is always equal to the event time (censoring can be easily incorporated whenever it is uninformative, but it is not the major topic in this study). denotes the set of indices of longitudinal variables. There are two subsets of variables, is the set of variables with their longitudinal observations systematically missed except for the observation at event time; while the complement set consists of variables for which longitudinal observations are available at all observation time.
The simulation-based procedure turns out capable of generating consistent estimators of all parametric and non-parametric components of model (1.1) from input data of type (1.5). To our best knowledge, this is the first procedure that can handle (1.5). Apart from feasibility, our simulation-based procedure is uniformly applicable to all three different specifications of longitudinal process (1.2), (1.3) and (1.4), as well as their mixture. The uniform property makes our procedure attractive as most existing procedures are either designed for continuous longitudinal process (1.3) (Andersen and Gill, 1982; Karr, 2017; Lin et al., 2000) or the counting process (1.4) (Taylor et al., 1994; Kim et al., 2013; Zeng and Lin, 2007), their combination is rarely discussed together with survival analysis. In addition, uniformity makes it possible to integrate estimation of different classes of joint models into one single software package, so it makes our procedure friendly to practitioners.
From the perspective of computation, the simulation-based procedure outperforms many existing procedures (Rizopoulos, 2010; Guo and Carlin, 2004) in the sense of being compatible with parallel computing framework. In fact, there are two major steps involved in the approach, a simulation step and a optimization step. The simulation step is carried out path-wisely so is completely parallelizable. From the simulation result, a complete version of likelihood function is derivable without any latent variable involved, so the optimization step doesn’t rely on EM algorithm and can be parallelized as well if stochastic descending algorithm is applied. Its compatibility to parallel computing makes the simulation-based procedure useful in handling massive data and financial applications.
The paper is organized as the following. In Section 2, we will present model specification and sketch the estimation procedure in details. The large sample properties of resulting estimators are stated in Section 3. Simulation results and the application to massive consumer-loan data are presented in Section 4. Section 5 discusses some extensions of our model and concludes. All proofs are collected in Appendix.
2 Model Specification & Estimation
2.1 Model Specification
In this study, we consider a -dimensional mixture longitudinal processes, its projection to the th dimension, , is either a counting process of type (1.4) or absolutely continuous in time which, without loss of generality, can be expressed as a time integral:
is an arbitrary stochastic process with finite first and second moments at every,
is an arbitrary initial random variable. (2.1) include both of (1.2) and (1.3) as special cases, it reduces to (1.2) as long as and , and reduces to (1.3) if and in (1.3) are absolutely continuous with respect to , which is usually assumed to hold in practice.
To be general, it is allowed that some of the longitudinal dimension is not continuous in time, say representable as a counting process. For simplicity of presentation, we consider the case where is a counting process only for one covariate dimension, while our methodology is easily extendible to the multi-dimensional counting process situation without much modification. So, from now on, we will denote longitudinal process as with and representing the counting dimension and absolutely continuous dimensions respectively.
where the supscript distinguish the intensity of the counting process from the intensity of the terminal event process.
2.2 Simulatibility and simulation procedures
Since our estimation procedure is simulation-based, it requires the entire longitudinal process to be simulatible which is formally defined as below:
A longitudinal process is simulatible if for every , it is possible to generate a sequence of random vectors such that the process
converges weakly to the true process as , where denotes the integral part of .
In addition, a longitudinal process is empirically simulatible if there exists a simulation procedure such that for every positive integer , and every , identically independently distributed (i.i.d.) sample sequences of the form
can be generated from the procedure, such that for every , the cross section is i.i.d. samples of with being the th element of the simulatible sequence in (2.3) corresponding to .
Most widely-studied longitudinal processes are empirically simulatible. Process of type (1.2) is simulatible through computing the time integral (2.1) with . Process (1.3) is simulatible by the following algorithm:
In algorithm (1), is the last element of the sequence , denotes the th element in a sequence ; is the set of natural number starting from ; operation appends an element to the given sequence. Note that initial time is a -dimensional vector and considered as an input, whose entries are constantly in most cases. But algorithm (1) and the following algorithm (2) can be used as an intermediate step in the simulation of longitudinal process with counting component. In that case, initial time may not always be zero, so we leave as a free parameter.
In general, denote as the vector , with the variational rate of the th-dimension time integral in (2.1), when is Markovian in the following sense:
absolutely continuous process (2.1) is simulatible via the following Euler-type algorithm, where is the conditional density of given the full information of longitudinal process up to time , is the conditional density given only the longitudinal observation at time :
Longitudinal process with one counting dimension is also empirically simulatible, while the simulation algorithm becomes tricky. For simplicity of presentation, in the following pseudo-code (algorithm (3)), we assume that the first dimension (indexed by ) of the longitudinal process represents the counting component. In addition, we require that at initial time, the counting dimension puts all mass at , this restriction can be easily relaxed but the pseudo-code would become too redundant.
In algorithm (3), for every , is considered as matrix with its th row being the dimensional longitudinal process simulated at time and denote the corresponding value at the counting dimension. is the inner product between matrices with appropriate row- and column-dimensions, the product matrix is formed through entry-wise product between and entries in .
is a random number generator that draws a random number from uniform distribution on interval.
It is critical to notice that the counting component can be considered as time-invariant between two consecutive jump time (reflected through the line 11 of algorithm (3), where the conditional density is reconstructed and the probability that the counting component jumps out of its current stage is set to ), then the design of algorithm (3) becomes quite simple. It takes fully use of the local stationarity of the counting component in the way that for every fixed subject, simulation of the longitudinal process with one counting dimension is decomposed as a sequential simulation of longitudinal processes with all their dimensions being absolutely continuous in time. This sequential construction is crucial not only in algorithm design, but is also the key to verify the identifiability of model (1.1) under specification (2.1) and (2.2), the details will be discussed in section (3).
It turns out that algorithm (1)-(3) can generate the simulatible sequence required in definition (2.1) for all longitudinal processes (1.2), (1.3), (1.4) and their mixture. Proof for algorithm (1) is quite trivial, while proof for algorithm (2) and (3) are parts of the proof of the consistency of our estimators which, therefore, are combined with the proof of theorem (3.3) and presented in Appendix.
Notice that algorithm (3) is extendible to handle the occurrence of the terminal event. In fact, algorithm (3) can be generalized to the following algorithm (4), which returns a set of i.i.d. samples of .
is the joint of longitudinal observation at event time and the event time itself, its i.i.d. samples are the key to construct the joint probability density function (pdf) and likelihood function in our estimation procedure.
In algorithm (4), all notations follow their interpretations in previous three algorithms. Censor bound is a prescribed positive constant, it specifies the end of observation. operation returns a dimensional vector through concatenating two row vectors with dimension and , respectively. Apparently, algorithm (4) simulates the terminal event time in the same way as to simulate the jump time of counting component in algorithm (3).
2.3 Estimation procedure
The simulation algorithms stated in previous section provide a foundation to construct the estimation procedure of model (1.1). Our estimation is based on maximizing the full information likelihood function of observations .
Notice that at this moment, we assume that longitudinal observations are systematically missed for all covariate dimensions, i.e. the input data has a special form of (1.5) with the set . Estimation for more general form of input data is extended through the estimation procedure of the special case. To avoid ambiguity of notation, from now on, we will denote as the terminal event time derived from simulation and , as the longitudinal observation of simulated sample at and terminal time respectively. In contrast, for the real observed sample, the terminal event time is denoted as and longitudinal observations are denoted as or .
With the aid of simulation algorithms presented in previous section, the construction of likelihood function can be implemented according to following two steps:
Step 1: Fix a sample size , interval length , and a profile of model parameters and non-parametric components , executing appropriate simulation algorithms yields samples of which turn out to be i.i.d. samples of subjecting to the given parameters.
Step 2: Apply kernel density method to the i.i.d. samples in Step 1, yield an empirical pdf of the random vector that implicitly depends on and is expressed as below:
where is a -dimensional kernel function with bandwidth , for simplicity, we only consider the Gaussian kernel function in this study; are samples yielding from step 1.
The i.i.d. property of samples from Step 1 guarantees that yielding from Step 2 converges to the true joint pdf subjecting to .
Based on empirical pdf , an empirical version of the full information likelihood function can be constructed as below:
where is the observed covariate variables at the event time of the th subject and the event time itself, is the sample size (the number of subjects) of the input data.
In practice, the non-parametric components and in (2.7) can be replaced by their step-wise version:
where in (2.8) can take either as or , , is the indicator function, for simplicity, the step length is taken the same value as the length of time interval in simulation algorithm (1)-(4). To guarantee consistency, their values will depend on the sample size .
2.4 Estimation with general input-data type (1.5)
Although the estimation procedure of maximizing (2.7) is directly applicable to the input data of type (1.5), it does not fully utilize the information provided in input data as there is not any connection between the estimator and the partially existed longitudinal observations. To resolve this issue, we present a way to extend the estimation procedure in previous section, the extension can make better use of longitudinal information and increase estimation efficiency.
We apply the idea of censoring to construct a weighted average likelihood function, estimators fully utilizing the longitudinal data is then derived from maximizing that function. In details, for every fixed time interval and the subset of simulated data satisfying , the simulated samples admit to construct the conditional joint pdf of given and the conditional density of censored event given the censoring . More precisely, given and , we have the following uncensored pdf:
and the censored pdf
where follows the interpretation before; is the number of simulated subject s with terminal event time ; is the number of element in the set of dimensions that do not have missing longitudinal observations; denotes the projection of vector onto its sub-coordinate indexed by ; supscript , indicates the dependence on and , supscript and denote “uncensored” and “censored” respectively.
where is the real observed longitudinal vector for subject at time ; is the observed sub-coordinates of longitudinal vector of subject at time corresponding to those non-missing dimensions ; ; is defined analogous to by replacing simulated sample with real observed sample, formally, .
Estimators fully utilizing longitudinal information is derived from maximizing the mean likelihood function (2.11) and denoted as .
The choice of partition boundaries of and the number of partition cells is tricky and input-dependent.
When the observation time in (1.5) is uniform for all subject in sample, i.e. for all different , and , partition boundaries of can be simply selected as where is the index of the subject who has the greatest terminal event time. This choice can guarantee the most efficient utilizing longitudinal information. Input data with uniform observation time is widely existing in applications to finance and actuarial sciences where many economic variables are collected in a fixed frequency, such as GDP (Koopman et al., 2008; Li et al., 2017).
When the observation time is not uniform, but for every subject in sample, the observation frequency is relatively high in the sense that converges to as sample size , partition intervals can be selected with equal length while the number of partition intervals is set as
. In this case, interpolation method can be applied as discussed inAndersen and Gill (1982) to set longitudinal value at boundary point for subjects whose longitudinal observation are missing at .
Finally if longitudinal observation time is not uniform and has low frequency, the choice of and becomes quite complicated, we leave it for future discussion.
2.5 Parallel computing
Distinct from the estimators for joint models with longitudinal process specified through (1.3) (Wu and Yu, 2014; Rizopoulos, 2010; Guo and Carlin, 2004), the computation of our estimator is highly compatible with parallel computing framework, especially with the embarrassingly parallel computing framework Guo (2012). The parallelizability of our procedure comes from two sources which correspond to the simulation steps and optimization steps, respectively.
In the step of simulation and construction of empirical pdf, all simulation algorithms in section (2.1) is implemented in a path-wise manner, In fact, setting sample size for each run of algorithm (1)-(4) and repeating execution for times generates samples that are essentially identical to the samples by executing the algorithms once with sample size . So there are no interaction between two sample trajectories, an embarrassingly parallel computing framework is perfectly applicable in this setting and can significantly rise up the computation speed of the simulation step.
In the step of optimization, there is no latent variable involved in the empirical likelihood function (2.7), this is quite different from the estimation procedure of joint models with (1.3) as longitudinal process, where the involvement of random coefficient
leads to latent variables and the reliance on EM algorithm. The main advantage of not using EM algorithm is that there is no need to repeatedly solve a complex integral and a maximization problem which have sequential dependence. Consequently, evolutionary algorithm and/or stochastic descending algorithm(Liu et al., 2015; Cauwenberghs, 1993; Sudholt, 2015; Tomassini, 1999; Osmera et al., 2003) is applicable to maximize the likelihood function (2.7), which is embarrassingly parallelizable.
In sum, the estimation proposed in this paper is highly compatible with parallel computing framework. This is important because in applications to finance or risk management, massive input data is common that imposes a strict requirement on computation speed and efficient memory allocation. Parallel computing can significantly lift computation speed, meanwhile take better use of the limit memory. So, being parallelizable grants our estimation procedure with a great potential in a wide range of applications, especially the application in finance.
3 Asymptotic Property of Large Sample
The consistency and asymptotic normality of estimators, are established in this section. For convenience of expression, we need the following notations:
1. Denote , and as the domain of parameter , and , respectively.
2. Denote as a given model setup, with being the true model setup; denote function as the theoretical joint pdf depending on model setup , is the true joint pdf of observation .
as the conditional expectation of variational rate of absolutely continuous dimensions of longitudinal process given its observation at time and model setup , where is the vector of instantaneous variational rate as defined in (2.1), and denote the counting dimension and absolutely continuous dimension of longitudinal process respectively. To establish the consistency result, we need the following technical conditions:
. For every , the pdf induced by the true longitudinal process at initial time has full support and satisfies that for all non-zero dimensional vector . In addition, , for some positive constant and for all in consideration.
. For every and every , one of the following holds:
There exists such that the matrix converges to some limiting matrix as , denote the limiting matrix as
which satisfies the hyperbolic property, i.e. at least one eigenvalue ofmust have non-zero real part.
. , and are compact subsets of Euclidean space with appropriate dimension, suppose that they all have open interiors and the true values of parameter vector , and contained in their open interior.
. , and are finite for all as pairs of coordinates of vector , and for all in its domain; denote as a matrix with its entry being , denote as a matrix with its entry being , both matrices and are positive definite.
. For all combinations of and all , ; for every the map from the domain of to given through is continuous with respect to the topology.
. The true theoretical joint pdf is continuously differentiable with bounded first order partial derivatives.
. is the number of subjects in observation. The choice of parameter , and kernel width satisfies , , and .
Condition and are the key to verify model identification. Condition - are the standard assumption in maximum likelihood estimation (MLE), which guarantees the consistency and asymptotic normality of MLE estimators. Condition and guarantee that the simulation algorithms introduced in section (2.1) can generate the required simulatible sequence in (2.1) and that the empirical joint pdf, in (2.6), is consistent.
Among the seven conditions, and plays the central role to guarantee identifiability of our estimator. It turns out that and hold for a very general class of longitudinal processes. Particularly, almost all linear mixed longitudinal processes (1.3) are belonging to that class. In fact, for all absolutely continuous , such that , longitudinal processes (1.3) can be rewritten as a time-integral of the form (2.1):
where the initial vector which, by assumption, is a normal random vector, so always satisfies and (i). In fact, and (i) does not only hold for the normal class, but also hold for most popular distribution classes that we can meet in practice. This distribution-free property is crucial, as the computation of the traditional estimators to model (1.1) is expensive in time and memory, it strongly relies on the normality assumption to simplify the expression of likelihood function (Rizopoulos, 2010). However, computation load of our simulation-based estimator is not sensitive to the normality assumption, because according to algorithm (1), the variation of distribution class of and only affects the draw of initial samples, which does not take more time and/or memories for most of the widely-used distribution classes.
and distinguish our simulation-based estimator from the traditional estimators developed for model (1.1) under longitudinal specification (1.3) (Rizopoulos, 2010; Guo and Carlin, 2004). In literature, the standard trick is to consider the latent factor as a random effect, then model (1.1) becomes a frailty model, conditional independent assumption is utilized to derive explicit expression of the likelihood function and EM algorithm is applied to carry out the estimation. However, once if (1.1) is treated as a frailty model, its identifiability strongly depends on the number of longitudinal observations, which have to be greater than the dimension of longitudinal processes for the normally distributed as proved in Kim et al. (2013). Proof in Kim et al. (2013) also relies on the assumption of normality, it is not clear if the same trick applies to more general distribution classes. The dependence on normality and the availability of enough amount of longitudinal observations restrict the usefulness of joint model (1.1) in many fields, such as the credit risk management and actuarial science (Li et al., 2017; Koopman et al., 2008), where longitudinal processes are not normal in general. More critically, in most cases the observation of covariate variables is only available for a couple of years and collected on a monthly or quarterly base, so only tens of longitudinal records are present. In contrast, there are usually ultra-high dimensional (e.g. hundreds) covariates present. Thus, identifiability of model (1.1) is always a big concern. For our procedure, condition and guarantee model identifiability without any extra restriction on the number of longitudinal observations, neither on the distribution class of . In this sense our simulation-based procedure generalizes the standard estimation procedure of model (1.1).
Model (1.1) is identifiable under condition and , where its longitudinal process is a mixture of absolutely continuous processes specified through (2.1) and an one-dimensional counting process satisfying (2.2). Additionally, if - hold,
the estimator is consistent, asymptotically normally distributed for its parametric part , and has the following asymptotic property for its non-parametric part:
For the two baseline hazard functions and , the estimator and converge to and according to the weak- topology, the processes and converge weakly to two Gaussian Processes.
The estimator fully utilizing longitudinal information is also consistent and asymptotically normally distributed. In contrast to , estimator turns out to be more efficient. The details are summarized into the following theorem:
Under -, the estimator is consistent and asymptotically normally distributed, with their asymptotic variance being in scale of the asymptotic variance of , where is the number of censoring intervals.
The proof of theorem (3.3), (3.4) and lemma (A.1)-(A.3) are quite technical, but the idea behind them are straightforward. Notice that the simulation-based estimator developed in this study is essentially a maximum likelihood estimator, so as long as the model (1.1) is identifiable and the standard regularity condition - hold, consistency and asymptotic normality of our parametric estimator, , is just a consequence of the asymptotic property of the standard maximum likelihood estimator. As for the non-parametric part, , its consistency still holds by the fact that when a model is identifiable, the true model setup is the unique maximal point of the entropy function which is a function in (Amemiya, 1985). As for the asymptotic normality of non-parametric estimator, , the proof is essentially the same as the proof in Zheng et al. (2018).
In fact, lemma (A.1), combining with and , provides a foundation to lemma (A.3) and model identifiability. This is done through verifying inequality (A.17) in an inductive way where condition is applied repeatedly to remove the possibility of existence of an invariant probability measure.
Lemma (A.2) is crucial to the convergence of likelihood function. It helps confirm that samples generated from algorithm (1)-(4) are drawn correctly and satisfy the i.i.d. property, which guarantee that the empirical joint pdf (2.6) approaches to the theoretical pdf (A.1), (A.3) as the simulation sample size . In addition, with appropriate choice of tuning parameter subject to condition , lemma (A.2) also guarantees the convergence of likelihood function (2.7) to the theoretical entropy function.
4 Numerical Studies
4.1 Simulation studies
In this section, we present an example based on simulation studies to assess the finite sample performance of the proposed method.
200 random samples, each consisting of subjects, are generated from the model with dimensional longitudinal process with the th dimension being the counting dimension. The six absolutely continuous dimensions are given as a special case of (1.3) where the error and random effect are supposed to be independent normal random vectors and follow and , with the mean and co-variance matrix parametrized in the following way:
where , denotes the diagonal matrix with diagonal elements given as the vector . In this example, we take and for all and , which means both the random effect and error are standard normal random vectors in this example. The configuration matrix and .
The counting dimension has its jumping hazard specified through (2.2) such that and