Survival analysis has been proven useful in many areas including cancer research, clinical trials, epidemiological studies, actuarial science, and so on. A large body of methods have been developed under various survival models and data subject to right-censoring. Comprehensive discussion on those methods can be found in Kalfleisch and Prentice (2002), Lawless (2003), and the references therein. In practice, some complex features may appear in the dataset and make the analysis become challenging. In this paper, we mainly discuss left-truncation and measurement error in covariates.
Left-truncation usually comes from the prevalent sampling design, in which individuals only experience the initiating event but not the failure event before the recruiting time. Under this sampling scheme, individuals might not be observed because they experience the failure event before the recruiting time. In this sense, left-truncation may cause the delayed entry of subjects and may tend to produce a biased sample. Several methods have been developed based on different types of models. For example, Qin and Shen (2010) proposed two different methods of the estimating equations to estimate based on Cox proportional hazards (PH) model. Huang, Follman, and Qin (2012) proposed the semiparametric likelihood inference for the Cox PH model based on the length-biased sampling which is a special case of left-truncation. Su and Wang (2012) developed the semi-parametric approach for the joint modelling between the left-truncated and right-censored survival outcomes and the longitudinal covariates. In addition, Shen, Ning, and Qin (2009) and Ning, Qin and Shen (2014) proposed valid methods to estimate the parameter for the accelerated failure time model.
Not only the models mentioned above, different type of models are also discussed in the developments of survival analysis based on specific purposes. For example, different from the investigation of the hazard ratio based on the Cox PH model, sometimes researchers may be more interested in the risk difference attributed to the risk factors. Based on this purpose, the additive hazards model is considered, and the formulation is given by
where is a -dimensional vector of the covariates,
-dimensional vector of the covariates,is the conditional hazard function of the survival time given the covariates , is the unspecified baseline hazard function, and is a -dimensional vector of mainly interested parameter. Some methods have been proposed to deal with the additive model when left-truncation occurs. For example, Huang and Qin (2013) proposed the conditional estimating equation. Chen (2019a) developed the pseudo likelihood method to derive the estimator.
The second important feature is the measurement error in covariates. As discussed in Carroll et al. (2006), ignoring the error effect of covariates in the analysis may incur the tremendous bias of the estimator. With the absence of left-truncation, several methods have been developed to correct the error. To name a few, Nakamura (1992) developed an approximate corrected partial likelihood method which was extended by Buzas (1998) and Hu and Lin (2002). Huang and Wang (2000) proposed a nonparametric approach for settings with repeated measurements for mismeasured covariates. Xie et al. (2001) explored a least squares method to calibrate the induced hazard function. More related methods are also reviewed in Chen (2019c).
When both biased sample and measurement error occur simultaneously, several methods based on different type of models have been proposed. For example, Chen (2018) developed the three-stage procedure to deal with error-prone variables based on the accelerated failure time model. Chen (2019b) studied the cure model with left-truncated data and measurement error. Chen and Yi (2019) proposed the corrected pseudo likelihood estimation to estimate the parameter for the Cox PH model subject to left-truncated and right-censored survival data and covariate measurement error. However, other survival models, such as the additive hazards model, have not been fully explored when those two complex features occur in the dataset. Hence, in this paper, we mainly focus on the discussion of the additive hazards model.
On the other hand, high-dimensional data also attracts our attention. The analysis becomes difficult and the non-informative variables may appear as the dimension of variable increases. In order to collect the informative variables and make the analysis reasonable, the technique of
On the other hand, high-dimensional data also attracts our attention. The analysis becomes difficult and the non-informative variables may appear as the dimension of variable increases. In order to collect the informative variables and make the analysis reasonable, the technique ofvariable selection is one of useful tools to achieve this goal, and such method is also frequently implemented to the analysis of survival data. For example, Lin and Lv (2013) developed the variable selection method for the additive hazards model. However, they mainly focused on the survival data subject to right-censoring, and the analysis of variable selection with left-truncation and measurement error is not fully explored.
In this paper, we consider this important problem and develop inference methods for analysis of high-dimensional left-truncated and right-censored survival data with measurement error. We mainly focus on the discussion of the additive hazards model. Different from the estimating equation approach, we adopt the pseudo likelihood method proposed by Chen (2019a), which provides the more efficient and robust estimator. Based on the pseudo likelihood method, we proposed the simulation-based three-stage procedure to correct measurement error, select the informative variables, and derive the estimators simultaneously.
The motivated example of this paper is the Worcester Heart Attack Study (WHAS500) data which is collected by Hosmer et al. (2008). The main goal of this study is to determine the factors associated with trends over time in the incidence and survival rates following hospital admission for acute myocardial infarction (MI). The data were collected over thirteen 1-year periods beginning in 1975 and extending through 2001 on all MI patients admitted to the hospitals in Worcester, Massachusetts. There are 500 observations and 22 variables in this dataset. Specifically, as discussed in Hosmer, Lemeshow, and May (2008), the beginning of survival time was defined as the time the subject was admitted to the hospital. The main interest is the survival time of a patient who was discharged and still alive. Hence, an inclusion criterion is that only those subjects who are discharged and still alive are eligible to be included in the analysis. That is, the minimum survival time would be the length of the time a patient stayed in the hospital; individuals whose observation times are shorter than the minimum survival time are not included in this analysis.
Basically, the data are pertinent to three important events in calendar time: time of hospital admission, time of hospital discharge, and time of last follow-up (which is either failure or censoring). The total length of follow-up is defined as the length of time between hospital admission and the last follow-up, and the length of hospital stay is defined as the time length between hospital admission and hospital discharge. Data can only be collected for those individuals whose total length of follow-up is longer than the length of hospital stay, which is the so-called left-truncation (e.g., Kalbfleisch and Prentice 2002, Section 1.3; Lawless 2003, Section 2.4).
The rest of this article is organized as follows. We first introduce the structure of left-truncated and right-censored (LTRC) survival data and measurement error model in Section 2. We next present our proposed method in Section 3. Basically, we propose the three-stage procedure to deal with error-prone variables, select the active variables, and estimate the parameters simultaneously. In addition, we also provide the valid estimation procedure to derive the cumulative baseline hazards function and the distribution function of the truncation time. We give some model settings to examine the numerical performances of the estimator and implement the proposed method to WHAS500 dataset in in Section 4. Finally, the discussion of the paper is summarized in Section 5.
2 Notation and Model
2.1 Data Introduction
For an individual in the target disease population, let be the calendar time of the recruitment (e.g., the recruitment starts right at the hospital discharge) and let and denote the calendar time of the initiating event (e.g., hospital admission) and the failure event (e.g., death), respectively, where and . Let be the lifetime (e.g., the time length between the hospital admission and the failure) and be the truncation time (e.g., the time length between the hospital admission and the hospital discharge). Let be a -dimensional vector of covariates. Let be the unspecified probability density function of
be the unspecified probability density function of, and let denote the distribution function of . Let and be the density function and the survivor function of failure time , respectively.
Consistent with the notation considered by Chen (2019a, 2019b), for an individual with , we let denote to indicate such an individual is eligible for the recruitment so that measuring is possible. Figure 1 gives an illustration of the relationship among those variables. However, if , as shown in Figure 2, the individual is not included in the study so that the researcher cannot obtain any information of such individual.
We further define as the censoring time for a recruited subject. Let be the observed time and let be the indicator of a failure event. Suppose we have a sample of subjects where for , has the same distribution of , and let denote the realization value.
For the following development, we make standard assumptions which are commonly considered for survival data analysis and related frameworks (e.g., Huang and Qin 2013; Chen 2019a):
Conditional on , are independent of ;
Censoring time is non-informative.
2.2 Measurement Error Model
In practice, covariates are often subject to measurement error. For , we write , where and are -dimensional and -dimensional vectors of the covariates, respectively. Moreover, we also decompose in (1) by , where and are -dimensional and -dimensional vectors of parameters associated with the covariates and , respectively. Let .
Suppose that is measured with error with an observed value or surrogate , and that is precisely observed. The classical additive measurement error model (Carroll et al. 2006, Ch1) is assumed to describe the relationship between and :
where is independent of , and with covariance matrix . If is unknown, then additional information, such as repeated measurements or validation data, is needed so that can be estimated. To ease of the discussion and focus the presentation on the analysis of impact on measurement error, we let be a known covariance matrix.
In this section, we first briefly review the likelihood approach, which was proposed by Chen (2019a), based on the unobserved covariate . After that, we present the extension by incorporating the error-prone variables and variable selection.
3.1 Construction of the Likelihood Function
Let denote the counting process for the observed failure events. The modified at-risk process is denoted by for the adjustment of the truncation time.
Under Condition (A1), by the similar derivations in Appendix A of Chen (2019a), one can show that the joint density function of given is proportional to
where is the density function of given and , and is the density function of given . Therefore, for , under Condition (A2) and model , the full likelihood function is given by
where is the survivor function under model . Moreover, we can decompose into , where
is the likelihood of given ; and
is the likelihood of given and .
Different from the conventional martingale method or the estimating equation approach, Chen (2019a) derived the estimator of by maximizing the pseudo likelihood function based on (4). There are some advantages. The first advantage is that misspecification is considered. That is, the property of the martingale method holds only if the model is correctly specified, while the likelihood method does not need such strong condition (Lin and Wei 1989). The second advantage is that the likelihood method gives the more robust and more efficient estimator. This property is shown by numerical studies in Chen (2019a).
3.2 Inferential Procedure
In this section, we extend the setting in Section 3.1 by incorporating the error-prone and high-dimensional covariates. To deal with error-prone covariate, select active covariate variables, and estimate parameters simultaneously, we propose a simulation-based three-stage procedure.
- Step 1:
Let be a given positive integer and let be a sequence of pre-specified values with , where is a positive integer, and is a prespecified positive number such as .
For a given subject with and , we generate from , and define as
for every and . Therefore, the conditional distribution of given is .
- Step 2:
Estimation and selection
We adopt the likelihood function in Section 3.1. Specifically, replacing by gives
By the similar derivations in Chen (2019a), for given and , the estimators of and are respectively determined by
where is the independent copy of , is the second order symmetric kernel function and is the positive-value bandwidth. The estimator of bandwidth can be determined by the cross-validation criterion, and the detailed derivations can be found in Chen (2019a).
On the other hand, we observe that only (10) involves . To estimate it, it suffices to examine (10). Different from the iteration method in Chen (2019a), here we use the nonparametric maximum likelihood estimator (NPMLE) (e.g., Wang 1991) to estimate the distribution function of . For a fixed parameter and given and , the NPMLE of in (10) is given by
where and is determined in (11).
To do the variable selection, we propose to use different penalty functions for . Let denote the penalty function and let be the tuning parameter. There are several choices of the penalty function, including the LASSO (Tibshirani 1996), adaptive LASSO (ALASSO, Zou 2006), and SCAD (Fan and Li 2001) methods. The detailed formulations are listed as follows:
The penalty function based on the LASSO method is given by
The penalty function based on the ALASSO method is given by
where is the vector of weights. As suggested by Zou (2006), the weight can be set as for any and . Noting that gives for all , thus yielding the LASSO penalty. To find an estimate of , one may first find a consistent estimate of and then take as a weight for .
The penalty function based on the SCAD method is given by
where and is a fixed parameter. As suggested by Fan and Li (2001), we let .
As a result, for the given and , we calculate
In implementing the proposed method, choosing sensible tuning parameters is critical. There is no unique way of selecting a suitable tuning parameter, and methods such as the Akaike information criterion (AIC), the Bayesian information criterion (BIC), the Cross Validation (CV), and the Generalized Cross Validation (GCV) may be considered. Suggested by Wang et al. (2007), BIC tends to outperform among those procedures, especially in the setting with a penalized likelihood function. Consequently, we employ the BIC approach to select the tuning parameter . To emphasize the dependence on the tuning parameter, we let denote the estimator obtained from (14). Define
where represents the number of non-zero elements in for the given . The optimal tuning parameter , denoted by , is determined by minimizing (15) within suitable ranges of . As a result, the estimator of based on (14) is determined by .
- Step 3:
Based on (14), we define
for any given . For , let denote the th element of . Then for each fit a regression model to the sequence and extrapolate it to . Let and denote as the final estimator of .
The key idea of the proposed three-stage procedure is to use simulated surrogate measurements to delineate the patterns of different degrees of measurement error on inference results. The first and third stages adopt the simulation-extrapolation (SIMEX) method (Cook and Stefanski 1994; Carroll et al. 2006, Chapter 5) which is applicable to error-contaminated covariates. The second stage of the proposed method undertakes the selection of important variables for settings with different magnitudes of mismeasurement. It is imperative to address the impact of measurement error on variable selection in this step.
3.3 Estimation of the Cumulative Baseline Hazards Function
In this section, we discuss the procedure of estimating after the parameter is estimated in Sections 3.2.
Write . For and , let denote the th component in and let and be the th component in . Let
denote the sets containing the indices which reflect the non-zero components of the estimators and , respectively. Moreover, define . Let denote the subvector of containing non-zero elements based on . In addition, let and denote two subvectors of and containing non-zero elements based on and , respectively. For and , replacing by in (11) gives
for a given time . Taking averaging on (16) with respect to gives
where is a given time.
To estimate the cumulative baseline hazard function at a given time point , we adopt Step 3 in Section 3.2 and fit a regression model to through a regression function with the associated parameter denoted by , i.e.,
with a noise term , then we extrapolate it to . The resulting value, denoted as , is taken an estimate of .
3.4 Estimation of the Distribution Function of Truncation Time
for a given time , where and is determined in (16).
Next, taking average on (19) with respect to gives
for , where is a given time.
Finally, similar to (18), we fit a regression model to and then extrapolate it to . Consequently, the resulting value, denoted as , is taken an estimate of .
4 Numerical Studies
4.1 Model Setting
Let be the sample size and here we keep . Let and be the true parameters as described in (1), and we denote . Here we let and . We consider or , which indicates or . Let denote the set containing non-zero elements, and is the number of elements in . For the entries of and , we let , where stands for the Gauss integer.
Let , where is the covariance matrix of and with entries , and are, respectively, and covariance matrices with entries and for . In particular, we let , and with and for . Therefore, let the covariates be generated by normal distribution
be generated by normal distribution, where is the -dimensional zero vector.
Four model formulations for are considered in this simulation study as follows:
- Model 1:
- Model 2:
- Model 3:
- Model 4:
The observed data is collected from by conditioning on that . We repeatedly generate data these steps we obtain a sample of a required size . For the measurement error process, we consider model (2) with error , where is the diagonal matrix where the diagonal entry is taken as , , or .
Let be the censoring time generated from the uniform distribution
be the censoring time generated from the uniform distribution, where is a constant that is chosen to yield about 50% censoring rate. Consequently, and are determined by and , and the sample with size is .
In implementing the proposed method, we set and partition the interval into subintervals with the equal width 0.25 with the resulting cutpoints set as the values of . We take the regression function in Step 3 of the proposed method to be the quadratic polynomial functions, as suggested in Carroll et al. (2006, p.126).
Finally, we perform 1000 simulations for each setting.
4.2 Simulation Results
To assess the performance of the estimator of , we report several measures, the -norm
and the -norm
where . In addition, we calculate the number of the correctly selected variables (#S) and the number of the falsely excluded variables (#FN).
For Models 1-4, we compare the performance of the estimators obtained from applying the proposed method to the surrogate covariates as opposed to the estimators obtained from fitting the data with the true covariate measurements. We examine three different penalty functions as discussed in Section 3.2, including the LASSO, ALASSO, and SCAD methods. In comparison, we also examine the naive estimators of , denoted by , which is derived by directly implementing the observed covariates in (4).
In Tables 1-4, we report the numerical results of our proposed method and the naive approach as well as those obtained from the true covariate measurements. It is clear and expected that the results obtained from using the true covariate measurements are the best with the smallest norms under all settings. Regarding the performance on the proposed method with the three different penalty functions, the ALASSO and SCAD tend to slightly outperform the LASSO in terms of the specificity and the finite sample biases, indicated by the -norm and -norm. In terms of correctly selecting variables, the LASSO method includes more variables than the ALASSO and SCAD methods. All methods perform equally well in terms of falsely excluding variables and sensitivity, producing nearly perfect results. Furthermore, it is revealed that the naive method performs unsatisfactorily, with considerable finite sample biases produced and unreliable variable selection and exclusion results.
4.3 Analysis of the Worcester Heart Attack Study (WHAS500) Data
In this section, we apply the proposed method to analyze the data arising from the Worcester Heart Attack Study (WHAS500), which is described in Section 1. Specifically, as discussed by Hosmer, Lemeshow, and May (2008), the beginning of a survival time was defined as the time that subject was admitted to a hospital. The main interest is in the survival times of patients who were discharged alive from hospitals. Hence, a selection criterion was imposed that only those subjects who were discharged alive were eligible to be included in the analysis. That is, their minimum survival time would be the length of their hospital stay; individuals whose failure times did not exceed the minimum survival time were not enrolled in this analysis, and hence the left-truncation happens. With such a criterion, a sample of size 461 was selected and the truncation rate was approximately 7.8%. Be more specifically, total length of follow-up (lenfol) is the last event time (i.e., ), length of hospital stay (los) is the truncation time (i.e., ), and vital status at last follow-up (fstat) is . These 461 patients contribute the measurements which satisfy the constraint . In this dataset, the censoring rate is 61.8%.
The following covariates are included in our analysis: initial heart rate (hr, ), initial systolic blood pressure (sysbp, ), initial diastolic blood pressure (diasbp, ), body mass index (bmi, ), history of cardiovascular disease (cvd, ), atrial fibrillation (afb, ), cardiogenic shock (sho, ), age at hospital admission (age, ), gender (gender, ), congestive heart complications (chf, ), complete heart block (av3, ), MI Order (miord, ), and MI Type (mitype, ). As indicated by Bauldry et al. (2015) and Rothman (2008), it is reasonable to assume that covariates , , and are subject to mismeasurement due to the reasons including inaccurate measurement devices and/or procedures, the biological variability, and temporal variations. In this dataset, we have and , yielding that .
Since this dataset contains no additional information, such as repeated measurements or validation data, for the characterization of the measurement error process, we conduct sensitivity analyses to investigate the measurement error effects. Specifically, let be the sample covariance matrix of , and for sensitivity analyses we consider to be the covariance matrix for the measurement error model (2), where is the diagonal matrix with diagonal elements , which is specified as , , or .
Table 5 summarizes the estimates the result of variable selection of both the proposed and the naive methods. We first observe that the LASSO method produces more variables than the SCAD and ALASSO methods. The proposed method with three different penalty functions gives the robust result of variable selection regardless of the degrees of error effect. It is interesting to see that both ALASSO and SCAD methods select the same variables regardless of values of , it indicates that ALASSO and SCAD are highly recommended to adopt in analysis. Compared with the proposed method, we can observe that the naive method selects more variables. In addition, there are several variables which are commonly selected based on both methods, including sysbp, diasbp, bmi, afb, av3, miord, and mitype.
Variable selection and estimation for survival data are always important topics and also attract our attentions. Even though several methods have been proposed to deal with these problems, there has been little work of addressing these two complex features simultaneously in inferential procedures. In this paper, we develop the three-stage procedure to simultaneously correct error-prone variables, select variables, and estimate the parameters of main interest. We further demonstrate satisfactory finite sample performance of our methods using simulation studies.
One of the advantage is that the proposed method is based on the pseudo likelihood approach, which produces the more robust and efficient estimator (Chen 2019a). In addition, the proposed three-stage procedure can be naturally extended to other models. It implies that the proposed method provides a flexible approach to deal with different situations.
Finally, even though we have developed the valid method with such complex setting, there are still some challenges and extensions. For example, mismeasurement in the discrete covariates may happen, and it is also called misclassification problem. It is also interesting to explore misclassification, even mixture of measurement error and misclassification. In addition, even we discuss the high-dimensional data analysis, but we only consider the case . Actually, the ultrahigh-dimensional data analysis, i.e., , is also an important topic. Finally, even though we have no theoretical results of the proposed method in the current manuscript, numerical results provide the satisfactory performance of the proposed method, including precise estimation and high accuracy of variable selection. Exploring theoretical results of the proposed method, including consistency or oracle property, is also an important work in the future.
Conflict of Interest
This paper has no Conflict of Interest.
|Method||Result of proposed estimator||Result of naive estimator|
|Method||Result of proposed estimator||Result of naive estimator|
|Method||Result of proposed estimator||Result of naive estimator|