1.1 Literature Review
In this paper, our main interest is survival data with cure. In this dataset, these exists a group of subjects who are cured and never experience the failure event (death) in the study period. In the early discussion with right-censored survival data, Lu and Ying (2004) considered the semiparametric model.
In the recent developments of cure model, left-truncation and measurement error are two important features which attract our attention. Left-truncation makes a biased sample in survival data, and measurement error incurs a tremendous bias of the estimator if it is ignored. It is undoubted that these two features make the analysis be challenging.
In the past literature, Chen et al. (2017) proposed the conditional likelihood function based on left-truncation but without measurement error in covariates. With the absence of left-truncation, Ma and Yin (2008) considered the Cox model and introduced a corrected score approach to deal with measurement error in the covariates, but their method can only deal with the linear term of the covariate. To give a more flexible method, Bertrand et al. (2017) implemented the simulation-extrapolation (SIMEX) method which can be used for any function of the covariates.
In many practical situations, these two features may appear in the dataset simultaneously and it may cause the analysis to become complicated and challenging. To the best of our knowledge, there is no method to analyze survival data with those two features incorporated. In this paper, we mainly explore this important problem. We consider the transformation model which includes the Cox model as a special case.
1.2 Notation and Models
Let be the calendar time of the recruitment and let and denote the calendar time of the initiating event (or the disease incidence) and the failure event, respectively, where , and . Then for those uncored subjects, let be the failure time, and let denote the truncation time. Let denote the residual censoring time which is measured from to censoring. With both cured and uncored subjects, the failure time is determined by , where indicates whether a subject is cured or not . To characterize
, we consider a logistic regression model
where is a
-dimensional vector of covariates associated with model (1), and is a -dimensional vector of parameters. For subjects who are not cured, we consider the transformation model, which is given by
where is an unknown increasing function,
is a random variable with a known distribution,is a -dimensional vector of covariates, and is a -dimensional vector of parameters. Model (2) gives a broad class of some frequently used models in survival analysis. Specifically, when has an extreme value distribution, then follows the proportional hazards (PH) model; whereas when has a logistic distribution, then
follows the proportional odds (PO) model.
Let denote the observed failure time, truncated time, and two covariates which satisfy . That is, . For a recruited subject, define and .
In practice, the covariate can not be measured correctly and instead we only have an observed covariate . To characteristic the relationship between and , the classical linear measurement error is frequently used, which is given by
follows the normal distribution with mean zero and covariance matrix, and is independent to . If is unknown, then it can be estimated by additional information, such as repeated measurement or validation data (e.g., Carroll et al. (2006)). To focus on presenting our proposed method and easing the discussion, we assume that is known.
1.3 Organization of This Paper
The remainder is organized as follows. In Section 2, we first present the proposed method to correct the error effect and derive the estimator. After that, we develop the theoretical result for the proposed method. Numerical results are provided in Section 3. Finally, we conclude the paper with discussions in Section 4.
2 Main Results
2.1 Corrected Estimating Equations
Suppose that we have an observed sample of subjects where for , has the same distribution as . Let and for .
As presented in Section 1, the covariates is usually unobservable, and instead, we only observe . To deal with the mismeasurement and reduce the bias of the estimator, we propose the simulation-extrapolation (SIMEX) method (e.g., Cook and Stefaski (1994)). The proposed procedure is in the following three stages:
- Stage 1
Let be a given positive integer and let be a sequence of pre-specified values with . where is a positive integer, and is pre-specified positive number such as .
For a given subject with and , we generate from . Then for observed vector of covariates , we define as
for every . Therefore, the conditional distribution of given is .
- Stage 2
By the similar derivations in Lu and Ying (2004), under left-truncated survival data, we have
where is the cumulative hazard function of , , and . Taking log function with negative sign on (5) gives
By the counting process techniques (e.g., Anderson et al. (1993)), we define
which is a martingale process with . Then based on (7), we have two estimating equations (EE):
Let be a -dimensional vector of parameters. Solving (8) yields the estimator of when both and are fixed, which is denoted by . However, (9) only gives the estimator of . To derive the estimator of , we need to develop the third estimating equation based on . We consider the conditional probability
and by the similar derivation of Equation (9) in Lu and Ying (2004)
, we have the unbiased estimating equation for:
- Stage 3
By (13), we have a sequence . Then we fit a regression model to the sequence
where is the user-specific regression function, is the associated parameter, and is the noise term. The parameter can be estimated by the least square method, and we let denote the resulting estimate of .
Finally, we calculate the predicted value
and take as the SIMEX estimator of .
- Stage 4
Furthermore, we can also derive the estimator of the unknown function . To do this, we first replace by in , which gives . For every and , taking average with respect to gives . Finally, similar to Stage 3 above, fitting a regression model and taking as a predicted value yields a final estimator , also denoted as .
2.2 Theoretical Results
In this section, we present the theoretical results of the proposed method. We first define some notation. Let denote the true value of the parameter , and let denote the true function of . Let . For , define
We further define
We now present the theoretical results of and in the following theorem.
3 Numerical Study
3.1 Simulation Setup
We examine the setting where is generated from the extreme value distribution and the logistic distribution, and the truncation time
is generated from the exponential distribution with mean one. Letdenote a two-dimensional vector of parameters, and let be the vector of true parameters where we set . We consider a scenario where
are generated from a bivariate normal distribution with mean zero and variance-covariance matrix, which is set as . Given , and , the failure time is generated from the model:
Based on our two settings of , the failure time follows the PH model and the PO model, respectively. On the other hand, is generated by (1), and hence, the failure time with cure is determined by . Therefore, the observed data is collected from by conditioning on that . We repeatedly generate data these steps we obtain a sample of a required size . For the measurement error process, we consider model with error , where is a scalar which is taken as , , and , respectively.
We consider two censoring rates, say 25% and 50%, and let the censoring time
be generated from the uniform distribution, where is determined by a given censoring rate. Consequently, and are determined by and . In implementing the proposed method, we set and partition the interval into subintervals with width , and let the resulting cutpoints be the values of . We take the regression function to be the quadratic polynomial function, which is a widely used function in many cases (e.g., Cook and Stefaski 1994; Carroll et al. 2006). Finally, 1000 simulations are run for each parameter setting.
3.2 Simulation Results
We mainly examine the performance of the proposed method which is denoted by Chen (). In addition, to see the impact of the measurement error in covariate, we examine the naive estimator which is obtained by implementing in the estimating equations instead of , and the naive estimator is denoted by Naive (). We report the biases of estimates (Bias), the empirical variances (Var), the mean squared errors (MSE), and the coverage probabilities (CP) of those two estimators. The results are reported in Table 1.
First, the censoring rate and measurement degree have noticeable impact on each estimation methods. As expected, biases and variance estimates increase as the censoring rate increases. When the measurement degree increases, biases of both and are increasing, and the impact of the measurement error degrees seems more obvious on the naive estimator .
Within a setting with a given censoring rate and a measurement error degree, the naive method and the proposed method perform differently. When measurement error occurs, the performance of the proposed method is better than the naive method. The naive method produces considerable finite sample biases with coverage rates of 95% confidence intervals significantly departing from the nominal level. The proposed method outputs satisfactory estimate with small finite sample biases and reasonable coverage rates of 95% confidence intervals. Compared to the variance estimates produced by the naive approach, the proposed method which accounts for measurement error effects yield larger variance estimates, and this is the price paid to remove biases in point estimators. This phenomenon is typical in the literature of measurement error models. However, mean squared errors produced by the proposed method tends to be a lot smaller than those obtained from the naive method.
In this article, we focus the discussion on the transformation model based on cured survival data with left-truncation and develop a valid method to correct the covariate measurement error and derive an efficient estimator. In this article, we also establish the large sample properties, and the numerical results guarantee that our proposed method outperforms. Although we only focus on the simple structure of the measurement error model and assume that is precisely measured, our method can easily be extended to complex measurement error models or additional information, such as repeated measurement or validation data, and also allows in (1) is mismeasured. In addition, there are still many challenges in this topic, such as the discussion of time-dependent covariates with mismeasurement. These topics are also our researches in the future.
|model||cr||Method||Estimator of||Estimator of|
- usage of the true covariate ;
cr - censoring rate;
Bias - Difference between empirical mean and true value;
Var - Empirical variance;
MSE - Mean square error;
MVE - Model-based variance;
CP - Model-based coverage probability.
Appendix A Regularity Conditions
is a compact set, and the true parameter value is an interior point of .
Let be the finite maximum support of the failure time.
The are independent and identically distributed for .
The covariates and are bounded.
Conditional on the covariates and , is independent of .
Censoring time is non-informative. That is, the failure time and the censoring time are independent, given the covariates .
The regression function is true, and its first order derivative exists.
Condition (C1) is a basic condition that is used to derive the maximizer of the target function. (C2) to (C6) are standard conditions for survival analysis, which allow us to obtain the sum of i.i.d. random variables and hence to derive the asymptotic properties of the estimators. Condition (C7) is a common assumption in SIMEX method.
Appendix B Proof of Theorem 2.1
Proof of Theorem 2.1 (1):
and let denote a solution of . Since is a solution of
. By the Uniformly Law of Large Numbers (e.g.,van der Vaart (1998)), we have that converges uniformly to . Then we have that as ,
for every . By (B.3), we can show that as ,
Since , therefore, by the continuous mapping theorem, we have that as ,
Proof of Theorem 2.1 (2):
By (B.5), we have for every , b, and . Taking average with respect to gives . On the other hand, by the Uniformly Law of Large Numbers and similar derivations in Lu and Ying (2004) with , we have that as , for all . Therefore, we conclude that as , by the fact that .
Proof of Theorem 2.1 (3):
For and , applying the Taylor series expansion on (B.1) around gives
where is the first -dimensional components of and is the remaining -dimensional components of . Thus can be derived as a sum of i.i.d. random functions, which is given by
for , where .
Let denote the vectorization of estimator with every
. By the Central Limit Theorem on (B.10), we have that as ,
where . By the Taylor series expansion on with respect to , we have
Proof of Theorem 2.1 (4):
We first consider the expression of . By the Taylor series expansion with respect to , we have
where the third term is due to (B.13) and is the convergent function of .