Semiparametric Estimation for Cure Survival Model with Left-Truncated and Right-Censored Data and Covariate Measurement Error

12/28/2018 ∙ by Li-Pang Chen, et al. ∙ University of Waterloo 0

In this paper, we mainly discuss the cure model with survival data. Different from the usual survival data with right-censoring, we incorporate the features of left-truncation and measurement error in covariates. Generally speaking, left-truncation causes a biased sample in survival analysis; measurement error in covariates may incur a tremendous bias if we do not deal with it properly. To deal with these challenges, we propose a flexible way to analyze left-truncated survival data and correct measurement error in covariates. The theoretical results are also established in this paper.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Literature Review

In this paper, our main interest is survival data with cure. In this dataset, these exists a group of subjects who are cured and never experience the failure event (death) in the study period. In the early discussion with right-censored survival data, Lu and Ying (2004) considered the semiparametric model.

In the recent developments of cure model, left-truncation and measurement error are two important features which attract our attention. Left-truncation makes a biased sample in survival data, and measurement error incurs a tremendous bias of the estimator if it is ignored. It is undoubted that these two features make the analysis be challenging.

In the past literature, Chen et al. (2017) proposed the conditional likelihood function based on left-truncation but without measurement error in covariates. With the absence of left-truncation, Ma and Yin (2008) considered the Cox model and introduced a corrected score approach to deal with measurement error in the covariates, but their method can only deal with the linear term of the covariate. To give a more flexible method, Bertrand et al. (2017) implemented the simulation-extrapolation (SIMEX) method which can be used for any function of the covariates.

In many practical situations, these two features may appear in the dataset simultaneously and it may cause the analysis to become complicated and challenging. To the best of our knowledge, there is no method to analyze survival data with those two features incorporated. In this paper, we mainly explore this important problem. We consider the transformation model which includes the Cox model as a special case.

1.2 Notation and Models

Let be the calendar time of the recruitment and let and denote the calendar time of the initiating event (or the disease incidence) and the failure event, respectively, where , and . Then for those uncored subjects, let be the failure time, and let denote the truncation time. Let denote the residual censoring time which is measured from to censoring. With both cured and uncored subjects, the failure time is determined by , where indicates whether a subject is cured or not . To characterize

, we consider a logistic regression model


where is a

-dimensional vector of covariates associated with model (

1), and is a -dimensional vector of parameters. For subjects who are not cured, we consider the transformation model, which is given by


where is an unknown increasing function,

is a random variable with a known distribution,

is a -dimensional vector of covariates, and is a -dimensional vector of parameters. Model (2) gives a broad class of some frequently used models in survival analysis. Specifically, when has an extreme value distribution, then follows the proportional hazards (PH) model; whereas when has a logistic distribution, then

follows the proportional odds (PO) model.

Let denote the observed failure time, truncated time, and two covariates which satisfy . That is, . For a recruited subject, define and .

In practice, the covariate can not be measured correctly and instead we only have an observed covariate . To characteristic the relationship between and , the classical linear measurement error is frequently used, which is given by



follows the normal distribution with mean zero and covariance matrix

, and is independent to . If is unknown, then it can be estimated by additional information, such as repeated measurement or validation data (e.g., Carroll et al. (2006)). To focus on presenting our proposed method and easing the discussion, we assume that is known.

1.3 Organization of This Paper

The remainder is organized as follows. In Section 2, we first present the proposed method to correct the error effect and derive the estimator. After that, we develop the theoretical result for the proposed method. Numerical results are provided in Section 3. Finally, we conclude the paper with discussions in Section 4.

2 Main Results

2.1 Corrected Estimating Equations

Suppose that we have an observed sample of subjects where for , has the same distribution as . Let and for .

As presented in Section 1, the covariates is usually unobservable, and instead, we only observe . To deal with the mismeasurement and reduce the bias of the estimator, we propose the simulation-extrapolation (SIMEX) method (e.g., Cook and Stefaski (1994)). The proposed procedure is in the following three stages:

Stage 1

Let be a given positive integer and let be a sequence of pre-specified values with . where is a positive integer, and is pre-specified positive number such as .

For a given subject with and , we generate from . Then for observed vector of covariates , we define as


for every . Therefore, the conditional distribution of given is .

Stage 2

By the similar derivations in Lu and Ying (2004), under left-truncated survival data, we have


where is the cumulative hazard function of , , and . Taking log function with negative sign on (5) gives


By the counting process techniques (e.g., Anderson et al. (1993)), we define


which is a martingale process with . Then based on (7), we have two estimating equations (EE):




Let be a -dimensional vector of parameters. Solving (8) yields the estimator of when both and are fixed, which is denoted by . However, (9) only gives the estimator of . To derive the estimator of , we need to develop the third estimating equation based on . We consider the conditional probability

and by the similar derivation of Equation (9) in Lu and Ying (2004)

, we have the unbiased estimating equation for



Replacing in (9) and (10) by gives the following two estimating equations




for every and . Let denote the solution of two estimating equations and . Moreover, we define

Stage 3

By (13), we have a sequence . Then we fit a regression model to the sequence


where is the user-specific regression function, is the associated parameter, and is the noise term. The parameter can be estimated by the least square method, and we let denote the resulting estimate of .

Finally, we calculate the predicted value


and take as the SIMEX estimator of .

Stage 4

Estimation of
Furthermore, we can also derive the estimator of the unknown function . To do this, we first replace by in , which gives . For every and , taking average with respect to gives . Finally, similar to Stage 3 above, fitting a regression model and taking as a predicted value yields a final estimator , also denoted as .

2.2 Theoretical Results

In this section, we present the theoretical results of the proposed method. We first define some notation. Let denote the true value of the parameter , and let denote the true function of . Let . For , define

We further define

We now present the theoretical results of and in the following theorem.

Theorem 2.1

Under regularity conditions in A, estimators and have the following properties:

  • as ;

  • as ;

  • as ;

  • converges to the Gaussian process with mean zero and covariance function ,

where the exact formulations of and are placed in B.

3 Numerical Study

3.1 Simulation Setup

We examine the setting where is generated from the extreme value distribution and the logistic distribution, and the truncation time

is generated from the exponential distribution with mean one. Let

denote a two-dimensional vector of parameters, and let be the vector of true parameters where we set . We consider a scenario where

are generated from a bivariate normal distribution with mean zero and variance-covariance matrix

, which is set as . Given , and , the failure time is generated from the model:

Based on our two settings of , the failure time follows the PH model and the PO model, respectively. On the other hand, is generated by (1), and hence, the failure time with cure is determined by . Therefore, the observed data is collected from by conditioning on that . We repeatedly generate data these steps we obtain a sample of a required size . For the measurement error process, we consider model with error , where is a scalar which is taken as , , and , respectively.

We consider two censoring rates, say 25% and 50%, and let the censoring time

be generated from the uniform distribution

, where is determined by a given censoring rate. Consequently, and are determined by and . In implementing the proposed method, we set and partition the interval into subintervals with width , and let the resulting cutpoints be the values of . We take the regression function to be the quadratic polynomial function, which is a widely used function in many cases (e.g., Cook and Stefaski 1994; Carroll et al. 2006). Finally, 1000 simulations are run for each parameter setting.

3.2 Simulation Results

We mainly examine the performance of the proposed method which is denoted by Chen (). In addition, to see the impact of the measurement error in covariate, we examine the naive estimator which is obtained by implementing in the estimating equations instead of , and the naive estimator is denoted by Naive (). We report the biases of estimates (Bias), the empirical variances (Var), the mean squared errors (MSE), and the coverage probabilities (CP) of those two estimators. The results are reported in Table 1.

First, the censoring rate and measurement degree have noticeable impact on each estimation methods. As expected, biases and variance estimates increase as the censoring rate increases. When the measurement degree increases, biases of both and are increasing, and the impact of the measurement error degrees seems more obvious on the naive estimator .

Within a setting with a given censoring rate and a measurement error degree, the naive method and the proposed method perform differently. When measurement error occurs, the performance of the proposed method is better than the naive method. The naive method produces considerable finite sample biases with coverage rates of 95% confidence intervals significantly departing from the nominal level. The proposed method outputs satisfactory estimate with small finite sample biases and reasonable coverage rates of 95% confidence intervals. Compared to the variance estimates produced by the naive approach, the proposed method which accounts for measurement error effects yield larger variance estimates, and this is the price paid to remove biases in point estimators. This phenomenon is typical in the literature of measurement error models. However, mean squared errors produced by the proposed method tends to be a lot smaller than those obtained from the naive method.

4 Discussion

In this article, we focus the discussion on the transformation model based on cured survival data with left-truncation and develop a valid method to correct the covariate measurement error and derive an efficient estimator. In this article, we also establish the large sample properties, and the numerical results guarantee that our proposed method outperforms. Although we only focus on the simple structure of the measurement error model and assume that is precisely measured, our method can easily be extended to complex measurement error models or additional information, such as repeated measurement or validation data, and also allows in (1) is mismeasured. In addition, there are still many challenges in this topic, such as the discussion of time-dependent covariates with mismeasurement. These topics are also our researches in the future.

model cr Method Estimator of Estimator of
Bias Var MSE CP(%) Bias Var MSE CP(%)
PH 25% 0.01 Naive -0.230 0.007 0.059 21.3 -0.749 0.014 0.626 14.9
Chen 0.017 0.013 0.014 94.7 0.009 0.028 0.028 94.2
0.50 Naive -0.343 0.006 0.123 1.6 -0.606 0.015 0.432 30.0
Chen 0.025 0.023 0.026 94.5 0.011 0.027 0.028 94.5
0.75 Naive -0.347 0.005 0.125 0.3 -0.636 0.016 0.465 23.8
Chen 0.025 0.023 0.023 94.8 0.019 0.025 0.025 93.9
50% 0.01 Naive -0.248 0.016 0.267 9.1 -0.742 0.016 0.565 0.1
Chen 0.017 0.014 0.014 94.4 0.016 0.021 0.021 94.3
0.50 Naive -0.375 0.015 0.145 0.2 -0.600 0.016 0.376 0.4
Chen 0.024 0.036 0.039 95.2 0.019 0.025 0.025 95.0
0.75 Naive -0.360 0.014 0.134 0.1 -0.630 0.014 0.413 0.2
Chen 0.025 0.033 0.033 94.6 0.026 0.025 0.025 94.8
PO 25% 0.01 Naive -0.250 0.009 0.072 23.0 -0.729 0.015 0.557 0.4
Chen 0.010 0.019 0.020 94.2 0.009 0.024 0.024 94.5
0.50 Naive -0.377 0.008 0.150 1.5 -0.588 0.017 0.369 3.6
Chen 0.012 0.018 0.040 94.5 0.011 0.024 0.025 93.7
0.75 Naive -0.362 0.007 0.138 1.1 -0.619 0.014 0.405 1.4
Chen 0.016 0.018 0.018 94.3 0.015 0.022 0.022 94.6
50% 0.01 Naive -0.268 0.016 0.273 10.1 -0.842 0.016 0.574 1.3
Chen 0.016 0.024 0.024 94.6 0.016 0.027 0.027 94.5
0.50 Naive -0.388 0.016 0.168 1.4 -0.600 0.016 0.376 1.4
Chen 0.027 0.036 0.037 94.2 0.021 0.026 0.026 95.1
0.75 Naive -0.410 0.017 0.185 1.9 -0.630 0.018 0.413 1.2
Chen 0.028 0.036 0.036 94.6 0.025 0.027 0.027 94.6

- usage of the true covariate ;
cr - censoring rate;
Bias - Difference between empirical mean and true value;
Var - Empirical variance;
MSE - Mean square error;
MVE - Model-based variance;
CP - Model-based coverage probability.

Table 1: Numerical results for simulation study

Appendix A Regularity Conditions

  • is a compact set, and the true parameter value is an interior point of .

  • Let be the finite maximum support of the failure time.

  • The are independent and identically distributed for .

  • The covariates and are bounded.

  • Conditional on the covariates and , is independent of .

  • Censoring time is non-informative. That is, the failure time and the censoring time are independent, given the covariates .

  • The regression function is true, and its first order derivative exists.

Condition (C1) is a basic condition that is used to derive the maximizer of the target function. (C2) to (C6) are standard conditions for survival analysis, which allow us to obtain the sum of i.i.d. random variables and hence to derive the asymptotic properties of the estimators. Condition (C7) is a common assumption in SIMEX method.

Appendix B Proof of Theorem 2.1

Proof of Theorem 2.1 (1):


and let denote a solution of . Since is a solution of

. By the Uniformly Law of Large Numbers (e.g.,

van der Vaart (1998)), we have that converges uniformly to . Then we have that as ,


By definition (13), taking averaging with respect to on both sides of (B.2) gives that as ,


for every . By (B.3), we can show that as ,


Since , therefore, by the continuous mapping theorem, we have that as ,


Proof of Theorem 2.1 (2):
By (B.5), we have for every , b, and . Taking average with respect to gives . On the other hand, by the Uniformly Law of Large Numbers and similar derivations in Lu and Ying (2004) with , we have that as , for all . Therefore, we conclude that as , by the fact that .

Proof of Theorem 2.1 (3):
For and , applying the Taylor series expansion on (B.1) around gives

or equivalently,


By (11), (12), and the Uniformly Law of Large Numbers, we have that as ,



On the other hand, by (7), the estimating equations (11) and (12) can be expressed as


where is the first -dimensional components of and is the remaining -dimensional components of . Thus can be derived as a sum of i.i.d. random functions, which is given by



Combining (B.8) and (B.7) with (B.6) yields


By (13), taking average with respect to on both sides of (B.9) gives


for , where .

Let denote the vectorization of estimator with every

. By the Central Limit Theorem on (

B.10), we have that as ,


where . By the Taylor series expansion on with respect to , we have


Let and . Combining (B.11) and (B.12) gives that as ,


Finally, since the SIMEX estimator is defined by . Let . Combining (B.12) and (B.13) with and applying the delta method give that as ,

Proof of Theorem 2.1 (4):
We first consider the expression of . By the Taylor series expansion with respect to , we have


where the third term is due to (B.13) and is the convergent function of .

By (7) and the fact that is a solution of (8), we have