An efficient Gehan-type estimation for the accelerated failure time model with clustered and censored data

03/09/2020 ∙ by Liya Fu, et al. ∙ Shenzhen University 0

In medical studies, the collected covariates usually contain underlying outliers. For clustered /longitudinal data with censored observations, the traditional Gehan-type estimator is robust to outliers existing in response but sensitive to outliers in the covariate domain, and it also ignores the within-cluster correlations. To take account of within-cluster correlations, varying cluster sizes, and outliers in covariates, we propose weighted Gehan-type estimating functions for parameter estimation in the accelerated failure time model for clustered data. We provide the asymptotic properties of the resulting estimators and carry out simulation studies to evaluate the performance of the proposed method under a variety of realistic settings. The simulation results demonstrate that the proposed method is robust to the outliers existing in the covariate domain and lead to much more efficient estimators when a strong within-cluster correlation exists. Finally, the proposed method is applied to a medical dataset and more reliable and convincing results are hence obtained.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Censored data are very common in biomedical studies. A popular method for analyzing censored data is the Cox proportional hazards model (Cox 1972). However, when the proportional hazards assumption is violated, the Cox model may derive inconsistent parameter estimators. A semiparametric accelerated failure time model is an alternative to the Cox proportional hazards model, which is a linear model for the logarithm of the failure time and covariates with error distribution unspecified (Kalbfleish and Prentice 2002). Rank-based estimation for the accelerated failure time model with clustered/longitudinal data has been studied by some researchers in recent years (Lee et al. 1993; Lin et al. 1998; Jin et al. 2006; Wang and Fu 2011; Chiou et al. 2015; Chiou et al. 2014). The analysis of the clustered failure time data is much more complicated due to the potential within-cluster correlations and the nature of censoring. Lee et al. (1993) studied the weighted log-rank statistic and the Buckley-James method for the correlated and censored data, and provided the covariance matrix estimation of the estimating functions, which does not require specifying the error distribution. Jin et al. (2006) proposed rank-based estimating functions for multivariate failure time data and developed a novel resampling method for the covariance matrix estimation of regression parameters. Chiou et al. (2015) presented weighted rank-based estimating equations for fitting the AFT model with clustered failure times from stratified random sampling, and used the induced smoothing approach proposed by Brown and Wang (2005) to reduce the computational burden. This approach has been adapted to clustered failure time data by Johnson and Strawderman (2009) and Wang and Fu (2011).

The aforementioned methods are based on the independence working model assumption and ignore the underlying within-cluster correlations. To take account of the within-cluster correlations and improve the efficiency of estimators with similar computational complexity for clustered survival data analysis, Wang and Fu (2011) proposed splitting the Gehan weight estimating function to the between- and within-cluster estimating functions and recombining the two resultant estimators. Chiou et al. (2014) extended the generalized estimating equations approach to the clustered and censored data. In longitudinal studies, some potential outliers exist in response and/or covariates, which often result in serious problems for parameter estimation in the AFT model. The rank-based method is robust against outliers in response. However, as far as we know, the literature on parameter estimators against outliers in covariates for clustered and censored data is quite limited. Luo et al. (2014) proposed robust approaches based on the smoothed Gehan rank estimation methods, but their method was based on an independence model. This leads us to seek an efficient and doubly robust method for clustered and censored data with outliers in covariates and/or response.

In this paper, we propose weighted Gehan-type estimating functions with the induced smoothing approach, which take account of the within-cluster correlations, varying cluster sizes, and outliers in covariates and/or response. Therefore, the proposed method is robust against outliers existing in covariates and/or response. Furthermore, the induced smoothing method is utilized to eliminate computational issues resulting from the unsmoothness of the estimating functions and multiple solutions. The induced estimating functions are continuous and differentiable, which make the statistical inference convenient and provide both regression parameter estimation and their covariance matrices. The asymptotic properties of the estimators from the nonsmoothed weighted rank-based estimating functions are established. The estimators from the smoothed estimating functions are shown to be consistent and have the same asymptotic distribution as those from the nonsmooth version. The covariance of the estimators is estimated by a sandwich formula.

In Section 2, we briefly review the accelerated failure time model, and present the weighted rank-based estimating equations for the AFT models with clustered data. In Section 3, we provide computational procedures for computing the parameter estimates and their covariance and carry out simulation studies to evaluate the performance of the proposed method. In Section 4, we analyze two real medical datasets for illustration. Some conclusions are summarized in Section 5.

2 Weighted estimating functions

2.1 The AFT model

Suppose that there are independent clusters, and their respective cluster sizes are . Let and denote the failure time and censoring time for the th member of the th cluster, and let be the corresponding vector of covariates. We assume that and are independent conditional on the covariates . The accelerated failure time model is


where is the unknown regression parameter vector corresponding to the covariate vector of dimension , and are independent random error vectors for . However, for each cluster, the error terms may be correlated. If , and , where is the indicator function, then the observations consist of .

2.2 Weighted estimating functions for AFT models

Let . The rank-based estimating functions of dimension using the Gehan-type weight take the following form,


which is monotonic with respect to (Fygenson and Ritov 1994). Let be the resultant estimator from (2), which can be also obtained by minimizing the following scalar objective function,

where .

Because is based on the independent working correlation assumption, the efficiency of can be enhanced by accounting for the within-cluster correlations and the impacts of varying cluster sizes. Furthermore, is robust against outliers in response and is sensitive to outliers in covariates. To seek doubly robust and efficient parameter estimates, we propose the following weighted estimating functions

where and are weights to be specified. Let be the estimator from , which can be also derived by minimizing the following objective function

For , in a general way, we can select weights including , , and , where

is the average within-cluster correlation and is obtained using the moment estimator from Wang and Carey (2003) given a consistent estimation for


where is the rank of the corresponding residual term , and is the average of the rank sum of all . In this paper, we use the third weight to incorporate the within-cluster correlations. For weight , we use the generalized rank (GR) estimation (Naranjo and Hettmansperger 1994) defined by

Here, and correspond to tuning constants and denotes the squared Mahalanobis distance of based on the robust estimates of location and dispersion for the design set (Rousseeuw and van Zomeren 1990). For the tuning parameters and , we use and which is the th percentile of a -distribution. Specifically, when and , corresponds to the classical Gehan-type estimating function .

Denote as the true value of . According to Lee et al. (1993) and Jin et al. (2006), under some regularity conditions, the limiting distribution of

follows a zero-mean multivariate normal distribution, and the asymptotic covariance matrix of



where and

. According to Lee et al. (1993), we deduce the limiting variance matrix of

given as follows:


Matrix depends on the error distributions, which are unknown and usually difficult to estimate. If is a smooth function of , can be estimated by evaluated at an estimate of . However, is a step function, Hence, its derivative does not exist, which makes parameter estimates and computation of difficult. Moreover, calculational issues often arise when minimizing or solving , and the solution in general is not unique and consists of a single interval or even multiple intervals, although the length of these intervals converges asymptotically to zero (Jin et al. 2006).

2.3 Smoothed weighted estimating function

To overcome difficulties with the lack of smoothness of the estimating functions, we now introduce the induced smoothing method given by Brown and Wang (2005). Assume that and is independent of the data, where denotes the identity matrix. Let be a positive definite matrix and satisfy . Then, the induced smoothing version of is , where the expectation is taken with respect to . By some simple calculation, we have


where . Let be the standard normal density function. Similarly, we can obtain the induced smoothing version of ,

Then we can obtain by minimizing . Alternatively, can be derived as the multivariate root of . The derivative of can be easily derived,

Before giving the asymptotic properties of the smoothed versions and the resultant estimators, the following regularity conditions are required.
C1. The parameter vector lies in a compact subset of .
C2. For , are bounded and , where is the Euclidean norm.
C3. , where is a constant.
C4. The common marginal density function of the errors , , and its derivative are bounded and satisfy
C5. The marginal distribution of is absolutely continuous and the corresponding density function is bounded on , for any and .
C6. The matrices and are non-singular.

Theorem 1.Under conditions C1-C5, uniformly in .

Theorem 2.Let be any symmetric and positive definite matrix with . Under conditions C1-C6, is a strongly consistent estimator of .

Theorem 3.Let be any symmetric and positive definite matrix with . Under conditions C1-C6, converges in distribution to , where is given by (3).

Theorem 1 indicates that the difference between the smoothed estimating functions and unsmoothed version is negligible. Theorems 1 and 2 indicate that is a consistent estimator of . Therefore, can be consistently estimated using The proofs of theorems 1 and 2 are given in the Appendices A and B. The proofs of Theorem 3 can be established by following similar lines as in established similar results for the independence estimator (Johnson and Strawderman 2009). According to Brown and Wang (2005), an iteration procedure to simultaneously obtain the smoothed estimate and its covariance matrix estimate can be described by the following steps:
Step 1. Choose an initial value (e.g. ) for the working covariance matrix and a consistent estimator for to evaluate and .
Step 2. In the -th iteration, update by minimizing or solving .
Step 3. Update based on using the current values and .
Step 4. Repeat Steps 2-3 until a convergence criterion is satisfied.

In our experience, in general, the algorithm converges after only a few iterations. The final values of and can be as estimates of and .

3 Simulation studies

In this section, we carry out simulation studies to evaluate the performance of the proposed estimator by comparing the biases and mean squared errors (MSE) with the Gehan-estimator , derived from with , and the smoothed estimator from .

In the simulation studies, we generate the data from model (1) with , and . Cluster sizes are sampled from to

with equal probability. The censoring times

are generated from a uniform distribution

, where controls the rate of the censoring. The rates of the censoring are taken as and . The error terms are generated from a multivariate normal distribution and a multivariate t-distribution

with three degrees of freedom, where

is an exchangeable matrix with parameter and . Note that the correlation coefficient between and is . The covariate is a cluster-level covariate in which does not change within each subject or cluster, and independently generated from the standard normal distribution. The covariate are with-cluster covariate in which the covariate varies within each subject or cluster, and independently generated from the standard normal distribution. The covariate is contaminated by adding an outlier equal to with a probability of or . For the case, simulations are carried out. The simulation results are given in Tables 1-4 present the biases and mean squared errors for each case. Tables 5-6 show the empirical variances (Evar) and the variances (Ivar) using the iterative method of §2 for simultaneously estimating the regression parameters and covariance matrix.

From Tables 1-2, we can see that both biases and MSEs of the estimates increase as censoring rate increases, and all the estimates are unbiased when there are no outliers. The mean square errors decrease as the number of cluster increases. When there exist outliers (Tables 3-4), the proposed estimate has much smaller biases and MSEs than and . The mean squared errors of the smoothed estimate are similar to those of the nonsmoothed estimate across all cases. When the covariate is a within-cluster covariate, the estimates corresponding to performs better than and . However, when the covariate is cluster-level covariate, and corresponding to perform better than . From Tables 5-6, we can see that the variance estimates obtained via the iterate method (for simultaneously estimating the regression parameters and covariance matrix) for and are accurate and similar to the empirical variance estimates across all simulation studies. Overall, the results presented in Tables 5-6 suggest that the smoothing parameter has a minimal impact on the bias or actual variance of the regression parameter estimates, and the proposed estimate are robust and efficient.

4 Analysis of real medical data

In this section, we illustrate the proposed method by analyzing two real longitudinal data sets.

The first one is a longitudinal and survival dataset collected in a recent clinical trial, which was described by Guo and Carlin (2004). In this trial, a total of HIV-infected patients were enrolled and randomly assigned to receive either didanosine (ddI) or zalcitabine (ddC). CD4 counts were recorded at study entry and again at , , , and -month visits, and the times to death were also recorded. Due to death or censoring of the patients, the data is unbalanced. For full details regarding the conduct of the trial the reader is referred to Abrams et al. (1994) and Goldman et al. (1996). The dataset is available in the JM package in statistical software R.

In this paper, we are interested in whether the time to death or censoring of the patients is different for the ddI and ddC groups. Let be the time to death or censoring of the th patient. We include five covariates as main effects in our analysis: CD4 counts, observation time at which the CD4 cells count was recorded (obstime), drug (, ), gender (, ), PrevOI (previous opportunistic infection (AIDS diagnosis) at study entry , no AIDS diagnosis ), and AZT (AZT failure , AZT intolerance ). Note that covariates are cluster-level covariates except CD4 and obstime. Figure 1 indicates that the CD4 may include some underlying outliers which are larger than . We analyze the data by the following AFT model,

We estimate the parameters by the same method in simulation studies. Parameter estimates and their standard errors are given in Table

7. We can see that the estimates obtained from different methods are similar. Furthermore, has smaller standard errors than and for the cluster-level covariates. However, for within-cluster covariates, CD4 and obstime, the standard errors of are larger than those of and , which are consistent with the findings in our simulation studies.

The second example is from the standard and new anti-epileptic drugs (SANAD) study [16, 17] with the aim to know whether the new drug lamotrigine (LTG) is superior to the standard drug carbamazepine (CBZ) for patients with epilepsy. There were patients in the trial treated with LTG or CBZ randomly. We consider the effects of six covariates on the time of drug withdrawal: dose, treatment (LTG=1, CBZ=0), age, gender (male=1, female=0), and two indicator variables, with.use (1=withdrawal due to unacceptable adverse effects, 0=otherwisw) and with.isc(1= withdrawal because of inadequate seizure control, 0=otherwise). It is noticeable that these covariates are cluster-level except dose, and there may be some underlying outliers in dose according to Figure 2. We use the following AFT model to analyze the data.

The parameter estimates and their standard errors using different methods are shown in Table 8.

In Table 8, it is shown that using different methods obtains similar estimates. The effect of treatment is positive. In other words, the conclusion is that LTG is superior to CBZ, which has been found in [16, 17, 23].Moreover, the standard error of , the coefficient of the within-cluster covariate dose, is comparable with those obtained by other methods. However, the standard errors of for cluster-level covariates are the smallest one among the five methods.

5 Discussion

The Gehan weight estimating function is monotonic with respect to regression parameters, is robust against outliers in response, and has a unique solution. Therefore, many researchers have utilized it to estimate parameters in the AFT model (Fygenson and Ritov 1994; Jin et al. 2006). However, they did not consider the possible outliers existing in covariates. Furthermore, the within-cluster correlation was often ignored for the clustered and censored data. The proposed method is as simple as the independence model of Jin et al. (2006), but takes account of within-cluster correlations and varying cluster sizes. Moreover, it is robust against outliers in covariates and/or response. When there exist outliers in covariates, the proposed method leads to substantial improvement over the commonly used Gehan estimator. Furthermore, the calculation burden can be greatly reduced by the induced smoothing method (Brown and Wang 2005; Johnson and Strawderman 2009).

In this paper, we only considered the linear regression model. In fact, the idea can be easily extended to the partial linear model (Cheng and Wei 2000). The simulation results indicate that the proposed method depends on the covariate design and is not fully efficient; that is because the correlations are still not well considered and quantified in this paper. Further work is therefore needed to incorporate within-cluster correlations into the optimal parameter estimation.


Yan Zhou’s research was supported by the National Natural Science Foundation of China (Grant No. 11701385), the National Statistical Research Project (Grant No. 2017LY56), the Doctor Start Fund of Guangdong Province (Grant No. 2016A030310062). Liya Fu’s research was supported by the National Natural Science Foundation of China (11871390), the Fundamental Research Funds for the Central Universities (No. xjj2017180), the Natural Science Basic Research Plan in Shaanxi Province of China (2018JQ1006) and the Doctoral Programs Foundation of the Ministry of Education of China (20120201120053). You-Gan Wang wishes to thank the support from the Australian Research Council grant (DP160104292).


  • [1] Abrams, D. I., Goldman, A. L, Launer, C, Korvick, J. A., Neaton, J. D., Crane, L. R., Grodesky, M., Wakefield, S., Muth, K., Kornegay, S., Cohn, D. J., Haris, A., Luskin-Hawk, R., Markowitz, N., Sampson, J. H., Thompson, M., Deyton, L., and the Terry Beim Community Programs for Clinical Research on AIDS (1994), Comparative Trial of Didanosine and Zalcitabine in Patients with Human Immunodeficiency Virus Infection Who are Intolerant of or Have Failed Zidovudine Therapy, New England Journal of Medicine 330, 657–662.
  • [2] Brown, B. M. and Wang, Y-G. (2005). Standard errors and covariance matrices for smoothed rank estimators. Biometrika 92, 149–58.
  • [3] Cox, D. R. (1972). Regression models and life tables (with discussion). Journal of the Royal Statistical Society, Ser. B 34, 187–220.
  • [4] Cheng, S. C. and Wei, L. J. (2000). Inference for a seiparametric model with panel data. Biometrika 87, 89–97.
  • [5] Chiou S. H., Kang, S. W. and Jun Y. (2015) Semiparametric accelerated failure time modeling for clustered failure times from stratified sampling. Journal of the American Statistical Association, 110, 621–629
  • [6] Chiou S. H., Kang S. W., Kim J. and Jun Y. (2014) Marginal semiparametric multivariate accelerated failure time model with generalized estimating equations. Lifetime Data Analysis. 20, 599–618.
  • [7] Fygenson, M. & Ritov, Y. (1994). Monotone estimating equations for censored data. The Annals of Statistics 22, 732–746.
  • [8] Guo, X. and Carlin, B. P. (2004). Separate and joint modeling of longitudinal and event time data using standard computer packages. The American Statistician 58, 16–24.
  • [9] Goldman, A. L, Carlin, B. P., Crane, L. R., Launer, C, Korvick, J. A., Deyton, L., and Abrams, D. I. (1996), Response of CD4+ and Clinical Consequences to Treatment Using ddl or ddC in Patients with Advanced HIV Infection, Journal of Acquired Immune Deficiency Syndromes and Human Retrovirology 11, 161–169.
  • [10]

    Jin, Z., Lin, D. Y., Wei, L. J. and Ying, Z. (2006). Rank regression analysis of multivariate failure time data based on marginal linear models. Scandinavian Journal of Statistics 33, 1–23.

  • [11] Johnson, L. M. and Strawderman, R. L. (2009). Induced smoothing for the semiparametric accelerated failure time model: Asymptotics and extensions to clustered data. Biometrika 96, 577–590.
  • [12] Kalbfleisch, J. D. & Prentice, R. L. (2002). The Statistical Analysis of Failure Time Data, 2nd ed. Hoboken: Wiley.
  • [13] Lee, E. W., Wei, L. J. & Ying, Z. (1993). Linear regression analysis for highly stratified failure time data. Journal of the American Statistical Association 88, 557–565.
  • [14] Lin, D. Y., Wei, L. J., Ying, Z., (1998). Accelerated failure time models for counting processes. Biometrika 85, 605–618.
  • [15] Luo J., Li H. F. Zhang J. J. (2014). Robust smoothed rank estimation methods for accelerated failure time model allowing clusters. Communications in Statistics-Simulation and Computation. DOI:10.1080/03610918.2014.882944.
  • [16] Marson A.G., Appleton R., Baker G.A., Chadwick, D.W., Doughty, J., Eaton, B., Gamble, C., Jacoby, A., Shackley, P., Smith, D.F., Tudur-Smith, C., Vanoli, A.,and Williamson P.R. 2007. A randomised controlled trial examining longer-term outcomes of standard versus new antiepileptic drugs, Health Technology Assessment 11(37), 1-154.
  • [17] Marson, A.G., Al-Kharusi, A.M., Alwaidh, M., Appleton, R., Baker, G.A., Chadwick, D.W., Cramp, C., Cockerell, O.C., Cooper, P.N., Doughty, J., Eaton, B., Gamble, C., Goulding, P.J., Howell, S.J.L., Hughes, A., Jackson, M., Jacoby, A., Kellett, M., Lawson, G.R., Leach, J.P., Nicolaides, P., Roberts, R., Shackley, P., Shen, J., Smith, D.F., Smith, P.E.M., Smith, T.C., Vanoli, A. and Williamson, P.R. on behalf of the SANAD Study group. 2007. The SANAD study of effectiveness of carbamazepine, gabapentin, lamotrigine, oxcarbazepine, or topiramate for treatment of partial epilepsy: an unblinded randomised controlled trial, Lancet 365, 1000-1015.
  • [18] Naranjo, J. D., and Hettmansperger, T. P. (1994). Bounded-Influence rank regression. Journal of the Royal Statistical Society, Ser. B 56, 209–220.
  • [19] Neuhaus, J. M. and Kalb eish, J. D. (1998). Between- and within-cluster covariate effects in the analysis of clustered data. Biometrics 54 638–645.
  • [20] Rousseeuw, P. J. and van Zomeren, B. C. (1990). Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association 85, 633–639.
  • [21] Wang, Y-G. and Carey, V. J. (2003). Working correlation structure misspecification, estimation and covariate design: Implications for generalized estimating equations performance. Biometrika 90, 29–41.
  • [22] Wang, Y-G. and Fu, L. Y. (2011). Rank regression for accelerated failure time model with clustered and censored data. Computational Statistics and Data Analysis 55, 2334–2343.
  • [23] Williamson, P.R., Kolamunnage-Dona, R., Philipson, P., Marson, A.G. 2008. Joint modelling of longitudinal and competing risks data, Statistics in Medicine 27(30), 6426-6438.