In real studies, the primary covariates sometimes are not directly recorded in their true values, but rather, they are observed in a distorted form, where the distortion is in the form of a multiplicative factor. This type of data does not get sufficient attention as other types of covariate measurement error problems, even though they are also quite wide prevalent in real studies. For example, when releasing household data on energy use, in order to maintain confidentiality, the U.S. Department of Energy multiplied the survey data by some randomly selected numbers before publication (Hwang, 1986). Therefore, the contaminated data available to the public is , where and respectively denote the true data and the randomly selected number. This multiplicative contamination structure is also very common in biomedical studies, in the form of normalization, as some primary covariates are often normalised by a confounder such as BMI () or by other measures of body configuration or age. For instance, in a study of the relationship between the fibrinogen level (FIB) and serum transferrin level (TRF) among hemodialysis patients, Kaysen et al. (2002) found that BMI has a great influence on FIB and TRF and may distort the true relationship between them. Therefore, they proposed a calibration method where dividing the observed FIB and TRF by the confounding variable BMI. This implies a multiplicative structure between the unobserved primary variables and the confounding variable. Such phenomenon also appears frequently in environmental studies where ambient measure is used for normalization, and in genomic studies where library size needs to be normalized for next generation sequencing data.
In some situations, however, the precise nature of multiplicative relationship between the primary variables and the confounding variable could be unknown, and in this case the naive practice of dividing by the confounding variable may result in biased estimates or losing of power for statistical inference. To overcome these difficulties, Sentrk & Mller (2005) considered a more flexible multiplicative form which is an unknown function of the confounding variable
. They proposed a covariate adjustment method for the linear regression model, where both the responseand the covariates are distorted by an observable confounder , i.e., , , where and are observable distorted covariates and response, and
are unknown smooth distorting functions. Directly applying the widely used ordinary least squares (OLS) method to the contaminated datawill result in biased and inconsistent estimates. Sentürk & Müller (2005) corrected the bias by linking it to a varying-coefficient regression model, then utilized the bin method (Fan & Zhang, 2000) to obtain consistent estimators (Sentürk & Müller, 2006). Related research includes Nguyen & Sentürk (2008) on generalizing this method to the case of multiple distorting covariates, Sentürk & Müller (2009) on extending to generalized linear model, Zhang, Zhu, & Liang (2012) and Zhang et al. (2013) on the nonlinear regression model and the partial linear model. More recently, Cui et al. (2009) developed a direct plug-in estimation procedure for nonlinear regression model with one confounding variable. They proposed to estimate the distorting functions and by nonparametrically regressing the response and predictors on the distorting variable, and obtained the estimates for the unobservable response and predictors, then conducted the nonlinear least squares method on the estimated counterparts . Zhang, Zhu, & Zhu (2012) further applied this direct plug-in method to semiparametric model by incorporating dimension reduction techniques. To relax the parametric assumptions and some restrictive conditions on distorting functions in the existing literature, Delaigle, Hall, & Zhou (2016) proposed a more flexible nonparametric estimator for the regression function.
In this paper, we focus on investigating censored survival data where the response of interest is a right-censored survival time and the primary predictor is distorted by an observable confounding variable through the multiplicative form , where is the unknown distorted function. A reasonable identifiability condition for this structure is corresponding to the assumption that the mean distorting effect vanishes (Sentürk & Müller, 2005). The existing methods mentioned earlier can not be applicable here due to censoring. Furthermore, the existing methods for censored survival data with mismeasured covariates (e.g., Prentice, 1982; Wang et al., 1997; Zhou & Pepe, 1995; Zhou & Wang, 2000; Huang & Wang, 2000; Hu & Lin, 2002) can not handle this multiplicative distortion. To make valid inference, we propose a covariate-adjusted Cox proportional hazards regression to address this multiplicative contamination structure. Inspired by Cui et al. (2009), we first employ the nonparametric regression to obtain consistent estimator of the distorting function through the kernel smoothing method, and obtain the estimates for the true covariates by . Then the regression parameters are estimated by maximizing the partial likelihood on the estimated data. Our approach has several distinctive advantages. First, the contamination structure we considered is more general which includes a large class of confounding mechanisms, e.g., means there is no contamination, represents the contamination structure . So the applicability of our proposed method can be quite broad. Second, the computation of our method is simple and fast, which will greatly facilitates its implementation in real application.
The rest of the article is organized as follows. In Section , we introduce the covariate-adjust Cox regression for the multiplicative contaminated data and present the proposed covariate-calibration method. In Section , we establish the asymptotic properties of the proposed estimates. In Section , we present simulation results to evaluate the finite sample performance of the proposed estimates. In Section , we apply the proposed method to a data set from the National Wilms’ Tumor Study (NWTS). Some concluding remarks are given in Section . All technical proofs are presented in the supplementary material.
2 Cox regression with multiplicative contamination structure
2.1 Model, data and contamination structure
To fix notation, let denote the survival time, denote the censoring time, denote the observed time, and denote the failure indicator. Let and be the associated covariates where is the one that subjects to multiplicative contamination. Assume that the censoring mechanism is random, that is, the survival time and the censoring time are conditionally independent given and . The proportional hazards regression model (Cox, 1972) assumes that the conditional hazard function of the survival time associated with covariates and takes the form of
where is the baseline hazard function, and are the unknown regression coefficients. We assume the scalar covariate is not observed precisely while the -dimensional covariate could be accurately observed. Assume the observed data consists of subjects, denoted by , which are independent samples from . Instead of exact , we observe such that
where is an observable variable and independent of , is an unknown link function. To make the model identifiability, we assume that , which implies that the distorting effect vanishes on average.
We aim to infer the regression parameters and based on the observations available. When are observed without contamination, maximizing the partial likelihood (Cox, 1975)
can offer the estimates for and . It is evident that (2) can not be used when are unobservable or contaminated.
Note that the established methods on the Cox regression with additive contamination structure always require error to be independent of (e.g., Huang & Wang, 2000; Li & Ryan, 2004). Directly apply the additive error structure methods to current setting is not feasible. To illustrate this, even though the multiplicative contamination structure (1) can also be rewritten as an additive structure,
the error is not independent of , hence the methods mentioned above can not be applicable here. If one takes the logarithmic transformation assuming the related quantities are positive, then one would arrive at the additive covariate contamination structure (4). Here the error term is independent of , but extra variation needs to be accounted for in the back-transformation procedure. Moreover, the routine approximately corrected score method for the Cox regression at the scale would result in biased estimate if the correct Cox regression model is linear in .
2.2 Covariate-calibration method
Our proposed approach is based on directly calibrating . Note that
We can employ the commonly used Nadaraya–Watson kernel smoothing estimate for , which is given by
where is the kernel smoothing function and is the bandwidth. Since converges to
almost surely by using the strong law of large numbers, we can obtain an consistent estimate foras . Following (1), we propose a calibration of by . Therefore, we can construct an estimated partial likelihood using as follows,
The proposed estimator was defined as the maximizer for , i.e.,
2.3 Bandwidth selection
In real data analysis, it is desirable to have an automatically data-driven method for selecting the bandwidth parameter . We will employ a cross-validation (CV) method to choose the optimal The kernel estimate of the density function of , , is denoted as
Following Rudemo (1982) and Bowman (1984), we define an integrated squared error (ISE) as follows,
be the leave-one-out kernel density estimator, i.e.,
The second term of (7) can be consistently estimated by Therefore, we propose a cross-validation criterion as follows,
which is considered as the optimal bandwidth parameter.
3 Asymptotic properties
We set , let and respectively represent the estimation and the true value of the regression parameter . The following theorem gives the consistency and asymptotic normality of the proposed estimator when . The regularity conditions and the proofs of this theorem are given in the Appendixes A and B, respectively.
The above theorem establishes the asymptotic normality of the proposed estimator , furthermore, from (ii) in Theorem 1, we can obtain the asymptotic distribution of and respectively, i.e., , and , where and respectively denote the th order sequential principal minor and the th diagonal element of matrix . We note a few remarks on the terms in the expression for the asymptotic covariance matrix. If there is no distortion with , we can estimate by maximizing the partial likelihood (2), the asymptotic covariance matrix of is . So the term
is caused by the distortion. Furthermore, the limiting variance for
includes some unknown components to be estimated, therefore, we can use the sandwich method and plug-in estimation to obtain the standard error and construct the confidence region for.
4 Simulation studies
We conducted extensive simulations to investigate the finite-sample performance of the proposed estimator and compared it with two completing estimators. The first one is the naive estimator that ignores the contamination and directly uses to replace ; the second one is the oracle estimate , which is obtained by assuming that was known.
The survival times were generated from the Cox proportional hazards model with the conditional hazard function given by
Set , and the baseline hazard function . The covariate
follows a multivariate normal distribution with meanand correlation matrix for . We generated from and the confounding covariates
from a uniform distribution over interval. We considered two forms of distortion function and , which satisfy . We took the censoring time , where was generated from Unif. The study duration was chosen to yield the desirable censoring rate. To estimate the distorting function, we chose Gaussian kernel function and adopted the leave-one-out cross-validation method to select the bandwidth. We took the sample size and , coupled with the censoring rates (CR) of , and . For each configuration, we repeated simulations.
Tables 1 and 2 summarize the results of , and under different distortion functions and different censoring rates for sample size and , respectively. We make the following observations: (i) As expected, in terms of the mean-square error or the coverage probability, the oracle estimator and our proposed estimator are all superior to the naive estimator , especially for the results of . Not surprisingly, the naive estimator are seriously biased. For example, under the censoring rate of and in Table 1, the bias for is , more than half of its real value , while the bias for proposed estimator is only ; moreover, the coverage probability for is , almost equals to zero. (ii) The proposed estimator are essentially unbiased and comparable with the oracle estimator under different settings, even for the cases with high censoring rate of . For example, in the case of censoring rate and in Table 1, the relative efficiency , very close to 1. (iii) Our proposed method performs stably with the choice of the distortion function, while the naive method performs worse if we chose . The coverage probabilities of for almost equal or close to zero. These simulation results demonstrate that the proposed covariate-calibration approach can effectively overcome the negative effect arising from the covariate contamination and meanwhile exhibits good performance.
5 Analysis with Wilms’ tumor study
We applied the proposed covariate-calibration method to the Wilms’ tumor data, which was collected in two randomized studies in Wilms’ tumor patients. Wilms’ tumor is a rare kidney cancer occurring in young children. The National Wilms’ Tumor Study Group (NWTSG) conducted several randomized studies to test different treatments in Wilms’ tumor patients. We use a Wilms’ tumor data including patients participating in two of the NWTSG trials NWTS-3 and NWTS-4 (D’Angio et al., 1989; Green et al., 1998) to evaluate the joint effect of tumor weight, histological type and other risk factors. The primary endpoint of the study was the survival time (in years). During the follow-up, patients died of Wilms’ tumor and the other patients were censored, which led to the censoring rate of . The mean observed time was 10.33 years (ranging from 0.01 to 22.50 years). We divided the data into two groups according to the histological type (favorable and unfavorable) and summarized the size and mean of each covariate in Table 3. It can be seen that patients have favorable tumor and the other patients have unfavorable tumor. The mean observed time for patients with favorable tumor is 10.68 years, which is larger then the corresponding value (7.55) of the unfavorable tumor group. Figure 1 shows the Kaplan-Meier curves for the two different tumor histological types, from which we can see that patients with favorable tumor experienced longer survival time.
The predictors included in this analysis are the weight of tumor bearing specimen (abbreviated as wgt, in kilograms), the histological type of the tumor (type, being 0 if favorable and 1 otherwise), tumor stage (stage, coded by 1 and 0, indicating spread of the tumor from localized to metastatic), age at diagnosis (age, measured in years), the study number (num, 1 denotes NWTS-3 and 0 denotes NWTS-4).
We examine the following Cox proportional hazards regression model,
It is known that the weight of tumor bearing specimen (wgt) is affected by tumor’s diameter (diam, in centimeters). The scatter points of wgt versus diam shown in Figure 2 clearly demonstrate that there indeed exists a strong positive correlation between them. Therefore, we directly adjust for the potential distorting covariate with the proposed method and assume the distortion model as , where is an unknown link function, is the observed wgt. The analysis results of the covariate effects were summarized in Table 4. As a comparison, we also presented the results of the naive method which ignoring the contamination of wgt. By observing the results, the -value of wgt is 0.008 for our proposed method, which means wgt has significant influence on patients’ survival time, while the corresponding value is 0.244 for the naive method without the potential distorting effect of “diam”. From the medical standpoint, wgt has great influence on patients’ survival time, whereas ignoring the contamination leads to this covariate insignificant. Furthermore, from all these two methods, we can conclude that patients with favorable tumor would possess longer survival time, compared with ones with unfavorable tumor, which coincides with Figure 1. As a result, we conclude that the proposed covariate-calibration method offers a convincing result for the Wilms’ tumor data.
Covariate-adjusted problem is a common contamination problem in biomedical studies. Similar issues arrive in other field, e.g. in environmental studies, exposures are often calibrated by the daily environment or ambient measures, like the role of BMI in medical studies, or genomic studies where library size is being normalized. Our method deals with the type of some primary covariates that are observed after being distorted by a multiplicative factor (an unknown function of an observable confounding variable). We fill in the gap in the literature on censored survival data with distorting function in primary risk factor, which is lacking in terms of statistical method. We propose a direct estimation procedure to estimate the regression parameters in the Cox proportional hazards regression model. The novel idea of our procedure is to obtain a consistent estimator of the distorted covariate by employing the kernel smoothing method and then obtain the parameter estimation by plugging in the estimated covariate.
Numerical results show that the proposed method is working very well in correcting the bias arising from covariate distortion. It performs stably to a variety choice of the distortion functions. An important improvement of our method is that we allow flexible distorting model to handle various confounding mechanisms. The proposed method is easy to compute and will provide a critical tool for researchers facing with this type data in practice.
A few remarks on using the proposed method in real studies. First, on the construction of confidence interval of the proposed estimation, we note that because of the nonlinear structure of the estimated partial likelihood and the maximum partial likelihood estimation does not have a closed form, the establishment of theoretic properties in this paper is more difficult than linear model. The asymptotic covariance matrix derived in Theorem 1 depends on several unknown components, therefore, it is difficult to construct confidence region based on normal approximation. We recommend to use the common sandwich approach to obtain the standard error estimation. This method has been tested and demonstrated to perform well in our numerical studies.
Second, for ease of exposition, we consider only one confounding variable. In many applications, however, there are multiple distorting variables that simultaneously affect the primary covariate. In principle, the proposed method can handle this case and the sandwich method can be employed as well to obtain the standard error estimation. Deriving theoretic properties of the corresponding estimators will be more difficult and need additional technicalities.
Finally, as we require division of the distorted variable by the estimated distorting function, we imposed some regularity assumptions on the curve of the distorting function. In particular, the proposed method can not be applied if vanishes. Delaigle, Hall, & Zhou (2016) proposed a more flexible nonparametric estimator for the regression function, which significantly weakens some of the strong assumptions on the distorting function. Further research is underway to extend this work to censored survival data.
7 Supplementary materials
The supplementary material presents the detailed proof of Theorem 1.
This work is funded in part by the National Natural Science Foundation of China NSFC 11771366 (Liu), NSFC 11971362 (Liu), NSFC 11671311 (Wu) and NSFC 11901581 (Zhang), U.S. National Institute of Health grant R01 ES021900 (Zhou), P01 CA142538 (Zhou), P30 ES010126 (Zhou).
Appendix A: Regularity conditions
Unless otherwise stated, all limits are taken as . Suppose and are
-vectors, then we writefor the matrix . Also we write for the matrix . For a matrix A or vector a, let and . For matrix or vector sequences and , denote if and denote if . Denote and diag as the diagonal matrix whose diagonal vector is a. We set , , , , and . Let denote the end time of the study. Here, we introduce the following notations:
for . Note that is a scalar, and are -vectors, and are matrices. Before proving the theorem, we first describe the regular conditions needed as follows:
(Finite interval). .
(Asymptotic stability). There exist a neighbourhood of , scalar, vector and matrix functions , and defined on such that for
(Lindeberg condition). There exists such that
(Asymptotic regularity conditions). Let , , and be as in condition C and define and . For all , :
, and are continuous functions of , uniformly in , , and are bounded on , is bounded away from zero on , and the matrix
is positive definite.
and are bounded away from zero and have bounded second derivatives.
, and .
The kernel function satisfies condition in Giné & Guillou (2002). Let
then for any , satisfies that
for some positive constants and , where denotes the -covering number of the metric space , is the envelope function of , the supremum is taken over and the norm is defined as .
and ; and monotonically converge to zero as .
and are bounded away from .
These conditions are mild and can be satisfied in most of circumstances. Conditions C1-C4 are essential for the asymptotic results of Cox proportional hazards regression model. Condition C5 is a mild smoothness condition on the involved functions. Condition C6 is common for a kernel function and C7 is to regularize the complexity of the kernel function so that the supremum norm for kernel functions can be bounded in probability, which are also imposed in Chen et al. (2016) and Chen, Genovese, & Wasserman (2018). Specially, the Gauss kernel function satisfies the Conditions C6 and C7. Condition C8 states that the bandwidth converges to zero at certain rate with respect to the sample size . Condition C9 is necessary in the study of covariate-adjusted problems, see Sentürk & Müller (2006).
Appendix B: Proofs of asymptotic properties
As a preparation, we state a lemma, which is extracted from Lemma B.2 of Zhang, Zhu, & Liang (2012) and frequently used in the process of the proof.
Let be a continuous function satisfying . Assume that conditions C5C9 hold. The following asymptotic representation holds:
Proof of Theorem
Proof of . Denote by and , the log partial likelihood of this covariate-adjusted Cox model can be written as
The main point of the proof lies in stating that, for any ,
This implies, by the fact that and the consistency of Cox model under conditions C1–C4, the consistency of follows from Lemma of Wu (1981). The detailed proof were given in the supplementary material.
Proof of . Let
By Taylor expansion, there exists between and such that
By the definition of , we know that . So we have
We can prove that
where is defined in condition C4, and .
- (1) Bowman, A. W. (1984). An alternative method of cross-validation for the smoothing of density estimates. Biometrika, 71, 353–360.
- (2) Chen, Y. -C., Genovese, C. R., Tibshirani, R. J., & Wasserman, L. (2016). Nonparametric modal regression. The Annals of Statistics, 44, 489–514.
- (3) Chen, Y. -C., Genovese, C. R., & Wasserman, L. (2018). Density level sets: Asymptotics, inference, and visualization. Journal of the American Statistical Association, 112, 1684–1696.
- (4) Cox, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society, Series B, 34, 187–220.
- (5) Cox, D. R. (1975). Partial likelihood. Biometrika, 62, 269–276.
- (6) Cui, X., Guo, W. S., Lin, L., & Zhu, L. X. (2009). Covariate-adjusted nonlinear regression. The Annals of Statistics, 37, 1839–1870.
- (7) D’Angio, G. J., Breslow, N., Beckwith, J. B., et al. (1989). Treatment of Wilms’ tumor: Results of the third national Wilms’ tumor study. Cancer, 64, 349–360.
- (8) Delaigle, A., Hall, P., & Zhou, W. X. (2016). Nonparametric covariate-adjusted regression. The Annals of Statistics, 44, 2190–2220.
- (9) Fan, J., & Zhang, J. (2000). Two-step estimation of functional linear models with applications to longitudinal data. Journal of the Royal Statistical Society, Series B, 62, 303–322.
- (10) Giné, E., & Guillou, A. (2002). Rates of strong uniform consistency for multivariate kernel density estimators. Annales de l’Institut Henri Poincaré, 38, 907–921.
- (11) Green, D. M., Breslow, N. E., Beckwith, J. B., et al. (1998). Comparison between single-dose and divided-dose administration of dactinomycin and doxorubicin for patients with Wilms’ tumor: A report from the National Wilms’ Tumor Study Group. Journal of Clinical Oncologys, 16, 237–245.
- (12) Huang, Y. J., & Wang, C. Y. (2000). Cox regression with accurate covariates unascertainable: A nonparametric-correction approach. Journal of the American Statistical Association, 95, 1209–1219.
- (13) Hu, C., & Lin, D. Y. (2002). Cox regression with covariate measurement error. Scandinavian Journal of Statistics, 29, 637–655.
- (14) Hwang, J. T. (1986). Multiplicative errors-in-variables models with applications to recent data released by the U.S. Department of Energy. Journal of the American Statistical Association, 81, 680–688.
- (15) Kaysen, G. A., Dubin, J. A., Müller, H. G., Mitch, W. E., Rosales, L. M., Levin, N. W., & Hemo Study Group (2002). Relationships among inflammation nutrition and physiologic mechanisms establishing albumin levels in hemodialysis patients. Kidney International, 61, 2240–2249.
- (16) Li, Y., & Ryan, L. (2004). Survival analysis with heterogeneous covariate measurement error. Journal of the American Statistical Association, 99, 724–735.
- (17) Nguyen, D. V., & Sentürk, D. (2008). Multicovariate-adjusted regression models. Journal of Statistical Computation and Simulation, 78, 813–827.
- (18) Prentice, R. L. (1982). Covariate measurement errors and parameter estimation in a failure time regression model. Biometrika, 69, 331–342.
- (19) Rudemo, M. (1982). Empirical choice of histograms and kernel density estimators. Scandinavian Journal of Statistics, 9, 65–78.
- (20) Sentürk, D., & Müller, H. G. (2005). Covariate-adjusted regression. Biometrika, 92, 75–89.
- (21) Sentürk, D., & Müller, H. G. (2006). Inference for covariate-adjusted regression via varying coefficient models. The Annals of Statistics, 34, 654–679.
- (22) Sentürk, D., & Müller, H. G. (2009). Covariate-adjusted generalized linear models. Biometrika, 96, 357–370.
- (23) Wang, C. Y., Hsu, L., Feng, Z. D., & Prentice, R. L. (1997). Regression calibration in failure time regression. Biometrics, 53, 131–145.
- (24) Wu, C. F. (1981). Asymptotic theory of nonlinear least squares estimation. The Annals of Statistics, 9, 501–513.
- (25) Zhang, J., Yu, Y., Zhu, L. X., & Liang, H. (2013). Partial linear single index models with distortion measurement errors. Annals of the Institute of Statistical Mathematics, 65, 237–267.
Zhang, J., Zhu, L. X., & Liang, H. (2012).
Nonlinear models with measurement errors subject to single-indexed distortion.
Journal of Multivariate Analysis, 112, 1–23.
- (27) Zhang, J., Zhu, L. P., & Zhu, L. X. (2012). On a dimension reduction regression with covariate adjustment. Journal of Multivariate Analysis, 104, 39–55.
- (28) Zhou, H., & Pepe, M. S. (1995). Auxiliary covariate data in failure time regression. Biometrika, 82, 139–149.
- (29) Zhou, H., & Wang, C. Y. (2000). Failure time regression with continuous covariates measured with error. Journal of the Royal Statistical Society, Series B, 62, 657–665.