Linear regression model with a randomly censored predictor:Estimation procedures

10/23/2017 ∙ by Folefac Atem, et al. ∙ 0

We consider linear regression model estimation where the covariate of interest is randomly censored. Under a non-informative censoring mechanism, one may obtain valid estimates by deleting censored observations. However, this comes at a cost of lost information and decreased efficiency, especially under heavy censoring. Other methods for dealing with censored covariates, such as ignoring censoring or replacing censored observations with a fixed number, often lead to severely biased results and are of limited practicality. Parametric methods based on maximum likelihood estimation as well as semiparametric and non-parametric methods have been successfully used in linear regression estimation with censored covariates where censoring is due to a limit of detection. In this paper, we adapt some of these methods to handle randomly censored covariates and compare them under different scenarios to recently-developed semiparametric and nonparametric methods for randomly censored covariates. Specifically, we consider both dependent and independent randomly censored mechanisms as well as the impact of using a non-parametric algorithm on the distribution of the randomly censored covariate. Through extensive simulation studies, we compare the performance of these methods under different scenarios. Finally, we illustrate and compare the methods using the Framingham Health Study data to assess the association between low-density lipoprotein (LDL) in offspring and parental age at onset of a clinically-diagnosed cardiovascular event.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modeling continuous outcome data using linear regression usually assumes in theory that the values of the covariates are fully observed. However, in practice and especially for any large data set, it is unlikely that complete information will be available for all study participants. The issue of censored data is ubiquitous and affects many studies and permeates a wide range of research areas, including medicine, economics, and social sciences .

Multiple reasons lead to incomplete observations in a data set including nonresponse, attrition, and absence of event of interest. Usually during the design, implementation, and data collection phases of a study, efforts are made to minimize the occurrence of incomplete data whenever possible and, when unavoidable, to understand the reasons for such a discrepancy in order to handle the available data adequately and run appropriate statistical analyses. Although there is an extensive literature on missing data[33, 34, 52] and censored outcomes,[28, 26, 25] only a small number of papers have explored scenarios in which the covariate is censored.[12, 6, 43, 42]

Arguably, the inadequacy of linear regression models on censored outcome variables has sparked an interest in alternative methods, and subsequently has led to major developments of regression models for survival analysis for decades. Extensive literature has been published regarding censored outcomes, especially in studies of time-to-event outcomes where censoring is due to loss of follow-up, drop out, or study termination.[28, 26, 25]

While there is a vast literature on censored outcomes and different related methods have been discussed extensively, only a limited number of papers have focused on the issue of censored covariates. Ignoring or using a wrong approach to account for the censored nature of a covariate in regression model estimation can lead to analytical issues and spurious results [43, 42, 8]. It is important that censored covariates be recognized, acknowledged, and handled appropriately to produce reliable results.[18, 17, 16, 35, 4, 41, 3, 19, 44] However, the vast majority of literature on censored covariates has focused on censoring due limit of detection or type 1 censored covariates where observations of the covariate below such a limit cannot be measured or detected, but recorded at or less than the limit of detection value.[18, 10, 46, 8, 9, 53, 30] Only a handful of publications have investigated the implications of randomly censored covariates where some observations of the covariate are censored at varying censoring time points.[44, 5, 6, 7, 31]

Censored covariate measurements arise when, for some participants in a study, the ascertained information of interest has not yet occurred (or will not occur) at the time of assessment. This is due to a time lag between the time when a covariate is measured (usually at baseline) and the occurrence (or non-occurrence) of an event of particular interest that needs to happen for such a measure to be available and assessed. For instance, Clayton [11] investigated familial aggregation in chronic disease incidence and modeled the possible influence that parental age at onset of a given disease might have on an individual’s risk of succumbing to a particular disease. Using the Framingham Heart Study—an ongoing multi-generational landmark study designed to identify factors and characteristics that contribute to the development of cardiovascular disease (CVD) and other diseases through long-term, active surveillance and monitoring—Atem and Matsouaka [5] studied the impact that age at onset of clinically-diagnosed cardiovascular events in parents may have on the onset of cardiovascular events among offspring.
In both cases, even if important factors have been thoroughly measured for both parents and their offspring, it is unlikely that all parents have had or will have developed the disease of interest at the time of investigation. This means that the variable ”parental age at onset of a given disease” is guaranteed to be censored, i.e., not fully observed. Therefore, it is extremely important in any statistical analysis to account for the fact that the variable of interest is censored for some participants.

In theory, there are many ways to address the issue of censored covariates in data analyses. From a practical point of view, however, the most important questions are: When and under what conditions can one safely consider the problem of censored covariates to be trivial? How can current methods be applied and under what conditions can one expect (asymptotically) unbiased and meaningful results?

In general, inappropriate handling of censored covariates may affect the type I error

[8], yield biased results, hinder the power to detect any meaningful treatment differences, or lead to loss of efficiency in estimating the coefficient parameters of a regression model.[42]

Complete-case analysis, whereby observations with censored covariate values are discarded (on purpose or through a software default option), is the most commonly used method. When the sample size of the data is large, the censoring mechanism is independent of the outcome, and the proportion of censored data is relatively small, complete-case analysis of the data can be employed safely[33, 34]

since the impact of censored observations on the analysis of the data may be negligible. In that case the complete-case analysis yields valid (consistent and asymptotically unbiased) estimates for regression coefficient parameters.

However, under moderate to heavy censoring of data there might be a substantial loss of efficiency due to the reduction in the sample size and the significant loss of information on other fully-observed covariates and on the outcome measures of the deleted observations.[32] Furthermore, when the censoring mechanism is informative, using a complete-case analysis can lead to biased results which are in part exacerbated by the losses of information and efficiency since restricting the analysis solely to truly observed covariate measures may introduce some imbalance in the dataset in a way that misrepresents the population under study.

When the issue of censored covariate is not ignored all together, simple substitution methods (or to be more accurate ad hoc fill-in methods)—where censored observations are replaced by the overall mean or median of the observed variable or, alternatively, by a constant—are frequently used because they are simple, easy to understand, are easy to implement. Unfortunately, they may lead to substantial biased estimates and inaccurate conclusions.[13, 10, 45, 46, 38, 43, 17]

Several non-trivial statistical methods have been developed specifically to input censored covariates and estimate regression coefficient parameters in a model with a censored covariate.[51, 27] Some of these methods, known as parametric methods, use maximum likelihood estimation (MLE) under the assumption that the covariate follows a specific distribution.[40, 4, 29, 45, 31, 13, 10, 12, 3, 48, 47, 27] For example, when such a distribution assumption is plausible, Richardson and Ciampi[41] proposed using MLE and input censored observations with in the context where measurements of the covariate, are left-censored due to limit of detection However, this approach has some limitations, especially when the censored covariate is correlated with other covariates.[40, 27, 48, 47]

As we know, an MLE method relies on a parametric distribution assumption of the censored covariate, i.e., the postulated distribution is assumed to be true and correctly specified. It is less reliable when the distribution assumption is incorrect or when the data set is so small that it becomes questionable whether the assumed distribution fits the data well. In that case, a semiparametric model that makes weaker parametric distribution assumptions or, even better, a nonparametric method that does not assume any specific distribution model at all is preferable.[40, 27, 35]

As we previously mentioned, most of these methods highlighted above have been developed to account for type I censoring or limit of detection and are typically developed for left-censored covariates. Nevertheless, it is fairly straightforward to adapt the methods for a limit of detection data or type I censored covariate to a right-censored covariate. However, to the best of our knowledge, no such parametric approach employed for type I censored covariates has been extended to handle a randomly censored covariates. In addition, barring a few papers on dependent (randomly) censoring mechanism[44], the vast majority of published methods for type I censoring rely on the assumption that the censoring mechanism is independent of the outcome in interest.[19, 35, 38, 18]

Our primary objective in this paper is to adapt a parametric method proposed in linear regression models with a type I censored covariate to the case in which a covariate is randomly censored covariate. We use simulation studies to compare this newly developed method to the methods proposed by Sampene and Folefac [44]—in which randomly censored covariate values are replaced by a nonparametric and a semiparametric estimations of or , where denotes the maximum observation time for the variable , and the outcome of interest . For this purpose, we will consider both dependent and independent censoring mechanism which occurred depending on whether such a censoring mechanism depends or not on the outcome of interest. Furthermore, we will also compare the aforementioned nonparametric estimation method to the commonly used deletion or complete-case analysis and give recommendations on the methods of estimation based on our simulation results.

We begin in Section 2 by presenting parametric and non-parametric methods used in the censored covariate literature. We then introduce the methods proposed by Sampene and Folefac [44] to handle randomly censored covariates. In Section 3 we run simulation studies to compare each of the discussed methods as well as the complete-case analysis method. Finally, we apply these methods in Section 4 to the Framingham Offspring Study to assess the influence of parental age at onset of cardiovascular disease on the systolic blood pressure of their offspring.

2 Notation and Methods

We consider study participants independently sampled from a referenced population. Let , and be, respectively, the continuous outcome variable, the potentially censored covariate (from which we are interested in making inferences), and the right censoring variable, where indexes subjects.

Due to the right-censoring in the covariate , for each participant

, we observe the vector

where , . The linear regression model is given by


where the parameter coefficients and are the intercept and slope, respectively. The random error is assumed to be independent of

and follows a normal distribution with mean 0 and variance

i.e. .

We consider two different cases of censoring mechanisms. In the first case, we assume that the censoring mechanism is non-informative, i.e., is independent of the outcome . For the second case, we assume that the censoring mechanism depends on the outcome in a sense that there is some known point

such that the random variable

follows a distribution characterized by the distribution function when and by the distribution function when

For simplicity and demonstrative purposes, we limit our discussion to cases with no additional covariates. If, in practice, the data at hand contain a set of additional fully-observed (i.e. non-censored) covariates, , the method discussed here could easily be extended to accommodate such covariates.

2.1 Parametric method: Maximum likelihood estimation

A parametric method assumes an underlying distribution of the population from which the data at hand were sampled and uses the maximum likelihood estimation method to draw inference.

Suppose that the censoring is independent of ; this implies and are independent. Therefore, the distribution of is a product of the distributions of and . The likelihood of is made up of two components; one based on the uncensored (observed) and the other on the right-censored :

with if is observed and if is censored, The maximum likelihood estimate of the unknown regression parameter corresponding to the censored covariate is derived from the log-likelihood function


where and

Suppose that follows a normal distribution with mean and variance we have

For the censored component, consider

the cummulative gaussian distribution function and define

, and We show in the Appendix that


When censoring is dependent as described above, and are dependent. The likelihood can be expressed as


where the data is made of fully observed and right censored . The censored component is divided into and components with associated distribution and respectively, as in equation (2).

2.2 Overview of nonparametric methods

As stated in the introduction, most of the methods described in published literature that examine the issue of covariates are subject to the limit of detection. Prior to the late 1990’s, the most common approach to handling such censored covariates was the complete-case analysis method.

Alternatively, several naive ad hoc alternative methods have been proposed, including substitution methods, which consist of replacing censored covariate values with either a function of the limit of detection, , e.g., , , [19] or the mean of the observed covariate measures (mean substitution)[48] as well as dichotomizing the potentially censored covariate into a binary covariate.[8, 43] Inevitably, each of these ad hoc methods leads to a biased estimation of . For instance, Helsel investigated the use of these naive substitution methods and concluded that they are inefficient and have no mathematically plausible backing [16]. The extent of the inefficiency depends on the extent and the severity of censoring (i.e., the distance between the limit of detection or random censoring value and the natural limit for ) of censoring. Finally, Atem et al[6, 7]

explored additional non-parametric methods, based on multiple imputation approach, but concluded that these methods were not efficient when applied to the cases of dependent censoring.

Recently, Sampene and Atem [44] proposed two conditional multiple imputation methods for estimation and inference. The underpinnings of these methods involve replacing the randomly censored values by estimates of , for . In the absence of additional covariates, the former is determined via a Kaplan-Meier estimator and performs well when the correlation between and is weak. When the correlation between and is strong, similar to case of missing covariate [33] the outcome of interest is included in the imputation . This conditional imputation involving outcome , unlike the imputation not involving used estimates from the Cox proportional hazard hence the name Cox Multiple Imputation. To estimate the corresponding variance of for inference, Sampene and Atem [44] suggested using either a conditional multiple imputation or a conditional single imputation along with a bootstrap resampling procedure to correct for the underestimation of the variance inherent to the single imputation. In doing so, we showed that these improvements to the complete-case analysis method result in valid inferences regardless of whether the censoring mechanism is dependent or independent of the outcome. Furthermore, using simulation studies, they demonstrated that the multiple imputation method is similar to the conditional single imputation with bootstrap resampling.

In the next section, we will run simulation studies to compare the complete-case analysis, parametric, mean imputation, naive ad hoc, conditional single imputation, conditional multiple imputation and Cox multiple imputation methods. It is worth mentioning that Cole et al (2009) and Nie et al [40] have explored the parametric approach for type 1 censoring. They showed that this approach is very efficient for limit of detection data. However, as pointed out by one of the reviewers, it is worth exploring how well this parametric approach works compared to others non parametric approaches when censoring is random.

3 Monte Carlo Simulations

3.1 Data generation and simulation set up

We assumed that the true linear regression model is given by with = and . The variable as well as the censoring variable distribution were generated from a two-parameter Weibull distribution


where is the shape parameter also known as the Weibull slope, with and , , is the scale parameter.

More precisely, we generated samples of size and respectively, and chose in each case. For independent censoring mechanism, we considered the following distributions

  • and for and .

  • and with and .

  • and , and .

The selected values of allowed us to obtain, respectively, and censoring. Under dependent censoring, we defined the corresponding mechanism such that if and if . We also considered the following data generating distributions to obtain and censoring: , and with and , respectively.

3.2 Simulation results

Tables 14 summarize the results of the four sets of simulations performed for light censoring (20%) and heavy censoring (40%) in terms of

  1. , which is the (overall) deviation of a parameter estimate from the true parameter where is the estimate from the -th generated data set;

  2. empirical standard error,

    , of the estimate over all simulation data sets;

  3. simulation error, i.e., the average of model-based standard errors;

  4. mean squared error (MSE), which is the expectation of the square deviation of a parameter estimate from the truth. It is equal to ;

  5. coverage probability which is the proportion of simulated samples for which the

    confidence interval includes for .

Light Censoring
Simulation Coverage
Bias Error MSE Probability
Actual data (No Censoring) 0.0012 0.2527 0.2536 0.0639 0.970
Complete-case 0.0044 0.4138 0.4163 0.1712 0.955
Mean Substitution 0.0044 0.4200 0.4163 0.1712 0.962
Maximum Likelihood 0.0510 0.3345 0.3377 0.1145 0.945
Conditional Single Imputation 0.0361 0.2901 0.3138 0.0855 0.969
Conditional Multiple Imputation 0.0014 0.4011 0.4111 0.0842 0.960
Cox Based Multiple Imputation 0.0013 0.4201 0.4211 0.1765 0.966
Actual data (No Censoring) 0.0009 0.1103 0.1118 0.0122 0.971
Complete-case 0.0021 0.1795 0.1799 0.0322 0.955
Mean Substitution 0.0021 0.1820 0.1799 0.0331 0.957
Maximum Likelihood 0.0361 0.1469 0.1530 0.0229 0.945
Conditional Single Imputation 0.0132 0.1297 0.1524 0.0170 0.970
Conditional Multiple Imputation 0.0011 0.1811 0.1815 0.0324 0.960
Cox Based Multiple Imputation 0.0012 0.1705 0.1743 0.0290 0.966
Heavy Censoring
Actual data (No Censoring) 0.0012 0.2527 0.2536 0.0639 0.970
Complete-case -0.0048 0.8683 0.9052 0.7540 0.943
Mean Substitution -0.0048 0.8852 0.9052 0.7836 0.946
Maximum Likelihood 0.2007 0.5229 0.5421 0.3137 0.900
Conditional Single Imputation 0.1000 0.4542 0.5657 0.2163 0.930
Conditional Multiple Imputation 0.0545 0.8101 0.7887 0.6592 0.970
Cox Based Multiple Imputation 0.0029 0.8700 0.8911 0.7569 0.965
Actual data (No Censoring) 0.0009 0.1103 0.1118 0.0122 0.971
Complete-case -0.0125 0.3744 0.3845 0.1403 0.948
Mean Substitution -0.0126 0.3820 0.3845 0.1461 0.952
Maximum Likelihood 0.0949 0.2334 0.2369 0.6348 0.911
Conditional Single Imputation 0.0613 0.1812 0.2188 0.0366 0.950
Conditional Multiple Imputation 0.0060 0.3891 0.3691 0.1475 0.961
Cox Based Multiple Imputation 0.0021 0.3786 0.3888 0.1433 0.966

Table 1: Case 1: and censoring is independent of .
Light Censoring
Simulation Coverage
Bias Error MSE Probability
Actual data (No Censoring) 0.0086 0.3934 0.3969 0.1548 0.968
Complete-case 0.0232 0.5419 0.5717 0.2942 0.942
Mean Substitution 0.0232 0.5446 0.5717 0.2971 0.942
Maximum Likelihood 0.0510 0.4605 0.4833 0.2147 0.939
Conditional Single Imputation 0.0185 0.4399 0.4641 0.1939 0.966
Conditional Multiple Imputation 0.0170 0.5818 0.5956 0.3388 0.960
Cox Based Multiple Imputation 0.0131 0.5463 0.5554 0.2986 0.962
Actual data (No Censoring) 0.0048 0.1757 0.1731 0.0309 0.973
Complete-case 0.0073 0.2426 0.2315 0.0589 0.958
Mean Substitution 0.0073 0.2436 0.2314 0.0594 0.959
Maximum Likelihood 0.0338 0.2075 0.2047 0.0442 0.944
Conditional Single Imputation 0.0058 0.1927 0.2801 0.0372 0.972
Conditional Multiple Imputation 0.0092 0.2481 0.2566 0.0616 0.961
Cox Based Multiple Imputation 0.0060 0.2450 0.2456 0.6001 0.962
Heavy Censoring
Actual data (No Censoring) 0.0086 0.3934 0.3969 0.1548 0.968
Complete-case 0.0334 0.8828 0.9124 0.7805 0.941
Mean Substitution 0.0334 0.8890 0.9124 0.7914 0.941
Maximum Likelihood 0.0914 0.5967 0.6065 0.3644 0.892
Conditional Single Imputation 0.0663 0.5466 0.6072 0.3032 0.960
Conditional Multiple Imputation 0.0311 0.8600 0.8340 0.7406 0.961
Cox Based Multiple Imputation 0.0200 0.9001 0.9004 0.8106 0.959
Actual data (No Censoring) 0.0048 0.1757 0.1731 0.0309 0.973
Complete-case 0.0108 0.3893 0.3865 0.1517 0.956
Mean Substitution 0.0108 0.3922 0.3864 0.1539 0.956
Maximum Likelihood 0.0900 0.2704 0.2767 0.0812 0.935
Conditional Single Imputation 0.0367 0.2319 0.2564 0.0551 0.960
Conditional Multiple Imputation 0.0121 0.4000 0.3811 0.1601 0.961
Cox Based Multiple Imputation 0.0061 0.3869 0.3905 0.1497 0.965
Table 2: Case 2: and censoring is independent of .
Light Censoring
Simulation Coverage
Bias Error MSE Probability
Actual data (No Censoring) 0.0264 0.8322 0.8258 0.6932 0.958
Complete-case 0.0465 1.0370 1.0286 1.0775 0.948
Mean Substitution 0.0465 1.0366 1.0286 1.0767 0.948
Maximum Likelihood 0.0294 0.8894 0.8827 0.7919 0.952
Conditional Single Imputation 0.0361 0.8855 1.4533 0.7854 0.943
Conditional Multiple Imputation 0.0451 1.0349 1.0381 1.0731 0.959
Cox Based Multiple Imputation 0.0430 1.0391 1.0396 1.0816 0.959
Actual data (No Censoring) 0.0008 0.3795 0.3765 0.1440 0.970
Complete-case 0.0064 0.4698 0.4703 0.2208 0.949
Mean Substitution 0.0064 0.4698 0.4703 0.2207 0.948
Maximum Likelihood 0.0085 0.4085 0.4132 0.1604 0.970
Conditional Single Imputation 0.0044 0.4005 0.6782 0.1604 0.970
Conditional Multiple Imputation 0.0019 0.5101 0.5139 0.2602 0.961
Cox Based Multiple Imputation 0.0021 0.4704 0.4777 0.2212 0.969
Heavy Censoring
Actual data (No Censoring) 0.0264 0.8322 0.8258 0.6932 0.958
Complete-case 0.0516 1.4517 1.4531 2.1074 0.943
Mean Substitution 0.0516 1.4555 1.4531 2.1142 0.945
Maximum Likelihood 0.0204 1.0124 1.0700 1.0254 0.941
Conditional Single Imputation 0.0371 1.0079 1.3468 1.0172 0.942
Conditional Multiple Imputation 0.0429 1.5321 1.5386 2.3492 0.959
Cox Based Multiple Imputation 0.0916 1.4573 1.4857 2.2157 0.959
Actual data (No Censoring) 0.0008 0.3795 0.3765 0.1440 0.970
Complete-case 0.0067 0.6537 0.6514 0.4274 0.946
Mean Substitution 0.0067 0.6538 0.6515 0.4275 0.947
Maximum Likelihood 0.0191 0.4639 0.4585 0.2156 0.951
Conditional Single Imputation 0.0153 0.4442 0.7648 0.1975 0.970
Conditional Multiple Imputation 0.0060 0.6779 0.6768 0.4631 0.959
Cox Based Multiple Imputation 0.0061 0.6537 0.6617 0.4274 0.966
Table 3: Case 3: and censoring is independent of .
Light Censoring
Simulation Coverage
Bias Error MSE Probability
Actual data (No Censoring) -0.0022 0.8335 0.8484 0.6947 0.949
Complete-case 0.0157 1.0285 1.0365 1.0581 0.944
Mean Substitution 0.0157 1.0298 1.0365 1.0607 0.945
Maximum Likelihood 0.0109 0.8881 0.9122 0.7888 0.948
Conditional Single Imputation 0.0045 0.8847 1.1293 0.7827 0.959
Conditional Multiple Imputation 0.0030 1.0421 1.0471 1.0859 0.949
Cox Based Multiple Imputation 0.0049 1.4071 1.4771 1.9800 0.966
Actual data (No Censoring) 0.0012 0.3793 0.3843 0.1439 0.968
Complete-case -0.0067 0.4684 0.4685 0.2194 0.947
Mean Substitution -0.0067 0.4688 0.4685 0.2198 0.948
Maximum Likelihood -0.0063 0.4070 0.4105 0.1657 0.953
Conditional Single Imputation -0.0039 0.3992 0.7403 0.1594 0.966
Conditional Multiple Imputation 0.0025 0.4500 0.4796 0.2025 0.956
Cox Based Multiple Imputation 0.0032 0.5062 0.5111 0.2562 0.967
Heavy Censoring
Actual data (No Censoring) -0.0022 0.8335 0.8484 0.6947 0.949
Complete-case 0.0156 1.4648 1.4558 2.4120 0.944
Mean Substitution 0.0156 1.4671 1.4558 2.1526 0.945
Maximum Likelihood 0.0328 1.0092 1.0090 1.0196 0.944
Conditional Single Imputation 0.0349 1.0006 1.2314 1.0024 0.933
Conditional Multiple Imputation 0.0201 1.4001 1.4091 1.9607 0.966
Cox Based Multiple Imputation 0.0450 1.0955 1.0998 1.2116 0.966
Actual data (No Censoring) 0.0012 0.3793 0.3843 0.1439 0.968
Complete-case 0.0102 0.6578 0.6611 0.4328 0.947
Mean Substitution 0.0102 0.6596 0.6611 0.4352 0.948
Maximum Likelihood 0.0164 0.4672 0.4630 0.2185 0.952
Conditional Single Imputation 0.0301 0.4436 0.6156 0.1977 0.966
Conditional Multiple Imputation 0.0134 0.6867 0.6867 0.4717 0.966
Cox Based Multiple Imputation 0.0141 0.6771 0.6846 0.4587 0.966
Table 4: Case 4: and censoring depends on .

Tables 1 and 2 show that when the distribution of

is highly skewed (see Figure

1), the parametric approach results in larger bias and MSE as compared to the conditional multiple imputation approach. Although the complete case is unbiased, deleting observations reduces the sample size, which results in an increased standard error and larger MSE as compared to both the maximum likelihood and the conditional multiple imputation methods. Despite being unbiased, both the complete case and the mean substitution methods are inefficient with higher MSE as compared to the maximum likelihood approach ,conditional multiple imputation and the Cox multiple imputation approach. The single conditional imputation is unbiased and is more efficient than the mean imputation with smaller MSE because its underestimates the standard error when imputed values are used as true values with no uncertainty. All approaches resulted in acceptable coverage probabilities.

Figure 1: Distribution of the censored covariate as the function of the shape and scale parameters

Tables 3 and 4 show that, when the distribution of is close to normal (see Figure 1), the maximum likelihood approach results in smaller bias and standard error. The log likelihood (2) is derived under the normal distribution assumption. Therefore, the distribution of in Tables 34, which is close to the true distribution from which the maximum likelihood method is based, provides a better and a more efficient parameter estimate than the multiple imputation methods. The standard error and MSE is smaller than that of both the conditional multiple and Cox multiple imputation approaches. The other imputation methods are less efficient and more biased. Overall, it is worth mentioning that the Cox multiple imputation is more efficient than the conditional imputation when the data is well powered. As the sample size increases, this Cox multiple imputation is very efficient, which might be due to the fact that this approach uses one additional parameter in the imputation model as compared to the Kaplan-Meier based conditional imputation that does not involve the outcome in the imputation model.

4 Illustrative example: Association between parent age of cardiac events and low density lipoprotein (LDL) in offspring.

According to the American Heart Association cardiovascular disease (CVD) is a multi-faceted disease that affects the heart or blood vessels. CVD includes hypertensive, rheumatic, congenital, and vulvar heart diseases as well as cardiomyopathies, heart arrhythmias, carditis, aortic aneurysms, peripheral artery disease, venous thrombosis, coronary death, myocardial infarction, coronary insufficiency, angina, ischemic stroke, hemorrhagic stroke, transient ischemic attack, peripheral artery disease, and heart failure. It is the global leading cause of death, accounting for over 30% of all deaths worldwide—approximately 17.3 million deaths per year. In the United States, someone dies from CVD every 39 seconds, with most of those deaths being attributed to coronary heart disease.[49, 2]

Though the death rate due to CVD has decreased slowly over the last 30 years, death from heart disease remains the leading cause of death in the United States, and caring for patients with poor cardiovascular health continues to be one of the largest burdens on the health care system today. From 1990 to 2009, CVD ranked first in the number of days for which patient received hospital care,[1] yet 72% of Americans do not consider themselves at risk for heart disease.[2]

Associations have long been established between CVD and a wide variety of risk factors, including non-modifiable variable such as family history[21, 24, 39, 37, 15, 23, 20]. Blood levels of low density lipoproteins (LDL), one of the five major groups of lipoproteins categorized by density, are regarded as a strong predictor of CVD. To illustrate the methods proposed in this paper, we study the association between LDL in offspring and age at onset of a clinically-diagnosed cardiovascular event in parents, using data from the Framingham Heart Study database and looking at both the Original and Offspring cohorts.[22]

The Framingham Heart Study (FHS) is an ongoing prospective study of the etiology of cardiovascular disease, among other prevalent diseases. The study began in 1948 and enrolled 5,209 participants (55% women) aged 28 and 62 years old residing in Framingham, Massachusetts as part of the original cohort who have been followed up to the present. In 1971, the Framingham Offspring Study was established with a sample of 5,124 men and women aged 5 to 70 years old who were either (genetic or adoptive) offspring or spouses of offspring of the original cohort[14, 36]. Study participants are examined routinely to update their health status information and potential risk factors. Standard clinical examinations included physician interview, physical examination, and laboratory tests, and continue to the present. Participants in the original cohort have been followed biannually; there were 40 participants during the Exam visit held in 2012–2014. In the offspring cohort, participants have been followed approximately every four years. The Offspring Examination Cycle 9 covered the years 2011 to 2014 and had 2430 participants.

In this example, we performed two separate analyses, one for each parent, to evaluate the relationship between ages of CVD in parents and log(LDL) in offspring. Data gathered from the original cohort (Exam 12(1971–1974); 3,261 participants) and the offspring cohort(Exam 1(1971–1975); 5,124 participants) were used. We deleted all missing data and restricted the LDL to physician recorded values; this reduced the sample size to (1,401 mothers and 1,221 fathers).

Of the 1401 mothers in the final data set, 907 of them (i.e., 35.26%) experienced a cardiovascular event whereas 909 (i.e 74.45%) out of 1,221 fathers experienced a cardiovascular event. The median age of CVD was 66 years and 63 years for mothers and fathers, respectively.

Results of the data analyses are provided in Tables 5 and 6. The results for the complete-case analysis, mean substitution, maximum likelihood, conditional single imputation and the conditional multiple imputation are consistent with the simulation results. With a larger sample, the assumption of normality for the censored covariate is met and the parametric method provides better estimates, along with smaller standard error. On the other hands, the results from the ad hoc substitution methods are inconsistent with the simulation results. This is because there is no scientific bases for such substitutions.

5 Conclusion

Most of the literature on censored covariates deals with the issue of limits of detection, the point at which observations below this limit cannot be measured or detected and are instead recorded at the limit of detection value. [10, 46, 43] In this paper, we considered the estimation of linear regression models when the covariate of interest is randomly censored. We evaluated non parametric conditional imputation methods based on the Kaplan-Meier estimate to impute a censored covariate. We compared this non parametric approach based on Kaplan-Meier to the regression from the full data (without censoring), the complete-case analysis, a naive ad hoc substitution (replacing censored values by the mean of the observed covariate values) and the maximum likelihood approach.

Parametric estimators were determined via maximum likelihood estimation method based on an underlying distribution assumption of the censored covariate. Throughout our simulations, we demonstrate that the naive ad hoc substitution method provides biased estimation of the regression parameter of the censored covariate. As Helsel pointed out[17], these substitution methods are akin to fabricating data; they don’t have any theoretical basis and should thus be discouraged.

The complete-case analysis method is the widely used approach for handling censored predictors as it is easy to implement. The obvious pitfall of the complete-case approach is that it potentially sacrifices information by discarding observations. Although, this method leads to unbiased estimates under independent censoring, it can result in a substantial loss of power, especially under moderate to high percentages of censored observations. Under dependent censoring, complete-case analysis may lead to model misspecification due to selection bias if a group of subjects with similar characteristics do not experience the event of interest or leave a study before its completion.

The mean substitution approach is easy and looks reliable but a detailed analysis of this approach shows it has many short comings. We cannot always guarantee that the mean of the complete case will be greater than the time at censoring. One basic assumption of censored data is that, if the event is to occur, it can only happen after the censored time. Furthermore, this approach does not make use of the available information, that is, the time at censoring.

Using parametric methods requires prior knowledge or postulating a distribution model for the censored covariate. When the postulated parametric distribution of the censored covariate corresponds to the true distribution, the maximum likelihood estimation method and the nonparametric method via Kaplan-Meier estimation all provide consistent estimates, under independent censoring. Under dependent censoring, if the distribution of and are similar, these methods are efficient; however, if the distribution of and are dissimilar the MLE approach will be highly inefficient (as shown in section 2.1). Therefore, we propose the use of Kaplan-Meier nonparametric imputation in absence of prior knowledge of the distribution of censored covariate or when such a distribution cannot be accurately ascertained. On the other hand, if the sample size is large and the distribution of is a member of the exponential family, the MLE approach can be suitable.

Method Estimate SE P-value
Complete-case (64.74% of the data) 0.0034 0.00012 0.0002
Mean Substitution 0.0044 0.0012 0.0002
Maximum Likelihood 0.1999 0.0797 0.0123
Conditional Single Imputation 0.0022 0.0009 0.0276
Conditional Multiple Imputation 0.0023 0.0010 0.0284
Cox Based Multiple Imputation 0.0020 0.0009 0.0286
Table 5: Relationship between Maternal age of onset of CVD and LDL in offspring
Method Estimate SE P-value
Complete-case (74.45% of the data) 0.0024 0.0013 0.0675
Mean Substitution 0.0024 0.0013 0.0675
Maximum Likelihood 0.7660 0.1480
Conditional Single Imputation 0.0018 0.0010 0.0742
Conditional Multiple Imputation 0.0018 0.0011 0.0787
Cox Based Multiple Imputation 0.0017 0.0010 0.0768
Table 6: Relationship between Paternal age of onset of CVD and LDL in offspring


  • [1] Health, united states, 2010. Accessed: 2016-05-16.
  • [2] Matters of your heart. Accessed: 2016-05-13.
  • [3] P. S. Albert, O. Harel, N. Perkins, and R. Browne. Use of multiple assays subject to detection limits with regression modeling in assessing the relationship between exposure and outcome. Epidemiology (Cambridge, Mass.), 21(Suppl 4):S35, 2010.
  • [4] S. G. Arunajadai and V. A. Rauh. Handling covariates subject to limits of detection in regression. Environmental and ecological statistics, 19(3):369–391, 2012.
  • [5] F. Atem and R. A. Matsouaka. Improving the efficiency of the complete-case analysis when a covariate is randomly censored. Technical report, 2016.
  • [6] F. Atem, J. Qian, J. E. Maye, K. A. Johnson, and R. A. Betensky. Linear regression with a randomly censored covariate: application to an alzheimer’s study. Technical report, 2015.
  • [7] F. D. Atem, J. Qian, J. E. Maye, K. A. Johnson, and R. A. Betensky.

    Multiple imputation of a randomly censored covariate improves logistic regression analysis.

    Journal of Applied Statistics, pages 1–11, 2016.
  • [8] P. C. Austin and J. S. Hoch. Estimating linear regression models in the presence of a censored independent variable. Statistics in medicine, 23(3):411–429, 2004.
  • [9] P. W. Bernhardt, H. J. Wang, and D. Zhang. Statistical methods for generalized linear models with covariates subject to detection limits. Statistics in Biosciences, pages 1–22, 2013.
  • [10] P. W. Bernhardt, H. J. Wang, and D. Zhang. Flexible modeling of survival data with covariates subject to detection limits via multiple imputation. Computational statistics & data analysis, 69:81–91, 2014.
  • [11] D. G. Clayton. A model for association in bivariate life tables and its application in epidemiological studies of familial tendency in chronic disease incidence. Biometrika, 65(1):141–151, 1978.
  • [12] S. R. Cole, H. Chu, L. Nie, and E. F. Schisterman.

    Estimating the odds ratio when exposure has a limit of detection.

    International journal of epidemiology, page dyp269, 2009.
  • [13] G. D’Angelo and L. Weissfeld. An index approach for the cox model with left censored covariates. Statistics in medicine, 27(22):4502–4514, 2008.
  • [14] T. R. Dawber, F. E. Moore, and G. V. Mann. Ii. coronary heart disease in the framingham study. International journal of epidemiology, 44(6):1767–1780, 2015.
  • [15] R. Donahue, E. Bloom, R. Abbott, D. Reed, and K. Yano. Central obesity and coronary heart disease in men. The Lancet, 329(8537):821–824, 1987.
  • [16] D. R. Helsel. More than obvious: better methods for interpreting nondetect data. Environmental science & technology, 39(20):419A–423A, 2005.
  • [17] D. R. Helsel. Fabricating data: how substituting values for nondetects can ruin results, and what can be done about it. Chemosphere, 65(11):2434–2439, 2006.
  • [18] D. R. Helsel et al. Nondetects and data analysis. Statistics for censored environmental data. Wiley-Interscience, 2005.
  • [19] R. W. Hornung and L. D. Reed. Estimation of average concentration in the presence of nondetectable values. Applied occupational and environmental hygiene, 5(1):46–51, 1990.
  • [20] B. V. Howard, D. C. Robbins, M. L. Sievers, E. T. Lee, D. Rhoades, R. B. Devereux, L. D. Cowan, R. S. Gray, T. K. Welty, O. T. Go, et al. Ldl cholesterol as a strong predictor of coronary heart disease in diabetic individuals with insulin resistance and low ldl the strong heart study. Arteriosclerosis, thrombosis, and vascular biology, 20(3):830–835, 2000.
  • [21] W. Kannel, T. Dawber, H. Thomas Jr, and P. McNamara. Comparison of serum lipids in the prediction of coronary heart disease. framingham study indicates that cholesterol level and blood pressure are major factors in coronary heart disease; effect of obesity and cigarette smoking also noted. Rhode Island medical journal, 48:243, 1965.
  • [22] W. B. Kannel, M. Feinleib, P. M. McNamara, R. J. Garrison, and W. P. Castelli. An investigation of coronary heart disease in families the framingham offspring study. American journal of epidemiology, 110(3):281–290, 1979.
  • [23] U. Keil. Coronary artery disease: the role of lipids, hypertension and smoking. Basic Research in Cardiology, 95(1):I52–I58, 2000.
  • [24] A. Keys et al. Coronary heart disease in seven countries. Circulation, 41(1):186–195, 1970.
  • [25] J. P. Klein and M. L. Moeschberger. Survival analysis: techniques for censored and truncated data. Springer Science & Business Media, 2005.
  • [26] D. G. Kleinbaum and M. Klein. Survival analysis. Springer, 1996.
  • [27] S. Kong and B. Nan. Semiparametric approach to regression with a covariate subject to a detection limit. Biometrika, page asv055, 2016.
  • [28] S. Lagakos. General right censoring and its impact on the analysis of survival data. Biometrics, pages 139–156, 1979.
  • [29] K. Langohr, G. Gómez, and R. Muga. A parametric survival model with an interval-censored covariate. Statistics in medicine, 23(20):3159–3175, 2004.
  • [30] M. Lee, L. Kong, and L. Weissfeld. Multiple imputation for left-censored biomarker data based on gibbs sampling method. Statistics in medicine, 31(17):1838–1848, 2012.
  • [31] S. Lee, S. Park, and J. Park. The proportional hazards regression with a censored covariate. Statistics & probability letters, 61(3):309–319, 2003.
  • [32] S. Lipsitz, M. Parzen, S. Natarajan, J. Ibrahim, and G. Fitzmaurice. Generalized linear models with a coarsened covariate. Journal of the Royal Statistical Society: Series C (Applied Statistics), 53(2):279–292, 2004.
  • [33] R. J. Little. Regression with missing x’s: a review. Journal of the American Statistical Association, 87(420):1227–1237, 1992.
  • [34] R. J. Little and D. B. Rubin. Statistical analysis with missing data. John Wiley & Sons, 2014.
  • [35] H. S. Lynn. Maximum likelihood inference for left-censored HIV RNA data. Statistics in medicine, 20(1):33–45, 2001.
  • [36] S. S. Mahmood, D. Levy, R. S. Vasan, and T. J. Wang. The framingham heart study and the epidemiology of cardiovascular disease: a historical perspective. The Lancet, 383(9921):999–1008, 2014.
  • [37] J. E. Manson, G. A. Colditz, M. J. Stampfer, W. C. Willett, B. Rosner, R. R. Monson, F. E. Speizer, and C. H. Hennekens. A prospective study of obesity and risk of coronary heart disease in women. New England journal of medicine, 322(13):882–889, 1990.
  • [38] R. C. May, J. G. Ibrahim, and H. Chu. Maximum likelihood estimation in generalized linear models with multiple covariates subject to detection limits. Statistics in medicine, 30(20):2551–2561, 2011.
  • [39] J. D. Neaton and D. Wentworth. Serum cholesterol, blood pressure, cigarette smoking, and death from coronary heart disease overall findings and differences by age for 316099 white men. Archives of internal medicine, 152(1):56–64, 1992.
  • [40] L. Nie, H. Chu, C. Liu, S. R. Cole, A. Vexler, and E. F. Schisterman. Linear regression with an independent variable subject to a detection limit. Epidemiology (Cambridge, Mass.), 21(Suppl 4):S17, 2010.
  • [41] D. B. Richardson and A. Ciampi. Effects of exposure measurement error when an exposure variable is constrained by a lower limit. American Journal of Epidemiology, 157(4):355–363, 2003.
  • [42] R. Rigobon and T. M. Stoker. Estimation with censored regressors: Basic issues*. International Economic Review, 48(4):1441–1467, 2007.
  • [43] R. Rigobon and T. M. Stoker. Bias from censored regressors. Journal of Business & Economic Statistics, 27(3):340–353, 2009.
  • [44] E. Sampene and F. D. Atem. Imputing a randomly censored covariate in a linear regression model. Technical report, 2015.
  • [45] A. Sattar, S. K. Sinha, and N. J. Morris. A parametric survival model when a covariate is subject to left-censoring. Journal of biometrics & biostatistics, (2), 2012.
  • [46] A. Sattar, S. K. Sinha, X.-F. Wang, and Y. Li. Frailty models for pneumonia to death with a left-censored covariate. Statistics in medicine, 2015.
  • [47] E. F. Schisterman and R. J. Little. Opening the black box of biomarker measurement error. Epidemiology (Cambridge, Mass.), 21(Suppl 4):S1, 2010.
  • [48] E. F. Schisterman, A. Vexler, B. W. Whitcomb, and A. Liu. The limitations due to exposure detection limits for regression models. American journal of epidemiology, 163(4):374–383, 2006.
  • [49] J. Schwalm, M. McKee, M. D. Huffman, and S. Yusuf. Resource effective strategies to prevent and treat cardiovascular disease. Circulation, 133(8):742–755, 2016.
  • [50] J. V. Tsimikas, L. E. Bantis, and S. D. Georgiou. Inference in generalized linear regression models with a censored covariate. Computational Statistics & Data Analysis, 56(6):1854–1868, 2012.
  • [51] A. Vexler, G. Tao, and X. Chen. A toolkit for clinical statisticians to fix problems based on biomarker measurements subject to instrumental limitations: From repeated measurement techniques to a hybrid pooled–unpooled design. Advanced Protocols in Oxidative Stress III, pages 439–460, 2015.
  • [52] H. J. Wang and X. Feng. Multiple imputation for m-regression with censored covariates. Journal of the American Statistical Association, 107(497):194–204, 2012.
  • [53] H. Wu, Q. Chen, L. B. Ware, and T. Koyama. A bayesian approach for generalized linear models with explanatory biomarker measurement variables subject to detection limit: an application to acute lung injury. Journal of applied statistics, 39(8):1733–1747, 2012.


Appendix A Likelihood function for a censored covariate

Note that , which implies

The log likelihood equation (2) from the main text becomes


where the constant term

Appendix B SAS Code: Parametric model

proc NLMIXED data=data-set; parms mux= sigmax= sigma= alpha= beta= ; Q=sqrt(1/sigmax**2+beta**2/sigma**2); e=y-alpha-beta*time; emu=y-alpha-beta*mux; if censored=0 then LL=(-e**2/sigma**2/2-log(sigma**2*sigmax**2)/2 -(time-mux)**2/sigmax**2/2); if censored=1 then LL=log(sqrt(sigma**2+beta**2*sigmax**2)**-1*probnorm(Q*(-time+mux +sigma**-2*Q**-2*beta*emu))*exp(2**-1*sigma**-4*Q**-2*beta**2*emu**2 -2**-1*sigma**-2*emu**2));

model y general(LL); run;


The data for this study was approved by The University of Texas Health Science Center Institutional Review Board and was made available with the help of the Biologic Specimen and Data Repository Information Coordinating Center. Request for access to the Framingham Pedigree was approved by the Framingham Executive Committee. The authors acknowledge research support from the National Institutes of Health (NIH).
R.A. Matsouaka was supported by the National Center for Advancing Translational Sciences of the National Institutes of Health under Award Number UL1TR001117.
The content of this paper is solely the responsibility of authors and does not necessarily represent the official view of the National Institutes of Health.

Conflict of Interest: None declared.