This report is designed to clarify a few points about the article “Semiparametric modeling of grouped current duration data with preferential reporting” by McLain, Sundaram, Thoma and Louis in Statistics in Medicine (McLain et al., 2014, hereafter MSTL) regarding using the methods under right censoring. In simulation studies, it has been found that bias can occur when right censoring is present. Current duration data normally does not have censored values, but censoring can be induced at a value, say , after which the data values are thought to be unreliable. As noted in MSTL, some right censored data require an assumption on the parametric form of the data beyond . While this assumption was given in MSTL, the implications of the assumption were not sufficiently explored. Here we present simulations and evaluate the methods of MSTL under type I censoring, give some settings under which the method works well even in presence of censoring, state when the model is correctly specified and discuss the reasons of the bias.
2 Tail Assumptions Under Right Censoring
The bias observed under censoring is a result of model misspecification under censoring. To see this, we note the following form of the current duration probability mass function for the semi-parametric model
. However, such an approach cannot be taken under right censoring. As noted in the Estimation section of MSTL, page 3966,
Let denote the ordered and distinctly observed uncensored current durations, and . When censoring is present, we cannot set for because the likelihood for those censored at would be . To allow for , we introduce an additional parameter and set for all .
That is, under type I censoring the model assumes that are equal for all .
The tail assumption is needed because a semiparametric model cannot estimate the mean under type I censoring without making a parametric assumption on the distribution beyond the value of . Recall that the relationship between and is , thus is required to specify the model. Under type I censoring at we can only estimate with a semiparametric model. This is similar to the fact that cannot be estimated from a Kaplan-Meier curve if the maximum value is censored. To estimate the above tail assumption is used, which implies that the discrete hazard probability of takes the parametric form
Notice that this implies that the discrete hazard probabilities are constant in , thus
follows a geometric distribution in the tail, i.e.,is constant in for . When this assumption is misspecified biases can occur. For example, if is non-constant in for , the denominator in (1) is misspecified since it is a function of for . The misspecification in the denominator cannot be absorbed in any way, and results in model misspecification. This same phenomena happens with the piecewise constant model of MSTL, where is constant beyond the largest knot.
If the values of were observed the tail behavior of the ’s would not impact the estimation since they would not enter the likelihood. However, since we observe the values with probability mass function given in (1), the tail values of impact the estimation. This explains why this problem is unique to current duration analysis.
Another issue with censoring is how to truncate the upper limit of the infinite sum in the denominator in (1), which we denote by . In theory this value should be set at a point where negligible probability mass occurs thereafter. For cases when there is no known upper boundary to the distribution, we have observed in simulation studies that when is too large it causes instability in the estimates, especially for the piecewise constant model, and having too small results in biased estimates. Whether a value is “too small” or “too large” will depend on the distribution of the data. A strategy we found effective in simulation studies was to set to twice the largest value before the administrative censoring was implemented. MSLT set , which we found could be too large based on some of the new simulation settings tested.
3 Simulation Studies
To test the properties of the models in MSLT, numerous simulation studies were performed. The current duration for the th subject was simulated by generating the unobserved total durations as for , where and is a fixed large integer then setting . This setting replicates a renewal process in equilibrium with renewal distribution (see Feller, 1966, for details).
All of the simulation scenarios used data that was discretely distributed with a simple binary covariate with 0.5 success probability. The underlying distribution of the survival times is where . The value of was set to (a) , (b) for or (c) . Here, (a) corresponds to a geometric setting, (b) corresponds to a piecewise geometric distribution, and the survival function for (c) is equal to with we refer to as the discrete Weibull setting (note that (c) is equivalent to (a) when ). For (a) we set , for (b) and (c) and was varied to alter the proportion of censored values. For (b) or , while for (c) or . The lower values induce more censoring. For (b) we set and , which match the knots used for the piecewise constant model. For each setting, type I censoring at along with no censoring was applied. All simulations used subjects.
The above distributions were fitted with the semiparametric and piecewise constant models from MSLT where the piecewise constant model had knots at , equal to those used for simulating the data. For the geometric setting in (a) the tail assumption is correctly specified regardless of the value of . The tail assumption is also correctly specified in (b) when since for all . The misspecified scenarios include (b) when , and setting (c). Programs to simulate and fit all models are available from the first authors website (see the ‘Programs’ Section below).
|Piecewise Geometric with high censoring|
|Discrete Weibull with high censoring|
), the empirical standard deviation (sd), the empirical coverage probability (ecp) and the censoring proportion (prop cen).
In Table 1 we present bias, standard deviation and empirical coverage probabilities for various distributional assumptions corresponding to the distributions discussed above, which were varied by the fixed censoring value and the and parameters. As expected, the effect of the varying censoring value on the geometric setting is relatively small. There does appear to be a decrease in the overall parameter estimate as the censoring value decreases, but overall the estimates are relatively unbiased. For the piecewise geometric setting the parameters are relatively unbiased for . This is as hypothesized since when the tail assumption is correctly specified. When the tail assumption is misspecified and we see increasing bias as gets closer to zero. Further, when the proportion censored increases the results remain consistent. This suggests that the value of , not the overall censoring proportion, is what is driving the bias. Thus, when the tail assumption is correctly specified the results appear to be relatively unbiased regardless of the proportion censored.
The Weibull setting shows noticeable bias in the estimates when the censoring percentage is larger than 10%. It should be noted that the piecewise constant model is misspecified under the Weibull, so some bias is expected. This misspecification appears to have a larger impact on the bias for the ‘high censoring’ distribution. For the semi-parametric setting the results have small bias when the censoring proportion is less than 30%.
The purpose of this paper was to investigate the properties of the MSTL model when all data are censored at a fixed value (i.e., type I censoring at ). The impact of censoring is that a parametric assumption on the tail behavior of the data must be assumed. Specifically, under censoring the model assumes that the hazard probability is constant for all where is the censoring value. The simulation studies show that when the tail behavior is correctly specified both models have relatively unbiased results regardless of the amount of censoring. This can be seen in the relatively unbiased results for the geometric setting for both models (another setting with higher censoring showed similar results), and the results for both piecewise geometric settings when . Recall that the last knot of the piecewise scenario was so the true values are constant beyond this value. Thus, when the distribution is geometric beyond the censoring value. The discrete Weibull setting is misspecified for all values of . Further, the piecewise constant model is misspecified when there is no censoring. Our simulation results show that under misspecification the degree of bias depends on the amount of censoring.
The analysis included in MSTL censored all values at . The simulation studies suggest that will not have large impact on the results, however, this could be sensitive to the true distribution. The analysis was repeated without censoring and the results were largely unchanged. The previous analysis with the piecewise model found significant associations for both age ( with 95% CI ) and parity ( with 95% CI ). This analysis also found significant associations for both age ( with 95% CI ) and parity ( with 95% CI ). For the semi-parametric model the effect of age changed from in the old analysis to with no censoring. The effect of parity showed attenuation with in the old analysis and with the new analysis.
The “geometric in the tail” assumption allows calculation of the necessary quantities needed to implement maximum likelihood estimation under censoring. Specifically, it assures that for all which is required for likelihood calculation. When the “geometric in the tail” assumption is misspecified it will lead to biased results of varying degrees (as explored Section 3). When the tail assumption is misspecified, one option is to impose different tail behavior. Some examples include (i) , (ii) , or (iii) for . It is important to keep in mind that sparse data are available to determine the tail behavior. We implemented different tail assumptions in simulations studies and found unstable results when two parameters were included in the calculation of the tail behavior of . So if (i) or (ii) were used one of the parameters should be fixed.
In summary, the simulations in the paper show that censoring should be employed with caution when using the MSTL method. Further, if censoring is required multiple values of should be used to test the sensitivity of the results. Unlike the situation found in standard survival analysis, the model assumptions extend beyond the censoring value. The main reason for censoring in current duration data is due to concerns of measurement errors associated with large responses. Censoring is an attractive option when measurement error is likely, but we recommend that it be used cautiously in keeping with the specified parametric assumptions. One solution in this case is to use the piecewise model, which as shown in MSTL can correct for random digit preference in the outcome.
A zip file containing all the programs to implement the MSTL model can be found at through the following link https://sites.google.com/site/alexmclain/research. See the link under the reference for MSTL “Zip file with R code to run the programs.” This file contains all of the programs to run the semiparametric and piecewise models, along with a nonparametric method. It also contains sample data, along with two programs that will generate current duration data for the discrete Weibull and piecewise constant distributions used in Section 3. The geometric distribution can be generated as a special case of the discrete Weibull distribution when .
We would like to thank Professor Niels Keiding and his group for alerting us to this issue.
Feller, W. (1966),
An introduction to probability theory and its applications. Vol. II, New York: John Wiley & Sons Inc.
- McLain et al. (2014) McLain, A. C., Sundaram, R., Thoma, M., and Buck Louis, G. M. (2014), “Semiparametric modeling of grouped current duration data with preferential reporting,” Statistics in Medicine, 33, 3961–3972.