A proportional hazards model for interval-censored subject to instantaneous failures

03/31/2018 ∙ by Prabhashi W. Withana Gamage, et al. ∙ Clemson University 0

The proportional hazards (PH) model is arguably one of the most popular models used to analyze time to event data arising from clinical trials and longitudinal studies, among many others. In many such studies, the event time of interest is not directly observed but is known relative to periodic examination times; i.e., practitioners observe either current status or interval-censored data. The analysis of data of this structure is often fraught with many difficulties. Further exacerbating this issue, in some such studies the observed data also consists of instantaneous failures; i.e., the event times for several study units coincide exactly with the time at which the study begins. In light of these difficulties, this work focuses on developing a mixture model, under the PH assumptions, which can be used to analyze interval-censored data subject to instantaneous failures. To allow for modeling flexibility, two methods of estimating the unknown cumulative baseline hazard function are proposed; a fully parametric and a monotone spline representation are considered. Through a novel data augmentation procedure involving latent Poisson random variables, an expectation-maximization (EM) algorithm was developed to complete model fitting. The resulting EM algorithm is easy to implement and is computationally efficient. Moreover, through extensive simulation studies the proposed approach is shown to provide both reliable estimation and inference.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Interval-censored data commonly arise in many clinical trials and longitudinal studies, and is characterized by the fact that the event time of interest is not directly observable, but rather is known relative to observation times. As a special case, current status data (or case-1 interval censoring) arise when there exists exactly one observation time per study unit; i.e., at the observation time one discovers whether or not the event of interest has occurred. Data of this structure often occurs in resource limited environments or due to destructive testing. Alternatively, general interval-censored data (or case-2 interval censoring) arise when multiple observation times are available for each study unit, and the event time can be ascertained relative to two observation times. It is well known that ignoring the structure of interval-censored data during an analysis can lead to biased estimation and inaccurate inference; see Odell et al (1992), Dorey et al (1993). Further exasperating this issue, some studies are subject to the occurrence of instantaneous failures; i.e., the event time of interest for a number of the study units occurs at time zero. This feature can occur as an artifact of the study design or may arise during an intent-to-treat analysis (Liu et al, 2016; Matsuzaki et al, 2005; Lamborn et al, 2008). For example, Chen et al (2015) describes a registry based study of end-stage renal disease patients, with the time of enrollment corresponding to the time at which the patient first received dialysis. In this study, several of the patient expire during the first dialysis treatment, leading to the occurrence of an instantaneous failure. Similarly, Liem et al (1997) describes an intent-to-treat clinical trial comparing conventional anterior surgery and laparoscopic surgery for repairing inguinal hernia. In this study, various patients not receiving the allocated intervention, were inadequately excluded from the analysis to overcome issues such as consent withdrawal, procedure misfit etc. that would rightly attribute to instantaneous failures. Survival data with instantaneous events is not uncommon in epidemiological and clinical studies, and for this reason, herein a general methodology under the proportional hazards (PH) model is developed for the analysis of interval-censored data subject to instantaneous failures.

Originally proposed by Cox et al (1972), the PH model has (arguably) become one of the most popular regression models for analyzing time-to-event data. For analyzing interval-censored data under the PH model, several notable contributions have been made in the recent years; e.g., see Finkelstein (1986), Satten (1996), Goggins et al (1998), Groeneboom and Wellner (1992), Pan (1999), Pan (2000), Goetghebeur and Ryan (2000), Betensky et al (2002), Cai and Betensky (2003), Zhang et al (2010), Sun (2007), Zhang and Sun (2010), and Li and Ma (2013). More recently, Wang et al (2016) developed a methodology under the PH model which can be used to accurately and reliably analyze interval-censored data. In particular, this approach makes use of a monotone spline representation to approximate the cumulative baseline hazard function. In doing so, an expectation-maximization (EM) algorithm is developed through a data augmentation scheme involving latent Poisson random variables which can be used to complete model fitting. It is worthwhile to note, that none of the aforementioned techniques were designed to account for the effects associated with instantaneous failures.

The phenomenon of instantaneous (or early) failures occur in many lifetime experiments; to include, but not limited to, reliability studies and clinical trials. In reliability studies, instantaneous failures may be attributable to inferior quality or faulty manufacturing, where as in clinical trials these events may manifest due to adverse reactions to treatments or clinical definitions of outcomes. When the failure times are exactly observed, as is the case in reliability studies, it is common to incorporate instantaneous failures through a mixture of parametric models, with one being degenerate at time zero; e.g., see

Muralidharan (1999), Kale and Muralidharan (2002), Murthy et al (2004), Muralidharan and Lathika (2006), Pham and Lai (2007), and Knopik (2011). In the case of interval-censored data, more common among epidemiological studies and clinical trials, accounting for instantaneous failures becomes a more tenuous task, with practically no guidance available among the existing literature. Arguably, in the context of interval-censored data, one could account for instantaneous failures by introducing an arbitrarily small constant for each as an observation time, and subsequently treat the instantaneous failures as left-censored observations. In doing so, methods for interval-censored data, such as those discussed above, could be employed. While this approach may seem enticing, in the case of a relatively large number of instantaneous failures it has several pitfalls. In particular, through numerical studies (results not shown) it has been determined that this approach when used in conjunction with modeling techniques such as those proposed in Pan (1999) and Wang et al (2016) may lead to inaccurate estimation of the survival curves and/or the covariate effects. Further, after an extensive literature review, it does not appear that any methodology has previously been developed to specifically address data of this structure. For these reasons, herein a general methodology under the PH model is developed for the analysis of interval-censored data subject to instantaneous failures.

For the analysis of interval-censored data subject to instantaneous failures a new mixture model is proposed, which is a generalization of the semi-parametric PH model studied in Wang et al (2016). The proposed PH model is developed under the standard PH assumption; i.e., the covariates provide for a multiplicative effect on the baseline risk of both experiencing a failure at time zero and thereafter. Two separate techniques are developed for the purposes of estimating the cumulative baseline hazard function. The first allows a practitioner to specify a parametric form (up to a collection of unknown coefficients) for the unknown function, while the second provides for more modeling flexibility through the use of the monotone splines of Ramsay (1988)

. Under either formulation, a two-stage data augmentation scheme involving latent Poisson random variables is used to develop an efficient EM algorithm which can be used to estimate all of the unknown parameters. Through extensive simulation studies the proposed methodology is shown to provide reliable estimation and inference with respect to the covariate effects, baseline cumulative hazard function, and baseline probability of experiencing an instantaneous failure.

The remainder of this article is organized as follows. Section 2 presents the development of the proposed model, the derivation of the EM algorithm, and outlines uncertainty quantification. The finite sample performance of the proposed approach is evaluated through extensive numerical studies, the features and results of which are provided in Section 3. Further, code which implements the proposed methodology has been added to the existing R software package ICsurv and is freely available from the CRAN (i.e., http://cran.us.rproject.org/).

2 Model and Methodology

Let denote the failure time of interest. Under the PH model, the survival function can be generally written as

(1)

where is a

-dimensional vector of covariates,

is the corresponding vector of regression coefficients, and is the baseline survival function. Under the phenomenon of interest, there is a baseline risk (probability) of experiencing an instantaneous failure; i.e., , where is the baseline risk and is a -dimensional vector of zeros. Thus, under the PH assumptions, the probability of experiencing an instantaneous failure, given the covariate information contained in , can be ascertained from (1) as

Similarly, given that an instantaneous failure does not occur, it is assumed that the failure time conditionally follows the standard PH model; i.e.,

where and is the usual cumulative baseline hazard function. Note, in order for

to be a proper cumulative distribution function,

should be a monotone increasing function with

. Thus, through an application of the Law of Total Probability, one has that

for . Based on these assumptions, the cumulative distribution function of can be expressed as the following mixture model,

where, for reasons that will shortly become apparent, is re-parametrized as , for .

2.1 Observed data likelihood

In scenarios where interval-censored data arise, one has that the failure time () is not directly observed, but rather is known relative to two observation times, say ; i.e., one has that . In general, the four different outcomes considered here can be represented through the values of and ; i.e., an instantaneous failure (), is left-censored (), is interval-censored (), and is right-censored (). For notational convenience, let be an indicator denoting the event that is not an instantaneous failure, and , , and be censoring indicators denoting left-, interval-, and right-censoring, respectively; i.e., , , , and .

In order to derive the observed data likelihood, it is assumed throughout that the individuals are independent, and that conditional on the covariates, the failure time for an individual is independent of the observational process. This assumption is common among the survival literature; see, e.g., Zhang and Sun (2010), Liu and Shen (2009) and the references therein. The observed data collected on individuals is given by , which constitutes independent realization of . Thus, under the aforementioned assumptions, the observed data likelihood is given by

(2)

where represents the set of unknown parameters which are to be estimated.

2.2 Representations of

The unknown parameters in the observed likelihood involve the regression parameters , , and the cumulative baseline hazard function . Herein, two techniques for modeling the cumulative baseline hazard function are discussed. The first approach considers the use of a fully parametric model, which is known up to a set of unknown coefficients. For example, a linear, quadratic, or logarithmic parametric model can be specified by setting , , and , respectively. Note, all of these models obey the constraints placed on , as long as the , for . In general, a parametric form for the cumulative baseline hazard model can be specified as

(3)

where is a monotone increasing function, , and , for . Under these mild conditions, it is easily verified that inherits the same properties, and therefore adheres to the aforementioned constraints.

The second approach, which is inspired by the works of Wang et al (2016), Lin and Wang (2010), Wang and Dunson (2011), Cai et al (2011), McMahan et al (2013), and Wang et al (2015), views as an unknown function, and hence an infinite dimensional parameter. To reduce the dimensionality of the problem, the monotone splines of Ramsay (1988) are used to approximate . Structurally, this representation is identical to that of (3) with the exception that is a spline basis function and is an unknown spline coefficient, for . Again, it is required that , for all , to ensure that is monotone increasing function. Briefly, the spline basis functions are piecewise polynomial functions and are fully determined once a knot sequence and the degree are specified. The shape of the basis splines are predominantly determined by the placement of the knots while the degree controls the smoothness (Cai et al, 2011). For instance, specifying the degree to take values 1, 2 or 3 correspond to the use of linear, quadratic or cubic polynomials respectively. Given the knots and degree, the ( degree - 2) basis functions are fully determined. For further discussion on specifying the knots, as well as their placement, see Wang et al (2016), Ramsay (1988), and McMahan et al (2013).

2.3 Data Augmentation

Under either of the representations of proposed in Section 2.2, the unknown parameters in the observed data likelihood consist of , where . Since the observed data likelihood exists in closed-form, the maximum likelihood estimator MLE of could be obtained by directly maximizing (2) with respect to ; i.e., one could obtain , the MLE of , as . It is worthwhile to point out that the numerical process of directly maximizing (2), with respect to , is often unstable and rarely performs well (Wang et al, 2015).

To circumvent these numerical instabilities, an EM algorithm was derived for the purposes of identifying the MLE. This algorithm was developed based on a two-stage data augmentation process, where carefully structured latent Poisson random variables are introduced as missing data. The first stage relates both the instantaneous failure indicator and the censoring indicators to latent Poisson random variables; i.e., the , , and are introduced such that

subject to the following constraints: , , , and where , and . At this stage of the data augmentation, the conditional likelihood is

(4)

where and is the probability mass function of the random variable . In the second and final stage, the and are separately decomposed into the sum of independent latent Poisson random variables; i.e., and , where

At this stage of the augmented data likelihood is

(5)

where and . It is relatively easy to show that by integrating (5) over the latent random variables one will obtain the observed data likelihood depicted in (2).

2.4 EM algorithm

In general, the EM algorithm consists of two steps: the expectation step (E-step) and the maximization step (M-step). The E-step in this algorithm involves taking the expectation of with respect to all latent variables conditional on the current parameter value and the observed data . This results in obtaining the function, where . The M-step then finds . This process is repeated in turn until convergence of the algorithm is attained. In this particular setting, the E-step yields as

where is a function of but is free of . Notice that in we suppress, for notational convenience, the dependence of the expectations on the observed data and ; i.e., from henceforth it should be understood that .

An enticing feature, which makes the proposed approach computationally efficient, is that all of the expectations in can be expressed in closed-form, and moreover can be computed via simple matrix and vector operations. In particular, from (4) it can be ascertained that if and then conditionally, given and

, follows a zero-truncated Poisson distribution, and it follows a degenerate distribution at 0 for any other values of

and . Thus, the conditional expectation of , given and , can be expressed as

where . Through a similar set of arguments one can obtain the necessary conditional expectations of and as

respectively. Further, from (5) it can be ascertained that if and then conditionally, given and

, follows a binomial distribution with

being the number of trials and being the success probability, and it is follows a degenerate distribution at 0 for any other values of and . Thus, through an application of the Law of Iterated Expectations, the conditional expectation of , given and , can be expressed as

Through a similar set of arguments one can obtain the necessary conditional expectation of as

Note, in the expressions of the expectations of and the dependence on , , and are suppressed with the properties associated with these variables being inherited from the expectations associated with and , respectively.

The M-step of the algorithm then finds . To this end, consider the partial derivatives of with respect to which are given by

(6)
(7)
(8)

By setting (6) equal to zero and solving for , one can obtain

(9)

for . Similarly, by setting (7) equal to zero and solving for , one can obtain

(10)

Notice that, and depend on . Thus, one can obtain by setting (8) equal to zero and solving the resulting system of equations for , after replacing and by and , respectively. Note, the aforementioned system of equations can easily be solved using a standard Newton Raphson approach. Finally, one obtains and as and , respectively.

The proposed EM algorithm is now succinctly stated. First, initialize and repeat the following steps until converges.

  1. Calculate by solving the following system of equations

    where and are defined above.

  2. Calculate for and .

  3. Update .

At the point of convergence, define to be the proposed estimator , which is the MLE of .

2.5 Variance estimation

For the purposes of uncertainty quantification, several variance estimators were considered and evaluated; e.g., the inverse of the observed Fisher information matrix, the Huber sandwich estimator, and the outer product of gradients (OPG) estimator. After extensive numerical studies (results not shown), it was found that the most reliable variance estimator, among those considered, was that of the OPG estimator. In general, the OPG estimator is obtained as

where and is the log-likelihood contribution of the th individual. Using this estimator, one can conduct standard Wald type inference.

3 Simulation Study

In order to investigate the finite sample performance of the proposed methodology, the following simulation study was conducted. The true distribution of the failure times was specified to be

where (i.e., ), , , , and , where and take on values of -0.5 and 0.5 resulting 4 regression parameter configurations. Additionally, these studies consider two cumulative baseline hazard functions; i.e., a logarithmic and a linear

. These choices were made so that the hazard functions behave similarly but have different shapes. In total, these specifications lead to eight separate data generating models for the failure times. Two generating processes were considered for the observation times: an exponential distribution with a mean of 10 and a discrete uniform over the interval

. In both cases, a single observation time, , was generated for each failure time which was not instantaneous (i.e., ), and the intervals were created such that () and () if was smaller (greater) than . A few comments are warranted on the selection of the observation processes. First, the former process attempts to match the baseline characteristics of the failure time distribution. Second, note that the specification of the two observation processes result in case-1 interval-censored (i.e., current status) data. This was done for a primary reason that the data of this nature possess less information when compared to general interval-censored data, and thus if the proposed approach works well in this setting it could be inferred that it should perform better in the case of interval-censored data. In total, these data generating steps lead to sixteen generating mechanisms, and each were used to create 500 independent data sets consisting of observations.

In order to examine the performance of the proposed approach across a broad spectrum of characteristics, several different models were used to analyze each data set. First, following the development presented in Section 2.2, three different parametric forms were considered for the cumulative baseline hazard function: , , and , which are henceforth referred to as models M1, M2, and M3, respectively. Note, these specifications allow one to examine the performance of the proposed approach when the cumulative baseline hazard function is correctly specified (e.g., M2 when ), over specified (e.g., M3 when ), and misspecified (e.g., M1 when ). Further, for each data set a model (M4) was fit using the monotone spline representation for the cumulative baseline hazard function developed in Section 2.2. In order to specify the basis functions, the degree was specified to be 2, in order to provide adequate smoothness, and one interior knot was placed at the median of the observed finite nonzero interval end points, with boundary knots being placed at the minimum and maximum of the same. The EM algorithm derived in Section 2.4 was used to complete model fitting for M1-M4. The starting value for all implementations was set to be , where () is a -dimensional vector of zeros (ones). Convergence was declared when the maximum absolute differences between the parameter updates were less than the specified tolerance of .

Table 1 summarizes the estimates of the regression coefficients and the baseline instantaneous failure probability for all considered simulation configurations and models, when the observation times were drawn from a exponential distribution. This summary includes the empirical bias, the sample standard deviation of the 500 point estimates, the average of the 500 standard error estimates, and the empirical coverage probabilities associated with 95% Wald confidence intervals. Table 2 provides the analogous results for the case in which the observation times are sampled from a discrete uniform distribution. From these results, one will first note that across all considered simulation settings the proposed approach performs very well for M4 and the correct parametric model (i.e., M1 when

and M2 when ); i.e., the parameter estimates exhibit very little bias, the sample standard deviation of the 500 point estimates are in agreement with the average of the standard error estimates, and the empirical coverage probabilities are at their nominal level. In summary, these findings tend to suggest that the proposed methodology can be used to reliably estimate the covariate effects, the instantaneous failure probability, and quantify the uncertainty in the same. Additionally, these findings generally continue to persist for the case in which the parametric model is over specified (e.g., M3 when ), with the resulting estimates in some cases exhibiting a slightly larger bias and a bit more variability, as one would expect. Further, from these results one will also note that when the parametric model is misspecified (e.g., M2 and M3 when ) the estimates tend to exhibit more bias and less reliable inference, which is expected under the misspecification of the cumulative baseline hazard function. Finally, the estimates obtained under M4 (i.e., the model which makes use of the monotone splines) exhibit little if any difference when compared to the estimates resulting from correct parametric model. In totality, from these findings, it is conjectured that the approach which makes use of the spline representation to approximate the unknown cumulative baseline hazard functions (i.e., M4) would likely be preferable, since it avoids the potential of model misspecification and it obtains estimators of the unknown parameters, as well as their standard errors, that are equivalent to those estimators obtained under the true parametric model, the form of which is generally not known.

True
M1(True) M2(Misspecified) M3(Misspecified) M4(Spline)
Parameter Bias SD ESE CP95 Bias SD ESE CP95 Bias SD ESE CP95 Bias SD ESE CP95
-0.03 0.15 0.15 0.94 -0.06 0.17 0.15 0.89 -0.06 0.17 0.15 0.90 -0.04 0.16 0.16 0.93
-0.01 0.29 0.28 0.94 -0.05 0.33 0.27 0.89 -0.05 0.33 0.28 0.90 -0.02 0.30 0.29 0.93
0.00 0.06 0.06 0.95 0.00 0.06 0.06 0.94 0.00 0.06 0.06 0.94 0.00 0.06 0.06 0.95
-0.03 0.15 0.14 0.94 -0.07 0.17 0.14 0.87 -0.07 0.17 0.14 0.88 -0.04 0.15 0.15 0.94
0.01 0.26 0.26 0.94 0.06 0.30 0.25 0.89 0.06 0.30 0.26 0.89 0.03 0.27 0.27 0.94
0.00 0.06 0.06 0.94 -0.01 0.06 0.05 0.92 -0.01 0.06 0.06 0.93 -0.01 0.06 0.06 0.94
0.01 0.15 0.15 0.95 0.05 0.16 0.15 0.91 0.05 0.16 0.15 0.92 0.02 0.15 0.15 0.95
0.01 0.28 0.28 0.94 -0.03 0.32 0.27 0.90 -0.03 0.32 0.27 0.91 0.01 0.29 0.29 0.94
0.00 0.05 0.06 0.97 0.00 0.06 0.06 0.96 0.00 0.06 0.06 0.96 0.00 0.05 0.06 0.96
0.02 0.14 0.14 0.95 0.07 0.15 0.14 0.89 0.07 0.15 0.14 0.90 0.03 0.14 0.15 0.95
0.02 0.28 0.26 0.94 0.06 0.32 0.25 0.86 0.06 0.32 0.26 0.87 0.03 0.28 0.27 0.93
0.00 0.06 0.06 0.93 -0.01 0.06 0.05 0.90 -0.01 0.06 0.06 0.91 0.00 0.06 0.06 0.93
True
M1(Misspecified) M2(True) M3(Over specified) M4(Spline)
Parameter Bias SD ESE CP95 Bias SD ESE CP95 Bias SD ESE CP95 Bias SD ESE CP95
0.01 0.15 0.16 0.96 -0.03 0.16 0.16 0.94 -0.05 0.17 0.16 0.94 -0.04 0.16 0.17 0.94
0.04 0.28 0.29 0.95 0.00 0.30 0.30 0.95 -0.03 0.32 0.30 0.94 -0.02 0.32 0.31 0.95
0.00 0.06 0.06 0.95 0.00 0.06 0.06 0.95 0.00 0.06 0.06 0.95 0.00 0.06 0.06 0.95
0.02 0.15 0.15 0.94 -0.02 0.15 0.15 0.94 -0.04 0.16 0.15 0.93 -0.04 0.16 0.16 0.94
-0.02 0.26 0.27 0.96 0.02 0.28 0.27 0.93 0.03 0.29 0.28 0.93 0.03 0.29 0.29 0.95
0.01 0.06 0.06 0.94 0.00 0.06 0.06 0.94 -0.01 0.06 0.06 0.94 -0.01 0.06 0.06 0.94
-0.02 0.14 0.16 0.96 0.02 0.15 0.16 0.96 0.03 0.16 0.16 0.95 0.03 0.16 0.17 0.95
0.06 0.29 0.29 0.94 0.02 0.30 0.29 0.93 0.01 0.31 0.30 0.93 0.01 0.31 0.31 0.94
0.00 0.05 0.06 0.97 0.00 0.05 0.06 0.96 0.00 0.05 0.06 0.96 0.00 0.06 0.06 0.97
-0.02 0.14 0.15 0.95 0.02 0.15 0.15 0.95 0.04 0.15 0.15 0.95 0.04 0.15 0.16 0.95
-0.02 0.27 0.27 0.94 0.02 0.29 0.27 0.94 0.04 0.30 0.28 0.93 0.04 0.30 0.29 0.93
0.01 0.06 0.06 0.96 0.00 0.06 0.06 0.95 -0.01 0.06 0.06 0.94 -0.01 0.06 0.06 0.95
Table 1: Simulation results summarizing the estimates of the regression coefficients and the baseline instantaneous failure probability obtained from M1-M4 across all simulation settings, when observation times were sampled from an exponential distribution. This summary include the average of the 500 point estimates minus the true value (Bias), the sample standard deviation of the 500 point estimates (SD), the average of the estimated standard errors (ESE), and empirical coverage probabilities associated with 95% Wald confidence intervals (CP95). Note, when then M1 is the true parametric model and when then M2 is the true parametric model.
True
M1(True) M2(Misspecified) M3(Misspecified) M4(Spline)
Parameter Bias SD ESE CP95 Bias SD ESE CP95 Bias SD ESE CP95 Bias SD ESE CP95
-0.02 0.14 0.15 0.95 -0.03 0.15 0.15 0.92 -0.03 0.15 0.15 0.93 -0.03 0.15 0.15 0.94
0.00 0.27 0.27 0.96 -0.02 0.29 0.27 0.94 -0.02 0.29 0.27 0.94 -0.02 0.28 0.28 0.96
-0.01 0.06 0.06 0.95 -0.01 0.06 0.06 0.93 -0.01 0.06 0.06 0.94 -0.01 0.06 0.06 0.94
-0.03 0.14 0.14 0.94 -0.05 0.15 0.14 0.92 -0.05 0.15 0.14 0.92 -0.04 0.15 0.14 0.94
0.00 0.26 0.25 0.94 0.03 0.28 0.25 0.92 0.03 0.28 0.25 0.93 0.02 0.27 0.26 0.93
0.00 0.06 0.06 0.95 0.00 0.06 0.06 0.94 0.00 0.06 0.06 0.94 0.00 0.06 0.06 0.95
0.02 0.16 0.15 0.94 0.03 0.16 0.15 0.94 0.03 0.16 0.15 0.94 0.03 0.16 0.15 0.94
0.00 0.28 0.27 0.95 -0.01 0.29 0.27 0.93 -0.01 0.29 0.27 0.93 -0.01 0.29 0.28 0.94
0.00 0.06 0.06 0.95 0.00 0.06 0.06 0.95 0.00 0.06 0.06 0.95 0.00 0.06 0.06 0.95
0.01 0.14 0.14 0.95 0.04 0.15 0.14 0.92 0.04 0.15 0.14 0.93 0.03 0.14 0.14 0.94
0.03 0.27 0.25 0.94 0.05 0.29 0.25 0.91 0.05 0.29 0.25 0.91 0.04 0.28 0.26 0.92
0.00 0.06 0.06 0.95 -0.01 0.06 0.05 0.94 -0.01 0.06 0.06 0.94 0.00 0.06 0.06 0.94
True
M1(Misspecified) M2(True) M3(Over specified) M4(Spline)
Parameter Bias SD ESE CP95 Bias SD ESE CP95 Bias SD ESE CP95 Bias SD ESE CP95
0.00 0.15 0.15 0.95 -0.02 0.15 0.15 0.93 -0.03 0.16 0.16 0.93 -0.03 0.16 0.16 0.94
0.02 0.27 0.28 0.96 0.00 0.28 0.28 0.96 -0.01 0.29 0.29 0.96 -0.01 0.29 0.29 0.96
-0.01 0.06 0.06 0.95 -0.01 0.06 0.06 0.95 -0.01 0.06 0.06 0.94 -0.01 0.06 0.06 0.95
0.00 0.14 0.14 0.95 -0.02 0.14 0.14 0.94 -0.03 0.15 0.15 0.94 -0.03 0.15 0.15 0.93
-0.03 0.25 0.26 0.95 0.00 0.27 0.26 0.94 0.01 0.27 0.27 0.94 0.01 0.27 0.27 0.94
0.01 0.06 0.06 0.96 0.00 0.06 0.06 0.95 0.00 0.06 0.06 0.95 0.00 0.06 0.06 0.95
0.00 0.15 0.15 0.94 0.02 0.16 0.15 0.94 0.02 0.16 0.16 0.94 0.02 0.16 0.16 0.94
0.01 0.28 0.28 0.95 -0.01 0.29 0.28 0.95 -0.01 0.30 0.28 0.94 -0.01 0.30 0.29 0.94
0.00 0.06 0.06 0.96 0.00 0.06 0.06 0.96 0.00 0.06 0.06 0.96 0.00 0.06 0.06 0.96
-0.02 0.13 0.14 0.96 0.01 0.14 0.14 0.95 0.02 0.14 0.15 0.95 0.02 0.14 0.15 0.96
0.00 0.26 0.26 0.96 0.03 0.27 0.26 0.95 0.04 0.28 0.27 0.95 0.04 0.28 0.27 0.95
0.00 0.06 0.06 0.96 0.00 0.06 0.06 0.95 0.00 0.06 0.06 0.95 0.00 0.06 0.06 0.96
Table 2: Simulation results summarizing the estimates of the regression coefficients and the baseline instantaneous failure probability obtained from M1-M4 across all simulation settings, when observation times were sampled from a discrete uniform distribution. This summary include the average of the 500 point estimates minus the true value (Bias), the sample standard deviation of the 500 point estimates (SD), the average of the estimated standard errors (ESE), and empirical coverage probabilities associated with 95% Wald confidence intervals (CP95). Note, when then M1 is the true parametric model and when then M2 is the true parametric model.

Figure 1 summarizes the estimates of the baseline survival function (i.e., ) obtained from M1-M4 across all considered regression parameter configurations when and the observation times were sampled from the exponential distribution. Figures 2-4 summarizes the analogous results for the other simulation configurations. In particular, these figures present the true baseline survival functions, the average of the point-wise estimates, and the point-wise 2.5th and 97.5th percentiles of the estimates. These figures reinforce the main findings discussed above; i.e., M4 and the correctly specified parametric model again provide reliable estimates of the baseline survival function, and hence the cumulative baseline hazard function, across all simulation configurations. Similarly the over specified model also provides reliable estimates, but the same can not be said for the misspecified models. It is worthwhile to point out that the baseline survival curves do not extend to unity as time goes toward the origin, this is due to the fact that the baseline instantaneous failure probability is . Again, these findings support the recommendation that the spline based representation of the cumulative baseline hazard function should be adopted in lieu of a parametric model, thus obviating the possible ramifications associated with misspecification.

Figure 1: Simulation results summarizing the estimates of the baseline survival function obtained by the proposed approach under M1 (first row), M2 (second row), M3 (third row), and M4 (fourth row) when true

and observation times were drawn from exponential distribution. The solid line provides the true value, dashed line represents the average estimated value, and the dotted lines indicate the 2.5% and 97.5% quantiles, of the point-wise estimates.

Figure 2: Simulation results summarizing the estimates of the baseline survival function obtained by the proposed approach under M1 (first row), M2 (second row), M3 (third row), and M4 (fourth row) when true and observation times were drawn from exponential distribution. The solid line provides the true value, dashed line represents the average estimated value, and the dotted lines indicate the 2.5% and 97.5% quantiles, of the point-wise estimates.
Figure 3: Simulation results summarizing the estimates of the baseline survival function obtained by the proposed approach under M1 (first row), M2 (second row), M3 (third row), and M4 (fourth row) when true and observation times were drawn from discrete uniform distribution over [1, 17]. The solid line provides the true value, dashed line represents the average estimated value, and the dotted lines indicate the 2.5% and 97.5% quantiles, of the point-wise estimates.
Figure 4: Simulation results summarizing the estimates of the baseline survival function obtained by the proposed approach under M1 (first row), M2 (second row), M3 (third row), and M4 (fourth row) when true and observation times were drawn from discrete uniform distribution over [1, 17]. The solid line provides the true value, dashed line represents the average estimated value, and the dotted lines indicate the 2.5% and 97.5% quantiles, of the point-wise estimates.

In summary, this simulation study illustrates that proposed methodology can be used to analyze current status data which is subject to instantaneous failures, and moreover that the monotone spline approach discussed in Section 2.2 should be adopted for approximating the unknown cumulative baseline hazard function. A few additional details about the numerics of the approach follow. First, the average time required to complete model fitting was approximately one second, supporting the claim that the proposed approach is computationally efficient. Lastly, for the reasons of complete transparency, for a single data set, among 8000, the OPG estimator under M4 resulted in a singular matrix, which prevented standard error estimation estimation, and this issue was resolved by slightly shifting the interior knot.

4 Discussion

This work proposed a new model, under the PH assumption, which can be used to analyze interval-censored data subject to instantaneous failures. Through the model development, two techniques for approximating the unknown cumulative baseline hazard function are illustrated. To complete model fitting, a novel EM algorithm is developed through a two-stage data augmentation process. The resulting algorithm is easy to implement and is computationally efficient. These features are likely attributable to the fact that the carefully structured data augmentation steps lead to closed-form expressions for all necessary quantities in the E-step of the algorithm. Moreover, in the M-step the regression coefficients are updated through solving a low-dimensional system of equations, while all other unknown parameters are updated in closed-form. The finite sample performance of the proposed approach was exhibited through an extensive numerical study. This study suggests that the use of the monotone spline representation of the cumulative baseline hazard function would in general be preferable, in order to circumvent the possibility of model misspecification. To further disseminate this work, code, written in R, has been prepared and is available upon request from the corresponding author.

5 Acknowledgments

This research was partially supported by National Institutes of Health grant AI121351.

References

  • Betensky et al (2002) Betensky RA, Lindsey JC, Ryan LM, Wand M (2002) A local likelihood proportional hazards model for interval censored data. Statistics in Medicine 21(2):263–275, DOI 10.1002/sim.993
  • Cai et al (2011) Cai B, Lin X, Wang L (2011) Bayesian proportional hazards model for current status data with monotone splines. Computational Statistics & Data Analysis 55(9):2644–2651, DOI 10.1016/j.csda.2011.03.013
  • Cai and Betensky (2003) Cai T, Betensky RA (2003) Hazard regression for interval-censored data with penalized spline. Biometrics 59(3):570–579, DOI 10.1111/1541-0420.00067
  • Chen et al (2015) Chen CM, Lai CC, Cheng KC, Weng SF, Liu WL, Shen HN (2015) Effect of end-stage renal disease on long-term survival after a first-ever mechanical ventilation: a population-based study. Critical Care 19(1):354, DOI 10.1186/s13054-015-1071-x
  • Cox et al (1972) Cox R, et al (1972) Regression models and life tables (with discussion). Journal of the Royal Statistical Society 34:187–220
  • Dorey et al (1993)

    Dorey FJ, Little RJ, Schenker N (1993) Multiple imputation for threshold-crossing data with interval censoring. Statistics in medicine 12(17):1589–1603

  • Finkelstein (1986) Finkelstein DM (1986) A proportional hazards model for interval-censored failure time data. Biometrics pp 845–854
  • Goetghebeur and Ryan (2000)

    Goetghebeur E, Ryan L (2000) Semiparametric regression analysis of interval-censored data. Biometrics 56(4):1139–1144, DOI 

    10.1111/j.0006-341X.2000.01139.x
  • Goggins et al (1998)

    Goggins WB, Finkelstein DM, Schoenfeld DA, Zaslavsky AM (1998) A markov chain monte carlo em algorithm for analyzing interval-censored data under the cox proportional hazards model. Biometrics pp 1498–1507

  • Groeneboom and Wellner (1992) Groeneboom P, Wellner JA (1992) Information bounds and nonparametric maximum likelihood estimation, vol 19. Springer Science & Business Media
  • Kale and Muralidharan (2002) Kale B, Muralidharan K (2002) Optimal estimating equations in mixture distributions accommodating instantaneous or early failures. Quality control and applied statistics 47(6):677–680
  • Knopik (2011) Knopik L (2011) Model for instantaneous failures. Scientific Problems of Machines Operation and Maintenance 46(2):37–45
  • Lamborn et al (2008) Lamborn KR, Yung WA, Chang SM, Wen PY, Cloughesy TF, DeAngelis LM, Robins HI, Lieberman FS, Fine HA, Fink KL, et al (2008) Progression-free survival: an important end point in evaluating therapy for recurrent high-grade gliomas. Neuro-oncology 10(2):162–170, DOI 10.1215/15228517-2007-062
  • Li and Ma (2013) Li J, Ma S (2013) Survival analysis in medicine and genetics. CRC Press
  • Liem et al (1997) Liem MS, van der Graaf Y, van Steensel CJ, Boelhouwer RU, Clevers GJ, Meijer WS, Stassen LP, Vente JP, Weidema WF, Schrijvers AJ, et al (1997) Comparison of conventional anterior surgery and laparoscopic surgery for inguinal-hernia repair. New England Journal of Medicine 336(22):1541–1547, DOI 10.1056/NEJM199705293362201
  • Lin and Wang (2010) Lin X, Wang L (2010) A semiparametric probit model for case 2 interval-censored failure time data. Statistics in medicine 29(9):972–981, DOI 10.1002/sim.3832
  • Liu et al (2016) Liu C, Yang W, Devidas M, Cheng C, Pei D, Smith C, Carroll WL, Raetz EA, Bowman WP, Larsen EC, et al (2016) Clinical and genetic risk factors for acute pancreatitis in patients with acute lymphoblastic leukemia. Journal of Clinical Oncology 34(18):2133–2140, DOI 10.1200/JCO.2015.64.5812
  • Liu and Shen (2009) Liu H, Shen Y (2009) A semiparametric regression cure model for interval-censored data. Journal of the American Statistical Association 104(487):1168–1178, DOI 10.1198/jasa.2009.tm07494
  • Matsuzaki et al (2005) Matsuzaki A, Nagatoshi Y, Inada H, Nakayama H, Yanai F, Ayukawa H, Kawakami K, Moritake H, Suminoe A, Okamura J (2005) Prognostic factors for relapsed childhood acute lymphoblastic leukemia: Impact of allogeneic stem cell transplantation?a report from the kyushu-yamaguchi children’s cancer study group. Pediatric blood & cancer 45(2):111–120, DOI 10.1002/pbc.20363
  • McMahan et al (2013) McMahan CS, Wang L, Tebbs JM (2013) Regression analysis for current status data using the em algorithm. Statistics in medicine 32(25):4452–4466, DOI 10.1002/sim.5863
  • Muralidharan (1999) Muralidharan K (1999) Tests for the mixing proportion in the mixture of a degene-rate and exponential distribution. Journal Indian Statistical Associations 37:105–119
  • Muralidharan and Lathika (2006) Muralidharan K, Lathika P (2006) Analysis of instantaneous and early failures in weibull distribution. Metrika 64(3):305–316, DOI 10.1007/s00184-006-0050-2
  • Murthy et al (2004) Murthy DP, Xie M, Jiang R (2004) Weibull models, vol 505. John Wiley & Sons
  • Odell et al (1992) Odell P, Anderson K, Agostino R (1992) Maximum likelihood estimation for interval-censored data using a weibull-based accelerated failure time model. Biometrics pp 951–959, DOI 10.2307/2532360
  • Pan (1999) Pan W (1999) Extending the iterative convex minorant algorithm to the cox model for interval-censored data. Journal of Computational and Graphical Statistics 8(1):109–120
  • Pan (2000) Pan W (2000) A multiple imputation approach to cox regression with interval-censored data. Biometrics 56(1):199–203, DOI 10.1111/j.0006-341X.2000.00199.x
  • Pham and Lai (2007) Pham H, Lai CD (2007) On recent generalizations of the weibull distribution. IEEE transactions on reliability 56(3):454–458, DOI 10.1109/TR.2007.903352
  • Ramsay (1988) Ramsay JO (1988) Monotone regression splines in action. Statistical science pp 425–441, DOI 10.1214/ss/1177012761
  • Satten (1996) Satten GA (1996) Rank-based inference in the proportional hazards model for interval censored data. Biometrika 83(2):355–370, DOI 10.1093/biomet/83.2.355
  • Sun (2007) Sun J (2007) The statistical analysis of interval-censored failure time data. Springer Science & Business Media
  • Wang and Dunson (2011)

    Wang L, Dunson DB (2011) Semiparametric bayes’ proportional odds models for current status data with underreporting. Biometrics 67(3):1111–1118, DOI 

    10.1111/j.1541-0420.2010.01532.x
  • Wang et al (2016) Wang L, McMahan CS, Hudgens MG, Qureshi ZP (2016) A flexible, computationally efficient method for fitting the proportional hazards model to interval-censored data. Biometrics 72(1):222–231, DOI 10.1111/biom.12389
  • Wang et al (2015) Wang N, Wang L, McMahan CS (2015) Regression analysis of bivariate current status data under the gamma-frailty proportional hazards model using the em algorithm. Computational Statistics & Data Analysis 83:140–150, DOI 10.1016/j.csda.2014.10.013
  • Zhang et al (2010) Zhang Y, Hua L, Huang J (2010) A spline-based semiparametric maximum likelihood estimation method for the cox model with interval-censored data. Scandinavian Journal of Statistics 37(2):338–354, DOI 10.1111/j.1467-9469.2009.00680.x
  • Zhang and Sun (2010) Zhang Z, Sun J (2010) Interval censoring. Statistical Methods in Medical Research 19(1):53–70, DOI 10.1177/0962280209105023