Regression modelling of interval censored data based on the adaptive ridge procedure

by   Olivier Bouaziz, et al.

We consider the Cox model with piecewise constant baseline hazard to deal with a mixed case of left-censored, interval-censored and right-censored data. Estimation is carried out with the EM algorithm by treating the true event times as unobserved variables. This estimation procedure is shown to produce a block diagonal Hessian matrix of the baseline parameters. Taking advantage of this interesting feature of the estimation procedure a L0 penalised likelihood method is implemented in order to automatically determine the number and locations of the cuts of the baseline hazard. The method is directly extended to the inclusion of exact observations and to a cure fraction. Statistical inference of the model parameters is derived from likelihood theory. Through simulation studies, the penalisation technique is shown to provide a good fit of the baseline hazard and precise estimations of the resulting regression parameters. The method is illustrated on a dental dataset where the effect of covariates on the risk of ankylosis for replanted teeth is assessed.


page 1

page 2

page 3

page 4


An efficient penalized estimation approach for a semi-parametric linear transformation model with interval-censored data

We consider efficient estimation of flexible transformation models with ...

A proportional hazards model for interval-censored data subject to instantaneous failures

The proportional hazards (PH) model is arguably one of the most popular ...

A proportional hazards model for interval-censored subject to instantaneous failures

The proportional hazards (PH) model is arguably one of the most popular ...

Proportional hazards model with partly interval censoring and its penalized likelihood estimation

This paper considers the problem of semi-parametric proportional hazards...

Nonparametric Bayesian estimation of a concave distribution function with mixed interval censored data

Assume we observe a finite number of inspection times together with info...

Analysis of Networks via the Sparse β-Model

Data in the form of networks are increasingly available in a variety of ...

1 Introduction

Interval censored data arise in situations where the event of interest is only known to have occurred between two observation times. These types of data are commonly encountered when the patients are intermittently followed up at medical examinations. This is the case for instance in AIDS studies, when HIV infection onset is determined by periodic testing, or in oncology where the time-to-tumour progression is assessed by measuring the tumour size at periodic testing. Dental dataset are another examples which are usually interval-censored because the teeth status of the patients are only examined at visits to the dentist. While interval-censored data are ubiquitous in medical applications it is still a common practice to replace the observation times with their midpoints or endpoints and to consider these data as exact. This allows to analyse the data using standard survival approach (with exact and right-censored data) but may results in a large bias of the estimators.

In the context of interval-censored data, [turnbull1976] introduced an iterative algorithm for the non-parametric estimation of the survival function. As a different estimation method, the iterative convex minorant was proposed by [groeneboom92] and [jongbloed1998iterative]. In [groeneboom92], the authors derived the slow rate of convergence of order for the non-parametric survival estimator. Moreover, the obtained law is not gaussian and cannot be explicitly computed. Many methods were also developed in a regression setting. In particular, the Cox model with non-parametric baseline was studied in [huang1995efficient]. The authors derived a convergence rate for the regression parameter with a gaussian limit but the problem of estimation and inference of the baseline survival function pertains in this regression context: the baseline survival function has the slow rate of convergence and even more problematic, the asymptotic distribution of this function could not be derived. The same conclusions were observed in [boruvka2015cox] where the authors use the more general Cox-Aalen model with non-parametric baseline. As a consequence, alternatives to the non-parametric baseline have been introduced. In [lindsey1998study] and [sun07] parametric baselines such as Weibull or piecewise constant are introduced. In that case, the convergence rate of the global parameters is of order and the asymptotic distribution is gaussian (see [sun07]). In [betensky2002local] a local likelihood is implemented which results in a smooth estimation of the baseline hazard using a kernel function. However, asymptotic properties of the estimators were not derived in their work and the estimators performance depend on the choice of the kernel bandwidth. In [wang2016flexible], monotone B-splines are implemented in order to estimate the cumulative baseline hazard. The authors introduce a two stage data augmentation which allows them to use the Expectation Maximisation algorithm [EM, see dempster1977maximum] in order to perform estimation. Asymptotics with rate of convergence of the estimators are derived. However, the number and location of the splines knots are pre-determined by the user and the estimators performance depend on the choice of these tuning parameters.

In this work, we study the Cox model with piecewise constant baseline hazard. Treating the unobserved true event times as missing variables we use the EM algorithm to perform estimation. As a result, the Hessian of the log-likelihood to be maximised is seen to be diagonal. This is a remarkable feature of the method that easily allows to perform estimation with the piecewise constant baseline using arbitrarily large set of cuts. In contrast, this model had been already introduced in [carstensen1996regression] and [lindsey1998study] but maximisation of the model parameters was achieved using the observed likelihood which resulted in a full rank Hessian matrix. In [carstensen1996regression] for example, the authors warn against computational issues which may force the user to reduce the number of cuts by combining adjacent intervals. Using the EM algorithm to perform estimation in the piecewise constant hazard model is new to our knowledge and easy to implement. Also, all the quantities involved in the E-step can be explicitly computed in our method, contrary to previous works (see [betensky2002local] for example) which require to approximate integrals. In comparison with [wang2016flexible] the E-step is more natural and directly applicable using the complete likelihood. Moreover, taking advantage of the sparse structure of the Hessian matrix, our method can be combined with a L penalty designed to detect the location and number of cuts. This is performed through the adaptive ridge procedure, a regularisation method that was introduced in [NuelFrommlet], [rippe2012visualization] and then applied in a survival context (without covariates) in [bouaziz17]. This penalisation technique results in a flexible method where the cuts and locations of the piecewise constant baseline are automatically chosen from the data, thus providing a good compromise between purely non-parametric and parametric baseline functions. This is in contrast with existing techniques such as in [wang2016flexible] where the location and number of knots of splines basis are fixed by the user. Finally we also emphasise the advantage of the L method in terms of interpretability: by detecting the relevant set of cuts of the baseline the method highlights the different regions of time where the risk of failure varies. This can be of great interest in medical applications in order for the clinicians to precisely detect time intervals of greater risks of failure for example.

Another advantage of using the EM algorithm is to provide direct extensions of the Cox model. In this work we also consider the inclusion of exact data in the estimation method. This mixed case of exact and interval-censored data is usually not easy to analyse as standard methods for interval-censoring do not directly extend to exact data. However, using our method, inclusion of exact data is straightforward through the E-step and the likelihood can be decomposed into the contribution of exact and interval-censored observations. Another extension that is developed in this work is the inclusion of a fraction of non-susceptible patients. This situation is modelled using the cure model of [sy00] and [peng2000nonparametric]

, with a logit link for the probability of being cured. Little attention has been paid to this model in the case of interval-censored data. In 


the authors consider a partially linear transformation model where the baseline is modelled using spline basis but the number and location of knots are chosen in an ad-hoc manner. In 

[liu2009semiparametric] a different cure model was introduced where the marginal survival function (without conditioning on the susceptible group) is modelled. However, the asymptotic distribution of the estimated parameters were not derived under this model. With our method, estimation in the cure Cox model is straightforward. The E-step results in a weighted log-likelihood with the weights corresponding to the probability of being cured such that our estimation method readily extends to the cure model.

In Section 2 the piecewise constant hazard model is introduced. The estimation method based on the EM algorithm is presented in Section 3 for interval censored data and fixed cuts of the hazard. Estimation in the non-parametric case, in the regression model and extensions for exact data and the cure model are also developed in this section. Then, the L penalised likelihood that allows to select the location and number of cuts from the data is presented in Section 4

. The construction of confidence intervals and statistical tests is discussed in Section 

5. In Section 6, an extensive simulation study is presented where our adaptive ridge estimator is compared with the midpoint estimator and the ICsurv estimator from [wang2016flexible]. Finally, a dental dataset on complications for replanted teeth is analysed in Section 7.

2 A piecewise constant hazard model for interval censored data


denote the time to occurrence of the event of interest. We consider a situation where all individuals are subject to interval censoring defined by the random variables

such that and are observed and . The situation and corresponds to left-censoring, corresponds to strictly interval censoring and to right censoring. The special case

is also allowed which corresponds to exact observations of the time of interest. We introduce a covariate vector

of dimension and for convenience we also introduce which equals if an individual is right censored and if he/she is exactly observed, left censored or interval censored. The variable is considered continuous and we assume independent censoring in the following way (see for instance [zhang2005regression]):

This supposes that the variables do not convey additional information on the law of apart from assuming to be bracketed by and . Finally, we assume non-informative censoring in the sense that the distribution of and does not depend on the model parameters involved in the distribution of .

We consider the following Cox proportional hazard model for the time variable :


where is an unknown row parameter vector of dimension . We model the baseline function through a piecewise constant hazard. Let represent cuts, with the convention that and . Let , with denoting the indicator function. We suppose that

for . Under this model, note that the survival and density functions are respectively equal to:

We set the model parameter we aim to estimate. In the following, we will also study the so-called nonparametric situation, when no covariates are available, which is encompassed in our modelling approach as the special case where . In this context the hazard function is simply equal to which is assumed to be piecewise constant and the model parameter is . The observed data consist of in the nonparametric context and of in the regression context, while is considered as incompletely observed. In the latter context, we introduce the notation .

3 Estimation procedure with fixed cuts

For the sake of simplicity, we first consider the scenario when no exact data are observed (which means there only are left, interval and right censored data). The estimation method is based on the EM algorithm and is presented in Section 3.1 in the general regression context since the nonparametric context can be easily derived by setting . The nonparametric context is discussed in Section 3.2, the implementation of the M step for the regression context is presented in Section 3.3 and the method when exact observations are also available is developped in Section 3.4. Finally, the inclusion of a fraction of non-susceptible individuals is studied in Section 3.5.

3.1 The EM algorithm for left, right and interval censored observations

The observed likelihood is defined with respect to the observed data by:

The Maximum Likelihood Estimator (MLE) can be derived from maximisation of this observed log-likelihood with respect to the model parameters, as in [carstensen1996regression] for instance. The obtained parameter estimates are not explicit but a Newton-Raphson algorithm can be easily implemented. However, in this optimisation problem, the block of the Hessian matrix corresponding of the baseline coefficients will be of full rank and can lead to intractable solutions if the number of cuts is large. An alternative method to compute the MLE is therefore to use the EM algorithm based on the complete likelihood of the unobserved true event times. This algorithm will result into a diagonal block matrix of the baseline coefficients.

The EM algorithm is based on the complete likelihood, defined by:

Denote by the current parameter value. The E-step takes the expectation of the complete log-likelihood with respect to the ’s, given the ’s, ’s, ’s, ’s and . Write

where represents the conditional density of given data and , evaluated at . Under the independent censoring assumption,

The E-step consists of computing the quantity . We have:

where is the indicator and , respectively denote , . Now, for , introduce the quantities




Finally, the M-step corresponds of maximising, with respect to , the quantity

3.2 Estimation in the absence of covariates

In the absence of covariates, the previous results hold with , and the model parameters we aim to estimate are just . The objective function in the M-step can be defined with respect to the sufficient statistics and :

The derivatives of with respect to , , equal

As a consequence, in the absence of covariates, one gets the explicit parameters estimators:

at each step of the EM algorithm. At convergence, this provides an estimator of the hazard function from which quantities of interest, such as the survival function, can be easily derived.

3.3 Estimation in the general regression framework

In the regression framework, each step of the EM algorithm is solved through a Newton-Raphson procedure. The first and second order derivatives of with respect to and are equal to


The block matrix of the Hessian corresponding to the second order derivatives with respect to the ’s is diagonal while the three other blocks are of full rank. Inversion of the Hessian matrix is then achieved using the Schurr complement which takes advantage of this sparse structure of the Hessian. When considering a large number of cuts, that is , the total complexity of the inversion of the Hessian is of order . The exact formula of the Schurr complement is given in the Appendix section.

3.4 Inclusion of exact observations

It is straightforward to deal with exact observations since they can be directly included in the EM algorithm. For an exact observation ,

with and . Note that this corresponds to the classical contribution of an exact observation to the log-likelihood in the standard Poisson regression for right censored observations (see for instance [aalen_borgan_book]). As a result, can be decomposed as

The estimation method follows as previously. In particular, in the absence of covariates, the explicit parameters estimator of are equal to:

where and .

In the regression setting, maximisation over the and parameters is performed through the Newton-Raphson algorithm as before. Full expressions of the score vector and Hessian matrix are given in the Appendix section. The Schurr complement is used again to inverse the Hessian matrix (see the Appendix section).

3.5 Inclusion of a fraction of non-susceptibles (cure fraction)

Taking into account non-susceptible individuals is possible using the cure model from [sy00]. This is achieved by modelling the latent status (susceptible/non-susceptible) of the individuals through a variable which equals for patients that will eventually experience the event and for patients that will never experience the event. Since the estimation method uses the EM algorithm, this latent variable can be easily dealt with through the E-step.

The probability of being susceptible is equal to . For a right censored individual, is not observed. The marginal survival function of is for , where is the survival function of the susceptibles. Note that as . We assume that censoring is independent of . See [sy00] for more details about the cure model. The proportional hazard Cox model for the susceptibles is defined as


The cure model specifies the hazard, conditional on and , to be equal to . The baseline function is assumed to be piecewise constant as in Section 2 and the conditional density and survival functions of the susceptibles are respectively noted and . If one wants to model the effect of covariates on the probability of being cured, a logistic link can be used:


where is a covariate vector including the intercept and is a row parameter vector, both of dimension . The observed data then consist of while and are respectively incompletely observed and non observed data. The model parameter is in the completely nonparametric context (no covariates nor ), if only the covariate is used or in the full regression context (with covariates and ). In the later case, we introduce the notation . The other situations are encompassed in our modelling approach by setting and/or . Note that our cure model is identifiable and does not require additional constraints such as in [sy00] where the authors had to impose to be null for greater than the last event time in the context of exact and right-censored data.

Under the cure model, the observed likelihood is now defined as

and the complete likelihood is defined as

The E step is performed as follows. Let , we have:

The E-step consists of computing the function . In the case of interval-censored and exact observations, we have:

where , are defined as in Equations (3.1) and (3.1) with the quantity replaced by . The terms and were defined in Section 3.4.

The function separates the terms with and the terms involving such that maximisation of these terms can be performed separately. Let , and . In the nonparametric setting, explicit estimators of the parameters can be computed at each step of the EM algorithm through the formulas:

In the general regression context, a Newton-Raphson procedure is implemented separately to maximise both parts of . The first and second order derivatives of with respect to are equal to:

Exact expressions of the first and second order derivatives of with respect to and are given in the Appendix section. They are expressed as weighted versions with respect to of the derivatives obtained in the context where all individuals are susceptibles. As previously, the block matrix corresponding to the second order derivatives with respect to the s of the Hessian is diagonal and inversion of the Hessian matrix is achieved using the Schurr complement.

4 Estimation procedure using the adaptive ridge method

In this section we present a penalised estimation method to detect the number and location of the cuts of the baseline hazard, when those are not known in advance. The proposed methodology is based on the work of [NuelFrommlet], [rippe2012visualization] and [bouaziz17] and can be applied to any of the previous scenarios (with exact observations, with a cure fraction, in a nonparametric setting, in a regression setting) where the function represents the objective function associated with the context under study.

4.1 A penalised EM algorithm

If the number of cuts is not known in advance, we choose a large grid of cuts (i.e large) and we penalise the log-likelihood in the manner of [NuelFrommlet], [rippe2012visualization] and [bouaziz17]. This penalisation is designed to force consecutive values of the s to be close to each other. It is defined in the following way:


where are non-negative weights that will be iteratively updated in order for the weighted ridge penalty term to approximate the L penalty. The pen term is a tuning parameter that describes the degree of penalisation. Note that the two extreme situations pen and pen respectively correspond to the unpenalised log-likelihod model of Section 3 and to the Cox model with exponential baseline.

Only the maximisation over is affected by the penalty. The first and second order derivatives of with respect to are equal to:

The block matrix corresponding to the second order derivatives with respect to the s is therefore tridiagonal. For a given value of pen and of the weight vector , inversion of the Hessian matrix is performed using the Schurr complement as previously (see the Appendix section) and the Newton-Raphson algorithm is implemented to derive . Once the Newton-Raphson algorithm has reached convergence, the weights are updated at the th step from the equation

for with (recommended value from [NuelFrommlet]) and where the ’s represent the estimates of the ’s obtained through the Newton-Raphson algorithm. This form of weights is motivated by the fact that is close to when and close to when . Hence the penalty term tends to approximate the L norm. The weights are initialized by , which gives the standard ridge estimate of .

Finally, for a given value of pen, once the adaptive ridge algorithm has reached convergence, a set of cuts is found for the ’s verifying . The non-penalised log-likelihood is then maximised using this set of cuts and the final maximum likelihood estimate is derived using the results of Section 3. It is important to stress that the penalised likelihood is used only to select a set of cuts. Reimplementing the non-penalised log-likelihood in the final step enables to reduce the bias classically induced by penalised maximisation techniques.

4.2 Choice of the penalty term

A Bayesian Information Criterion (BIC) is introduced in order to choose the penalty term. As explained in the previous section, for each penalty value the penalised EM likelihood (6) selects a set of cuts. For a selected set of cuts we denote by the total number of parameters to be estimated and by the corresponding non-penalised estimated model parameter obtained by maximisation of the function. The BIC is then defined as:

Note that the BIC is expressed here in terms of selected models. Since different penalty values can yield the same selection of cuts, the BIC needs only to be computed for all different selected models (and not for all different penalties). As an illustration of the model selection procedure, a full regularisation path can be produced where for each penalty value corresponds a set of cuts and parameter estimates, see for example Figure 3 of [bouaziz17]. The final set of cuts along with its estimator is chosen such that is minimal.

Other criteria could be used to perform model selection such as the Akaike Information Criterion (AIC). However we recommend to use the BIC as this criterion was shown to have similar performance as the cross-validation criterion on time to event data in [bouaziz17].

4.3 Discussion on computational aspects

The complexity for the inversion of the Hessian of is still of order , in the case (see the Appendix section on the Schurr complement for details about computational complexity). However, for a given penalty, it should be noted that the global algorithm for maximising or consists of an EM algorithm with a Newton-Raphson procedure at each step. As a consequence, a Generalised Expectation Maximisation (GEM) algorithm (see [dempster1977maximum]) is used instead of the standard EM where, as soon as the value of or increases, the Newton-Raphson procedure is stopped. This results in computing only a few steps of the Newton-Raphson algorithm (very often only one step is needed). As the EM algorithm is usually very slow to reach convergence the turboEM R package with the squareEM option is used to accelerate the procedure (see for instance [varadhan2008simple]). Finally, the algorithm must be iterated for the whole sequence of penalties. In order to evaluate the global computational cost, numerical experiments were conducted which showed that, for a maximum of initial cuts, the total complexity of the whole procedure is of order . See Section 6.3 for more details on these aspects.

5 Statistical inference

Inference on the model parameters can be achieved after selection of the cuts of the baseline function by considering these cuts as fixed parameters. Then the problem reduces to a fully parametric model. Theoretical results for parametric models with interval censored data have been discussed in 

[sun07] for instance.

Statistical tests are implemented from the likelihood ratio test which is based on the observed likelihood . Let with of dimension

. To test the null hypothesis

, with

known, one can use the test statistic

which follows a chi-squared distribution with

degrees of freedom from standard likelihood theory. Confidence intervals can also be constructed from the likelihood ratio statistic. Let us assume that with of dimension and consider the test versus . The confidence interval level of the parameter will be determined by the set of values such that the previous test is not significant at the significance level . Note that the p-value of the test is defined by (with a slight abuse of notation for the realisation of the test statistic)

and the test is non-significant if this value is greater than . Let be the quantile of the distribution. The bounds of the confidence intervals can therefore be determined by resolving the equation


with respect to . This equation has two solutions and since it is clear that is part of the confidence interval (the p-value equals one for this value), a grid search can be performed using for example the uniroot package with the two starting intervals and , where is a positive constant. This constant can be chosen arbitrarily large and should satisfy that the left-hand side of Equation (7) is of opposite sign for and . See [zhou2015empirical] for more details about the likelihood ratio test approach for constructing confidence intervals.

A more classical method for deriving confidence intervals is based on the normal approximation of the model parameter. It requires to compute the Hessian matrix of the observed log-likelihood evaluated at . This can be done by direct calculation or by using the following relationship which makes use of the complete likelihood:


In the above equation, the Hessian can be computed based on the complete likelihood by taking the derivative of the right-hand side of the equation with respect to . Note that, in the absence of exact observations and of a cure fraction,