# Penalized maximum likelihood for cure regression models

We propose a new likelihood approach for estimation, inference and variable selection for parametric cure regression models in time-to-event analysis under random right-censoring. In such a context, it often happens that some subjects under study are "cured", meaning that they do not experience the event of interest. Then, sample of the censored observations is an unlabeled mixture of cured and "susceptible" subjects. Using inverse probability censoring weighting (IPCW), we propose a binary outcome regression likelihood for the probability of being cured given the covariate vector. Meanwhile the conditional law of the susceptible subjects is allowed to be very general. The IPCW requires a preliminary fit for the conditional law of the censoring, for which general parametric, semi- or non-parametric approaches could be used. The incorporation of a penalty term in our approach is straightforward; we propose L1-type penalties for variable selection. Our theoretical results are derived under mild technical assumptions. Simulation experiments and real data analysis illustrate the effectiveness of the new approach.

## Authors

• 14 publications
• 9 publications
08/12/2020

### A presmoothing approach for estimation in mixture cure models

A challenge when dealing with survival analysis data is accounting for a...
07/17/2020

### An Easy-to-Implement Hierarchical Standardization for Variable Selection Under Strong Heredity Constraint

For many practical problems, the regression models follow the strong her...
12/11/2019

### Parametric mode regression for bounded data

We propose new parametric frameworks of regression analysis with the con...
02/01/2022

### Latent Class Analysis with Semi-parametric Proportional Hazards Submodel for Time-to-event Data

Latent class analysis (LCA) is a useful tool to investigate the heteroge...
08/23/2013

A new family of penalty functions, adaptive to likelihood, is introduced...
11/27/2019

### Modeling excess hazard with time–to–cure as a parameter

Cure models have been widely developed to estimate the cure fraction whe...
02/02/2020

### Bayesian Reliability Analysis of the Power Law Process with Respect to the Higgins-Tsokos Loss Function for Modeling Software Failure Times

The Power Law Process, also known as Non-Homogeneous Poisson Process, ha...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Standard survival models assume that all individuals experience the event of interest eventually (cf. Kalbfleisch and Prentice (2002)). However, this assumption is not always tenable since, for example, some diseases may require specific biological and/or lifestyle traits in place to activate, or it may be that immunity is a consequence of successful curative treatment (or, indeed, a combination of both treatment and pre-treatment attributes). The individuals who will not experience the event are referred to as cured or non-susceptible, and, although the former is perhaps suggestive of treatment, both are used interchangeably in the biostatistic literature. While there may be scenarios where it is possible to medically examine individuals to determine who is cured, typically the “cure status” is unknown and inference about cure mechanisms is based on the results of a survival study. Note that, as the observation period increases in such a study, the sample becomes composed of relatively more censoring times since the cured individuals never contribute event times. In practice then, the presence of cured individuals can be visualized in terms of the Kaplan-Meier curve (Kaplan and Meier, 1958) which will plateau as time increases (instead of tending towards zero).

The importance of estimating the cured proportion was recognized by Boag (1949) and Berkson and Gage (1952) who made use of a mixture model. Here, the overall survival distribution is improper with mixture components given by: (i) a proper survival distribution for uncured individuals (usually referred to as the latency distribution), and (ii) a point mass at time infinity for cured individuals. Farewell (1977, 1982)

extended this work to incorporate covariates in both the latency component (proportional hazards Weibull model) and the cure component (logistic regression model). This mixture model structure permits quite straightforward interpretation and is the most widely used cure model. However other formulations exist, notably the bounded cumulative hazard model

(Tsodikov, 1998; Tsodikov et al., 2003), which can be motivated through an underlying mechanistic model for tumor growth, or the compound Poisson multiplicative frailty model (Aalen, 1992; Duchateau and Janssen, 2007), which assigns a zero frailty (and, hence, zero hazard) to a mass of individuals (i.e., cured individuals). Studies concerning mixture cure models can also be traced in the labor economics literature, under the name split population duration models (Schmidt and Witte, 1989), and in the reliability literature under the name limited-failure population models (Meeker, 1987).

While much early work in cure modelling assumed a parametric latency distribution (e.g., Boag (1949) used a log-normal model, and Farewell (1977, 1982) used a Weibull model), non/semi-parametric approaches are typically preferred in the analysis of time-to-event data. Furthermore, Yu et al. (2004) showed that cure estimates can be sensitive to the choice of latency distribution – albeit more flexible parametric cure models are more robust (Peng et al., 1998; Yamaguchi, 1992). The semi-parametric cure model consists of a parametric logistic cure regression component and semi-parametric proportional hazards (PH) (Peng and Dear, 2000; Sy and Taylor, 2000) or accelerated failure time (AFT) latency model (Li and Taylor, 2002; Zhang and Peng, 2007); in both PH and AFT cases, the baseline latency model is an unspecified function, and estimation is based on the EM algorithm (Dempster et al., 1977) in combination with (modified) partial likelihood (Cox, 1972, 1975) or rank regression (Ritov, 1990; Tsiatis, 1990) respectively. Note that penalized splines have also been considered for the baseline latency model (Corbière et al., 2009). Of course, structural model assumptions are still made in the way that covariates enter (i.e., PH or AFT), and, estimation of the cure component can still be sensitive to such choices.

Our approach differs from the aforementioned. First note that, if the cure status was directly observable, one would simply apply standard binary logistic (or probit) regression. Because, typically, the cure status is not available (e.g., medically or by finite-time survival studies), we propose replacing it with a “proxy” or “synthetic” value, and then proceeding in the usual way. In particular, we generate these synthetic values using inverse probability censoring weighting (IPCW) arguments (Robins and Finkelstein, 2000; van der Laan et al., 2003). This approach obviates the need for a latency model – although IPCW does require estimation of the censoring distribution. At first sight, we trade one missing data framework (EM) for another (IPCW), and latency estimation with censoring estimation. However, in addition to being an interesting new cure estimation process in its own right, our proposal has its advantages. Unlike the existing EM approaches, which are not so easily extended to more flexible latency models (beyond PH and AFT), our general framework does not depend on any particular specification for the censoring model, i.e., this could be very flexible. Furthermore, once the synthetic cure status has been computed (in one initial step), one may avail of standard, fast (GLM) estimation and penalized variable selection procedures thereafter.

The remainder of this article is organized as follows. In Section 2 we introduce a new result which forms the basis of the proposed likelihood estimation procedure of Section 3. In Section 3.2 we present a penalized version of this likelihood for the purpose of variable selection. Asymptotic properties of our estimation procedure are given in Section 4, along with empirical evidence via simulation in Section 5. Real data examples are given in Section 6. We close with some remarks in Section 7. The proofs are postponed to the Appendix.

## 2 Preliminaries

Let denote the survival time of interest, and note in particular that, in contrast to standard survival models, the support includes the possibility that to allow for cases where the event never occurs, i.e., cured (or non-susceptible) individuals. We also introduce the cure indicator so that is the cure probability. Furthermore, is survival time for an uncured individual (the latency time) with survivor function whose support is contained in , i.e., is simply a standard survivor function. Since , we have that which is the so-called “mixture cure” model; it is a finite mixture of cured and uncured individuals. Note that so that, in the time limit (when ), there is a proportion of individuals who do not experience the event, and has an asymptote.

Generally we will have a vector of covariates, , and it is of particular interest to model the relationship between and . Thus, generalizing the above mixture cure model to covariate dependence, we arrive at

 ST(t∣X)={1−π(X)}ST0(t∣X)+π(X)

where is the cure regression function, and is a vector of parameters which describe the relationship between and the cure probability. Typically, although not necessarily, covariates enter through a linear predictor, i.e., is a given function of where is the covariate vector and is the vector of associated regression coefficients. (Here and in the following, for any matrix denotes its transpose.) In our applications, we will make use of a logistic regression function, , but one may, of course, use alternative parametric forms (e.g., probit or complementary log-log, or forms in which covariates enter in non-linear ways). Note that we have not suggested any particular model for as, in our proposed estimation procedure, this function is completely unspecified.

In the majority of practical applications, we do not observe the cure status, . We observe individuals over a (possibly fixed) time period during which some will experience the event and others will not. Those who do not experience the event during the observation period are censored – but this does not mean they have been cured since they may experience the event at a later stage outside of the observation period. We therefore introduce a censoring time with survivor function . In contrast to , the support of does not include infinity since, practically, observation windows are finite. Let be the observed time (where is the minimum operator), and is the event indicator. Therefore, due to censoring, neither nor are observed so that inference must be made through and .

We will assume the following:

 T0 ⊥C ∣X, (2.1) B ⊥(T0,C) ∣X, (2.2)

where (2.1) is the standard independence assumption made throughout survival literature, and (2.2) is introduced in the cure context to ensure that the cure regression model is identifiable. In particular, Assumptions (2.1) and (2.2) guarantee that . See Lemma 8.1 in the Appendix. With these assumptions in place, it can then be shown that

 E(ΔSC(Y−∣X) ∣∣∣X) = E(1−B∣X) = 1−π(X) (2.3)

where, for any possible value, , of the covariate vector,

 SC(t−∣x)=Pr(C≥t∣X=x).

This result is the core of our estimation scheme which is described in Section 3. In fact, (2.3) is a special case of the more general result

 E(Δr(Y,X)SC(Y−∣X) ∣∣∣X)=E{r(T0,X)∣X}{1−π(X)} (2.4)

but with , and, indeed, it is (2.4) which is proved in Lemma 8.2. It is worth highlighting that (2.4) is an application of the Inverse-Probability Censoring Weighting (IPCW) approach (Robins and Finkelstein, 2000) extended to the case where a cured proportion exists; (2.4) reduces to the usual IPCW approach when .

## 3 Estimation and inference

If the iid replicates of were observed, we would simply have the standard Bernoulli log-likelihood,

 ℓ(θ)=n∑i=1[Bilogπi(θ)+(1−Bi)log(1−πi(θ))], (3.1)

where . Of course, (3.1) is not operational in the cure context since is not observed but it serves as our motivation for likelihood estimation and inference on based on iid replicates , .

### 3.1 Likelihood estimation

We now define

 B∗i(SC)=B∗i(Yi,Δi,Xi,SC)=1−ΔiSC(Yi−∣Xi), (3.2)

which, from (2.3), is such that . Replacing with in (3.1), we obtain the log-likelihood function which we propose in the case where is known, that is

 ℓ∗(θ)=n∑i=1[B∗i(SC)logπi(θ)+{1−B∗i(SC)}log(1−πi(θ))]ωi. (3.3)

Unlike (3.1), this likelihood is formed using the observable quantities and rather than the unobservable . Here, are positive weights. As will be explained in the sequel, the weight function is a technical device that will be needed when deriving general asymptotic results. However, we anticipate that in practically all applications . The score function, where

 U∗i(θ)=[B∗i(SC)−πi(θ)]ωiπi(θ)(1−πi(θ))∂πi(θ)∂θ∈Rp+1,

is unbiased due to property (2.3); note that, in the case of a logistic cure regression model, we obtain the simplified expression with . Furthermore, using standard inequalities, it can be shown that, when the cure regression model is identifiable,

 E{ℓ∗(θ)}

where is the true parameter vector. See Lemma 8.3 in the Appendix. Hence, is a legitimate criterion for estimation and inference on .

Note that we have explicitly written as a function of (while its dependence on , , and is implicit) since we must estimate in practice, i.e., we will use . In applications of IPCW (recall that (2.3) is based on IPCW arguments), is typically estimated using a Kaplan-Meier or a Cox model, but one could also use more flexible non-parametric regression models such as those of Aalen (1980) or Beran (1981); in fact, the asymptotic theory of Section 4 only requires that the chosen estimator for has an iid representation. Thus, while our approach does not require specification or estimation of , we must estimate , and this can essentially be done in an arbitrarily flexible way. Therefore, for practical purposes, we propose the maximum likelihood estimator defined as

 ^θ=(^θ(0),^θ(1),…,^θ(p))⊤=argmaxθ^ℓ∗(θ) (3.5)

where

 ^ℓ∗(θ)=n∑i=1[B∗i(^SC)logπi(θ)+{1−B∗i(^SC)}log(1−πi(θ))]ωi. (3.6)

The weights will serve in theory to control the behavior of general estimates of in regions of low covariate density.

Of course, (3.6) is the familiar likelihood function used in logistic regression, (3.1), (hence, GLM estimation procedures can be used) but with replaced with . The quantity plays a similar role to a “synthetic observation” as used by Koul et al. (1981) (see also Delecroix et al. (2008)

and references therein) in least squares estimation for censored survival data. However, their response variable is a survival time, whereas we have a binary cure indicator. It is worth highlighting that current estimation procedures for semi-parametric cure models also involve replacing

in (3.1) with an expected value, but in an iterative EM fashion (Peng and Dear, 2000; Sy and Taylor, 2000; Li and Taylor, 2002; Zhang and Peng, 2007), whereas is computed in one step followed by maximization of (3.6). The reason for this difference comes from the modeling approach. In the existing literature, models are assumed for both the conditional survival time of the uncured individuals, , and the cure proportion, , that together completely determine , while no assumptions are made about . Since is directly identifiable from the observed data, the EM iterations between and represent a natural way to estimate these assumed model components. In contrast, our approach uses the fact that can also be identified from the data, and, thus, we neither need to impose assumptions on nor use an iterative procedure.

### 3.2 Penalized likelihood for variable selection

Selection of important variables has traditionally been carried out using best subset and stepwise procedures. However, these discrete approaches (variables are either in or out) are computationally intensive, their theoretical properties are not so well characterized (Fan and Li, 2001, 2002)

, coverage of confidence intervals for selected variables can be poor

(Hurvich and Tsai, 1990; Zhang, 1992), and they are unstable in the sense that small changes to the data can result in large changes to the selected model and coefficients (Breiman, 1996). More modern approaches are based on penalized likelihood methods such as the least absolute shrinkage and selection operator (lasso) (Tibshirani, 1996), and these remove variables by estimating their coefficients as zero. Furthermore, the adaptive lasso (alasso) (Zou, 2006)

enjoys the so-called oracle property (i.e., asymptotically, the estimates and standard errors are the same as if the true submodel was known and had simply been fitted without variable selection); the non-convex smoothly clipped absolute deviation (SCAD) penalty also possesses the oracle property

(Fan and Li, 2001). Interestingly, the adaptive lasso is asymptotically related to (a continuous version of) best subset selection (Zhang and Lu, 2007).

The situation we consider for variable selection is the one where is a given function of and thus with . Then, by construction, the incorporation of a penalty term in our setup is straightforward, with the alasso estimator given by

 ^θλ=(^θλ,(0),^θλ,(1),…,^θλ,(p))⊤=argmaxθ∈Θ^ℓ∗λ(θ) (3.7)

where

 ^ℓ∗λ(θ)=^ℓ∗(θ)−λp∑j=1wj|θ(j)| (3.8)

is the penalized likelihood function, is defined in (3.6), is the associated penalty parameter, and are (potentially adaptive) weights, . Here, as is usual, the intercept, , is not penalized in (3.7), and, furthermore, typically, the covariates will be standardized. Setting yields the lasso penalty which penalizes all coefficients equally, while for some yields the alasso penalty (we will set as is most common in practice). In the latter case may be any consistent estimator of , and, typically, , where is the th unpenalized estimator from (3.5). In Section 5.3 we provide details on implementation aspects of the alasso in our context.

## 4 Asymptotic results

Our asymptotic results are deduced under some minimal moment assumptions on the observed variables completed by some mild high-level assumptions on the cure regression model and on the model for the censoring variable. These conditions are quite natural in the context of right-censored data when covariates are present and are to be verified on a case by case basis according to the context of the application. In this section we use the notation

, and (hence, ) as defined in equation (3.2).

###### Assumption 4.1

(The data) The observations are independent replications of , where is some covariate space. Moreover,

Let be a generic parametric cure regression model, that is a set of functions of the covariate vector indexed by in some parameter space . For the logistic cure model,

### 4.1 Consistency

For the definition of a Glivenko-Cantelli class of function, we refer to the book by van der Vaart (2000).

###### Assumption 4.2

(The cure regression model)

1. The weight is bounded, almost surely nonnegative and has a positive expectation.

2. There exists such that . Moreover, there exists such that, for any ,

3. For any .

4. The model is a Glivenko-Cantelli class of functions of with constant envelope.

###### Assumption 4.3

(Uniform law of large numbers)

The estimator

satisfies the law of large numbers uniformly over the class of the logit transformations of the functions in

, that is

 supθ∈Θ∣∣ ∣∣1nn∑i=1[B∗i(^SC)−B∗i(SC)]ω(Xi)log(πi(θ)1−πi(θ))∣∣ ∣∣=oP(1).

For simplicity, we consider a bounded weight and assume that the cure regressions in the model stay uniformly away from 0 and 1. The condition in Assumption 4.2-3 is a slightly reinforced identification condition that will guarantee that is a well-separated maximum of the expectation of the likelihood. It will be satisfied for instance in the logistic or probit regression as soon as the covariates are not redundant, that is whenever

is an invertible matrix.

Let us provide some mild sufficient conditions implying the uniform convergence of Assumption 4.3. These sufficient conditions involve a threshold that is commonly used in the literature of cure models. It is typically justified as representing a total follow-up of the study, and usually appears to have been considered independent of the covariates. However, we allow it to depend on the covariates in an arbitrary way.

###### Lemma 4.1

Assume that there exists such that, for any , and Moreover, Assumptions 4.2-2 and 4.2-4 hold true. If

 supx∈X,ω(x)>0supy≤τ(x)∣∣^SC(y−∣x)−SC(y−∣x)∣∣=oP(1), (4.1)

then the uniform convergence in Assumption 4.3 holds true.

The common parametric, semiparametric and nonparametric estimators satisfy condition (4.1). Several examples are provided in the monographs by Borgan et al. (1995) and Kalbfleisch and Prentice (2002), and, for convenience, some examples are recalled in the Appendix. The consistency of our maximum likelihood estimator is stated in the following results.

###### Proposition 4.2

If Assumptions 4.1, 4.2 and 4.3 hold true, then .

###### Corollary 4.3

If Assumptions 4.1 and 4.2 hold true, and condition (4.1) is met, then .

### 4.2 Asymptotic normality

###### Assumption 4.4

(The cure regression model)

1. For any , the map is twice continuously differentiable.

2. The true value is an interior point of ,

 E[∥∥∥∂π(θ0)∂θ∥∥∥2]<∞

and the matrix

 A(θ0)=E[ω(X)π(θ0)[1−π(θ0)]∂π(θ0)∂θ∂π(θ0)∂θ⊤]

is positive definite.

3. For any the families of functions of indexed by

 F1,kl={∂2π∂θ(k)∂θ(l)(x;θ):x∈X,θ∈Θ},F2,kl={(∂π∂θ(k)∂π∂θ(l))(x;θ):x∈X,θ∈Θ}

are Glivenko-Cantelli classes of functions of with integrable envelopes.

###### Assumption 4.5

(I.I.D. representation) Let be a vector-valued function such that . Then there exists a zero-mean vector-valued function that depends on , such that and

 1n∑1≤i≤n[B∗i(^SC)−B∗i(SC)]φ(Xi)=1n∑1≤i≤nμφC(Yi,Δi,Xi)+oP(n−1/2);

Assumption 4.4 introduces mild standard regularity conditions on the cure regression model. In particular, Assumption 4.4-3

allows the uniform law of large numbers and guarantees that the remainder terms in the standard Taylor expansion used to prove asymptotic normality for MLE are uniformly negligible. Such an assumption on the complexity of the classes of first and second order derivatives of the functions in the model are satisfied by the standard parametric models such as logit and probit models. As an alternative to Assumption

4.4-3, we could impose condition (4.1) and slightly stronger regularity conditions on the model . The details are provided at the end of the proof of Proposition 4.4. A property as required in Assumption 4.5 is very common in survival analysis and is related to the so-called Kaplan-Meier integrals (cf. Stute (1996) and Gerds et al. (2017)). In the Appendix we provide several examples of models and estimators for which Assumption 4.5

holds true, namely Kaplan-Meier, conditional Kaplan-Meier, Cox proportional hazard, proportional odds and transformation models. In these examples, except for the case of the conditional Kaplan-Meier estimator, the weights

will always be set equal to 1.

In general, the expression of the function in Assumption 4.5 depends on the joint law of the observations

. This function contributes to the asymptotic variance of our MLE

that, hence, will be different from the asymptotic variance of the infeasible MLE defined with instead of .

###### Proposition 4.4

Assume the conditions of Assumptions 4.1, 4.2 and Lemma 4.1 are met. Moreover, Assumption 4.4 holds true, and Assumption 4.5 is satisfied with

 φ(X)=ω(X)π(θ0)[1−π(θ0)]∂π(θ0)∂θ.

Then

 ^θ−θ0=A(θ0)−11nn∑i=1{μ(Yi,Δi,Xi;θ0)+μφC(Yi,Δi,Xi;θ0)}+oP(n−1/2).

where

 μ(Y,Δ,X;θ0)=[B∗(SC)−π(θ0)]ω(X)π(θ0)[1−π(θ0)]∂π(θ0)∂θ

and is the zero-mean vector-valued function from Assumption 4.5. In addition,

 √n(^θ−θ0)⇝Np+1(0,A(θ0)−1V(θ0)A(θ0)−1)

with . ( denotes convergence in law.)

If suitable estimates of the vectors and are available, say, and , then could be simply estimated by sample covariance, where . Meanwhile, could also be estimated by standard methods, and thus one could derive an estimate of the variance of . However, the estimates and are sometimes difficult to built. Alternatively, one can make use of the nonparametric bootstrap to approximate the law of ; indeed, we use this approach in our simulation studies and real data analysis.

### 4.3 Oracle properties for the adaptive lasso

In this section we prove consistency in variable selection for the adaptive lasso proposed in Section 3.2. Moreover, we prove the asymptotic normality for the true subset of coefficients. That is, we extend the Theorem 4 of (Zou, 2006) to the cure regression context. Let be the true value of the cure regression parameter vector. Assume the true model has a sparse representation. Let . Without loss of generality, suppose with Below, the subscript is used to define the subvectors or blocks in matrices with the components corresponding to the indices in the set . That is, is the subvector of the first components of , denotes the vector of partial derivatives with respect to the first components of , and is the upper-left block of dimension of the matrix defined in Assumption 4.4-2.

###### Proposition 4.5

Assume the conditions of Proposition 4.4 are met and is a given function of . Let be the estimator defined in equation (3.7) with for some . Moreover, assume that and . Let Then

1. .

2.  ^θA,λ−θA,0=AA(θ0)−11nn∑i=1{μA(Yi,Δi,Xi;θ0)+μφA,C(Yi,Δi,Xi;θ0)}+oP(n−1/2).

where

 μA(Y,Δ,X;θ0)=[B∗(SC)−π(θ0)]ω(X)π(θ0)[1−π(θ0)]∂π(θ0)∂θA

and is the zero-mean vector-valued function from Assumption 4.5 considered with

 φA(X)=ω(X)π(θ0)[1−π(θ0)]∂π(θ0)∂θA.

 √n(^θA,λ−θA,0)⇝Np0+1(0,AA(θ0)−1VA(θ0)AA(θ0)−1)

with .

As was the case for Proposition 4.4, we can also obtain Proposition 4.5 by imposing condition (4.1) and slightly stronger regularity conditions on the model instead of Assumption 4.4-3.

## 5 Simulation studies

### 5.1 Setup

We first generate , and , from which we obtain when and otherwise, and, hence, the observed time, , and censoring indicator, , respectively. The cure status is given by where , , and and are independent variables. We set with such that the marginal cure proportion . Consider the survivor function

 ST0(t∣X)={exp(−tκ)−exp(−τκ)1−exp(−τκ)}ψ

which is that of a truncated Weibull whose support is with a rate parameter, , and a shape parameter, . The latency time, , was generated according to this distribution with and where and ; the proportional hazards property holds when . The value of was set at the 95th percentile of the marginal untruncated distribution, i.e., is the unique solution of the equation

 E[{exp(−τκ)}ψ]=0.05.

Clearly, depends on the value of . Lastly, the censoring time,

, was generated from an exponential distribution with rate parameter

where . The value was chosen such that the overall censored proportion is given by where , and this depends on the values of and ; since , there are then four values for the censoring proportion, .

It is worth highlighting that only affects cure probability (since ), whereas affects all components of the data generating process (since ). Sample sizes of were considered, and, with two values for each of , , and , there are 24 scenarios altogether. Each simulation scenario was replicated 2000 times.

### 5.2 Estimation procedure

We applied the estimation scheme described in Section 3 to the simulated data with estimated using a Cox model in which both covariates, and , appear as predictors. Table 1 displays the average bias and standard error of estimates over simulation replicates. While the bias can be somewhat large when , this vanishes as the sample size increases. Similarly, the standard errors also decrease with the sample size. As we might expect, the estimates generally disimprove when the censoring proportion increases. Furthermore, the results do not change appreciably when is varied (i.e., the approach is not sensitive to the form of ), while the standard errors decrease a little when is increased. Table 2 shows the empirical coverage for 95% confidence intervals constructed using bootstrapping with 399 replicates; we find that the empirical coverage is close to the nominal level.

By way of comparison, we also applied the EM approach of Peng and Dear (2000) and Sy and Taylor (2000) which has been implemented in the smcure (Chao et al., 2012) package in R (R Core Team, 2018). In contrast to our scheme, must be estimated rather than . Thus, was estimated using a Cox model in which both covariates, and , appear as predictors. The results, shown in Table 3, are broadly similar to those in Table 1, albeit the smcure estimates are generally more efficient. This was expected since the smcure estimates take into account the model imposed to , while our assumption is fully nonparametric in that respect. An important difference is the fact that, when does not have the proportional hazards property (i.e., when ), we see bias in the smcure estimates which does not disappear with increasing sample size. In particular, the bias manifests through and ; interestingly, is unaffected (i.e., the coefficient of , the covariate which only enters the cure component).

### 5.3 Selection procedure

A variety of algorithms have been implemented for solving non-differentiable lasso problems, e.g., quadratic programming (Tibshirani, 1996), least angle regression (LARS) (Efron et al., 2004), and co-ordinate descent Friedman et al. (2007). However, we prefer the use of a differentiable penalty since standard gradient-based optimization procedures can then be utilized. Therefore, we propose the use of

 ^ℓ∗λ,ϵ(θ)=^ℓ∗(θ)−λp∑j=1wjaϵ(θ(j)) (5.1)

where is an extension of the absolute value function such that , and which is differentiable for . Clearly, smaller values bring the penalty closer to the alasso, but also bring (5.1) closer to being non-differentiable. In our work, we have found that works well.

For the purpose of selecting the tuning parameter, , we consider cross-validation; in particular we aim to minimize the -fold cross-validation error. First, since , we may define the error term, . Then, for a partition of the set , the mean-squared error for the th fold, is given by where is the penalized estimate with the th fold removed. Thus, the -fold cross-validation error is given by

 CVE(λ)=1kk∑j=1∑i∈Fj{B∗i(^SC)−πi(^θ−jλ)}2 (5.2)

where we will use as is standard in practice. Minimizing (5.2) with respect to can be achieved by profiling over a range of values or by using a one-dimensional optimizer, e.g., golden search; we will define to be the minimizer of (5.2).

In order to test our proposed selection procedure, we simulated data as described in Section 5.1, but with four additional independent variables, namely, , , , and . These variables do not affect probability of cure, i.e., their coefficients are zero, but we set so that affects other aspects of the data generating process; the and coefficients for