1. Introduction
A common problem faced in medical studies is for some subjects to never experience the event of interest during the study period. For example, consider a follow–up study examining the harmful side–effects of a pharmaceutical product. Since side–effects are commonly rare, it is expected that many subjects involved in the study will not experience the harmful effect of the treatment by the end of the study, and, therefore, these subjects will be censored at the conclusion of the study. Hence, these subjects are called long–term survivors. The first known study involving statistical analysis of data containing long–term survivors dates back to Boag (1949), and this author coined the term cure model to indicate the data contained a non–trivial proportion of long–term survivors. However, Farewell (1986) observes that cure models should not be used without clear empirical or biological need (see Section 2). Results on censored data models should be used in these cases, and studies involving subjects with censored responses are common. Well known methods can be employed to study these data (see, for example, Lawless, 1982; Aitkin et al.,1989; Harris and Albert, 1991; Collett, 1994).
We consider the case of observing responses
that are right censored by another random variable
, and, hence, observe only the minimum . Throughout this article, we will assume that and are only conditionally independent given a covariate , and, for simplicity, we will assume the censoring variableis a continuous random variable. For clarity, we will refer to responses corresponding to the subpopulation that has survival times that are finite as
, which are only indirectly observed due to the presence of censored values. The purpose of this paper is to study the heteroskedastic nonparametric regression of given the covariate :(1.1) 
Here is the regression function and is the scale function (bounded away from zero), which are both assumed to be smooth. We assume the error is a continuous random variable that is independent of the covariate and has distribution function
. Identifiability of the components in the cure model also requires additional assumptions on the joint distribution of
, where is the right–censoring indicator. Finally, is assumed to be a location–type functional and is assumed to be a scale–type functional, which imposes additional requirements on the model errorthat are analogous to the usual zero mean and unit variance assumptions so that we can identify both of
and (see Section 2).From the random sample of data we will propose nonparametric function estimators of , and in Section 2, and we study the large sample behavior of these estimators in Section 2.1. Note, these data also include the cured cases and, therefore, do not directly correspond with (1.1). Estimating the error distribution function is particularly important because many statistical inference procedures depend on functionals of ; for example, Kolmogorov–Smirnov–type and Cramér–von–Mises–type statistics. Van Keilegom and Akritas (1999) considered estimation of , as well as the functions and , from a model similar to (1.1) but without considering cured subjects, which presents a new and important challenge, and, hence, those results do not directly apply to the present situation.
We are interested in studying a completely nonparametric statistical methodology for examining location–scale modelling of data containing long–term survivors. Since the study of this type of data presented in Boag (1949), the literature on this subject has been divided into two distinct categories. The original cure model proposed by Boag (1949) is now known as the two–component mixture cure model (see, for example, Taylor, 1995). We will refer to the second category as simply the non–mixture cure model (see, for example, Haybittle, 1959, 1965). Several advancements have been made in the literature when these models are assumed to have a parametric or a semiparametric form. Kuk and Chen (1992) investigate combining logistic regression methods (used to estimate the unknown proportion of cured cases) with proportional hazards techniques to obtain estimators of their model parameters. Taylor (1995) works with a similar model as that of Kuk and Chen (1992) but proposes an ExpectationMaximization algorithm for simultaneously fitting the model parameters and estimating the baseline hazard function, and this author also uses simulations to conjecture that a crucial assumption is required for identifying components of these models (see Section
2). Sy and Taylor (2000) consider a similar model as that of Kuk and Chen (1992) but investigate an estimation technique based on the ExpectationMaximization algorithm. Lu (2008) considers the proportional hazards mixture cure model and proposes a semiparametric estimator of the unknown parameters of that model by maximizing a socalled nonparametric likelihood function. The estimator is shown to have minimum asymptotic variance. Lu (2010), the same author as before, investigates an accelerated failure time cure model (a special case of the two–component mixture cure model) and proposes an ExpectationMaximization algorithm for fitting the unknown parameters of that model as well as obtaining an estimator of the unknown error density function using a kernelsmoothed conditional profile likelihood technique. Xu and Peng (2014) and LópezCheda et al. (2017) consider estimating the unknown cure fraction in a completely nonparametric setting. Patilea and Van Keilegom (2017) consider a general approach to modelling the conditional survival function of a subject given that the subject is not cured by proposing socalled inversion formulae that allows one to express the conditional survival function of the uncured subjects in terms of the proportion of cured subjects and the subdistributions of the response and the censoring variable.A particularly interesting model belonging to the non–mixture case is proposed by Yakovlev and Tsodikov (1996). Earlier, in a backandforth exchange over letters to the editors of Statistics in Medicine, Andrej Yakovlev very clearly details shortcomings of the two–component mixture cure model. Specifically, he argues that the two–component mixture cure model implicitly assumes that there is only a single risk operative in the population, i.e. a single latent variable that determines whether or not subjects are cured / uncured. Yakovlev argues that, in general, populations have multiple risk operatives and he proposes what we are referring to as the non–mixture cure model in his letter to the editors. The exchange between him and the authors Alan Cantor and Jonathan Shuster can be found in Yakovlev et al. (1994). Sometimes this model is called a promotion time cure model, and it has gained popularity due to motivations from Biology (in particular cancer research). Several results on parametric and semiparametric estimation of the unknown parameters in these models are available in the literature. However, this model is not in the scope of the current article, and we only mention a few of the notable works in this area. Tsodikov (1998) compares likelihood–based fitting techniques for the non–mixture cure model. Later, Tsodikov et al. (2003) survey the literature and report that, in less than a decade from its introduction, the non–mixture cure model is already in popular use, and these authors promote use of the non–mixture cure model in semiparametric and Bayesian settings. Zeng et al. (2006) propose a recursive algorithm for obtaining maximum likelihood estimates in the non–mixture cure model. Additionally, these authors show their regression parameter estimators have minimum asymptotic variance. Portier et al. (2017) consider an extension of the non–mixture cure model.
Recently, efforts have been made to unify the two seemingly distinct categories of cure models. Sinha et al. (2003) discuss the benefits and disadvantages of the mixture and non–mixture cure models. Yin and Ibrahim (2005) propose transforming the unknown population survival function in a manner that is analogous to the BoxCox transformation for nonnormally distributed random variables.
Our goal is to generally relax the model conditions and investigate an alternative estimation strategy for the two–component mixture cure model. A popular theme in both modelling categories is to use proportional hazards methods, where one estimates the baseline hazard function using a nonparametric estimator. In this case, the proportional hazards model is made additionally flexible through a nonparametric estimator of the baseline hazard function. The result is a semiparametric estimation technique (the remaining model parameters are finite dimensional and a likelihood function is usually required to obtain estimators).
The article is organized as follows. We further discuss model (1.1) and motivate our nonparametric function estimators in Section 2, and we give asymptotic results on these estimators in Section 2.1. In Section 3, we investigate the finite sample properties of our proposed estimator of and we illustrate the proposed techniques by characterizing a set of data collected to study distant metastasis in lymphnodenegative breast cancer patients. Our numerical study of the previous results in Section 3.1 shows the finite sample behavior of the estimators proposed in Section 2 is well–described by the asymptotic statements given in Section 2.1. The proofs of these results and further supporting technical results are given in Section 4.
2. Estimation of the model parameters
We begin this section with a discussion of the identifiability of the cure model parameters. In the following, write for the distribution function of the covariates and for the density function of , where the support of is . Let be the conditional distribution function of the responses given and be the conditional distribution function of given . For both cure models and censored response models, it is important that we place conditions on the distribution function (and therefore ) so that we may identify and estimate the regression model components and and the error distribution function . Empirical or biological need for using a cure model in the present situation means the support of the censoring variable is never contained in the support of the subpopulation , i.e. we require
(2.1) 
where and . Taylor (1995) uses simulation evidence to conjecture the necessity of (2.1). Xu and Peng (2014) and LópezCheda et al. (2017) observe that (2.1) leads to identifiability of the cure model components (see Lemma 1 of LópezCheda et al., 2017); specifically, it is required to identify the conditional proportion of cured cases given as well as the conditional distribution function of given . To ensure the distribution of the censoring variable is identifiable, we will further assume that the remaining mass of beyond occurs at , i.e. we assume the conditional equivalence of the events , , given . This justifies writing
where is assumed to be bounded away from zero and one, i.e. there are finite positive real numbers satisfying for every . Hence, . This means that (2.1) implies that we only need an estimator of , which does not depend on , rather than .
To conclude our discussion on identifiability, recall that the regression function is a location–type functional and the scale function is a scale–type functional. This means there are transformations and such that
Therefore, we can see that the regression model components and are identifiable when and .
As noted on page 186 in Dabrowska (1987), responses arising from experiments with censored values are often skewed to the right and, therefore, estimators of the mean (and scale) will be affected. Beran (1981) proposes using
–type regression functionals, which are more robust to skewing in the data. To explain the idea, we introduce the score functionand the quantiles
for . Here the score function must be nonnegative and satisfy . Throughout this article, we work with the following definitions of and :where and . Hence, for and to be identifiable, we will require that satisfies
where is the quantile function of , i.e. for .
With all of the components of the regression model (1.1) identified, we can define our estimators of the model parameters. To define the estimator of , we will introduce further notation. Write for the conditional distribution function of the minimum given and for the conditional subdistribution function of both and given . From the discussion above, we can see that for every , which follows by (2.1), and, hence, we can consistently estimate (and therefore ) everywhere on the region , cf. Van Keilegom and Akritas (1999).
Using and , the conditional cumulative hazard function of given may be written as
(2.2) 
To estimate and , we introduce the Nadaraya–Watson weights
where is a given kernel function and is a sequence of bandwidth parameters. Later, we will specify conditions on choosing and . Estimates of and then follow by the approach of Stone (1977): for ,
(2.3) 
Substituting (2.3) into (2.2) leads to an estimator of in the approach of Beran (1981):
(2.4) 
Here is the ascending ordering of and both of and are ordered according to . For simplicity we will assume that the data contain no tied responses, which is reasonable because our assumptions imply the responses are continuous random variables. Otherwise the ordering of the variables indicated above is not unique and the estimator is affected.
Xu and Peng (2014) propose estimating by the largest uncensored response and then combining this estimator with (2.4) to form an estimator the unknown proportion of cured cases,
(2.5) 
The estimator is shown to be consistent and asymptotically normally distributed. Later, López–Cheda et al. (2017) generalize this result in two steps. First, these authors show the estimator is strongly consistent for . Second, the estimator is shown to be a uniformly, strongly consistent estimator of .
Turning our attention now to and , we can see that the unknown quantiles of the uncured population must be estimated. It is easy to show the equivalence from the equivalence , with and , where . We can consistently estimate by , where . The resulting estimators of and are analogous to those of Van Keilegom and Akritas (1999):
with , .
Write
for the largest observable standardized error, which is finite by (
2.1). It follows that does not depend on and , for –almost every , from standardization. This means we can transfer the support of , , into the support of , , , where can be estimated. Note, this is the same transfer of information from to studied in Van Keilegom and Akritas (1999). However, this implies that we can form an estimator of using the estimators of , , and , which is new.Observe the error distribution function can be written as the average
where we have used the transference mapping for . We arrive at the proposed estimator of :
(2.6) 
Note this estimator is averaging over the local model estimators at each covariate that are not consistent at the root rate but are consistent at slower rates. Nevertheless, we show the estimator is root consistent for
and satisfies a functional central limit theorem (see Section
2.1).2.1. Asymptotic results on the nonparametric function estimators
In order to state our asymptotic results for the estimators introduced in the previous section, we require the following assumptions:

[label=(A0)]

The bandwidth satisfies and .

There are real numbers satisfying for every .


has bounded second derivative.


The distribution function of the covariates has a density function that is bounded and bounded away from zero in .

The density function has two bounded derivatives.



There is a continuous nondecreasing function satisfying and such that

The conditional (sub)distribution functions and have continuous partial derivatives, with respect to , and , respectively, that are bounded in .

There are continuous nondecreasing functions and with , and such that
and 
The second partial derivatives, with respect to , of the conditional (sub)distribution functions and exist and are bounded in .


The conditional distribution functions and admit bounded Lebesgue density functions.

The (conditional) distribution functions and are twice continuously differentiable and bounded away from zero in absolute value on any compact interval in the region , with the density function of bounded away from zero at .


The score function is bounded and nonnegative, and there are constants such that is bounded away from zero on but equal to zero on (when we only require that is equal to zero on ).

is continuously differentiable with bounded derivative .

Assumptions 3 and 4 are common assumptions taken for nonparametric regression models, which guarantee good performance of nonparametric function estimators. Note that Assumptions 5 (i) and (iii) are satisfied for many distributions. Suppose that is the logistic distribution function with a positive, bounded mean function and scale function . Write and . Then Assumption 5 (i) is satisfied by choosing . When is also differentiable with a bounded derivative, then bounding functions and (that are similar to ) can be chosen to satisfy Assumption 5 (iii) as well. Assumptions 5 (ii) and (iv) and 6 imply the conditional distribution functions and also meet these conditions and that must meet Assumptions 5 (ii) and (iv), when these assumptions are required; e.g. due to the conditional independence of and given we can write
In addition, and defined in Section 2 are functionals based on truncated means, which implies the integrals are restricted to compact subsets of . Therefore, combining the Leibniz integral rule for differentiation with Assumptions 5 (ii) and (iv) yields that both and are twice differentiable with bounded derivatives. Assumption 7 is a technical assumption required for the consistency of for
, and many probability distributions satisfy this assumption as well. Finally, Assumption
8 is a standard assumption that ensures and are welldefined –type regression functionals (see page 186 of Dabrowska, 1987).Define
Our first result specifies the asymptotic order and expansion of the estimator , which is given in Theorem 3 of LópezCheda et al. (2017). We offer an alternative proof of this result, which may be found in Section 4.
In the next two results, the asymptotic orders and expansions of the location estimator and the scale estimator are given.
Proposition 2.
Proposition 3.
To continue, it is common in heteroskedastic models to place a restriction on the curvature of the distribution function of either the responses or the errors (see, for example, Chown, 2016, who works with finite Fisher information for location and scale). Recall the functions , and from Assumption 5 (i) and (iii). We will require the function to satisfy the following curvature restriction that is analogous to assuming finite Fisher information for location and scale, i.e. we assume that has two derivatives such that
(2.7) 
Consequently, for sequences of real numbers and satisfying and , as and with , (2.7) implies , uniformly in (see the proof of Theorem 1 in Section 4). Note, (2.7) is also satisfied for many distributions, which includes the logistic distribution as in the example given above of a distribution satisfying Assumptions 5 (i) and (iii). We are now prepared to state the two main results of this section: a strong uniform representation of the difference and the weak convergence of .
Theorem 1 (strong uniform representation of ).
Let Assumptions 1 – 8 hold. Assume the function , where the functions , and are given in Assumption 5, is twice differentiable and satisfies (2.7). Finally, let satisfy (2.7), i.e. has finite Fisher information for both location and scale and the error density is bounded and satisfies . Then
where , almost surely, and
with
and
Theorem 2 (weak convergence of ).
Under the conditions of Theorem 1, if the bandwidth sequence is chosen such that and (e.g., for any ) then is asymptotically linear, i.e.
where and the influence function is
with . Consequently, the process weakly converges to a mean zero Gaussian process with covariance function for .
Remark 1 (consequences for the choice of bandwidth).
Theorem 2 implies that the estimator is a root consistent estimator of only when the bandwidth sequence satisfies and , which undersmoothes the estimators , , and . A bandwidth sequence given by satisfies and for every . Note that when we have , and this choice does not lead to a root consistent estimator because this bandwidth undersmoothes by too much. Alternatively, when we have , and this choice also does not lead to a root consistent estimator because this bandwidth does not undersmooth by enough. Another interesting consequence highlighted by Theorem 2 is that the asymptotic behavior of does not depend on the bandwidth sequence used to construct the covariatelocalized estimators when and .
3. Applications of the previous results
Here we investigate the finite sample properties of the proposed estimator using a simulation study. Our results indicate good finite sample behavior even at the smaller sample size of with multiple bandwidth configurations. This is particularly encouraging as we did not need to perform a computationally costly bandwidth selection procedure. Instead, we consider a variety of bandwidths of the form , with parameters and arbitrarily chosen and
denoting the sample standard deviation of the covariates. The results given here reflect the conclusion in Remark
1: the estimator shows insensitivity to the choice of bandwidth parameters used to construct the local model estimators when this sequence is chosen to appropriately undersmooth these estimators. This section is concluded with an illustration of the previous results using a dataset collected to study the behavior of distant metastasis in lymphnodenegative breast cancer sufferers.3.1. Numerical study
To study the finite sample performance of the estimator , we conducted simulations of runs using sample sizes , , and under the following data generation scheme. The covariates
are uniformly distributed on the interval
, and the location and scale functions are chosen as(3.1) 
For the error distribution, we chose the standard normal distribution that has been truncated at 2, centered at zero and scaled to satisfy the cure model identifiability requirements (see Section 2). An initial set of responses are then obtained using (1.1).
We work with a cure proportion function given by the logistic distribution function with standard scaling that has been centered at , which gives about cured cases on average. Cure indicators are randomly generated based on this probability function, and whenever a cure indicator is equal to one we replace the corresponding value of with . Finally, the censoring variables are randomly generated from a mixture distribution of two components with equal mixing probabilities, where one component distribution is a normal distribution centered at with variation and the other component distribution is a shifted version of (1.1), with and as in (3.1) but now is shifted up by and the model errors are standard normally distributed (no truncation). These choices result in about censored values for the uncured cases. When we combine the censoring from both cases (cured and uncured) we expect a typical dataset generated in our simulations to present with about of censored response values.
The resulting response values are taken as the minimum of each and and a censoring indicator is set equal to one whenever and zero otherwise. Finally, the score function is chosen by regions of . In the region , is chosen as the logistic distribution function with scaling , and, in the region , is set equal to . This is a smooth approximation of a step function that nearly integrates to .
The bandwidth parameter sequence used to construct the covariatelocalized model estimators is of the form , where denotes the sample standard deviation of the covariates . We investigate four situations: the constant of proportionality is either or and the exponent parameter is either or . These choices are appropriate for Theorem 2, since .
We numerically measure the performance of the estimator in two ways. First, the asymptotic mean squared errors at values , , , and are simulated, where this performance metric is calculated by first obtaining the simulated mean squared errors and then multiplying these by the corresponding sample size. Second, the asymptotic integrated mean squared error is simulated, where this quantity is calculated similarly to the asymptotic mean squared errors but now we integrate over . These performance metrics are predicted to be stable from Theorem 2.