Many study designs involve cross-sectional sampling, which may lead to length-biased sampling of a time to event . In a prospective cohort study where the initiation time for the event is unknown and subjects are followed prospectively, a right-censored forward recurrence time is observed. This occurs, for example, in HIV seroprevalence studies , where time to AIDS following infection with HIV is of interest but the infection time is unknown. If the initiation time is known, but there is no follow-up, the backward recurrence time is observed. This current duration design  has been employed in pregnancy surveys, where current trier couples provide the length of an ongoing attempt at pregnancy, and in mover-stayer models . If there is both an initiation time and a follow-up, a biased event time may be observed. When the sampling time is known, such data are commonly analyzed using methods for left-truncated data, where one conditions on the lack of an event prior to the sampling time, that is, .
In the aforementioned data set-ups, only subjects who have experienced an initiating event prior to sampling can potentially be sampled, and the sample is biased towards larger values of . If one assumes that the rate of the initiating event is stationary over time, e.g., is a homogeneous Poisson process, then the sampling time falls uniformly in the interval between the initiation and the event times [6, 15]. Letting denote the distribution of , the length-biased version has distribution , , where . Under the uniform sampling time assumption [6, 14, 9], , where is uniform(0,1) and independent of . Thus, both and will have the same density function. Since the density is the same, we use to denote both and . The density of is
where is the survival function of . This well-known result, given in expression (2) of , can be derived from the uniformity of , which yields that the conditional density of given has the form
yielding the joint density , from which (1.1) follows.
In the presence of a
covariate vectorwith density , one may formulate the effect of on the underlying event time via the accelerated lifetime model
where is a regression parameter and
is a non-negative random variable with density, survival function and hazard function . In the semiparametric version of (1.3), with the distribution of completely unspecified, efficient estimation of without length-bias is achievable with and without right censoring .
The observed covariate is also subject to length-biased sampling. Using arguments in  and , we first obtain that a consequence of (1.3) is that the joint density of and observed covariate is proportional to due to the length bias, yielding the joint density
where the integral is over the range of and is the mean associated with density . This means that satisfies an accelerated life model, given by
where is independent of with density of the form
and has the possibly non-monotone density , .
Thus, if follows model (1.3), then the distribution of also follows an accelerated lifetime model
where has a density of the form and has monotone density . Thus the accelerated lifetime structure is maintained in models for the forward and backward recurrence times, as discussed in , as well as for the length-biased time.
Since the conditional distributions of and satisfy accelerated lifetime models, existing estimation procedures may naively be applied to obtain semiparametric estimators for . However, as the marginal distribution of the observed covariates depends on the parameter , it is unclear whether estimators derived from the conditional distributions of and will be fully efficient. Estimation using observed covariates has been considered in , and . Under restrictive assumptions on
, for example, moment restrictions, improved estimation is possible. However, a comprehensive study of such issues with completely unspecified covariate distribution and right censoring has not been undertaken for general length-biased sampling.
The main contribution of this paper is to show that a naive efficient estimator which ignores the dependence on in the marginal covariate distribution is still efficient for estimation of the regression parameter in length-biased and recurrence time data. Hence, the standard techniques that are used for estimation in the accelerated failure time models can also be applied in these cases without loss of information. We provide the theoretical derivation of the efficient score in Section 2, simulation results and data analysis in Section 3 and 4 and conclude with some discussions in Section 5.
2 Efficient Scores and Estimation
We start by defining some of the important assumptions and notation used in the paper, along with a description of some of the tangent spaces used in deriving the efficient score. The concepts and notation closely follow that given in Chapters 3 and 18 of .
Let be the class of all density functions on and be the class of all density functions on . The semiparametric model for the core accelerated lifetime model is given by
where the distribution has a density with respect to an absolutely continuous measure ,
where is the true parameter value. For the accelerated failure time model of the recurrence times, the semiparametric model is
where for , and
We assume that is a compact subset of . Further, we assume has density
Define . Let the true distribution be with . Define the separate submodels for each parameter, holding the other parameters fixed, as , and .
Let , and be the tangent spaces for at . By definition of tangent spaces in Chapter 18 of , these are all closed subsets of , where denotes square-integrable functions integrating to zero with respect to . For a density , let denote the space of square-integrable functions with respect to the measure . For a survival function , we similarly let denote the space of square-integrable functions with respect to the measure , even though may not integrate to 1. Tangent spaces for a given model represent the set of all likelihood score functions for one-dimensional submodels of the given model. The three tangent spaces just defined represent the likelihood based scores used to estimate the parameter given in the subscript while holding the remaining parameters fixed at their true values. Let be the ordinary score for when and are fixed. Then the efficient score function for in the full model at is , where denotes the orthogonal projection of onto the linear span of .
2.1 Inference for Forward and Backward Recurrence Times
We now calculate the efficient score and information using the recurrence time potentially subject to right censoring, which covers both the forward and backward recurrence time settings. For the -th individual, we observe (), where is the recurrence time of the -th individual, is the time of right censoring, which is assumed to be independent of the recurrence time conditionally given the covariates, is the indicator of whether the event time is observed and is the observed covariate. Theorem 2.1 below demonstrates that in this setting, the efficient score equals that of the naive efficient estimator based on the conditional likelihood given the covariates.
For right-censored , we assume is a compact set in , and that belongs in the interior of . For fixed but arbitrary , we define our semiparametric model in terms of the distribution of and the corresponding censored variable . The conditional density of given is thus
while the conditional hazard is
Given the density of is monotone decreasing. We now state our assumptions:
and are independent given ;
The distribution of is independent of the parameters , and the distribution of is independent of the parameters ;
The last assumption is needed to ensure that the density of has finite Fisher information about . The next theorem gives that the efficient score equals that from the naive efficient estimator.
Suppose that the covariate vector is almost surely bounded. Define
Then under (A1)–(A4) and with , the ordinary score for at is
the tangent space for is where the score operator for is given by
the tangent space for is , and the efficient score for at is
The likelihood for one observation is given by
Taking log and differentiating with respect to , we obtain the ordinary score for at ,
The expression in 2.3 for the ordinary score function for can be derived by noting that the quantity in brackets on the right hand side of the above expression is a stochastic integral with respect to the counting-process martingale in 2.2 , using proposition A.3.6 in .
Next, we can conclude from the Lemma 2.1 below that the tangent space for can be considered the maximal tangent space . Hence the tangent space for can be expressed through the one dimensional submodels for any , which yield the one dimensional baseline hazard submodels
Differentiating the likelihood with respect to and setting now yields the score given in 2.4. In order to find we find such that for all . That is . Note that . Conditioning on and and using the fact that is distributed independently of and we obtain
The second equality above is obtained by using the result that if , then
Thus for all if
Thus the projection of on is given by
Now for finding for , we consider the one-parameter path , where . The score operator for is given by
for . Note that . If is unrestricted then the tangent space can be taken to be the orthocomplement of the linear span of , i.e., in . Since is distributed independently of and , for any and , i.e., . Since , we obtain
Now replacing with and subtracting 2.6 and 2.7 from 2.3 yields the efficient score given in 2.5. The accelerated failure time model for given is equivalent to the log-linear model , where has hazard function
and is the baseline hazard for . In the current setting,
This model is the same as the linear regression model forbut with a sign change on . The efficient score for the linear regression model under right-censoring is given in Expression (27) on Page 149 of  and has the same form as 2.5, except for changes in parameter and variable notation. Specifically, the function in 2.5 equals the negative of defined in Expression (23) of , after replacing with , where the negative is due to the sign change. To see this, note that
Thus the efficient score is free of , so to estimate efficiently, one does not need to estimate the covariate distribution. Hence, one does not need to impose an additional identifiability condition for such as the mean-zero assumption. The efficient information is
where . This is somewhat complicated to estimate, but the approach described in Remark 2 of  will yield a consistent estimator which can be used for inference on .
Since the backward recurrence times are uncensored, we can assume that the censoring times are infinite with and . Thus the efficient score for the backward recurrence time simplifies to
and the efficient information becomes
Before presenting Lemma 2.1, we provide a few needed definitions. Let be the model consisting of densities on of the form , where ; and let and be the respective tangent sets for and at and at , where satisfies . Lemma 2.1 establishes that , which is needed in the proof of Theorem 2.1 to identify , a key technical step. We will be using score operators which allow us to construct scores for a model of interest from scores for a simpler model [see, e.g., Chapter 18 of 12].
If is the score operator mapping tangents in to , then is dense in the maximal tangent set for , i.e., .
Let be the density on corresponding to . Consider the following parametric path through :
where is bounded, continuously differentiable with bounded derivative satisfying and . Note that is the closure within of the derivatives of curves with respect to and is the closure within of derivatives of curves with respect to , where
and is the survival function corresponding to . Thus, is the maximal non-parametric tangent set for while is the maximal tangent set for . The corresponding parametric submodel for is
Thus, the tangent set (which consists of scores with respect to the one parameter models ) is given by the operator
Let be space of the bounded functions on and be the subset of of functions which attain zero at all time points large enough. It is easy to verify that is dense in and that is dense in , and, moreover, that for all and that for all , where is the adjoint of defined as the solution to
for all and . This relation yields that .
By definition of , is the closed linear span of in , and thus . To prove the lemma, we need to verify that also holds. Suppose there is a which is not in . Then there exists a sequence such that and
This now implies that . We can now show that for any for which ,
as by previous arguments combined with some analysis. Since was an arbitrary choice for which , we obtain that -almost surely, and the desired conclusion follows. ∎
2.2 Inference for Length-Biased Data
A similar result may be obtained for length-biased data by replacing in the proof of Theorem 2.1 with , where, as before, is the density generating . This yields the following result:
Using the same notation as Theorem 2.1 and under the same conditions, the efficient score for at for length-biased data is
where for ,
The proof of Theorem 2.2 is very similar to the proof of Theorem 2.1. It follows along the same lines with a few minor differences, which are outlined below: The likelihood for one observation is given by
Here, the actual form of is different from Theorem 2.1. Thus we need to replace with , where the map is as implicitly defined above just before the statement of Theorem 2.2. Specifically, this is the new model for holding and fixed at their true values. The other models and submodels are the same as for Theorem 2.1 except that and are replaced by and . Taking log of and differentiating with respect to we obtain as the ordinary score for at . The quantity in brackets on the right hand side is a stochastic integral with respect to the counting-process martingale in (2.14) and is thus also a martingale. Using this, we can obtain the ordinary score