1 Introduction
Many study designs involve crosssectional sampling, which may lead to lengthbiased sampling of a time to event . In a prospective cohort study where the initiation time for the event is unknown and subjects are followed prospectively, a rightcensored forward recurrence time is observed. This occurs, for example, in HIV seroprevalence studies [2], where time to AIDS following infection with HIV is of interest but the infection time is unknown. If the initiation time is known, but there is no followup, the backward recurrence time is observed. This current duration design [9] has been employed in pregnancy surveys, where current trier couples provide the length of an ongoing attempt at pregnancy, and in moverstayer models [16]. If there is both an initiation time and a followup, a biased event time may be observed. When the sampling time is known, such data are commonly analyzed using methods for lefttruncated data, where one conditions on the lack of an event prior to the sampling time, that is, .
In the aforementioned data setups, only subjects who have experienced an initiating event prior to sampling can potentially be sampled, and the sample is biased towards larger values of . If one assumes that the rate of the initiating event is stationary over time, e.g., is a homogeneous Poisson process, then the sampling time falls uniformly in the interval between the initiation and the event times [6, 15]. Letting denote the distribution of , the lengthbiased version has distribution , , where . Under the uniform sampling time assumption [6, 14, 9], , where is uniform(0,1) and independent of . Thus, both and will have the same density function. Since the density is the same, we use to denote both and . The density of is
(1.1) 
where is the survival function of . This wellknown result, given in expression (2) of [8], can be derived from the uniformity of , which yields that the conditional density of given has the form
(1.2) 
yielding the joint density , from which (1.1) follows.
In the presence of a
covariate vector
with density , one may formulate the effect of on the underlying event time via the accelerated lifetime model(1.3) 
where is a regression parameter and
is a nonnegative random variable with density
, survival function and hazard function . In the semiparametric version of (1.3), with the distribution of completely unspecified, efficient estimation of without lengthbias is achievable with and without right censoring [17].The observed covariate is also subject to lengthbiased sampling. Using arguments in [4] and [13], we first obtain that a consequence of (1.3) is that the joint density of and observed covariate is proportional to due to the length bias, yielding the joint density
(1.4) 
where the integral is over the range of and is the mean associated with density . This means that satisfies an accelerated life model, given by
(1.5) 
where is independent of with density of the form
and has the possibly nonmonotone density , .
The relation still holds for a uniform independent of , so we can multiply (1.4) by (1.2), integrate over , and replace with , to obtain the joint density of and observed covariate :
(1.6) 
Thus, if follows model (1.3), then the distribution of also follows an accelerated lifetime model
(1.7) 
where has a density of the form and has monotone density . Thus the accelerated lifetime structure is maintained in models for the forward and backward recurrence times, as discussed in [10], as well as for the lengthbiased time.
Since the conditional distributions of and satisfy accelerated lifetime models, existing estimation procedures may naively be applied to obtain semiparametric estimators for . However, as the marginal distribution of the observed covariates depends on the parameter , it is unclear whether estimators derived from the conditional distributions of and will be fully efficient. Estimation using observed covariates has been considered in [5], and [3]. Under restrictive assumptions on
, for example, moment restrictions, improved estimation is possible. However, a comprehensive study of such issues with completely unspecified covariate distribution and right censoring has not been undertaken for general lengthbiased sampling.
The main contribution of this paper is to show that a naive efficient estimator which ignores the dependence on in the marginal covariate distribution is still efficient for estimation of the regression parameter in lengthbiased and recurrence time data. Hence, the standard techniques that are used for estimation in the accelerated failure time models can also be applied in these cases without loss of information. We provide the theoretical derivation of the efficient score in Section 2, simulation results and data analysis in Section 3 and 4 and conclude with some discussions in Section 5.
2 Efficient Scores and Estimation
We start by defining some of the important assumptions and notation used in the paper, along with a description of some of the tangent spaces used in deriving the efficient score. The concepts and notation closely follow that given in Chapters 3 and 18 of [12].
Let be the class of all density functions on and be the class of all density functions on . The semiparametric model for the core accelerated lifetime model is given by
where the distribution has a density with respect to an absolutely continuous measure ,
where is the true parameter value. For the accelerated failure time model of the recurrence times, the semiparametric model is
(2.1) 
where for , and
We assume that is a compact subset of . Further, we assume has density
Define . Let the true distribution be with . Define the separate submodels for each parameter, holding the other parameters fixed, as , and .
Let , and be the tangent spaces for at . By definition of tangent spaces in Chapter 18 of [12], these are all closed subsets of , where denotes squareintegrable functions integrating to zero with respect to . For a density , let denote the space of squareintegrable functions with respect to the measure . For a survival function , we similarly let denote the space of squareintegrable functions with respect to the measure , even though may not integrate to 1. Tangent spaces for a given model represent the set of all likelihood score functions for onedimensional submodels of the given model. The three tangent spaces just defined represent the likelihood based scores used to estimate the parameter given in the subscript while holding the remaining parameters fixed at their true values. Let be the ordinary score for when and are fixed. Then the efficient score function for in the full model at is , where denotes the orthogonal projection of onto the linear span of [1].
2.1 Inference for Forward and Backward Recurrence Times
We now calculate the efficient score and information using the recurrence time potentially subject to right censoring, which covers both the forward and backward recurrence time settings. For the th individual, we observe (), where is the recurrence time of the th individual, is the time of right censoring, which is assumed to be independent of the recurrence time conditionally given the covariates, is the indicator of whether the event time is observed and is the observed covariate. Theorem 2.1 below demonstrates that in this setting, the efficient score equals that of the naive efficient estimator based on the conditional likelihood given the covariates.
For rightcensored , we assume is a compact set in , and that belongs in the interior of . For fixed but arbitrary , we define our semiparametric model in terms of the distribution of and the corresponding censored variable . The conditional density of given is thus
while the conditional hazard is
Given the density of is monotone decreasing. We now state our assumptions:

and are independent given ;

The distribution of is independent of the parameters , and the distribution of is independent of the parameters ;

;

.
The last assumption is needed to ensure that the density of has finite Fisher information about . The next theorem gives that the efficient score equals that from the naive efficient estimator.
Theorem 2.1.
Suppose that the covariate vector is almost surely bounded. Define
(2.2) 
and
Then under (A1)–(A4) and with , the ordinary score for at is
(2.3) 
the tangent space for is where the score operator for is given by
(2.4) 
the tangent space for is , and the efficient score for at is
(2.5) 
Proof.
The likelihood for one observation is given by
Taking log and differentiating with respect to , we obtain the ordinary score for at ,
The expression in 2.3 for the ordinary score function for can be derived by noting that the quantity in brackets on the right hand side of the above expression is a stochastic integral with respect to the countingprocess martingale in 2.2 [1], using proposition A.3.6 in [1].
Next, we can conclude from the Lemma 2.1 below that the tangent space for can be considered the maximal tangent space . Hence the tangent space for can be expressed through the one dimensional submodels for any , which yield the one dimensional baseline hazard submodels
Differentiating the likelihood with respect to and setting now yields the score given in 2.4. In order to find we find such that for all . That is . Note that . Conditioning on and and using the fact that is distributed independently of and we obtain
The second equality above is obtained by using the result that if , then
Thus for all if
Thus the projection of on is given by
(2.6) 
Now for finding for , we consider the oneparameter path , where . The score operator for is given by
for . Note that . If is unrestricted then the tangent space can be taken to be the orthocomplement of the linear span of , i.e., in . Since is distributed independently of and , for any and , i.e., . Since , we obtain
(2.7) 
Now replacing with and subtracting 2.6 and 2.7 from 2.3 yields the efficient score given in 2.5. The accelerated failure time model for given is equivalent to the loglinear model , where has hazard function
(2.8) 
and is the baseline hazard for . In the current setting,
(2.9) 
This model is the same as the linear regression model for
but with a sign change on . The efficient score for the linear regression model under rightcensoring is given in Expression (27) on Page 149 of [1] and has the same form as 2.5, except for changes in parameter and variable notation. Specifically, the function in 2.5 equals the negative of defined in Expression (23) of [1], after replacing with , where the negative is due to the sign change. To see this, note thatThe first row follows from (2.8), the second row follows from the substitution followed by (2.9), and the last row follows from the definitions of and . ∎
Thus the efficient score is free of , so to estimate efficiently, one does not need to estimate the covariate distribution. Hence, one does not need to impose an additional identifiability condition for such as the meanzero assumption. The efficient information is
(2.10) 
where . This is somewhat complicated to estimate, but the approach described in Remark 2 of [17] will yield a consistent estimator which can be used for inference on .
Since the backward recurrence times are uncensored, we can assume that the censoring times are infinite with and . Thus the efficient score for the backward recurrence time simplifies to
(2.11) 
and the efficient information becomes
(2.12) 
Before presenting Lemma 2.1, we provide a few needed definitions. Let be the model consisting of densities on of the form , where ; and let and be the respective tangent sets for and at and at , where satisfies . Lemma 2.1 establishes that , which is needed in the proof of Theorem 2.1 to identify , a key technical step. We will be using score operators which allow us to construct scores for a model of interest from scores for a simpler model [see, e.g., Chapter 18 of 12].
Lemma 2.1.
If is the score operator mapping tangents in to , then is dense in the maximal tangent set for , i.e., .
Proof.
Let be the density on corresponding to . Consider the following parametric path through :
where is bounded, continuously differentiable with bounded derivative satisfying and . Note that is the closure within of the derivatives of curves with respect to and is the closure within of derivatives of curves with respect to , where
and is the survival function corresponding to . Thus, is the maximal nonparametric tangent set for while is the maximal tangent set for . The corresponding parametric submodel for is
Thus, the tangent set (which consists of scores with respect to the one parameter models ) is given by the operator
Let be space of the bounded functions on and be the subset of of functions which attain zero at all time points large enough. It is easy to verify that is dense in and that is dense in , and, moreover, that for all and that for all , where is the adjoint of defined as the solution to
for all and . This relation yields that .
By definition of , is the closed linear span of in , and thus . To prove the lemma, we need to verify that also holds. Suppose there is a which is not in . Then there exists a sequence such that and
This now implies that . We can now show that for any for which ,
as by previous arguments combined with some analysis. Since was an arbitrary choice for which , we obtain that almost surely, and the desired conclusion follows. ∎
2.2 Inference for LengthBiased Data
A similar result may be obtained for lengthbiased data by replacing in the proof of Theorem 2.1 with , where, as before, is the density generating . This yields the following result:
Theorem 2.2.
Using the same notation as Theorem 2.1 and under the same conditions, the efficient score for at for lengthbiased data is
(2.13) 
where for ,
with ,
(2.14) 
and
(2.15) 
Proof.
The proof of Theorem 2.2 is very similar to the proof of Theorem 2.1. It follows along the same lines with a few minor differences, which are outlined below: The likelihood for one observation is given by
Here, the actual form of is different from Theorem 2.1. Thus we need to replace with , where the map is as implicitly defined above just before the statement of Theorem 2.2. Specifically, this is the new model for holding and fixed at their true values. The other models and submodels are the same as for Theorem 2.1 except that and are replaced by and . Taking log of and differentiating with respect to we obtain as the ordinary score for at . The quantity in brackets on the right hand side is a stochastic integral with respect to the countingprocess martingale in (2.14) and is thus also a martingale. Using this, we can obtain the ordinary score
Comments
There are no comments yet.