1 Introduction
Studies with long followup often suffer from high dropout rate. Dropouts can depend on the outcome of interest, even after adjusting for observed covariates. This makes the dropouts “nonignorable” and biases the analysis based solely on the nondropouts (Rubin, 1976). As a way to handle nonignorable dropouts, doublesampling designs allocate additional resources to pursue a sample of the dropouts and find out their outcomes (Baker et al., 1993; Glynn et al., 1993; Frangakis and Rubin, 2001; Cochran, 2007). The doublesampling can be more practical or informative if it targets dropouts whose history at the dropout time has specific profiles. For example, An et al. (2009) found that such “profile designs” can save more than 35% of resources compared to the standard doublesampling design. In addition, it can be more practical to doublesample relatively recent dropouts. For estimation in doublesampling designs, An et al. (2014)
employed a parametric approach to estimate the survival probability. However, analyses of such designs can be more reliable if it does not rely heavily on parametric assumptions.
Robins et al. (2001) had suggested a possible way of deriving a semiparametric estimator for doublesampling designs, but that and any other such existing proposals rely on first coming up with and then verifying conjectures “by hand”. Such a process is prone to human errors and is not deductive, i.e., not generalizable.Recently, Frangakis et al. (2015) proposed the idea of making semiparametric estimation deductive. Contrary to the classical semiparametric framework which relies heavily on the conjecture and verification of the analytic form of the efficient influence function (EIF), their approach produces semiparametric locally efficient estimators by utilizing the Gateaux derivative representation of the EIF in nonparametric models. This deductive estimation idea can save dramatic human effort from difficult mathematical derivations, and such effort can be transferred, for example, to designing new studies. One limitation, though, is that their approach requires analytically evaluating the estimand at a mixture of a continuous distribution and a point mass. This derivation is feasible in certain cases (e.g., the twophase design examples in Frangakis et al. (2015)), but becomes highly errorprone when derived by hand in complicated problems like estimating survival probability in the doublesampling design.
In this paper, we develop a semiparametric estimator for mortality rate in doublesampling designs, by adapting the deductive estimation method in Frangakis et al. (2015) to incorporate a novel “discrete support” structure. The discrete support is a space constructed from the observed data, which enables (i) approximation of the observed data distribution, and (ii) numerical computation for constructing semiparametric estimators, including augmenting fitted models and perturbing a continuous distribution by a point mass at an observed data point. This discretization technique has its root in Chamberlain (1987). The proposed method is expected to produce semiparametric locally efficient estimators within finite steps without knowledge of the EIF. By avoiding the need to evaluate the estimand at a mixture of a continuous distribution and a point mass, the proposed method overcomes the limitation in Frangakis et al. (2015) and yields computerizable semiparametric estimation for complex designs such as the doublesampling design.
We apply the method to estimating the mortality rate using data from a doublesampling design component of the President’s Emergency Plan for AIDS Relief (PEPFAR), an HIV monitoring and treatment program. In addition, we explore and discuss how the estimated mortality rate is impacted by certain restrictions on the doublesampling as a scientifically interesting problem, because doublesampling can be more pragmatic if restricted to relatively recent dropouts (An et al., 2014).
The paper is organized as follows. In Section 2, we introduce the doublesampling design and the parameter of interest with its identifiability conditions. In Section 3, we review the deductive estimation framework in Frangakis et al. (2015), and present the proposed estimation method for survival probability in doublesampling designs. In particular, we present and discuss how the discrete support idea facilitates the deductive estimation in doublesampling designs. In Section 4, we apply the method to PEPFAR and analyze the impact of selection criteria on doublesampling. Section 5 concludes with remarks.
2 Doublesampling design and quantity of interest
For clarity, we present arguments with a doublesampling design as shown in Figure 1, which is modified from An et al. (2014, Figure 1). First, we describe characteristics that are inherent to a patient (i.e., potential outcomes (Rubin, 1976)), which would be realized if the program could enroll and follow a patient indefinitely with a standard effort. Then, we describe the actual design consisting of two phases.
Patient inherent characteristics and quantity of interest. For each patient, there is an enrollment time , covariates at enrollment, and a survival time (time from enrollment to death). The quantity of interest is for a given , which is the proportion of patients surviving beyond year for a population represented as a random sample by the enrolled patients. If the program were to follow patients indefinitely with a standard effort while they are alive, some patients would discontinue contact from the monitoring (patient types (b1), (c), (d), (e1), and (e2) in Figure 1). For these patients, labeled true dropouts and indicated by , denote by the time from enrollment to dropout and denote by any information available at dropping out in addition to . For instance, may include and some longitudinal measurements.
Phase 1 of the actual design and missingness due to administrative censoring. In the first phase, the actual design enrolls patients at times and monitors them with standard effort, not indefinitely, but until “Now”—the present time. The time from enrollment to “Now” is the time to administrative censoring, denoted by . For patient , define if and otherwise. Define .
Simply based on Phase 1, the standard survival data are not observed for dropouts whose time to dropout satisfies . Denote such observed dropout patients by (patient types (c), (d), (e1), and (e2) in Figure 1).
Phase 2: information after doublesampling. Phase 2 of the design selects a subset of the observed dropouts, called a doublesample, by using characteristics of the patient known up to the dropout time . Additional resources are allocated for searching for and finding this doublesample at “Now”. Such doublesampling is expressed as follows:
Condition 1. Patientdependent doublesampling. For each observed dropout, the indicator of being selected as a doublesample can depend on the patient characteristics up to dropout time ; after we condition on , doublesampling is independent of survival and enrollment times:
The design must also address the information missing due to administrative censoring. This can be done in practice by limiting the design to a window period of enrollment, within which the following holds:
Condition 2. Exchangeability of enrollment times within a time period. Patients enrolled at different times (equivalently, having different time to administrative censoring ) have similar survival times after conditioning on :
Under Conditions 1 and 2, the estimand is identifiable from the following components of the distribution of the observed data:
(1)  
(2) 
In particular, by Condition 1 the distribution from the doublesampled individuals is the same as the one for all dropouts, , and so, together with the second component of (2) gives, upon averaging over , . That, together with (1), gives . Denote by the function that takes as an arbitrary distribution of from independent survival and censoring times and returns the survival probability beyond (this function is the common probability limit of the NelsonAalen and KaplanMeier estimators). Then, by Condition 2, the estimand is calculable from the above distributions as
(3) 
The calculation described above involves regression distributions that need to be estimated with robust and deductive (i.e., easily computerizable) methods. We describe such a method next.
3 Deductive estimation of survival probability in doublesampling designs
3.1 Overview: computerizable and deductive estimation
An estimand can be viewed as a functional of a distribution on the data. For example, in the doublesampling design described above, expression (3) shows that is a function of the distribution of the observed data. Estimation of requires modeling assumptions because the expression (3) involves regressions that, in practice, cannot be estimated fully nonparametrically. In particular, suppose the estimand in (3) has a nonparametric efficient influence function (EIF) denoted by , where represents the remaining components of the distribution, other than . We wish to find deductive estimators that solve
(4) 
after substituting for estimates of a working model. Such estimators are consistent and locally efficient if the working estimators of ( are consistent with convergence rates larger than (van der Vaart, 2000). By “deductive”, we mean that the estimator should not rely on conjectures for the functional form of , and should be guaranteed to produce an estimate in the sense of Turing (1937) (i.e., use a discrete and finite set of instructions, and, for every input, finish in discrete finite steps).
Recently, a deductive estimation method has been proposed that does not need the analytic form of EIF (Frangakis et al., 2015). The key idea is that, for any working distribution for , we can calculate numerically the EIF as the numerical Gateaux derivative after perturbing the working distribution by a small mass at each observed data point; i.e.,
(5) 
and where is the perturbed distribution, i.e., a mixture of a continuous distribution and a point mass. Then, we can find the best working distribution parameter as one that makes zero the sum of the numerical EIFs, . The estimator solving (4) approximately is then
. The standard error of the estimator can be estimated by
with denoting the sample size.The estimator has consistency properties beyond those of the MLE for the same working model , since the former depends only on the features of that remain in the nonparametric influence function EIF. For example, Frangakis et al. (2015) show that in the two phase design, such estimator shares the double robustness of estimators that work with the EIF form as a given. The more general conditions between the true distribution and the working model are stated in the Appendix, and specific conditions for working models are given in Section 3.2.
The challenge in directly applying the above deductive estimation approach to the doublesampling design is the evaluation of . Although for certain problems the estimand at the perturbed distribution can be obtained relatively easily (such as the average outcome in the twophase design example from Frangakis et al. (2015)), we found the perturbation difficult to derive analytically for the survival probability considered in this paper. To avoid analytically evaluating , we combine the deductive estimation procedure with the discrete support idea, described next.
3.2 Incorporating discrete support for estimation in doublesampling designs
The general idea of the discrete support is to approximate continuous working models ( in the previous section) by discrete ones. If the support of the discrete working model contains all the observed data points, perturbing the working model by a point mass would be as simple as changing the values in a probability table. This discretization idea was used in Chamberlain (1987) to facilitate the derivation of nonparameteric efficient influence functions.
In the doublesampling design, the observed data for patient is , , and if or ( and ). Here, (which includes ) and are undefined if , and are missing if and . We define the discrete support (DS) for the doublesampling design to be
(6) 
where is the Cartesian product space of the unique values of and the unique values of for patients with , and is the Cartesian product space of the unique values of and the unique values of for patients with . Figure 2 gives an illustration of the spaces and . Note that by nature of the doublesampling design, among those individuals with those who weren’t doublesampled will have , and we include this as a set of unique values of in .
Formally, we define
The set DS contains as its elements all observed and additional points needed for identifiability of the estimand and the deductive estimation procedure. As we will see in the following, such construction of DS for the doublesampling design enables (i) approximation of the observed data distribution, (ii) extension of fitted working models, and (iii) perturbation on a working distribution by a point mass at an observed data point. Note that instead of constructing an overall Cartesian product space, we keep the quantities that can be estimated reliably from the data (i.e., ), and construct product spaces conditional on and , respectively.
With a slight abuse of notation, in the following we denote by
an arbitrary (discrete) probability distribution on the discrete support set
DS. The deductive estimation procedure to construct a semiparametric estimator for in the doublesampling design, after constructing the discrete support set DS, is as follows:(step 1) : Construct and code the function that takes an arbitrary probability distribution on DS, and outputs (through averaging and normalization) the decomposition (1)(2) and the value based on (3). This step solely depends on the identification result of the estimand.
(step 2) : Construct and code the function that takes an arbitrary probability distribution on DS, an index , and a small , and outputs the perturbed mixture distribution
Given , the perturbed probability at an arbitrary point can be computed by if , and if , where is the probability of under the (discrete) probability measure . This step is straightforward because the perturbation is conducted on a discrete set.
(step 3) : Fit a working distribution on DS using working models. Then extend to a model with a tuning parameter , so that the lefthand side of (7) below explores values around 0. We give the exact form of for the doublesampling design in Section 3.3.
(step 4) : Find the parameter that solves
(7) 
where is defined in (5). For computing , is computed by step 3; is computed by step 2; and and are computed by step 1.
(step 5) : The resulting estimator is , which is computed by the function defined in step 1. The standard error of can be estimated by .
The estimator produced by the above method has the following robustness property. Suppose we decompose , where isolates the components of the distribution that are modeled through in step 3, and is the remaining part of . Generally, the expected EIF, , is zero at the truth but possibly also at other values of (e.g., doublerobustness). Under regularity conditions usually needed for consistency with estimating equations, the above estimator when taking a working model should be consistent if the model includes a distribution that satisfies condition (12) of the Appendix
that is, the distribution zeros out the limit EIF and gives the correct value of the estimand.
3.3 Working and extended model
Here we describe the working model and its extension used in step 3 of the estimation procedure in Section 3.2.
The working model on the discrete support is a discrete probability distribution on . The value of at an arbitrary point in is calculated as follows: for the doublesampled subjects ( so that is not NA), we have
(8) 
for the other subjects with , we have
(9) 
We now describe how each term on the right hand side of (8) and (9) is calculated. For , , and , we use the empirical distributions. The selection for doublesampling
is modeled using logistic regression. The distribution
is modeled as the likelihood arising from independent given :is computed as  
is computed as  
This working independence between and does not need to and is not expected to hold, and only the resulting likelihood for is used. The working distributions for censoring time and survival time are each taken as Cox proportional hazards model fits. The fitted distribution is then normalized to sum to 1 over all pairs in , conditional on each pair of in .
The working model on is calculated analogously. For an arbitrary point in , we compute by
where a similar normalization is conducted on . The working models on and are chosen to be variationally independent; i.e., not sharing parameters.
The working model is then extended to a model by adding a 1dimensional parameter to and , while leaving the other factors of unchanged. Denoting by the two preextension models for and , the extension models with are taken as:
(10) 
where , . This extension can increase the probability for larger values (with ) or smaller values (with ), and was verified numerically to be able to explore values for (7) around 0, where other attempted extensions could not. Note that when , gives the original working model.
4 Application to PEPFAR
We apply our method to estimating the mortality rate using the data set from a doublesampling design component of the President’s Emergency Plan for Aids Relief (PEPFAR), and HIV monitoring and treatment program in East Africa evaluating the antiretroviral treatment (ART) for HIVinfected people (Geng et al., 2015). The data set consists of 1,773 HIVinfected adults from Morogoro, Tanzania, who started ART after entering the study. There are 673 dropouts during the study. Among the dropouts, 91 patients got doublesampled. We use baseline age and pretreatment CD4 value as , and the loss to followup time and the CD4 value measured at the last visit before dropout as .
Estimated mortality rate. The black curve in the left panel of Figure 3 is the estimated year mortality rate for using the method in Section 3
, along with its pointwise 95% confidence interval obtained by normal approximation. For example, the deductively estimated 1year mortality rate is 11.4% with standard error 0.8%. As a comparison, a stratified KaplanMeier approach gives estimated 1year mortality rate 14% (not shown). The yellow curve in the left panel of Figure
3 is the estimated mortality rate when forcing in step 5 of Section 3, which corresponds to the estimator obtained by directly plugging in the fitted working model into the estimand, , without the additional steps to find that solves (7).Sensitivity to doublesampling selection criteria. Due to logistic reasons, it may be more feasible in practice to doublesample relatively recent dropouts (An et al., 2014). A natural question is how such selection criteria of the doublesamples would impact the estimated mortality rate. To answer this question, we parametrize the selection rule by a scalar , where only the dropouts with are eligible to be sampled in the second phase. For different values of , Table 1 lists the proportion of doublesamples in the PEPFAR data set that satisfy and the corresponding deductively estimated 1year mortality rate. For a given , we obtain the estimated mortality rate by first setting and for the doublesamples in the data set with , and then using the method in Section 3. The estimated 1year mortality rate and the standard error seem to be only slightly impacted by . When days (i.e., when only the most recent 19 doublesamples are included), because there is too little variation in the included doublesamples the Cox model cannot be fitted and the estimation procedure breaks down.
(days)  Included doublesamples (%)  Estimate (%)  Standard error (%)  

262  20  NA  NA  NA 
344  30  1.45  12.2  0.69 
382  40  1.58  12.5  0.71 
454  50  1.59  12.3  0.72 
494  60  1.57  10.8  0.71 
566  70  1.61  12.1  0.78 
652  80  1.56  11.5  0.77 
897  90  1.57  11.9  0.79 
1061  100  1.53  11.7  0.80 
The right panel in Figure 3 shows the estimated year mortality rates without restriction on doublesampling (black curve) and when restricting the design to only doublesample the patients who dropped out within the past two years (blue curve) or the past year (yellow curve). They correspond to , 750, or 365 days; that is, when all, 85%, or 36% of the doublesamples in the data set are included. For all three curves, the estimated mortality rates are similar for year. The yellow curve (restricting doublesampling to the past year dropouts) start to diverge from the other two as grows beyond 1 year. We conclude that different selection criteria can result in quite different estimates, and evaluation and optimization of the doublesampling selection criteria is an area of future research.
5 Remarks
We proposed a deductive method to produce semiparametric estimators for estimating survival probability in the doublesampling design. The method is easily computerizable by incorporating the discrete support structure into the approach in Frangakis et al. (2015). We applied the method to a doublesampling component at a site of the PEPFAR program.
It would be interesting to compare the proposed estimator with estimators derived if one had used the explicit form of the EIF. While Robins and Rotnitzky (2001) describe briefly the conditions that the EIF satisfies through a set of equations, we have found that by solving those equations by hand, we had high risk of introducing possible errors. We are also unaware of any work that gives a closed form expression of the EIF.
For the proposed estimation procedure in Section 3.2, the construction of the discrete support, step 1, and step 3 are problemspecific, whereas steps 2, 4 and 5 are generic to other nonparametric settings. As we discussed in Section 3, when constructing the discrete support, a necessary condition is that it should contain all the observed data points so that perturbation can be easily calculated. It should also facilitate approximation of the observed data distribution with a discrete working model. It remains an open question to elucidate the principles for constructing the discrete support and for extending the working model on the discrete support. We expect the proposed estimation procedure to be applicable to other designs after finding a satisfactory answer to this question.
In the literature, discussions on robust estimators have been partly based on characterizing robust estimating functions; see, for example, Robins et al. (2000) and Robins and Rotnitzky (2001). Although our estimator is obtained by (numerically) solving the EIF equation, it is more difficult to find analytically all the conditions for which the estimator is consistent, precisely because the focus is on problems in which the EIF is difficult to derive analytically. Perhaps, therefore, a supplemental numerical method may exist that can also characterize more intuitively these conditions.
Appendix
Suppose the working model assumes the true distribution belongs in some set . Then the estimator, say , that solves the nonparametric EIF within the working model will, in the limit, be
(11) 
where denotes the expectation under . Therefore, by denoting , and assuming sufficient smoothness of the distributions, the estimator will converge to the true value under the following joint conditions:
(12) 
The analytic form of above expressions may not be easily accessible when the EIF’s form is not. Suppose however, a computational method can easily determine just the ”zeros” of the expressions, that is, given any , the features of that are restricted in order to satisfy conditions (12). With such a method, coupled with the method of deductive estimation, the researcher can focus efforts to clarify and model especially well those restricted features, as this would provide approximate consistency of the estimator.
Acknowledgments
The authors would like to acknowledge Beverly S. Musick of Indiana University for compiling the database on which this study was based and for offering expert advice on the data. This work was supported by the U.S. National Institute of Drug Abuse (R01 AI10271001). The statements in this work are solely the responsibility of the authors and do not represent the views of this organization.
References
 An et al. (2009) An MW, Frangakis CE, Musick BS, Yiannoutsos CT (2009) The need for doublesampling designs in survival studies: An application to monitor PEPFAR. Biometrics 65(1):301–306
 An et al. (2014) An MW, Frangakis CE, Yiannoutsos CT (2014) Choosing profile doublesampling designs for survival estimation with application to President’s Emergency Plan for Aids Relief evaluation. Statistics in medicine 33(12):2017–2029

Baker et al. (1993)
Baker SG, Wax Y, Patterson BH (1993) Regression analysis of grouped survival data: informative censoring and double sampling. Biometrics 49(2):379–389

Chamberlain (1987)
Chamberlain G (1987) Asymptotic efficiency in estimation with conditional moment restrictions. Journal of Econometrics 34(3):305–334
 Cochran (2007) Cochran WG (2007) Sampling techniques. John Wiley & Sons
 Frangakis and Rubin (2001) Frangakis CE, Rubin DB (2001) Addressing an idiosyncrasy in estimating survival curves using double sampling in the presence of selfselected right censoring. Biometrics 57(2):333–342
 Frangakis et al. (2015) Frangakis CE, Qian T, Wu Z, Diaz I (2015) Deductive derivation and Turingcomputerization of semiparametric efficient estimation. Biometrics 71(4):867–874
 Geng et al. (2015) Geng EH, Odeny TA, Lyamuya RE, NakiwoggaMuwanga A, Diero L, Bwana M, Muyindike W, Braitstein P, Somi GR, Kambugu A, et al. (2015) Estimation of mortality among HIVinfected people on antiretroviral treatment in east Africa: a sampling based approach in an observational, multisite, cohort study. The Lancet HIV 2(3):e107–e116

Glynn et al. (1993)
Glynn RJ, Laird NM, Rubin DB (1993) Multiple imputation in mixture models for nonignorable nonresponse with followups. Journal of the American Statistical Association 88(423):984–993
 Robins and Rotnitzky (2001) Robins J, Rotnitzky A (2001) Comments. Statistica Sinica 11(4):920–936
 Robins et al. (2001) Robins J, Rotnitzky A, Bonetti M (2001) Discussion of the Frangakis and Rubin article. Biometrics 57(2):343–347
 Robins et al. (2000) Robins JM, Rotnitzky A, van der Laan M (2000) Comment. Journal of the American Statistical Association 95(450):477–482, DOI 10.1080/01621459.2000.10474224
 Rubin (1976) Rubin DB (1976) Inference and missing data. Biometrika 63:581–592
 Turing (1937) Turing AM (1937) On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London Mathematical Society 2(1):230–265
 van der Vaart (2000) van der Vaart AW (2000) Asymptotic Statistics. Cambridge University Press