Deductive semiparametric estimation in Double-Sampling Designs with application to PEPFAR

Robust estimators in missing data problems often use semiparametric estimation. Such estimation usually requires the analytic form of the efficient influence function (EIF), the derivation of which can be ad hoc and difficult. Recent work has shown how the derivation of EIF in such problems can be made deductive using the functional derivative representation of the EIF in nonparametric models. This approach, however, requires deriving a mixture of a continuous distribution and a point mass, which can itself be challenging for complicated problems. We propose to address this challenge using a "discrete support" structure, therefore making the deductive semiparametric estimation feasible to more complicated designs and estimands. The discrete support is a space constructed from the observed data, which enables (i) approximation of the observed data distribution, and (ii) numerical computation of the deductive semiparametric estimators. The method is expected to produce semiparametric locally efficient estimators within finite steps without knowledge of the EIF. We apply the new method to estimating the mortality rate in a double-sampling design of the President's Emergency Plan for AIDS Relief program. We also analyze the sensitivity of the estimated mortality rate to the inclusion criteria of double-samples.


page 1

page 2

page 3

page 4


Nonparametric Pattern-Mixture Models for Inference with Missing Data

Pattern-mixture models provide a transparent approach for handling missi...

Design-based estimators of distribution function in ranked set sampling with an application

Empirical distribution functions (EDFs) based on ranked set sampling (RS...

Revisiting Identifying Assumptions for Population Size Estimation

The problem of estimating the size of a population based on a subset of ...

Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks

In networks with binary activations and or binary weights the training b...

Learning Invariant Representations with Missing Data

Spurious correlations allow flexible models to predict well during train...

A Simple Re-Derivation of Onsager's Solution of the 2D Ising Model using Experimental Mathematics

In this case study, we illustrate the great potential of experimental ma...

Using Program Induction to Interpret Transition System Dynamics

Explaining and reasoning about processes which underlie observed black-b...

1 Introduction

Studies with long follow-up often suffer from high dropout rate. Dropouts can depend on the outcome of interest, even after adjusting for observed covariates. This makes the dropouts “non-ignorable” and biases the analysis based solely on the non-dropouts (Rubin, 1976). As a way to handle non-ignorable dropouts, double-sampling designs allocate additional resources to pursue a sample of the dropouts and find out their outcomes (Baker et al., 1993; Glynn et al., 1993; Frangakis and Rubin, 2001; Cochran, 2007). The double-sampling can be more practical or informative if it targets dropouts whose history at the dropout time has specific profiles. For example, An et al. (2009) found that such “profile designs” can save more than 35% of resources compared to the standard double-sampling design. In addition, it can be more practical to double-sample relatively recent dropouts. For estimation in double-sampling designs, An et al. (2014)

employed a parametric approach to estimate the survival probability. However, analyses of such designs can be more reliable if it does not rely heavily on parametric assumptions.

Robins et al. (2001) had suggested a possible way of deriving a semiparametric estimator for double-sampling designs, but that and any other such existing proposals rely on first coming up with and then verifying conjectures “by hand”. Such a process is prone to human errors and is not deductive, i.e., not generalizable.

Recently, Frangakis et al. (2015) proposed the idea of making semiparametric estimation deductive. Contrary to the classical semiparametric framework which relies heavily on the conjecture and verification of the analytic form of the efficient influence function (EIF), their approach produces semiparametric locally efficient estimators by utilizing the Gateaux derivative representation of the EIF in nonparametric models. This deductive estimation idea can save dramatic human effort from difficult mathematical derivations, and such effort can be transferred, for example, to designing new studies. One limitation, though, is that their approach requires analytically evaluating the estimand at a mixture of a continuous distribution and a point mass. This derivation is feasible in certain cases (e.g., the two-phase design examples in Frangakis et al. (2015)), but becomes highly error-prone when derived by hand in complicated problems like estimating survival probability in the double-sampling design.

In this paper, we develop a semiparametric estimator for mortality rate in double-sampling designs, by adapting the deductive estimation method in Frangakis et al. (2015) to incorporate a novel “discrete support” structure. The discrete support is a space constructed from the observed data, which enables (i) approximation of the observed data distribution, and (ii) numerical computation for constructing semiparametric estimators, including augmenting fitted models and perturbing a continuous distribution by a point mass at an observed data point. This discretization technique has its root in Chamberlain (1987). The proposed method is expected to produce semiparametric locally efficient estimators within finite steps without knowledge of the EIF. By avoiding the need to evaluate the estimand at a mixture of a continuous distribution and a point mass, the proposed method overcomes the limitation in Frangakis et al. (2015) and yields computerizable semiparametric estimation for complex designs such as the double-sampling design.

We apply the method to estimating the mortality rate using data from a double-sampling design component of the President’s Emergency Plan for AIDS Relief (PEPFAR), an HIV monitoring and treatment program. In addition, we explore and discuss how the estimated mortality rate is impacted by certain restrictions on the double-sampling as a scientifically interesting problem, because double-sampling can be more pragmatic if restricted to relatively recent dropouts (An et al., 2014).

The paper is organized as follows. In Section 2, we introduce the double-sampling design and the parameter of interest with its identifiability conditions. In Section 3, we review the deductive estimation framework in Frangakis et al. (2015), and present the proposed estimation method for survival probability in double-sampling designs. In particular, we present and discuss how the discrete support idea facilitates the deductive estimation in double-sampling designs. In Section 4, we apply the method to PEPFAR and analyze the impact of selection criteria on double-sampling. Section 5 concludes with remarks.

2 Double-sampling design and quantity of interest

For clarity, we present arguments with a double-sampling design as shown in Figure 1, which is modified from An et al. (2014, Figure 1). First, we describe characteristics that are inherent to a patient (i.e., potential outcomes (Rubin, 1976)), which would be realized if the program could enroll and follow a patient indefinitely with a standard effort. Then, we describe the actual design consisting of two phases.

Figure 1: Characteristics for different patient types in a double-sampling design. Column denotes the number of patients of each type in the double-sampling data set of PEPFAR. “obs” means observed; “UD” means undefined. Quantities in brackets are unobserved. The figure is a modified version of An et al. (2014, Figure 1).

Patient inherent characteristics and quantity of interest. For each patient, there is an enrollment time , covariates at enrollment, and a survival time (time from enrollment to death). The quantity of interest is for a given , which is the proportion of patients surviving beyond year for a population represented as a random sample by the enrolled patients. If the program were to follow patients indefinitely with a standard effort while they are alive, some patients would discontinue contact from the monitoring (patient types (b1), (c), (d), (e1), and (e2) in Figure 1). For these patients, labeled true dropouts and indicated by , denote by the time from enrollment to dropout and denote by any information available at dropping out in addition to . For instance, may include and some longitudinal measurements.

Phase 1 of the actual design and missingness due to administrative censoring. In the first phase, the actual design enrolls patients at times and monitors them with standard effort, not indefinitely, but until “Now”—the present time. The time from enrollment to “Now” is the time to administrative censoring, denoted by . For patient , define if and otherwise. Define .

Simply based on Phase 1, the standard survival data are not observed for dropouts whose time to dropout satisfies . Denote such observed dropout patients by (patient types (c), (d), (e1), and (e2) in Figure 1).

Phase 2: information after double-sampling. Phase 2 of the design selects a subset of the observed dropouts, called a double-sample, by using characteristics of the patient known up to the dropout time . Additional resources are allocated for searching for and finding this double-sample at “Now”. Such double-sampling is expressed as follows:

Condition 1. Patient-dependent double-sampling. For each observed dropout, the indicator of being selected as a double-sample can depend on the patient characteristics up to dropout time ; after we condition on , double-sampling is independent of survival and enrollment times:

The design must also address the information missing due to administrative censoring. This can be done in practice by limiting the design to a window period of enrollment, within which the following holds:

Condition 2. Exchangeability of enrollment times within a time period. Patients enrolled at different times (equivalently, having different time to administrative censoring ) have similar survival times after conditioning on :

Under Conditions 1 and 2, the estimand is identifiable from the following components of the distribution of the observed data:


In particular, by Condition 1 the distribution from the double-sampled individuals is the same as the one for all dropouts, , and so, together with the second component of (2) gives, upon averaging over , . That, together with (1), gives . Denote by the function that takes as an arbitrary distribution of from independent survival and censoring times and returns the survival probability beyond (this function is the common probability limit of the Nelson-Aalen and Kaplan-Meier estimators). Then, by Condition 2, the estimand is calculable from the above distributions as


The calculation described above involves regression distributions that need to be estimated with robust and deductive (i.e., easily computerizable) methods. We describe such a method next.

3 Deductive estimation of survival probability in double-sampling designs

3.1 Overview: computerizable and deductive estimation

An estimand can be viewed as a functional of a distribution on the data. For example, in the double-sampling design described above, expression (3) shows that is a function of the distribution of the observed data. Estimation of requires modeling assumptions because the expression (3) involves regressions that, in practice, cannot be estimated fully nonparametrically. In particular, suppose the estimand in (3) has a nonparametric efficient influence function (EIF) denoted by , where represents the remaining components of the distribution, other than . We wish to find deductive estimators that solve


after substituting for estimates of a working model. Such estimators are consistent and locally efficient if the working estimators of ( are consistent with convergence rates larger than (van der Vaart, 2000). By “deductive”, we mean that the estimator should not rely on conjectures for the functional form of , and should be guaranteed to produce an estimate in the sense of Turing (1937) (i.e., use a discrete and finite set of instructions, and, for every input, finish in discrete finite steps).

Recently, a deductive estimation method has been proposed that does not need the analytic form of EIF (Frangakis et al., 2015). The key idea is that, for any working distribution for , we can calculate numerically the EIF as the numerical Gateaux derivative after perturbing the working distribution by a small mass at each observed data point; i.e.,


and where is the perturbed distribution, i.e., a mixture of a continuous distribution and a point mass. Then, we can find the best working distribution parameter as one that makes zero the sum of the numerical EIFs, . The estimator solving (4) approximately is then

. The standard error of the estimator can be estimated by

with denoting the sample size.

The estimator has consistency properties beyond those of the MLE for the same working model , since the former depends only on the features of that remain in the nonparametric influence function EIF. For example, Frangakis et al. (2015) show that in the two phase design, such estimator shares the double robustness of estimators that work with the EIF form as a given. The more general conditions between the true distribution and the working model are stated in the Appendix, and specific conditions for working models are given in Section 3.2.

The challenge in directly applying the above deductive estimation approach to the double-sampling design is the evaluation of . Although for certain problems the estimand at the perturbed distribution can be obtained relatively easily (such as the average outcome in the two-phase design example from Frangakis et al. (2015)), we found the perturbation difficult to derive analytically for the survival probability considered in this paper. To avoid analytically evaluating , we combine the deductive estimation procedure with the discrete support idea, described next.

3.2 Incorporating discrete support for estimation in double-sampling designs

The general idea of the discrete support is to approximate continuous working models ( in the previous section) by discrete ones. If the support of the discrete working model contains all the observed data points, perturbing the working model by a point mass would be as simple as changing the values in a probability table. This discretization idea was used in Chamberlain (1987) to facilitate the derivation of nonparameteric efficient influence functions.

In the double-sampling design, the observed data for patient is , , and if or ( and ). Here, (which includes ) and are undefined if , and are missing if and . We define the discrete support (DS) for the double-sampling design to be


where is the Cartesian product space of the unique values of and the unique values of for patients with , and is the Cartesian product space of the unique values of and the unique values of for patients with . Figure 2 gives an illustration of the spaces and . Note that by nature of the double-sampling design, among those individuals with those who weren’t double-sampled will have , and we include this as a set of unique values of in .

Figure 2: Illustration of the discrete support (DS) for double-sampling designs.

Formally, we define

The set DS contains as its elements all observed and additional points needed for identifiability of the estimand and the deductive estimation procedure. As we will see in the following, such construction of DS for the double-sampling design enables (i) approximation of the observed data distribution, (ii) extension of fitted working models, and (iii) perturbation on a working distribution by a point mass at an observed data point. Note that instead of constructing an overall Cartesian product space, we keep the quantities that can be estimated reliably from the data (i.e., ), and construct product spaces conditional on and , respectively.

With a slight abuse of notation, in the following we denote by

an arbitrary (discrete) probability distribution on the discrete support set

DS. The deductive estimation procedure to construct a semiparametric estimator for in the double-sampling design, after constructing the discrete support set DS, is as follows:

(step 1) : Construct and code the function that takes an arbitrary probability distribution on DS, and outputs (through averaging and normalization) the decomposition (1)-(2) and the value based on (3). This step solely depends on the identification result of the estimand.

(step 2) : Construct and code the function that takes an arbitrary probability distribution on DS, an index , and a small , and outputs the perturbed mixture distribution

Given , the perturbed probability at an arbitrary point can be computed by if , and if , where is the probability of under the (discrete) probability measure . This step is straightforward because the perturbation is conducted on a discrete set.

(step 3) : Fit a working distribution on DS using working models. Then extend to a model with a tuning parameter , so that the left-hand side of (7) below explores values around 0. We give the exact form of for the double-sampling design in Section 3.3.

(step 4) : Find the parameter that solves


where is defined in (5). For computing , is computed by step 3; is computed by step 2; and and are computed by step 1.

(step 5) : The resulting estimator is , which is computed by the function defined in step 1. The standard error of can be estimated by .

The estimator produced by the above method has the following robustness property. Suppose we decompose , where isolates the components of the distribution that are modeled through in step 3, and is the remaining part of . Generally, the expected EIF, , is zero at the truth but possibly also at other values of (e.g., double-robustness). Under regularity conditions usually needed for consistency with estimating equations, the above estimator when taking a working model should be consistent if the model includes a distribution that satisfies condition (12) of the Appendix

that is, the distribution zeros out the limit EIF and gives the correct value of the estimand.

3.3 Working and extended model

Here we describe the working model and its extension used in step 3 of the estimation procedure in Section 3.2.

The working model on the discrete support is a discrete probability distribution on . The value of at an arbitrary point in is calculated as follows: for the double-sampled subjects ( so that is not NA), we have


for the other subjects with , we have


We now describe how each term on the right hand side of (8) and (9) is calculated. For , , and , we use the empirical distributions. The selection for double-sampling

is modeled using logistic regression. The distribution

is modeled as the likelihood arising from independent given :

is computed as
is computed as

This working independence between and does not need to and is not expected to hold, and only the resulting likelihood for is used. The working distributions for censoring time and survival time are each taken as Cox proportional hazards model fits. The fitted distribution is then normalized to sum to 1 over all pairs in , conditional on each pair of in .

The working model on is calculated analogously. For an arbitrary point in , we compute by

where a similar normalization is conducted on . The working models on and are chosen to be variationally independent; i.e., not sharing parameters.

The working model is then extended to a model by adding a 1-dimensional parameter to and , while leaving the other factors of unchanged. Denoting by the two pre-extension models for and , the extension models with are taken as:


where , . This extension can increase the probability for larger values (with ) or smaller values (with ), and was verified numerically to be able to explore values for (7) around 0, where other attempted extensions could not. Note that when , gives the original working model.

4 Application to PEPFAR

We apply our method to estimating the mortality rate using the data set from a double-sampling design component of the President’s Emergency Plan for Aids Relief (PEPFAR), and HIV monitoring and treatment program in East Africa evaluating the antiretroviral treatment (ART) for HIV-infected people (Geng et al., 2015). The data set consists of 1,773 HIV-infected adults from Morogoro, Tanzania, who started ART after entering the study. There are 673 dropouts during the study. Among the dropouts, 91 patients got double-sampled. We use baseline age and pre-treatment CD4 value as , and the loss to follow-up time and the CD4 value measured at the last visit before dropout as .

Estimated mortality rate. The black curve in the left panel of Figure 3 is the estimated -year mortality rate for using the method in Section 3

, along with its pointwise 95% confidence interval obtained by normal approximation. For example, the deductively estimated 1-year mortality rate is 11.4% with standard error 0.8%. As a comparison, a stratified Kaplan-Meier approach gives estimated 1-year mortality rate 14% (not shown). The yellow curve in the left panel of Figure

3 is the estimated mortality rate when forcing in step 5 of Section 3, which corresponds to the estimator obtained by directly plugging in the fitted working model into the estimand, , without the additional steps to find that solves (7).

Figure 3: Estimated mortality rates of the PEPFAR data set, with 95% pointwise confidence interval (shaded band). Left panel: comparison of deductive estimates (with solving (7)) and the estimates when forcing , where there is no restriction on double-sampling. Right panel: comparison of deductive estimates with different restrictions on double-sampling.

Sensitivity to double-sampling selection criteria. Due to logistic reasons, it may be more feasible in practice to double-sample relatively recent dropouts (An et al., 2014). A natural question is how such selection criteria of the double-samples would impact the estimated mortality rate. To answer this question, we parametrize the selection rule by a scalar , where only the dropouts with are eligible to be sampled in the second phase. For different values of , Table 1 lists the proportion of double-samples in the PEPFAR data set that satisfy and the corresponding deductively estimated 1-year mortality rate. For a given , we obtain the estimated mortality rate by first setting and for the double-samples in the data set with , and then using the method in Section 3. The estimated 1-year mortality rate and the standard error seem to be only slightly impacted by . When days (i.e., when only the most recent 19 double-samples are included), because there is too little variation in the included double-samples the Cox model cannot be fitted and the estimation procedure breaks down.

(days) Included double-samples (%) Estimate (%) Standard error (%)
262 20 NA NA NA
344 30 1.45 12.2 0.69
382 40 1.58 12.5 0.71
454 50 1.59 12.3 0.72
494 60 1.57 10.8 0.71
566 70 1.61 12.1 0.78
652 80 1.56 11.5 0.77
897 90 1.57 11.9 0.79
1061 100 1.53 11.7 0.80
Table 1: Deductive estimates and standard errors for the 1-year mortality rate in the PEPFAR data set when restricting double-sampling to dropouts with for different values.

The right panel in Figure 3 shows the estimated -year mortality rates without restriction on double-sampling (black curve) and when restricting the design to only double-sample the patients who dropped out within the past two years (blue curve) or the past year (yellow curve). They correspond to , 750, or 365 days; that is, when all, 85%, or 36% of the double-samples in the data set are included. For all three curves, the estimated mortality rates are similar for year. The yellow curve (restricting double-sampling to the past year dropouts) start to diverge from the other two as grows beyond 1 year. We conclude that different selection criteria can result in quite different estimates, and evaluation and optimization of the double-sampling selection criteria is an area of future research.

5 Remarks

We proposed a deductive method to produce semiparametric estimators for estimating survival probability in the double-sampling design. The method is easily computerizable by incorporating the discrete support structure into the approach in Frangakis et al. (2015). We applied the method to a double-sampling component at a site of the PEPFAR program.

It would be interesting to compare the proposed estimator with estimators derived if one had used the explicit form of the EIF. While Robins and Rotnitzky (2001) describe briefly the conditions that the EIF satisfies through a set of equations, we have found that by solving those equations by hand, we had high risk of introducing possible errors. We are also unaware of any work that gives a closed form expression of the EIF.

For the proposed estimation procedure in Section 3.2, the construction of the discrete support, step 1, and step 3 are problem-specific, whereas steps 2, 4 and 5 are generic to other nonparametric settings. As we discussed in Section 3, when constructing the discrete support, a necessary condition is that it should contain all the observed data points so that perturbation can be easily calculated. It should also facilitate approximation of the observed data distribution with a discrete working model. It remains an open question to elucidate the principles for constructing the discrete support and for extending the working model on the discrete support. We expect the proposed estimation procedure to be applicable to other designs after finding a satisfactory answer to this question.

In the literature, discussions on robust estimators have been partly based on characterizing robust estimating functions; see, for example, Robins et al. (2000) and Robins and Rotnitzky (2001). Although our estimator is obtained by (numerically) solving the EIF equation, it is more difficult to find analytically all the conditions for which the estimator is consistent, precisely because the focus is on problems in which the EIF is difficult to derive analytically. Perhaps, therefore, a supplemental numerical method may exist that can also characterize more intuitively these conditions.


Suppose the working model assumes the true distribution belongs in some set . Then the estimator, say , that solves the nonparametric EIF within the working model will, in the limit, be


where denotes the expectation under . Therefore, by denoting , and assuming sufficient smoothness of the distributions, the estimator will converge to the true value under the following joint conditions:


The analytic form of above expressions may not be easily accessible when the EIF’s form is not. Suppose however, a computational method can easily determine just the ”zeros” of the expressions, that is, given any , the features of that are restricted in order to satisfy conditions (12). With such a method, coupled with the method of deductive estimation, the researcher can focus efforts to clarify and model especially well those restricted features, as this would provide approximate consistency of the estimator.


The authors would like to acknowledge Beverly S. Musick of Indiana University for compiling the database on which this study was based and for offering expert advice on the data. This work was supported by the U.S. National Institute of Drug Abuse (R01 AI102710-01). The statements in this work are solely the responsibility of the authors and do not represent the views of this organization.


  • An et al. (2009) An MW, Frangakis CE, Musick BS, Yiannoutsos CT (2009) The need for double-sampling designs in survival studies: An application to monitor PEPFAR. Biometrics 65(1):301–306
  • An et al. (2014) An MW, Frangakis CE, Yiannoutsos CT (2014) Choosing profile double-sampling designs for survival estimation with application to President’s Emergency Plan for Aids Relief evaluation. Statistics in medicine 33(12):2017–2029
  • Baker et al. (1993)

    Baker SG, Wax Y, Patterson BH (1993) Regression analysis of grouped survival data: informative censoring and double sampling. Biometrics 49(2):379–389

  • Chamberlain (1987)

    Chamberlain G (1987) Asymptotic efficiency in estimation with conditional moment restrictions. Journal of Econometrics 34(3):305–334

  • Cochran (2007) Cochran WG (2007) Sampling techniques. John Wiley & Sons
  • Frangakis and Rubin (2001) Frangakis CE, Rubin DB (2001) Addressing an idiosyncrasy in estimating survival curves using double sampling in the presence of self-selected right censoring. Biometrics 57(2):333–342
  • Frangakis et al. (2015) Frangakis CE, Qian T, Wu Z, Diaz I (2015) Deductive derivation and Turing-computerization of semiparametric efficient estimation. Biometrics 71(4):867–874
  • Geng et al. (2015) Geng EH, Odeny TA, Lyamuya RE, Nakiwogga-Muwanga A, Diero L, Bwana M, Muyindike W, Braitstein P, Somi GR, Kambugu A, et al. (2015) Estimation of mortality among HIV-infected people on antiretroviral treatment in east Africa: a sampling based approach in an observational, multisite, cohort study. The Lancet HIV 2(3):e107–e116
  • Glynn et al. (1993)

    Glynn RJ, Laird NM, Rubin DB (1993) Multiple imputation in mixture models for nonignorable nonresponse with follow-ups. Journal of the American Statistical Association 88(423):984–993

  • Robins and Rotnitzky (2001) Robins J, Rotnitzky A (2001) Comments. Statistica Sinica 11(4):920–936
  • Robins et al. (2001) Robins J, Rotnitzky A, Bonetti M (2001) Discussion of the Frangakis and Rubin article. Biometrics 57(2):343–347
  • Robins et al. (2000) Robins JM, Rotnitzky A, van der Laan M (2000) Comment. Journal of the American Statistical Association 95(450):477–482, DOI 10.1080/01621459.2000.10474224
  • Rubin (1976) Rubin DB (1976) Inference and missing data. Biometrika 63:581–592
  • Turing (1937) Turing AM (1937) On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London Mathematical Society 2(1):230–265
  • van der Vaart (2000) van der Vaart AW (2000) Asymptotic Statistics. Cambridge University Press