We consider estimation of a pathwise differentiable real valued target estimand based on observing independent and identically distributed observations from a data distribution known to belong to a nonparametric statistical model . A target parameter is a mapping that maps a possible data distribution into real number, while represents the statistical estimand. The canonical gradient of the pathwise derivative of the target parameter at a distribution defines an asymptotically efficient estimator among the class of regular estimators (Bickel et al., 1997): an estimator is asymptotically efficient at if and only if it is asymptotically linear at with influence curve :
The target parameter depends on the data distribution through a parameter , while the canonical gradient possibly also depends on another nuisance parameter :
. Both of these nuisance parameters are chosen so that they can be defined as a minimizer of the expectation of a specific loss function:and , where we used the notation . We consider the case that the parameter spaces and for these nuisance parameters and are contained in the set of multivariate cadlag functions with sectional variation norm (Gill et al., 1995) bounded by a constant (this norm will be defined in the next section).
We consider a targeted minimum loss-based (substitution) estimator (van der Laan and Rubin, 2006; van der Laan, 2008; van der Laan and Rose, 2011, 2017) of the target parameter that uses as initial estimator of these nuisance parameters the highly adaptive lasso minimum loss-based estimators (HAL-MLE) defined by minimizing the empirical mean of the loss over the parameter space (Benkeser and van der Laan, 2016). Since the HAL-MLEs converge at a rate faster than with respect to (w.r.t.) the loss-based quadratic dissimilarities (to be defined later, which corresponds with a rate faster than for estimation of and ), this HAL-TMLE has been shown to be asymptotically efficient under weak regularity conditions (van der Laan, 2015)
. Statistical inference could therefore be based on the normal limit distribution in which the asymptotic variance is estimated with an estimator of the variance of the canonical gradient. In that case, inference is ignoring the potentially very large contributions of the higher order remainder which could, in finite samples, easily dominate the first order empirical mean of the efficient influence curve term when the size of the nuisance parameter spaces is large (e.g., dimension of data is large and model is nonparametric).
In this article we propose the nonparametric bootstrap to obtain a better estimate of the finite sample distribution of the HAL-TMLE than the normal limit distribution. The bootstrap fixes the sectional variation norm at the values used for the HAL-MLEs on a bootstrap sample. We propose a data adaptive selector of this tuning parameter tailored to obtain improved finite sample coverage for the resulting confidence intervals.
In Section 2 we formulate the estimation problem and motivate the challenge for statistical inference. In Section 3 we present the nonparametric bootstrap estimator of the actual sampling distribution of the HAL-TMLE which thus incorporates estimation of its higher order stochastic behavior, and can thereby be expected to outperform the Wald-type confidence intervals. We prove that this nonparametric bootstrap is asymptotically consistent for the optimal normal limit distribution. Our results also prove that the nonparametric bootstrap preserves the asymptotic behavior of the HAL-MLEs of our nuisance parameters and , providing further evidence for good performance of the nonparametric bootstrap. Importantly, our results demonstrate that the approximation error of the nonparametric bootstrap estimate of the true finite sample distribution of the HAL-TMLE is mainly driven by the approximation error of the nonparametric bootstrap for estimating the finite sample distribution of a well behaved empirical process. In Section 4 we present a plateau selection method for selecting the fixed sectional variation norm in the nonparametric bootstrap and a bias-correction in order to obtain improved finite sample coverage for the resulting confidence intervals.
In Section 5 we demonstrate our methods for two examples involving a nonparametric model and a specified target parameter: average treatment effect and integral of the square of the data density. In Section 6 we carry out a simulation study to demonstrate the practical performance of our proposed nonparametric bootstrap based confidence intervals w.r.t. their finite sample coverage. We conclude with a discussion in Section 7. Proofs of our Lemma and Theorems have been deferred to the Appendix. We refer to our accompanying technical report for additional bootstrap methods and results based on applying the nonparametric bootstrap to an exact second order expansion of the HAL-TMLE, and to various upper bounds of this exact second order expansion.
2 General formulation of statistical estimation problem and motivation for finite sample inference
2.1 Statistical model and target parameter
i.i.d. copies of a random variable. Let
be the empirical probability measure of. Let be a real valued parameter that is pathwise differentiable at each with canonical gradient . That is, given a collection of one dimensional submodels through at with score , for each of these submodels the derivative can be represented as a covariance of a gradient with the score . The latter is an inner product of a gradient with the score in the Hilbert space of functions of with mean zero (under ) endowed with inner product . Let be the Hilbert space norm. Such an element is called a gradient of the pathwise derivative of at . The canonical gradient is the unique gradient that is an element of the tangent space defined as the closure of the linear span of the collection of scores generated by this family of submodels. Define the exact second-order remainder
where since has mean zero under .
[(Treatment-specific mean)] Let , where is a binary treatment, is a binary outcome, and is a nonparametric model. For a possible data distribution , let be the outcome regression, be the propensity score, and let
be the probability distribution of. The treatment-specific mean parameter is defined by . Let and note that the data distribution is determined by . The canonical gradient of at is
The second-order remainder is given by:
Let be a function valued parameter so that for some . For notational convenience, we will abuse notation by referring to the target parameter with and interchangeably. Let be a function valued parameter so that for some . Again, we will use the notation and interchangeably.
For each , let be a function of so that . Similarly, for each , let be a function of so that . We refer to and as loss functions for and . Let and be the loss-based dissimilarities for these two nuisance parameters. The negative loss based dissimilarity is often called the regret. Assume that the loss functions are uniformly bounded in the sense that and . In addition, assume
This condition holds for most common loss functions (such as mean-squared error loss and cross entropy loss), and it guarantees that the loss-based dissimilarities and behave as a square of an -norm. These two universal bounds on the loss function yield the oracle inequality for the cross-validation selector among a set of candidate estimators (van der Laan and Dudoit, 2003; van der Vaart et al., 2006; van der Laan et al., 2006, 2007; Polley et al., 2011). In particular, it establishes that the cross-validation selector is asymptotically equivalent to the oracle selector.
[(Treatment-specific mean)] For the treatment-specific mean parameter, the function is the outcome regression , and is the propensity score. The other component of will be estimated with the empirical probability measure, which is an NPMLE, so that a TMLE will not update this estimator. Let be the negative log-likelihood loss for the outcome regression. Similarly, is the negative-log-likelihood loss for propensity score. When, for some , and , then the loss functions are uniformly bounded with finite universal bounds (2).
Donsker class condition: Our formal theorems need to assume that , , and are uniform (in ) Donsker classes, or, equivalently, that the union of these classes is a uniform Donsker class. We remind the reader that a covering number is defined as the minimal number of balls of size w.r.t. -norm that are needed to cover the set of functions embedded in . Let be defined such that
Our formal results will refer to a rate of convergence of the HAL-MLEs given by implied by this index (van der Laan, 2015). In this article we will focus here on the following special Donsker class, in which case can be chosen as .
Loss functions and canonical gradient have a uniformly bounded sectional variation norm: We assume that the loss functions and canonical gradient are cadlag functions with a universal bound on the sectional variation norm. The latter class of functions is indeed a uniform Donsker class. In the sequel we will assume this, but we remark here that throughout we could have replaced this class of cadlag functions with a universal bound on the sectional variation norm by any other uniform Donsker class. Below we will present a particular class of models in which we assume that the nuisance parameters and themselves fall in such classes of functions, so that generally also and will fall in this class. All our applications have been covered by the latter type of models.
[(Treatment-specific mean)] Under the previous stated assumptions, the sectional variation norm of (for each ) can be bounded in terms of the sectional variation norm of and . Similarly, this same statement applies for and . As a consequence, the universal bounds (4) are finite.
We will formalize this condition now. Suppose that is a -variate random variable with support contained in a -dimensional cube . Let be the Banach space of -variate real valued cadlag functions endowed with a supremum norm (Neuhaus, 1971). Let and . We assume that these loss functions and the canonical gradient map into functions in with a sectional variation norm bounded by some universal finite constant (we will define sectional variation norm momentarily)
For a given function , we define the sectional variation norm as follows. For a given subset , let be the -specific section of that sets the coordinates outside the subset equal to 0, where we used the notation for the vector whose -th component equals if and otherwise. The sectional variation norm is now defined by
where the sum is over all subsets of . Note that is the standard variation norm of the measure generated by its -specific section on the -dimensional edge of the -dimensional cube . Thus, the sectional variation norm of is the sum of the variation norms of itself and of all its -specific sections , plus that of the offset . We also note that any function with finite sectional variation norm (i.e., ) can be represented as follows (Gill et al., 1995):
As utilized in (van der Laan, 2015) to define the HAL-MLE, since , this representation shows that
can be written as an infinitesimal linear combination of tensor product (over) indicator basis functions indexed by a cut-off , across all subsets , where the coefficients in front of the tensor product indicator basis functions are equal to the infinitesimal increments of at . This proves that this class of functions can be represented as a ”convex” hull of the class of indicators basis functions, which proves that it is a Donsker class (van der Vaart and Wellner, 1996).
For discrete measures this integral becomes a finite linear combination of such -way indicator basis functions (where denotes the size of the set ). One could think of this representation of as a saturated model of a function in terms of tensor products of univariate indicator basis functions, ranging from products over singletons to product over the full set . For a function , we also define the supremum norm .
General class of models for which parameter spaces for and are Cartesian products of sets of cadlag functions with bounds on sectional variation norm: Although the above bounds are the only relevant bounds for the asymptotic performance of the HAL-MLE and HAL-TMLE, for practical formulation of a model one might prefer to state the sectional variation norm restrictions on the parameters and themselves instead of on and . (In our formal results we will refer to such a model as having the extra structure (6) defined below, but, this extra structure is not needed, just as we can work with a general Donsker class as mentioned above.)
For that purpose, a model may assume that for variation independent parameters that are themselves -dimensional cadlag functions on with sectional variation norm bounded by some upper-bound and lower bound , , and similarly for with sectional variation norm bounds and , . Typically, such a model would not enforce a lower bound on the sectional variation norm so that we have . Let ; ; and , and similarly we define , and . Specifically, for such a class of models let
denote the parameter spaces for and , and assume that these parameter spaces are contained in the class of -variate cadlag functions with sectional variation norm bounded from above by and from below by , , . These bounds and will then imply bounds , for example, by verifying that for a universal , and similarly, for and . For such a model and would be defined as sums of loss function and . We also define the vector losses , , and corresponding vector dissimilarities and .
For example, the parameter space of () or () is defined as
for some set of possible values for , , , where one evaluates this restriction on in terms of the representation (5). Note that we used short-hand notation for being zero for . We will make the convention that if excludes , then it corresponds with assuming .
The subset of cadlag functions with sectional variation norm between and further restricts the support of these functions to a set . For example, might set for subsets of size larger than for all values , in which case the model assumes that the nuisance parameter can be represented as a sum over all subsets of size and of a function of the variables indicated by .
In order to allow modeling of monotonicity (e..g, nuisance parameter
is an actual cumulative distribution function), we also allow that this set restrictsfor all . We will denote the latter parameter space with
For the parameter space (7) of monotone functions we allow that the sectional variation norm is known by setting (e.g, for the class of cumulative distribution functions we would have ), while for the parameter space (6) of cadlag functions with sectional variation norm between and we assume .
For the analysis of our proposed nonparametric bootstrap sampling distributions we do not assume this extra model structure that or for some set , , . In the sequel we will refer to a model with this extra structure as a model satisfying (6), even though we include the case (7). All our formal results apply without this extras model structure (and also for any other uniform Donsker class as mentioned above), but it just happens to represent a natural model structure for establishing the sectional variation norm bounds (4) on , , and , and for computing HAL-MLEs. The key practical benefit of this extra model structure is that the implementation of the HAL-MLE for such a parameter space corresponds with fitting a linear combination of indicator basis functions of the form (indexed by a subset and knot-point ) under the sole constraint that the sum of the absolute value of the coefficients is bounded by and , and possibly that the coefficients are non-negative, where the set
implies the set of indicator basis functions that are included. Specifically, in the case that the nuisance parameter is a conditional mean or conditional probability we can compute the HAL-MLE with standard lasso linear or logistic regression software(Benkeser and van der Laan, 2016). Therefore, this restriction on our set of models also allows straightforward computation of its HAL-MLEs, corresponding HAL-TMLE, and their bootstrap analogues.
A typical statistical model assuming the extra structure (6) would be of the form indexed by the support sets and the sectional variation norm bounds , but the model might include additional restrictions on as long as the parameter spaces of these nuisance parameters equal these sets or .
In our first example we have a nuisance parameter that is not just assumed to be cadlag and have bounded sectional variation norm but is also bounded between and for some . This means that the parameter space for this is not exactly of type (6). This is easily resolved by, for example, reparameterizing where can be any cadlag function with sectional variation norm bounded by some constant . The bound implies automatically a supremum norm bound on , and thereby that for some . One now defines the nuisance parameter as . Similarly, such a parametrization can be applied to and to the density in our second example. These just represent a few examples showcasing that one can reparametrize the natural nuisance parameters in terms of nuisance parameters that have a parameter space of the form (6) or (7). These representations are actually natural steps for the implementation of the HAL-MLE since they allow us now to minimize the empirical risk over a generalized linear model with the sole constraint that the sum of absolute value of coefficients is bounded (and possibly coefficients are non-negative).
Bounding the exact second-order remainder in terms of loss-based dissimilarities: Let
for some mapping possibly indexed by . We assume the following upper bound:
for some function , , of the form , a quadratic polynomial with positive coefficients . In all our examples, one simply uses the Cauchy-Schwarz inequality to bound in terms of -norms of and , and subsequently one relates these -norms to its loss-based dissimilarities and , respectively. This bounding step will also rely on an assumption that denominators in are uniformly bounded away from zero. This type of assumption that guarantees uniform bounds on and on is often referred to as a strong positivity assumption since it requires that the data density has a certain type of support relevant for the target parameter , and that the data density is uniformly bounded away from zero on that support.
Continuity of efficient influence curve as function of at : We also assume that if and converge to zero in probability, then
2.2 HAL-MLEs of nuisance parameters
We estimate with HAL-MLEs satisfying (with probability tending to 1)
For example, might be defined as the actual minimizer . If has multiple components and the loss function is a corresponding sum loss function, then these HAL-MLEs correspond with separate HAL-MLEs for each component. We have the following previously established result from Lemma 3 in van der Laan (2015) for these HAL-MLEs. We represent estimators as mappings on the nonparametric model containing all possible realizations of the empirical measure .
Application of this general lemma proves that and .
One can add restrictions to the parameter space over which one minimizes in the definition of and as long as one guarantees that, with probability tending to 1, and . For example, in a model with extra structure (6) this allows one to use a data dependent upper bound on the sectional variation norm in the definition of if we know that will be larger than the true with probability tending to 1.
Consider a finite dimensional local least favorable model through at so that the linear span of the components of at includes . Let for . We assume that this one-step TMLE already satisfies
Since we will have that , and solves its score equation , which, in first order, equals its score equation at (with a second order remainder ). This basic argument allows one to prove that (10) holds under the assumption and regularity conditions, as formally shown in the Appendix of (van der Laan, 2015). Alternatively, one could use the one-dimensional canonical universal least favorable model satisfying at each (see our second example in Section 5). In that case, the efficient influence curve equation (10) is solved exactly with the one-step TMLE: i.e., (van der Laan and Gruber, 2015). The HAL-TMLE of is the plug-in estimator . In the context of model structure (6) (or (7)), we will also refer to this estimator as the HAL-TMLE to indicate its dependence on the specification of the bounds on the sectional variation norms of the components of and
Lemma 2 in Appendix A proves that converges at the same rate as (see (20)). This also implies this result for any -th step TMLE with fixed. The advantage of a one-step or -th step TMLE is that it is always well defined, and it easily follows that it converges at the same rate as the initial to . In addition, for these closed form TMLEs it is also guaranteed that the sectional variation norm of remains universally bounded. The latter is important for the Donsker class condition for asymptotic efficiency of the HAL-TMLE, but the Donsker class condition could be avoided by using a cross-validated HAL-TMLE that relies on sample splitting (van der Laan and Rose, 2011).
Assuming extra model structure (6), since we apply the least favorable submodel to an HAL-MLE that is likely having the maximal allowed sectional variation norm, the following remark is in order. We suggest to simply extend the statistical model by enlarging the sectional variation norm bounds to for some , even though the original bounds are still used in the definition of the HAL-MLEs. This increase in statistical model does not change the canonical gradient at (known to be an element of the interior of original model), while now a least favorable submodel through the HAL-MLE is allowed to enlarge the sectional variation norm. This makes the construction of a least favorable submodel easier by not having to worry to constrain the sectional variation norm. Since the HAL-MLE has the maximal allowed uniform sectional variation norm , and is consistent, the sectional variation norm of the TMLE will now be slightly larger, and asymptotically approximate . Either way, with the slightly enlarged definition of , we have so that the assumption (4) guarantees that is bounded by a universal constant.
[(Treatment-specific mean)] Condition (8) holds by applying the Cauchy-Schwarz inequality,and using for some . The HAL-MLEs and of and , respectively, can be computed with a lasso-logistic regression estimator with large (approximately ) number of indicator basis functions (see our example section for more details), where we can select the -norm of the coefficient vector with cross-validation. The least favorable submodel through is given by
where . Let , which is thus computed with a simple univariate logistic regression MLE, using as off-set . This defines the TMLE . Recall that is already an NPMLE so that a TMLE-update based on a log-likelihood loss and local least favorable submodel (i.e., with score , will not change this estimator. Let . The HAL-TMLE of is the plug-in estimator .
2.4 Asymptotic efficiency theorem for HAL-TMLE and CV-HAL-TMLE
We have the following identity for the HAL-TMLE: