We wish to infer the value of a statistical parameter at a law from which we sample independent observations. The parameter is a smooth function of the data distribution. We assume that we can define two variation-independent, infinite-dimensional features of the law, its so called - and -components, such that if we estimate them consistently at a fast enough joint rate, then we can build a confidence interval (CI) with a given asymptotic level based on a plain targeted minimum loss estimator (TMLE) [30, 29]. Typically, the parameter depends on the law only through its -component, whereas its canonical gradient depends on the law through both its - and -components. The estimators of the - and -components would typically be by products of machine learning algorithms. We focus on the case that the machine learning algorithm for the -component is fine-tuned by a real-valued parameter . Is it possible to construct an estimator that will lend itself to the construction of a CI, by fine-tuning data-adaptively and in a targeted fashion both the algorithm for the estimation of the -component and the resulting estimator of the parameter of interest?
The general problem that we address is often encountered in observational studies of the effect of an exposure, for instance when one wishes to infer the average effect of a two-level exposure. It is then necessary to account for the fact that the level of exposure is not fully randomized in the observed population. A pivotal object of interest in such studies, the so called exposure mechanism (that is, the conditional law of exposure given baseline covariates) is an example of what we generally call a -component of the law of the experiment.
A wide range of estimators of the average effect of a two-level exposure require the estimation of the propensity score: Horvitz-Thompson estimators ; estimators based on propensity score matching [23, 8, 7] or stratification [1, 24]; any estimator relying on the efficient influence curve, among which double-robust inverse probability of exposure weighted estimators [20, 22, 18] or estimators built based on the targeted minimum loss estimation (TMLE) methodology [30, 29].
Common methods for the estimation of the propensity score are multivariate logistic regression, high-dimensional propensity score adjustment [25, 2], and a variety of machine learning algorithms [15, 6, 10]. Except in the so called collaborative variant of TMLE that we will discuss shortly, the estimators of the propensity score can be derived at a preliminary step, regardless essentially of why they are needed and how they are used at the subsequent step. This is problematic because optimality at the preliminary step has little if any relation to optimality at the subsequent step. For instance, the optimal estimator of the propensity score at the preliminary step might take values very close to zero, therefore disqualifying it as a viable estimator at the subsequent step, not to mention an optimal one. In a less dramatic scenario, using an instrumental variable (which only influences exposure but not the outcome) to estimate the propensity score could concomitantly yield a better estimator thereof and only increase the variance of the resulting estimator of the effect of exposure [31, 29].
This prompted the development of the so called collaborative version of the targeted minimum loss estimation methodology [31, 29], where the estimation of the -component is not separated from that of the parameter of main interest anymore. More concretely, collaborative TMLE (C-TMLE) consists in building a sequence of estimators of the -component and in selecting one of them by optimizing a criterion that targets the parameter of main interest. For instance, in the above less dramatic scenario, covariates that are strongly predictive of exposure but not of the outcome would be removed, resulting in less bias for the estimator of the parameter of main interest.
The C-TMLE methodology has been adapted to a wide range of fields, including genomics [4, 34], survival analysis , and clinical studies. Because the derivation of C-TMLE estimators is often computationally demanding, scalable versions have also been developed .
In , the authors propose a C-TMLE algorithm that uses regression shrinkage of the exposure model for the estimation of the propensity score. It sequentially reduces the parameter that determines the amount of penalty placed on the size of the coefficient values, and selects the appropriate parameter by cross-validation. The methodology for continuously fine-tuned, collaborative targeted learning that we develop in this article encompasses the algorithm of . Its statistical analysis sheds light on why, and under which assumptions, it would provide valid statistical inference.
At this point in the introduction, we wish to formalize what is the problem at stake. What follows recasts the introductory paragraph in the theoretical framework that we adopt in the article.
Setting the scene.
Let be independent draws from a law on a set . We view as an element of the statistical model , a collection of plausible laws for . The more we know about , the smaller is . Our primary goal is to infer the value of parameter at , namely, . Our statistical analysis is asymptotic in the number of observations.
We consider the case that is pathwise differentiable at every with respect to (w.r.t.) a tangent set : there exists such that, for every , there exists a submodel satisfying (i) , (ii) for all , (iii)
(the submodel’s score function equals ), and (iv) the real valued mapping is differentiable at with a derivative equal to , where is a shorthand notation for (any measurable ). It is assumed moreover that every is associated with two possibly infinite-dimensional features and such that (i) and are unrelated (i.e., variation independent: knowing anything about tells nothing about and vice versa), (ii) depends on only through , (iii) depends on only through and , and (iv) is a mapping from to . At this early stage, we can introduce the pivotal
for every . The notation is justified (i) because we wish to think of the right-hand-side expression as a remainder term, and (ii) by the fact that and depend on only through and . We consider the case that parameter is such that, for some pseudo-distances and on and ,
where stand for “there exists a universal positive constant such that ”. A remainder term satisfying (1) is said double-robust.
Let be an algorithm for the estimation of , the -component of the true law . Likewise, let (, an open interval of of which the closure contains 0) be an -specific algorithm for the estimation of , the -component of . Formally, we view and each as mappings from
to and , respectively, that can “learn” from the empirical measure some estimators and of and . Set (the superscript 0 stands for “initial”), , and let be any element of the model of which the - and -components equal and . Derived by the mere substitution of for in , is a natural estimator of . It is not targeted toward the inference of in the sense that none of the known features of was derived specifically for the sake of ultimately estimating .
It is well documented in the TMLE literature that one way to target toward is to build from in such a way that
and to infer with . This can be achieved, in such a way that is not modified, by “fluctuating” , a procedure that we will develop in details in the specific example studied in the article. Then, by (1), the estimator satisfies the asymptotic expansion:
By convention, we agree that small values of correspond with less bias for as an estimator of . Moreover, we assume that there exists , , such that for some , i.e., that consistently estimates at rate . If is also such that for some , and if , then (1) and (2) yield
which may in turn imply the asymptotic linear expansion
with influence function , depending in particular on how data-adaptive are algorithms and (
). By the central limit theorem, (3) guarantees that is asymptotically Gaussian.
We focus on a more challenging situation, where is not necessarily . We anticipate that our analysis is also very relevant at small and moderate sample sizes when .
In order to derive an asymptotic linear expansion similar to (3) from (2) in this situation, we would have to derive an asymptotic expansion of . Unfortunately, we have reasons to believe that this is not possible without targeting (their presentation in an example is deferred to Section 3.3).
Now, observe that the estimators do not cooperate in the sense that, although and (for any two , ) share the same initial estimator , the construction of the latter does not capitalize on that of the former. In contrast, we propose to build collaboratively a continuum of estimators of the form () and to select data-adaptively one among them that will be asymptotically Gaussian, under conditions often encountered in empirical process theory.
Organization of the article.
In Section 2, we lay out a high-level presentation of collaborative TMLE, and state a high-level result. In Sections 3, 4, 5 and 6, we consider a specific example. In Section 3, we particularize the theoretical construction and analysis. In Section 4, we describe two practical instantiations of the estimator developed in Section 3. In Sections 5 and 6, we carry out a mutli-faceted simulation study of their performances and comment upon its results. In Section 7, we summarize the content of the article. All the proofs are gathered in the appendix.
2 High-level presentation and result
We now state and prove a general result about continuously fine-tuned, collaborative targeted minimum loss estimation, a version of [Theorem 10.1 in 28]. Its high-level assumptions are clarified in the particular example that we study in the next sections.
From now on, we slightly abuse notation and denote instead of , where and are the - and -components of . Let be a (one-dimensional) subset of (indexed by a real parameter ranging in an open subset of ) such that is twice differentiable over for all (-almost surely). We characterize and by setting, for every and ,
Consider the following inter-dependent assumptions. The first one is indexed by .
There exists an open neighborhood of for which the set is such that is twice differentiable over (-almost surely). Moreover, -almost surely,
For all , we know how to build , with - and -components denoted by and , in such a way that . Moreover, we know how to choose such that
and, for some deterministic , A1 is met and
It holds that , and there exists such that . In addition,
Let be given by . There exist and such that
It holds that . Moreover, there exists a deterministic such that A1 is met, and
(10) (11) (12)
Now that we have introduced our high-level assumptions, we can state the corresponding high-level result that they entail. The proof is relegated to the appendix.
Theorem 1 (Asymptotics of the collaborative TMLE – a high-level result).
Under assumptions A2 to A5, it holds that
Commenting on the high-level assumptions.
Assumption A1 concerns both (specifically, how depends on ) and algorithms , (specifically, how smooth is around ). In the particular example studied in the following sections, the counterpart C1 of A1 concerns only algorithms , .
In the example, we show how can be built collaboratively in such a way that A2 is met, under a series of nested assumptions about the smoothness of data-dependent, real-valued functions over , the construction of which notably involve algorithms , . To understand why achieving (6) is relevant, observe that the following oracle version of ,
can be rewritten as
Assumption A3 formalizes the convergence of to its target w.r.t. , and that of to some limit w.r.t. . It does not require that be equal to the target of , but A4 may be impossible to meet when (see below). Condition (7) in A3 is met for instance if the -norm of goes to zero in probability and if the difference falls in a -Donsker class with probability tending to one. As for (8), it typically holds whenever the product of the rates of convergence of and to their limits is . The counterpart of A3 in the example studied in the following sections is C2.
With A4, we assume the existence of an oracle that undersmoothes enough so that is an asymptotically linear estimator of , where we note that is pathwise differentiable in a similar way as . We say that is an oracle because the definition of involves and . It happens that
Under A2 and A3, if , , and if (3) is met with , then A4 holds with and .
It is difficult to assess whether or not A4 is a tall order when is not necessarily , or if .
3 Collaborative TMLE for continuous tuning when inferring the average treatment effect: presentation and analysis
In this section, we specialize the discussion to the inference of a specific statistical parameter, the so called average treatment effect. Section 3.1 introduces the parameter and recalls what are the corresponding and from Section 1. Section 3.2 describes the uncooperative construction of a continuum of uncooperative TMLEs. Section 3.3 argues why the selection of one of the uncooperative TMLEs is unlikely to yield a well behaved (i.e., asymptotically Gaussian) estimator when the product of the rates of convergence of the estimators of and to their limits is not fast enough (i.e., ). Then, Sections 3.4 and 3.5 present the collaborative construction of collaborative TMLEs and how to select one among them that is well behaved, under assumptions that are spelled out in Section 3.6, where the high-level Theorem 1 and its assumptions are specialized.
We observe independent draws , , from , the true law of . It is known that takes its values in . We consider the statistical model that leaves unspecified the law of and the conditional law of given ), while we might know that the conditional expectation of given belongs to a set .
The parameter of interest is the average treatment effect,
We choose it because its study provides a wealth of information and paves the way for the analysis of a variety of other parameters often encountered in the statistical literature.
More generally, every gives rise to , , and , which are respectively the marginal law of under , the conditional expectation of given under , the conditional probability that given under , and the couple consisting of and . For each of them, the average treatment effect is , where is given by
For notational conciseness, we let be given by
for every . Note that is the conditional likelihood of given when given is drawn from the Bernoulli law with parameter , hence the “” in the notation. Parameter viewed as a real-valued mapping over is pathwise differentiable at every w.r.t. the maximal tangent set . The efficient influence curve of at is given by
3.2 Uncooperative construction of a continuum of uncooperative TMLEs
Let be an initial estimator of and be a continuum of candidate estimators of indexed by a real-valued tuning parameter , an open interval of . By convention, we agree that small values of correspond with less bias for as an estimator of . Specifically, denoting
the valid loss function for the estimation ofgiven by
for every , where was defined in (14), we assume from now on that the empirical risk increases.
could correspond to fitting a logistic linear regression maximizing the log-likelihood under the constraint that the sum of the absolute values of the coefficients is smaller than or equal towith . We will refer to this algorithm as the LASSO logistic regression algorithm.
Let be the empirical law of . Set arbitrarily and let denote any element of such that the marginal law of under equals and the conditional expectation of given under is equal to , hence on the one hand; and the conditional expectation of given under coincide with on the other hand. Evaluating at yields an estimator of ,
which is not targeted toward the inference of in the sense that none of the known features of was derived specifically for the sake of ultimately estimating .
One way to target toward is to build from in such a way that
and to infer with . This can be achieved by “fluctuating” in the following sense. For every , introduce the so called “clever covariate” given by
Now, for every , let be characterized by
and be defined like except that the conditional expectation of given under equals (and not ). Clearly, when . Moreover, denoting the loss function given by
for every induced by a , it holds that
a property that prompts us to say that the one-dimensional submodel “fluctuates” “in the direction of” .
The optimal fluctuation of along the above submodel is indexed by the minimizer of the empirical risk
of which the existence is assumed (note that is twice differentiable and strictly convex). We call the TMLE of , and the resulting estimator
the TMLE of . It is readily seen that (22) is equivalent to
where . Since minimizes the differentiable mapping , it holds moreover that
which, combined with the previous display, yields
Finally, the TMLEs () are said uncooperative because, although they share the same initial estimator , for any two , , the construction of does not capitalize on that of .
3.3 Selecting one of the uncooperative TMLEs
At this stage of the procedure, a crucial question is to select one TMLE in the collection of uncooperative TMLEs, one that lends itself to the construction of a CI for with a given asymptotic level. Such a TMLE necessarily writes as for some well chosen . This could possibly be a deterministic (fixed in ) or a data-driven (random and -dependent) element of .
The risk generated by (18) is given by
is the Kullback-Leibler divergence between the Bernoulli laws with parameters. By Pinsker’s inequality, it holds that
for all . Therefore, if is bounded away from zero and one, then (17) implies
If the deterministic is such that (i) there exist two rates and such that and