# Collaborative targeted minimum loss inference from continuously indexed nuisance parameter estimators

Suppose that we wish to infer the value of a statistical parameter at a law from which we sample independent observations. Suppose that this parameter is smooth and that we can define two variation-independent, infinite-dimensional features of the law, its so called Q- and G-components (comp.), such that if we estimate them consistently at a fast enough product of rates, then we can build a confidence interval (CI) with a given asymptotic level based on a plain targeted minimum loss estimator (TMLE). The estimators of the Q- and G-comp. would typically be by products of machine learning algorithms. We focus on the case that the machine learning algorithm for the G-comp. is fine-tuned by a real-valued parameter h. Then, a plain TMLE with an h chosen by cross-validation would typically not lend itself to the construction of a CI, because the selection of h would trade-off its empirical bias with something akin to the empirical variance of the estimator of the G-comp. as opposed to that of the TMLE. A collaborative TMLE (C-TMLE) might, however, succeed in achieving the relevant trade-off. We construct a C-TMLE and show that, under high-level empirical processes conditions, and if there exists an oracle h that makes a bulky remainder term asymptotically Gaussian, then the C-TMLE is asymptotically Gaussian hence amenable to building a CI provided that its asymptotic variance can be estimated too. We illustrate the construction and main result with the inference of the average treatment effect, where the Q-comp. consists in a marginal law and a conditional expectation, and the G-comp. is a propensity score (a conditional probability). We also conduct a multi-faceted simulation study to investigate the empirical properties of the collaborative TMLE when the G-comp. is estimated by the LASSO. Here, h is the bound on the l1-norm of the candidate coefficients.

## Authors

• 9 publications
• 10 publications
• 43 publications
03/31/2018

### Collaborative targeted inference from continuously indexed nuisance parameter estimators

We wish to infer the value of a parameter at a law from which we sample ...
01/15/2019

### A nonparametric super-efficient estimator of the average treatment effect

Doubly robust estimators of causal effects are a popular means of estima...
06/30/2017

### Collaborative-controlled LASSO for Constructing Propensity Score-based Estimators in High-Dimensional Data

Propensity score (PS) based estimators are increasingly used for causal ...
05/23/2019

### Nonparametric Bootstrap Inference for the Targeted Highly Adaptive LASSO Estimator

The Highly-Adaptive-LASSO Targeted Minimum Loss Estimator (HAL-TMLE) is ...
08/14/2019

### Efficient Estimation of Pathwise Differentiable Target Parameters with the Undersmoothed Highly Adaptive Lasso

We consider estimation of a functional parameter of a realistically mode...
09/12/2020

### Doubly robust estimation for conditional treatment effect: a study on asymptotics

In this paper, we apply doubly robust approach to estimate, when some co...
11/12/2018

### An Easy Implementation of CV-TMLE

In the world of targeted learning, cross-validated targeted maximum like...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We wish to infer the value of a statistical parameter at a law from which we sample independent observations. The parameter is a smooth function of the data distribution. We assume that we can define two variation-independent, infinite-dimensional features of the law, its so called - and -components, such that if we estimate them consistently at a fast enough joint rate, then we can build a confidence interval (CI) with a given asymptotic level based on a plain targeted minimum loss estimator (TMLE) [30, 29]. Typically, the parameter depends on the law only through its -component, whereas its canonical gradient depends on the law through both its - and -components. The estimators of the - and -components would typically be by products of machine learning algorithms. We focus on the case that the machine learning algorithm for the -component is fine-tuned by a real-valued parameter . Is it possible to construct an estimator that will lend itself to the construction of a CI, by fine-tuning data-adaptively and in a targeted fashion both the algorithm for the estimation of the -component and the resulting estimator of the parameter of interest?

#### Literature overview.

The general problem that we address is often encountered in observational studies of the effect of an exposure, for instance when one wishes to infer the average effect of a two-level exposure. It is then necessary to account for the fact that the level of exposure is not fully randomized in the observed population. A pivotal object of interest in such studies, the so called exposure mechanism (that is, the conditional law of exposure given baseline covariates) is an example of what we generally call a -component of the law of the experiment.

A wide range of estimators of the average effect of a two-level exposure require the estimation of the propensity score: Horvitz-Thompson estimators [9]; estimators based on propensity score matching [23, 8, 7] or stratification [1, 24]; any estimator relying on the efficient influence curve, among which double-robust inverse probability of exposure weighted estimators [20, 22, 18] or estimators built based on the targeted minimum loss estimation (TMLE) methodology [30, 29].

Common methods for the estimation of the propensity score are multivariate logistic regression

[14], high-dimensional propensity score adjustment [25, 2], and a variety of machine learning algorithms [15, 6, 10]. Except in the so called collaborative variant of TMLE that we will discuss shortly, the estimators of the propensity score can be derived at a preliminary step, regardless essentially of why they are needed and how they are used at the subsequent step. This is problematic because optimality at the preliminary step has little if any relation to optimality at the subsequent step. For instance, the optimal estimator of the propensity score at the preliminary step might take values very close to zero, therefore disqualifying it as a viable estimator at the subsequent step, not to mention an optimal one. In a less dramatic scenario, using an instrumental variable (which only influences exposure but not the outcome) to estimate the propensity score could concomitantly yield a better estimator thereof and only increase the variance of the resulting estimator of the effect of exposure [31, 29].

This prompted the development of the so called collaborative version of the targeted minimum loss estimation methodology [31, 29], where the estimation of the -component is not separated from that of the parameter of main interest anymore. More concretely, collaborative TMLE (C-TMLE) consists in building a sequence of estimators of the -component and in selecting one of them by optimizing a criterion that targets the parameter of main interest. For instance, in the above less dramatic scenario, covariates that are strongly predictive of exposure but not of the outcome would be removed, resulting in less bias for the estimator of the parameter of main interest.

The C-TMLE methodology has been adapted to a wide range of fields, including genomics [4, 34], survival analysis [27], and clinical studies[11]. Because the derivation of C-TMLE estimators is often computationally demanding, scalable versions have also been developed [11].

In [26], the authors propose a C-TMLE algorithm that uses regression shrinkage of the exposure model for the estimation of the propensity score. It sequentially reduces the parameter that determines the amount of penalty placed on the size of the coefficient values, and selects the appropriate parameter by cross-validation. The methodology for continuously fine-tuned, collaborative targeted learning that we develop in this article encompasses the algorithm of [26]. Its statistical analysis sheds light on why, and under which assumptions, it would provide valid statistical inference.

The present study builds upon [28]. The methodology is also studied in [12, 13], the latter an example of real-life application.

At this point in the introduction, we wish to formalize what is the problem at stake. What follows recasts the introductory paragraph in the theoretical framework that we adopt in the article.

#### Setting the scene.

Let be independent draws from a law on a set . We view as an element of the statistical model , a collection of plausible laws for . The more we know about , the smaller is . Our primary goal is to infer the value of parameter at , namely, . Our statistical analysis is asymptotic in the number of observations.

We consider the case that is pathwise differentiable at every with respect to (w.r.t.) a tangent set : there exists such that, for every , there exists a submodel satisfying (i) , (ii) for all , (iii)

 ddtlogdPtdP(O)∣∣∣t=0=s(O)

(the submodel’s score function equals ), and (iv) the real valued mapping is differentiable at with a derivative equal to , where is a shorthand notation for (any measurable ). It is assumed moreover that every is associated with two possibly infinite-dimensional features and such that (i) and are unrelated (i.e., variation independent: knowing anything about tells nothing about and vice versa), (ii) depends on only through , (iii) depends on only through and , and (iv) is a mapping from to . At this early stage, we can introduce the pivotal

 Rem20(Q,G)≡Ψ(P)−Ψ(P0)+P0D∗(P)

for every . The notation is justified (i) because we wish to think of the right-hand-side expression as a remainder term, and (ii) by the fact that and depend on only through and . We consider the case that parameter is such that, for some pseudo-distances and on and ,

 |Rem20(Q,G)|≲dQ(Q,Q0)×dG(G,G0), (1)

where stand for “there exists a universal positive constant such that ”. A remainder term satisfying (1) is said double-robust.

Let be an algorithm for the estimation of , the -component of the true law . Likewise, let (, an open interval of of which the closure contains 0) be an -specific algorithm for the estimation of , the -component of . Formally, we view and each as mappings from

 ⋃N≥1{N−1N∑i=1Dirac(oi):o1,…,oN∈O}

to and , respectively, that can “learn” from the empirical measure some estimators and of and . Set (the superscript 0 stands for “initial”), , and let be any element of the model of which the - and -components equal and . Derived by the mere substitution of for in , is a natural estimator of . It is not targeted toward the inference of in the sense that none of the known features of was derived specifically for the sake of ultimately estimating .

It is well documented in the TMLE literature that one way to target toward is to build from in such a way that

 PnD∗(P∗n,h)=oP(1/√n)

and to infer with . This can be achieved, in such a way that is not modified, by “fluctuating” , a procedure that we will develop in details in the specific example studied in the article. Then, by (1), the estimator satisfies the asymptotic expansion:

 Ψ(P∗n,h)−Ψ(P0)=(Pn−P0)D∗(P∗n,h)+Rem20(Q∗n,h,Gn,h)+oP(1/√n). (2)

By convention, we agree that small values of correspond with less bias for as an estimator of . Moreover, we assume that there exists , , such that for some , i.e., that consistently estimates at rate . If is also such that for some , and if , then (1) and (2) yield

 Ψ(P∗n,hn)−Ψ(P0)=(Pn−P0)D∗(P∗n,hn)+oP(1/√n)

which may in turn imply the asymptotic linear expansion

 Ψ(P∗n,hn)−Ψ(P0)=(Pn−P0)IF+oP(1/√n), (3)

with influence function , depending in particular on how data-adaptive are algorithms and (

). By the central limit theorem, (

3) guarantees that is asymptotically Gaussian.

We focus on a more challenging situation, where is not necessarily . We anticipate that our analysis is also very relevant at small and moderate sample sizes when .

In order to derive an asymptotic linear expansion similar to (3) from (2) in this situation, we would have to derive an asymptotic expansion of . Unfortunately, we have reasons to believe that this is not possible without targeting (their presentation in an example is deferred to Section 3.3).

Now, observe that the estimators do not cooperate in the sense that, although and (for any two , ) share the same initial estimator , the construction of the latter does not capitalize on that of the former. In contrast, we propose to build collaboratively a continuum of estimators of the form () and to select data-adaptively one among them that will be asymptotically Gaussian, under conditions often encountered in empirical process theory.

#### Organization of the article.

In Section 2, we lay out a high-level presentation of collaborative TMLE, and state a high-level result. In Sections 3, 4, 5 and 6, we consider a specific example. In Section 3, we particularize the theoretical construction and analysis. In Section 4, we describe two practical instantiations of the estimator developed in Section 3. In Sections 5 and 6, we carry out a mutli-faceted simulation study of their performances and comment upon its results. In Section 7, we summarize the content of the article. All the proofs are gathered in the appendix.

## 2 High-level presentation and result

We now state and prove a general result about continuously fine-tuned, collaborative targeted minimum loss estimation, a version of [Theorem 10.1 in 28]. Its high-level assumptions are clarified in the particular example that we study in the next sections.

From now on, we slightly abuse notation and denote instead of , where and are the - and -components of . Let be a (one-dimensional) subset of (indexed by a real parameter ranging in an open subset of ) such that is twice differentiable over for all (-almost surely). We characterize and by setting, for every and ,

 ∂hD∗(Q,G\raisebox−1.29pt\scalebox1.2$⋅$)(O) ≡ ddtD∗(Q,Gt)(O)|t=h, (4) ∂2hD∗(Q,G\raisebox−1.29pt\scalebox1.2$⋅$)(O) ≡ d2dt2D∗(Q,Gt)(O)|t=h.

Consider the following inter-dependent assumptions. The first one is indexed by .

A1

There exists an open neighborhood of for which the set is such that is twice differentiable over (-almost surely). Moreover, -almost surely,

 suph∈T|∂2hD∗(Q,Gn,\raisebox−1.29pt\scalebox1.2$⋅$)(O)|≤c.
A2

For all , we know how to build , with - and -components denoted by and , in such a way that . Moreover, we know how to choose such that

 PnD∗(Q∗n,hn,Gn,hn)=oP(1/√n). (5)

and, for some deterministic , A1 is met and

 Pn∂hnD∗(Q∗n,hn,Gn,\raisebox−1.29pt\scalebox1.2$⋅$)=oP(1/n1/4). (6)
A3

It holds that , and there exists such that . In addition,

 (Pn−P0)(D∗(Q∗n,hn,Gn,hn)−D∗(Q1,G0)) = oP(1/√n), (7) Rem20(Q∗n,hn,Gn,hn)−Rem20(Q1,Gn,hn) = oP(1/√n). (8)
A4

Let be given by . There exist and such that

 Φ0(Gn,~hn)−Φ0(G0)=(Pn−P0)Δ(P1)+oP(1/√n). (9)
A5

It holds that . Moreover, there exists a deterministic such that A1 is met, and

 (Pn−P0)(D∗(Q1,Gn,hn)−D∗(Q1,Gn,~hn)) = oP(1/√n), (10) (hn−~hn)×P0(∂hnD∗(Q∗n,hn,Gn,\raisebox−1.29pt\scalebox1.2$⋅$)−∂hnD∗(Q1,Gn,\raisebox−1.29pt\scalebox1.2$⋅$)) = oP(1/√n), (11) (Pn−P0)(∂hnD∗(Q∗n,hn,Gn,\raisebox−1.29pt\scalebox1.2$⋅$)−∂hnD∗(Q1,Gn,\raisebox−1.29pt\scalebox1.2$⋅$)) = oP(1/√n). (12)

Now that we have introduced our high-level assumptions, we can state the corresponding high-level result that they entail. The proof is relegated to the appendix.

###### Theorem 1 (Asymptotics of the collaborative TMLE – a high-level result).

Under assumptions A2 to A5, it holds that

 Ψ(P∗n,hn)−Ψ(P0)=(Pn−P0)(D∗(Q1,G0)+Δ(P1))+oP(1/√n). (13)

#### Commenting on the high-level assumptions.

Assumption A1 concerns both (specifically, how depends on ) and algorithms , (specifically, how smooth is around ). In the particular example studied in the following sections, the counterpart C1 of A1 concerns only algorithms , .

In the example, we show how can be built collaboratively in such a way that A2 is met, under a series of nested assumptions about the smoothness of data-dependent, real-valued functions over , the construction of which notably involve algorithms , . To understand why achieving (6) is relevant, observe that the following oracle version of ,

 limt→0t≠01tP0(D∗(Q∗n,hn,Gn,hn+t)−D∗(Q∗n,hn,Gn,hn)),

can be rewritten as

 limt→0t≠01t(Rem20(Q∗n,hn,Gn,hn+t)−Rem20(Q∗n,hn,Gn,hn))

in view of (1). Thus, achieving (5) relates to finding critical points of .

Assumption A3 formalizes the convergence of to its target w.r.t. , and that of to some limit w.r.t. . It does not require that be equal to the target of , but A4 may be impossible to meet when (see below). Condition (7) in A3 is met for instance if the -norm of goes to zero in probability and if the difference falls in a -Donsker class with probability tending to one. As for (8), it typically holds whenever the product of the rates of convergence of and to their limits is . The counterpart of A3 in the example studied in the following sections is C2.

With A4, we assume the existence of an oracle that undersmoothes enough so that is an asymptotically linear estimator of , where we note that is pathwise differentiable in a similar way as . We say that is an oracle because the definition of involves and . It happens that

###### Lemma 2.

Under A2 and A3, if , , and if (3) is met with , then A4 holds with and .

It is difficult to assess whether or not A4 is a tall order when is not necessarily , or if .

Finally, A5 states that the distance between and , introduced in A2, is of order at most. Its conditions (10) and (12) are of similar nature as (7). As for (11), the Cauchy-Schwarz inequality reveals that it is met if the -norm of is .

## 3 Collaborative TMLE for continuous tuning when inferring the average treatment effect: presentation and analysis

In this section, we specialize the discussion to the inference of a specific statistical parameter, the so called average treatment effect. Section 3.1 introduces the parameter and recalls what are the corresponding and from Section 1. Section 3.2 describes the uncooperative construction of a continuum of uncooperative TMLEs. Section 3.3 argues why the selection of one of the uncooperative TMLEs is unlikely to yield a well behaved (i.e., asymptotically Gaussian) estimator when the product of the rates of convergence of the estimators of and to their limits is not fast enough (i.e., ). Then, Sections 3.4 and 3.5 present the collaborative construction of collaborative TMLEs and how to select one among them that is well behaved, under assumptions that are spelled out in Section 3.6, where the high-level Theorem 1 and its assumptions are specialized.

### 3.1 Preliminary

We observe independent draws , , from , the true law of . It is known that takes its values in . We consider the statistical model that leaves unspecified the law of and the conditional law of given ), while we might know that the conditional expectation of given belongs to a set .

Introduce

 ¯Q0(A,W)≡EP0(Y|A,W),G0(W)≡P0(A=1|W).

The parameter of interest is the average treatment effect,

 ψ0≡EQW,0(¯Q0(1,W)−¯Q0(0,W)).

We choose it because its study provides a wealth of information and paves the way for the analysis of a variety of other parameters often encountered in the statistical literature.

More generally, every gives rise to , , and , which are respectively the marginal law of under , the conditional expectation of given under , the conditional probability that given under , and the couple consisting of and . For each of them, the average treatment effect is , where is given by

 Ψ(P)≡EQW(¯Q(1,W)−¯Q(0,W)).

For notational conciseness, we let be given by

 ℓG(A,W)≡AG(W)+(1−A)(1−G(W)) (14)

for every . Note that is the conditional likelihood of given when given is drawn from the Bernoulli law with parameter , hence the “” in the notation. Parameter viewed as a real-valued mapping over is pathwise differentiable at every w.r.t. the maximal tangent set . The efficient influence curve of at is given by

 D∗(P)(O) ≡ D∗2(¯Q,G)(O)+(¯Q(1,W)−¯Q(0,W)−Ψ(P))where (15) D∗2(¯Q,G)(O) ≡ 2A−1ℓG(A,W)(Y−¯Q(A,W)).

Recall definition (1). It is easy to check that, for every ,

 Rem20(¯Q,G)=EP0[(2A−1)(1−ℓG0(A,W)ℓG(A,W))(¯Q(A,W)−¯Q0(A,W))]. (16)

Writing instead of slightly abuses notation, but is justified because integrating out in the RHS of (16) reveals that it only depends on , and . Furthermore, by the Cauchy-Schwartz inequality, it holds that

 Rem20(¯Q,G)2≤P0(¯Q−¯Q0)2×P0(G−G0ℓG)2. (17)

### 3.2 Uncooperative construction of a continuum of uncooperative TMLEs

#### Prerequisites.

Let be an initial estimator of and be a continuum of candidate estimators of indexed by a real-valued tuning parameter , an open interval of . By convention, we agree that small values of correspond with less bias for as an estimator of . Specifically, denoting

the valid loss function for the estimation of

given by

 L1(G)(A,W)≡−logℓG(A,W)=−AlogG(W)−(1−A)log(1−G(W)), (18)

for every , where was defined in (14), we assume from now on that the empirical risk increases.

For example,

could correspond to fitting a logistic linear regression maximizing the log-likelihood under the constraint that the sum of the absolute values of the coefficients is smaller than or equal to

with . We will refer to this algorithm as the LASSO logistic regression algorithm.

#### Uncooperative TMLEs.

Let be the empirical law of . Set arbitrarily and let denote any element of such that the marginal law of under equals and the conditional expectation of given under is equal to , hence on the one hand; and the conditional expectation of given under coincide with on the other hand. Evaluating at yields an estimator of ,

 Ψ(P0n,h)=1nn∑i=1(¯Q0n(1,Wi)−¯Q0n(0,Wi)),

which is not targeted toward the inference of in the sense that none of the known features of was derived specifically for the sake of ultimately estimating .

One way to target toward is to build from in such a way that

 PnD∗(P∗n,h)=oP(1/√n)

and to infer with . This can be achieved by “fluctuating” in the following sense. For every , introduce the so called “clever covariate” given by

 C(G)(A,W)≡2A−1ℓG(A,W). (19)

Now, for every , let be characterized by

 logit(¯Q0n,h,ε(A,W))≡logit(¯Q0n(A,W))+εC(Gn,h)(A,W) (20)

and be defined like except that the conditional expectation of given under equals (and not ). Clearly, when . Moreover, denoting the loss function given by

 L2(¯Q)(O)≡−Ylog¯Q(A,W)−(1−Y)log(1−¯Q(A,W))

for every induced by a , it holds that

 ddεL2(¯Q0n,h,ε)(O)=−D2(¯Q0n,h,ε,Gn,h)(O),

a property that prompts us to say that the one-dimensional submodel “fluctuates” “in the direction of” .

The optimal fluctuation of along the above submodel is indexed by the minimizer of the empirical risk

 (21)

of which the existence is assumed (note that is twice differentiable and strictly convex). We call the TMLE of , and the resulting estimator

 ψ∗n,h≡Ψ(P∗n,h)=1nn∑i=1(¯Q0n,h,εn,h(1,Wi)−¯Q0n,h,εn,h(0,Wi)) (22)

the TMLE of . It is readily seen that (22) is equivalent to

 Pn(D∗(P∗n,h)−D∗2(¯Q∗n,h,Gn,h))=0

where . Since minimizes the differentiable mapping , it holds moreover that

 PnD∗2(¯Q∗n,h,Gn,h)=0 (23)

which, combined with the previous display, yields

 PnD∗(P∗n,h)=0; (24)

in words, is targeted toward indeed. Furthermore, in view of (16) and (24), satisfies

 ψ∗n,h−Ψ(P0)=(Pn−P0)D∗(P∗n,h)+Rem20(¯Q∗n,h,Gn,h). (25)

Finally, the TMLEs () are said uncooperative because, although they share the same initial estimator , for any two , , the construction of does not capitalize on that of .

### 3.3 Selecting one of the uncooperative TMLEs

At this stage of the procedure, a crucial question is to select one TMLE in the collection of uncooperative TMLEs, one that lends itself to the construction of a CI for with a given asymptotic level. Such a TMLE necessarily writes as for some well chosen . This could possibly be a deterministic (fixed in ) or a data-driven (random and -dependent) element of .

The risk generated by (18) is given by

 R1(G,G0)≡EQ0,W[KL(G0(W),G(W))],

where

is the Kullback-Leibler divergence between the Bernoulli laws with parameters

. By Pinsker’s inequality, it holds that

 0≤2P0(G−G0)2≤R1(G,G0)

for all . Therefore, if is bounded away from zero and one, then (17) implies

 Rem20(¯Q,G)2≲P0(¯Q−¯Q0)2×R1(G,G0). (26)

If the deterministic is such that (i) there exist two rates and such that and