1 background and Motivation
The stratumspecific treatment effect or blip gillrobins2001 function is defined as random variable given by the average treatment effect for a randomly drawn stratum of confounders. Estimating the cumulative distribution function or CDF of the blip function therefore estimates the proportion of the population that has an average treatment effect at or below a given level. Because clinicians treat patients based on confounders, such is of interest as to evaluating treatment. However, we will see the blip CDF is not pathwise differentiable so we need to estimate a kernelsmoothed version, which can be of interest in and of itself. Such might also provide a pathway to forming confidence intervals for the blip CDF.
Much consideration has been given to the distribution of , where is the counterfactual outcome under the intervention to set treatment to , as per the neymanrubin potential outcomes framework neyman1923, rubin1974. Neyman, 1923, realized that even in estimating the mean of , the impossibility of identifying the correlation of
hampered variance estimation in small samples. Assumptions needed to estimate the joint distribution of
and are hard to verify. Cox, 1958 assumes a constant treatment effect for predefined subgroups, while Fisher, 1951 suggests one can essentially view the counterfactualby careful design. Heckman and Smith, 1998 estimate the quantiles of
via the assumption of quantiles being preserved from to given a strata of confounders. Without strong assumptions, using tail bounds frechet1951 to estimate the quantiles of via the marginals of and tends to leave too big of a measure of uncertainty to be useful heckman. Heckman also mentions that his analysis becomes much easier if remains fixed for a given stratum, i.e., , for which we aim to estimate the CDF.2 Data
Our full data, including unobserved measures, is assumed to be generated according to the following structural equations Wright, Strotz, Pearl:2000aa. We can assume a joint distribution, , an unknown distribution of unmeasured variables. are the measured variables. In the time ordering of occurrence we have where
is a vector of confounders,
, where is a binary treatment and , where is the outcome, either binary or bounded continuous. We thusly define a distribution , via .The fulldata model, , which is nonparametric, consists of all possible . The observed data model, , is linked to in that we observe where is generated by according to the structural equations above. Our true observed data distribution, , is therefore an element of a nonparametric observed data model, . In the case of a randomized trial or if we have some knowledge of the treatment mechanism,
is considered a semiparametric model and we will incorporate such knowledge.
3 Parameter of Interest and Identification
First we define the potential outcome under the intervention to set the treatment to as neyman1923 . The blip function is then defined as . Our parameter of interest is a mapping from to defined by or the CDF of the blip.
We will impose the randomization assumption Robins1986, Greenland1986, as well as positivity, for all and . Defining yields and we can identify the parameter of interest as a mapping from the observed data model, , to via the
is not pathwise differentiable Vaart:2000aa so instead we consider the smoothed version of the parameter mapping, using kernel, , with bandwidth, . Here we will suppress in the notation for convenience:
NOTE: We assume throughout this paper, for all values, . In other words, our blip distribution function is continuous.
4 Derivation of the Efficient Influence Curve of
4.0.1 Tangent Space for Nonparametric Model
The true data generating distribution, , has density, , which can be factorized as . We consider the one dimensional set of submodels that pass through at Vaart:2000aa . The tangent space is the closure in norm of the set of scores, , or directions for the paths defined above. We write:
For a nonparametric model,
forms a Hilbert space with inner product defined as . Our notion of orthogonality now is if and only if and, therefore, the above direct sum is valid. In other words, every score, , can be written as where, due to the fact , it is easy to see , and . Furthermore we know that a projection of on is4.0.2 Efficiency Theory in brief
The efficient influence curve at a distribution, , for the parameter mapping, , is a function of the observed data, , notated as . Its variance gives the generalized CramerRao lower bound for the variance of any regular asymptotically linear estimator of Vaart:2000aa. For convenience we define the outcome model and the treatment mechanism as . We will simplify the notation for the blip here as well, leaving off the subscript for the distribution so that . As in van der Vaart, 2000, we define the pathwise derivative at along score, , as
(1) 
We note to the reader, we imply a direction, , when we write , which has density , but generally leave it off the notation as understood.
By the riesz representation theorem riesz for Hilbert Spaces, assuming the mapping in (1) is a bounded and linear functional on , it can be written in the form of an inner product where is a unique element of , which we call the canonical gradient or efficient influence curve. Thus, in the case of a nonparametric model, the only gradient is the canonical gradient. It is notable that the efficient influence curve has a variance that is the lower bound for any regular asymptotically linear estimator Vaart:2000aa. Since the TMLE, under conditions as discussed in this paper, asymptotically achieves variance equal to that of the efficient influence curve, the estimator is asymptotically efficient.
As a note to the reader: Our parameter mapping does not depend on the treatment mechanism, , and also which, means our efficient influence curve must therefore be in for the nonparametric model. Therefore, our efficient influence curve will have two orthogonal components in and respectively. We have no component in , which is why we need not perform a TMLE update of the initial prediction, , of . Such also teaches us that for the semiparametric model, where the treatment mechanism is known, the efficient influence function will remain the same.
Theorem 4.1.
Assume is lipschitz and smooth on . The efficient influence curve for the parameter, , is given by
PROOF:
Define . We also define , where is defined via its density, , and is the density of . is the socalled score function in Hilbert Space, , the completion (under the norm) of the space of mean 0 functions of finite variance. We remind the reader that since our model is nonparametric, the tangent space is Vaart:2000aa. We will now compute the pathwise derivative functional on , writing it as an inner product (covariance in the Hilbert Space ), of the score, , and the efficient influence curve, a unique element of the tangent space, . We notate the efficient influence curve as indexed by the distribution, , and as a function of the observed data, : . By dominated convergence we have
(5)  
let’s ignore (2) for now  
(6) 
We can note that for such that as , because is order To see this, consider the convenient fact that is bounded.
Let’s now drop for now and use integration by parts to compute a part of the integrand in (5):
We can summarize as follows:
As previously stated, the second term disappears by easy choice of .
We then compute the pathwise derivative along at :
Thus we finally get:
where
And this is the efficient influence curve since the canonical gradient
is the only gradient for a nonparametric model where the closure
of the set of scores is all of .
QED
5 The Targeted Maximum Likelihood Estimator, TMLE
We will employ the notation, to be the empirical average of function, , and to be
. Define a loss function,
, which is a function of the observed data, O, and indexed at the distribution on which it is defined, , such that is minimized at the true observed data distribution, . The targeted maximum likelihood (TML) estimating procedure maps an initial estimate, , of the true data generating distribution to such that and such that , where , in this case, is the number of points on the CDF. is called the TMLE of the initial estimate Laan:2006aa, Laan:2011aa. For convenience, we define, and its initial estimate, and we will use with corresponding initial estimate, . The initial estimate of the distribution of is denoted , the empirical distribution of , with density, . For this paper, the TMLE procedure only adjusts the initial estimate of the outcome regression, leaving and as is. Thus we will only update to its TMLE, .To perform the TMLE updating procedure, we may find an element of either a universal least favorable submodel (ulfm), a least favorable submodel (lfm), both defined in van der Laan and Gruber, 2016, or a canonical least favorable submodel clfm. Both clfm and ulfm use a single dimensional submodel where as the lfm is of dimension, , and identical to a clfm if . The ulfm has the advantage of not relying on iteration as explained in van der Laan and Gruber, 2016, but here we did not notice an appreciable difference in performance so we used the faster clfm procedure. To construct a clfm, ulfm or lfm, one needs to know the efficient influence curve, which is given by
where we estimate the CDF of blip at a given blip value, , using kernel, , and bandwidth blipCDFtech. The CVTMLE algorithm by the author cvtmle simplifies the originally formulated CVTMLE algorithm by Zheng and van der Laan, 2010 and, in this case, turns out to be the same estimator if we use a pooled regression to fit the fluctuation parameter. The TMLE updating procedure is implemented in the software packages blipCDF blipCDF and tmle3. Here we will provide for readers more familiar with TMLE, only the socalled clever covariate Laan:2006aa for , but the reader may consult Levy, 2018c for a detailed algorithm.
If we are simultaneously estimating points, on the CDF curve, we will have a clever covariate:
The TMLE procedure yields and our estimator is then a plugin, using the empirical distribution, :
where , the blip function estimate. For simultaneously estimating many points on the CDF of the blip, the TMLE procedure yields a common outcome model for all values, , which has the advantage of preserving monotonicity.
5.0.1 Software
The TMLE is implemented in the software packages blipCDF blipCDF and tmle3.
6 TMLE conditions
By solving the efficient influence curve equation with our TMLE update, we can then write a second order expansion, . We then arrive at the following three conditions (for fixed bandwidth, ) that guarantee asymptotic efficiency for this estimator Laan:2006aa, Laan:2011aa, the first of which is not required for CVTMLE Zheng:2010aa. Thus CVTMLE is our preferred estimator, since it requires less conditions on our machine learning, enabling a more aggressive approach to fitting the treatment mechanism and the outcome model.
6.1 TMLE Conditions and Asymptotic Efficiency
We refer the reader to Targeted Learning Appendix Laan:2011aa as well as Laan:2015aa,Laan:2015ab, Laan:2006aa for a more detailed look at the theory of TMLE. For convenience, we will summarize some of the main results for the reader.
6.1.1 Conditions for Asymptotic Efficiency
Define the norm . Assume the following TMLE conditions:

is in a PDonsker class. This condition can be dropped in the case of using CVTMLE Zheng:2010aa. We show the advantages to CVTMLE in our simulations.

is for all .

then . Our plugin TMLE’s and CI’s are given by
Under the above conditions, these confidence bands will be as small as possible for any regular asymptotically linear estimator at significance level, , where for Z standard normal and
is the sample standard deviation of
Laan:2006aa. Note, that if the TMLE conditions hold for the initial estimate, , then they will also hold for the updated model, Laan:2015aa, thereby placing importance on our ensemble machine learning in constructing . For simultaneous confidence intervals, we refer the reader to Levy, van der Laan et al., 2018, for the method which leverages the efficient influence curve approximation to form confidence bounds that simultaneously cover the parameter values at a given significance level. Such inference is as tight as possible and certainly tighter than a standard bonferroni correction bonferroni.7 The Remainder Term for a TMLE Plugin Estimator of
In this section we will prove the remainder term of the previous section is , assuming WLOG that the support of the kernel is .
Lemma 7.1.
Assume lipschitz , where and assume WLOG the support of the kernel is
then
proof:
Lipschitz 
QED
Theorem 7.2.
The remainder term, , is
Proof:
(10)  
(12)  
(14)  
Clearly (12) will disappear if is known. Otherwise the term is by cauchyschwarz.
Now let’s take a look at (13):
(15) 
We can divide the space into disjoint parts and integrate:
a) :
Assuming is lipschitz, we have as follows:
Comments
There are no comments yet.