# Kernel Smoothing of the Treatment Effect CDF

We provide a CV-TMLE estimator for a kernel smoothed version of the cumulative distribution of the random variable giving the treatment effect or so-called blip for a randomly drawn individual. We must first assume the treatment effect or so-called blip distribution is continuous. We then derive the efficient influence curve of the kernel smoothed version of the blip CDF. Our CV-TMLE estimator is asymptotically efficient under two conditions, one of which involves a second order remainder term which, in this case, shows us that knowledge of the treatment mechanism does not guarantee a consistent estimate. The remainder term also teaches us exactly how well we need to estimate the nuisance parameters to guarantee asymptotic efficiency. Through simulations we verify theoretical properties of the estimator and show the importance of machine learning over conventional regression approaches to fitting the nuisance parameters. We also derive the bias and variance of the estimator, the orders of which are analogous to a kernel density estimator. This estimator opens up the possibility of developing methodology for optimal choice of the kernel and bandwidth to form confidence bounds for the CDF itself.

## Authors

• 6 publications
• 15 publications
09/10/2019

### De-biased Machine Learning for Compliers

Instrumental variable identification is a concept in causal statistics f...
11/09/2018

### A Fundamental Measure of Treatment Effect Heterogeneity

In this paper we offer an asymptotically efficient, non-parametric way t...
03/05/2021

### Estimation of Partially Conditional Average Treatment Effect by Hybrid Kernel-covariate Balancing

We study nonparametric estimation for the partially conditional average ...
11/13/2021

### Analysis of stepped wedge cluster randomized trials in the presence of a time-varying treatment effect

Stepped wedge cluster randomized controlled trials are typically analyze...
02/21/2022

### Asymptotic properties of the normalized discrete associated-kernel estimator for probability mass function

Discrete kernel smoothing is now gaining importance in nonparametric sta...
11/20/2020

### Nonparametric instrumental regression with right censored duration outcomes

This paper analyzes the effect of a discrete treatment Z on a duration T...
10/05/2020

### Empirical Likelihood Inference in Randomized Controlled Trials with High-Dimensional Covariates

In this paper, we propose a data-adaptive empirical likelihood approach ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 background and Motivation

The stratum-specific treatment effect or blip gillrobins2001 function is defined as random variable given by the average treatment effect for a randomly drawn stratum of confounders. Estimating the cumulative distribution function or CDF of the blip function therefore estimates the proportion of the population that has an average treatment effect at or below a given level. Because clinicians treat patients based on confounders, such is of interest as to evaluating treatment. However, we will see the blip CDF is not pathwise differentiable so we need to estimate a kernel-smoothed version, which can be of interest in and of itself. Such might also provide a pathway to forming confidence intervals for the blip CDF.

Much consideration has been given to the distribution of , where is the counterfactual outcome under the intervention to set treatment to , as per the neyman-rubin potential outcomes framework neyman1923, rubin1974. Neyman, 1923, realized that even in estimating the mean of , the impossibility of identifying the correlation of

hampered variance estimation in small samples. Assumptions needed to estimate the joint distribution of

and are hard to verify. Cox, 1958 assumes a constant treatment effect for pre-defined subgroups, while Fisher, 1951 suggests one can essentially view the counterfactual

by careful design. Heckman and Smith, 1998 estimate the quantiles of

via the assumption of quantiles being preserved from to given a strata of confounders. Without strong assumptions, using tail bounds frechet1951 to estimate the quantiles of via the marginals of and tends to leave too big of a measure of uncertainty to be useful heckman. Heckman also mentions that his analysis becomes much easier if remains fixed for a given stratum, i.e., , for which we aim to estimate the CDF.

## 2 Data

Our full data, including unobserved measures, is assumed to be generated according to the following structural equations Wright, Strotz, Pearl:2000aa. We can assume a joint distribution, , an unknown distribution of unmeasured variables. are the measured variables. In the time ordering of occurrence we have where

is a vector of confounders,

, where is a binary treatment and , where is the outcome, either binary or bounded continuous. We thusly define a distribution , via .

The full-data model, , which is non-parametric, consists of all possible . The observed data model, , is linked to in that we observe where is generated by according to the structural equations above. Our true observed data distribution, , is therefore an element of a non-parametric observed data model, . In the case of a randomized trial or if we have some knowledge of the treatment mechanism,

is considered a semi-parametric model and we will incorporate such knowledge.

## 3 Parameter of Interest and Identification

First we define the potential outcome under the intervention to set the treatment to as neyman1923 . The blip function is then defined as . Our parameter of interest is a mapping from to defined by or the CDF of the blip.

We will impose the randomization assumption Robins1986, Greenland1986, as well as positivity, for all and . Defining yields and we can identify the parameter of interest as a mapping from the observed data model, , to via the

 Ψ(P)=EPI(b(W)≤t) for P∈M

is not pathwise differentiable Vaart:2000aa so instead we consider the smoothed version of the parameter mapping, using kernel, , with bandwidth, . Here we will suppress in the notation for convenience:

 Ψδ,t(P)=Ew∫x1δk(x−tδ)I(b(W)≤x)dx=∫x1δk(x−tδ)F(x)dx

NOTE: We assume throughout this paper, for all values, . In other words, our blip distribution function is continuous.

## 4 Derivation of the Efficient Influence Curve of Ψδ,t(P)

#### 4.0.1 Tangent Space for Nonparametric Model

The true data generating distribution, , has density, , which can be factorized as . We consider the one dimensional set of submodels that pass through at Vaart:2000aa . The tangent space is the closure in norm of the set of scores, , or directions for the paths defined above. We write:

 T = ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯{s(o)|ES=0,ES2<∞} = ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯{S(y|a,w)|EPYS=0,ES2<∞}⊕¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯{s(a|w)|EPAS=0,ES2<∞}⊕¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯{S(w)|EPWS=0,ES2<∞} = TY⊕TA⊕TW

For a non-parametric model,

forms a Hilbert space with inner product defined as . Our notion of orthogonality now is if and only if and, therefore, the above direct sum is valid. In other words, every score, , can be written as where, due to the fact , it is easy to see , and . Furthermore we know that a projection of on is

 SY(y∣w,a) = S(w,a,y)−E[S(W,A,Y)∣W=w,A=a] = S(w,a,y)−∫S(w,a,y)pY(y∣w,a)dν(y) = ddϵlog(pYϵ)|ϵ=0

#### 4.0.2 Efficiency Theory in brief

The efficient influence curve at a distribution, , for the parameter mapping, , is a function of the observed data, , notated as . Its variance gives the generalized Cramer-Rao lower bound for the variance of any regular asymptotically linear estimator of Vaart:2000aa. For convenience we define the outcome model and the treatment mechanism as . We will simplify the notation for the blip here as well, leaving off the subscript for the distribution so that . As in van der Vaart, 2000, we define the pathwise derivative at along score, , as

 limϵ→0(Ψδ,t(Pϵ)−Ψδ,t(P)ϵ)⟶˙Ψδ,t(S) (1)

We note to the reader, we imply a direction, , when we write , which has density , but generally leave it off the notation as understood.

By the riesz representation theorem riesz for Hilbert Spaces, assuming the mapping in (1) is a bounded and linear functional on , it can be written in the form of an inner product where is a unique element of , which we call the canonical gradient or efficient influence curve. Thus, in the case of a nonparametric model, the only gradient is the canonical gradient. It is notable that the efficient influence curve has a variance that is the lower bound for any regular asymptotically linear estimator Vaart:2000aa. Since the TMLE, under conditions as discussed in this paper, asymptotically achieves variance equal to that of the efficient influence curve, the estimator is asymptotically efficient.

As a note to the reader: Our parameter mapping does not depend on the treatment mechanism, , and also which, means our efficient influence curve must therefore be in for the nonparametric model. Therefore, our efficient influence curve will have two orthogonal components in and respectively. We have no component in , which is why we need not perform a TMLE update of the initial prediction, , of . Such also teaches us that for the semi-parametric model, where the treatment mechanism is known, the efficient influence function will remain the same.

###### Theorem 4.1.

Assume is lipschitz and smooth on . The efficient influence curve for the parameter, , is given by

 D⋆Ψδ,t(P)(O) = −1δk(b(W)−tδ)∗2A−1g(A|W)(Y−¯Q(A,W))+∫1δk(x−tδ)I(b(W)≤x)dx−Ψδ,t

PROOF:

Define . We also define , where is defined via its density, , and is the density of . is the so-called score function in Hilbert Space, , the completion (under the norm) of the space of mean 0 functions of finite variance. We remind the reader that since our model is nonparametric, the tangent space is Vaart:2000aa. We will now compute the pathwise derivative functional on , writing it as an inner product (covariance in the Hilbert Space ), of the score, , and the efficient influence curve, a unique element of the tangent space, . We notate the efficient influence curve as indexed by the distribution, , and as a function of the observed data, : . By dominated convergence we have

 Ψδ,t(P)=limh→0Ew∫1−11δk(x−tδ)Φ(b(W)−xh)dx
 limϵ→0Ψδ,t(Pϵ)−Ψδ,t(P)ϵ = limϵ→01ϵlimh→0Ew∫x1δk(x−tδ)(Φ(bϵ(W)−xh)−Φ(b(W)−xh))dx (5) +Ew(∫x1δk(x−tδ)I(b(W)≤x)−Ψt,δ(P))S(O)dx let’s ignore (2) for now limϵ→01ϵlimh→0Ew∫x1δk(x−tδ)(1hΦ′(b(W)−xh)(bϵ(W)−b(W)))dx+ +limϵ→01ϵh→0limEw∫x1δk(x−tδ)(12h2Φ(2)(ζ(x−b(W)h))(bϵ(W)−b(W))2)dx = limϵ→01ϵlimh→0(Ew∫x1δk(x−tδ)(1hΦ′(b(W)−xh)(bϵ(W)−b(W)))dx+R2,h,x(bϵ,b)) (6)

We can note that for such that as , because is order To see this, consider the convenient fact that is bounded.

Let’s now drop for now and use integration by parts to compute a part of the integrand in (5):

 Ewlima→∞∫t+aδt−aδ1δk(x−tδ)1hΦ′(b(W)−xh)dx(bϵ(W)−b(W)) = Ewlima→∞(−1δk(x−tδ)Φ(b(W)−xh)∣∣∣t+aδt−aδ+∫x1δ2k′(x−tδ)EwΦ(b(W)−xh))(bϵ(W)−b(W))dx = +Ew∫x1δ2k′(x−tδ)I(b(W)≤x)(bϵ(W)−b(W))dx h→0 and Dominated convergence⟹ 2% nd term disappears. k lipschitz⟹ = Ewlima→∞−1δI(b(W)≤t+aδ)k(a)+1δI(b(W)≤t−aδ)k(−a)) +1δ(k(a)−k(max(b(W),t−aδ)−tδ)))I(b(W)≤t−aδ)(bϵ(W)−b(W)) =

We can summarize as follows:

 limϵ⟶0Ψδ,t(Pϵ)−Ψδ,t(P)ϵ = limϵ→01ϵEw(−1δk(b(W)−tδ)(bϵ(W)−b(W)))+ +limϵ→0limh(ϵ)→0EwR2,h,x(bϵ,b)ϵ +Ew(∫x1δk(x−tδ)I(b(W)≤x)−Ψt,δ(P))S(O)dx

As previously stated, the second term disappears by easy choice of .

 ddϵpYϵ(y|a=0,w)|ϵ=0 =pY(y|a,w)SY(y|a,w) =pY(y|a,w)(S(o)−E[S(o)|a,w])

We then compute the pathwise derivative along at :

 limϵ→01ϵEw(−1δk(b(W)−tδ)(bϵ(W)−b(W))) = ∫(−1δk(b(w)−tδ)∫(pϵ(y|a=1,w)−ddϵpYϵ(y|a=0,w))dν(y))pW(w)dν(w) = ∫(−1δk(b(w)−tδ)∫∫2a−1g(a|w)ypYϵ(y|a,w)SY(o)dν(y))g(a|w)pW(w)dν(a,w) = ∫(−1δk(b(w)−tδ)∫∫2a−1g(a|w)ypYϵ(y|a,w)(S(o)−E[S|a,w])dν(y))g(a|w)pW(w)dν(a,w) = ∫−1δk(b(w)−tδ)2a−1g(a|w)S(o)p(o)dν(o)−∫1δ¯Q(a,w)S(o)p(o)dν(o) = ⟨−1δk(b(w)−tδ)2A−1g(A|W)(Y−¯Q(A,W)),S(O)⟩L20(P)

Thus we finally get:

 limϵ⟶0Ψδ,t(Pϵ)−Ψt,δ(P)ϵ =Ew(∫x1δk(x−tδ)I(b(W)≤x)−Ψδ,t(P))S(O)dx+ +⟨−1δk(b(w)−tδ)2A−1g(A|W)(Y−¯Q(A,W)),S(O)⟩L20(P) =⟨D⋆Ψδ,t(P),S⟩L20(P)

where

 D⋆Ψδ,t(P)(O) = −1δk(b(W)−tδ)∗2A−1g(A|W)(Y−¯Q(A,W))+∫1δk(x−tδ)I(b(W)≤x)dx−Ψδ,t

And this is the efficient influence curve since the canonical gradient is the only gradient for a non-parametric model where the closure of the set of scores is all of .

QED

## 5 The Targeted Maximum Likelihood Estimator, TMLE

We will employ the notation, to be the empirical average of function, , and to be

. Define a loss function,

, which is a function of the observed data, O, and indexed at the distribution on which it is defined, , such that is minimized at the true observed data distribution, . The targeted maximum likelihood (TML) estimating procedure maps an initial estimate, , of the true data generating distribution to such that and such that , where , in this case, is the number of points on the CDF. is called the TMLE of the initial estimate Laan:2006aa, Laan:2011aa. For convenience, we define, and its initial estimate, and we will use with corresponding initial estimate, . The initial estimate of the distribution of is denoted , the empirical distribution of , with density, . For this paper, the TMLE procedure only adjusts the initial estimate of the outcome regression, leaving and as is. Thus we will only update to its TMLE, .

To perform the TMLE updating procedure, we may find an element of either a universal least favorable submodel (ulfm), a least favorable submodel (lfm), both defined in van der Laan and Gruber, 2016, or a canonical least favorable submodel clfm. Both clfm and ulfm use a single dimensional submodel where as the lfm is of dimension, , and identical to a clfm if . The ulfm has the advantage of not relying on iteration as explained in van der Laan and Gruber, 2016, but here we did not notice an appreciable difference in performance so we used the faster clfm procedure. To construct a clfm, ulfm or lfm, one needs to know the efficient influence curve, which is given by

 D⋆Ψδ,t(P)(O) = −1δk(b(W)−tδ)∗2A−1g(A|W)(Y−¯Q(A,W))+∫1δk(x−tδ)I(b(W)≤x)dx−Ψδ,t

where we estimate the CDF of blip at a given blip value, , using kernel, , and bandwidth blipCDFtech. The CV-TMLE algorithm by the author cvtmle simplifies the originally formulated CV-TMLE algorithm by Zheng and van der Laan, 2010 and, in this case, turns out to be the same estimator if we use a pooled regression to fit the fluctuation parameter. The TMLE updating procedure is implemented in the software packages blipCDF blipCDF and tmle3. Here we will provide for readers more familiar with TMLE, only the so-called clever covariate Laan:2006aa for , but the reader may consult Levy, 2018c for a detailed algorithm.

 H(A,W)=−1δk(b(w)−tδ)2A−1g(A|W)

If we are simultaneously estimating points, on the CDF curve, we will have a clever covariate:

 (H1(A,W),H2(A,W),...,Hd(A,W))=1−2Aδg(A|W)(k(b(w)−t1δ),k(b(w)−t2δ),...,k(b(w)−tdδ))

The TMLE procedure yields and our estimator is then a plug-in, using the empirical distribution, :

 Ψδ,t(P∗n)=1nn∑i=11δ∫k(x−tδ)I(b∗n(Wi)≤x)dx

where , the blip function estimate. For simultaneously estimating many points on the CDF of the blip, the TMLE procedure yields a common outcome model for all -values, , which has the advantage of preserving monotonicity.

#### 5.0.1 Software

The TMLE is implemented in the software packages blipCDF blipCDF and tmle3.

## 6 TMLE conditions

By solving the efficient influence curve equation with our TMLE update, we can then write a second order expansion, . We then arrive at the following three conditions (for fixed bandwidth, ) that guarantee asymptotic efficiency for this estimator Laan:2006aa, Laan:2011aa, the first of which is not required for CV-TMLE Zheng:2010aa. Thus CV-TMLE is our preferred estimator, since it requires less conditions on our machine learning, enabling a more aggressive approach to fitting the treatment mechanism and the outcome model.

### 6.1 TMLE Conditions and Asymptotic Efficiency

We refer the reader to Targeted Learning Appendix Laan:2011aa as well as Laan:2015aa,Laan:2015ab, Laan:2006aa for a more detailed look at the theory of TMLE. For convenience, we will summarize some of the main results for the reader.

#### 6.1.1 Conditions for Asymptotic Efficiency

Define the norm . Assume the following TMLE conditions:

1. is in a P-Donsker class. This condition can be dropped in the case of using CV-TMLE Zheng:2010aa. We show the advantages to CV-TMLE in our simulations.

2. is for all .

then . Our plug-in TMLE’s and CI’s are given by

 ΨΨδ,t(P⋆n)±zα∗ˆσn(D⋆Ψδ,t(P⋆n))√n

Under the above conditions, these confidence bands will be as small as possible for any regular asymptotically linear estimator at significance level, , where for Z standard normal and

is the sample standard deviation of

Laan:2006aa. Note, that if the TMLE conditions hold for the initial estimate, , then they will also hold for the updated model, Laan:2015aa, thereby placing importance on our ensemble machine learning in constructing . For simultaneous confidence intervals, we refer the reader to Levy, van der Laan et al., 2018, for the method which leverages the efficient influence curve approximation to form confidence bounds that simultaneously cover the parameter values at a given significance level. Such inference is as tight as possible and certainly tighter than a standard bonferroni correction bonferroni.

## 7 The Remainder Term for a TMLE Plug-in Estimator of Ψδ,t(P0)

In this section we will prove the remainder term of the previous section is , assuming WLOG that the support of the kernel is .

###### Lemma 7.1.

Assume lipschitz , where and assume WLOG the support of the kernel is

then

proof:

 P0I(b0(W)>t+δ,b(W)t+δ,b(W)b0(W)−(t+δ)) ≤ P0I(b0(W)>t+δ,b(W)b0(W)−(t+δ)) ≤ Pr(t+δ

QED

###### Theorem 7.2.

The remainder term, , is

Proof:

 R2(P0P) = P0D∗(P)+Ψ(P)−Ψ(P0) (10) = P0[−1δk(b(W)−tδ)2A−1g(A|W)(Y−¯Q(A,W))+∫1δk(x−tδ)I(b(W)>x)dx−∫x1δk(x−tδ)I(b0(W)>x)dx] = P0−1δk(b(W)−tδ)(2A−1g(A|W)(Y−¯Q(A,W)))+P0∫1δk(x−tδ)(I(b(W)>x)−I(b0(W)>x))dx = −1δP0[k(b(W)−tδ)((g0(1|W)g(1|W))(¯Q0(1,W)−¯Q(1,W))−(g0(0|W)g(0|W))(¯Q0(0,W)−¯Q(0,W)))]+ (12) 1δP0∫b0(W)b(W)k(x−tδ)dx = −1δP0[k(b(W)−tδ)((g0(1|W)g(1|W)−1)(¯Q0(1,W)−¯Q(1,W))−(g0(0|W)g(0|W)−1)(¯Q0(0,W)−¯Q(0,W)))] (14) +1δP0[∫b0(W)b(W)k(x−tδ)dx+k(b(W)−tδ)(b(W)−b0(W))]

Clearly (12) will disappear if is known. Otherwise the term is by cauchy-schwarz.

Now let’s take a look at (13):

 1δP0[∫b(W)b0(W)k(x−tδ)dx+k(b(W)−tδ)(b0(W)−b(W))] (15)

We can divide the space into disjoint parts and integrate:

a) :

Assuming is lipschitz, we have as follows:

 1δP0I(t−δ