# Estimation of distributional effects of treatment and control under selection on observables: consistency, weak convergence, and applications

In this paper the estimation of the distribution function for potential outcomes to receiving or not receiving a treatment is studied. The approach is based on weighting observed data on the basis on estimated propensity score. A weighted version of empirical process is constructed and its weak convergence to bivariate Gaussian process is established. Results for the estimation of the Average Treatment Effect (ATE) and Quantile Treatment Effect (QTE) are obtained as by-products. Applications to the construction of nonparametric tests for the treatment effect and for the stochastic dominance of the treatment over control are considered, and their finite sample properties and merits are studied via simulation.

## Authors

• 4 publications
• 1 publication
01/28/2022

### Heterogeneous Treatment Effect Estimation based on a Partially Linear Nonparametric Bayes Model

Recently, conditional average treatment effect (CATE) estimation has bee...
02/22/2022

### Policy Evaluation for Temporal and/or Spatial Dependent Experiments in Ride-sourcing Platforms

Policy evaluation based on A/B testing has attracted considerable intere...
07/24/2019

### Sharp bounds on the relative treatment effect for ordinal outcomes

For ordinal outcomes, the average treatment effect is often ill-defined ...
09/18/2021

### Estimations of the Conditional Tail Average Treatment Effect

We study estimation of the conditional tail average treatment effect (CT...
03/05/2021

### Estimation of Partially Conditional Average Treatment Effect by Hybrid Kernel-covariate Balancing

We study nonparametric estimation for the partially conditional average ...
12/11/2021

### Multiply robust estimators in longitudinal studies with missing data under control-based imputation

Longitudinal studies are often subject to missing data. The ICH E9(R1) a...
12/17/2021

### Selection bias in the treatment effect for a principal stratum

Estimation of treatment effect for principal strata has been studied for...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The evaluation of the possible effects of a treatment on an outcome plays a central role in theoretical as well as applied statistical and econometrical literature; cfr. the excellent review papers by [3] and [12]. The main quantity of interest, traditionally, is the average effect of the treatment on outcome, or better the difference between the expected valued of outcomes for treated and control (untreated) subjects, i.e. (Average Treatment Effect). Another quantity of interest is the effects of treatment on outcome quantiles, which is summarized by (Quantile Treatment Effect). The main source of difficulty is that data are usually observational, so that the estimation of the treatment effect by simply comparing outcomes for treated vs. control subjects is prone to a relevant source of bias: receiving a treatment is not a “purely random” event, and there could be relevant differences between treated and control subjects. This motivates the need to account for confounding covariates.

In the literature, several different techniques have been proposed to estimate , under various assumptions (see [3], [12] and references therein). As far as is concerned, cfr. the paper by [9]. The problem of evaluating possible differences in the distribution function of potential outcomes with binary instrumental variables is studied in [1] via a Kolmogorv-Smirnov type test.

In the present paper we essentially focus on evaluating the possible effects of the treatment on the whole outcome probability distribution. The starting point is to use outcome weighting similar to those introduced in

[11] and [9]. Using this approach, estimates of the distribution function (d.f.) for treated and control subjects will be obtained. Such estimators essentially play a role similar to the empirical d.f. in nonparametric statistics. It will be shown that the resulting “empirical processes” weakly converge to an appropriate Gaussian process. Although it is non a Brownian bridge, it possesses several properties similar to the Brownian bridge (continuity of trajectories, etc.). These theoretical results are applied to the construction of confidence bands for the outcome distribution under treatment and under control, as well as to construct a new statistical test to compare treated and untreated subjects. In a sense, such a test is a version of the classical Wilcoxon-Mann-Whitney test for two groups comparison. Its main merit is to capture the possible difference between treated and untreated subjects even when is equal to zero. Another application of interest will be the construction of a test for stochastic dominance of treatment w.r.t. control, which is of interest, for instance, in programme evaluation exercises ([15]), welfare outcome, etc..

The paper is organized as follows. In Section 2 the problem is described. In Section 3.2 the main asymptotic large sample results are provided, and in Section 4 approximations based on subsampling are considered. Particularizations to and are given in Section 5. Section 6 is devoted to the construction of confidence bands for the d.f. of outcomes, for both treated and untreated subjects. In Section 7 a Wilcoxon-type statistic to test for treatment effect of the d.f of outcomes in introduced, and in Section 8 an elementary test for first-order stochastic dominance of treated vs. untreated is studied. The finite sample performance of the proposed methodologies is studied via Monte Carlo simulation in Section 9.

## 2 The problem

Let be an outcome of interest, observed on a sample of subjects. Some of the sample units are treated with an appropriate treatment (treated group); the other sample units are untreated (control group). If denotes the treatment indicator variable, then whenever , is observed; otherwise, if , is observed. Here and are the potential outcomes due to receiving and not receiving the treatment, respectively. The observed outcome is then equal to . In the sequel, will denote the distribution function (d.f.) of , and the d.f. of .

As already said in the introduction, receiving a treatment is not a “purely random” event, as in experimental framework. On the contrary, there could be relevant differences between treated and untreated subjects, due to the presence of confounding covariates. In the sequel, we will denote by

the (random) vector of relevant covariates, that is assumed to be observed.

In order to get consistent estimates, identification restrictions are necessary. The relevant restriction assumed in the sequel is selection of treatment is based on observable variables: given a set of observed covariates, assignment either to the treatment group or to the control group is random. Formally speaking, let be the conditional probability of receiving the treatment given covariates ; it is termed propensity score. The marginal probability of being treated, , is equal to .

In the sequel, our main assumption is that the strong ignorability conditions (cfr. [18]

) are fulfilled. In more detail, consider next the joint distribution of (

), and denote by the support of . The following assumptions are assumed to hold.

• Unconfoundedness (cfr. [19]): given , are jointly independent of : .

• The support of , is a compact subset of .

• Common support: there exists for which , so that , .

Assumption is also known as Conditional Independence Assumption ().

For the sake of simplicity, we will use in the sequel the notation

 p1(x)=p(x),p0(x)=1−p(x). (1)

From the above assumptions, the basic relationships

 E[1pj(x)I(T=j)I(Y≤y)] = Ex[E[1pj(x)I(T=j)I(Y(j)≤y)∣∣∣x]] (2) = Ex[1pj(x)E[I(T=j)∣∣x]E[I(Y(j)≤y)∣∣x]] = Ex[Fj(y|x))] = Fj(y),j=1,0.

are obtained.

The Average Treatment Effect (ATE) is defined as . The estimation of ATE is a problem of primary importance in the literature, and several different approaches have been proposed ([3] and references therein). Another parameter of interest in the Quantile Treatment Effect (QTE), which is the difference between quantiles of and : , with ; cfr. [9]. In particular, when it reduces to the Median Treatment Effect.

As already said in the introductory section, in the present paper we concentrate on the estimation of the d.f.s , under treatment and control, respectively. As special cases, the results in [11] and [9] will be obtained.

## 3 Estimation of F1,f0

### 3.1 Basics

The basic approach to the estimation of , follows, in principle, the ideas developed in [11] to estimate ATE. First of all, the propensity score is estimated by a sieve estimator , say; cfr. [11], [9]. Let , be a -dimensional vector of polynomials in , such that

• ;

• ;

• includes all polynomials up to order whenever , with as .

The propensity score is approximated by a linear combination of

on a logit scale, with coefficients estimated by maximizing a pseudo-likelihood. More formally, if

, then , where the -dimensional vector is estimated by maximum likelihood method:

 ˆ\boldmathπK=argmax1nn∑i=1{Tilog(L(\boldmathHK(x)T% \boldmathπK))+(1−Ti)log(L(1−\boldmathHK(x)T\boldmathπK))}.

In the sequel, the following result will be widely used.

###### Theorem 1.

Assume that S1 - S3 are fulfilled, and that is continuously differentiable of order , with . If , with , then

 supx|ˆpn(x)−p(x)|p→0asn→∞. (3)

Proof. See [11].

Again, for notational simplicity, and similarly to , define:

 ˆp1,n(x)=ˆpn(x),ˆp0,n(x)=1−ˆpn(x). (4)

In order to estimate and , the following “Hájek - type” estimators are considered:

 ˆF1,n(y)=n∑i=1w(1)i,nI(Yi≤y),ˆF0,n(y)=n∑i=1w(0)i,nI(Yi≤y) (5)

where

 w(j)i,n=I(Ti=j)/ˆpj,n(xi)∑nk=1I(Tk=1)/ˆpj,n(xk),j=1,0;i=1,…,n. (6)

It is immediate to see that are proper d.f.s, i.e. they are bona fide estimators.

As alternative estimators of , , the following “Horvitz-Thompson - type” estimators could be considered:

 ˆFHT1,n(y)=1nn∑i=1I(Ti=1)ˆp1,n(xi)I(Yi≤y),ˆFHT0,n(y)=1nn∑i=1I(Ti=0)ˆp0,n(xi)I(Yi≤y). (7)

We will mainly concentrate on for two reasons. First of all, are not proper d.f.s, because , with positive probability. In the second place, as it will be seen in the sequel, are asymptotically equivalent to .

### 3.2 Basic asymptotic results

The goal of the present section is to study the asymptotic, large sample, properties of estimators . Our first result is a Glivenko - Cantelli type result, showing the uniform consistency (in probability) of , .

###### Proposition 1.

Assume that the conditions of Th. 1 are fulfilled. Then:

 supx∣∣ˆF1,n(y)−F1(y)∣∣p→0,supx∣∣ˆF0,n(y)−F0(y)∣∣p→0asn→∞. (8)

Proof. See Appendix.

Next step consists in studying the limit, large sample distribution of the above estimators. Define first the stochastic process

 Wn(y)=[W1,n(y)W0,n(y)]=[√n(ˆF1,n(y)−F1(y))√n(ˆF0,n(y)−F0(y))],y∈R (9)

The bivariate stochastic process essentially plays the same role as the empirical process in classical non-parametric statistics, with a complication due to the presence of , instead of the usual empirical distribution function.

The weak convergence of can be proved similarly to the classical empirical process, with modifications. In the first place, from

 √n(ˆFj,n(y)−Fj(y))=(1nn∑i=1I(Ti=j)ˆpj,n(xi))−11√nn∑i=1I(Ti=j)ˆpj,n(xi)(I(Yi≤y)−Fj(y)),j=1,0

and from Lemma 2 , it is seen that the limiting distribution of , if it exists, coincides with the limiting distribution of

 ⎡⎢ ⎢ ⎢⎣1√n∑ni=1I(Ti=1)ˆp1,n(xi)(I(Yi≤y)−F1(y))1√n∑ni=1I(Ti=0)ˆp0,n(xi)(I(Yi≤y)−F0(y))⎤⎥ ⎥ ⎥⎦,y∈R. (10)

In the second place, by repeating verbatim the arguments in Th. 1 in [11], and [10], with instead of and instead of , it is seen that, if , with , then the relationship

 ⎡⎢ ⎢ ⎢⎣1√n∑ni=1I(Ti=1)ˆp1,n(xi)(I(Yi≤y)−F1(y))1√n∑ni=1I(Ti=0)ˆp0,n(xi)(I(Yi≤y)−F0(y))⎤⎥ ⎥ ⎥⎦=⎡⎢⎣1√n∑ni=1Z1,i(y)1√n∑ni=1Z0,i(y)⎤⎥⎦+op(1),y∈R. (11)

holds, where

 Zj,i(y)=(I(Ti=j)pj(xi)I(Yi≤y)−Fj(y))−Fj(y|xi)pj(xi)(I(Ti=j)−pj(xi)),j=1,0;i=1,…,n. (12)

The term appearing in depends on , and, as it appears by using the bounds in [10], convergence in probability to zero (or better, to the vector ) holds uniformly over compact sets of s. Hence, in order to prove that the sequence of stochastic processes converges weakly to a limit process, it is enough to prove that converges weakly to a limiting process.

###### Proposition 2.

Assume that the conditions of Th. 1 are fulfilled, and that , , , are continuous. Then, the sequence of stochastic processes converges weakly, as goes to infinity, to a Gaussian process with null mean function (, ) and covariance kernel:

 C(y,t)=E[W(y)⊗W(t)]=[C11(y,t)C10(y,t)C01(y,t)C00(y,t)] (13)

where:

 Cjj(y,t) = E[1pj(x)(Fj(y∧t|x)−Fj(y|x)Fj(t|x))] (14) + Ex[(Fj(y|x)−Fj(y))(Fj(t|x)−Fj(t))],j=1,0; C10(y,t) = E[(F1(y|x)−F1(y))(F0(t|x)−F0(t))] (15) = E[F1(y|x)F0(t|x)]−F1(y)F0(t); C01(y,t) = C10(t,y)=E[(F1(t|x)−F1(t))(F0(y|x)−F0(y))]. (16)

Weak convergence takes place in the set of bounded functions equipped with the sup-norm (if ) .

Proof. See Appendix.

Due to the continuity of , , the weak convergence of Proposition 2 also holds in the space of -valued càdlàg functions equipped with the Skorokhod topology.

Consider now the Horvitz-Thompson estimators , and define:

 WHTjn(y)=√n(ˆFHTj,n(y)−Fj(y)),j=1,0.

From the proof of Proposition 2, it appears that the sequence of stochastic processes converges weakly to the same Gaussian limiting process that appears in Proposition 2. Hence, the Horvitz-Thompson estimators are asymptotically equivalent to the Hájek estimators .

As well known, in classical nonparametric statistics the empirical process converges weakly to a Brownian bridge, on the scale of the population ditribution function. The limiting process in Proposition 2 is not a Browinian bridge, of course, although it is a Gaussian process. However, it shares with the Brownian bridge an important property: it possesses trajectories that are a.s. continuous.

###### Proposition 3.

If and are continuous, the limiting process possesses trajectories that are continuous with probability 1.

Proof. See Appendix.

### 3.3 Differentiable functionals

The result of Proposition 2 can be immediately extended to general Hadamard differentiable functionals of , again assuming the continuity of , . Consider a general functional:

 θ=θ(F1,F0):l∞(R)2→E

where is equipped with the -norm metric and is a normed space equipped with a norm . As seen in Proposition 3, the limiting process concentrates on , where is the set of continuous functions on the extended real line . Note that functions in are bounded.

The functional is Hadamard differantiable at tangentially to if there exists a linear application

 θ′(F1,F0):C(¯¯¯¯R)×C(¯¯¯¯R)→E

such that:

 ∥∥∥θ((F1,F0)+tht)−θ(F1,F0)t−θ′(F1,F0)(h)∥∥∥E→0ast↓0,∀ht→h.

Using Theorem 20.8 in [20], we then have:

 √n(θ(ˆF1,ˆF0)−θ(F1,F0))d→θ′(F1,F0)(W). (17)

In general, since is a linear functional of a Gaussian process, it is a Gaussian process, as well. In particular, if is a real-valued functional, then

has a Gaussian distribution with zero expectation and variance

 σ2θ=E[θ′(F1,F0)(W)2]. (18)

For the sake of simplicity, let be equal to . The above result can be rewritten as

 √n(ˆθn−θ)d→N(0,σ2θ)asn→∞ (19)

where the asymptotic variance is given by .

## 4 Subsampling approximation

Consider a functional

. In order to construct a confidence interval on the basis of

, a consistent estimate of the asymptotic variance is necessary. Unfortunately, apart a few cases, this is not simple, because could depend on , in a complicate way, and a direct estimation could not be possible. This is the case, for instance, of quantiles, that will be dealt with in next section. Here we briefly present a simple approach based on subsampling.

Define , , and consider all the subsamples of size of . Let further be the statistic computed for the -th subsample of size . Next, consider then the empirical distribution function of the quantities . In symbols:

 Rn,m(u)=(nm)−1(nm)∑l=1I(√m(ˆθm,l−ˆθn)≤u). (20)

If:

• ;

• depends on in such a way that , ;

then, using Th. 2.1 in [17], we have

 Rn,m(u)p→Φ(uσθ)asn,m→∞ (21)

where is the distribution function of the Gaussian distribution. The convergence in (21) is uniform in .

Relationship tells us that can be (uniformly) approximated by , as and get large. From the continuity and strict monotonicity of , it follows that the empirical quantile converges in probability to the quantile of order of the distribution .

The number of subsamples of size , in can be very large, and then could be difficult to be computed. In this case a “stochastic” version of can be considered according to the following steps.

• Select independent subsamples of size from .

• Compute the corresponding values of the statistic .

• Compute of the corresponding empirical distribution function:

 ˆRn,m(u)=1MM∑l=1I(√m(ˆθm,l−ˆθn)≤u). (22)

It can be easily verified that if , and , then has the same limiting behaviour as . These results can be used to obtain confidence intervals for

and for testing statistical hypotheses

via inversion of confidence intervals. In more detail, let

 ˆR−1n,m(u)=inf{u:ˆRn,m(u)≥p}

be the th quantile of . It is easy to show that the interval:

 [ˆθn−1√nˆR−1n,m(1−α2),ˆθn−1√nˆR−1n,m(α2)] (23)

is confidence interval for of asymptotic level .

The confidence interval can be also used for testing the hypothesis:

 {H0:θ=θ0H1:θ≠θ0

If is in the confidence interval, then is accepted, otherwise it is rejected. Clearly, this is a test of asymptotic significance level .

## 5 Average and Quantile Treatment Effect

The results obtained so far allow one to re-obtain, as special cases, results previously obtained by [11] and [9]. They are presented below.

### 5.1 Average Treatment Effect

The Average Treatment Effect (ATE, for short) is defined as:

 τ=E[Y(1)]−E[Y(0)]=∫Ryd[F1(y)−F0(y)]. (24)

In the sequel, we will assume that and are both finite. As an estimator of , consider

 ˆτ = ∫+∞−∞yd[ˆF1,n(y)−ˆF0,n(y)] (25) = n∑i=1yiw(1)i,n−n∑i=1yiw(0)i,n.

where the weights , are given by .

As it appears from , is a linear functional of and hence Hadamard differentiable. An integration by parts shows that the asymptotic distribution of coincides with that

 −∫+∞−∞(W1(y)−W0(y))dy

that turns out to normal with zero mean and variance

 σ2τ=∫+∞−∞∫+∞−∞{C11(y,t)−C10(y,t)−C01(y,t)+C00(y,t)}dydt.

It is not difficult to see that the estimator is asymptotically equivalent to that introduced in [11].

### 5.2 Quantiles and Quantile Treatment Effect

Let , be the quantile of order of , . In the sequel, we will assume that , are in the common support of , . Furthermore, we will denote by the support of , .

Suppose that , are continuous with positive density functions , , respectively:

 fj(y)=dFj(y)dy>0∀y∈supp(Fj),j=1,0.

As a consequence of the above assumption, is strictly monotonic (in its support).

Consider now () such that lie in the common support of , . It is intuitive to estimate the quantile by its “empirical counterpart”

 ˆQj,n(p)=ˆF−1j,n(p)=inf{y:ˆFj,n(y)≥p},j=1,0. (26)

Let now be the set of the restrictions of the distribution functions in to , and let be the set of càdlàg functions in . From [20], it is seen that the map (from onto is Hadamard differentiable at tangentially to with derivative:

 h⟼−(hf)∘F−1.

Using then Th. 20.8 in [20], (cfr. [7] for an equivalent approach), the process

 [√(n)(ˆQ1,n(p)−Q1(p)√(n)(ˆQ0,n(p)−Q0(p))],p∈[p1,p2] (27)

converges weakly as (on equipped with the -norm) to a Gaussian process defined as:

 Z(p)=⎡⎢ ⎢⎣−W1(Q1(p))f1(Q1(p)))−W0(Q0(p))f0(Q0(p))⎤⎥ ⎥⎦,p∈[p1,p2] (28)

The stochastic process is a Gaussian process with zero mean function and covariance kernel:

 Cz(p,u)=⎡⎢ ⎢⎣C1(Q1(p),Q1(u))f1(Q1(p))f1(Q1(u))C10(Q1(p),Q0(u))f1(Q1(p))f0(Q0(u))C01(Q0(p),Q1(u))f0(Q0(p))f1(Q1(u))C0(Q0(p),Q0(u))f0(Q0(p))f0(Q0(u))⎤⎥ ⎥⎦.

Note that due to the symmetry of the Gaussian distribution.

In [9] the difference between corresponding quantiles:

 φ(p)=Q1(p)−Q0(p) (29)

is considered. It is known as Quantile Treatment Effect (QTE, for short). From it is intuitive to estimate by

 ˆφ(p)=ˆQ1,n(p)−ˆQ0,n(p) (30)

The estimator is asymptotically equivalent to the estimator of QTE defined in [9]. In fact, from it appears that

 √n(ˆφ(p)−φ(p))=√n(ˆQ1,n(p)−ˆQ0,n(p)−(Q1(p)−Q0(p))) (31)

tends in distribution, as goes to infinity, to a Gaussian distribution with zero mean and variance:

 V = C1(Q1(p),Q1(p))f1(Q1(p))2+C0(Q0(p),Q0(p))f0(Q0(p))2−C10(Q1(p),Q0(p))f1(Q1(p))f0(Q0(p))−C01(Q0(p),Q1(p))f0(Q0(p))f1(Q1(p)) (32) = C1(Q1(p),Q1(p))f1(Q1(p))2+C0(Q0(p),Q0(p))f0(Q0(p))2−2C10(Q1(p),Q0(p))f1(Q1(p))f0(Q0(p)) = 1f1(Q1(p))2{Ex[1p(x)F1(Q1(p)|x)(1−F1(Q1(p)|x))] + Ex[(F1(Q1(p)|x)−F1(Q1(p)))2]} + 1f0(Q0(