Semiparametric Estimation of Treatment Effects in Randomized Experiments

We develop new semiparametric methods for estimating treatment effects. We focus on a setting where the outcome distributions may be thick tailed, where treatment effects are small, where sample sizes are large and where assignment is completely random. This setting is of particular interest in recent experimentation in tech companies. We propose using parametric models for the treatment effects, as opposed to parametric models for the full outcome distributions. This leads to semiparametric models for the outcome distributions. We derive the semiparametric efficiency bound for this setting, and propose efficient estimators. In the case with a constant treatment effect one of the proposed estimators has an interesting interpretation as a weighted average of quantile treatment effects, with the weights proportional to (minus) the second derivative of the log of the density of the potential outcomes. Our analysis also results in an extension of Huber's model and trimmed mean to include asymmetry and a simplified condition on linear combinations of order statistics, which may be of independent interest.



page 1

page 2

page 3

page 4


Estimation and Inference of Extremal Quantile Treatment Effects for Heavy-Tailed Distributions

Causal inference for extreme events has many potential applications in f...

Efficient Estimation of General Treatment Effects using Neural Networks with A Diverging Number of Confounders

The estimation of causal effects is a primary goal of behavioral, social...

Efficient Heterogeneous Treatment Effect Estimation With Multiple Experiments and Multiple Outcomes

Learning heterogeneous treatment effects (HTEs) is an important problem ...

Heterogeneous Treatment Effects for Networks, Panels, and other Outcome Matrices

We are interested in the distribution of treatment effects for an experi...

CONQ: CONtinuous Quantile Treatment Effects for Large-Scale Online Controlled Experiments

In many industry settings, online controlled experimentation (A/B test) ...

Estimating spillovers using imprecisely measured networks

In many experimental contexts, whether and how network interactions impa...

Code Repositories


Efficiently Estimate Treatment Effects Based On A Parametric Model For Treatment Effects

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Historically, randomized experiments were often carried out in medical and agricultural settings. In these settings, sample sizes were often modest, typically on the order of hundreds or (occasionally) thousands of units. Outcomes commonly studied included average mortality or crop yield, and were characterized by relatively well-behaved data with thin tails. Standard analyses in those settings typically involved estimating the average effect of the treatment by the difference in average outcomes by treatment group, followed by constructing confidence intervals for the average causal effect using Normal distribution based approximations. These methods originated in the 1920s, e.g.,

Neyman (1990); Fisher (1937), but they continue to be the standard in modern applications. See Wu and Hamada (2011) for a recent discussion.

More recently many experiments are conducted using digital technology (see, e.g., Kohavi et al. (2020) for an overview). Gupta et al. (2019, p.20)

claim that “Together these organizations [Airbnb, Amazon,, Facebook, Google, LinkedIn, Lyft, Microsoft, Netflix, Twitter, Uber, and Yandex] tested more than one hundred thousand experiment treatments last year.” The settings for these online experiments are substantially different from those in biomedical and agricultural settings. First, the experiments are often on a vastly different scale, with the number of units on the order of hundreds of thousands to tens of millions. Second, the outcomes of interest, variables such as time spent by a consumer, sales per consumer or payments per service provider, are characterized by distributions with extremely thick tails. Third, the average treatment effects are often small relative to the standard deviation of the outcomes, even if their magnitude is substantively important.

For example, Lewis and Rao (2015) analyze challenges with statistical power in experiments designed to measure digital advertising effectiveness. They discuss a hypothetical experiment where the average expenditure per potential customer is $7, with a standard deviation of $75, and where an average treatment effect of $0.35 (0.005 of a standard deviation) would be substantial in the sense of being highly profitable for the company. In that example, an experiment with power for a treatment effect of $0.35 equal to 80%, and a significance level for the test of means of 0.05, would require a sample size of 1.4 million customers. As a result confidence intervals for the average effect of an advertisement are likely to include zero even if the average effect were substantively important, even with large sample sizes, e.g., over a hundred thousand units. However, the fact that a confidence interval for the average treatment effect includes zero does not mean that the data are not informative about the presence of causal effects of the treatment. Using Fisher exact p-value calculations (Fisher, 1937) with well-chosen statistics (e.g., the Hodges-Lehman difference in average ranks, Rosenbaum (1993)

), one may well be able to establish conclusively that treatment effects are present. However, the magnitude of the treatment effect is important for decisions about advertising. In this type of example, point estimates of average effects are of more importance for decision makers than rejections of the sharp Fisher null hypothesis of no effects whatsoever.

This sets the stage for the problem we address in this paper. In the absence of additional information, there is no estimator for the average treatment effect that is more efficient than the difference in means. To obtain more precise estimates, we either need to change focus away from the average treatment effect, or we need to make additional assumptions. One approach to changing the question slightly is to transform the outcome (e.g. taking logarithms or winsorizing the outcome) followed by a standard analysis estimating the average effect of the treatment on the transformed outcome. In this paper we choose a different approach, making additional assumptions on the distribution of the data. The key conceptual contribution is that we postulate low-dimensional parametric models for the unit-level treatment effects. Note that we do not use a parametric models for the full outcome distributions, which might involve interaction between potential treatment and control values because specifying a parametric model that well approximates the full outcome distribution is more challenging than doing so for treatment effects. Unlike outcomes, treatment effects tend to be small and often have little variation. Instead, we propose a semiparametric model of the outcome distribution that incorporates a parametric model only for the treatment effects but not for the joint outcome distributions (Bickel et al. (1993), Bickel and Doksum (2015)

). For this semiparametric set-up, we derive the influence function, the semiparametric efficiency bound, and propose semiparametrically efficient estimators. It turns out that the parametrization of the treatment effect can be very informative, potentially making the asymptotic variance for these semiparametric estimators much smaller than the corresponding asymptotic variance for the difference-in-means estimator. In addition, even if the model for the treatment effect is misspecified, the estimand corresponding to the estimator may have an interesting, causal, interpretation.

The remainder of the paper is organized as follows. First, in Section 2 we consider a special case where we assume that the treatment effect is additive and constant. This implies that the two potential outcome distributions differ only by a shift. This case is closely related to the classical two-sample problem, as discussed in Hodges Jr and Lehmann (1963)

, and to problems considered in the literature on descriptive statistics for nonparametric models as in

Bickel and Lehmann (1975a, b), Doksum (1974), Doksum and Sievers (1976). Particularly relevant is the work of Jaeckel (1971a), Jaeckel (1971b). One way to interpret Section 2 is that it presents, in a causal inference framework, extensions of Jaeckel’s work to the two sample context in the setting of semiparametric theory. In the process of doing so, we generalize Huber’s model (Huber, 1964) and the estimator based on trimmed means to include asymmetry, and present a simplified version of the results of Chernoff et al. (1967) (see also Bickel (1967), Govindarajulu et al. (1967) and Stigler (1974) on linear combinations of order statistics). In particular, we exhibit efficient M and L estimates for outcome distributions that are known up to a shift. We then analyze partially adaptive estimates, in particular flexible trimmed means (as in Jaeckel (1971a)) and fully adaptive estimates of both types, as discussed, for example, in Bickel et al. (1993). In this setting the problem is closely related to the literature on robust estimation of locations (e.g., Bickel and Lehmann (1975a, b); Jaeckel (1971a); Hampel et al. (2011); Huber (2011); Bickel and Lehmann (1976, 2012)). In Section 3 we consider the case where we have more flexible parametric models for the treatment effect. In Section 4 we provide some simulation evidence regarding the finite sample properties of the proposed methods in controlled settings. In this section we also provide two applications of the methods, one to data on housing prices and one to data on medical expenses. In both cases the outcome distributions are thick-tailed. Section 5 concludes. A software implementation for R is available at

2 Constant Additive Treatment Effects

In this section we focus on a special case of the general problem where the treatment effect is additive and constant across units. In Section 3 we discuss the general case. We start by setting up the problem more formally and introducing some notation. Then we discuss robust estimation in the one-sample case to motivate a class of weighted quantile treatment effect estimators. We then discuss the formal semiparametric problem and show the adaptivity of the proposed estimators. Finally we consider partial adaptivity and robustness.

2.1 Set Up

We consider a set up with a randomized experiment with

observations drawn randomly from a large population. With probability

a unit is assigned to the treatment group. Let and denote the number of units assigned to the treatment and control group. Using the potential outcome set up (Neyman, 1990; Rubin, 1974; Imbens and Rubin, 2015), let and denote the two potential outcomes for unit , and let the binary treatment be denoted by . The realized outcome is . We assume that there is no interference between units, so that the treatment assignment for one unit does not affect the outcomes for any other unit. For all units in the sample we observe the pair .

The cumulative distribution functions for the two potential outcomes are

and with inverses and and means and variances , , and . Note that by the random assignment assumption the distribution of the potential outcome is identical to the conditional distribution of the realized outcome conditional on .

We are interested in the average treatment effect in the population,


The natural estimator for this average treatment effect is the difference in sample averages


are the averages of the observed outcomes by treatment group. Under standard conditions

and , with .

The concern with the conventional estimator is that it may be imprecise. In particular in settings where the outcome distribution is thick tailed, sometimes extremely so, confidence intervals may be wide. At the same time, it is possible that there is substantial evidence in the data that the treatment does have some effect. Randomization based tests on ranks or other transformations of the outcome (e.g., Rosenbaum (2002)) may offer such evidence. However, the magnitude of the causal effect is often of primary interest, and it is not sufficient to simply reject the null of no effect. We address this by incorporating ideas from the robust estimation literature (Hampel et al., 2011; Huber, 2011)

and by imposing some restrictions on the joint distribution of the potential outcomes motivated by the semiparametric literature

(Bickel and Lehmann, 1975a, b, 1976, 2012) to develop new estimators.

The quantile treatment effect (Lehmann and D’Abrera, 1975) plays an important role in our setup. For quantile , define


which is closely related to what Doksum (1974) and Doksum and Sievers (1976) label the response function: The natural estimate for the quantile treatment effect is the empirical plug-in,

where equals , defined as the order statistic of , , where and similarly for .

A natural class of parameters summarizing the difference between the and distributions consists of weighted averages of the quantile treatment effects:

where the weights integrate to one, . Different choices for the weight function correspond to different estimands. The constant weight case, , corresponds to the population average treatment effect . The median corresponds to the case where puts all its mass at . We thus allow to permit point masses.

For a given weight function we can estimate the parameter using a weighted average quantile estimator:



and again are the order statistics in treatment group .

2.2 Efficient estimation

We begin by considering the nonparametric model for a single sample, with cumulative distribution where the interest is in the weighted quantile for a given weight funtion . For this one-sample case, Jaeckel (1971a, b) builds on Chernoff et al. (1967) to show that under simple conditions on and , for a sample size of ,


where the influence function is related to the weight function by


the last term ensures that .

Note that if the derivatives and exist, by (2.6),

so that . Note that for the median, (2.6) yields, . Our formula (2.6) is slightly more general than Jaeckel’s, and in the appendix we establish sufficient conditions on and for our version of his result to hold.

Expression (2.5) in turn implies that

where the variance equals

Jaeckel (1971a, b)’s results extend in the following way to the two-sample setting we are interested in. If is estimated by


then, under regularity conditions given in the appendix (Theorem A.1),


where is given by (2.6) for given .

Now let us return to the primary focus of this section, the estimation of the average treatment effect under the constant treatment effect assumption. In that case the quantile treatment effects are all equal:

Then, for all weight functions ,


Suppose that in addition to knowing that the treatment effect is constant, we know up to a shift. That is, where is known with derivative , unknown, and, because of the constant treatment effect, . For this fully parametric model (with unknown parameters and ), if the Fisher information satisfies , the maximum likelihood estimator of , suitably regularized (e.g., Le Cam and Yang (1988)), has influence function,


If is absolutely continuous, then


provides an efficient estimate when substituted appropriately in (2.7), leading to


2.3 Fully adaptive estimation

It follows from semiparametric theory (Bickel et al., 1993) that even if the density is unknown, substituting a suitable estimate of (and ) in (2.10) or (2.12), will yield estimators with influence functions given by (2.10), or equivalently by


where .

This is a consequence of the orthogonality of the tangent space with respect to .

Formally we have for the unknown case:

Theorem 1.

For all such that exists and :

  1. There exist consistent estimators and .

  2. Under mild conditions (see (A.9) and (A.10) in the appendix), we can construct an estimate such that

  3. Under mild conditions (see Lemma A.2 in the appendix), we can construct an estimate (weighted average quantile) such that satisfies (2.14).

Our conditions are not optimal (see Stone (1975) for minimal ones in the one sample case).

The estimate is of the form,


where is an estimate of using , and using , a one step estimate using the sample splitting technique (Klaassen, 1987).

The estimate is obtained by direct plug-in of weight function estimation using sample splitting as above. Details are in the appendix.

Thus, both the estimators and are adaptive to models in which the distribution of control and treated potential outcomes is known up to location. Theorem 1

implies that the influence function based estimator is as efficient as the maximum likelihood estimator based on the true distribution function. For instance, if potential outcomes are normally distributed, the maximum likelihood estimator is the difference in means, and the influence function estimator has the same limiting distribution. If, however, the potential outcomes follow a double exponential distribution, the difference in means estimator is not efficient; the difference in medians is the efficient estimator. Under this true distribution, the influence function based estimator adapts and has the same limiting distribution as the difference in medians. For the Cauchy distribution the optimal weights are more complicated,

, but the influence function based estimator has the same limiting distribution, without requiring a priori knowledge about the distribution.

This influence function based estimator is a special case of the estimator developed by Cuzick (1992a, b)

for the partial linear regression model setting.

Although these estimators are efficient under the constant additive treatment model, as we shall see in Section 3, if the model is violated, estimates of the types derived from known continue to estimate at rate , meaningful measures of the treatment effect as discussed in Section 2.1. This is unfortunately not the case for and because estimation of and introduces components of variance of order larger than . However, there is a partial remedy.

2.4 Partial Adaptation

An interesting alternative to full adaptation in this problem was studied by Jaeckel (1971b) in the one-sample symmetric case. See also Yu and Yao (2017). The estimator proposed by Jaeckel (1971b) is

for a sample from with symmetric. We can interpret this as restricting the class of weight functions to one indexed by a scalar parameter :

This weight function yields the -trimmed mean. We can then choose the value of that minimizes the asymptotic variance of . This asymptotic variance is equal to


which can be estimated by replacing with the empirical distribution, denoted as . Let , and let be the corresponding estimator for . Jaeckel (1971b) shows that is adaptive for estimating over a Huber family of densities. In the Huber family has density for varying , :

where makes , and . Adaptivity here means that using the trimming proportion optimizing the variance estimate, in fact yields an estimate which is efficient for the member of the Huber family generating the data. He optimizes .

Because this family includes among others the Gaussian () and double exponential (), this family is very flexible. For more properties, see Huber (2011).

In the two-sample case, it is reasonable to consider asymmetric weight functions leading to the natural generalization,

We show in the online appendix that Jaeckel’s result on partial adaptation can, for our two sample problem, be extended to a generalization of the Huber family whose members are symmetric iff , defined by :


where and is the CDF of .

This family can be equivalently parametrized by and .

3 The General Parametric Treatment Effect Case

In some settings, the assumption of an additive model may be too restrictive. In this section, we develop estimators given a general parametric model for this difference.

The starting point is a model governing the relation between the two potential outcomes:

Assumption 1 (Parametric Model for Unit-level Treatment Effects).

The unit-level potential outcomes satisfy

for some known function and an unknown parameter, . Equivalently,

As before we initially assume known. In terms of the quantile treatment effects Assumption 1 implies the restriction

Given Assumption 1, the population average treatment effect can be characterized as

In practice, however, estimating may still be subject to substantial sampling variance. As an alternative, we suggest to estimate the in-sample, as opposed to population, average treatment effect. An insight is that estimators of this object can have a much lower variance. We define the in-sample average treatment effect as


When is not just an additive function, is sample-dependent, and thus stochastic. In particular when the variance of is large because of thick tails for the potential outcome distributions, the variance of will be large, too. We therefore focus on the variance of estimators relative to for the particular sample at hand, rather than relative to the population average .

Because is known, we can estimate efficiently by some version of maximum likelihood to get an estimate and

as an estimate of . The density of given is

By Assumption 1


and the score function is


where .

If is assumed unknown, to obtain an efficient influence function we must

  • Compute the tangent plane as varies with fixed. The tangent plane is given by

    (Note that both factors of the likelihood must be varied treating as fixed.)

  • Project on the orthocomplement of the tangent plane to get

    where is the projection of on .

  • The efficient influence function is given by

Lemma 1.

The efficient influence function for is



To use the influence function approach we need a -consistent initial estimator . We can do so by a fixed number of quantiles, , where is the dimension of , and find the that solves

for . Let denote the solution to this system of equations. Then:

Theorem 2.

Under mild conditions on the estimation of density and its derivative, the estimator below is efficient for , i.e. :

where is the estimate of using , and is the estimate of using , again a one step estimate using the sample splitting technique (Klaassen, 1987), similar to (2.15).

Note that, just as in the constant treatment effect setting, the efficient influence function is not the one corresponding to known.

4 Simulations

We evaluate the performance of the proposed estimators and conventional estimators in a Monte Carlo study. Throughout most of these simulations, the true unit-level treatment effects are all zero. We estimate the treatment effect using the proposed efficient estimators based on an additive model. We consider seven estimators: The (standard) difference in means, the difference in medians, the Hodges-Lehman (Hodges Jr and Lehmann (1963)) estimator,111The Hodges-Lehmann estimator is equal to the median of all pairwise differences between treated and control observations. the adaptively trimmed mean, the adaptively winsorized mean,222For both trimming and winsorizing, we choose the trimming percentage that minimizes an estimate of the asymptotic variance, allowing anywhere between no trimming (difference in means) and the extreme of trimming all but the medians (difference in medians). While including the extremes is not covered by the theory, this approach appears to work well in our simulations. the estimator based on the efficient influence function (eif), and the weighted average quantiles (waq) estimator. For the latter two we report only results without sample splitting. Results for the case where we do use sample splitting are very similar and are available from the authors.

We present results for three sets of simulations. First, we report results for settings with a wide range of known distributions for the potential outcomes, so we can directly assess the ability of the proposed methods to adapt to different distributions. Second, we report results for simulations based on real data, one set based on housing prices and one based on medical expenditures. Both data sets are examples of relatively heavy tailed distributions as commonly encountered in empirical work.

4.1 Simulations with Known Distributions

We simulate samples of

20,000 observations, exactly half of which are treated, and report summary statistics based on 10,001 simulated samples (using an odd number so that the median is unique). We repeat the simulation study for standardized Normal, Double Exponential (Laplace), and Cauchy distributions for the potential outcomes. The difference in means is the maximum likelihood estimator Normally distributed data, and so will do well there, but may perform poorly for thicker tailed distributions such as the Double Exponential distribution and in particular the Cauchy distribution. The difference in medians is the maximum likelihood estimator for the Double Exponential distribution, and relatively robust to thick tails and outliers, and so is expected to perform reasonably well across all specifications, but not as well as the efficient estimators for the Normal.

For the simulations with known distributions we can infer the functional form for the optimal weights for the quantile-based estimator. The optimal weights for the waq estimator are proportional to the (estimated) second derivative of the log density. For the Normal distribution, , implying the optimal weights are constant, so that the semiparametrically efficient estimator approximates the simple difference in means (which is the parametric maximum likelihood estimator in this case). The density of the double exponential distribution is such that the optimal weights asymptotically place all weight close to the median. The maximum likelihood estimator for the location of a Double Exponential distribution is the sample median. For the standard Cauchy distribution the efficient weights on the difference in quantiles of treated and control distributions are , shown in Figure 1. Most of the weight is concentrated around the median, with strictly negative weights outside the quantile range.

Figure 1: The efficient weight function for the Cauchy distribution. The weights, normalized to have mean 1, are plotted against the quantile (left) and against the value of the observations (right). For the figure on the right, we only plot the range from to , which corresponds to approximately the to quantile. At this point, the weight is approximately , and weights for more extreme quantiles are closer to 0.

The efficient estimators perform well across distributions, and confidence intervals based either on estimates of the analytic variance formulas or on the bootstrap achieve their nominal coverage levels, as shown in Table 1. Their standard deviations are close to the theoretical efficiency bound, as shown in column 4, labelled relative efficiency, where values larger than one imply standard deviations of the estimator in excess of the efficiency bound. The efficient influence function and weighted average quantiles estimators are close to the most efficient estimator for the normal, double exponential, and Cauchy distribution. The last columns show that the confidence intervals are close to their nominal coverage for each distribution. For computational convenience in the simulations, the confidence intervals based on the bootstrap variance use the -out-of- bootstrap (Bickel et al. (2012)), with 2,000 (half treated, half control), to estimate the variance of the estimators. Even with these smaller sample sizes, the density estimates calculated in the bootstrap samples appear to be sufficiently good to yield reasonable confidence intervals for the estimators.

95% C.I. boot. var. C.I.
estimator bias
RMSE MAD coverage
Normal Distribution:
diff. in means 0.000 0.014 1.01 0.014 0.010 0.95 0.055 0.95 0.055
diff. in medians -0.000 0.018 1.26 0.018 0.012 0.95 0.069 0.95 0.069
Hodges-Lehmann 0.000 0.015 1.03 0.015 0.010 0.95 0.057 0.95 0.057
adaptive trim 0.000 0.015 1.03 0.015 0.010 0.94 0.055 0.95 0.057
adaptive wins. 0.000 0.014 1.01 0.014 0.010 0.95 0.055 0.95 0.055
eif 0.000 0.014 1.02 0.014 0.010 0.95 0.056 0.95 0.057
waq 0.000 0.014 1.02 0.014 0.010 0.95 0.056 0.95 0.056
Double Exponential Distribution:
diff. in means -0.000 0.020 1.43 0.020 0.014 0.95 0.078 0.95 0.078
diff. in medians 0.000 0.014 1.01 0.014 0.010 0.97 0.061 0.95 0.057
Hodges-Lehmann -0.000 0.016 1.17 0.016 0.011 0.95 0.064 0.95 0.064
adaptive trim 0.000 0.014 1.02 0.014 0.010 0.95 0.059 0.95 0.057
adaptive wins. -0.000 0.018 1.29 0.018 0.012 0.96 0.076 0.95 0.069
eif -0.000 0.015 1.06 0.015 0.010 0.95 0.060 0.96 0.060
waq -0.000 0.015 1.07 0.015 0.010 0.95 0.060 0.96 0.061
Cauchy Distribution:
diff. in means 0.462 127.149 6357.47 127.144 2.047 0.98 9.324 0.98 9.292
diff. in medians 0.000 0.022 1.11 0.022 0.015 0.97 0.093 0.95 0.087
Hodges-Lehmann 0.000 0.026 1.28 0.026 0.018 0.95 0.101 0.95 0.101
adaptive trim 0.000 0.022 1.08 0.022 0.015 0.95 0.092 0.95 0.086
adaptive wins. 0.000 0.024 1.19 0.024 0.016 0.98 0.111 0.97 0.100
eif 0.000 0.020 1.01 0.020 0.014 0.97 0.085 0.97 0.088
waq 0.000 0.021 1.05 0.021 0.014 0.96 0.085 0.98 0.099
Table 1: Summary statistics for simulations with different distributions: Normal, Double Exponential, Cauchy. Statistics shown are based on 10,001 simulated samples. Each sample has 10,000 treated observations and 10,000 control observations. Relative efficiency is the ratio of standard deviation to the square root of the efficiency bound. Columns labeled length show the median length of the confidence intervals.

4.2 Simulations with House Price Data

In the second set of simulations we use house price data from the replication files of Linden and Rockoff (2008, 2019)

. They obtained property sales data for Mecklenburg County, North Carolina, between January 1994 and December 2004. They dropped sales below $5,000 and above $1,000,000, such that 170,239 observations remain, which we take as our population of interest. Despite the trimming the distribution is noticeably skewed (skewness

) and heavy-tailed (kurtosis

). Even after taking logs, the distribution is heavy-tailed with kurtosis equal to . Figure 2 plots a histogram for house prices, both in levels and in logs, along with the estimated optimal weights (second derivative of the log density) based on all 170,239 observations.

Figure 2: Histogram of house prices, in levels and logs, as well as estimated optimal weights (second derivative of the log density) based on all 170,239 observations. The weights are normalized to be mean 1 (black horizontal line). Some weights are below 0 (blue horizontal line). Vertical lines indicate the , , , , , , , quantiles.

We base simulations on this data by drawing samples of size 20,000, and randomly assigning exactly half of each sample to the treatment group and the remaining half to the control group, with a zero treatment effect. Each sample is drawn without replacement, but observations may appear in multiple samples. Because the true distribution, however, is not available analytically, we cannot calculate the efficiency bound directly. Instead, we estimate the efficiency bound using density estimates based on all observations. We compute the same estimators as in the simulations of the previous section, with a small adjustment to the adaptively trimmed and winsorized means. In settings with positively-valued outcomes, researchers are typically more concerned about the effect of heavy right, rather than left, tails on the variance of estimators. We therefore fix the trimming and winsorizing percentiles on the left to 0% (no trimming), and only adaptively choose the threshold on the right. Without such an adjustment, the estimators tend to trim (too) little in order to avoid losing observations from the left tail, which are unproblematic for the precision of the estimator.

Table 2 summarizes the simulation results based on 10,001 simulated samples. When the house prices are in levels, the standard deviation of the difference in means estimator is twice as large as that of efficient estimators. For the difference in medians and the Hodges-Lehman estimators, which are less affected by outliers in the data, the standard deviation is larger by approximately 30% and 20%, respectively. Confidence intervals, based on estimated variances and asymptotic normal approximations, have close to nominal coverage throughout, and are meaningfully shorter for the efficient estimators we propose.

We also estimate a proportional treatment effect (multiplicative) model and translate the estimated coefficients into level effects. Under the multiplicative model, . When outcomes are strictly positive, this is identical to an additive model for log outcomes; . Given an estimate of the additive model with the outcome in logs, the estimate of the level treatment effect is then

For estimates of the in-sample treatment effect, we therefore apply estimates to a population of interest with known means and and fixed treatment probability as

Using the Delta method, if is the asymptotic variance of , then the asymptotic variance of , holding , , and fixed, is

which we estimate by replacing with and by the estimate of the variance, . For the purpose of these simulations, we set equal to the population mean of house prices, and .

When treatment effects are assumed to be proportional to potential outcomes, the proposed estimators for this multiplicative model are still more efficient than alternative estimators, but the gains are smaller. The middle panel of Table 2 shows the quality of estimates of the multiplicative parameter obtained by transforming outcomes into logs, . As can be seen in panel (b) of Figure 2, the distribution of log house prices appears closer to the normal distribution with fewer “outliers.” Consequently, the difference in means estimator, which is efficient for the normal distribution, comes noticeably closer to the efficiency bound than when outcomes are in levels. Nevertheless, the efficient estimators further reduce the variance. The bottom panel of Table 2 shows the same summary statistics when the multiplicative parameter is translated back into a level effect. The efficient estimators of the multiplicative parameter lead to treatment effects with smaller variance and shorter confidence intervals than the difference in means, irrespective of whether the latter is estimated in levels (top panel) or in logs and then translated into levels (bottom panel).

95% C.I. boot. var. C.I.
estimator bias
RMSE MAD coverage length coverage length
effect in levels based on additive model in levels
diff. in means -28 1871 1.92 1871 1251 0.95 7343 0.95 7339
diff. in medians 8 1259 1.30 1259 855 0.95 4989 0.95 4893
Hodges-Lehmann -5 1138 1.17 1138 757 0.95 4487 0.95 4487
adaptive trim 5 989 1.02 989 651 0.98 4560 0.95 3914
adaptive wins. 4 1114 1.15 1114 738 0.99 6378 0.97 4760
eif 13 941 0.97 941 628 0.95 3819 0.95 3768
waq 13 946 0.97 946 626 0.95 3819 0.95 3859
multiplicative parameter: additive model in logs
diff. in means -0.0000 0.0083 1.35 0.0083 0.0055 0.95 0.033 0.95 0.032
diff. in medians 0.0000 0.0075 1.23 0.0075 0.0051 0.95 0.030 0.95 0.029
Hodges-Lehmann -0.0000 0.0072 1.17 0.0072 0.0048 0.95 0.028 0.95 0.028
adaptive trim -0.0000 0.0083 1.35 0.0083 0.0055 0.95 0.033 0.95 0.032
adaptive wins. -0.0000 0.0083 1.35 0.0083 0.0055 0.95 0.033 0.95 0.032
eif -0.0000 0.0067 1.08 0.0067 0.0045 0.95 0.026 0.95 0.026
waq -0.0000 0.0067 1.08 0.0067 0.0044 0.95 0.026 0.95 0.027
effect in levels based on additive model in logs
diff. in means -7 1695 1.35 1695 1125 0.95 6658 0.95 6657
diff. in medians 9 1545 1.23 1545 1048 0.95 6128 0.95 6001
Hodges-Lehmann -6 1468 1.17 1468 976 0.95 5786 0.95 5783
adaptive trim -7 1695 1.35 1695 1125 0.95 6658 0.95 6657
adaptive wins. -7 1695 1.35 1695 1125 0.95 6658 0.95 6657
eif -5 1363 1.08 1363 914 0.95 5374 0.95 5415
waq -3 1364 1.08 1364 910 0.95 5374 0.95 5462
Table 2: Summary statistics for simulations based on house price data from Linden and Rockoff (2008). Statistics shown are based on 10,001 simulated samples. Each sample has 10,000 treated observations and 10,000 control observations. Relative efficiency is the ratio of standard deviation to the square root of the efficiency bound, which we calculate from density estimates based on the full sample. Columns labeled length show the median length of the confidence intervals. The adaptively trimmed mean and adaptively winsorized mean only apply trimming at the top, but not at the bottom.

4.3 Medical Expenditures Data

Next, we present simulation results based on medical expenditure data from the IBM MarketScan Research Database, following the sample construction of Koenecke et al. (2021, Figure 2). We restrict the sample to males, age 45–64, with pneumonia inpatient diagnosis and at least 1 year of continuous medical enrollment. For each patient, we consider the first inpatient admission only to abstract away from any dynamics. We focus on medical expenditure as the outcome variable. For each patient, we sum the payments recorded by MarketScan for this admission. In total, we use data on 103,662 admissions.333The sample size differs slightly from that reported by Koenecke et al. (2021) due to missing expenditure data for a small number of admissions. Figure 3 plots a histogram for medical expenditure in levels and in logs along with the estimated optimal weights (second derivative of the log density). We also observe a treatment variable in this data set, the (prior) use of alpha blockers, which Koenecke et al. (2021) find may improve health outcomes during respiratory distress by preventing hyperinflammation.

Figure 3: Histogram of medical expenditures per admission, in levels and logs, as well as estimated optimal weights (second derivative of the log density), based on the 98,155 observations in the control group. For the figure in levels, vertical lines indicate the , , , and quantiles; the figure is limited to below $200,000, such that the and higher quantiles do not appear. For the figure in logs, vertical lines indicate the , , , , , quantiles. The weights are normalized to be mean 1 (black horizontal line). Some weights are below 0 (blue horizontal line).

We design a simulation study similar to those of the previous sections, treating the receipt of alpha blockers as randomly assigned in our population. This allows us to study (coverage) properties of the estimators and inference procedures in settings where the parametric treatment effect model is not (necessarily) correctly specified. For these simulations, each sample is a draw, without replacement, of 200 of the 5,507 observations in the treatment group and 3,565 of the 98,155 observations in the control group. While the control group is smaller than in our other simulations, it remains sufficiently large to estimate the densities and its derivatives required for our estimators. In this simulation design, is given by the empirical distribution of the control group, and is given by the empirical distribution of the treatment group, of the full sample. Although it is not necessarily correct, the additive treatment effect model may offer a reasonable approximation, and we are interested in the performance of inference methods when the conditions for our theoretical results are not (quite) met in this particular application.

The simulation results reported in Table 3 suggest that the proposed estimators perform reasonably well in this setting despite mis-specification. The bias, RMSE, MAD, and coverage are relative to the population difference in means. The third panel of the table translates the multiplicative parameter into levels effects as in the previous section for simulations based on house prices. Here, we take 38,745 and 34,872, as well as , as in the distribution these simulations draw from. As for the house price data, the adaptively trimmed mean and adaptively winsorized mean are asymmetric, only trimming the right tail of the data.

95% C.I. boot. var. C.I.
estimator bias
RMSE MAD coverage length coverage length
effect in levels based on additive model in levels
error and coverage relative to population difference in means
diff. in means -108 4567 4566 3016 0.93 16935 0.93 16888
diff. in medians 3345 1271 3578 3222 0.46 6161 0.28 5259
Hodges-Lehmann 3490 891 3602 3464 0.03 3664 0.01 3534
adaptive trim 3877 641 3929 3860 0.00 3052 0.00 2778
adaptive wins. 2816 1384 3138 2786 0.52 5668 0.57 6023
eif 3928 682 3987 3922 0.01 4878 0.00 3240
waq 3846 792 3927 3848 0.03 4878 0.01 4018
multiplicative parameter: additive model in logs
multiplicative parameter: error and coverage relative to population difference in means
diff. in means -0.001 0.077 0.077 0.052 0.95 0.30 0.95 0.30
diff. in medians 0.005 0.080 0.080 0.051 0.97 0.33 0.95 0.32
Hodges-Lehmann 0.008 0.071 0.071 0.048 0.95 0.29 0.95 0.28
adaptive trim 0.016 0.074 0.076 0.053 0.95 0.29 0.95 0.29
adaptive wins. -0.002 0.078 0.078 0.052 0.94 0.30 0.95 0.31
eif 0.029 0.066 0.072 0.049 0.95 0.27 0.94 0.26
waq 0.028 0.066 0.071 0.047 0.95 0.27 0.95 0.27
effect in levels based on additive model in logs
effect in levels: error and coverage relative to population difference in means
diff. in means 2423 2871 3756 2522 0.90 11167 0.91 11138
diff. in medians 2647 3009 4006 2499 0.92 12318 0.92 11951
Hodges-Lehmann 2747 2673 3832 2672 0.88 10730 0.87 10390
adaptive trim 3062 2809 4154 2983 0.86 11055 0.85 10850
adaptive wins. 2365 2915 3752 2544 0.90 11084 0.92 11389
eif 3532 2525 4341 3481 0.78 10452 0.76 10014
waq 3464 2523 4285 3413 0.79 10441 0.77 10147
Table 3: Summary statistics for simulations based on the medical expenditures data in logs, across 1,000 simulated samples. Each sample has 200 treated observations and 3,565 control observations. Coverage and median length refer to 95% confidence intervals.

5 Conclusion

In many modern settings where randomized experiments are used to estimate treatment effects the presence of heavy-tailed distributions can lead to larger standard errors. Often researchers use Winsorizing with

ad hoc thresholds to address this. Here we develop systematic methods for obtaining more precise inferences using parametric models for the treatment effects, while avoiding the specification of models for the potential outcomes. We present results for semiparametric effiency bounds, suggest efficient estimators, and show in simulations that these methods can be effective in realistic settings.


  • Batir (2017) Batir, N. (2017). Bounds for the gamma function. Results in Mathematics 72, 865–874.
  • Bickel (1967) Bickel, P. J. (1967). Some contributions to the theory of order statistics. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, pp. 575–591. University of California Press.
  • Bickel (1982) Bickel, P. J. (1982). On adaptive estimation. The Annals of Statistics 10(3), 647–671.
  • Bickel and Doksum (2015) Bickel, P. J. and K. A. Doksum (2015). Mathematical statistics: basic ideas and selected topics, Volume 2. CRC Press.
  • Bickel et al. (2012) Bickel, P. J., F. Götze, and W. R. van Zwet (2012). Resampling fewer than n observations: gains, losses, and remedies for losses. In Selected works of Willem van Zwet, pp. 267–297. Springer.
  • Bickel et al. (1993) Bickel, P. J., C. A. Klaassen, Y. Ritov, and J. A. Wellner (1993). Efficient and adaptive estimation for semiparametric models, Volume 4. Johns Hopkins University Press Baltimore.
  • Bickel and Lehmann (1975a) Bickel, P. J. and E. L. Lehmann (1975a). Descriptive statistics for nonparametric models i. introduction. The Annals of Statistics 3(5), 1038–1044.
  • Bickel and Lehmann (1975b) Bickel, P. J. and E. L. Lehmann (1975b). Descriptive statistics for nonparametric models ii. location. The Annals of Statistics 3(5), 1045–1069.
  • Bickel and Lehmann (1976) Bickel, P. J. and E. L. Lehmann (1976). Descriptive statistics for nonparametric models. iii. dispersion. Annals of Statistics 4(6), 1139–1158.
  • Bickel and Lehmann (2012) Bickel, P. J. and E. L. Lehmann (2012). Descriptive statistics for nonparametric models iv. spread. In Selected Works of EL Lehmann, pp. 519–526. Springer.
  • Chernoff et al. (1967) Chernoff, H., J. L. Gastwirth, and M. V. Johns (1967). Asymptotic distribution of linear combinations of functions of order statistics with applications to estimation. The Annals of Mathematical Statistics 38(1), 52–72.
  • Csorgo and Revesz (1978) Csorgo, M. and P. Revesz (1978). Strong approximations of the quantile process. The Annals of Statistics 6(4), 882–894.
  • Cuzick (1992a) Cuzick, J. (1992a). Efficient estimates in semiparametric additive regression models with unknown error distribution. The Annals of Statistics 20(2), 1129–1136.
  • Cuzick (1992b) Cuzick, J. (1992b). Semiparametric additive regression. Journal of the Royal Statistical Society. Series B (Methodological) 45(3), 831–843.
  • Doksum (1974) Doksum, K. (1974). Empirical probability plots and statistical inference for nonlinear models in the two-sample case. The annals of statistics 2(2), 267–277.
  • Doksum and Sievers (1976) Doksum, K. A. and G. L. Sievers (1976). Plotting with confidence: Graphical comparisons of two populations. Biometrika 63(3), 421–434.
  • Fisher (1937) Fisher, R. A. (1937). The design of experiments. Oliver And Boyd; Edinburgh; London.
  • Govindarajulu et al. (1967) Govindarajulu, Z., L. Le Cam, and M. Raghavachari (1967).

    Generalizations of theorems of chernoff and savage on the asymptotic normality of test statistics.

    In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, pp. 609–638. University of California Press.
  • Gupta et al. (2019) Gupta, S., R. Kohavi, D. Tang, Y. Xu, R. Andersen, E. Bakshy, N. Cardin, S. Chandran, N. Chen, D. Coey, et al. (2019). Top challenges from the first practical online controlled experiments summit. ACM SIGKDD Explorations Newsletter 21(1), 20–35.
  • Hampel et al. (2011) Hampel, F. R., E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel (2011). Robust statistics: the approach based on influence functions, Volume 196. John Wiley & Sons.
  • Hodges Jr and Lehmann (1963) Hodges Jr, J. L. and E. L. Lehmann (1963). Estimates of location based on rank tests. The Annals of Mathematical Statistics 34(2), 598–611.
  • Huber (1964) Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics 35(1), 73–101.
  • Huber (2011) Huber, P. J. (2011). Robust statistics. Springer.
  • Imbens and Rubin (2015) Imbens, G. W. and D. B. Rubin (2015). Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press.
  • Jaeckel (1971a) Jaeckel, L. A. (1971a). Robust estimates of location: Symmetry and asymmetric contamination. The Annals of Mathematical Statistics 42(3), 1020–1034.
  • Jaeckel (1971b) Jaeckel, L. A. (1971b). Some flexible estimates of location. The Annals of Mathematical Statistics 45(2), 1540–1552.
  • Klaassen (1987) Klaassen, C. A. (1987). Consistent estimation of the influence function of locally asymptotically linear estimators. The Annals of Statistics 15(4), 1548–1562.
  • Koenecke et al. (2021) Koenecke, A., M. Powell, R. Xiong, Z. Shen, N. Fischer, S. Huq, A. M. Khalafallah, M. Trevisan, P. Sparen, J. J. Carrero, A. Nishimura, B. Caffo, E. A. Stuart, R. Bai, V. Staedtke, D. L. Thomas, N. Papadopoulos, K. W. Kinzler, B. Vogelstein, S. Zhou, C. Bettegowda, M. F. Konig, B. D. Mensh, J. T. Vogelstein, and S. Athey (2021). Alpha-1 adrenergic receptor antagonists to prevent hyperinflammation and death from lower respiratory tract infection. eLife 10, e61700.
  • Kohavi et al. (2020) Kohavi, R., D. Tang, and Y. Xu (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press.
  • Le Cam and Yang (1988) Le Cam, L. and G. L. Yang (1988). On the preservation of local asyptotic normality under information loss. The Annals of Statistics 16(2), 483–520.
  • Lehmann and D’Abrera (1975) Lehmann, E. L. and H. J. D’Abrera (1975). Nonparametrics: statistical methods based on ranks. Holden-day.
  • Lewis and Rao (2015) Lewis, R. A. and J. M. Rao (2015). The unfavorable economics of measuring the returns to advertising. The Quarterly Journal of Economics 130(4), 1941–1973.
  • Linden and Rockoff (2008) Linden, L. and J. E. Rockoff (2008). Estimates of the impact of crime risk on property values from megan’s laws. American Economic Review 98(3), 1103–27.
  • Linden and Rockoff (2019) Linden, L. and J. E. Rockoff (2019). Replication data for: Estimates of the impact of crime risk on property values from megan’s laws. Inter-university Consortium for Political and Social Research (ICPSR).
  • Neyman (1990) Neyman, J. (1923/1990).

    On the application of probability theory to agricultural experiments. essay on principles. section 9.

    Statistical Science 5(4), 465–472.
  • Rosenbaum (1993) Rosenbaum, P. R. (1993). Hodges-lehmann point estimates of treatment effect in observational studies. Journal of the American Statistical Association 88(424), 1250–1253.
  • Rosenbaum (2002) Rosenbaum, P. R. (2002). Observational Studies. Springer.
  • Rubin (1974) Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology 66(5), 688.
  • Stigler (1974) Stigler, S. M. (1974). Linear functions of order statistics with smooth weight functions. The Annals of Statistics 2(4), 676–693.
  • Stone (1975) Stone, C. J. (1975). Adaptive maximum likelihood estimators of a location parameter. The Annals of Statistics 3(2), 267–284.
  • Wu and Hamada (2011) Wu, C. J. and M. S. Hamada (2011). Experiments: planning, analysis, and optimization, Volume 552. John Wiley & Sons.
  • Yu and Yao (2017) Yu, C. and W. Yao (2017). Robust linear regression: A review and comparison. Communications in Statistics-Simulation and Computation 46(8), 6261–6282.

6 A general theorem on linear combination of order statistics

Let be a distribution function, twice differentiable, and . Let be a function s.t. and . The C-R condition was first proposed by Csorgo and Revesz (1978).


There exists and such that

  1. if or , and can be replaced by another constant if .

  2. for and for .


There exists such that

Theorem A.1.

Let be i.i.d. from , and let and be the empirical distribution function and empirical quantile function respectively. Assume that the C-R condition and the above W condition hold and let



which converges in distribution to if further , where