## 1 Introduction

Randomized^{†}^{†}
Significance Statement.
As datasets get larger and more complex, there is a growing interest in using machine learning methods to enhance scientific analysis. In many settings, considerable work is required to make standard machine learning methods useful for specific scientific applications. We find, however, that in the case of treatment effect estimation with randomized experiments, regression adjustments via machine learning methods designed to minimize test set error directly induce efficient estimates of the average treatment effect. Thus, machine learning methods can be used out-of-the-box for this task, without any special-case adjustments.

controlled trials are often considered the gold standard for estimating the effect of an intervention, as they allow for simple model-free inference about the average treatment effect on the sampled population. Under mild conditions, the mean observed outcome in the treated sample minus the mean observed outcome in the control sample is a consistent and unbiased estimator for the population average treatment effect.

However, the fact that model-free inference is possible in randomized controlled trials does not mean that it is always optimal: as argued by Fisher [1]

, if we have access to auxiliary features that are related to our outcome of interest via a linear model, then controlling for these features using ordinary least squares will reduce the variance of the estimated average treatment effect without inducing any bias. This line of research has been thoroughly explored: under low-dimensional asymptotics where the problem specification remains fixed while the number of samples grows to infinity, it is now well-established that regression adjustments are always asymptotically helpful—even in misspecified models—provided we add full treatment-by-covariate interactions to the regression design and use robust standard errors

[2, 3, 4, 5, 6, 7, 8, 9, 10].The characteristics of high-dimensional regression adjustments are less well understood. In a recent advance, Bloniarz et al. [11] show that regression adjustments are at least sometimes helpful in high dimensions: given an “ultra-sparsity” assumption from the high-dimensional inference literature, they establish that regression adjustments using the lasso [12, 13] are more efficient than model-free inference. This result, however, leaves a substantial gap between the low-dimensional regime—where regression adjustments are always asymptotically helpful—and the high-dimensional regime where we only have special-case results.

In this paper, we show that high-dimensional regression adjustments to randomized controlled trials work under much greater generality than previously known. We find that any regression adjustment with a free intercept yields unbiased estimates of the treatment effect. This result is agnostic as to whether the regression model was obtained using the lasso, the elastic net [14]

, subset selection, or any other method that satisfies this criterion. We also propose a simple procedure for building practical confidence intervals for the average treatment effect.

Furthermore, we show that the precision of the treatment effect estimates obtained by such regression adjustments depends only on the prediction risk of the fitted regression adjustment. In particular, any risk-consistent regression adjustment can be made to yield efficient estimates of the average treatment effect in the sense of [15, 16, 17, 18]. Thus, when choosing which regression adjustment to use, practitioners are justified in using standard model selection tools that aim to control prediction error, e.g., Mallow’s Cp or cross-validation.

This finding presents a striking contrast to the theory of high-dimensional regression adjustments in observational studies. In a setting where treatment propensity may depend on covariates, simply fitting low-risk regression models to the treatment and control samples via cross-validation is not advised, as there exist regression adjustments that have low predictive error but yield severely biased estimates of the average treatment effect [19, 20, 21, 22]. Instead, special-case procedures are needed: For example, Belloni et al. [21] advocate a form of augmented model selection that protects against bias at the cost of worsening the predictive performance of the regression model. The tasks of fitting good high-dimensional regression adjustments to randomized versus observational data thus present qualitatively different challenges.

The first half of this paper develops a theory of regularized regression adjustments with high-dimensional Gaussian designs. This analysis enables us to highlight the connection between the predictive accuracy of the regression adjustment and the precision of the resulting treatment effect estimate, and also to considerably improve on theoretical guarantees available in prior work. In the second half of the paper, we build on these insights to develop cross-estimation, a practical method for inference about average treatment effects that can be paired with either high-dimensional regularized regression or non-parametric machine learning methods.

## 2 Setting and notation

We frame our analysis in terms of the Neyman–Rubin potential outcomes model [23, 24]. Given i.i.d. observations , , we posit potential outcomes and ; then, the outcome that we the actually observe is . We focus on randomized controlled trials, where is independent of all pre-treatment characteristics,

(1) |

We take the predictors to be generated as and assume a homoskedastic linear model in each arm,

(2) |

for , where is mean-zero noise with variance ; more general models will be considered later. We use the notation and . We study inference about the average treatment effect . In our analysis, it is sometimes also convenient to study estimation of the conditional average treatment effect.

(3) |

As discussed by [17], good estimators for are generally good estimators for , and vice-versa. In the homogeneous treatment effects model , and coincide.

## 3 Regression adjustments with Gaussian designs

Suppose that we have obtained parameter estimates , , for the linear model (2) via the lasso, the elastic net, or any other method. We then get a natural estimator for the average treatment effect:

(4) |

In the case where is the ordinary least squares estimator for , the behavior of this estimator has been carefully studied by [10, 8]. Our goal is to characterize its behavior for generic regression adjustments , all while allowing the number of predictors to be much larger than the sample size .

The only assumption that we make on the estimation scheme is that it be centered: for ,

(5) |

i.e., that the mean of the predicted outcomes matches that of the observed outcomes; and is translation invariant and only depends on

(6) |

Here, and denote the mean of the outcomes and features over all observations with . Algorithmically, a simple way to enforce this constraint is to first center the training samples , , run any regression method on this centered data, and then set the intercept using (5); this is done by default in standard software for regularized regression, such as glmnet [25]. We also note that ordinary least squares regression is always centered in this sense, even after common forms of model selection.

Now, if our regression adjustment has a well-calibrated intercept as in (5), then we can write (4) as

(7) |

To move forward, we focus on the case where the data-generating model for is Gaussian, i.e., for some and positive-semidefinite matrix , and . For our purpose, the key fact about Gaussian data is that the mean of independent samples is independent of the within-sample spread, i.e.,

(8) |

conditionally on the treatment assignments . Thus, because only depends on the centered data and , we can derive a simple expression for the distribution of . The following is an exact finite sample result, and holds no matter how large is relative to ; a key observation is that is mean-zero by randomization of the treatment assignment, for .

###### Proposition 1.

Suppose that our regression scheme for and is centered, and that our data-generating model is Gaussian as above. Then, writing for ,

(9) |

If the errors in and are roughly orthogonal, then

(10) |

and, in any case, twice the right-hand side is always an upper bound for the left-hand side. Thus, the distribution of effectively depends on the regression adjustments only through the excess predictive error

where the above expectation is taken over a test set example . This implies that, in the setting of Proposition 1, the main practical concern in choosing which regression adjustment to use is to ensure that has low predictive error.

The above result is conceptually related to recent work by Berk et al. [3] (see also [26]), who showed that the accuracy of low-dimensional covariate adjustments using ordinary least-squares regression depends on the mean-squared error of the regression fit; they also advocate using this connection to provide simple asymptotic inference about . Here, we showed that a similar result holds for any regression adjustment on Gaussian designs, even in high dimensions; and in the second half of the paper we will discuss how to move beyond the Gaussian case.

### 3.1 Risk consistency and the lasso

As stated, Proposition 1 provides the distribution of conditionally on , and so is not directly comparable to related results in the literature. However, whenever is risk consistent in the sense that

(11) |

for , we can asymptotically omit the conditioning.

###### Theorem 2.

In the case of the lasso, Theorem 2 lets us substantially improve over the best existing guarantees in the literature [11]. The lasso estimates as the minimizer over of

(13) |

for some penalty parameter . Typically, the lasso is used when we believe a sparse regression adjustment to be appropriate. In our setting, it is well known that the lasso satisfies , provided the penalty parameter is well chosen and does not allow for too much correlation between features [27, 28].

Thus, whenever we have a sequence of problems as in Theorem 2 where is -sparse, i.e., has at most non-zero entries, and , we find that is efficient in the sense of (12). Note that this result is much stronger than the related result of [11]

, which shows that lasso regression adjustments yield efficient estimators

in an ultra-sparse regime with .To illustrate the difference between these two results, it is well known that if

, then it is possible to do efficient inference about the coefficients of the underlying parameter vector

[29, 30, 31], and so the result of [11] is roughly in line with the rest of the literature on high-dimensional inference. Conversely, if we only have , accurate inference about the coefficients of is in general impossible without further conditions on the covariance of [32, 33]. Yet we have shown that we can still carry out efficient inference about . In other words, the special structure present in randomized trials means that much more is possible than in the generic high-dimensional regression setting.### 3.2 Inconsistent regression adjustments

Even if our regression adjustment is not risk consistent, we can still use Proposition 1 to derive unconditional results about whenever

(14) |

We illustrate this phenomenon in the case of ridge regression, where regression adjustments generally reduce—but do not eliminate—excess test-set risk. Recall that ridge regression estimates

as the minimizer over of(15) |

The following result relies on random-matrix theoretic tools for analyzing the predictive risk of ridge regression

[34].###### Theorem 3.

Suppose we have a sequence of problems in the setting of Proposition 1 with and , such that the spectrum of the covariance has a weak limit. Following [34], suppose moreover that the true parameters and are independently and randomly drawn from a random effects model with

(16) |

Then, selecting in (4) via ridge regression tuned to minimize prediction error, and with , we get ,

(17) |

where the are the companion Stieltjes transforms of the limiting empirical spectral distributions for the treated and control samples, as defined in the proof.

To interpret the above result, we note that the quantity can also be induced via the limit [35, 36]

Finally, we note that the limiting variance of obtained via ridge regression above is strictly smaller than the corresponding variance of the unadjusted estimator, which converges to ; this is because optimally-tuned ridge regression strictly improves over the “null” model in terms of its predictive accuracy.

## 4 Practical inference with cross-estimation

In the previous section, we found that—given Gaussianity assumptions—generic regression adjustments yield unbiased estimates of the average treatment effect, and also that low-risk regression adjustments lead to high-precision estimators. Here, we seek to build on this insight, and to develop simple inferential procedures about and that attain the above efficiency guarantees, all while remaining robust to deviations from Gaussianity or homoskedasticity.

Our approach is built around *cross-estimation*, a procedure
inspired by data splitting and the work of [37, 38].
We first split our data into equally-sized
folds (e.g., ) and then, for each fold , we compute

(18) |

Here, ,

, etc. are moments taken over the

-th fold, while and are centered regression estimators computed over the other folds. We then obtain an aggregate estimate , where is the number of observations in the -th fold. An advantage of this construction is that an analogue to the relation (8) now automatically holds, and thus our treatment effect estimator is unbiased without assumptions. Note that the result below references both the average treatment effect and the conditional average treatment effect .###### Theorem 4.

Suppose that we have independent and identically distributed samples satisfying (1), drawn from a linear model (2) where has finite first moments and the conditional variance of given may vary. Then, . If, moreover, the are all risk-consistent in the sense of (11) for , and both the signals residuals are asymptotically Gaussian when averaged, then writing , we have

(19) |

In the homoskedatic case, i.e., when the variance of conditionally on does not depend on , then the above is efficient. With heteroskedasticity, the above is no longer efficient because we are in a linear setting and so inverse-variance weighting could improve precision; however, (19) can still be used as the basis for valid inference about .

### 4.1 Confidence intervals via cross-estimation

Another advantage of cross-estimation is that it allows for moment-based variance estimates for . Here, we discuss practical methods for building confidence intervals that cover the average treatment effect . We can verify that the variance of is after conditioning on the and , with

(20) |

Now, the above moments correspond to observable quantities on the -th data fold, so we immediately obtain a moment-based plug-in estimator for . Finally, we build -level confidence intervals for as

(21) |

where

is the appropriate standard Gaussian quantile. In the setting of Theorem

4, i.e., with risk consistency and bounded second moments, we can verify that the are asymptotically uncorrelated and so the above confidence intervals are asymptotically exact.### 4.2 Cross-validated cross-estimation

High-dimensional regression adjustments usually rely on a tuning parameter that controls the amount of regularization, e.g., the parameter for the lasso and ridge regression. Although theory provides some guidance on how to select , practitioners often prefer to use computationally-intensive methods such as cross-validation.

Now, our procedure in principle already allows for cross-validation: if we estimate in (18) via any cross-validated regression adjustment that only relies on all but the -th data folds, then will be unbiased for . However, this requires running the full cross-validated algorithm times, which can be very expensive computationally.

Here, we show how to obtain good estimates using only a single round of cross-validation. First, we specify regression folds, and for each and we compute and as the mean of all observations in the -th fold with . Next, we center the data such that and for all observations in the -th fold. Finally, we estimate by running a standard out-of-the-box cross-validated algorithm (e.g., cv.glmnet for R) on the -triples with the same folds as specified before, and then use (18) to compute .

The actual estimator that we use to estimate and in our experiments is inspired by the procedure of Imai and Ratkovic [39]. Our goal is to let the lasso learn shared “main effects” for the treatment and control groups. To accomplish this, we first run a -dimensional lasso problem,

(22) |

and then set and . We simultaneously tune and estimate by cross-validated cross-estimation as discussed above. When all our data is Gaussian, this procedure is exactly unbiased by the same argument as used in Proposition 1; and even when is not Gaussian, it appears to work well in our experiments.

## 5 Non-parametric machine learning methods

In our discussion so far, we have focused on treatment effect estimation using high-dimensional, linear regression adjustments, and showed how to provide unbiased inference about

under general conditions. Here, we show how to extend our results about cross-estimation to general non-parametric regression adjustments obtained using, e.g., neural networks or random forests [40]. We assume a setting wherefor some unknown regression functions , and
our goal is to leverage estimates obtained
using any machine learning method to improve the precision
of , as follows:^{1}^{1}1We
note that (23) only depends on
and
implicitly through
.
It may thus also be interesting to estimate directly using, e.g., the “tyranny of the
minority” scheme of Lin [8].^{,}^{2}^{2}2A related
estimator is studied by Rothe [41] in the context of classical non-parametric regression
adjustments, e.g., local regression, for observational studies with known treatment propensities.

(23) | ||||

where is any estimator that does not depend on the -th training example; for random forests, we set to be the “out-of-bag” prediction at . To motivate (23), we start from (7), and expand out terms using the relation

where , etc. The remaining differences between (23) and (7) are due to the use of out-of-bag estimation to preserve randomization of the treatment assignment conditionally on the corresponding regression adjustment. We estimate the variance of using the formula

The following result characterizes the behavior of this estimator, under the assumption that the estimator is “jackknife-compatible,” meaning that the expected jackknife estimate of variance for converges to 0. We define this condition in the proof, and verify that it holds for random forests.

###### Theorem 5.

Suppose that is jackknife-compatible.
Then, the estimator (23) is asymptotically unbiased,
. Moreover, if the regression
adjustments are risk-consistent in the sense that^{3}^{3}3With
random forests, [42] provide such a risk-consistency result.
,
and the potential outcomes have finite second moments,
then is efficient and
is asymptotically standard Gaussian.

We note that there has been considerable recent interest in using machine learning methods to estimate heterogeneous treatment effects [43, 44, 45, 46]. In relation to this literature, our present goal is more modest: we simply seek to use machine learning to reduce the variance of treatment effect estimates in randomized experiments. This is why we obtain more general results than the papers on treatment heterogeneity.

## 6 Experiments

In our experiments, we focus on two specific variants of treatment effect estimation via cross-estimation. For high-dimensional linear estimation, we use the lasso-based method (22) tuned by cross-validated cross-estimation. For non-parametric estimation, we use (23) with random forest adjustments. We implement our method as an open-source R-package, crossEstimation, built on top of glmnet [25] and randomForest [47] for R. The supporting information has additional simulation results.

### 6.1 Simulations

We begin by validating our method in a simple simulation setting with , where . In all simulations, we set the the features to be Gaussian with auto-regressive AR- covariance. We compare our lasso-based cross-estimation with both the simple difference-in-means estimate , and the proposal of Bloniarz et al. [11] that uses lasso regression adjustments tuned by cross-validation. Our method differs from that of Bloniarz et al. in that we use a different algorithm for confidence intervals, and also that we use the joint lasso algorithm (22) instead of computing separate lassos in both treatment arms.

Figures 1 and 2 display results for different choices of , , etc., while varying . In both cases, we see that the confidence intervals produced by our cross-estimation algorithm and the method of Bloniarz et al. are substantially shorter than those produced by the difference in means estimator. Moreover, our confidence intervals accurately represent the variance of our estimator (compare solid and dashed-dotted lines in the left panels), and achieve nominal coverage at both the 95% and 99% levels. Conversely, especially in small samples, the method of Bloniarz et al. underestimates the variance of the method, and does not achieve target coverage.

### 6.2 Understanding attitudes towards welfare

We also consider an experimental dataset collected as
a part of the General Social Survey.^{4}^{4}4Subjects were either asked
whether we, as a society, spend too much money on “welfare” or
on “assistance to the poor.” The questions were randomly assigned
and the treatment effect corresponds to the change in the proportion of
people who answer “yes” to the question. This dataset is discussed
in detail in [44];
we pre-process the data as in [48].
The dataset is large ( after pre-processing), so we know the true treatment effect
essentially without error: The fraction of respondents who say we spend too
much on assistance to the poor is smaller than the fraction of respondents
who say we spend too much on welfare by 0.35. To test our method, we repeatedly
drew subsamples of size from the full dataset, and
examined the ability of both lasso- and random-forest-based
cross-estimation to recover the correct answer.
We had regressors.

First of all, we note that both variants of cross-estimation achieved excellent coverage. Given a nominal coverage rate of 95%, the simple difference-in-means estimator, lasso-based cross-estimation and random forest cross-estimation had realized coverage rates of 96.3%, 96.5% and 95.3% respectively over 1,000 replications. Meanwhile, given a nominal target of 99%, the realized numbers became 99.0%, 99.0%, and 99.3%. We note that this dataset has non-Gaussian features and exhibits considerable treatment effect heterogeneity.

Second, Figure 3 depicts the reduction in squared confidence interval length for individual realizations of each method. More formally, we show boxplots of , where is the variance estimate used to build confidence intervals. Here, we see that although cross-estimation may not improve the precision of the simple method by a large amount, it consistently improves performance by a small amount. Moreover, in this example, random forests result in a larger improvement in precision than lasso-based cross-estimation.

## 7 Discussion

In many applications of machine learning methods to causal inference, there is a concern that the risk of specification search, i.e., trying out many candidate methods and choosing the one that gives us a significant result, may reduce the credibility of empirical findings. This has led to considerable interests in methodologies that allow for complex model fitting strategies that do not compromise statistical inference.

One prominent example is the design-based paradigm to causal inference in observational studies, whereby we first seek to build an “observational design” by only looking at the features and the treatment assignments , and only reveal the outcomes once the observational design has been set [49, 50]. The observational design may rely on matching, inverse-propensity weighting, or other techniques. As the observational design is fixed before the outcomes are revealed, practitioners can devote considerable time and creativity to fine-tuning the design without compromising their analysis.

From this perspective, we have shown that regression adjustments to high-dimensional randomized controlled trials exhibit a similar opportunity for safe specification search. Concretely, imagine that once we have collected data from a randomized experiment, we only provide our analyst with class-wise centered data: , , and . The analyst can then use this data to obtain any regression adjustment they want, which we will then plug into (4). Our results guarantee that—at least with a random Gaussian design—the resulting treatment effect estimates will be unbiased regardless of the specification search the analyst may have done using only the class-wise centered data. Cross-estimation enables us to mimic this phenomenon with non-Gaussian data.

## Appendix A Proofs

### a.1 Proof of Proposition 1

### a.2 Proof of Theorem 2

Given our hypotheses, we immediately see that , , and . The conclusion follows from Proposition 1 via Slutsky’s theorem.

### a.3 Proof of Theorem 3

For a covariance matrix , we define its spectral distribution

as the empirical distribution of its eigenvalues. We assume that, in our sequence of problems,

converges weakly to some limiting population spectral distribution . Given this assumption, it is well known that the spectra of the sample covariance matrices also converge weakly to a limiting empirical spectral distribution, with probability 1

[35, 36]. In this notation, the companion Stieltjes transform is defined asGiven these preliminaries and under the listed hypotheses, [34] show that the risk of optimally tuned ridge regression converges in probability:

where is the asymptotically optimal choice for .

Now, in our setting, we need to apply this result to the treatment and control samples separately. The asymptotically optimal regularization parameters for and are and respectively. Moreover, by spherical symmetry,

Thus, together we the above risk bounds, we find that

where and are the companion Stieltjes transforms for the control and treatment samples respectively. The desired conclusion then follows from Proposition 1.

### a.4 Proof of Theorem 4

By randomization of the treatment assignment we have, within the -th fold and for ,

Thus, we see that

Meanwhile, writing for the set of observations in the -th fold and for the number of those observations with , we can write

Now, by consistency of the regression adjustment, the third summand decays faster that and so can asymptotically be ignored. Re-arranging the first two summands, we get

which has the desired asymptotic variance, and is asymptotically Gaussian under the stated regularity conditions.

### a.5 Proof of Theorem 5

We begin with a definition. The estimator is *jackknife-compatible* if,
for any and a new independently-drawn test point ,

(24) |

for some sequence . The quantity inside the left-hand expectation is the popular jackknife estimate of variance for ; the condition requires that this variance estimate converge to 0 in expectation. We note that this condition is very weak: most classical statistical estimators will satisfy this condition with , while subsampled random forests of the type studied in [46] satisfy it with , where is the subsample size.

Now, as in the proof of Theorem 4, we write as

where is a residual term

The main component of is an unbiased, efficient estimator for . It remains to show that is asymptotically unbiased given jackknife-compatibility, and is moreover asymptotically negligible if is risk-consistent.

For the remainder of the proof, we focus on the setting where the regression adjustments and are computed separately on samples with and respectively. The reason may not be exactly unbiased is that is a function of observations if , while it is a function of observations if , and this effect can create biases. Our goal is to show, however, that these biases are small for any jackknife-compatible estimator. To do so, we first define a “leave-two-out” approximation to :