1 Methodology and Motivation
Machine learning (ML) has had widespread success in solving prediction problems in applications ranging from image and speech recognition (lecun2015deep) to personalized medicine (kononenko2001machine). This makes it an attractive tool also for studying heterogeneity in causal effects. In fact, ML excels at overcoming wellknown limitations of traditional methods used to solve this task. For example, matching methods struggle to perform well when confounders and covariates are highdimensional (rubin1996matching)
; generalized linear models are not flexible enough to discover variable interactions and nonlinear trends; and propensitybased methods suffer from variance issues in estimation
(lee2011weight). In contrast, supervised machine learning has proven highly useful in discovering patterns in highdimensional data
(lecun2015deep), approximating complex functions and trading off bias and variance (swaminathan2015counterfactual).In this observational study based on data from the National Study of Learning Mindsets (NSLM), we apply both offtheshelf and purposebuilt ML estimators to characterize heterogeneity in the effect of a student mindset intervention on future academic performance. In particular, we compare estimates of conditional average treatment effects based on linear models, random forests, gradient boosting and deep neural networks. Below, we introduce the problem of discovering treatment effect heterogeneity and describe our methodology.
1.1 Problem setup
We study the effect of a student mindset intervention based on observations of 10391 students in 76 schools from the NSLM study. The intervention, assigned at student level, is represented by a binary variable
and the performance outcome by a realvalued variable . Students are observed through covariates and schools through covariates ^{1}^{1}1The meanings of the different covariates are described in later sections of the manuscript.. For convenience, we let represent the full set of covariates of a studentschool pair. We let denote the observation corresponding to a student in a school . As each student is enrolled in at most one school, we ommit the index in the sequel. Observed treatment groups (control) and (treated) are defined by . The full dataset of observations is denoted , and the density of all variables .We adopt the NeymanRubin causal model (rubin2005causal), and denote by the potential outcomes corresponding to interventions and respectively. The goal of this study is to characterize heterogeneity in the treatment effect across students and schools. As is well known, this effect is not identifiable without additional assumptions as each student is observed in only one treatment group. Instead, we estimate the conditional average treatment effect (CATE) with respect to observed covariates .
(1) 
CATE is identifiable from observational data under the standard assumptions of ignorability
consistency, , and overlap (positivity)
The CATE conditioned on the full set of covariates is the closest we get to estimating the treatment effect for an individual student. However, to design policies, it is rarely necessary to have this level of resolution. In later sections, we estimate conditional effects also with respect to subsets or functions of , such as the average effect stratified by school achievement level. By first identifying and then marginalizing it with respect to such functions, we adjust for confounding to the best of our ability.
1.2 Methodology overview
The flexibility of ML estimators creates both opportunities and challenges. For example, it is typical for the number of parameters of the best performing model on a task to exceed the number of available samples. This is made possible by mitigating overfitting (high variance) through appropriate regularization. Indeed, many models achieve the best fit only after having carefully set several tuning parameters that control the bias and variance tradeoff. It is standard practice in ML to use sample splitting for this purpose. Here, we apply such a pipeline to CATE estimation, proceeding through the following steps.

Split the observed data into two partitions, a training set and a validation set , for parameter fitting and model selection respectively.

Fit estimators of potential outcomes and to and select tuning parameters based on heldout error on

Impute CATE, , for every student in and fit an interpretable model to characterize treatment effect heterogeneity
This pipeline allows us to find the best fitting (blackbox) estimators possible in Step 2 without regard for the interpretability of their predictions. By fitting a simpler, more interpretable model to the imputed effects in Step 3, we may explain the predictions of the more complex model in terms of known quantities. This procedure is particularly well suited when the effect is a simpler function than the response and it also allows us to control the granularity at which we study heterogeneity.
The data from NSLM has a multilevel nature; students (level 1) are grouped into schools (level 2) and each level is associated with its own set of covariates. The literature is rich with studies of causal effects in multilevel settings, see for example gelman2006data. However, this is primarily targeted towards studying the effects of highlevel (e.g. schoollevel) interventions on lowerlevel subjects (e.g. students), and the increased uncertainty that comes with such an analysis. While interventions are assigned at studentlevel, it is important to note that only 76 values of schoollevel variables are observed, which introduces the risk of overfitting to these covariates specifically. We adjust for the multilevel nature of the data in sample splitting, bootstrapping and the analysis of imputed effects.
In the following sections we describe each step of our methodology in detail.
Step 1. Sample splitting
To enable unbiased estimation of prediction error and select tuning parameters, we divide the dataset
into two parts with 80% of the data used for the training set and 20% for a validation set . We partition the set of schools, rather than students, making sure that the entire student body of any one school appears only in either or . This is to mitigate overfitting to schoollevel covariates. As there are only 76 schools, random sampling may create sets that have very different characteristics. To mitigate this, we balance andby constructing a large number of splits uniformly at random and selecting the one that minimizes the Euclidean distance between summary statistics of the two sets. In particular, we compare the first and second order moments of all covariates. We increase the influence of the treatment variable
by a factor 10 in this comparison to ensure that treatment groups are split evenly across and .Step 2. Estimation of potential outcomes
The conditional average treatment effect is the difference between expected potential outcomes, here denoted and . Under ignorability w.r.t. (see above), we have that
and thus, . A straightforward route to estimating is to independently fit the conditionals for each value of and compute their difference. This has recently been dubbed the Tlearner approach to distinguish it from other learning paradigms (kunzel2017meta). Below, we briefly cover theory that motivates this method and point out some of its shortcomings. To study heterogeneity, we consider several Tlearners as well as two alternative approaches described below.
We approximate using hypotheses and measure their quality by the mean squared error. The groupconditional expected and empirical risks are defined as follows
(2) 
We never observe directly, but learn from noisy observations
. Statistical learning theory helps resolve this issue by bounding the expected risk in terms of its empirical counterpart and a measure of function complexity
(vapnik1999overview). For hypotheses in a class with a particular complexity measure with logarithmic dependence on(e.g. a function of the covering number), it holds with probability greater than
thatwhere is a bound on the expected variance in (see johansson2018learning for a full derivation). This class of bounds illustrate the biasvariance tradeoff that is typical for machine learning and motivates the use of regularization to control model complexity. In our experiments, we consider several Tlearner models that estimate each potential outcome independently using regularized empirical risk minimization, solving the following problem.
(3) 
Here, is a function parameterized by and a regularizer of model parameters such as the norm (LASSO) or
norm (Ridge) penalties. In our analysis, we compare four commonly used offtheshelf machine learning estimators: ridge regression, random forests, gradient boosting and deep neural networks.
Sharing power between treatment groups
A drawback of Tlearners is that no information is shared between estimators of different potential outcomes. In problems where the baseline response is a more complex function than the effect itself, the Tlearner is wasteful in terms of statistical power (kunzel2017meta; nie2017learning). As an alternative, we apply the TARNet neural network architecture of shalit2017estimating which learns a representation of both treatment groups jointly, but predicts potential outcomes separately. This has the advantage of sharing information across treatment groups in learning the average response, but allows for flexibility in effect approximation. For an illustration comparing Tlearners and TARNet, see Figure 1.
Generalizing across treatment groups
The careful reader may have noticed that the population and empirical risk in equations (2)–(3) are defined with respect to the observed treatment assignments. To estimate CATE, we want our estimates of potential outcomes to be accurate for the counterfactual assignment as well. In other words, we want for the risk on the full cohort,
to be small. When treatment groups and are highly imbalanced, the expected risk within one treatment group may not be representative of the risk on the full population. This is another drawback of Tlearner estimators, which do not adjust for this discrepancy.
In recent work, shalit2017estimating characterize the difference between and and bound the error in CATE estimates using a distance metric between treatment groups. In particular, they considered the integral probability metric (IPM) family of distances, resulting in the following relation between fullcohort and treatment group risk.
(4) 
In shalit2017estimating, the authors used the kernelbased Maximum Mean Discrepancy (gretton2012kernel) to regularize and balance the representations learned by the TARNet architecture, minimizing an upper bound on the CATE risk. We apply this approach, called Counterfactual Regression (CFR), in our analysis.
Step 3. Characterization of heterogeneity in CATE
After fitting models for each potential outcome, the conditional average treatment effect is imputed for each student by . Unlike with linear regressors, the predictions of most ML estimators are difficult to interpret directly through model parameters. For this reason, ML models are often considered blackbox methods (lipton2016mythos). However, in the study of heterogeneity, it is crucial to characterize for which subjects the effect of an intervention is low and for which it is high. To accomplish this, we adopt the common practice of posthoc interpretation—fitting a simpler, more interpretable model to the imputed effects .
In its very simplest form
may be a function of a single attribute, such as the school size, effectively averaging over other attributes. This is usually a good way of discovering global trends in the data but will neglect meaningful interactions between variables, much like a linear model. As a more flexible alternative, we also fit decision tree models and inspect the learned trees. Trees of only two variables may be visualized directly in the covariate space, and larger trees in terms of their decision rules.
2 Workshop results
We present the first results of our analysis as shown in the workshop Empirical Investigation of Methods for Heterogeneity at the Atlantic Causal Inference Conference, 2018.
2.1 Covariate balance
First, we investigate the extent to which the overlap assumption holds true by comparing the covariate statistics of the treatment and control groups. In Figure 2, we visualize the marginal distributions of each covariate, as well as a 2D tSNE projection of the entire covariate set (maaten2008visualizing)
. The observed difference between the marginal covariate distributions of the two treatment groups is very small. Also the nonlinear tSNE projection reveals little difference between treatment groups. The less imbalance between treatment groups, the closer our problem is to standard supervised learning. Said differently, the density ratio
is close to and the IPM distance between conditional distributions, see (4), is small. Hence, we expect neither propensity reweighting nor balanced representations (e.g. CFR) to have a large effect on the results.2.2 Estimation of potential outcomes
score with 95% schoollevel cluster bootstrap confidence intervals. 
In our analysis, we compare four Tlearners based on ridge regression (RR), random forests (RF), gradient boosting (GB), and neural networks (NN). In addition, we compare the representationlearning algorithms TARNet and Counterfactual Regression (CFR) (shalit2017estimating). For each estimator family, we fit models of both potential outcomes on the training set and select tuning parameters based on the heldout score on the validation set . To estimate uncertainty in model predictions, we perform schoollevel bootstrapping of the training set (cameron2008bootstrap), fitting each model to each bootstrap sample^{2}^{2}2The bootstrap analysis was added after the workshop results, but is presented here for completeness.. In Table 3, we give the estimate of the average treatment effect (ATE) from each model, the heldout score of the fit of factual outcomes, and 95% confidence intervals based on the empirical bootstrap. In addition, we give the naïve estimate of the ATE—the difference between observed average outcomes in the two treatment groups.
We see that all methods produce very similar estimates of ATE and perform comparably in terms of . As expected, based on the small covariate imbalance shown in the previous section, the regression adjusted estimates are close to the naïve estimate of the ATE. This likely also explains the small difference between TARNet and CFR, as even for moderate to large imbalance regularization, the empirical risk dominates the objective function. The performance of the neural network Tlearner would likely be improved with a different choice of architecture or tuning parameters. This is consistent with shalit2017estimating in which TARNet architecture achieved half of the error of the Tlearner on the IHDP benchmark.
2.3 Heterogeneity in causal effect
We examine further the CATE for each student imputed by the best fitting model. As CFR had a slight edge in over Tlearning estimators (although confidence intervals overlap) and has stronger theoretical justification, we analyze the effects imputed by CFR below. In Figure 2(a), we visualize the distribution of imputed CATEs. We see that for almost all students, the effect is estimated to be positive, indicating an improvement in performance as an effect of the mindset intervention. Recall that the average treatment effect was estimated to be . Around 95% of students were estimated to have an effect in the range . For reference, the mean of the observed outcome was
and the standard deviation
.To discover drivers of heterogeneity we fit a random forest estimator to imputed effects and inspect the feature importance of each variable—the frequency with which it is used to split the nodes of a tree. The five most important variables of the random forest were , , , and . In Figure 4 we stratify imputed CATE with respect to these variables, as well as which is of interest to the study organizers. We see a strong trend that the effect of the intervention decreases with prior schoollevel student mindset, . The urbanicity of the school,
, is a categorical variable for which Category D appears to be associated with substantially lower effect. In contrast, the effect of the intervention appears to increase with students’ prior expectations
. One of the questions of the original study was whether there exists a “Goldilocks effect” for schoolachievement level , meaning that the intervention only has an effect for schools that are neither achieving too poorly nor too well. These results cannot confirm this hypothesis, nor reject it.3 Postworkshop analysis
Heterogeneity in treatment effect may be a nonlinear or nonadditive function of observed covariates. Such patterns remain hidden when analyzing CATE as a function of a single variable at a time or using linear regression. To reveal richer patterns of heterogeneity, we fit highly regularized regression tree models and inspect their decision rules. First, we consider combinations only of pairs of variables at a time. We note that for schoollevel variables, only 76 unique values exist, one for each school. To prevent overfitting to these variables, we require that each leaf in the regression tree contains samples from at least 10 schools. When studentlevel covariates are included, we require leaves to have samples of at least 1000 students.
In Figure 5, we visualize trees fit to two distinct variable pairs. We note a very slight nonlinear pattern in heterogeneity as a function of (schoollevel student mindset) and (school poverty concentration), and that explains a lot of the variance observed at moderate values of in Figure 4. We emphasize, however, that the sample size at the schoollevel is small, and that observed patterns have high variance. In the righthand figure, (student’s expectations) appears associated with a larger effect only if the average mindset of the school is sufficiently high. This pattern disappears when using a linear model. In the Appendix, we show a regression tree fit to the entire covariate set.
4 Discussion
Machine learning offers a broad range of tools for flexible function approximation and provides theoretical guarantees for statistical estimation under model misspecification. This makes it a suitable framework for estimation of causal effects from nonlinear, imbalanced or highdimensional observational data. The flexibility of machine learning comes at a price however: many methods come with tuning parameters that are challenging to set for causal estimation; models are often difficult to optimize globally; and interpretability of models suffers. While progress has been made independently on each of these problems, a standardized set of tools has yet to emerge.
In the analysis of the NLSM data, machine learning appears wellsuited to study overlap, potential outcomes and heterogeneity in imputed effects. However, the analysis also opens some methodological questions. The multilevel nature of covariates is not accounted for in most offtheshelf ML models and regularization of models applied to multilevel data has been comparatively less studied than for singlelevel data. In addition, as pointed out by several authors (kunzel2017meta; nie2017learning), the Tlearner approach to causal effect estimation may suffer from compounding bias and from wasting statistical power. This may be one of the reasons we observe a slight advantage of representation learning methods such as TARNet and CFR.
Comments
There are no comments yet.