 # Model-Robust Counterfactual Prediction Method

We develop a method for assessing counterfactual predictions with multiple groups. It is tuning-free and operational in high-dimensional covariate scenarios, with a runtime that scales linearly in the number of datapoints. The computational efficiency is leveraged to produce valid confidence intervals using the conformal prediction approach. The method is model-robust in that it enables inferences from observational data even when the data model is misspecified. The approach is illustrated using both real and synthetic datasets.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

In many casual inference problems, the unit of analysis is subject to an exposure, indexed by , and is associated with a continuous outcome (or response) . For instance, an exposure may correspond to ‘not receiving’ or ‘receiving’ medication. The inferential question is then typically posed in counterfactual terms:

“What would the outcome have been, had the unit been assigned to a different exposure ?”

The ability to address this question using observational data is relevant in a wide variety of fields, including clinical trials, epidemiology, econometrics, policy evaluation, etc. 

Each unit is typically associated with a range of covariates (or features), collected in a vector

, which may affect its outcome and/or exposure assignment. When contains all variables that simultaneously affect both and , it is possible to provide causal interpretations from observed data. The onus is on the researcher to include such potentially confounding variables . Under this standard condition, the dependencies between exposure, outcome and covariates can be encoded by a graph as in Figure 1

along with an associated joint distribution

. Fig. 1: Dependency graphs, where ∙ and ■ represent random and deterministic variables, respectively. Left: Observed outcome from data-generating process. Right: Counterfactual outcome when assigning exposure ~z.

The counterfactual question can now be directly translated into predicting the counterfactual (or potential) outcome, , had the unit been set to , thus overriding the covariate-dependency of the exposure assignment [3, 4, 5, 6, 7, 8, 9]. The resulting dependencies for can also be encoded in a graph shown in Figure 1 with an associated joint distribution [10, 1].

The counterfactual outcome

is a random variable and the targeted quantity in most prior works has been the difference between average outcomes for exposures

and , i.e.,

 τ=E[y(1)]−E[y(0)],

which averages out [2, 11]. Using observational data from units,

 D={(x1,y1,z1),…,(xn,yn,zn)},

the target is identifiable assuming that the units are drawn independently from the data generating process and that there is an overlap of covariates for all exposure types [12, 13]

. Many methods that estimate

, model either the outcome of each exposure type or the exposure selection mechanism as functions of . A central inferential task is to provide confidence intervals (CI) for the estimate . Much effort has been made to formulate model-robust methods for this task as well as extending them to the case of high-dimensional so as to include a large number of potential confounders, cf. [14, 15, 16, 17].

For the counterfactual question , it is however more relevant to compare the covariate-specific outcomes directly, rather than averaging them over , cf. [18, 19, 20]. Consequently, the focus of recent methods has been the covariate-specific effect,

also referred to as the ‘conditional average treatment effect’. Since follows from the dependency structure, it is possible to learn flexible regression models of using a subset of the observational data,

 D~z={(xi,yi)},% where(xi,yi)∼p(x,y|z=~z).

The average effect is then estimated as . Using tree-based models it is possible to derive CIs for that are asymptotically valid, cf. [21, 22, 23, 24].

A fundamental limitation of targeting is that the irreducible dispersions of the counterfactual outcomes and are omitted. While correctly inferring that, say, , it may still be the case that frequently exceeds the value of . Such considerations are important in applications where different expoures involve differential risks. Then merely reporting and a CI omits valuable information about and .

In this paper, we aim to address this limitation by considering as a predictor of for a unit with covariates . Then using prediction intervals (PIs) for the predictors , the irreducible dispersions of the outcomes can be taken into account in the counterfactual analysis, cf. Fig. 2. Fig. 2: Counterfactual analysis of units with covariates x assigned to different exposures ~z. Outcome predictions ˆμ0(x) and ˆμ1(x) are denoted by circles. The prediction intervals (PIs) C0,β(x) and C1,β(x) (lines) incorporate the outcomes with a coverage of 100%β. Together the PIs enable an intuitive assessment of the counterfactual outcomes. Below we propose using the PIs to quantify how separable the outcomes are under different exposures.

A major challenge is obtaining PIs that are valid when the data-generating process is unknown. Here we consider the general conformal approach of [26, 27] to construct PIs with valid coverage properties in distribution-free manner. Each point in the interval is then constructed by re-fitting the predictor for the corresponding exposure group. This becomes computationally prohibitive using complex tree-based models or fitting methods that require parameter tuning, especially when is high-dimensional. This obstacle is circumvented using the method proposed below.

Our contributions in this paper are the following. We propose the use of prediction intervals for covariate-specific counterfactual analysis of each exposure type and define a measure of their relative impact that provides additional information to merely comparing and . We then learn sparse predictor models , with corresponding PIs , that automatically adapt to nonlinearities by leveraging the computational properties of the Spice predictor approach . This obviates the need for cross-validation or other tuning techniques. Since the conformal PIs also exhibit marginal coverage properties, even when lacking a correctly specified model of the data-generating process, the resulting method provides a model-robust means of counterfactual analysis.

Notation: and denote the -norm and Hadamard product. The cardinality of a set is . The operator stacks all elements into a single column vector. is the empirical mean of over all pairs in .

Remark: Code for the proposed method is available at https://github.com/dzachariah/counterfactual.

## Ii Counterfactual prediction method

A predictive approach to counterfactual analysis is readily generalized to multiple exposure types . We illustrate this as we proceed to define a measure of the separability of counterfactual outcomes under different exposures.

### Ii-a Counterfactual confidence

Consider an observational study with exposure types and a scalar covariate , as illustrated in Fig. 3. For a given , let and prediction intervals be the counterfactual prediction and PI for each exposure . We propose the following measure of the impact of exposures on the outcomes.

###### Definition 1.

A covariate-specific comparison between predicted outcomes for exposures and is said to have % counterfactual confidence, where is the largest value for which their PIs do not overlap. That is, the PIs are mutually exclusive

 C~z,β(x)∩Cz,β(x)=∅. (1)

A high confidence asserts that the counterfactual outcomes are highly separable and thus the impact of the exposures are more distinctive. This provides additional information to the size of the exposure effect measured as .

In Fig. 3 we see that for a unit with covariate , the counterfactual confidence when comparing exposures and is greater than , indicating highly separable outcomes. The confidences for the pairwise are tabulated pairwise below for covariate (left) and (right):

1 81%
2 64% 96%
1 24%
2 45% 58%

It is seen that for , the separability of counterfactual outcomes can be asserted with greater confidence than for . This is corroborated by comparing the datasets shown in Fig. 3. Fig. 3: Counterfactual analysis with K=3 exposure types and scalar covariate x, using method proposed on subsequent sections. Left: Datasets D~z for exposures ~z=0 (∘), 1 (+) and 2 (×). The resulting predictions ˆμ~z(x) along with 90%-intervals C~z,0.90(x). Center: x=−1. Right: x=2.

### Ii-B Conformal prediction intervals

Consider a regression model class

parameterized by . In subsection II-C, we specify a flexible regressor vector that adapts to nonlinearities in the data. When the model class is well-specified it includes the unknown mean function .

To learn a model in from , we build upon the tuning-free Spice-method  where the learned weights are defined as

 ˆw=argminw√ˆE[|y−ϕ⊤(x)w|2∣∣~z]+∥φ⊙w∥1, (2)

and the elements of are given by

 φj=⎧⎪ ⎪⎨⎪ ⎪⎩√ˆE[ϕ2j(x)|~z]|D~z|,j=1,…,p0,j=0.

The solution (2) can be computed sequentially for each sample in , using a coordinate descent algorithm with a runtime that scales as . We now leverage this property for the construction of conformal PIs.

The principle behind general conformal prediction can be described as follows [26, 27]: For the covariate of interest , consider a new sample where is a variable. Then quantify how well this sample conforms to the observed data via the learned model (2). All points that conform well, form a prediction interval. The conformity is quantified by including in the learned model (2), which is achieved by a sequential update in . Then, following , we define a measure

 (3)

where is the th fitted residual and is the indicator function. The measure is bounded between 0 and 1, where lower values correspond to higher conformity. We construct by varying over a set of grid points , as summarized in Algorithm 1. By leveraging the computational properties of the learning method, the prediction interval is computed with a total runtime of . The range of is set to exceed that of the outcomes in the observed dataset. A point prediction is obtained as the minimizer of .

Despite the fact that no dispersion model of the data generating process is required, the resulting prediction intervals exhibit valid coverage properties. When the model class is well-specified, the interval exhibits asymptotic conditional coverage, that is,

 Pr{y∈C~z,β(x)∣∣x,~z}=β+oP(1),

under certain regularity conditions [29, thm. 6.2]. More generally, is calibrated to ensure marginal coverage [29, thm. 2.1]

Note that this does not require a well-specified model class . In other words, the more accurate the learned prediction model, the tighter the prediction interval but its marginal coverage property remains not matter if the model is correct or not. This confers a robustness property to the proposed inference method in cases when .

### Ii-C Sparse additive predictor models

To learn accurate prediction models, we now turn to the specification of the regressor vector in . Let denote the dimension of and consider the additive model

 ϕ(x)=col{1,ψ1(x1),…,ψd(xd)},

which enables interpretable component-wise predictors . When a covariate is categorical, we use a standard basis vector for category and when . In the case of binary categories, we simply have .

For noncategorical covariates

, such as continuous or count variables, however, we propose a data-adaptive piecewise linear model. Let the empirical quantile function be

 ˆF−1(q)=inf{x:ˆF(x)≤q},

where is the empirical cumulative density function of obtained from the full observational data . Then we specify the model for by the -dimensional vector

 [ψ(x)]k=⎧⎨⎩(x−ck)+,k=1,…,m−1(x−ck)+I{x≤ck+1}+ck+1I{x>ck+1},k=m

where is a specified number of knots or breakpoints defined as . The resulting model yields finer resolution line segments where data density is high and is capable of capturing continuous nonlinear responses with respect to .

The regularization term in (2) leads to sparse solutions and thus the learning method yields a set of sparse additive predictor models with nonlinear responses in a data-adaptive manner, cf. .111The predictive performance can be related to the best subset predictor, see  for more details. Moreover, the method takes into account the amount of data observed within each line segment for each exposure type, and controls the model complexity accordingly. This mitigates overfitting when there is unequal training data or covariate imbalances across exposure types, cf. .

The integer determines the maximum resolution of the nonlinear model. A high enables higher predictive accuracy provided enough data is available to learn the nonlinear response. For instance, suppose denotes the size of the smallest dataset and and denote the number of binary and continuous covariates, respectively. Then and a natural upper limit is .

## Iii Numerical experiments

In this section we demonstrate the proposed counterfactual prediction approach by means of three examples. In the following examples, we consider exposure types.

### Iii-a Nonlinear effects

We consider the example in , which allows for a comparison with both tree-based and linear models. For each unit, the exposure

is assigned with equal probability. Then the covariate

(with ) is drawn as

 x|(z=0)∼N(40,102)andx|(z=1)∼N(20,102)

and the counterfactual outcomes as

 y(0)|x∼N(72+3√|x|,1)andy(1)|x∼N(90+exp(0.06x),1). (4)

A simulated observational dataset with is illustrated in Fig. 4. To obtain the predictions in the figure, we use . For a unit with covariate , as an example, we note that is larger than and that both confidence intervals are tight, as is expected by inspecting the data generating process (4) at the given covariate. In addition, the counterfactual confidence is found to be greater than 90%, indicating a highly separable counterfactual outcomes as expected.

To illustrate the robustness property of the prediction intervals, we repeat the experiment using 1000 Monte Carlo simulations. For each simulation, we generate new data and also draw a new unit from both exposure groups. For a unit with exposure , the outcome is found to belong to interval with probability 0.918 when . Similarly, for exposure , the outcome is contained in the interval with probability 0.907. This coverage property holds eventhough the mean function does not belong to . Fig. 4: Left: Dataset D with n=120 samples. Exposures ~z=0 (∘) and 1 (+). Right: Predictions with 90%-prediction intervals at x=30.

### Iii-B High-dimensional covariates

The desire to include all potential confounders in the covariate vector , may lead in many applications to dimensions that can be larger than . To address this issue, we simulate an experimental setting with covariates but only samples. Naturally, we have . The exposures and are assigned with probabilities 0.6 and 0.4, respectively. The covariates are drawn as

 x|(z=0)∼N(0,Σ0)% andx|(z=1)∼N(0,Σ1),

where and are randomly generated covariance matrices with unit trace. The matrices have numerical rank and are constructed using outer products of Gaussian vectors. This generates highly correlated covariates, as is typical in real applications. The counterfactual outcomes are generated as

 y(0)|x∼N(x1+5x10+5x20+0.5,0.52)andy(1)|x∼N(x1+x10−x30,0.52). (5)

However, this is not a problem for the learning method which automatically prunes away irrelevant covariates due to the adaptive regularization in (2).

A simulated observational dataset is shown in Fig. 5. We also illustrate the predicted outcomes for a unit with all covariates equal to one, . We observe that is considerably larger than , also when taking into account the prediction intervals. This is consistent with the data generating process (5) evaluated at the fixed . The interval for exposure is also seen to be significantly wider than that for exposure , reflecting the larger uncertainty of the predicted outcome. In this case it is possible to assert counterfactual confidence greater than 90%.

We repeat this experiment as well to validate the coverage properties of the intervals, using 1000 Monte Carlo simulations. For each simulation, we generate new data and also draw a new unit from both exposure groups. For a unit with exposure , the outcome is found to be contained in interval with probability 0.921 when . Similarly, for exposure , the outcome is contained in the interval with probability 0.915. Fig. 5: Left: Dataset D with n=100 samples. Only one covariate, x1, is shown for the sake of two-dimensional visualization. Exposures ~z=0 (∘) and 1 (+). Right: Predictions with 90%-prediction intervals at x=1.

### Iii-C Schooling data

Following the example in , we assess the effect of schooling on income for adults in the US born in the 1930s, using data from . The observed outcome is the weekly earnings (on a logarithmic scale) of a subject in 1970. Each individual is subject to one of two exposures: corresponds to receiving 12 years of schooling or more and corresponds to receiving less than 12 years. We consider 26 binary covariates in . Ten covariates indicate the year of birth 1930-1939, and eight indicate the census region. In addition, eight indicators represent ethnic identification, marital status and whether or not the subject lives in the central city of a metropolitan area. The observational study consists of samples. (See  for details.)

Discrete covariates can be partitioned into separate subgroups, and a direct inference approach would be to estimate the average outcomes of exposures 0 and 1 for each group. However, the number of subgroups grows quickly and there are not sufficient samples in the dataset for each subgroup and exposure. Therefore we apply the proposed method. The predicted outcomes are illustrated for subjects in different covariate groups in Fig 6. All subjects in these subgroups were born in the same year and came from the same region. The prediction interval widths are likely to be affected by the very coarse division of schooling used here, since includes 0 to 11 years of schooling, which is a substantial variation, while includes 12 years and more.

The three subgroups are : Caucasian, unmarried and not in a major city, : Caucasian, married and in a major city, and : African-American, married, and in a major city. Given that the units are logarithmic, the differences of predicted earnings, , correspond to +52%, +26% and +39% of weekly earnings, for , and , respectively. This means that the inferred effect of schooling is greatest for while considerably less for . The prediction intervals in Fig. 6 suggest, however, that there is a considerable dispersion of the outcomes. The predicted outcome of schooling has a counterfactual confidence of 33%, 20% and 25% for the three subgroups. Thus for subgroup , schooling not only exhibits the lowest predicted gains but also the least separable counterfactual outcomes can be asserted. The opposite is true for subgroup .

The findings appear to be consistent with features of US society in the 1970s: a Caucasian person in a major city with a family was expected to have greater access to economic opportunities, such that schooling experience mattered less to earnings. For the unmarried counterpart who lived outside of the major city, such alternative opportunities were fewer so that schooling could have a more significant impact. An African-American person in a major city with a family represents an intermediate case. Fig. 6: Counterfactual analysis of schooling (~z) and weekly earnings (y, in logarithmic units) using 90%-prediction intervals. Left: Caucasian, unmarried and not in major city. Center: Caucasian, married and in city. Right: African-american, married and in city.

## Iv Conclusions

We have developed a new method for counterfactual analysis using observational data based on prediction intervals. The intervals were used to define a measure of relative separability of counterfactual outcomes under different exposures. This takes into account the dispersions of the outcomes and provides additional information to the difference between predictions.

The intervals were constructed in a distribution-free and model-robust manner based on the general conformal prediction approach. The computational obstacles of this approach were circumvented by leveraging properties of a tuning-free method that learns sparse additive predictor models for counterfactual outcomes. We demonstrated the method using both real and synthetic data.

## References

•  M. A. Hernán and J. M. Robins, Causal Inference. Chapman & Hall/CRC, 2018. forthcoming.
•  S. L. Morgan and C. Winship, Counterfactuals and causal inference. Cambridge University Press, 2014.
•  J. Neyman, “Sur les applications de la théorie des probabilités aux experiences agricoles: Essai des principes,” Roczniki Nauk Rolniczych, vol. 10, pp. 1–51, 1923.
•  D. B. Rubin, “Estimating causal effects of treatments in randomized and nonrandomized studies.,” Journal of Educational Psychology, vol. 66, no. 5, p. 688, 1974.
•  J. Pearl, Causality. Cambridge University Press, 2009.
•  J. Pearl et al., “Causal inference in statistics: An overview,” Statistics surveys, vol. 3, pp. 96–146, 2009.
•  A. Sjölander, “The language of potential outcomes,” in Causality: Statistical Perspectives and Applications, John Wiley & Sons, 2012.
•  S. Greenland, “Causal inference as a prediction problem: Assumptions, identification and evidence synthesis,” in Causality: Statistical Perspectives and Applications, John Wiley & Sons, 2012.
•  J. Peters, D. Janzing, and B. Schölkopf, Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press, 2017.
•  T. S. Richardson and J. M. Robins, “Single world intervention graphs: a primer,” in Second UAI workshop on causal structure learning, Bellevue, Washington, 2013.
•  G. W. Imbens, “Nonparametric estimation of average treatment effects under exogeneity: A review,” Review of Economics and statistics, vol. 86, no. 1, pp. 4–29, 2004.
•  L. Wasserman, All of Statistics: A Concise Course in Statistical Inference. Springer Texts in Statistics, Springer New York, 2004.
•  G. W. Imbens and D. B. Rubin, Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.
• 

J. M. Robins and Y. Ritov, “Toward a curse of dimensionality appropriate (coda) asymptotic theory for semi-parametric models,”

Statistics in medicine, vol. 16, no. 3, pp. 285–319, 1997.
•  A. Belloni, V. Chernozhukov, and C. Hansen, “Inference on treatment effects after selection among high-dimensional controls,” The Review of Economic Studies, vol. 81, no. 2, pp. 608–650, 2014.
•  M. H. Farrell, “Robust inference on average treatment effects with possibly more covariates than observations,” Journal of Econometrics, vol. 189, no. 1, pp. 1–23, 2015.
•  V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, and W. Newey, “Double/debiased/neyman machine learning of treatment effects,” arXiv preprint arXiv:1701.08687, 2017.
•  P. M. Rothwell, “Can overall results of clinical trials be applied to all patients?,” The Lancet, vol. 345, no. 8965, pp. 1616–1619, 1995.
•  D. M. Kent and R. A. Hayward, “Limitations of applying summary results of clinical trials to individual patients: the need for risk stratification,” Jama, vol. 298, no. 10, pp. 1209–1212, 2007.
• 

J. Weiss, F. Kuusisto, K. Boyd, J. Liu, and D. Page, “Machine learning for treatment assignment: Improving individualized risk attribution,” in

AMIA Annual Symposium Proceedings, vol. 2015, p. 1306, American Medical Informatics Association, 2015.
•  T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics, Springer New York, 2013.
•  J. L. Hill, “Bayesian nonparametric modeling for causal inference,” Journal of Computational and Graphical Statistics, vol. 20, no. 1, pp. 217–240, 2011.
• 

S. Wager and S. Athey, “Estimation and inference of heterogeneous treatment effects using random forests,”

Journal of the American Statistical Association, no. just-accepted, 2017.
•  S. Künzel, J. Sekhon, P. Bickel, and B. Yu, “Meta-learners for estimating heterogeneous treatment effects using machine learning,” arXiv preprint arXiv:1706.03461, 2017.
•  S. Geisser, Predictive inference. CRC Press, 1993.
•  V. Vovk, A. Gammerman, and G. Shafer, Algorithmic learning in a random world. Springer Science & Business Media, 2005.
•  V. Balasubramanian, S.-S. Ho, and V. Vovk, Conformal prediction for reliable machine learning: Theory, Adaptations and applications. Newnes, 2014.
•  D. Zachariah, P. Stoica, and T. B. Schön, “Online learning for distribution-free prediction,” arXiv preprint arXiv:1703.05060, 2017.
•  J. Lei, M. G’Sell, A. Rinaldo, R. J. Tibshirani, and L. Wasserman, “Distribution-free predictive inference for regression,” Journal of the American Statistical Association, no. just-accepted, 2017.
•  P. Ravikumar, H. Liu, J. Lafferty, and L. Wasserman, “Spam: Sparse additive models,” in Proceedings of the 20th International Conference on Neural Information Processing Systems, pp. 1201–1208, Curran Associates Inc., 2007.
•  J. Esarey, “Causal inference with observational data,” in Analytics, Policy, and Governance, Yale University Press, 2017.
•  J. D. Angrist and A. B. Krueger, “Does compulsory school attendance affect schooling and earnings?,” The Quarterly Journal of Economics, vol. 106, no. 4, pp. 979–1014, 1991.