# Bayesian Nonparametric Causal Inference: Information Rates and Learning Algorithms

We investigate the problem of estimating the causal effect of a treatment on individual subjects from observational data, this is a central problem in various application domains, including healthcare, social sciences, and online advertising. Within the Neyman Rubin potential outcomes model, we use the Kullback Leibler (KL) divergence between the estimated and true distributions as a measure of accuracy of the estimate, and we define the information rate of the Bayesian causal inference procedure as the (asymptotic equivalence class of the) expected value of the KL divergence between the estimated and true distributions as a function of the number of samples. Using Fano method, we establish a fundamental limit on the information rate that can be achieved by any Bayesian estimator, and show that this fundamental limit is independent of the selection bias in the observational data. We characterize the Bayesian priors on the potential (factual and counterfactual) outcomes that achieve the optimal information rate. As a consequence, we show that a particular class of priors that have been widely used in the causal inference literature cannot achieve the optimal information rate. On the other hand, a broader class of priors can achieve the optimal information rate. We go on to propose a prior adaptation procedure (which we call the information based empirical Bayes procedure) that optimizes the Bayesian prior by maximizing an information theoretic criterion on the recovered causal effects rather than maximizing the marginal likelihood of the observed (factual) data. Building on our analysis, we construct an information optimal Bayesian causal inference algorithm.

## Authors

• 21 publications
• 84 publications
• ### Bayesian causal inference for count potential outcomes

The literature for count modeling provides useful tools to conduct causa...
08/07/2020 ∙ by Young Lee, et al. ∙ 0

• ### The Promises of Parallel Outcomes

Unobserved confounding presents a major threat to the validity of causal...
12/10/2020 ∙ by Ying Zhou, et al. ∙ 0

• ### Using Experimental Data to Evaluate Methods for Observational Causal Inference

Methods that infer causal dependence from observational data are central...
10/06/2020 ∙ by Amanda Gentzel, et al. ∙ 0

• ### Bayesian method for causal inference in spatially-correlated multivariate time series

Measuring the causal impact of an advertising campaign on sales is an im...
01/19/2018 ∙ by Bo Ning, et al. ∙ 0

• ### Counterfactual Mean Embedding: A Kernel Method for Nonparametric Causal Inference

This paper introduces a novel Hilbert space representation of a counterf...
05/22/2018 ∙ by Krikamol Muandet, et al. ∙ 0

• ### Bayesian Doubly Robust Causal Inference via Loss Functions

Frequentist inference has a well-established supporting theory for doubl...
03/06/2021 ∙ by Yu Luo, et al. ∙ 0

• ### A Bayesian Method for Causal Modeling and Discovery Under Selection

This paper describes a Bayesian method for learning causal networks usin...
01/16/2013 ∙ by Gregory F. Cooper, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

The problem of estimating the individualized causal effect of a particular intervention from observational data is central in many application domains and research fields, including public health and healthcare [1], computational advertising [2], and social sciences [3]

. With the increasing availability of data in all these domains, machine learning algorithms can be used to obtain estimates of the effect of an intervention, an action, or a treatment on individuals given their features and traits. For instance, using observational electronic health record data

, machine learning-based recommender system can learn the individual-level causal effects of treatments currently deployed in clinical practice and help clinicians refine their current treatment policies [4]. There is a growing interest in using machine learning methods to infer the individualized causal effects of medical treatments; this interest manifests in recent initiatives such as STRATOS [4], which focuses on guiding observational medical research, in addition to various recent works on causal effect inference by the machine learning community [5, 6, 7, 8, 9, 10].

The problem of estimating individual-level causal effects is usually formulated within the classical potential outcomes framework, developed by Neyman and Rubin [11, 12]. In this framework, every subject (individual) in the observational dataset possesses two “potential outcomes”: the subject’s outcome under the application of the treatment, and the subject’s outcome when no treatment is applied. The treatment effect is the difference between the two potential outcomes, but since we only observe the “factual” outcome for a specific treatment assignment, and never observe the corresponding “counterfactual” outcome, we never observe any samples of the true treatment effect in an observational dataset.

This is what makes the problem of causal inference fundamentally different from standard supervised learning (regression)

. Moreover, the policy by which treatments are assigned to subjects induces a selection bias in the observational data, creating a discrepancy in the feature distributions for the treated and control patient groups, which makes the problem even harder. Many of the classical works on causal inference have focused on the simpler problem of estimating average

treatment effects, where unbiased estimators based on propensity score weighting were developed to alleviate the impact of selection bias on the causal estimands (see

[13] and the references therein).

While more recent works have developed machine learning algorithms for estimating individualized treatment effects from observational data in the past few years [5, 14, 15, 16, 2, 17, 8, 18, 19], the inference machinery built in most of these works seem to be rather ad-hoc. The causal inference problem entails a richer set of modeling choices and decisions compared to that of the standard supervised learning (regression) problem, which includes deciding what model to use, how to model the treatment assignment variables in the observational data, and how to handle selection bias, etc. In order to properly address all these modeling choices, one needs to understand the fundamental limits of performance in causal effect estimation problems, and how different modeling choices impact the achievable performance.

In this paper, we establish the fundamental limits on the amount of information that a learning algorithm can gather about the causal effect of an intervention given an observational data sample. We also provide guidelines for building proper causal inference models that “do not leave any information on the table” because of poor modeling choices. A summary of our results is provided in the following Section.

## Ii Summary of the Results

We address the individualized causal effect estimation problem on the basis of the Neyman-Rubin potential outcomes model [11, 12]. We focus on Bayesian nonparametric learning algorithms, as they are immune to model mis-specification, and can learn highly heterogeneous response functions that one would expect to encounter in datasets with medical or social outcomes [20, 3]. In Section IV, we introduce the notion of information rate

as a measure for the quality of Bayesian nonparametric learning of the individualized causal effects. The information rate is defined in terms of a measure of the Kullback-Leibler divergence between the true and posterior distributions for the causal effect. In Theorem 1, we establish the equivalence between Bayesian information rates and frequentist estimation rate. In the rest of the paper, we characterize: (1) the optimal information rates that can be achieved by any Bayesian nonparametric learning algorithm, and (2) the nature of the priors that would give rise to “informationally optimal” Bayesian nonparametric causal inference procedure.

In Section V, we establish the fundamental limit on the information rate that can be achieved by any Bayesian causal inference procedure using an information-theoretic lower bound based on Fano’s method. The optimal information rate is a property of the function classes to which the potential outcomes belong, and is independent of the inference algorithm. We show that the optimal information rate for causal inference is governed by the “rougher” of the two potential outcomes functions. We also show that the optimal information rates for causal inference are insensitive to selection bias (Theorem 2).

In Section VI

, we characterize the Bayesian priors that achieve the optimal rate. We show that the most common modeling choice adopted in the literature, which is to augment the treatment assignment variable to the feature space, leads to priors that are suboptimal in terms of the achievable rate (Theorem 3). We show that informationally optimal priors are ones that place a probability distribution over a vector-valued function space, where the function space has its smoothness matching the rougher of the two potential outcomes functions. Since the true smoothness parameter of the potential outcomes functions is generally unknown a priori, we propose a prior adaptation procedure, called the

information-based empirical Bayes procedure, which optimizes the Bayesian prior by maximizing an information-theoretic criterion on the recovered causal effects rather than maximizing the marginal likelihood of the observed (factual) data.

We conclude the paper by building an information-optimal Bayesian causal inference algorithm that is based on our analysis. The inference procedure embeds the potential outcomes in a vector-valued reproducing kernel Hilbert space (vvRKHS), and uses a multi-task Gaussian process prior (with a Matérn kernel) over that space to infer the individualized causal effects. We show that for such a prior, the proposed information-based empirical Bayes method exhibits an insightful factual bias and counterfactual variance decomposition. Experiments conducted on a standard dataset that is used for benchmarking causal inference models show that our model significantly outperforms the state-of-the-art.

## Iii Related Work

We conduct our analysis within the potential outcomes framework developed by Neyman and Rubin [11, 12]. The earliest works on estimating causal effects have focused on the problem of obtaining unbiased estimates for the average treatment effects using observational samples. The most common well-known estimator for the average causal effect of a treatment is the propensity score weighting estimator, which simply removes the bias introduced by selection bias by giving weights to different samples that are inversely proportional to their propensity scores [13]. More recently, the machine learning community has also developed estimators for the average treatment effects that borrows ideas from representation learning, i.e. see for instance the work in [9]. In this paper, we focus on the individual, rather than the average causal effect estimation problem.

To the best of our knowledge, non of the previous works have attempted to characterize the limits of learning causal effects in either the frequentist or Bayesian setups. Instead, most previous works on causal effect inference have focused on model development, and various algorithms have been recently developed for estimating individualized treatment effects from observational data, mostly based on either tree-based methods [16, 8, 7]

, or deep learning methods

[14, 15]. Most of the models that were previously developed for estimating causal effects relied on regression models that treat the treatment assignment variables (i.e. whether or not the intervention was applied to the subject) as an extended dimension in the feature space. Examples of such models include Bayesian additive regression trees (BART) [8], causal forests [7], balanced counterfactual regression [18], causal multivariate additive regression splines (MARS) [19], propensity-dropout networks [15]

, or random forests

[21]. In all these methods, augmenting the treatment assignment variable to the feature space introduces a mismatch between the training and testing distribution (i.e. covariate shift induced by the selection bias [18]

). The different methods followed different approaches for handling the selection bias: causal forests use estimates of the propensity score for deriving a tree splitting rule that attempts to balance the treated and control populations, propensity-dropout networks use larger dropout regularization for training points with very high or very low propensity scores, whereas balanced counterfactual regression uses deep neural networks to learn a balanced representation (i.e. a feature transformation) that tries to alleviate the effect of the selection bias. Bayesian methods, like BART, do not address selection bias since the Bayesian posterior naturally incorporates uncertainty in regions of poor overlap in the feature space. As we show later in Sections

VI and VIII, our analysis and experimental results indicated that, by augmenting the treatment assignment variable to the feature space, all these methods achieve a suboptimal information rate.

Our analysis is related to a long strand of literature that studied frequentist (minimax) estimation rates, or posterior contraction rates in standard regression problems [22, 23, 24, 25, 26]. In Theorem 2, we show that the optimal information rate for causal inference has the same form as the optimal minimax estimation rate obtained by Stone in [25] for standard nonparametric regression problems, when the true regression function is set to be the rougher of the two potential outcomes functions. Our analysis for the achievable information rates for Gaussian process priors uses the results by van Zanten and van der Vaart in [27].

## Iv Bayesian Nonparametric Causal Inference from Observational Data

In this section, we provide a general description for the Neyman-Rubin causal model considered in this paper (Subsection IV-A), and present the Bayesian nonparametric inference framework under study (Subsection IV-B).

### Iv-a The Neyman-Rubin Causal Model

Consider a population of subjects with each subject possessing a -dimensional feature . An intervention is applied to some subjects in the population: subject

’s response to the intervention is a random variable denoted by

, whereas the subject’s natural response when no intervention is applied is denoted by . The two random variables, , are known as the potential outcomes. The causal effect of the intervention (treatment) on subject is characterized through the difference between the two (random) potential outcomes , and is generally assumed to be dependent on the subject’s features . Hence, we define the individualized treatment effect (ITE) for a subject with a feature as

 T(x)=E[Y(1)i−Y(0)i∣∣Xi=x]. (1)

Our goal is to estimate the function from an observational dataset , which comprises independent samples of the random tuple where is an intervention assignment indicator that indicates whether or not subject has received the intervention (treatment) under consideration. The outcomes and are known in the literature as the factual and the counterfactual outcomes, respectively [18], [28]. Intervention assignments generally depend on the subjects’ features, i.e. . This dependence is quantified via the conditional distribution , also known as the propensity score of subject [13], [11]. In the rest of this paper, we denote the propensity score of a feature point as .

The observational dataset is drawn from a joint density with a probability space that supports the following standard conditions [11, 12]:

• Condition 1 (unconfoundedness): Treatment assignment decisions are independent of the outcomes given the subject’s features, i.e. .

• Condition 2 (overlap): Every subject has a non-zero chance of receiving the treatment, and treatment assignment decisions are non-deterministic, i.e. .

### Iv-B Bayesian Nonparametric Causal Inference

Throughout this paper, we consider the following

signal-in-white-noise

random design regression model for the potential outcomes:

 Y(ω)i=fω(Xi)+ϵi,ω,ω∈{0,1}, (2)

where is a Gaussian noise variable. It follows from (2) that and hence the ITE is given by . The functions and correspond to the response surfaces over the subjects’ feature space with and without the intervention; the difference between these two surfaces correspond to the individualized effect of the intervention. We assume that is a compact metric space (e.g. bounded, closed sets in ), and that the true response surfaces , are totally bounded functions in a space of “smooth” or “regular” functions , where is a smoothness (or regularity) parameter. This roughly means that is -differentiable; precise definitions for -regular function classes will be provided in subsequent Sections.

A Bayesian procedure for estimating the ITE function entails specifying a prior distribution over the response surfaces and , which in turn induces a prior over . The nonparametric nature of inference follows from the fact that is a prior over functions, and hence the estimation problem involves an infinite-dimensional parameter space. For a given prior

, the Bayesian inference procedure views the observational dataset

as being sampled according to the following generative model:

 f0,f1 ∼Π,Xi∼dP(Xi=x) ωi|Xi=x ∼Bernoulli(γ(x)) Y(ωi)i|f0,f1,ωi ∼N(fωi(x),σ2ωi),i=1,...,n. (3)

Since we are interested in estimating an underlying true ITE function , we will analyze the Bayesian causal inference procedure within the frequentist setup, which assumes that the subjects’ outcomes are generated according to the model in (3) for a given true (and fixed) regression functions and . That is, in the next Subsection, we will assess the quality of a Bayesian inference procedure by quantifying the amount of information the posterior distribution has about the true ITE function . This type of analysis is sometimes referred to as the “Frequentist-Bayes” analysis [29].

### Iv-C Information Rates

How much information about the true causal effect function is conveyed in the posterior ? A natural measure of the “informational quality” of a posterior is the information-theoretic criterion due to Barron [30], which quantifies the quality of a posterior via the Kullback-Leibler (KL) divergence between the posterior and true distributions. In that sense, the quality (or informativeness) of the posterior at a feature point is given by the KL divergence between the posterior distribution at , , and the true distribution of . The overall quality of a posterior is thus quantified by marginalizing the pointwise KL divergence over the feature space . For a prior , true responses and , propensity function , and observational datasets of size , the expected KL risk is:

 Dn(Π;f0,f1,γ)=Ex[EDn[DKL(P(x)∥QDn(x))]], (4)

where is the KL divergence222The KL divergence between probability measures and is given by [31]. The existence of the Radon-Nikodym derivative in (4) is guaranteed since and are mutually absolutely continuous. is the true distribution of , i.e. , and is the posterior distribution of , given by:

 QDn(x) =dΠn(Y(1)−Y(0)|X=x,Dn) (⋆)=dΠn(T(x)+N(0,σ20+σ21)|Dn), (∗)=∫N(T(x),σ20+σ21)dΠn(T(x)|Dn), (5)

where steps and in (5) follow from the sampling model in (3). The expected KL risk in (4) marginalizes the pointwise KL divergence over the distribution of the observational dataset (generated according to (3)), and the feature distribution . Variants of the expected KL risk in (4) have been widely used in the analysis of nonparametric regression models, usually in the form of (cumulative) Cesàro averages of the pointwise KL divergence at certain points in the feature space [32, 30, 33, 27]. Assuming posterior consistency, the (asymptotic) dependence of on reflects the rate by which the posterior “sandwiches” the true ITE function everywhere in . An efficient causal inference procedure would exhibit a rapidly decaying : this motivates the definition of an “information rate”.

Definition 1. (Information Rate) We say that the information rate of a Bayesian causal inference procedure is , for a sequence , if is .   ∎

Note that  is the equivalence class of all sequences that have the same asymptotic rate of convergence. In the rest of this paper, we use the notation to denote the information rate achieved by a prior in a causal inference problem instance described by the tuple . The notion of an “information rate” for a Bayesian causal effect inference procedure is closely connected to the frequentist estimation rate (with respect to the loss) with as the estimand [30, 34, 35]. The following Theorem establishes such a connection.

Theorem 1. Let  be the KL risk of a given Bayesian causal inference procedure, then we have that

 Dn(Π;f0,f1,γ)≤¯σ⋅EDn[∥∥EΠ[T|Dn]−T∥∥22],

for some , where is the -norm with respect to the feature distribution, i.e. .

Proof. Recall from (4) that is given by

 Dn(Π;f0,f1,γ)=Ex[EDn[DKL(P(x)∥QDn(x))]].

Based on (5), can be written as

 DKL(P(x)∥∥∥∫N(T(x),σ20+σ21)dΠn(T(x)|Dn)),

which by the convexity of the KL divergence in its second argument, and using Jensen’s inequality, is bounded above by

 DKL(P(x)∥∥N(EΠ[T(x)|Dn],σ20+σ21)).

From the regression model in (2), we know that , and hence KL divergence above can be written as

 DKL(N(T(x),σ20+σ21)∥∥N(EΠ[T(x)|Dn],σ20+σ21)),

which is given by since [31]. Hence, the expected KL risk is bounded above as follows

 Dn(Π;f0,f1,γ) ≤EDn[Ex[|EΠ[T(x)|Dn]−T(x)|22(σ20+σ21)]],

for all .   ∎

Theorem 1 says that the information rate of causal inference lower bounds the rate of convergence of the risk of the sequence of estimates induced by the posterior mean . The risk was dubbed the precision in estimating heterogeneous effects (PEHE) by Hill in [8], and is the most commonly used metric for evaluating causal inference models [36, 8, 18, 21, 28]. Theorem 1 tells us that the PEHE is , and hence the Bayesian information rate presents a limit on the achievable performance of frequentist estimation. In that sense, the asymptotic behavior of

is revealing of both the informational quality of the Bayesian posterior, as well as the convergence rates of frequentist loss functions.

## V Optimal Information Rates for Bayesian Causal Inference

In this Section, we establish a fundamental limit on the information rate that can be achieved by any sequence of posteriors for a given causal inference problem. Let the achievable information rate for a given prior and function classes and , denoted by , be the rate obtained by taking the supremum of the information rate over functions in and . This is a quantity that depends only on the prior but not on the specific realizations of and . The optimal information rate is defined to be the maximum worst case achievable information rate for all functions in and , and is denote by . While the information rate characterizes a particular instance of a causal inference problem with and a given Bayesian prior , the optimal information rate is an abstract (prior-independent) measure of the “information capacity” or the “hardness” of a class of causal inference problems (corresponding to response surfaces in and ). Intuitively, one expects that the limit on the achievable information rate will be higher for smooth (regular) response surfaces and for propensity functions that are close to 0.5 everywhere in . Theorem 2 provides a detailed characterization for the optimal information rates in general function spaces. Whether or not the Bayesian inference procedure achieves the optimal information rate will depend on the prior . In the next Section, we will investigate different design choices for the prior , and characterize the “capacity-achieving” priors that achieve the optimal information rate.

In Theorem 2, we will use the notion of metric entropy to characterize the “size” of general (nonparametric or parametric) function classes. The metric entropy of a function space is given by the logarithm of the covering number of that space with respect to a metric , i.e. . A formal definition for covering numbers is provided below.

Definition 2. (Covering number) A -cover of a given function space with respect to a metric is a set of functions such that for any function , there exists some such that . The -covering number of is [29]

That is, the covering number of a function class is the number of balls (in a given metric ) of a fixed radius required to cover it. Throughout this paper, the metric entropy will always be evaluated with respect to the norm. In the light of the definition above, the metric entropy can be thought of as a measure of the complexity of a function class; smoother function classes would generally display a smaller metric entropy. All function classes considered in this paper have finite metric entropy. Figure 1 shows a pictorial depiction for two exemplary function classes and for the treated and control responses, respectively. In this depiction, is smaller than , hence the -cover of contains more balls than the -cover of , and it follows that has a larger metric entropy. This manifests in the control response surface being less smooth than the treated response surface . This is usually the case for real-world data on responses to medical treatments, where the untreated population typically display more heterogeneity than the treated population [20].

We now present the main result of this Section. In the following Theorem, we provide a general characterization for the optimal information rates of Bayesian causal inference when the treated and control surfaces are known to belong to function classes and .

Theorem 2. (Optimal Information Rates) Suppose that  is a compact subset of , and that Conditions 1-2 hold. Then the optimal information rate is where is the solution for .

Proof. See Appendix A.     ∎

Theorem 2 characterizes  in terms of the metric entropies  and  for general function classes and . We used the local Fano method to derive an information-theoretic lower bound on the information rate that can be achieved by any estimator [30]. The characterization in Theorem 2 implies that selection bias has no effect on the achievable information rate. (Thus, in the rest of the paper we drop the dependency on when referring to .) That is, as long as the overlap condition holds, selection bias does not hinder the information rate that can be achieved by a Bayesian causal inference procedure, and we can hope to find a good prior that achieves the optimal rate of posterior contraction around the true ITE function irrespective of the amount of bias in the data. Theorem 2 also says that the achievable information rate is bottle-necked by the more “complex” of the two response surfaces and . Hence, we cannot hope to learn the causal effect at a fast rate if either of the treated or the control response surfaces are rough, even when the other surface is smooth.

The general characterization of the optimal information rates in Theorem 2 is cast into specific forms by specifying the regularity classes and . Table I demonstrates the optimal information rates for standard function classes, including analytic, smooth, Hölder [30, Section 6.4], Sobolev [37], Besov [30, Section 6.3], and Lipschitz functions [38, 39]. A rough description for the optimal information rates of all nonparametric function spaces (-smooth, Hölder, Sobolev, Besov, and Lipschitz) can be given as follows. If is -regular (e.g. -differentiable) and is -regular, then the optimal information rate for causal inference is

 I∗n(Fα0,Fα1)≍n−2(α0∧α1)2(α0∧α1)+d, (6)

where  denotes asymptotic equivalence, i.e. in Bachmann-Landau notation,  if . That is, the regularity parameter of the rougher response surface, i.e. , dominates the rate by which any inference procedure can acquire information about the causal effect. This is because, if one of the two response surfaces is much more complex (rough) than the other (as it is the case in the depiction in Figure 1), then the ITE function would naturally lie in a function space that is at least as complex as the one that contains the rough surface. Moreover, the best achievable information rate depends only on the smoothness of the response surfaces and the dimensionality of the feature space, and is independent of the selection bias. Due to the nonparametric nature of the estimation problem, the optimal information rate for causal inference gets exponentially slower as we add more dimensions to the feature space [23, 35].

Note that in Theorem 2, we assumed that for the surfaces and , all of the dimensions of are relevant to the two response surfaces. Now assume that surfaces and have relevant feature dimensions in the sets and , respectively, where [40], then

 I∗n(Fα0P0,Fα1P1)≍n−2α02α0+p0∨n−2α12α1+p1, (7)

where denotes the space of functions in for which the relevant dimensions are in . In (7), the rate is dominated by the more complex response surface, where “complexity” here is manifesting as a combination of the number of relevant dimensions and the smoothness of the response over the those dimensions. One implication of (7) is that the information rate can be bottle-necked by the smoother of the response surfaces and , if such a response has more relevant dimensions in the feature space333A more general characterization of the information rate would consider the case when the responses have different smoothness levels on each of the -dimensions. Unfortunately, obtaining such a characterization is technically daunting.. More precisely, if , then the information rate can still be bottle-necked by the smoother surface as long as .

Since the optimal (Bayesian) information rate is a lower bound on the (frequentist) minimax estimation rate (Theorem 1), we can directly compare the limits of estimation in the causal inference setting (established in Theorem 2) with that of the standard nonparametric regression setting. It is well known that the optimal minimax rate for estimating an -regular function is ; a classical result due to Stone [25, 26]. The result of Theorem 2 (and the tabulated results in Table I) asserts that the causal effect estimation problem is as hard as the problem of estimating the “rougher” of the two surfaces and in a standard regression setup.

The fact that selection bias does not impair the optimal information rate for causal inference is consistent with previous results on minimax-optimal kernel density estimation under selection bias or length bias

[41, 42, 43, 44]. In these settings, selection bias did not affect the optimal minimax rate for density estimation, but the kernel bandwidth optimization strategies that achieve the optimal rate needed to account for selection bias [43, 45]. In Section VI, we show that the same holds for causal inference: in order to achieve the optimal information rate, the strategy for selecting the prior needs to account for selection bias. This means that even though the optimal information rates in the causal inference and standard regression settings are similar, the optimal estimation strategies in both setups are different.

## Vi Rate-adaptive Bayesian Causal Inference

In Section V, we have established the optimal rates by which any Bayesian inference procedure can gather information about the causal effect of a treatment from observational data. In this Section, we investigate different strategies for selecting the prior , and study their corresponding achievable information rates. (An optimal prior is one that achieves the optimal information rate .) A strategy for selecting comprises the following three modeling choices:

1. How to incorporate the treatment assignment variable in the prior ?

2. What function (regularity) class should the prior place a probability distribution over?

3. What should be the smoothness (regularity) parameter of the selected function class?

The first modeling decision involves two possible choices. The first choice is to give no special role to the treatment assignment indicator , and build a model that treats it in a manner similar to all other features by augmenting it to the feature space . This leads to models of the form

 f(x,ω):X×{0,1}→R.

We refer to priors over models of the form above as Type-I priors. The second modeling choice is to let index two different models for the two response surfaces. This leads to models of the form where and for some . We refer to priors over models of the form as Type-II priors.

Type I and II priors induce different estimators for . The posterior mean estimator for a Type-I prior is given by

 ^Tn(x)=EΠ[f(x,1)|Dn]−EΠ[f(x,0)|Dn],

whereas for a Type-II prior, the posterior mean ITE estimator is given by where . Figure 2 is a pictorial depiction for the posterior mean ITE estimates obtained via Type-I and Type-II priors.

The main difference between Type-I and Type-II priors is that the former restricts the smoothness of on any feature dimension to be the same for and . This also entails that the relevant dimensions for the two response surfaces ( and ) need to be the same under a Type-I prior. (This is a direct consequence of the fact that Type-I priors give no special role to the variable .) As a result, a priori knowledge (or even data-driven knowledge) on the differences between responses and (e.g. in terms of smoothness levels or relevant dimensions) cannot be incorporated in a Type-I prior. Type-II priors can incorporate such information as they provide separate models for and . However, while Type-I priors give a posterior of and using a joint model that is fitted using all the observational data, Type-II priors use only the data for one population to compute posteriors of one response surface, which can be problematic if the two populations posses highly unbalanced relative sizes (e.g. treated populations are usually much smaller than control populations [1]).

In order to better illustrate the difference between Type-I and Type-II priors, we look at their simpler parametric counterparts. A Type-I linear regression model defines

as a linear function , where and is a Gaussian noise variable. (Here the Type-I prior is a prior on the model coefficients and ) As we can see, this model restricts the two responses and to have the exact same interactions with the features through the coefficients in . If we know a priori that and have different “slopes” or different relevant dimensions, we cannot incorporate this knowledge into the model. What would such a model learn? Assuming consistency, the estimated ITE function would be everywhere in . Thus, the restricted nature of a Type-I parametric model led to a constant (non-individualized) estimate of . On the contrary, a Type-II model of the form would allow for learning a linear estimate of the true function , with potentially different relevant dimensions for both surfaces. However, Type-II model will only use data with to fit the model for .

Unlike their parametric counterparts, the nonparametric Type-I and II priors can (in general) learn the ITE function consistently, but how do their information rates compare? Subsection VI-A studies the achievable information rates for “oracle” Type-I and Type-II priors that are informed with the true smoothness parameters ( and ) and relevant dimensions of the function classes and . In Subsection VI-B, we study the (more realistic) setting when and are unknown, and investigate different strategies for adapting the prior to the smoothness of the treated and control response surface in a data-driven fashion.

### Vi-a Oracle Priors

In this Subsection, we assume that the true smoothness and relevant dimensions for and are known a priori. In the following Theorem, we show that Type-II priors are generally a better modeling choice than Type-I priors.

Theorem 3. (Sub-optimality of Type-I priors) Let  be the space of all Type-I priors that give probability one to draws from , and let  be the space of all Type-II priors that give probability one to draws from . If , , and , then

 infβ0,β1infΠ∈Π∘∘β0,1In(Π;Hα0P0,Hα1P1)≍I∗n(Hα0P0,Hα1P1), infβinfΠ∈Π∘βIn(Π;Hα0P0,Hα1P1)≳I∗n(Hα0P0,Hα1P1),

where denotes asymptotic inequality.

Proof. See Appendix B.     ∎

Theorem 3 says that if , then the information rate that any Type-I prior can achieve is always suboptimal, even if we know the relevant dimensions and the true smoothness of the response surfaces and . The Theorem also says that an oracle Type-II prior can achieve the optimal information rate. When the the surfaces and have the same relevant dimensions and the same smoothness, the two priors achieve the same rate. More precisely, the best achievable information rate for a Type-I prior is given by

 infβinfΠ∈Π∘βIn(Π;Hα0P0,Hα1P1)=Θ(n−2(α0∧α1)2(α0∧α1)+|P0∪P1|),

whereas for Type-II priors, the best achievable rate is

 infβ0,β1infΠ∈Π∘∘β0,1In(Π;Hα0P0,Hα1P1)=Θ(n−2α02α0+|P0|∨n−2α12α1+|P1|).

We note that most state-of-the-art causal inference algorithms, such as causal forests [7], Bayesian additive regression trees [8], and counterfactual regression [18, 28], use Type-I regression structures for their estimates. The sub-optimality of Type-I priors, highlighted in Theorem 3, suggests that improved estimates can be achieved over state-of-the-art algorithms via a Type-II regression structure.

We now focus on the second and third modeling questions: on what function space should the prior be placed, and how should we set the regularity of the sample paths drawn from the prior? In the rest of this Section, we assume that the true response surfaces reside in Hölder spaces. One possible prior over Hölder balls is the Gaussian process , with a Matérn covariance kernel and a smoothness parameter . (Draws from such a prior are almost surely in a -Hölder function space [46, 47].) In the following Theorem, we characterize the information rates achieved by such a prior.

Theorem 4. (The Matching Condition) Suppose that  and  are in Hölder spaces  and , respectively, and let

 Π(β0,β1)=(GP(Mat\'{e}rn(β0)),GP(Mat\'{e}rn(β1))),

be a Type-II prior over . If , then we have that

 In(Π(β0,β1);Hα0,Hα1)≲n−2β02β0+d∨n−2β12β1+d,

where posterior consistency holds only if , and .

Proof. See Appendix C.     ∎

For a Type-I prior , the upper bound on is , with consistency holding for . Using the results of the paper by Castillo in [48], the upper bound in Theorem 4 can be shown to be tight. Recall that the optimal information rate for causal inference in Hölder spaces is (Table I). Theorem 4 quantifies the information rates achieved by a Type-II prior with smoothness levels and . The Theorem says that a prior can achieve the optimal information rate if and only if it captures the smoothness of the rougher of the two response surfaces. This gives rise to the following matching condition that a prior requires in order to provide an optimal rate:

 βω=α0∧α1,αω≤β1−ω≤α1−ω,ω=argminw∈{0,1}αw.

That is, the regularity of the prior needs to match the rougher of the two surfaces, and the prior over the smoother surface needs to be at least as smooth as the rougher surface. Consistency holds only if the prior is at least as smooth as the true response, since otherwise the response surfaces would not be contained in the support of the prior. Note that Theorem 4 assumes that the true response surfaces exhibit a Hölder-type regularity, and that the prior is placed on a reproducing kernel Hilbert space with a particular kernel structure. While proving that the matching condition holds for general priors and function spaces is technically daunting, we believe that (given the results in Table I) the matching condition in Theorem 4 would hold for other notions of regularity (e.g. Sobolev, Lipschitz, etc), and for a wide range of practical priors. For instance, Theorem 4 holds for Gaussian processes with re-scaled squared exponential kernels [49].

To sum up this Subsection, we summarize the conclusions distilled from our analyses of the achievable rates for oracle priors. Priors of Type II are generally a better design choice compared to priors of Type I, especially when the two response surfaces exhibit different forms of heterogeneity. In order to achieve the optimal information rate, a typical condition is that the regularity of the prior needs to match that of the rougher of the two response surfaces. Since in practice we (generally) do not know the true smoothness of the response surfaces, we cannot build a prior that satisfies the matching condition. Practical causal inference thus requires adapting the prior to the smoothness of the true function in a data-driven fashion; we discuss this in the next Subsection.

Note that, unlike in standard nonparametric regression, adapting the regularity of the prior for the causal inference inference task entails a mixed problem of testing and estimation, i.e. we need to test whether is less than , and then estimate (or ). Hence, one would expect that the prior adaptation methods used in standard regression problems would not necessarily suffice in the causal inference setup. Prior adaptation can be implemented via hierarchical Bayes or empirical Bayes methods. Hierarchical Bayes methods specify a prior over (also known as the hyper-prior [24]), and then obtain a posterior over the regularity parameters in a fully Bayesian fashion. Empirical Bayes simply obtains a point estimate of , and then conducts inference via the prior specified by . We focus on empirical Bayes methods since the hierarchical methods are often impractically expensive in terms of memory and computational requirements. A prior induced by (obtained via empirical Bayes) is called rate-adaptive if it achieves the optimal information rate, i.e. .

In the rest of this Subsection, we show that marginal likelihood maximization, which is the dominant strategy for empirical Bayes adaptation in standard nonparametric regression [50, 24], can fail to adapt to in the general case when . (This is crucial since in most practical problems of interest, the treated and control response surfaces have different levels of heterogeneity [20].) We then propose a novel information-based empirical Bayes strategy, and prove that it asymptotically satisfies the matching condition in Theorem 4. Finally, we conclude the Subsection by identifying candidate function spaces over which we can define the prior such that we are able to both adapt to functions in Hölder spaces, and also conduct practical Bayesian inference in an algorithmically efficient manner.

#### Vi-B1 Information-based Empirical Bayes

To see why the marginal likelihood-based empirical Bayes method may fail in adapting priors for causal inference, consider the following example. Suppose that and where . Let be a Type-I data-driven prior, where is an empirical Bayes estimate of . For the likelihood-based empirical Bayes, is obtained by maximizing the marginal likelihood with respect to . Note that since and possess different smoothness parameters, then the “true” model for generating is characterized by a likelihood function . Assume that the true model is identifiable, i.e. the mapping is one-to-one. Type-I priors re-parametrize the observation model so that the likelihood function is parametrized with a single smoothness parameter . Hence, as long as , the new parametrization renders an unidentifiable model, since the mapping is not one-to-one (i.e. different combinations of and can map to the same ). This means that in this case likelihood-based empirical Bayes would never satisfy the matching condition in Theorem 4, even in the limit of infinite samples (). In most practical Bayesian models (e.g. Gaussian processes), the empirical Bayes estimate will be in the interval with high probability as depicted in Figure 2(a). This means that with high probability, the likelihood-based empirical Bayes method will prompt an oversmoothed prior, from which all draws are smoother than the true ITE function, leading to a suboptimal information rate.

The failure of likelihood-based empirical Bayes in the causal inference setup is not surprising as maximum likelihood adaptation is only optimal in the sense of minimizing the Kullback-Leibler loss for the individual potential outcomes. Optimal prior adaptation in our setup should be tailored to the causal inference task. Hence, we propose an information-based empirical Bayes scheme in which, instead of maximizing the marginal likelihood, we pick the smoothness level that minimizes the posterior Bayesian KL divergence, i.e.

 ^βn =argminβEΠ(.|Dn,β)[Dn(Π(β);f0,f1)|Dn] =argminβEΠ(.|Dn,β)[Ex[DKL(P(x)∥QDn(x))]]. (8)

The information-based empirical Bayes estimator is simply a Bayesian estimator of with the loss function being the posterior KL risk in (4). Unlike the likelihood-based method, the objective in (8) is an direct measure for the quality of causal inference conducted with a prior . In the following Theorem, we show that asymptotically satisfies the matching condition in Theorem 4.

Theorem 5. (Asymptotic Matching) Suppose that  and  belong to the Hölder spaces  and , respectively, and let be a prior over Hölder space with order . If is obtained as in (8) using cross-validation, then under certain regularity conditions we have that .

Proof. See Appendix D.     ∎

Theorem 5 says that the information-based empirical Bayes estimator is consistent. That is, the estimate will eventually converge to as . Note that this is a weaker result than adaptivity: consistency of does not imply that the corresponding prior will necessarily achieve the optimal information rate. However, the consistency result in Theorem 5 is both strongly suggestive of adaptivity, and also indicative of the superiority of the information-based empirical Bayes method to the likelihood-based approach.

Note that, while the information-based empirical Bayes approach guarantees the asymptotic recovery of

, it can still undersmooth the prior for the smoother response surface. This can problematic if we wish the posterior credible interval on

to be “honest”, i.e. possess frequentist coverage [51, 7, 36]. A more flexible Type-II prior that assigns different smoothness parameters and