# Semiparametric Bayesian causal inference using Gaussian process priors

We develop a semiparametric Bayesian approach for estimating the mean response in a missing data model with binary outcomes and a nonparametrically modelled propensity score. Equivalently we estimate the causal effect of a treatment, correcting nonparametrically for confounding. We show that standard Gaussian process priors satisfy a semiparametric Bernstein-von Mises theorem under smoothness conditions. We further propose a novel propensity score-dependent prior that provides efficient inference under strictly weaker conditions. We also show that it is theoretically preferable to model the covariate distribution with a Dirichlet process or Bayesian bootstrap, rather than modelling the covariate density using a Gaussian process prior.

## Authors

• 7 publications
• 4 publications
• ### Debiased Bayesian inference for average treatment effects

Bayesian approaches have become increasingly popular in causal inference...
09/26/2019 ∙ by Kolyan Ray, et al. ∙ 0

• ### GPMatch: A Bayesian Doubly Robust Approach to Causal Inference with Gaussian Process Covariance Function As a Matching Tool

Gaussian process (GP) covariance function is proposed as a matching tool...
01/29/2019 ∙ by Bin Huang, et al. ∙ 0

• ### Hierarchical Bayesian Bootstrap for Heterogeneous Treatment Effect Estimation

A major focus of causal inference is the estimation of heterogeneous ave...
09/22/2020 ∙ by Arman Oganisian, et al. ∙ 0

• ### Optimal experimental design via Bayesian optimization: active causal structure learning for Gaussian process networks

We study the problem of causal discovery through targeted interventions....
10/09/2019 ∙ by Julius von Kugelgen, et al. ∙ 39

• ### Gaussian Processes with Errors in Variables: Theory and Computation

Covariate measurement error in nonparametric regression is a common prob...
10/14/2019 ∙ by Shuang Zhou, et al. ∙ 0

• ### Varying-coefficient models with isotropic Gaussian process priors

We study learning problems in which the conditional distribution of the ...
08/28/2015 ∙ by Matthias Bussas, et al. ∙ 0

• ### Efficient Bayesian shape-restricted function estimation with constrained Gaussian process priors

02/13/2019 ∙ by Pallavi Ray, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In many applications, one wishes to make inference concerning the causal effect of a treatment or condition. Examples include healthcare, online advertising and assessing the impact of public policies amongst many others. The available data are often observational rather than the result of a carefully planned experiment or trial. The notion of “causal” then needs to be carefully defined and the statistical analysis must take into account other possible explanations for the observed outcomes.

A common framework for causal inference is the potential outcome setup [23, 31]

. In this framework, every individual possesses two “potential outcomes”, corresponding to the individual’s outcomes with and without treatment. The treatment effect, which we wish to estimate, is thus the difference between these two potential outcomes. Since we only observe one out of each pair of outcomes, and not the corresponding “counterfactual” outcome, we do not directly observe samples of the treatment effect. Because in practice, particularly in observational studies, individuals are assigned treatments in a biased manner, a simple comparison of actual cases (i.e. treated individuals) and controls may be misleading due to selection bias. A typical way to overcome this is to gather the values of covariate variables that influence both outcome and treatment assignment (“confounders”) and apply a correction based on the “propensity score”, which is the conditional probability that a subject is treated as a function of the covariate values. Under the assumption that outcome and treatment assignment are independent given the covariates, the causal effect of treatment can be identified from the data. Popular estimation methods include “propensity score matching”

[40, 38] and “double robust methods” [31, 37, 39]. In this paper we follow the approach of modelling the propensity score function nonparametrically and posing the estimation of the treatment effect as a problem of estimation of a functional on a semiparametric model [5, 42, 48]. Our methodological novelty is to follow a semiparametric Bayesian approach, putting nonparametric priors on the propensity score and/or on the unknown response function and the covariate distribution, possibly incorporating an initial estimator of the first function.

For notational simplicity we in fact consider the missing data model which is mathematically equivalent to observing one arm of the causal setup. The model is also standard and widely-studied on its own in biostatistical applications, where response variables are frequently missing, and is a template for a number of other models

[32, 43]. For a recent review on estimating an average treatment effect over a (sub)population, a problem that has received considerable recent attention in the econometrics, statistics and epidemiology literatures, see Athey et al. [3].

We suppose that we observe i.i.d. copies

of a random variable

, where and take values in the two-point set and are conditionally independent given . We think of as the outcome of a treatment and are interested in estimating its expected value . The problem is that the outcome is observed only if the indicator variable takes value 1, as otherwise the third component of is equal to 0. Whether the outcome is observed or not may well be dependent on its value, which precludes taking a simple average of the observed outcomes as an estimator for . The covariate is collected to correct for this problem; it is assumed to contain exactly the information that explains why the response is not observed except for purely random causes, so that the outcome and missingness indicator are conditionally independent given , i.e. the outcomes are missing at random (relative to ). The connection to causal inference is that we may think of as a “counterfactual” outcome if a treatment were assigned () and its mean as “half” the treatment effect under the assumption of unconfoundedness.

The model for a single observation can be described by the distribution of and the two conditional distributions of and given . In this paper we model these three components nonparametrically. We investigate a Bayesian approach, putting a nonparametric prior on the three components, in particular Gaussian process and Dirichlet process priors. We then consider the mean response as a functional of the three components and study the induced marginal posterior distribution of from a frequentist perspective. The aim is to derive conditions under which this marginal posterior distribution satisfies a Bernstein–von Mises theorem in the semiparametric sense, thus yielding recovery of the mean response at a -rate and asymptotic efficiency in the semiparametric sense.

In recent years Bayesian approaches have become increasingly popular due to their excellent empirical performance for such problems [22, 21, 41, 54, 19, 20, 1]. However, despite their increasing use in practice, there have been few corresponding theoretical results. Indeed, early work on semiparametric Bayesian approaches to this specific missing data problem produced negative results, proving that many common classes of priors, or more generally likelihood-based procedures, produce inconsistent estimates assuming no smoothness on the underlying parameters, see the results and discussion in [36, 29]. We attempt to shed light on this apparent gap between the excellent empirical performance observed in practice and the potentially disastrous theoretical performance.

We consider various prior schemes. A straightforward method of prior modelling is to choose the three component parameters a priori independent. We show that under sufficient smoothness assumptions, efficient estimation of using common product priors is indeed possible. These product priors fall within the framework of [36], thereby showing that their negative result is tied to the (almost) full nonparametric model they consider. However, it has also been argued that product priors may not be an optimal choice [29]. As the likelihood factorizes over the three component parameters, a product prior will lead to a product posterior, and does not allow the components to share information, which may lead to unnecessarily harsh smoothness requirements on the true parameters. For this reason we also consider dependent priors. In particular, we propose a novel Gaussian process prior that incorporates an estimate of the propensity score function, and show that it performs efficiently under strictly weaker conditions than for standard product priors. Unlike for these latter priors, extra regularity of the binary regression function can compensate for low regularity of the propensity score, that is one direction of so-called “double robustness” [39, 33]. A related construction using Bayesian additive regression trees (BART) has been shown to work well empirically [20]. It can thus be both practically and theoretically advantageous to employ propensity score-dependent priors.

For the estimation of at -rate, smoothness of the distribution of the covariate

is not needed. Therefore, we also consider modelling this distribution by a Dirichlet process prior. We show that this is theoretically (and computationally) preferable to modelling a density, even when the smoothness of the latter is modelled correctly. Modelling the density, whether using a Gaussian process prior or otherwise, can induce a non-vanishing bias, unless the underlying parameters possess extra smoothness. Heuristic arguments suggest that this effect becomes significantly more pronounced for moderate or high dimensional covariates.

The papers [32, 35] consider estimation of the parameter under minimal smoothness conditions on the parameters. By using estimating equations the authors construct estimators that attain an optimal rate of convergence slower than in cases where the component parameters have low smoothness. Furthermore, they construct estimators that attain a -rate under minimal smoothness conditions, less stringent than in earlier literature, using higher order estimating equations. It is unclear whether similar results can be obtained using a Bayesian approach. The constructions in the present paper can be compared to the estimators obtainable for linear (or first order) estimating equations. It remains to be seen whether Bayesian modelling is capable of performing the bias corrections necessary to handle true parameters of low smoothness levels in a similar manner as higher order estimating equations.

For smooth parametric models, the theoretical justification for posterior based inference is provided by the Bernstein–von Mises theorem or property (hereafter BvM). This property says that as the number of observations increases, the posterior distribution is approximately a Gaussian distribution centered at an efficient estimator of the true parameter and with covariance equal to the inverse Fisher information, see Chapter 10 of

[48]. While such a result does not hold in full generality in infinite dimensions [13], semiparametric analogues can establish the BvM property for the marginal posterior of a finite-dimensional parameter in the presence of an infinite-dimensional nuisance parameter [9, 30, 6, 10]. In such cases, care is typically required in the choice of prior assigned to the nonparametric part, a feature that manifests itself in particular in our setting through the choice of prior for the distribution of the covariate .

While our first main theorem is in the spirit of earlier work, extending this to a structured semiparametric model, our other results are more innovative, in both a modelling and a technical sense. The second theorem uses a combination of a Gaussian and Dirichlet process prior, and the third theorem concerns a novel prior that incorporates a random perturbation in the least favourable direction, therewith taking away a potential bias.

An important consequence of the semiparametric BvM is that credible sets for the functional are asymptotically confidence regions with the same coverage level. The Bayesian approach thus automatically provides access to uncertainty quantification once one can sample from the posterior distribution. Obtaining confidence statements for average treatment effects is a current area of research and there has been recent progress in this direction, for example using random forests and regression tree methods

[2, 53]. Our results show that Bayesian approaches can also yield valid frequentist uncertainty quantification in this setting.

The paper is structured as follows. In Section 2, we provide a review of the model, including the relevant semiparametric theory. Section 3 contains all the results, with discussion in Section 4 and the main proofs in Section 5. Auxiliary results and posterior contraction results are deferred to Sections 6 and 7, respectively.

### 1.1 Notation

The notation denotes inequality up to a constant that is fixed throughout and is the largest integer strictly smaller than . The symbol is used for the logistic function given by . We abbreviate by . For probability densities and with respect to some dominating measure , is the Hellinger distance,

is the Kullback-Leibler divergence and

. We denote by and the -Sobolev and Hölder spaces respectively. For i.i.d. random variables with common law the notations and are the empirical measure and process, respectively. The notation denotes the law of a random element . We often drop the index in the product measure , writing , and write instead of , where is the true parameter for the data generating distribution. The -covering number of a set for a semimetric , denoted , is the minimal number of -balls of radius needed to cover , and is the minimal number of brackets of size needed to cover a set of functions .

## 2 Model details

Recall that we observe i.i.d. copies of a random variable , where and take values in the two-point set and are conditionally independent given , which itself takes values in a given measurable space . Denote the full sample by . This model can be parameterized via the marginal density (with respect to some given measure denoted ) or distribution of and the conditional probabilities , called the propensity score, and , the regression of on . The distribution of an observation is thus fully described by the triple or .

For prior construction it will be useful to transform the parameters by their natural link functions. Let be the logistic function and consider the reparametrization

 ηa=Ψ−1(1/a),ηb=Ψ−1(b),ηf=logf, (2.1)

where is defined when the density of exists, and set . In a slight abuse of notation, we also write when working only with the distribution function of . The parametrization (2.1) is used for our prior construction; we use the two parametrizations interchangeably.

The density of can now be given as

 pη(x)=(1a(z))r(1−1a(z))1−rb(z)ry(1−b(z))r(1−y)f(z). (2.2)

Note that this factorizes over the parameters. If the covariate is not assumed to have a density and , we use the same notation , but then the factor is understood to be 1. Since factorizes over the three (or two) parameters, the log-likelihood based on separates as

 ℓn(η)=n∑i=1logp(a,b,f)(Xi)=ℓan(ηa)+ℓbn(ηb)+ℓfn(ηf), (2.3)

where each term is the logarithm of the factors involving only or or , and is understood to be absent when existence of a density is not assumed. The functional of interest is the mean response , which can be expressed in the parameters as

 χ(η)=∫bdF=∫Ψ(ηb)(z)eηf(z)dz,

where the second representation is available if has a density.

Estimators that are -consistent and asymptotically efficient for have been constructed using various methods, but only if or (or both) are sufficiently smooth. In the present context, under the assumption that and , Robins et al. [35] have constructed estimators that are -consistent if . They have also shown that the latter condition is sharp: the minimax rate becomes slower than when (see [34]). The estimators in [35] employ higher order estimating equations to obtain better control of the bias. First-order estimators, based on linear estimators or semiparametric maximum likelihood, have been shown to be -consistent only under the stronger condition

 α2α+1+β2β+1≥12, (2.4)

see e.g. [37, 39]. In both cases the conditions show a trade-off between the smoothness levels of or : higher permits lower and vice-versa. This trade-off results from the multiplicative form of the bias of linear or higher-order estimators. So-called double robust estimators are able to exploit this structure, and work well if either or is sufficiently smooth. (More generally, it suffices that the parameters and can be estimated well enough, where the combined rates are relevant. The inequalities even remain valid with or interpreted as the existence of -consistent estimators of or , as would be the case given a correctly specified finite-dimensional model.) We shall henceforth also assume that the parameters and are contained in Hölder spaces and , respectively.

For estimation of at -rate the covariate density need not be smooth, which makes sense intuitively, as the functional can be written as an integral relative to the corresponding distribution . (Counter to this intuition [34, 35] show this to be false for optimal estimation at slower than -rate.) This may motivate modelling nonparametrically, in the Bayesian setting for instance with a Dirichlet process prior.

All these observations are valid only if the estimation problem is not affected by the parameters , or taking values on the boundary of their natural ranges. For simplicity we throughout make the following assumption.

###### Assumption.

The true functions and are bounded away from 0 and 1 and is bounded away from 0 and .

We finish by reviewing the tangent space and information distance of the model, which is well known to play an important role in semiparametric estimation theory [4, 5, 42]. (See [9] or Chapter 12 of [16] for general reviews in the context of Bayesian estimation.)

With regards to the parametrization (2.1), consider the one-dimensional submodels induced by the paths

 1at=Ψ(ηa+tα),bt=Ψ(ηb+tβ),ft=exp(ηf+tϕ−log∫eηf+tϕdz)

for given directions with , and given “starting” point . Inserting these paths in the likelihood (2.2), and computing the derivative at of the log likelihood, we obtain the “score functions”

 Baηα(X)=(R−1a(Z))α(Z),Bbηβ(X)=R(Y−b(Z))β(Z),Bfηϕ(X)=ϕ(Z).

The operators , , are the score operators for the three parameters. The overall score when perturbing the three parameters simultaneously is the sum of the three terms in the previous display. The efficient influence function of the functional at the point is known to take the form

 ˜χη(X)=Ra(Z)(Y−b(Z))+b(Z)−χ(η).

By definition this function has two properties. First the derivative of the functional along a path at is the inner product of this function with the score function of the path: for every path of the above form. Second the function is contained in the closed linear span of the set of all score functions. Indeed, in the present case we have, for all ,

 ˜χη(x)=Bηξη(x)=Bbηa(x)+Bfη(b−∫bdF)(x), (2.5)

where is the least favourable direction given by

 ξη=(0,ξbη,ξfη)=(0,a,b−∫bdF).

The function is the score function for the submodel corresponding to the perturbations in the directions of on . The latter submodel is called least favourable, since has the smallest information about the functional of interest at . According to semiparametric theory (e.g. Chapter 25 of [48]) a sequence of estimators is asymptotically efficient for estimating at the true parameter if and only if

 ˆχn=χ(η0)+1nn∑i=1˜χη0(Xi)+oPη0(n−1/2). (2.6)

The sequence

is then asymptotically normal with mean zero and variance

, which is least possible in a local minimax sense.

For a direction , the information norm corresponding to the score operator (or LAN norm in the language of [9, 30, 10]) equals

 ∥v∥2η:=Pη[(Bηv)]2=∫[1a(1−1a)α2+b(1−b)aβ2+(ϕ−Fϕ)2]dF=:∥α∥2a+∥β∥2b+∥ϕ∥2f.

It may be noted that the three components of the score operator are orthogonal, which is a consequence of the factorization of the likelihood. The minimal asymptotic variance for estimating can be written in terms of the information norm as

 ∥ξη0∥2η0=Pη0(Bη0ξη0)2=Pη0˜χ2η0=∫a0b0(1−b0)dF0+∫b20dF0−χ(η0)2.

## 3 Results

We put a prior probability distribution

on the parameter or , and consider the posterior distribution based on the observation . This induces posterior distributions on all measurable functions of , including the functional of interest .

We write for the marginal posterior distribution of , where is any random sequence satisfying (2.6

). We shall be interested in proving that this distribution asymptotically looks like a centered normal distribution with variance

. For a precise statement of this approximation, let be the bounded Lipschitz distance on probability distributions on (see Chapter 11 of [11]).

###### Definition 1.

Let be i.i.d. observations with arising from the density in (2.2), whose distribution we denote by . We say that the posterior satisfies the semiparametric Bernstein–von Mises (BvM) if, for satisfying (2.6), as ,

 dBL(LΠ(√n(χ(η)−ˆχn)|X(n)),N(0,∥ξη0∥2η0))→Pnη00.

In the following subsections we present three results for general priors on the parameters , where the first puts a prior on a density of , the second and third use the Dirichlet process prior, and the third also uses an estimator of the propensity score. While the priors become more specific in going from the first to the third result, the conditions for the BvM theorem become increasingly mild. In Section 3.4 we specialize the three results to Gaussian process priors and obtain more concrete results.

### 3.1 General prior on a, b and f

In the first result is a prior on the triple , or equivalently on the triple constructed through the parametrization (2.1). Define to be a perturbation of in the least favourable direction as follows:

 ηt(η)=(ηa,ηb−t√nξbη0,ηf−t√nξfη0−log∫eηf−tξfη0/√ndz). (3.1)
###### Theorem 1.

Consider an arbitrary prior on . Assume that there exist measurable sets of functions satisfying

 Π(η∈Hn|X(n)) →P01, (3.2) supb=Ψ(ηb):η∈Hn∥b−b0∥L2(F0) →0, (3.3) supf=eηf/∫eηfdz:η∈Hn∥f−f0∥1 →0, (3.4) supb=Ψ(ηb):η∈Hn∣∣Gn[b−b0]∣∣ →P00, (3.5)

and also

 supb=Ψ(ηb),f=eηf/∫eηfdz:η∈Hn∣∣√n∫(b−b0)(f−f0)dz∣∣→0. (3.6)

If for the path given in (3.1)

 ∫Hn∏ni=1pηt(η)(Xi)dΠ(η)∫Hn∏ni=1pη(Xi)dΠ(η)→P01, (3.7)

then the posterior distribution of satisfies the BvM theorem.

Conditions (3.2)–(3.6) permit to control the remainder terms in an expansion of the likelihood. The first four conditions (3.2)–(3.5) require that the posterior concentrates on shrinking neighbourhoods about the true parameters and , though not , and hence mostly require consistency, whereas the remaining condition (3.6) also requires a -rate on a certain bias term.

In Theorems 2-3, which put a Dirichlet prior on the distribution rather than a prior on the density , condition (3.6) disappears and hence this might be interpreted as involving a bias incurred by possibly putting the wrong prior on . The condition, which seems tied to any prior that directly models , may be satisfied for reasonable priors if both and are sufficiently smooth, but in the situation where has low regularity, even correctly calibrating the smoothness of the prior on can perform worse than naively using a Dirichlet process. The condition provides another example where an infinite-dimensional prior can induce an undesired bias [13, 29, 24, 10]. This effect becomes more pronounced as the covariate dimension increases and can be problematic in even moderate dimensions.

The uniformity in required in (3.5) is unpleasant, as it will typically require that the class of supported by the posterior distribution is not unduly large. The condition is linked to using the likelihood and similar conditions arise in maximum likelihood based estimation procedures, although (3.5) seems significantly weaker, as the uniformity is required only on the essential support of the posterior distribution, which might be much smaller than the full parameter space. The use of estimating equations can avoid uniformity conditions by sample splitting [35]. In the Bayesian framework one might similarly base posterior distributions of different parameters on given subsamples, but this is unnatural so that we do not pursue this route here, although Theorem 3 below is a step in this direction. A sufficient condition for the uniformity in (3.5) is that the class of functions in the condition is contained in a fixed -Donsker class (see Lemma 3.3.5 of [51]). In particular, it suffices that the posterior concentrates on a bounded set in for . While this condition is easy to establish for certain priors, such as uniform wavelet priors [17], for the Gaussian process priors considered here we employ relatively complicated arguments using metric entropy bounds.

Condition (3.7) measures the invariance of the prior for the full nuisance parameter under a shift in the least favourable direction . It is a structural condition on the combination of prior and model, and if not satisfied may destroy the -rate in the BvM theorem (see [9] or [16] for further discussion). Although we shall verify the condition for some priors of interest below, this condition may impose smoothness conditions on the parameters, and prevent so-called “double robustness”. We shall remove this condition for special priors in Theorem 3 below.

Since , Theorem 1 implicitly requires conditions on through (3.7), even though does not appear in the functional . Such conditions become explicit for concrete priors below.

###### Remark 1.

If the quotient on the left side of (3.7) is asymptotic to for some possibly random sequence of real numbers , then the assertion of the BvM theorem is still true, but the normal approximation must be replaced by . See [10, 30] for further discussion. The same is true for all other results in the following.

###### Remark 2.

If the supremum in (3.5), or similar variables below, is not measurable, then we interpret this statement in terms of outer probability.

### 3.2 General prior on a and b and Dirichlet process prior on F

Intuitively the estimation problem should not depend too much on properties of the distribution of the covariates, as the latter are fully observed and the functional is an integral relative to the covariate distribution . For -estimation this intuition appears to be correct, which suggests not to assume existence of the covariate density , and put a prior directly on the distribution . In the following theorem we shall see that this will remove condition (3.6).

The standard nonparametric prior on the set of probability distributions on a (Polish) sample space is the Dirichlet process prior [12]. This distribution is characterized by a base measure , which can be any finite measure on the sample space. It is well known that in the model consisting of sampling from the Dirichlet process prior and next sampling observations from , the posterior distribution of given is again a Dirichlet process with updated base measure , where is the empirical distribution of . (For full definitions and properties, see the review in Chapter 4 of [16].)

We utilize the Dirichlet process prior on together with an independent prior on the remaining parameters , constructed using the logistic link function, as previously. The Dirichlet process prior does not give probability one to a dominated set of measures , which means that the posterior distribution of cannot be derived using Bayes’s formula. However, we can obtain a representation as follows. The parameters and the data are generated through the hierarchical scheme:

• independent from .

• Given the covariates are i.i.d. .

• Given the pairs

are independent from products of binomial distributions with success probabilities

and .

• The observations are with .

From this scheme it follows that and are independent given , and also that and are conditionally independent given . We can then conclude that the posterior distribution of given is the same as the posterior distribution of given , which is the distribution. Furthermore, the posterior distribution of given can be derived by Bayes’s rule from the binomial likelihood of given , which is dominated. Thus the posterior distribution is given by

 Π((a,b)∈A,F∈B|X(n)) (3.8) =∫B∫A∏ni=1p(a,b)(Xi)dΠ(a,b)∫∏ni=1p(a,b)(Xi)dΠ(a,b)dΠ(F|Z(n)),

where is the conditional density of given , given by (2.2) with deleted or taken equal to 1, and is the -distribution. This formula remains valid if , which yields the Bayesian bootstrap, see Chapter 4.7 of [16]. This choice is also covered in the following theorem. We suspect that the theorem extends to other exchangeable bootstrap processes, as considered in [27] (see [47], Section 3.7.2).

Define to be a perturbation of in the least favourable direction, restricted to the components corresponding to and :

 ηt(η)=(ηa,ηb−t√nξbη0). (3.9)
###### Theorem 2.

Consider a prior consisting of an arbitrary prior on and an independent Dirichlet prior on . Assume that there exist measurable sets of functions satisfying

 Π(η∈Ha,bn|X(n)) →P01, (3.10) supb=Ψ(ηb):η∈Ha,bn∥b−b0∥L2(F0) →0, (3.11) supb=Ψ(ηb):η∈Ha,bn|Gn[b−b0]| →P00. (3.12)

If for the path in (3.9),

 ∫Ha,bn∏ni=1pηt(η)(Xi)dΠ(η)∫Ha,bn∏ni=1pη(Xi)dΠ(η)→P01, (3.13)

then the posterior distribution (3.8) satisfies the BvM theorem.

Formula (3.8) shows that a draw from the posterior distribution of the functional of interest is obtained by independently drawing from its posterior distribution and from the -distribution, and next forming the integral . The posterior distribution of is constructed from the conditional likelihood of given without the interception of or its prior distribution. Instead of a Bayesian-motivated or bootstrap type choice for , which requires randomization given , one could also directly plug in an estimator of based on and randomize only from its posterior distribution. The empirical distribution is an obvious choice. The proof of Theorem 2 suggests that, under the conditions of the theorem,

 dBL(LΠ(√n(χ(η)−ˆχn)|X(n)),N(0,∥ξb0η0∥2b0))→0

in -probability. Compared to the BvM theorem this suggests a normal approximation with the same centering, but a smaller variance, since the variance in the BvM theorem is the sum . The lack of posterior randomization of thus results in an underestimation of the asymptotic variance. Using credible sets resulting from this ‘posterior’ would give overconfident (wrong) uncertainty quantification. Since our focus is on the Bayesian approach, we do not purse such generalizations further.

### 3.3 Propensity score-dependent priors

To reduce unnecessary regularity conditions, it can be useful to use a preliminary estimate of the inverse propensity score [35, 37, 39]. Such an approach has recently also been advocated in the Bayesian literature in a related setting, where [20] plug an estimator of the propensity score into a Bayesian additive regression tree (BART) prior used in [22] and achieve improved empirical performance. In this section we employ preliminary estimators to augment the prior on with the aim of weakening the conditions required for a semiparametric BvM.

Suppose we have a sequence of estimators of the inverse propensity score satisfying

 ∥^an−a0∥L2(F0)=OP0(ρn) (3.14)

for some sequence . Since the propensity score is just a (binary) regression function of onto , standard (adaptive) smoothing estimators satisfy this condition with rate if the propensity score is assumed to be contained in , which is the minimax rate over this space (note that will attain at least the rate of an estimator of the propensity score itself). Consider the following prior on :

 b(z)=Ψ(Wbz+λ^an(z)), (3.15)

where is a continuous stochastic process independent of the random variable , which follows a prior distribution for given variance (potentially varying with , but fixed is allowed). We assume that is based on observations that are independent of , the observations used in the likelihood to obtain the posterior. Otherwise, the prior (3.15) becomes data-dependent, which significantly complicates the technical analysis.

To improve clarity we only consider the combination of (3.15) with assigning a Dirichlet process prior, but all results below can be extended to the case where is assigned an independent prior of the form (3.23) upon making the same extra assumptions as was done previously to ensure the prior does not induce any additional bias through the prior shift condition (3.7).

We may think of as a degenerate prior on , and then by the factorization of the likelihood the part of the likelihood involving cancels from the posterior distribution (3.8) if marginalized to (and hence ). Of course the same will happen if we assign an independent prior to . In both cases it is unnecessary to further discuss a prior on .

###### Theorem 3.

Given independent estimators satisfying (3.14) and consider the prior (3.15) for with the stochastic process and random variable independent, and assign an independent Dirichlet process prior. Assume that there exist measurable sets of functions satisfying, for every and some numbers ,

 Π(λ:|λ|≤unσ2n√n|X(n)) →P01, (3.16) Π((w,λ):w+(λ+tn−1/2)^an∈Hbn|X(n)) →P01, (3.17) supb=Ψ(ηb):ηb∈Hbn∥b−b0∥L2(F0) ≤εbn, (3.18) supb=Ψ(ηb):ηb∈Hbn∣∣Gn[b−b0]∣∣ →P00. (3.19)

If and , then the posterior distribution satisfies the semiparametric BvM theorem.

The advantage of this theorem over the preceding theorems is that (3.13) does not appear in its conditions. (The theorem adds (3.16) and (3.17), but these are relatively mild.) As noted above, condition (3.13) requires a certain invariance of the prior of in the the least favourable direction , and typically leads to smoothness requirements on . In contrast we show below that Theorem 3 can yield the BvM theorem for propensity scores of arbitrarily low regularity. Thus the theorem is able to achieve what could be named single robustness. Whether “double robustness”, the ability of also handling response functions of arbitrarily low smoothness, is also achieved remains unclear. Specifically, we have not been able to verify condition (3.19) for reasonable priors, without assuming that the smoothness of is above the usual threshold ( in dimensions).

The single robustness is achieved by perturbing the prior process for in the least favourable direction using the auxiliary variable . Since the least favourable direction is unknown, this is replaced with an estimate .

Condition (3.16

) puts a lower bound on the variability of the perturbation, i.e. on the standard deviation

of . An easy method to ascertain this condition is to show that the prior mass of the set in the left side is exponentially small and next invoke Lemma 3. Specifically, by the univariate Gaussian tail bound the prior mass of is bounded above by . If the Kullback-Leibler neighbourhood in Lemma 3 has prior probability at least , then the lemma gives the sufficient condition for (3.16), i.e. .

### 3.4 Gaussian process priors

In this section we specialize Theorems 13 to Gaussian process and/or Dirichlet process priors. In all cases the priors on the three parameters , and or are independent. Since does not appear in and the likelihood (2.2) factorizes over , and , the terms cancel from the marginal posterior distribution of . Thus the prior on is irrelevant, and it is not necessary to consider it.

For simplicity we take the covariate space to be the unit interval . We denote a Gaussian process linked to a prior for or by , where .

There are a great variety of Gaussian process priors, and their success in nonparametric estimation is known to depend on their sample smoothness, as measured through their small ball probability (see [49, 44, 45, 46]). We both derive propositions on general Gaussian processes and consider the special example of the Riemann-Liouville process released at zero. For given , this is defined by

 R¯βt=⌊¯β⌋+1∑k=0gktk+∫t0(t−s)δ−1/2dBs,t∈[0,1], (3.20)

where the are i.i.d. standard normal random variables and is an independent Brownian motion. This process is appropriate for nonparametric modelling of -functions. We shall investigate the effect of the smoothness parameter on the BvM theorem.

A Gaussian process can typically be viewed as a Borel measurable map in some separable Banach space , often equipped with the uniform norm . Its covariance function determines a so-called reproducing kernel Hilbert space (RKHS) . (For details, see [50].) The “concentration function” of the -valued process at is defined as, for ,

 ϕi,ηi0(ε)=infh∈Hi:∥h−ηi0∥W<ε∥h∥2Hi−logP(∥Wi∥W<ε).

For standard statistical models, posterior contraction rates for Gaussian process priors are linked to the solution of the equation (see [49])

 ϕi,ηi0(εin)∼n(εin)2. (3.21)

We in general write , , for the respective contraction rates for the two parameters . Contraction is relative to a distance that depends on the model and the norm of (see Section 7).

First consider equipping both and with Gaussian process priors. Given independent mean-zero Gaussian processes and , consider the prior

 b(z) =Ψ(Wbz), (3.22) f(z) =eWfz∫10eWfudu. (3.23)
###### Proposition 1.

Consider the product Gaussian process prior (3.22)-(3.23) on and . Let satisfy (3.21) with respect to the norm for and suppose , where is a rate of contraction in of the posterior distribution of to . Suppose there exist sequences and such that

 ∥ξin−ξiη0∥∞≤ζin,∥ξin∥Hi≤√nζin,√nεinζin→0,i∈{b,f}. (3.24)

Suppose further that there exist measurable sets of functions such that for every and (3.19) holds. Then the posterior distribution satisfies the semiparametric BvM theorem.

The solution to (3.21) for yields a contraction rate for in the Hellinger distance [49], but the proposition requires a rate in . If the prior is supported on a fixed -ball, for instance suitably conditioned Gaussian process priors [17], then the Hellinger rate automatically implies the same rate in -distance. For unbounded priors, such as Riemann-Liouville processes, one may often use regularity properties of the Gaussian process to bootstrap a Hellinger rate to one in , and thus take to be a solution to (3.21) with (see Proposition 5 of [10]).

For the concrete case of the Riemann-Liouville process the preceding proposition implies the following.

###### Corollary 1.

Suppose , , and consider the prior (3.22)-(3.23) with Riemann-Liouville processes (3.20) with parameters and on and respectively. If , , , and

 β∧¯β2¯β+1+γ∧¯γ2¯γ+1>12, (3.25)

then the posterior distribution satisfies the semiparametric BvM theorem.

The corollary suggests that modelling the density using a Gaussian process prior works well under smoothness conditions on . If and , so that the prior processes select the correct smoothness, the conditions in the corollary reduce to and . This requires smoothness for , although it is known that no smoothness of is required to estimate the functional . For -dimensional covariates, the equivalent condition is , which is problematic for even moderate dimensions. The prior for the parameter can therefore have a significant impact and must be carefully chosen.

This can be avoided by modelling not the density , but the corresponding distribution , with a Dirichlet process prior.

###### Proposition 2.

Consider the Gaussian process prior (3.22) on and an independent Dirichlet process prior on . Let satisfy (3.21) with respect to the norm . Suppose there exist sequences and such that

 ∥ξbn−ξbη0∥∞≤ζbn,∥ξbn∥Hb≤√nζbn,√nεbnζbn→0.

Suppose further that there exist measurable sets of functions such that for every and (3.19) holds. Then the posterior distribution satisfies the semiparametric BvM theorem.

Suppose