## 1 Introduction

We tackle the problem of prior-posterior inference when the only available information about the unknown parameter is supplied by a set of conditional moment (CM) restrictions

(1.1) |

where is a -vector of known functions of a -valued random vector and the unknown , and is the unknown conditional distribution of given a -valued random vector . Such models are important because many standard models in statistics can be recast in terms of CM restrictions. These models also arise naturally in causal inference, missing data problems, and in models derived from theory in economics and finance. Because the CM conditions constrain the set of possible distributions , we say that the model is correctly specified if the true data generating process is in the set of distributions constrained to satisfy these moment conditions for some , while the model is misspecified if is not in the set of implied distributions for any .

A different starting point is when one is given the *unconditional* moments,
say . Prior-posterior analysis
can then be based on the empirical likelihood, for example, Lazar (2003) and many others, or the
exponentially tilted empirical likelihood (ETEL), as in Schennach (2005) and Chib, Shin and Simoni (2018). Developing a Bayesian framework for CM
models is important. While it is true that the conditional moments
imply that is uncorrelated with , i.e., , where is the Kronecker product operator,
the conditional moments assert even more, that is uncorrelated with any measurable,
bounded function of . Thus, there is an efficiency loss if this information is ignored.

We approach this problem by first constructing unconditional moments

(1.2) |

based on an increasing (in sample size) vector of approximating functions, , obtained, for instance, from splines of each variable in (Donald, Imbens and Newey, 2003). Efficiency loss is avoided as the number of moments increases with sample size. Next, for each sample size and for each , the nonparametric exponentially tilted empirical likelihood (ETEL) function is constructed to satisfy these unconditional moments. Unlike the empirical likelihood, the ETEL function has a fully Bayesian interpretation. It is the likelihood that emerges from integrating out with respect to a nonparametric prior that satisfies the CMs. The posterior of interest is then this nonparametric likelihood multiplied by a prior distribution of the parameters. Due to the fact that the nonparametric likelihood is limited to a set of values for which the empirical counterpart of the moment conditions (1.2) are equal to 0, the posterior (equivalently, the prior) is truncated to the set .

We study the prior-posterior mapping on many fronts, taking up the question of misspecified models, model comparisons, and computations, combining careful theoretical work with the needs of applications. The posterior distribution is shown to satisfy Bernstein-von Mises (BvM) theorems in both the correct and misspecified cases. In the former case the growth of (for approximating functions given by splines) is at most , where

is the sample size. The asymptotic posterior variance is then equal to the semiparametric efficiency bound derived in

Chamberlain (1987). In the latter case, in parallel with Kleijn and van der Vaart (2012), the posterior distribution of the centered and scaled parameter , whereis the pseudo-true value, converges to a Normal distribution with variance that now is different from the variance of the frequentist estimator. Interestingly, this convergence holds only if

increases more slowly than in the correctly specified case. This can be interpreted as limiting the number of implied unconditional moments to limit the magnification of the misspecification.We informally use these rate conditions from the theoretical analysis to guide the range of choice of for any given . Due to the fact that for a fixed

the volume (prior probability content) of the region of truncation

decreases with (a result of more restrictions), values of beyond the range recommended by the theory amplify the Bayesian bias, and, hence, should be avoided. Large values of can also produce rank-deficiency of the approximating functions basis matrix and, in the event of a misspecified model, increase misspecification. Around the values of we recommend, the posterior distribution is generally robust to , and little fine-tuning is necessary.Finite sample summaries of the posterior distribution are obtained by Markov chain Monte Carlo (MCMC) methods. Since the posterior is underpinned by a non-parametric likelihood, and the effective prior is truncated, efficient sampling is not automatic. However, after extensive study, we have produced a near-black-box MCMC approach (available as a R-package) that is based on the tailored Metropolis-Hastings (M-H) algorithm of

Chib and Greenberg (1995) and its randomized version in Chib and Ramamurthy (2010).The entire paper is interspersed with examples of pedagogical importance and practical relevance. Real data applications to risk-factor determination in finance, and causal inference under conditional ignorability, are included.

It is worth noting that previous Bayesian work on conditional moments, for example, Liao and Jiang (2011), Florens and Simoni (2012, 2016), Kato (2013), Chen, Christensen and Tamer (2018) and Liao and Simoni (2019), has little overlap with the discussion here. A major difference is that none of these papers adopt the fully Bayesian ETEL framework. Another is that these papers examine a different class of CM models. Finally, none of these papers takes up the question of model comparisons. Nonetheless, these papers and the current work, taken together, represent an important broadening of the Bayesian enterprise to new classes of models.

The rest of the paper is organized as follows. Section 2 has the sketch of the conditional moment setting. Section 3 discusses the prior-posterior analysis and the large-sample properties of the posterior distribution. Section 4 is concerned with the problem of comparing CM models via marginal likelihoods. In Section 5 two extensions are considered and Section 6 has real data applications to finance and causal inference. Section 7 concludes. Proofs are in the online supplementary appendix.

## 2 Setting and Motivation

Let be an -valued random vector and be an -valued random vector. The vectors and have elements in common if the dimension of the subvector is non-zero. Moreover, we denote

and its (unknown) joint distribution by

. By abuse of notation, let also denote the associated conditional distribution. Suppose that we are given a random sample of . Hereafter, is the expectation with respect to and is the conditional expectation with respect to the conditional distribution associated with .The parameter of interest is , which is related to the conditional distribution through the conditional moment restrictions

(2.1) |

where is a -vector of known functions. Many interesting and important models in statistics fall into this framework.

###### Example 1.

(Linear model with heteroscedasticity of unknown form) Suppose that

(2.2) |

where , and . This CM model is consistent with the data generating process (DGP) , where , and (independent) follow some unknown distribution , with , and the heteroscedasticity function is unknown. The restrictions

(2.3) |

where now is a vector of functions, additionally impose that is conditionally symmetric.

Note that in the foregoing example, the two unconditional moment conditions

(2.4) |

which assert that: (i) has mean zero and (ii) is uncorrelated with , are weaker but, if the CM model is correct, less informative about .

## 3 Prior-Posterior Analysis

### 3.1 Expanded Moment Conditions

The starting point, as in the frequentist approaches of Donald and Newey (2001), Ai and Chen (2003) and Carrasco and Florens (2000), is a transformation of the CM restrictions into unconditional moment restrictions. Following Donald, Imbens and Newey (2003), let , , denote a -vector of real-valued functions of , for instance, splines. Suppose that these functions satisfy the following condition for the distribution .

###### Assumption 3.1.

For all , is finite, and for any function with there are vectors such that as ,

Now, let be the value of that satisfies (2.1) for the true . If , then Donald, Imbens and Newey (2003, Lemma 2.1) established that: (1) if equation (2.1) is satisfied with , then for all ; (2) if equation (2.1) is not satisfied by , then , for all large enough .

Henceforth, we let denote the expanded functions and refer to

(3.1) |

as the expanded moments. Under the stated assumptions, the expanded moments are equivalent to the CM restrictions (2.1), as .

In our numerical examples, we construct using the natural cubic spline basis of Chib and Greenberg (2010), with fixed at a given value, as in sieve estimation. If consists of more than one element, say where and are continuous variables and is binary, then the basis matrix is constructed as follows. Let denote the sample data on . Let denote the matrix of the continuous data and interactions of the continuous data and the binary data. Now suppose , for are knots based on each column of and let denote the corresponding matrix of cubic spline basis functions. Then, is given by

where is the matrix in which each column of is subtracted from its first and then the first column is dropped, see Chib and Greenberg (2010). Thus, the dimension of this matrix is , where . If is large, in relation to , data-compression methods can be employed. Specifically, let denote the orthogonal matrix of eigenvectors from the singular value decomposition of , and let denote the corresponding

vector of eigenvalues. Then, after employing the rotation

, the columns of corresponding to small values of are dropped, and the resulting column-reduced matrix is taken as the basis matrix. We refer to this as the rotated column reduced basis matrix. To define the expanded functions, let denote a vector of the th element of evaluated at the sample data matrix . Then, the expanded functions for the sample observations are obtained by multiplying by the matrix (or by the rotated column reduced ) and concatenating. We use versions of this approach in our examples.### 3.2 Posterior distribution

We base the prior-posterior mapping, for each sample size and , on the nonparametric exponentially tilted empirical likelihood (ETEL) function. The ETEL has a fully Bayesian interpretation (Schennach, 2005) as an integrated likelihood, integrated over the prior on the data distribution that satisfies the given moments. Other such priors exist, for example, Kitamura and Otsu (2011), Shin (2014) and Florens and Simoni (2021), that lead to different integrated likelihoods.

The ETEL function takes the form

(3.2) |

where

are the probabilities that minimize the Kullback-Leibler divergence between the probabilities

assigned to each sample observation and the empirical probabilities , subject to the conditions that the probabilities sum to one and that the expectation under these probabilities satisfy the given unconditional moment conditions (3.1).Specifically, are the solution of the following problem:

(3.3) |

(see Schennach (2005) for a proof). In practice, the solution of this problem emerges from the dual (saddlepoint) representation (see e.g. Csiszar (1984)) as

(3.4) |

where is the estimated tilting parameter.

Let be the convex hull of for a given and denote its interior. Let denote the set of values for which the empirical moment conditions hold. Then, the posterior distribution is the truncated distribution given by

(3.5) |

where is the indicator function.

Combining the indicator function with the prior, we see that the (effective) prior is truncated to . This fact can be used to argue that, for fixed , it is not desirable to have a large . This is because as increases for a given , the support of the prior shrinks (equivalently, the prior probability content of the region of truncation decreases), due to the fact that more restrictions are imposed. We refer to this prior probability content by the shorthand, volume. Reduction in the volume tends to increase the Bayesian bias and reduce the posterior spread, without any change in the data, with deleterious impact on the posterior. In practice, we use the rule to fix . Larger values than this can, of course, be tried, but one should make sure that the volume of does not become much smaller than one. Around the values of we recommend, the posterior distribution is generally robust to , and little fine-tuning is necessary.

###### Example 1 (continued).

To illustrate the role of in the prior-posterior analysis, and its impact on the volume (prior probability content) of , we create a set of simulated data , , with covariates , intercept , slope , and is distributed according to , where

is the skew normal distribution with location, scale, and shape parameters given by

, each depending on . When is zero, is normal with mean . We set , so that .Suppose that and . Model parameters are estimated solely from the condition , . The prior is the default independent student-

distribution with location 0, dispersion 5, and degrees of freedom 2.5, truncated to

. The posterior is computed for given by 2, , 9 and 20 (the value is based on the theory below for splines approximating functions). The results are shown in Table 1. Importantly, when is close to the value suggested by theory, the posterior distribution is robust to . However, when , quite different from the recommended value, the Bayesian bias is larger and the posterior standard deviation is smaller, without any change in the data. This is due to the effect of the prior, in the following way. As increases for a fixed , the volume of decreases, equivalently, the support of the prior distribution shrinks, as illustrated in Figure 1. This explains why values of close to the recommended value are preferred.As an aside, if the true model was unconditional (i.e., without conditional heteroskedasticity), then there is little loss in using more expanded moments - the extra moments are superfluous and hence do not change the effective support of the prior. In that case, no tangible cost in imposed, apart from the computational burden of carrying those moments along.

Mean | SD | Median | Lower | Upper | Ineff | |||
---|---|---|---|---|---|---|---|---|

0.76 | 1.07 | 0.10 | 1.07 | 0.88 | 1.27 | 1.10 | ||

1.01 | 0.14 | 1.01 | 0.74 | 1.29 | 1.14 | |||

0.73 | 1.07 | 0.10 | 1.07 | 0.88 | 1.26 | 1.15 | ||

1.03 | 0.12 | 1.03 | 0.80 | 1.25 | 1.08 | |||

0.68 | 1.07 | 0.09 | 1.07 | 0.89 | 1.25 | 1.14 | ||

1.02 | 0.11 | 1.02 | 0.79 | 1.25 | 1.13 | |||

0.60 | 0.98 | 0.07 | 0.98 | 0.85 | 1.12 | 1.14 | ||

1.10 | 0.10 | 1.10 | 0.91 | 1.29 | 1.13 | |||

0.54 | 0.99 | 0.07 | 0.99 | 0.86 | 1.13 | 1.11 | ||

1.12 | 0.09 | 1.12 | 0.94 | 1.31 | 1.11 |

and 10, 15, 20. Results based on 20,000 MCMC draws beyond a burn-in of 1000. “Lower” and “Upper” refer to the 0.05 and 0.95 quantiles of the simulated draws, respectively, and “Ineff” to the inefficiency factor.

### 3.3 Asymptotic properties

Consider now the large sample behavior of the posterior distribution of . We let and , respectively, denote the true value of and of the data distribution . As notation, when the true distribution is involved, expectations (resp. ) are taken with respect to (resp. the conditional distribution associated with ). In addition, we denote ,

For a vector , denotes the Euclidean norm. For a matrix ,

denotes the operator norm (the largest singular value of the matrix). Finally, let

denote the support of .The first assumption is a normalization for the second moment matrix of the approximating functions which is standard in the literature, see e.g. Newey (1997) and Donald et al. (2003).

###### Assumption 3.2.

For each there is a constant scalar such that , has smallest eigenvalue bounded away from zero uniformly in , and .

The bound is known explicitly in a number of cases depending on the approximating functions we use. Donald et al. (2003) provide a discussion and explicit formulas for in the case of splines, power series and Fourier series. We also refer to Newey (1997) for primitive conditions for regression splines and power series.

###### Assumption 3.3.

(a) There exists a unique that satisfies for the true ; (b) the data , are i.i.d. according to ; (c) is bounded.

This assumption is the same as Donald et al. (2003, Assumption 3). The following three assumptions are also the same as the ones required by Donald et al. (2003) to establish asymptotic normality of the Generalized Empirical Likelihood (GEL) estimator.

###### Assumption 3.4.

(a) ; (b) is twice continuously differentiable in a neighborhood of , and , , are bounded on ; (c) is nonsingular.

###### Assumption 3.5.

(a) has smallest eigenvalue bounded away from zero; (b) for a neighborhood of , is bounded, and for all , and is bounded.

###### Assumption 3.6.

There is such that and .

Part (b) of Assumption 3.5 imposes a Lipschitz condition which allows application of uniform convergence results. The last assumption is about the prior distribution of and is standard in the Bayesian literature on frequentist asymptotic properties of Bayes procedures.

###### Assumption 3.7.

(a) is a continuous probability measure that admits a density with respect to the Lebesgue measure; (b) is positive on a neighborhood of .

We are now able to state our first major result in which we establish the asymptotic normality and efficiency of the posterior distribution of the local parameter .

###### Theorem 3.1 (Bernstein-von Mises).

We note that the centering of the limiting normal distribution satisfies . We also note that the condition in the theorem implies , which is a classical condition in the sieve literature. This condition is required to establish a stochastic Local Asymptotic Normality (LAN) expansion, which is an intermediate step to prove the BvM result, as we explain below. The LAN expansion is not required to establish asymptotic normality of the GEL estimators, which explains why our condition is slightly stronger than the condition required by Donald, Imbens and Newey (2003). On the other hand, our condition is weaker than the condition required by Donald, Imbens and Newey (2009) to establish the mean square error of the GEL estimators. The asymptotic covariance of the posterior distribution coincides with the semiparametric efficiency bound given in Chamberlain (1987) for conditional moment condition models. This means that, for every , -credible regions constructed from the posterior of are -confidence sets asymptotically.

The proof of this theorem is given in the supplementary appendix and consists of three steps. In the first step we show consistency of the posterior distribution of , namely:

(3.8) |

for any , as . To show this, the identification assumption (3.6) is used. In the second step we show that the ETEL function satisfies a stochastic LAN expansion:

(3.9) |

where denotes a compact subset of and . As the ETEL function is an integrated likelihood, expansion (3.9) is better known as integral LAN in the semiparametric Bayesian literature, see e.g. Bickel and Kleijn (2012, Section 4). In the third step of the proof we use arguments as in the proof of Van der Vaart (1998, Theorem 10.1) to show that (3.8) and (3.9) imply asymptotic normality of . While these three steps are classical in proving the Bernstein-von Mises phenomenon, establishing (3.9) raises challenges that are otherwise absent. This is because the ETEL function is a nonstandard likelihood that involves estimated parameters whose dimension is , which increases with . While and are expected to converge to zero in the correctly specified case, the rate of convergence is slower than . In the supplementary appendix we show that this rate is under the previous assumptions.

### 3.4 Misspecified model

We now generalize the preceding BvM result for the important class of misspecified conditional moment models.

###### Definition 3.1 (Misspecified model).

We say that the conditional moment conditions model is misspecified if the set of probability measures implied by the moment restrictions does not contain the true data generating process for any , that is, where and with the set of all conditional probability measures of .

In essence, if (2.1) is misspecified then there is no such that almost surely for every large enough. Now, for every define as the minimizer of the Kullback-Leibler divergence of to the model , where denotes the set of all the probability measures on . That is, , where . If we suppose that the dual representation of the Kullback-Leibler minimization problem holds, then the -density of has the closed form: , where denotes the tilting parameter and is defined in the same way as in the correctly specified case:

(3.10) |

We also impose a condition to ensure that the probability measures , which are implied by the model, are dominated by the true probability measure . This is required for the validity of the dual theorem. Therefore, following Sueishi (2013, Theorem 3.1), we replace Assumption 3.3 (a) by the following.

###### Assumption 3.8.

For every , there exists such that is mutually absolutely continuous with respect to , where and denotes the set of all the probability measures on .

This assumption implies that is non-empty. A similar assumption is also made by Kleijn and van der Vaart (2012) and Chib, Shin and Simoni (2018) to establish the BvM under misspecification. The pseudo-true value of the parameter is denoted by and is defined as the minimizer of the Kullback-Leibler divergence between the true and :

(3.11) |

where . Under the preceding absolute continuity assumption, the pseudo-true value is available as

(3.12) |

Note that , the value of the tilting parameter at the pseudo-true value , is nonzero because the moment conditions do not hold.

Assumption 3.8 implies that . We supplement this with the assumption that , (so that ). Because consistency in misspecified models is defined with respect to the pseudo-true value , we need to replace Assumption 3.7 (b) by the following Assumption 3.9 (b) which, together with Assumption 3.9 (a), requires the prior to put enough mass to balls around .

###### Assumption 3.9.

(a) is a continuous probability measure that admits a density with respect to the Lebesgue measure; (b) The prior distribution is positive on a neighborhood of , where is as defined in (3.12).

Hereafter, we use the sub/super index to denote an expectation, a variance or covariance taken with respect to the probability . The following assumption is analogous to the second part of Assumption 3.2 for the -density of replacing .

###### Assumption 3.10.

For each the matrix has smallest eigenvalue bounded away from zero uniformly in .

In the next assumption we denote by the interior of and by a ball centered at with radius for some and a compact subset of .

###### Assumption 3.11.

(a) The data , are i.i.d. according to and

(b) The pseudo-true value is the unique maximizer of

where ;

(c) is continuous at each with probability one;

(d) is twice continuously differentiable in a neighborhood of with probability one and for , , ;

(e) for a neighborhood of and for , it holds that

where is as defined in Assumption 3.2;

(f) for a neighborhood of and for , it holds that

where is as defined in Assumption 3.2;

(g) the matrix has smallest (resp. largest) eigenvalue bounded away from zero (resp. infinity);

(h) for a neighborhood of it holds that is bounded.

Assumption 3.11 (b) guarantees uniqueness of the pseudo-true value and is a standard assumption in the literature on misspecified models (see e.g. White (1982)). Assumptions 3.11 (d)-(f) are the counterparts of Assumptions 3.4 (b) and 3.5 (b), respectively, for the misspecified case. It is important to notice that they implicitly contain the first part of Assumption 3.2. The reason why we cannot separate the part involving the moment function (or its derivative) and the one involving in the assumption, as we do for the correctly specified model, is that the -density of cannot be factorized in a conditional density of given and a marginal density of independent of . In particular, in the misspecified case the pseudo-true value of the tilting parameter is not equal to zero as it is the tilting parameter in the correctly specified case. Assumption 3.11 (g) is the counterpart of Assumption 3.5 (b) for the misspecified case.

The BvM theorem for misspecified models now follows. Let , and . Moreover, let denote a compact subset of and , with .

###### Theorem 3.2 (Bernstein-von Mises (misspecified)).

Let Assumptions 3.1, 3.2, 3.8 - 3.11 hold. Assume that there exists a constant such that for any sequence ,

(3.13) |

as . If and then, the posteriors converge in total variation towards a Normal distribution, that is,

(3.14) |

where is any Borel set, is a random vector bounded in probability and is a positive definite matrix equal to the inverse of: