# An MCMC Approach to Empirical Bayes Inference and Bayesian Sensitivity Analysis via Empirical Processes

Consider a Bayesian situation in which we observe Y ∼ p_θ, where θ∈Θ, and we have a family {ν_h, h ∈H} of potential prior distributions on Θ. Let g be a real-valued function of θ, and let I_g(h) be the posterior expectation of g(θ) when the prior is ν_h. We are interested in two problems: (i) selecting a particular value of h, and (ii) estimating the family of posterior expectations { I_g(h), h ∈H}. Let m_y(h) be the marginal likelihood of the hyperparameter h: m_y(h) = ∫ p_θ(y) ν_h(dθ). The empirical Bayes estimate of h is, by definition, the value of h that maximizes m_y(h). It turns out that it is typically possible to use Markov chain Monte Carlo to form point estimates for m_y(h) and I_g(h) for each individual h in a continuum, and also confidence intervals for m_y(h) and I_g(h) that are valid pointwise. However, we are interested in forming estimates, with confidence statements, of the entire families of integrals { m_y(h), h ∈H} and { I_g(h), h ∈H}: we need estimates of the first family in order to carry out empirical Bayes inference, and we need estimates of the second family in order to do Bayesian sensitivity analysis. We establish strong consistency and functional central limit theorems for estimates of these families by using tools from empirical process theory. We give two applications, one to Latent Dirichlet Allocation, which is used in topic modelling, and the other is to a model for Bayesian variable selection in linear regression.

## Authors

• 2 publications
• 2 publications
02/07/2019

### Bias-Aware Confidence Intervals for Empirical Bayes Analysis

In an empirical Bayes analysis, we use data from repeated sampling to im...
07/22/2008

### Inference with Discriminative Posterior

We study Bayesian discriminative inference given a model family p(c,, θ)...
04/02/2020

### Bayesian model selection approach for colored graphical Gaussian models

We consider a class of colored graphical Gaussian models obtained by pla...
06/18/2020

### Bayesian Elastic Net based on Empirical Likelihood

Empirical likelihood is a popular nonparametric method for inference and...
03/31/2020

### Exact marginal inference in Latent Dirichlet Allocation

Assume we have potential "causes" z∈ Z, which produce "events" w with kn...
02/07/2020

### Empirical Bayes for Large-scale Randomized Experiments: a Spectral Approach

Large-scale randomized experiments, sometimes called A/B tests, are incr...
03/11/2019

### Bayesian Allocation Model: Inference by Sequential Monte Carlo for Nonnegative Tensor Factorizations and Topic Models using Polya Urns

We introduce a dynamic generative model, Bayesian allocation model (BAM)...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

This paper is concerned with two related problems. In the first, there is a function , where is a subset of some Euclidean space, and we wish to obtain confidence sets for . For each , the expression for is analytically intractable; however, we have at our disposal a family of functions

and a sequence of random variables

(these are iid or the initial segment of an ergodic Markov chain) such that the random function satisfies for each . We are interested in how we can use to form both a point estimate and a confidence set for .

This problem appears in empirical Bayes analysis and under many forms in likelihood inference. In empirical Bayes analysis, the application that is the focus of this paper, it arises as follows. Suppose we are in a standard Bayesian situation in which we observe a data vector

whose distribution is (with density with respect to some dominating measure) for some . We have a family of potential prior densities , and because the hyperparameter can have a great impact on subsequent inference, we wish to choose it carefully. Selection of is often guided by the marginal likelihood of the data under the prior , given by

 my(h)=∫pθ(y)νh(θ)dθ,h∈H. (1.1)

By definition, the empirical Bayes choice of is . Unfortunately, analytic calculation of is not feasible except for a few textbook examples, and estimation of

via Monte Carlo is notoriously difficult—for example, the “harmonic mean estimator” introduced by

Newton and Raftery (1994) typically converges at a rate which is much slower than (Wolpert and Schmidler, 2012).

It is very interesting to note that if is a constant, then the information regarding given by the two functions and is the same: the same value of maximizes both functions, and the second derivative matrices of the logarithm of these two functions are identical. In particular, the Hessians of the logarithm of these two functions at the maximum (i.e. the observed Fisher information) are the same and, therefore, the standard point estimates and confidence regions based on and are identical. This is a very useful observation because it turns out that it is usually easy to estimate the entire family for a suitable choice of . Indeed, for any , let denote the posterior corresponding to , let be fixed but arbitrary, and suppose that are either independent and identically distributed according to the posterior , or are the initial segment an ergodic Markov chain with invariant distribution . Let be the likelihood function. Note that given by (1.1) is the normalizing constant in the statement “the posterior is proportional to likelihood times the prior,” i.e.

 νh,y(θ)=ℓy(θ)νh(θ)/my(h). (1.2)

We have

 1nn∑i=1νh(θi)νh1(θi)a.s.⟶∫νh(θ)νh1(θ)νh1,y(θ)dθ=my(h)my(h1)∫νh,y(θ)νh1,y(θ)νh1,y(θ)dθ=my(h)my(h1), (1.3)

in which the first equality follows from (1.2) and cancellation of the likelihood. Let . Since is a fixed constant, as noted above, the two functions and give exactly the same information about . If we let , then —this quantity is computable, since it involves only the priors and not the posteriors—so we have precisely the situation discussed in the first paragraph of this paper. Other examples of this situation arising in frequentist inference, and in particular in missing data models, are given in Sung and Geyer (2007) and Doss and Tan (2014).

In Bayesian applications it is rare that Monte Carlo estimates of posterior quantities can be based on iid samples; in the vast majority of cases they are based on Markov chain samples, and that is the case that is the focus of this paper. We show that under suitable regularity conditions,

 argmaxhBn(h)a.s.⟶argmaxhB(h) (1.4)

and

 n1/2(argmaxhBn(h)−argmaxhB(h))d→N(0,Σ), (1.5)

where can be estimated consistently. Now, in general, almost sure convergence of to pointwise is not enough to imply that converges to under any mode of convergence, and in fact it is trivial to construct a counterexample in which and are deterministic functions defined on , for every , but does not converge to . To obtain results (1.4) and (1.5) above, some uniformity in the convergence is needed. We establish the necessary uniform convergence and show that (1.4) and (1.5) are true under certain regularity conditions on the sequence , the functions , and the function . Result (1.5) enables us to obtain confidence sets for .

The second problem we are interested in pertains to the Bayesian framework discussed earlier and is described as follows. Suppose that is a real-valued function of , and consider , the posterior expectation of given , when the prior is . Suppose that is fixed but arbitrary, and that is an ergodic Markov chain with invariant distribution . A very interesting and well-known fact, which we review in Section 2.3, is that for any , if we define

 w(h)i=[νh(θi)/νh1(θi)]∑nl=1[νh(θl)/νh1(θl)],

then

 ^Ig(h)=n∑i=1g(θi)w(h)i (1.6)

is a consistent estimate of . Clearly is a weighted average of the ’s. Under additional regularity conditions on the Markov chain and the function , we even have a central limit theorem (CLT):

, and we can consistently estimate the limiting variance. Thus, with a single Markov chain run, using knowledge of only the priors and not the posteriors, we can estimate and form confidence intervals for

for any particular value of . Now in Bayesian sensitivity analysis applications, we will be interested in viewing for many values of . For example, in prior elicitation settings, we may wish to find those aspects of the prior that have the biggest impact on the posterior, so that the focus of the effort is spent on those important aspects. We may also want to determine whether differences in the prior opinions of many experts have a significant impact on the conclusions. (For a discussion of Bayesian sensitivity analysis see Berger (1994) and Kadane and Wolfson (1998).) In these cases we will be interested in forming confidence bands for that are valid globally, as opposed to pointwise.

A common feature of the two problems we study in this paper is the need for uniformity in the convergence: to obtain confidence intervals for we need some uniformity in the convergence of to , and to obtain confidence bands for we need functional CLT’s for the stochastic process . Empirical process theory is a body of results that can be used to establish uniform almost sure convergence and functional CLT’s in very general settings. However, the results hold only under strong regularity conditions; and these conditions are often hard to check in practical settings—indeed the results can easily be false if the conditions are not met. Empirical process theory is fundamentally based on an iid assumption, whereas in our setting, the sequence is a Markov chain. In this paper we show how empirical process methods can be applied to our two problems when the sequence is a Markov chain, and we also show how the needed regularity conditions can be established.

The rest of the paper is organized as follows. In Section 2 we state our theoretical results, the main ones—those that pertain to the Markov chain case—being as follows. Theorem 3 asserts uniform convergence of to when the sequence is a Harris ergodic Markov chain, under certain regularity conditions on the family (the precise details are spelled out in the statement of the theorem), and we show how these regularity conditions can be checked with relative ease in standard settings. We then give a simple result which says that under a mild regularity assumption on , the condition entails . Theorem 4 establishes that under certain regularity conditions, we have asymptotic normality of . Theorem 6 establishes almost sure uniform convergence of to , and also functional weak convergence: the process converges weakly to a mean Gaussian process indexed by . We also show how this result can be used to construct confidence bands for that are valid globally. A by-product is functional weak convergence of to a mean Gaussian process indexed by , and construction of corresponding globally valid confidence bands for . In Section 3 we give two illustrations on Bayesian models in which serious consideration needs to be given to the effect of the hyperparameter and its choice. The first is to the Latent Dirichlet Allocation topic model, where we show how our methodology can be used to do sensitivity analysis, and the second is to a model for Bayesian variable selection in linear regression, where we show how our methodology can be used to select the hyperparameter. In the Appendix we provide the proofs of all the theorems except for Theorem 3; additionally, we show how the regularity conditions in Theorem 1 and Theorem 3 would typically be checked, and we verify these conditions in a simple setting.

## 2 Convergence of Bn(⋅) as a Process and Convergence of the Empirical Argmax

This section consists of three parts. Section 2.1 deals with uniform convergence of for the iid case, and introduces the framework that will enable us to obtain results for the Markov chain case; this framework will be used in Section 2.1 and in the rest of the paper. Section 2.2 deals with point estimates and confidence sets for , and Section 2.3 deals with uniform convergence and functional CLT’s for estimates of posterior expectations. Throughout, uniformity refers to a class of functions indexed by .

### 2.1 Uniform Convergence of Bn(⋅)

Let be a measurable subset of for some , and let

be a probability measure on

, where is the Borel sigma-field on . We assume that are independent and identically distributed according to , and we let be the empirical measure that they induce. We assume that is a convex compact subset of for some , and that for each ,

is measurable. The strong law of large numbers (SLLN) states that

 1nn∑i=1fh(θi)a.s.⟶∫fhdPif ∫|fh|dP<∞. (2.1)

Since we will be interested in versions of (2.1) that are uniform in , there will exist measurability difficulties, so we have to be careful in dealing with measurability issues. Before proceeding, we review some terminology and standard facts from the theory of empirical processes. We will use the following standard empirical process notation: for a signed measure on and a -integrable function , denotes . Let be an arbitrary probability measure on , suppose that are independent and identically distributed according to , and let be the empirical measure induced by . If is a class of functions mapping to , and is a signed measure on , we use the notation . We say that is Glivenko-Cantelli if converges to almost surely; sometimes we will say is -Glivenko-Cantelli, to emphasize the dependence on . Let . Our goal is to establish that is -Glivenko-Cantelli, which is exactly equivalent to the statement that the convergence in (2.1) holds uniformly in .

#### The IID Case

###### Theorem 1 (Theorem 6.1 and Lemma 6.1 in Wellner (2005))

Suppose that are independent and identically distributed according to . Suppose that is continuous in for -almost all . If is measurable and satisfies , then the class is -Glivenko-Cantelli.

Let and (the subscript to the expectation indicates that ). Then the conclusion of the theorem is the statement .

The integrability condition seems strong, and an even stronger integrability condition is imposed in Theorem 3. We discuss this issue in Remark 1 following the statement of Theorem 3, where we explain that in fact the two conditions are fairly easy to check in practice.

The next theorem also establishes that the class is Glivenko-Cantelli. In the theorem, the integrability condition on is replaced by an integrability condition on (here, is the gradient vector of with respect to , and is Euclidean norm). The condition on the gradient is sometimes easier to check. We include the theorem in part because a component of its proof is a key element in the proofs of Theorems 5 and 6 of this paper.

###### Theorem 2

Suppose that are independent and identically distributed according to , and that for each , . Assume also that for -almost all , exists and is continuous on . If is measurable and satisfies , then the class is -Glivenko-Cantelli.

#### The Markov Chain Case

Suppose now that the sequence is a Markov chain with invariant distribution , and that it is Harris ergodic (that is, it is irreducible, aperiodic, and Harris recurrent; see Meyn and Tweedie (1993, chapter 17) for definitions). Suppose also that for all . The best way to deal with the family of averages , is through the use of “regenerative simulation.” A regeneration is a random time at which a stochastic process probabilistically restarts itself; therefore, the “tours” made by the process in between such random times are iid. For example, if the stochastic process is a Markov chain on a discrete state space , and if is any point to which the chain returns infinitely often with probability one, then the times of return to form a sequence of regenerations. This iid structure will enable us to establish uniform convergence of the family . Before we explain this, we first note that for most of the Markov chains used in MCMC algorithms, the state space is continuous, and there is no point to which the chain returns infinitely often with probability one. Fortunately, Mykland et al. (1995) provided a general technique for identifying a sequence of regeneration times that is based on the construction of a minorization condition. This construction is reviewed at the end of this subsection, and gives rise to regeneration times with the property that

 E(τr−τr−1)<∞. (2.2)

Suppose now that there exists a regeneration sequence which satisfies (2.2). Such a Markov chain will be called regenerative. For any , consider . Let

 S(h)r=τr−1∑i=τr−1fh(θi),r=1,2,… (2.3)

be the sum of over the tour. Also, let , denote the length of the tour. The ’s do not involve . Note that the pairs are iid. If we run the chain for regenerations, then the total number of cycles is given by

 n=R∑r=1Nr=τR.

Also, . We have

 EP(fh(θ))a.s.⟵∑ni=1fh(θi)n=∑Rr=1S(h)r∑Rr=1Nr=(∑Rr=1S(h)r)/R(∑Rr=1Nr)/Ra.s.⟶E(S(h)1)E(N1). (2.4)

In (2.4), the convergence statement on the left follows from Harris ergodicity of the chain. The convergence statement on the right follows from two applications of the SLLN: By (2.2), and this, together with the convergence statement on the left, entails convergence of . The SLLN then implies that (if then the SLLN implies that with probability one). We conclude that . Note that continuity in of for almost all sequences follows from continuity in of for almost all , since with probability one, is a finite sum. Suppose in addition that is measurable and satisfies . Then by Theorem 1 we have . Since , we obtain

 suph∣∣∣(∑Rr=1S(h)r)/R(∑Rr=1Nr)/R−E(S(h)1)E(N1)∣∣∣a.s.⟶0,

i.e.

 suph∣∣∣∑ni=1fh(θi)n−EP(fh(θ))∣∣∣a.s.⟶0. (2.5)

We summarize this in the following theorem.

###### Theorem 3

Suppose that is a Harris ergodic Markov chain with invariant distribution for which there exists a regeneration sequence satisfying . Suppose also that is continuous in for -almost all . For each , let be defined by (2.3). If is measurable and satisfies , then (2.5) holds.

###### Remark 1

We now discuss the integrability condition , and our discussion encompasses the weaker condition assumed in Theorem 1. Suppose that for all . In the Appendix we show that, because is assumed to be compact, it is often possible to prove that for some ,

 there exist h1,…,hd∈H and constants c1,…,cd such thatsuph|fh(θ)|≤d∑j=1cj|fhj(θ)| for all θ∈Θ. (2.6)

In this case, since , we obtain

 suph|S(h)1|≤τ1−1∑i=τ0suph|fh(θi)|≤τ1−1∑i=τ0d∑j=1cj|fhj(θi)|.

Hence,

 E(suph|S(h)1|)≤d∑j=1E(τ1−1∑i=τ0cj|fhj(θi)|)=d∑j=1cjEP(|fhj(θ)|)E(N1),

which is finite. Thus, checking that reduces to establishing (2.6). In the Appendix we consider the Bayesian framework discussed in Section 1, in which , where is a family of priors, and , the posterior distribution corresponding to the prior , where is fixed. We show that if is an exponential family, then condition (2.6) holds. Therefore, the integrability condition is satisfied in a large class of examples. Moreover, the method we use for establishing (2.6) can be applied to other examples as well.

###### Remark 2

The idea to transform results for the iid case to the Markov chain case via regeneration has been around for many decades. Levental (1988) also obtained a Glivenko-Cantelli theorem for the Markov chain setting. In essence, the difference between his approach and ours is that his starting point is a Glivenko-Cantelli theorem for the iid case which requires a condition involving the minimum number of balls of radius in that are needed to cover —he is using metric entropy. This condition is very hard to check. By contrast, our starting point is a Glivenko-Cantelli theorem for the iid case which is based on bracketing entropy—in brief, the main regularity condition is implied by the continuity condition in Theorem 3. This continuity condition is trivial to verify: the parametric families that we are working with in our Bayesian setting satisfy it automatically.

##### The Minorization Construction

We now describe a minorization condition that can sometimes be used to construct regeneration sequences. Let be the transition function for the Markov chain . The construction described in Mykland et al. (1995) requires the existence of a function , whose expectation with respect to is strictly positive, and a probability measure on , such that satisfies

 Kθ(A)≥s(θ)Q(A)for all θ∈Θ and % A∈B. (2.7)

This is called a minorization condition and, as we describe below, it can be used to introduce regenerations into the Markov chain driven by . Define the Markov transition function by

 Gθ(A)=Kθ(A)−s(θ)Q(A)1−s(θ).

Note that for fixed , is a probability measure. We may therefore write

 Kθ=s(θ)Q+(1−s(θ))Gθ,

which gives a representation of as a mixture of two probability measures, and . This provides an alternative method of simulating from . Suppose that the current state of the chain is . We generate . If , we draw ; otherwise, we draw . Note that if , the next state of the chain is drawn from , which does not depend on the current state. Hence the chain “forgets” the current state and we have a regeneration. To be more specific, suppose we start the Markov chain with and then use the method described above to simulate the chain. Each time , we have and the process stochastically restarts itself; that is, the process regenerates. Mykland et al. (1995) provided a very widely applicable method, the so-called “distinguished point technique”, for constructing a pair that can be used to form a minorization scheme which satisfies (2.2).

For any fixed , consider now the expression

 (∑Rr=1S(h)r)/R(∑Rr=1Nr)/R

in (2.4). The bivariate CLT gives

 R1/2((∑Rr=1S(h)r)/R−EP(fh(θ))E(N1)(∑Rr=1Nr)/R−E(N1))d→N(0,Σh), (2.8)

where

. (We have ignored the moment conditions on

and that are needed, but we will return to these conditions in Section 2.3, where we give a rigorous development of a functional version of the CLT (2.8), in which the left side of (2.8) is viewed as a process in .) The delta method applied to the function gives the CLT

 R1/2((∑Rr=1S(h)r)/R(∑Rr=1Nr)/R−EP(fh(θ)))d→N(0,σ2h),

where (and is evaluated at the vector of means in (2.8)). Moreover, can be estimated in a simple manner using a plug-in estimate. Whether or not this method gives estimates of variance that are useful in the practical sense depends on whether or not the minorization condition we construct yields regenerations which are sufficiently frequent. Successful constructions of minorization conditions have been developed for widely used chains in many papers (we mention in particular Mykland et al. (1995), Roy and Hobert (2007), Tan and Hobert (2009), and Doss et al. (2014)); nevertheless, successful construction of a minorization condition is the exception rather than the norm. In this context, we point out that here regenerative simulation is notable primarily as a device that enables us to prove the theoretical results in the present paper and to arrive at informative expressions for asymptotic variances, but it may be possible to estimate these variances by other methods; this point is discussed further in Section 2.2.

###### Remark 3

The main regularity assumption in Theorem 3 is the condition . Without giving the details, we mention that in analogy with Theorem 2, it is possible to give a version of Theorem 3 in which this condition is replaced with the condition .

### 2.2 A Consistent Estimator and Confidence Sets for argmaxhB(h)

This section pertains to as an estimator of . After establishing that (2.5) entails that is consistent, we show that under additional regularity conditions, (i) is asymptotically normal, and (ii) we can consistently estimate the asymptotic variance. Results (i) and (ii) enable us to form asymptotically valid confidence sets for .

###### Lemma 1

Suppose that is a compact subset of Euclidean space, and let and be deterministic real-valued functions defined on . Suppose further that is continuous and has a unique maximizer, and that for each the maximizer of exists and is unique. If converges to uniformly on , then the maximizer of converges to the maximizer of .

The proof of Lemma 1 is routine and is given in the Appendix. Consider now and . By Lemma 1, if is continuous and its maximizer is unique, then implies . Thus, under continuity of and uniqueness of its maximizer, any conditions that imply (2.5)—in particular the conditions of Theorems 12, or 3—are also conditions that imply strong consistency of as an estimator of .

Before stating the next theorem, we need to set some notation and assumptions. We assume that each of and has a unique maximizer, and we denote and . For a function , denotes the gradient vector and denotes the Hessian matrix. We will assume that for every , and exist and are continuous for all . Recall that is defined by (2.3). The Markov chain will be run for regenerations, and in the asymptotic results below, . We will use the notation , , , etc. For almost any realization , the random variable is a finite sum, and therefore . Similarly, . We will assume that the family is such that the interchange of the order of integration and either first or second order differentiation is permissible, i.e.

 ∇h∫fhdP=∫∇hfhdPand∇2h∫fhdP=∫∇2hfhdP. (2.9)

For , let

 J(h)=∇2hB(h),Jn(h)=∇2hBn(h),
 τ2(h)=[E(N1)]−2E([∇hS(h)1−N1EP(∇hfh(θ))][∇hS(h)1−N1EP(∇hfh(θ))]⊤),

and

 τ2n(h)=1R¯N2R∑r=1(∇hS(h)r−Nr∇h¯S(h)/¯N)(∇hS(h)r−Nr∇h¯S(h)/¯N)⊤.

Suppose that is a Markov chain on the measurable space and has as an invariant probability measure. Let be the -step Markov transition function. Recall that the chain is called geometrically ergodic if there exist a constant and a function such that for ,

 supA∈B|Kn(x,A)−π(A)|≤M(x)cnfor all x∈X.

If is a matrix, then a statement of the sort will mean for . We will refer to the following conditions.

• The chain is geometrically ergodic.

• For every , there exists such that .

• The function is twice continuously differentiable and the matrix is nonsingular.

• is measurable and .

• is measurable and .

• is measurable and .

• is measurable and has finite expectation.

###### Theorem 4

Suppose that is a regenerative Markov chain with invariant distribution . Let

 v2=J(h0)−1τ2(h0)J(h0)−1. (2.10)
1. [itemsep=0mm,topsep=2mm,leftmargin=4.5mm]

2. Under A1–A5

 R1/2(hn−h0)d→N(0,v2)as R→∞, (2.11)

and consequently

 n1/2(hn−h0)d→N(0,E(N1)v2)as R→∞. (2.12)
3. Under A1–A7, for large the matrix is invertible, and the variance estimate

 v2n=[Jn(hn)]−1τ2n(hn)[Jn(hn)]−1

is a strongly consistent estimate of .

###### Remark 4

In the expression for the asymptotic variance given by (2.10), the term is the variance of a certain function of the Markov chain, and the term measures the inverse of the curvature of at its maximum ( is a deterministic function and does not involve the Markov chain): the flatter the surface at its maximum, the higher is the asymptotic variance.

###### Remark 5

The integrability condition in Assumption A4 was discussed in Remark 1, where we showed that it is satisfied whenever there exist such that for all (cf. (2.6), in which without loss of generality we take the constants to be equal to .) The integrability conditions in A5–A7 are satisfied under (2.13) and (2.14) below, which are very similar to (2.6). To make our explanation notationally less cumbersome and easier to follow, we will assume that , so that , , , and are all scalars. Assume that there exist and constants such that

 suph|∇hfh(θ)|≤d∑j=1cj|∇h