# A Bayesian 'sandwich' for variance estimation and hypothesis testing

Many frequentist methods have large-sample Bayesian analogs, but widely-used "sandwich" or "robust" covariance estimates are an exception. We propose such an analog, as the Bayes rule under a form of balanced loss function, that combines elements of standard parametric inference with fidelity of the data to the model. Our development is general, for essentially any regression setting with independent outcomes. Besides being the large-sample equivalent of its frequentist counterpart, we show by simulation that the Bayesian robust standard error can faithfully quantify the variability of parameter estimates even under model misspecification – thus retaining the major attraction of the original frequentist version. We demonstrate some advantages of our Bayesian analog's standard error estimate when studying the association between age and systolic blood pressure in NHANES.

• 3 publications
• 2 publications
03/15/2018

### A two-stage method for estimating the association between blood pressure variability and cardiovascular disease: An application using the ARIC Study

The association between visit-to-visit blood pressure variability and ca...
08/22/2018

### Robust Spatial Extent Inference with a Semiparametric Bootstrap Joint Testing Procedure

Spatial extent inference (SEI) is widely used across neuroimaging modali...
04/04/2022

### Logical coherence in Bayesian simultaneous three-way hypothesis tests

This paper studies whether Bayesian simultaneous three-way hypothesis te...
09/22/2021

### On Reparameterization Invariant Bayesian Point Estimates and Credible Regions

This paper considers reparameterization invariant Bayesian point estimat...
11/19/2021

### History and Nature of the Jeffreys-Lindley Paradox

The Jeffreys-Lindley paradox exposes a rift between Bayesian and frequen...
04/05/2021

### Objective Bayesian meta-analysis based on generalized multivariate random effects model

Objective Bayesian inference procedures are derived for the parameters o...
03/08/2022

### On Robust Inference in Time Series Regression

Least squares regression with heteroskedasticity and autocorrelation con...

## 1 Introduction

Probabilistic models, with a finite number of parameters that characterizes the data generating distribution, are a standard statistical tool. Frequentist methods view the parameters as fixed quantities and use statements about replicate datasets to make inference, while the Bayesian paradigm uses the “language” of probability to directly describe uncertainty about parameters. This philosophical distinction can lead to differences in interpretation of analyses, depending on which framework is being used.

Despite this, in analysis of large samples standard methods under one approach tend to have close analogs under the other. For example, frequentist maximum likelihood estimates (MLEs) and Bayesian posterior means behave similarly in large samples, as do typical standard error estimates and posterior standard deviations — fundamentally due to Bernstein-von Mises’ theorem [§10.2]van2000asymptotic.

An exception occurs when we consider model violations. Frequentist inference in large samples is straightforward; under only mild regularity conditions “robust” covariance estimates royall1986model provide inference for the parameters consistently estimated by the MLE or other

-estimators, even under model misspecification. These methods are widely-used, extremely simple to calculate, and are available in standard software. Bayesian analogs are much less studied. One approach relies on the large sample property that the posterior mean approaches the point in the model family closest (in Kullback-Leibler divergence) to the true data-generating distribution kleijn2012bernstein. The Bayesian point estimates can therefore be interpreted as a form of ‘best guess’, but the corresponding posterior variance is extremely hard to interpret, even in large samples. For linear regression, szpiro2010model give a Bayesian analog of the classic “sandwich” covariance estimate, as the posterior variance of a specific linear summary of the parameter space, in the limiting situation of an extremely flexible model and prior – essentially a form of ‘non-parametric Bayes’ dunson2010nonparametric. The result is not, however, available for other regression estimates, nor is it straightforward to use together with subjective prior information about parameters. A general Bayesian analog of sandwich estimates is yet to be developed.

In this article, we propose such a Bayesian analog, as the Bayes rule with respect to a form of balanced loss function zellner1994bayesian. We introduce the balanced loss function and its theoretic properties in Section 3. The balanced loss function comprises two terms, penalizing estimation error and lack of model fit respectively. In Section 4, we demonstrate the proposed Bayesian robust variance with several examples and use simulation studies to show its behaviors in various settings. In Section 5, we use Bayesian robust variance to quantify the uncertainty in an analysis to study the association between systolic blood pressure and age using a dataset from 2017-2018 National Health and Nutrition Examination Survey (NHANES). We conclude with a discussion in Section 6. Proofs of most theorems are deferred to the Appendix.

## 2 Inference Loss functions

For general observations we denote as an independent and identically distributed random sample from distribution with density with respect to a measure , indexed by a real parameter In our major focus on regression models, the ’s are the combined outcome and explanatory variables, i.e. where is the outcome variable and

is a real vector of explanatory variables. For Bayesian analysis, we use

to denote the vector of parameters, and write as the density function of the corresponding prior, supported on . We write as the posterior density of given the observed data. We denote and as the posterior expectation and posterior variance (or posterior variance-covariance matrix for vector-valued ), respectively, with respect to the posterior measure .

Suppose is the parameter of interest and a corresponding estimate. Using decision theory, optimal Bayesian estimates, known as Bayes rules, are obtained by minimizing the posterior risk , where is a loss function describing the discrepancy between the decision rule and the parameter. Common loss functions include the -loss and -loss , for which the Bayes rules are the posterior median and posterior mean, respectively. For a more comprehensive review of decision theory and optimality of Bayes rules, see parmigiani2009decision.

To give rules that estimate but also indicate the precision of that estimate, we need more general loss functions. To achieve both those goals we consider what we shall call the inference loss function

 LI(θ,d,Σ)=log|Σ|+(θ−d)TΣ−1(θ−d), (1)

where is a positive definite matrix. The right-hand term is a sum of adaptively-weighted losses, each of the form , where the weights are determined by the elements of the matrix . To discourage the weight for any combination of the from reaching zero, we penalize by the left term, the log of the determinant of . Theorem 1 gives characteristics of the inference loss function.

###### Theorem 1.

The Bayes rule with respect to the inference loss in (1) is and , which gives the minimized posterior risk .

By Theorem 1, the Bayes rules for and with respect to are the posterior mean and variance. Bernstein-von Mises’ Theorem implies that, in the frequentist sense, the posterior variance and the variance of the posterior mean are asymptotically equivalent, i.e. . As a result, the inference loss function provides estimates and approximations of those estimates’ error in both the Bayesian and frequentist senses.

We note that the inference loss function is not new: it was previously discussed by dawid1999coherent, who noted that it corresponds to using the log-determinant of the covariance matrix as a criterion to compare study designs. holland2015minimum also considered the inference loss as an example of a ‘proper’ loss function. However, to our knowledge, the simple and general form of its Bayes rule is not known, nor are the extensions we provide below.

## 3 Balanced Inference Loss functions for model-robust variance estimation

We extend the inference loss function and present a Bayesian analog of robust variance estimator, as the Bayes rules for a loss which penalizes both the lack of model fit and estimation error. Our loss function is motivated by the balanced loss functions originally proposed by zellner1994bayesian. Balanced loss functions have since been developed considerably, but all share a common form in which the loss is a weighted average of a term indicating lack-of-fit in some way, and one indicating estimation error. The general form of the balanced loss function we will consider is

 LBI(θ,d,Σ,Ω)=log|Σ|+(θ−d)TΩΣ−1(θ−d)Estimation error+1nn∑i=1˙li(θ)T(ΩIn(θ))−1˙li(θ)Lack of fit, (2)

where is the score function based on a single observation, is the Fisher information, and we make decisions , and and which are positive-definite weighting matrices. We call in (2) the balanced inference loss function.

The first term in is essentially the inference loss from (1), described earlier. However the balanced inference loss describes an additional decision, matrix , that sets rates of trade-off between elements of the estimation component of the inference loss versus the data-dependent penalty

, a form of ‘signal to noise’ ratio for deviations from the model; the score evaluated at any

describes discrepancies between the data and corresponding model, and the Fisher information matrix describes the uncertainty in those deviations. Furthermore, the penalty term is invariant to one-to-one transformations of .

We proceed to consider the Bayes rules for the balanced inference function.

###### Theorem 2.

The Bayes rule with respect to the inference loss in (2) is

 ^dn=EΠn(ϑ|Zn),^Ω=E(1nn∑i=1˙li(ϑ)˙li(ϑ)TIn(ϑ)−1|Zn),^Σn=VarΠn(ϑ|Zn)^Ω.
###### Proof.

See Appendix B. ∎

From Theorem 2 we see that the posterior mean remains the Bayes rule for estimate , but the decision for is now a scaled version of the posterior variance, where the rescaling term is – essentially – the posterior expectation of the ‘lack of fit’ terms in (2), omitting , and can be considered as a ‘signal-to-noise’ measure based around score .

In Theorem 3, we show that estimates the large-sample variance of in a model-robust manner.

###### Theorem 3.

Let be the minimal Kullback-Leibler point. Under the regularity conditions in Appendix C, we have

 ^Σn/nP0→I(θ∗)−1E0[˙li(θ∗)˙li(θ∗)T]I(θ∗)−1,

where the right-hand side is also the asymptotic variance of derived in [kleijn2012bernstein]:

 √n(^dn−θ∗)d→N(0,I(θ∗)−1E0[˙li(θ∗)˙li(θ∗)T]I(θ∗)−1).

Theorem 3 indicates that is asymptotically equivalent to the asymptotic variance of , under only mild regularity conditions that do not require fully-correct model specification, and is thus a Bayesian analog of the frequentist “robust” variance estimator.

As well as providing a Bayesian approach to these empirically-useful methods, our work also allows us to construct Bayesian analogs of Wald-type confidence intervals. We define the two-sided Bayesian robust confidence interval at level

for the th entry of as

 ^dn,j±z1−α2^Σ12n,jj,

where is the

-quantile of the standard normal distribution. In Section

4, we show by simulation studies that intervals constructed in this way can have appealing operating characteristics.

Besides leading to model-robust inference, we note that a slightly simpler version of the influence loss function – with a univariate correction term instead of a matrix-valued – provides a Bayesian analog of quasi-likelihood models mccullagh1983quasi. Details are provided in Appendix D.

As well as providing a Bayesian approach to these empirically-useful methods, our work also allows us to construct Bayesian analogs of Wald-type confidence intervals. We define the two-sided Bayesian robust confidence interval at level for the th entry of as

 ^dn,j±z1−α2^Σ12n,jj,

where is the -quantile of the standard normal distribution. As we describe below, intervals constructed in this way can have appealing operating characteristics.

## 4 Examples and simulation studies

In this section we show, through analytic examples and simulation studies, that the Bayesian robust variance quantifies the variability of the Bayes point estimates in large samples, even in the presence of model misspecification.

### 4.1 Estimating a Normal mean, but misspecifying the variance

As a deliberately straightforward first example, suppose a random sample is drawn from where mean is unknown mean and variance is known. We are interested in estimating but wrongly assume the observations have variance . In other words, we use the model , in which the variance is misspecified. The analysis also uses prior , for known .

The balanced loss function in this setting is

 LBI(θ,d,σ2,ω)=log|σ2|+ω(θ−d)2/σ2+1nωn∑i=1(Yi−θ)2

and the Bayes rule sets

 ^dn=μ+η2∑ni=1Yinη2+1,^ωn=1nn∑i=1(Yi−¯Y)2+η2nη2+1+(μ−¯Ynη2+1)2, ^σ2n=η2n(nη2+1)n∑i=1(Yi−¯Y)2+(η2nη2+1)2+η2(μ−¯Y)2(nη2+1)3.

Letting go to infinity, the point estimate is equivalent to the sample average and the variance estimate is equivalent to , i.e. the asymptotic inference that would be obtained under a correct model specification. The underlying approach remains fully parametric, and retains the (valuable) coherence of Bayesian analysis, together with the appealing normative element of decision theory, where a clear statement about the goal of the analysis (here, balancing estimation of with fidelity to the data) makes the our choice of what to report (i.e. ) both automatic and optimal.

### 4.2 Estimating the variance of Generalized Linear Model regression coefficients

The balanced inference loss function takes an appealingly straightforward form when the data are independent observations , from a generalized linear model (GLM), i.e. an exponential family where the distribution has the density

 p(Yi|θ,α)=exp(Yiθi−b(θi)α+c(Yi,α))

for functions , and scalars and  mccullagh1989generalized. We assume the presence of -dimensional explanatory variables augmenting each observation , with a link function connecting the mean function and the linear predictors via , where is a -vector of regression coefficients. We write and .

With a GLM, the score function with respect to for a single observation is and the empirical Fisher information is , where is the matrix with elements , , and is the diagonal matrix with the -th diagonal element . The balanced inference loss is therefore

 LBI(β,d,Σ,Ω) =log|Σ|+(β−d)TΩΣ−1(β−d)+ n∑i=1(Yi−μi)2αVi⋅(∂μi∂β)T⎡⎣Ωn∑j=1(∂μj∂β)(∂μj∂β)T⎤⎦−1(∂μi∂β), (3)

in which we observe that the data only enters the balancing term via the terms , which can be thought of as a Bayesian analog of (squared) Pearson residuals. The other components in the balancing term determine how much weight these ‘squared’ terms receive. The balanced inference loss’s Bayes rule sets to be the usual posterior mean, and

 ^Σn=VarΠn(β|Zn)⋅EΠn{DTV−1diag{[Y−μ]2}V−1D[DTV−1D]−1/α∣∣Zn},

which can be thought of as the model-based posterior variance ‘corrected’ by a term that assesses fidelity of the data to the model. As with the general case, asymptotic equivalence with the classical “robust” approach holds.

When the canonical link function is used, i.e. where , the Bayesian rule further simplifies to

 ^Σn=VarΠn(β|Zn)⋅EΠn{XTdiag{[Y−μ]2}X[XTVX]−1/α|Zn},

where the interpretation above can also be seen to hold.

#### Example: Linear regression

For observations and the explanatory variables , a Bayesian linear regression model assumes . The corresponding balanced inference loss is

 LBI(β,d,Σ,Ω)=log|Σ|+(β−d)TΩΣ−1(β−d)+n∑i=1(Yi−XTiβ)2σ2⋅XTi[Ωn∑j=1XjXTj]−1Xi,

in which the novel penalty term is a sum of squared Pearson errors, weighted by a modified form of leverage ([mccullagh1989generalized] Section 12.7.1, page 405) which includes a term inside the familiar ‘hat’ matrix . The Bayes rule for is the Bayesian robust variance

 ^Σn=VarΠn(β|Zn)⋅EΠn(XTdiag((Y−Xβ)2)X(XTX)−1/σ2|Zn).

We perform a simulation study to investigate the performance of Bayesian robust variance for linear regression. We generate univariate explanatory variables as where independently, and generate outcome variables as for , for various constants . When , the mean model, is quadratic in , and hence the model is misspecified.

We suppose interest lies in linear trend parameter and is the posterior mean, i.e. the Bayes rule for its estimation. We use weakly informative priors for and . For 1000 simulation under each setting, we report the average posterior mean (), the standard error of (), the average posterior standard deviation of (), and the average Bayesian robust standard error estimate ().

Table 1 compares the posterior standard deviation and the Bayesian robust standard error estimate, as estimates of the standard error of the Bayes rule. In all the scenarios except when (noted in grey, indicating that the linear regression model is correctly specified), while there is a recognizable difference between the true standard error and the posterior standard deviation, the Bayesian robust standard error is notably closer to the true standard error.

#### Example: Poisson regression

Poisson regression is a default method for studying the association between count outcomes and the explanatory variables agresti1990. With count observations and corresponding explanatory variables (which may contain an intercept) a Poisson regression model assumes . The balanced inference loss is

 LBI(β,d,Σ,Ω) =log|Σ|+(β−d)TΩΣ−1(β−d)+ n∑i=1(Yi−exp(XTiβ))2XTi[n∑j=1XjXTjexp(XTjβ)]−1Xi,

which is again interpretable as a weighted sum of squared Pearson errors, and the weights are again modified versions of leverage. The Bayes rule for is another Bayesian robust variance,

 ^Σn=VarΠn(β|Zn)⋅EΠn{XTdiag([Y−exp(Xβ)]2)X[XTdiag(exp(Xβ))X]−1|Zn},

where and .

To evaluate the method, as before we generate explanatory variables , where , and generate the outcomes as , for various constants and . The model is misspecified when . We assume the parameter of interest is the log risk ratio and is the posterior mean, the corresponding Bayes rule. We use weakly informative priors for . For each setting we perform 1,000 replicates, and report the same measures as for linear regression.

Table 2 shows the results. Again, the Bayesian robust standard error is a notably better estimate of the true standard error of than the posterior standard deviation of ; it performs better when the model is misspecified and no worse when correctly specified.

### 4.3 Estimating the variance of log hazards ratio with an exponential proportional hazards model

We demonstrate how our framework can be extended beyond GLMs with an association analysis for time-to-event outcomes and explanatory variables.

Suppose the observed data are for , where is the observed survival time, is the event indicator, is the true survival time, is the censoring time and is a -vector of covariates.

Suppose we intend to model the association between the survival time and the covariates using the exponential proportional hazards model

 Si|Xi,β∼Exponential(exp(XTiβ)).

The balanced inference loss is

 LBI(β,d,Σ,Ω) =log|Σ|+(β−d)TΩΣ−1(β−d)+ 1nn∑i=1(Δi−Tiexp(XTiβ))2XTi⎧⎨⎩Ω⎡⎣n∑j=1Tjexp(XTjβ)XjXTj⎤⎦⎫⎬⎭−1Xi

In this case, the interpretation of the balancing term is less clear than before. We note, however, for those subjects who experienced events (i.e. ), their contribution to the balancing term can be written as

 (Ti−Δiexp(−XTiβ))2exp(−2XTiβ)XTi⎧⎨⎩Ω⎡⎣n∑j=1Tjexp(XTjβ)XjXTj⎤⎦⎫⎬⎭−1Xi,

which, again, is their Pearson residual weighted by some version of their leverage.

The Bayes rule for is a Bayesian robust covariance matrix

 ^Σn= VarΠn(β|Zn)EΠn{XTdiag((Δi−Tiexp(XTiβ))2)X[XTdiag(Tiexp(XTiβ))X]−1|Zn}.

In a simulation study to investigate the robustness of different variance estimates in exponential proportional hazards models, we generated the covariates as where , for , and the survival times by a Weibull proportional hazards model

 Si|Xi,β∼Weibull(κ,exp(XTiβ)),

where the pdf of a random variable is given by

 f(t)=λκtκ−1exp(−λtκ),t>0.

We further performed administrative censoring at , . Our use of Exponential proportional model deviates from the true data generating distribution if .

We suppose the parameter of interest is the log hazards ratio in the model and use the prior distribution , . Again, we conduct 1000 simulation under each setting considered, and report the same measures as before.

Results are given in Table 3. Although slightly underestimating the standard error of the Bayes estimator when the model is correctly specified, in general the Bayesian robust standard error is a better estimate of the true standard error of than the posterior standard deviation of .

### 4.4 Bayesian robust confidence intervals

While appealingly robust in large samples, standard frequentist model-agnostic variance estimates often have unstable behavior in small samples kauermann2001note. We investigate whether their Bayesian analog might produce better small sample behavior, by the shrinkage/stability provided through the prior distribution. Using the posterior mean as estimate and Bayesian robust variance estimate , we construct the Bayesian analog of Wald-type “robust” confidence intervals. The proposed Bayesian robust confidence interval at confidence level for , , is

 ^dn,j±z1−α2^Σ12n,jj.

Figure 1

shows the coverage probabilities of these intervals, together with standard model-based 95% credible intervals and standard robust confidence intervals, using the examples in Section

4.2 and 4.3. We assume the same models and data distributions as in the previous sections. In all the model-misspecification scenarios, the posterior credible interval does not – as expected – have the nominal level coverage even in large samples. However, compared with the standard robust confidence intervals, the Bayesian robust confidence intervals have coverage probabilities closer to the nominal level. This suggests that the stability provided by even the relatively weak priors we have used is sufficient to improve the performance of the standard purely-empirical frequentist methods.

## 5 Application: Association of Systolic Blood Pressure and Age in 2017-2018 NHANES Data

We demonstrate the Bayesian robust standard error estimator using data from 2017-2018 National Health and Nutrition Examination Survey (NHANES) data. We are interested in the association between subjects’ systolic blood pressure systolic and their age among the subpopulation with age between 18 and 60 years old. To make the difference between the variance estimates more distinguishable, we randomly select a subset of 200 subjects for our analysis.

We used a simple linear regression model as a working model, with the average of two systolic blood pressure readings as the outcome and age and gender as covariates (Figure 2

left panel). We computed both the model-based standard error estimates (assuming homoscedasticity) and robust standard error estimates (resistent to heteroscedasticity and/or model misspecification) of the regression coefficients. For the corresponding Bayesian model, we postulate a normal linear regression model with the same outcome and covariates. We use weakly informative prior distributions for the unknown regression coefficients and the residual variance (Figure

2 right panel). We report the posterior mean as point estimates, and compute the posterior standard deviations and Bayesian standard errors for the regression coefficients. For the Bayesian analysis, we used Gibbs sampling in JAGS plummer2003jags with 3 chains and 30,000 iterations for each chain, where the first 18,000 iterations in each chain are burn-in and discarded from the analysis.

In Table 4, we report the point estimates and the variance estimates of the regression coefficients for the frequentist and the Bayesian linear regression models. We focus on the comparison of the variance estimates. The model-based standard errors/Bayesian posterior standard deviations given higher estimates of the standard error for all three regression coefficients than the frequentist/Bayesian robust standard errors. For the intercept and the regression coefficient for age, the Bayesian posterior standard deviations are nearly equal to the corresponding frequentist model-based standard errors, whereas the Bayesian robust standard error are approximately equal to the frequentist robust standard error. The four versions of standard error estimates for the regression coefficient of gender are not as distinguishable.

Code for the analysis and the dataset are available on GitHub (https://github.com/KenLi93/BRSE).

## 6 Discussion

In this article, we have proposed a balanced inference loss function, whose Bayes rules provide a Bayesian analog of the robust variance estimator. We have proven how the Bayesian robust variance converges to the true variance of the estimator of interest in large samples, under only mild regularity conditions. In our examples of GLMs and other forms of models, the corresponding loss function is seen to be a straightforward balance between losses for standard model-based inference, and a term that assesses how well the corresponding model actually fits the data that have been observed; the penalty employs terms and components that are familiar from diagnostic use of Pearson residuals, and from study of leverage. Using these methods we also proposed the Wald confidence intervals using Bayesian robust standard error, and found more stable behavior than their frequentist counterparts.

Our development of the robust variance estimates is fully parametric, pragmatically adjusting the standard model-based inference with a measure of how the corresponding model fits the data. This is not the only motivation available for robust variance estimates however; focusing only on linear regression, szpiro2010model proposed a Bayesian model-robust variance estimate by using highly-flexible models for the data distribution and clearly defining the parameter of interest in a model-agnostic manner. The extension of this essentially non-parametric approach beyond linear regression is unclear, and moreover its motivation can be challenging – much like the challenge of a frequentist development built purely on estimating equations, without parametric assumptions. In comparison, our proposed method does not require redefining the parameter of interest during the modeling, and the interpretation of the point estimates stays the same as in regular parametric statistics. The proposed Bayesian variance estimates do not require nonparametric Bayes components in the model specification as in szpiro2010model, thus greatly reducing the computation time and increasing the algorithmic stability. Our framework is general and can be applied to obtain the Bayesian robust variance for any regular parametric models.

Our work demonstrates that robust standard error estimates can be ubiquitous both in frequentist and in Bayesian statistics. Our approach enables the proper quantification of the variability of parameter estimates in Bayesian parametric models. The proposed balance inference loss function, through which the Bayesian robust standard error was derived, also provides insights on the source of discrepancy between the model-based and model-agnostic variance estimates.

## Appendix A Proof of Theorem 1

###### Proof.

The posterior risk is

 EΠn[L|Zn]=log|Σ|+EΠn[(ϑ−d)TΣ−1(ϑ−d)|Zn]=log|Σ|+tr(Σ−1EΠn[(ϑ−d)(ϑ−d)T|Zn]).

Writing , the posterior risk is

 log∣∣EΠn[(ϑ−d)(ϑ−d)T|Zn]∣∣−log|A|+tr(A)

in which the last two terms are respectively the sum of the eigenvalues and the eigenvalues of . This is minimized by setting so that all eigenvalues equal to 1, which occurs if and only if

is the identity matrix. Hence

.

To minimize the posterior risk with respect to , it remains to consider

By matrix determinant lemma, the above expression equals

 log|CovΠn(ϑ|Zn)|+log(1+(EΠn[ϑ|Zn]−d)TCovΠn[ϑ|Zn]−1(EΠn[ϑ|Zn]−d))+p

which is minimized by setting , which in turn means setting . This rule achieves minimized posterior risk .

## Appendix B Proof of Theorem 2

The expected posterior loss is

 EΠn[LBI|Zn] =log|Σ|+EΠn[(ϑ−d)TΩΣ−1(ϑ−d)∣∣∣Zn]+EΠn[1nn∑i=1˙li(ϑ)T(ΩIn(ϑ))−1˙li(ϑ)∣∣∣Zn] (4)

Similar to the proof of Theorem 1, the minimum of (4) with respect to with fixed is achieved by setting and . Substituting and in (4), we obtain that the minimal expected posterior loss with fixed is

 log|VarΠn[ϑ|Zn]|+p+log|Ω|+EΠn[1nn∑i=1˙li(ϑ)T(ΩIn(ϑ))−1˙li(ϑ)∣∣∣Zn] (5)

Again using the same method as in the proof of Theorem 1, the minimum of (5) is achieved by setting , which in turn gives the Bayes rule of :

 ^Σ=VarΠn[ϑ|Zn]E[1nn∑i=1˙li(ϑ)˙li(ϑ)TI−1n(ϑ)∣∣∣Zn].

## Appendix C Proof of Theorem 3

Let be the Fisher information. By the Bernstein von Mises Theorem under model misspecification kleijn2012bernstein, , which is the minimal Kullback-Leibler point. We see that

 n∑i=1˙li(ϑ)˙li(ϑ)TIn(ϑ)−1→P0E0[˙li(θ)˙li(θ)T]I(θ)−1

for any . The theorem below implies that under suitable regularity conditions,

 EΠn{n∑i=1˙li(ϑ)˙li(ϑ)TIn(ϑ)−1∣∣∣Zn}→P0E0[˙li(θ∗)˙li(θ∗)T]I(θ∗)−1.
###### Theorem 4.

Suppose is a random function and measurable on . Suppose the conditions in kleijn2012bernstein Theorem 2.3 and Corollary 4.2 hold. Further assume the following conditions hold:

1. is continuous in a neighborhood of and .

2. Uniform convergence of in a neighborhood of : for any , there exists and positive integer such that for any , we have

 P0(supθ1:∥θ1−θ∗∥<η0|An(θ1)−A(θ1)|>ϵ)<γ.
3. Asymptotic posterior uniform integrability: for any , we have

 limM→∞limsupn→∞P0(EΠn{∥An(θ)∥I(∥An(θ)∥>M)∣∣∣Z1,…,Zn}>ϵ)=0 (6)

Then .

###### Proof.

Step 1:

We show that as , for any . In other words, for any , , and , there exists a positive integer , such that for any , we have

 P0(Πn(∥An(ϑ)−An(θ∗)∥>ϵ∣∣∣Z1,…,Zn)>ξ)<2γ. (7)

Note that the additional conditions 1, 2 imply that for any , there exists and positive integer such that for any , we have

 P0(supθ1:∥θ1−θ∗∥<η0|An(θ1)−An(θ∗)|>ϵ)<γ. (8)

By Corollary 4.2 in kleijn2012bernstein, there exists a positive integer such that

for any .

Now that the left-hand-side of (7) is

 P0(Πn(∥An(ϑ)−An(θ∗)∥>ϵ,∥ϑ−θ∗∥1<η0∣∣∣Z1,…,Zn)>ξ)+ P0(Πn(∥An(ϑ)−An(θ∗)∥>ϵ,∥ϑ−θ∗∥1≥η0∣∣∣Z1,…,Zn)>ξ) ≤