1 Introduction
Probabilistic models, with a finite number of parameters that characterizes the data generating distribution, are a standard statistical tool. Frequentist methods view the parameters as fixed quantities and use statements about replicate datasets to make inference, while the Bayesian paradigm uses the “language” of probability to directly describe uncertainty about parameters. This philosophical distinction can lead to differences in interpretation of analyses, depending on which framework is being used.
Despite this, in analysis of large samples standard methods under one approach tend to have close analogs under the other. For example, frequentist maximum likelihood estimates (MLEs) and Bayesian posterior means behave similarly in large samples, as do typical standard error estimates and posterior standard deviations — fundamentally due to Bernsteinvon Mises’ theorem [§10.2]van2000asymptotic.
An exception occurs when we consider model violations. Frequentist inference in large samples is straightforward; under only mild regularity conditions “robust” covariance estimates royall1986model provide inference for the parameters consistently estimated by the MLE or other
estimators, even under model misspecification. These methods are widelyused, extremely simple to calculate, and are available in standard software. Bayesian analogs are much less studied. One approach relies on the large sample property that the posterior mean approaches the point in the model family closest (in KullbackLeibler divergence) to the true datagenerating distribution kleijn2012bernstein. The Bayesian point estimates can therefore be interpreted as a form of ‘best guess’, but the corresponding posterior variance is extremely hard to interpret, even in large samples. For linear regression, szpiro2010model give a Bayesian analog of the classic “sandwich” covariance estimate, as the posterior variance of a specific linear summary of the parameter space, in the limiting situation of an extremely flexible model and prior – essentially a form of ‘nonparametric Bayes’ dunson2010nonparametric. The result is not, however, available for other regression estimates, nor is it straightforward to use together with subjective prior information about parameters. A general Bayesian analog of sandwich estimates is yet to be developed.
In this article, we propose such a Bayesian analog, as the Bayes rule with respect to a form of balanced loss function zellner1994bayesian. We introduce the balanced loss function and its theoretic properties in Section 3. The balanced loss function comprises two terms, penalizing estimation error and lack of model fit respectively. In Section 4, we demonstrate the proposed Bayesian robust variance with several examples and use simulation studies to show its behaviors in various settings. In Section 5, we use Bayesian robust variance to quantify the uncertainty in an analysis to study the association between systolic blood pressure and age using a dataset from 20172018 National Health and Nutrition Examination Survey (NHANES). We conclude with a discussion in Section 6. Proofs of most theorems are deferred to the Appendix.
2 Inference Loss functions
For general observations we denote as an independent and identically distributed random sample from distribution with density with respect to a measure , indexed by a real parameter In our major focus on regression models, the ’s are the combined outcome and explanatory variables, i.e. where is the outcome variable and
is a real vector of explanatory variables. For Bayesian analysis, we use
to denote the vector of parameters, and write as the density function of the corresponding prior, supported on . We write as the posterior density of given the observed data. We denote and as the posterior expectation and posterior variance (or posterior variancecovariance matrix for vectorvalued ), respectively, with respect to the posterior measure .Suppose is the parameter of interest and a corresponding estimate. Using decision theory, optimal Bayesian estimates, known as Bayes rules, are obtained by minimizing the posterior risk , where is a loss function describing the discrepancy between the decision rule and the parameter. Common loss functions include the loss and loss , for which the Bayes rules are the posterior median and posterior mean, respectively. For a more comprehensive review of decision theory and optimality of Bayes rules, see parmigiani2009decision.
To give rules that estimate but also indicate the precision of that estimate, we need more general loss functions. To achieve both those goals we consider what we shall call the inference loss function
(1) 
where is a positive definite matrix. The righthand term is a sum of adaptivelyweighted losses, each of the form , where the weights are determined by the elements of the matrix . To discourage the weight for any combination of the from reaching zero, we penalize by the left term, the log of the determinant of . Theorem 1 gives characteristics of the inference loss function.
Theorem 1.
The Bayes rule with respect to the inference loss in (1) is and , which gives the minimized posterior risk .
By Theorem 1, the Bayes rules for and with respect to are the posterior mean and variance. Bernsteinvon Mises’ Theorem implies that, in the frequentist sense, the posterior variance and the variance of the posterior mean are asymptotically equivalent, i.e. . As a result, the inference loss function provides estimates and approximations of those estimates’ error in both the Bayesian and frequentist senses.
We note that the inference loss function is not new: it was previously discussed by dawid1999coherent, who noted that it corresponds to using the logdeterminant of the covariance matrix as a criterion to compare study designs. holland2015minimum also considered the inference loss as an example of a ‘proper’ loss function. However, to our knowledge, the simple and general form of its Bayes rule is not known, nor are the extensions we provide below.
3 Balanced Inference Loss functions for modelrobust variance estimation
We extend the inference loss function and present a Bayesian analog of robust variance estimator, as the Bayes rules for a loss which penalizes both the lack of model fit and estimation error. Our loss function is motivated by the balanced loss functions originally proposed by zellner1994bayesian. Balanced loss functions have since been developed considerably, but all share a common form in which the loss is a weighted average of a term indicating lackoffit in some way, and one indicating estimation error. The general form of the balanced loss function we will consider is
(2) 
where is the score function based on a single observation, is the Fisher information, and we make decisions , and and which are positivedefinite weighting matrices. We call in (2) the balanced inference loss function.
The first term in is essentially the inference loss from (1), described earlier. However the balanced inference loss describes an additional decision, matrix , that sets rates of tradeoff between elements of the estimation component of the inference loss versus the datadependent penalty
, a form of ‘signal to noise’ ratio for deviations from the model; the score evaluated at any
describes discrepancies between the data and corresponding model, and the Fisher information matrix describes the uncertainty in those deviations. Furthermore, the penalty term is invariant to onetoone transformations of .We proceed to consider the Bayes rules for the balanced inference function.
Theorem 2.
The Bayes rule with respect to the inference loss in (2) is
Proof.
See Appendix B. ∎
From Theorem 2 we see that the posterior mean remains the Bayes rule for estimate , but the decision for is now a scaled version of the posterior variance, where the rescaling term is – essentially – the posterior expectation of the ‘lack of fit’ terms in (2), omitting , and can be considered as a ‘signaltonoise’ measure based around score .
In Theorem 3, we show that estimates the largesample variance of in a modelrobust manner.
Theorem 3.
Let be the minimal KullbackLeibler point. Under the regularity conditions in Appendix C, we have
where the righthand side is also the asymptotic variance of derived in [kleijn2012bernstein]:
Theorem 3 indicates that is asymptotically equivalent to the asymptotic variance of , under only mild regularity conditions that do not require fullycorrect model specification, and is thus a Bayesian analog of the frequentist “robust” variance estimator.
As well as providing a Bayesian approach to these empiricallyuseful methods, our work also allows us to construct Bayesian analogs of Waldtype confidence intervals. We define the twosided Bayesian robust confidence interval at level
for the th entry of aswhere is the
quantile of the standard normal distribution. In Section
4, we show by simulation studies that intervals constructed in this way can have appealing operating characteristics.Besides leading to modelrobust inference, we note that a slightly simpler version of the influence loss function – with a univariate correction term instead of a matrixvalued – provides a Bayesian analog of quasilikelihood models mccullagh1983quasi. Details are provided in Appendix D.
As well as providing a Bayesian approach to these empiricallyuseful methods, our work also allows us to construct Bayesian analogs of Waldtype confidence intervals. We define the twosided Bayesian robust confidence interval at level for the th entry of as
where is the quantile of the standard normal distribution. As we describe below, intervals constructed in this way can have appealing operating characteristics.
4 Examples and simulation studies
In this section we show, through analytic examples and simulation studies, that the Bayesian robust variance quantifies the variability of the Bayes point estimates in large samples, even in the presence of model misspecification.
4.1 Estimating a Normal mean, but misspecifying the variance
As a deliberately straightforward first example, suppose a random sample is drawn from where mean is unknown mean and variance is known. We are interested in estimating but wrongly assume the observations have variance . In other words, we use the model , in which the variance is misspecified. The analysis also uses prior , for known .
The balanced loss function in this setting is
and the Bayes rule sets
Letting go to infinity, the point estimate is equivalent to the sample average and the variance estimate is equivalent to , i.e. the asymptotic inference that would be obtained under a correct model specification. The underlying approach remains fully parametric, and retains the (valuable) coherence of Bayesian analysis, together with the appealing normative element of decision theory, where a clear statement about the goal of the analysis (here, balancing estimation of with fidelity to the data) makes the our choice of what to report (i.e. ) both automatic and optimal.
4.2 Estimating the variance of Generalized Linear Model regression coefficients
The balanced inference loss function takes an appealingly straightforward form when the data are independent observations , from a generalized linear model (GLM), i.e. an exponential family where the distribution has the density
for functions , and scalars and mccullagh1989generalized. We assume the presence of dimensional explanatory variables augmenting each observation , with a link function connecting the mean function and the linear predictors via , where is a vector of regression coefficients. We write and .
With a GLM, the score function with respect to for a single observation is and the empirical Fisher information is , where is the matrix with elements , , and is the diagonal matrix with the th diagonal element . The balanced inference loss is therefore
(3) 
in which we observe that the data only enters the balancing term via the terms , which can be thought of as a Bayesian analog of (squared) Pearson residuals. The other components in the balancing term determine how much weight these ‘squared’ terms receive. The balanced inference loss’s Bayes rule sets to be the usual posterior mean, and
which can be thought of as the modelbased posterior variance ‘corrected’ by a term that assesses fidelity of the data to the model. As with the general case, asymptotic equivalence with the classical “robust” approach holds.
When the canonical link function is used, i.e. where , the Bayesian rule further simplifies to
where the interpretation above can also be seen to hold.
Example: Linear regression
For observations and the explanatory variables , a Bayesian linear regression model assumes . The corresponding balanced inference loss is
in which the novel penalty term is a sum of squared Pearson errors, weighted by a modified form of leverage ([mccullagh1989generalized] Section 12.7.1, page 405) which includes a term inside the familiar ‘hat’ matrix . The Bayes rule for is the Bayesian robust variance
We perform a simulation study to investigate the performance of Bayesian robust variance for linear regression. We generate univariate explanatory variables as where independently, and generate outcome variables as for , for various constants . When , the mean model, is quadratic in , and hence the model is misspecified.
We suppose interest lies in linear trend parameter and is the posterior mean, i.e. the Bayes rule for its estimation. We use weakly informative priors for and . For 1000 simulation under each setting, we report the average posterior mean (), the standard error of (), the average posterior standard deviation of (), and the average Bayesian robust standard error estimate ().
Table 1 compares the posterior standard deviation and the Bayesian robust standard error estimate, as estimates of the standard error of the Bayes rule. In all the scenarios except when (noted in grey, indicating that the linear regression model is correctly specified), while there is a recognizable difference between the true standard error and the posterior standard deviation, the Bayesian robust standard error is notably closer to the true standard error.
n  Ave()  Ave()  

50  2  4.996  0.332  0.282  0.331 
50  1  2.000  0.216  0.203  0.220 
50  0  1.009  0.171  0.167  0.167 
50  1  3.999  0.227  0.203  0.220 
50  2  6.997  0.349  0.282  0.334 
100  2  4.980  0.233  0.198  0.233 
100  1  2.001  0.150  0.141  0.153 
100  0  1.001  0.115  0.117  0.117 
100  1  4.008  0.152  0.141  0.153 
100  2  6.994  0.223  0.197  0.231 
Example: Poisson regression
Poisson regression is a default method for studying the association between count outcomes and the explanatory variables agresti1990. With count observations and corresponding explanatory variables (which may contain an intercept) a Poisson regression model assumes . The balanced inference loss is
which is again interpretable as a weighted sum of squared Pearson errors, and the weights are again modified versions of leverage. The Bayes rule for is another Bayesian robust variance,
where and .
To evaluate the method, as before we generate explanatory variables , where , and generate the outcomes as , for various constants and . The model is misspecified when . We assume the parameter of interest is the log risk ratio and is the posterior mean, the corresponding Bayes rule. We use weakly informative priors for . For each setting we perform 1,000 replicates, and report the same measures as for linear regression.
Table 2 shows the results. Again, the Bayesian robust standard error is a notably better estimate of the true standard error of than the posterior standard deviation of ; it performs better when the model is misspecified and no worse when correctly specified.
n  Ave()  Ave()  

50  0.50  0.348  0.101  0.115  0.100 
50  0.25  0.576  0.091  0.098  0.090 
50  0.00  1.014  0.086  0.085  0.084 
50  0.25  1.788  0.134  0.070  0.118 
50  0.50  2.870  0.296  0.049  0.181 
100  0.50  0.344  0.066  0.080  0.068 
100  0.25  0.567  0.060  0.068  0.062 
100  0.00  1.005  0.059  0.058  0.058 
100  0.25  1.808  0.089  0.049  0.084 
100  0.50  2.954  0.163  0.034  0.136 
4.3 Estimating the variance of log hazards ratio with an exponential proportional hazards model
We demonstrate how our framework can be extended beyond GLMs with an association analysis for timetoevent outcomes and explanatory variables.
Suppose the observed data are for , where is the observed survival time, is the event indicator, is the true survival time, is the censoring time and is a vector of covariates.
Suppose we intend to model the association between the survival time and the covariates using the exponential proportional hazards model
The balanced inference loss is
In this case, the interpretation of the balancing term is less clear than before. We note, however, for those subjects who experienced events (i.e. ), their contribution to the balancing term can be written as
which, again, is their Pearson residual weighted by some version of their leverage.
The Bayes rule for is a Bayesian robust covariance matrix
In a simulation study to investigate the robustness of different variance estimates in exponential proportional hazards models, we generated the covariates as where , for , and the survival times by a Weibull proportional hazards model
where the pdf of a random variable is given by
We further performed administrative censoring at , . Our use of Exponential proportional model deviates from the true data generating distribution if .
We suppose the parameter of interest is the log hazards ratio in the model and use the prior distribution , . Again, we conduct 1000 simulation under each setting considered, and report the same measures as before.
Results are given in Table 3. Although slightly underestimating the standard error of the Bayes estimator when the model is correctly specified, in general the Bayesian robust standard error is a better estimate of the true standard error of than the posterior standard deviation of .
n  #events  Ave()  Ave()  

50  0.8  0.00  49.9  0.004  0.187  0.151  0.164 
50  0.8  0.25  49.8  0.317  0.179  0.150  0.165 
50  0.8  0.50  49.4  0.608  0.185  0.152  0.165 
50  1.0  0.00  50.0  0.000  0.146  0.148  0.132 
50  1.0  0.25  49.8  0.250  0.151  0.149  0.134 
50  1.0  0.50  49.9  0.503  0.146  0.149  0.134 
50  1.5  0.00  50.0  0.006  0.101  0.148  0.094 
50  1.5  0.25  50.0  0.165  0.101  0.148  0.093 
50  1.5  0.50  50.0  0.337  0.102  0.146  0.093 
100  0.8  0.00  99.8  0.003  0.130  0.103  0.119 
100  0.8  0.25  99.6  0.311  0.131  0.103  0.119 
100  0.8  0.50  98.8  0.614  0.128  0.105  0.119 
100  1.0  0.00  100.0  0.006  0.102  0.102  0.095 
100  1.0  0.25  100.0  0.248  0.102  0.103  0.096 
100  1.0  0.50  99.8  0.497  0.104  0.102  0.097 
100  1.5  0.00  100.0  0.001  0.071  0.102  0.066 
100  1.5  0.25  100.0  0.170  0.070  0.102  0.066 
100  1.5  0.50  100.0  0.332  0.072  0.102  0.067 
4.4 Bayesian robust confidence intervals
While appealingly robust in large samples, standard frequentist modelagnostic variance estimates often have unstable behavior in small samples kauermann2001note. We investigate whether their Bayesian analog might produce better small sample behavior, by the shrinkage/stability provided through the prior distribution. Using the posterior mean as estimate and Bayesian robust variance estimate , we construct the Bayesian analog of Waldtype “robust” confidence intervals. The proposed Bayesian robust confidence interval at confidence level for , , is
Figure 1
shows the coverage probabilities of these intervals, together with standard modelbased 95% credible intervals and standard robust confidence intervals, using the examples in Section
4.2 and 4.3. We assume the same models and data distributions as in the previous sections. In all the modelmisspecification scenarios, the posterior credible interval does not – as expected – have the nominal level coverage even in large samples. However, compared with the standard robust confidence intervals, the Bayesian robust confidence intervals have coverage probabilities closer to the nominal level. This suggests that the stability provided by even the relatively weak priors we have used is sufficient to improve the performance of the standard purelyempirical frequentist methods.5 Application: Association of Systolic Blood Pressure and Age in 20172018 NHANES Data
We demonstrate the Bayesian robust standard error estimator using data from 20172018 National Health and Nutrition Examination Survey (NHANES) data. We are interested in the association between subjects’ systolic blood pressure systolic and their age among the subpopulation with age between 18 and 60 years old. To make the difference between the variance estimates more distinguishable, we randomly select a subset of 200 subjects for our analysis.
We used a simple linear regression model as a working model, with the average of two systolic blood pressure readings as the outcome and age and gender as covariates (Figure 2
left panel). We computed both the modelbased standard error estimates (assuming homoscedasticity) and robust standard error estimates (resistent to heteroscedasticity and/or model misspecification) of the regression coefficients. For the corresponding Bayesian model, we postulate a normal linear regression model with the same outcome and covariates. We use weakly informative prior distributions for the unknown regression coefficients and the residual variance (Figure
2 right panel). We report the posterior mean as point estimates, and compute the posterior standard deviations and Bayesian standard errors for the regression coefficients. For the Bayesian analysis, we used Gibbs sampling in JAGS plummer2003jags with 3 chains and 30,000 iterations for each chain, where the first 18,000 iterations in each chain are burnin and discarded from the analysis.Frequentist  Bayesian 

E[SBP] = β_0 + β_1 MALE + β_2AGE  SBP∼N(β_0 + β_1 MALE + β_2AGE, σ^2) β_j∼N(0, 1000), j=0,1,2 σ^2 ∼InvGamma(0.01, 0.01) 
In Table 4, we report the point estimates and the variance estimates of the regression coefficients for the frequentist and the Bayesian linear regression models. We focus on the comparison of the variance estimates. The modelbased standard errors/Bayesian posterior standard deviations given higher estimates of the standard error for all three regression coefficients than the frequentist/Bayesian robust standard errors. For the intercept and the regression coefficient for age, the Bayesian posterior standard deviations are nearly equal to the corresponding frequentist modelbased standard errors, whereas the Bayesian robust standard error are approximately equal to the frequentist robust standard error. The four versions of standard error estimates for the regression coefficient of gender are not as distinguishable.
Frequentist  Bayesian  

Est.  SE  Robust SE  Est.  Post. SD  BRSE  
(Intercept)  94.067  2.350  1.781  93.481  2.307  1.767 
Male  4.817  2.063  2.032  5.005  2.062  2.050 
Age (yrs)  0.570  0.046  0.042  0.580  0.046  0.042 
Code for the analysis and the dataset are available on GitHub (https://github.com/KenLi93/BRSE).
6 Discussion
In this article, we have proposed a balanced inference loss function, whose Bayes rules provide a Bayesian analog of the robust variance estimator. We have proven how the Bayesian robust variance converges to the true variance of the estimator of interest in large samples, under only mild regularity conditions. In our examples of GLMs and other forms of models, the corresponding loss function is seen to be a straightforward balance between losses for standard modelbased inference, and a term that assesses how well the corresponding model actually fits the data that have been observed; the penalty employs terms and components that are familiar from diagnostic use of Pearson residuals, and from study of leverage. Using these methods we also proposed the Wald confidence intervals using Bayesian robust standard error, and found more stable behavior than their frequentist counterparts.
Our development of the robust variance estimates is fully parametric, pragmatically adjusting the standard modelbased inference with a measure of how the corresponding model fits the data. This is not the only motivation available for robust variance estimates however; focusing only on linear regression, szpiro2010model proposed a Bayesian modelrobust variance estimate by using highlyflexible models for the data distribution and clearly defining the parameter of interest in a modelagnostic manner. The extension of this essentially nonparametric approach beyond linear regression is unclear, and moreover its motivation can be challenging – much like the challenge of a frequentist development built purely on estimating equations, without parametric assumptions. In comparison, our proposed method does not require redefining the parameter of interest during the modeling, and the interpretation of the point estimates stays the same as in regular parametric statistics. The proposed Bayesian variance estimates do not require nonparametric Bayes components in the model specification as in szpiro2010model, thus greatly reducing the computation time and increasing the algorithmic stability. Our framework is general and can be applied to obtain the Bayesian robust variance for any regular parametric models.
Our work demonstrates that robust standard error estimates can be ubiquitous both in frequentist and in Bayesian statistics. Our approach enables the proper quantification of the variability of parameter estimates in Bayesian parametric models. The proposed balance inference loss function, through which the Bayesian robust standard error was derived, also provides insights on the source of discrepancy between the modelbased and modelagnostic variance estimates.
Appendix A Proof of Theorem 1
Proof.
The posterior risk is
Writing , the posterior risk is
in which the last two terms are respectively the sum of the eigenvalues and the eigenvalues of . This is minimized by setting so that all eigenvalues equal to 1, which occurs if and only if
is the identity matrix. Hence
.To minimize the posterior risk with respect to , it remains to consider
By matrix determinant lemma, the above expression equals
which is minimized by setting , which in turn means setting . This rule achieves minimized posterior risk .
∎
Appendix B Proof of Theorem 2
The expected posterior loss is
(4) 
Appendix C Proof of Theorem 3
Let be the Fisher information. By the Bernstein von Mises Theorem under model misspecification kleijn2012bernstein, , which is the minimal KullbackLeibler point. We see that
for any . The theorem below implies that under suitable regularity conditions,
Theorem 4.
Suppose is a random function and measurable on . Suppose the conditions in kleijn2012bernstein Theorem 2.3 and Corollary 4.2 hold. Further assume the following conditions hold:

is continuous in a neighborhood of and .

Uniform convergence of in a neighborhood of : for any , there exists and positive integer such that for any , we have

Asymptotic posterior uniform integrability: for any , we have
(6)
Then .
Proof.
Step 1:
We show that as , for any . In other words, for any , , and , there exists a positive integer , such that for any , we have
(7) 
Note that the additional conditions 1, 2 imply that for any , there exists and positive integer such that for any , we have
(8) 
By Corollary 4.2 in kleijn2012bernstein, there exists a positive integer such that
for any .
Now that the lefthandside of (7) is