 # Accurate Computation of Marginal Data Densities Using Variational Bayes

Bayesian model selection and model averaging rely on estimates of marginal data densities (MDDs) also known as marginal likelihoods. Estimation of MDDs is often nontrivial and requires elaborate numerical integration methods. We propose using the variational Bayes posterior density as a weighting density within the class of reciprocal importance sampling MDD estimators. This proposal is computationally convenient, is based on variational Bayes posterior densities that are available for many models, only requires simulated draws from the posterior distribution, and provides accurate estimates with a moderate number of posterior draws. We show that this estimator is theoretically well-justified, has finite variance, provides a minimum variance candidate for the class of reciprocal importance sampling MDD estimators, and that its reciprocal is consistent, asymptotically normally distributed and unbiased. We also investigate the performance of the variational Bayes approximate density as a weighting density within the class of bridge sampling estimators. Using several examples, we show that our proposed estimators are at least as good as the best existing estimators and outperform many MDD estimators in terms of bias and numerical standard errors.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The marginal data density (MDD) is of central importance in Bayesian inference, providing a summary of the evidence contained in the data about a model. Since it is required for the computation of Bayes factors and posterior odds ratios

(see e.g. Kass & Raftery, 1995), it is essential for model comparisons, predictive analyses, and Bayesian hypotheses assessment.

To define the MDD, let vector collect all of the parameters of a model and denote observed data. MDD is the constant normalizing the posterior distribution of parameters given the data; it can be obtained by the application of Bayes’ rule:

 p(θ|y)=p(y|θ)p(θ)p(y), (1)

where the numerator on the right hand side of equation (1

) is the product of the likelihood function and the prior distribution of the parameters of the model, and the denominator contains the MDD. The latter is often obtained through integrating the joint distribution of data and parameters over the

-dimensional parameter vector defined on the parameter space :

 p(y)=∫θ∈Θp(y|θ)p(θ)dθ. (2)

The integral in equation (2) can be computed analytically only for simple models, as for example, in Zellner (1971), Sims & Zha (1998), Ardia et al. (2012), Kociȩcki et al. (2012), and Chan & Grant (2015). For more complicated models, elaborate numerical integration methods are required (see e.g., Gelfand & Smith, 1990). Estimation of MDD using such methods often involves significant costs due to substantial programming efforts and computational requirements. Nevertheless, pursuing numerically precise and computationally feasible estimation techniques is worthwhile because of the importance of MDD applications. Good summaries of various proposals can be found in Ardia et al. (2012) and Chen et al. (2012) among others.111A variety of methods have been proposed for computation of MDDs, we mention and use a dozen of them in this paper. Some examples that are not covered in this paper are the power posterior proposed by Friel & Pettitt (2008), nested sampling proposed by Skilling et al. (2006) and path sampling studied by Gelman & Meng (1998).

We propose two new MDD estimators that outperform existing alternatives in terms of the numerical efficiency while maintaining computational feasibility. The first of these estimators belongs to the class of reciprocal importance sampling (RIS) MDDEs and the second to the class of Bridge Sampling (BS) MDDEs. The former class was proposed by Gelfand & Dey (1994) defined by:

 ^pRIS(y)=[1SS∑s=1h(θ(s))p(y|θ(s))p(θ(s))]−1, (3)

where is a sample drawn from the posterior distribution and is a weighting density. The properties of the weighting function determine the properties of the estimator. For this estimator to have desirable properties, the support of the weighting density must be a subset of the posterior density’s support or equivalently the ratio must be bounded from above (see e.g. Gelfand & Dey, 1994; Chen et al., 2012) . Under this condition, it is well-known that converges to (see e.g. Geweke, 2005).

An early proposal (harmonic mean MDDE) by

Newton & Raftery (1994) to set did not satisfy this condition whenever the prior density had fatter tails than the posterior. This led later researchers to propose weighting densities such that their support is a subset of the posterior density’s support. For example, Geweke (1999) proposed using a truncated multivariate normal density and Sims et al. (2008) used a truncated elliptical density as the weighting function. These methods require some arbitrary choices regarding the truncation and can require many draws from the posterior distribution to achieve a reasonable numerical precision. Another solution has been to change the space of the weighting function through analytical integration (see Raftery et al., 2007; Fuentes-Albero & Melosi, 2013). However, this suggestion is applicable only if appropriate analytical integration is possible for a given model. Finally, Lenk (2009)

proposed correcting the harmonic mean MDDE by using the ratio of the prior and posterior probabilities on an appropriately defined subset of the support of posterior density. This solution leads to desirable properties of the estimator such as having a finite variance.

222See C for the computational details about this estimator. See also Pajor (2016) who applies the same strategy to the arithmetic mean MDDE. Another paper attempting to correct harmonic mean estimator is Raftery et al. (2006).

Our proposal for RIS MDDE is to use the output from variational Bayes (VB) estimation as the weighting function. The objective of VB estimation is to approximate the posterior density with another density that is tractable. The VB approximate posterior density is optimal in the sense that it minimizes the KL distance from this density to the posterior density. We show that a density constructed in this way, under some mild regularity conditions, is dominated by the posterior density, i.e., has a support that is a subset of the support of the posterior density and, thus, is a good candidate for a weighting function for the RIS MDDE.

We also show that the aforementioned optimality of the VB approximate density translates into optimality of the resulting MDDE. It leads to an RIS MDDE that is minimum variance within the class of RIS MDDEs. This finding is further confirmed by our simulations where we show that our estimator has the smallest numerical standard errors within the class of RIS MDDEs and also it performs well when compared to some popular alternative MDDEs from other classes of estimators. Finally, we show that our new MDDE has the desirable properties that were established by Geweke (2005) and Frühwirth-Schnatter (2004) for other RIS MDDEs. In particular, we show that our RIS MDDE is consistent, asymptotically normally distributed with a finite variance, and unbiased. In addition, we argue that these properties of the estimator make it appealing in empirical applications in which a common practice is to report a natural logarithm of the MDDE.

The focus of this paper is on the RIS MDDE because it can be easily computed given that a Markov Chain Monte Carlo (MCMC) sample from the posterior distribution is available. In contrast to many other MDDEs, the RIS MDDE does not require any other simulations. This is a desirable feature especially in those cases where the cost of additional computations is particularly high, for instance, due to a large dimension of the parameter space. The last point is well illustrated in recent developments in macroeconometrics where hypothesis testing for large nonlinear dynamic models is essential.

Fuentes-Albero & Melosi (2013) and Droumaguet et al. (2017) emphasize the feasibility of MDD estimation using an adjusted version of the estimator by Geweke (1999, 2005) whereas Sims et al. (2008) propose a new weighting function that assures the numerical stability of the RIS MDDE in models with nonstandard posterior distributions.

Although our main focus is the properties and performance of the RIS MDDE, We also investigate the usefulness of using the VB approximate density as a weighting function for the BS MDDE proposed by Meng & Wong (1996). This estimator is obtained through an iterative procedure initialized at some starting value for and the following recursion:

 ^p(t)BS(y)=^p(t−1)BS(y)1O∑Oo=1^p(θ(o)|y)Og(θ(s))+S^p(θ(o)|y)1S∑Ss=1g(θ(s))Og(θ(s))+S^p(θ(s)|y), (4)

where is an i.i.d. sample drawn from the weighting density , is an MCMC sample drawn from the posterior distribution, and . In practice, the iterative procedure presented in equation (4) requires few steps, usually around ten, until convergence is achieved. One advantage of the MDD BS estimator is that, irrespective of whether the weighting density is dominated by the posterior density or not, it has finite variance (see Frühwirth-Schnatter, 2004). Consequently, it is easy to see that the desired properties hold if the weighting function is set to the VB approximate posterior density.

Another appealing feature of our estimators is that VB posterior densities are available for many models through a vast number of papers that provide VB estimation for different type of models. Additionally, two software packages, Infer.Net (Minka et al., 2012) and STAN (Kucukelbir et al., 2015), provide automatic VB posterior density estimation. Thus, the range of models to which our method can be easily applied is enormous.

The paper is organized as follows: In Section 2, we present preliminaries on VB methods to the extent required for our application to the MDDE and establish the dominance property. Section 3 introduces the new RIS MDDE, states its properties and briefly reviews some of the other MDD estimators that are evaluated in our simulation experiments. In Section 4

, simulation experiments are used to study the numerical accuracy of the MDDE in three applications: vector autoregressions, stochastic frontier models, and longitudinal Poisson models. We have chosen vector autoregressions since it is possible to calculate the exact MDD for this model under a conjugate prior assumption making it useful for comparisons. The other two specifications are examples of nonlinear models with latent variables.

A contains the proofs of the propositions that establish the properties of our RIS MDDE and B a summary of relevant weighting functions for some of the MDD estimators used in 4.

## 2 Variational Bayes Methods

In this section we introduce a general setting of VB estimation and establish a property of the resulting approximate posterior density that is essential for MDD estimation.

### 2.1 Variational Bayes Estimation

VB is a deterministic alternative to MCMC; it was developed as a Bayesian method of inference in machine learning

(see e.g. Attias, 2000; Jordan et al., 1999). Recent surveys of VB include Wainwright & Jordan (2008), Bishop (2006, Chapter 10), and Murphy (2012), amongst others. Lately, this method has been adopted in statistics and econometrics, see e.g. McGrory & Titterington (2007), Ormerod & Wand (2010), Faes et al. (2011), Wand et al. (2012), Braun & McAuliffe (2010), Tan & Nott (2014), Nott et al. (2012), Wand (2017), Tran et al. (2017), Ong et al. (2017) and Hajargasht & Griffiths (2017). Some properties of the VB estimator can be found in Wang & Blei (2017) and the references therein. In this section, we follow the notation and the basic setting of VB method in Ormerod & Wand (2010) and Blei et al. (2017).

The objective of Bayesian inference is the characterization of the posterior distribution of the parameters of a model given data. In VB the idea is to approximate the posterior density with a density which is of a tractable form. The optimal approximate posterior density, denoted by

, is obtained by minimizing the Kullback-Leibler divergence of the density

from the true posterior density, :

 q∗(θ)=argminq∈QKL[q(θ)||p(θ|y)]=minq∈Q∫θ∈Θq(θ)ln[q(θ)p(θ|y)]dθ, (5)

where is a family of approximate densities.

In order to make the VB method appealing, the approximate density should be of an analytically tractable form. This can often be achieved by imposing simplifying assumptions on . One such assumption commonly used is that the approximate density of can be expressed as a product of factorized densities for some partition of vector :

 q(θ)=M∏m=1qm(θm). (6)

This factorized form corresponds to an approximation framework developed in physics known as mean field theory. Thus, this form of variational approximation is often referred to as mean-field variational Bayes (MFVB).

The assumption given in equation (6) introduces conditional stochastic independence between sub-vectors of parameters , for , given data . Using results from calculus of variations and some simple algebraic manipulations it is shown that the optimal factorized densities can be obtained from the following iterative procedure (see Ormerod & Wand, 2010): Initialize , and iterate:

 q1(θ1) ←exp[E−θ1ln(p(y,θ))]∫exp[E−θ1ln(p(y,θ))]dθ1 ⋮ qM(θM) ←exp[E−θMln(p(y,θ))]∫exp[E−θMln(p(y,θ))]dθM,

until the decrease in the value of the KL distance, , between two subsequent iterations is negligible. denotes expectation with respect to the density .

for many econometric models, imposing assumption (6) leads to factorized distributions belonging to known families of parametric distributions. For such cases, let the approximate density be parameterized by a vector of hyper-parameters , defined on a set of their admissible values , and denoted by . The optimal is defined by the optimal values of its hyper-parameters obtained by minimizing the Kullback-Leibler divergence of the approximate density, , from the true posterior density, over the hyper-parameters

. Then the iterative optimization procedure for the distributions described above transforms into a deterministic iterative procedure for the hyper-parameters or moments of these distributions which is known as the

coordinate ascent algorithm. The convergence of this deterministic algorithm is often obtained in a small number of steps, which makes VB estimation much faster than estimation based on MCMC.

### 2.2 The Convergence Criterion

Note that the constant normalizing the kernel of the posterior distribution is unknown and, in consequence, the value of the KL distance is unknown as well. To facilitate the computations, the natural logarithm of is expressed as a sum of a value defined by and KL divergence measure, (see e.g. Ormerod & Wand, 2010, p. 142):

 lnp(y)=lnMDDVBLB(λ)+KL[q(θ|λ)||p(θ|y)]. (7)

A closed-form solution for can be derived for many models with little effort. Moreover, since the KL divergence measure is nonnegative, and is equal to zero only when , provides a lower bound to . The problem of minimizing stated in equation (5) is equivalent to the problem of maximizing with respect to :

 λ∗=argmaxλ∈ΛlnMDDVBLB(λ)=argmaxλ∈Λ[Eqlnp(y|θ)p(θ)−Eqlnq(θ|λ)]. (8)

Consequently, to monitor the convergence of the coordinate ascent algorithm is used instead of KL distance as the objective function .

### 2.3 The Dominance Property

In this work we adopt the following assumptions in relation to the dominance property:

(i)

For any given data , the likelihood function is bounded from above for all values of the parameter vector in the parameter space.

(ii)

The prior density is proper.

(iii)

The VB approximate posterior density is bounded from above.

(iv)

The VB approximate posterior density is continuous and differentiable.

The first two assumptions are uncontroversial; they guarantee the existence of the posterior distribution and that the MDD is finite. The other two conditions for the VB approximate posterior density are quite general.

Under Assumption 2.3 we show that the posterior density dominates the VB approximate posterior density which we state as a corollary.

###### Corollary 1.

Let the support of the posterior density be . If Assumption 2.3 holds and the posterior density is continuous and differentiable on its support , then the posterior density dominates the VB approximate posterior density, that is, .

The proof of this claim is a by-product of the proof of Proposition 4 given in the next section. It states that the solution to the optimization problem such as the one given in equation (5) has to meet the property given in Corollary 1. Otherwise, cannot be a solution because then the value of the KL distance goes to infinity. This property has desirable consequences that have been discussed, for instance, in Murphy (2012, Chapter 21).

## 3 The New MDD Estimator

Our new RIS MDDE is defined by assigning the VB approximate posterior density as the weighting density i.e. in equation (3). This MDDE is given by:

 ^pRIS.VB(y)=[1SS∑s=1q∗(θ(s))p(y|θ(s))p(θ(s))]−1. (9)

The fact that the VB approximate posterior density is dominated by the posterior density is essential for the properties of the new RIS MDDE. We present these properties below.

### 3.1 Properties of the New MDD Estimator

In this section we consider the properties of the RIS MDD estimator given in equation (9) with the VB approximate posterior density used as the weighting function. These properties are stated in four propositions for which the proofs are given in A.

For computing this estimator a sample from the posterior distribution is assumed to be available. We require that this sample has suitable properties that we state in the following assumption. The draws are from an ergodic process with unique invariant density having support .

Now we show that the RIS estimator with the VB approximate posterior density indeed leads to an MDD estimator that is theoretically well-justified.

###### Proposition 0.

If Assumption 2.3 holds, then:

 ∫Ωpq∗(θ)p(y|θ)p(θ)p(θ|y)dθ=1p(y).

Proposition 1 suggests that the estimator given in equation (9), the inverse of the average ratio of the approximate posterior density to the kernel of the posterior density evaluated at posterior draws, should result in the constant normalizing the posterior density.

Subsequently, we show that the inverse of our estimator is consistent, asymptotically normally distributed, and unbiased.

###### Proposition 0.

Suppose that Assumptions 2.3 and 3.1 hold. Then:

(i)

(ii)

,

where denotes convergence in distribution at the limit where , denotes the ratio , and is the normalized spectral density of the process at frequency zero.

The first result is known for the class of estimators proposed by Gelfand & Dey (1994) following the proofs provided by Geweke (1992), Chib (1995), and Frühwirth-Schnatter (2006). It applies to our estimator under assumptions similar to those made by the authors of these papers. Note that Frühwirth-Schnatter (2006) provides an estimator for .

The second result implies that if we follow the common practice of reporting the logarithm of our MDD estimator we report the negative of logarithm of the inverse of an unbiased estimator i.e.

which is not different from reporting the logarithm of an unbiased MDDE such as the importance sampling MDDE.

In Proposition 3 we show that the variance of our estimator is finite and given by .

###### Proposition 0.

If Assumption 2.3 holds, then .

This result is going to be used to prove Proposition 4 stating that our estimator is optimal within the class of the Reciprocal Importance Sampling MDD estimators.

###### Proposition 0.

The RIS MDDE with the VB posterior as the weighting density is minimum variance among the class of RIS MDDEs with weighting densities from .

Proposition 4 is new and general since it is true for all VB approximate posterior densities including those that do not require the factorization from equation (6).

Finally, note that the property from Corollary 1 is not essential for the other estimator that we propose, i.e., the bridge sampling MDDE where the weighting function is equal to the approximate posterior distribution, . In this case, the properties established by Meng & Wong (1996) and Frühwirth-Schnatter (2004) hold as well for our weighting function.

### 3.2 Comparison with Existing MDD Estimators

Conceptually, the idea of using the VB approximate posterior density as the weighting function under an RIS framework is close to the idea of employing entropy methods to obtain the optimal importance density for the importance sampling (IS) MDDE proposed by Chan & Eisenstat (2015). The IS MDDE itself was introduced by van Dijk & Kloek (1978) and Geweke (1989) and is given by:

 ^pIS(y)=1RR∑r=1p(y|θ(r))p(θ(r))f(θ(r)), (10)

where is an i.i.d. sample drawn from the importance density . In order to have a good estimator, the importance density should be a close approximation to the posterior distribution. Its choice is a nontrivial task that is discussed below. From the point of view of the MDD estimation, it is essential that the importance density dominates the posterior distribution. This property, in many references, is described as having thicker tails than the posterior distribution. Chan & Eisenstat (2015) show that if this is the case then the IS MDDE is consistent, unbiased, and asymptotically normally distributed, as well as having a finite variance. Note that the construction of an appropriate importance density is the most challenging task associated with this estimator.

Chan & Eisenstat (2015) propose using an importance density that is derived from cross-entropy methods. Their method requires defining a parametric family of distributions parametrized by hyper-parameters . A distribution belonging to such a family, denoted by , is used to approximate the posterior distribution well. The optimal hyper-parameters, , are obtained by minimizing the Kullback-Leibler divergence of the the true posterior density, , from the approximate density, , over the admissible set of hyper-parameters’ values :

 κ∗=argminκ∈KKL[p(θ|y)||~f(θ|κ)]=minκ∈K∫θ∈Θp(θ|y)ln[p(θ|y)~f(θ|κ)]dθ. (11)

In practice, finding the optimal hyper-parameters is equivalent to solving the following problem:

 κ∗=argmaxκ∈K1SS∑s=1ln~f(θ(s)∣∣κ), (12)

where a sample of draws from the posterior distribution is used. Subsequently, the importance density is set to and used in the estimation of the MDD according to equation (10). Chan & Eisenstat (2015) show that under the assumption that dominates the posterior distribution and therefore this MDDE is consistent and unbiased.

Note that while our approximate posterior density minimizes the reverse KL distance between this density and the true posterior density (see equation 5), the optimal importance density is obtained by minimizing the forward KL distance between the true posterior and the importance density (see equation 11).

By the same argument that we used to prove Corollary 1, it can be shown that the optimal importance density dominates the posterior density333Therefore, according to Murphy (2012), the entropy method may lead to some other undesirable properties of the approximate density, a property assumed by Chan & Eisenstat (2015). Consequently, the entropy method delivers a weighting function that is a good importance density, while the VB method delivers one that is appropriate for the RIS MDDE.444There are significant differences between the two approaches to MDD estimation. The method by Chan & Eisenstat does not require sampling from the posterior distribution using MCMC methods but rather uses independent sampling from . However, it requires some additional simulations for the optimization process and, more importantly, making some arbitrary choices regarding the functional form of the importance density. Our estimator requires only a sample from the posterior density but it does not require any other samplings and optimization through the coordinate ascent algorithm is fast, automatic, and free of arbitrary choices.

Another related method for choosing the weighting function within the IS MDDE framework, studied in detail by Perrakis et al. (2014), is by using the product of marginal posterior densities. This approach can be thought as the dual of our RIS MDDE proposal in the sense that, in MFVB, the approximate posterior density is obtained by taking expectation of the posterior kernel with respect to while under the product of marginal densities framework, the weighting density is obtained by taking expectation with respect to . In comparison, (i) The VB approximate posterior density is dominated by the true posterior density which makes it a good candidate for RIS MDDE while the opposite is true for the product of marginals which makes it a good candidate for IS MDDE; (ii) It has also been shown by Botev et al. (2013), that the product of marginals is the best importance density, in the sense of minimizing forward Kullback-Leibler divergence from the true posterior density and, therefore, minimizing the MDDE variance, among all product form importance densities. As we showed, the same is true for VB in the context of reverse Kullback-Leibler divergence and reciprocal importance sampling; (iii) an advantage of our proposal is that in many cases the VB approximate posterior density can be obtained analytically or by simple optimizations while the product of marginals requires both simulation and estimation of the marginal posterior densities. We estimate MDDs using the product of marginals whenever it is feasible in our simulation comparisons.

## 4 Numerical Accuracy of the MDD Estimators for Some Models

We study the performance of the new MDD estimators by applying them to three classes of models: vector autoregressions with two alternative prior specifications, stochastic frontier models with two alternative specifications of the inefficiency factor, and the longitudinal Poisson model. Hereafter, the notation is consistent within each subsection and with the rest of the paper. However, some of the characters may be used multiple times with different meanings across the subsections.

For each model we compute benchmark values of the logarithm of the MDD, i.e. . If possible, an exact benchmark is computed by analytical integration. Otherwise, we use the estimators by Chib (1995) or Chib & Jeliazkov (2001) as a benchmark. We also report the VB lower-bound of the logarithm of the MDDE, denoted below by and its upper bound counterpart using the method proposed by Ji et al. (2010).555see Dieng et al. (2017) for another method of calculating an upper-bound for MDD. All of the MDDEs included in our comparison of numerical accuracy and references to them are described in Table 1.

For each model and data set, we compute values of the log-MDDs a hundred times using separate MCMC simulations. Subsequently, we use these values to compute two criteria for the assessment of the numerical accuracy of alternative MDDEs, namely, the numerical standard error (NSE)666

NSE is defined as the sample standard deviation for the MCMC sample of simulated values of

., and the fraction of simulated MDD values that fall in the interval between the VB lower and upper bounds for the mentioned above, denoted in tables by ”% in bounds”.

### 4.1 Vector Autoregressive Model

Consider the VAR() model:

 yt=a0+A1yt−1+⋯+Apyt−p+ϵt,ϵt∼i.i.d.NN(0N×1,Σ), (13)

for , and denoting the sample size, is the lag order of the model, is an -vector of observations at time , is a vector of constant terms and for are matrices of autoregressive coefficients. is an error term that is conditionally normally distributed given past observations with the mean being a vector of zeros and the covariance matrix .

Define a matrix collecting all the observations of vector . Define a vector , a matrix , and the matrix of coefficients . Also, let matrix collect all the error terms. Then the VAR process from equation (13) can be written as:

 Y=XA+E,E∼MNT×N(0T×N,Σ,IT), (14)

where follows a matric-variate normal distribution (see e.g. Woźniak, 2016) with the mean set to a matrix of zeros and the covariance matrix for a vectorized vector given by , where denotes the Kronecker product.

For convenience, we also introduce another matrix representation. Define vectors and . Then, the model from equation (13) can also be written as:

 y=(IN⊗X)α+e,e∼NTN(0,Σ⊗IT). (15)

#### 4.1.1 Prior Densities and Approximate Posterior Computations

For illustrative purposes, we consider two alternative prior distributions for VAR models: a natural- conjugate distribution where the posterior distribution, VB approximate distribution, and the MDD can be derived analytically, and an independent prior distribution that results in an iterative procedure for the estimation of the same quantities. Here, we present only the details of VB estimation; while for details of Gibbs sampling for posterior analysis, the reader is referred to Karlsson (2013).

##### Normal-Wishart Conjugate Prior

For VAR models with normally distributed error terms the natural-conjugate prior distribution is given in the following normal-Wishart form:

 p(A,Σ)=p(A|Σ)p(Σ−1),A|Σ∼MNK×N(A––,Σ,V––),Σ−1∼W(S––−1,ν––). (16)

where is a matrix of prior means, is a positive-definite matrix of order , is an positive-definite scale matrix of the Wishart prior distribution, and

denotes its degrees of freedom parameter.

Consider firstly the assumption that VB does not factorize the parameters into sub-groups, but instead a joint distribution is derived for and en bloc. In such a case, VB inference is equivalent to exact posterior analysis in which the joint posterior distribution of is of Normal-Wishart form (see e.g. Woźniak, 2016):

 q(A,Σ−1)≡p(A,Σ−1|y)=NW(¯¯¯¯A,Σ,¯¯¯¯V,¯¯¯¯S−1,¯ν). (17)

where:

 ¯¯¯¯A =¯¯¯¯V(V––−1A––+X′Y), ¯¯¯¯V =(V––−1+X′X)−1, ¯¯¯ν =T+ν––, ¯¯¯¯S =(Y−X¯¯¯¯A)′(Y−X¯¯¯¯A)+S––+(¯¯¯¯A−A––)′V––(¯¯¯¯A−A––).

Additionally, the MDD can be computed as an ordinate of the matric-variate -distribution at the data vector (see Karlsson, 2013):

 Y∼MtN,K(XA––,(IT+XV––X′)−1,S––,ν––). (18)

Consider now VB inference that is derived under the assumption that the VB approximate posterior distribution is factorized into marginal distributions for and :

 q(A,Σ−1)=qA(A)qΣ−1(Σ−1). (19)

The optimal VB approximate posterior distributions are given as closed-form formulae with the matric-variate normal distribution for and the Wishart distribution for :

 q∗A(A)=MNN.K(¯¯¯¯A∗,(¯¯¯ν∗−N−1)−1¯¯¯¯S∗,¯¯¯¯V∗),q∗Σ(Σ−1)=W(¯¯¯¯S∗−1,¯¯¯ν∗), (20)

where the parameters determining these distribution are:

 ¯¯¯¯A∗ =¯¯¯¯V(V––−1A––+X′Y), ¯¯¯¯V∗ =(V––−1+X′X)−1, ¯¯¯ν∗ =T+1+pN+ν––, ¯¯¯¯S∗ =¯¯¯ν¯¯¯ν−1−pN[(Y−X¯¯¯¯A)′(Y−X¯¯¯¯A)+S––+(¯¯¯¯A−A––)′V––−1(¯¯¯¯A−A––)].

Finally, is given by:

 lnMDDVBLB.VAR.C= −NT2lnπ+NK2ln(2e)+N2((T+ν––)ln(T+ν––)−¯¯¯ν∗ln¯¯¯ν∗)+lnΓN(¯¯¯ν∗2)−lnΓN(ν––2) +N2(ln∣∣¯¯¯¯V∗∣∣−ln|V––|)+12((T+ν––)ln∣∣∣¯¯¯¯S∗−1∣∣∣−ν––ln∣∣S––−1∣∣).
##### Normal-Wishart Independent Prior

This prior distribution presumes that the autoregressive parameters and the precision matrix are a priori independent:

 p(α,Σ−1)=p(α)p(Σ−1),α∼NNK(α––,V–––––),Σ−1∼W(S––−1,ν––), (21)

where is an -vector of prior means and is a positive-definite covariance matrix of order .

The optimal VB approximate posterior distributions are the multivariate normal distribution for and the Wishart distribution for :

 (22)

where the optimal values of the parameters determining these distributions are obtained using an iterative procedure. Initialize by setting starting values for and , and iterate:

 ¯¯¯¯α ←¯¯¯¯V(V––−1α––+¯¯¯νT∑t=1(x′t⊗IN)¯¯¯¯Syt), ¯¯¯¯V ←(V––−1+¯¯¯νT∑t=1(x′t⊗IN)¯¯¯¯S(x′t⊗IN)′)−1, ¯¯¯¯S ←S––+T∑t=1[(yt−(x′t⊗IN)¯¯¯¯α)(yt−(x′t⊗IN)¯¯¯¯α)′+(x′t⊗IN)¯¯¯¯V(x′t⊗IN)′], ¯¯¯ν ←T+ν––,

until the increase in is negligible. This value is given by:

 lnMDDVBLB.VAR.I= k2−NT2lnπ+lnΓN(¯ν∗2)−lnΓN(ν––2)+12(ln∣∣¯¯¯¯V∗∣∣−ln∣∣V–––––∣∣) +12(¯¯¯ν∗ln∣∣∣¯¯¯¯S∗−1∣∣∣−ν––ln∣∣S––−1∣∣)−12tr{V–––––−1[(¯¯¯¯α∗−α––)(¯¯¯¯α∗−α––)′+¯¯¯¯V∗]}.

#### 4.1.2 Comparison of the MDD estimators

To study the performance of the new MDDEs for the VAR models we use the same data set as Giannone et al. (2015). It is based on the benchmark US data set of Stock & Watson (2008). All the variable transformations are as in Giannone et al. (2015) and the reader is referred to that paper for more details.777We are grateful to Domenico Giannone, Michele Lenza, and Giorgio Primiceri for sharing their data with us. The time series consist of seven quarterly variables that span the period starting in the first quarter of 1959 and finishing in the last quarter of 2008, which gives 200 observations. The seven variables that we consider are real GDP, GDP deflator, federal funds rate, real consumption, real investment, hours worked and real compensation per hour. We estimated a model with four autoregressive lags (). We base our estimations on draws from the posterior distribution.

In the top panel of Table 2 we report the results for assessment of the numerical accuracy of the MDDEs for the VAR model with the normal-Wishart conjugate prior distribution. For this model the benchmark value of the MDD can be computed exactly. Note that it lies in the middle of the interval between the VB lower and upper bounds for the MDD. Figure 1: Comparison of simulation outcomes for the best MDDEs for Vector Autoregressions

Our estimator is the best estimator within the family of the RIS MDDEs. It has the lowest NSE and 100 percent of simulated MDDEs fall within the VB bounds. The value of the NSE for our estimator is 30 percent lower than the second best specification, which is, the RIS MDDE with the product of marginal posterior densities used as the weighting function.888See B for the definitions of the weighting functions used in the comparison. The Geweke (1999, 2005) and Sims et al. (2008) estimators perform quite badly in terms of both the NSE and the fraction of simulated MDDE values within the VB bounds.999These two estimators are known to perform better with an increasing number of MCMC draws () from the posterior distribution. Our simulations confirmed this finding. Nevertheless, had to be increased over fivefold to make these estimators competitive. Moreover, despite the fact that the VB approximate density has thinner tails than the exact posterior density, it still performs reasonably. Finally, in the case of the BS MDDE, VB and the product of marginal candidates perform very well although PMD is slightly better than our estimator.

These findings are confirmed by Figure 1 that illustrates the values of the logarithm of the simulated MDDEs for 100 repetitions of the estimation for the best four MDDEs. According to Figure 1, the best estimators in terms of precision are BS PMD and BS VB followed by RIS VB and IS PMD. We do not spot any significant bias in these estimators.

The results for the VARs with independent prior distributions for parameters matrices and are reported in the bottom panel of Table 2. Here, the MDDEs based on the VB approximate posterior density are the best in terms of the NSE relative to both RIS and BS MDDEs. Note that our BS MDDE performs better than two benchmark specifications by Chib (1995) and Frühwirth-Schnatter (2006). These findings are again confirmed by Figure 1 that illustrates the values of the logarithm of the simulated MDDEs.

### 4.2 Stochastic Frontier Model

Consider a stochastic frontier (SF) model with panel data that can be written as:

 yit=xitβ±ui+vit, (23)

where indexes firms and indexes time, is the dependent variable (either firms’ output or cost), is -dimensional row vector of regressors, is the log of the frontier production or cost function, is a -vector of unknown parameters, is a non-negative random error reflecting the inefficiency of firm , and is an i.i.d. normally distributed error term with mean zero and constant variance , denoted by . The negative sign before is for a production frontier model and the plus sign is for the cost frontier case. Finally, let an vector collect all of the inefficiency terms.

The Bayesian approach to estimation of stochastic frontier models using Monte Carlo methods has been described in van den Broeck et al. (1994) and Koop & Steel (2001) among others. In this section, we show how the MFVB for stochastic frontier models with two alternative distributions for inefficiency term, namely, exponential and gamma can be estimated. Further details of MFVB estimation for these and other SF models can be found in Hajargasht & Griffiths (2017) (for MCMC estimation of theses models see Koop & Steel, 2001, and Tsionas (2000) respectively)

. Using an exponential distribution leads to optimal approximate posterior densities of standard forms, while this is not the case for gamma where some numerical integration is required to obtain VB posterior distributions.

##### Exponential inefficiency

In this case we consider the stochastic frontier model in equation (23) with following an exponential distribution with parameter , i.e., .

##### Gamma inefficiency

The alternative inefficiency error distribution that we consider is gamma, . This case is interesting from the point of view of VB estimation because it results in some approximate posterior densities that are not of a known form. In these cases, it is necessary to use numerical integration to obtain some of the required moments.

#### 4.2.1 Prior Densities and Approximate Posterior Computations

##### Exponential inefficiency

We assume the following prior distributions:

 p(β)∼N(β––,V––β),σ−2∼G(A––σ,B––σ),λ∼G(A––λ,B––λ) (24)

where is the gamma density function as defined in Hajargasht & Griffiths (2017). The prior distributions for and are standard choices in the Bayesian stochastic frontier literature.

To facilitate MFVB estimation we need an appropriately factorized approximation to the posterior distribution. We consider the following:

 q∗(β,σ−2,u,λ)=q∗β(β)q∗σ−2(σ−2)q∗λ(λ)q∗u(u) (25)

The optimal VB approximate posterior distributions turn out to be:

 q∗β(β)=N(¯¯¯β∗,¯¯¯¯V∗β),q∗σ−2(σ−2)=G(¯¯¯¯A∗σ,¯¯¯¯B∗σ),q∗λ(λ)=G(¯¯¯¯A∗λ,¯¯¯¯B∗λ),q∗u(ui)=TN(¯¯¯μ∗i,¯¯¯υ2∗), (26)

where denotes the normal density function truncated to positive values of . The optimal hyper-parameters characterizing these distributions are computed by the coordinate ascent algorithm over the following iterations:

 ¯¯¯β ←¯¯¯¯Vβ(V––−1ββ––+¯¯¯¯Aσ¯¯¯¯B−1σx′(y±¯¯¯u⊗ıT)), ¯¯¯¯Vβ ←(V––−1β+¯¯¯¯Aσ¯¯¯¯B−1σx′x)−1, ¯¯¯¯Aσ ←A––σ+0.5NT, ¯¯¯¯Bσ ←B––σ+12(N∑i=1T∑t=1[(yit−xit¯¯¯β±¯¯¯ui)2+¯¯¯¯¯¯¯¯¯¯Var[ui]]+tr(x′x¯¯¯¯Vβ)), Aλ ←A––λ+N, Bλ ←Bλ–––+N∑i=1¯¯¯ui, ¯¯¯μi ←1T(−¯¯¯¯Aλ¯¯¯¯B−1λ¯¯¯¯A−1σ¯¯¯¯Bσ±T∑t=1(xit¯¯¯β−yit)), ¯¯¯υ2 ←T−1¯¯¯¯A−1σ¯¯¯¯Bσ,

where is an matrix with its th row set to , is a -vector of ones, and . The quantities , and are the corresponding moments of the truncated normal distribution for . They can be computed using , where and are the pdf and the cdf of the standard normal distribution, respectively.

We iterate the hyper-parameters of the approximate posterior distributions in equation (26) until the increment in for the model is negligible. This value is given by:

 lnMDDVBLB.SF.E=12((−NT+N)ln2π+N+k)+ln⎛⎜ ⎜⎝Γ(¯¯¯