# Distortion estimates for approximate Bayesian inference

Current literature on posterior approximation for Bayesian inference offers many alternative methods. Does our chosen approximation scheme work well on the observed data? The best existing generic diagnostic tools treating this kind of question by looking at performance averaged over data space, or otherwise lack diagnostic detail. However, if the approximation is bad for most data, but good at the observed data, then we may discard a useful approximation. We give graphical diagnostics for posterior approximation at the observed data. We estimate a "distortion map" that acts on univariate marginals of the approximate posterior to move them closer to the exact posterior, without recourse to the exact posterior.

## Authors

• 2 publications
• 7 publications
• 7 publications
05/23/2018

### Likelihood-free inference with emulator networks

Approximate Bayesian Computation (ABC) provides methods for Bayesian inf...
03/07/2022

### Discovering Inductive Bias with Gibbs Priors: A Diagnostic Tool for Approximate Bayesian Inference

Full Bayesian posteriors are rarely analytically tractable, which is why...
11/19/2010

### Sparse Choice Models

Choice models, which capture popular preferences over objects of interes...
11/28/2021

### Approximate Inference via Clustering

In recent years, large-scale Bayesian learning draws a great deal of att...
05/31/2021

### Online Bayesian inference for multiple changepoints and risk assessment

The aim of the present study is to detect abrupt trend changes in the me...
11/06/2017

### Flexible statistical inference for mechanistic models of neural dynamics

Mechanistic models of single-neuron dynamics have been extensively studi...
03/12/2018

### Bayesian inference for a partially observed birth-death process using data on proportions

Stochastic kinetic models are often used to describe complex biological ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

When we implement Bayesian inference for even moderately large datasets or complicated models some approximation is usually inescapable. Approximation schemes suitable for different Bayesian applications include Approximate Bayesian Computation (Pritchard et al., 1999; Beaumont, 2010), Variational Inference (Jordan et al., 1999; Hoffman et al., 2013), loss-calibrated inference (Lacoste-Julien et al., 2011; Kuśmierczyk et al., 2019) and synthetic likelihood (Wood, 2010; Price et al., 2018). New applications suggest new approximation schemes. In this setting diagnostic tools are useful for assessing approximation quality.

Menendez et al. (2014) give procedures for correcting approximation error in Bayesian credible sets. Rodrigues et al. (2018) give a post-processing algorithm specifically for recalibrating ABC posteriors. The new generic diagnostic tools given in Yao et al. (2018) and Talts et al. (2018) focus on checking the average performance of an approximation scheme over data space and are related to Prangle et al. (2014), which focuses on ABC posterior diagnostics. Their methods can be seen as an extension of Cook et al. (2006), Geweke (2004) and Monahan and Boos (1992), which were setup for checking MCMC software implementation. In contrast, we are interested in the the quality of approximation at the observed data . If one posterior approximation scheme works poorly in some region of but works well at , we may reject a useful approximation using any diagnostic based on average performance. We may conversely accept a poor approximation.

We give a generic diagnostic tool which checks the quality of a posterior approximation specifically at . We assume that 1) we can efficiently sample parameters from both the prior distribution and the observation model and 2) the approximation scheme we are testing is itself reasonably computationally efficient. We need this second assumption as we may need to call the approximation algorithm repeatedly.

The posterior has a multivariate parameter. However, we run diagnostics on one or two parameters or scalar functions of the parameters at a time, so our notation in Sections 2 and 3.2 takes and respectively. Parameters are continuous but this is not essential.

We introduce and estimate a family of “distortion maps”

 Dy:[0,1]⟶[0,1], y∈Y

which act on univariate marginals of the multivariate approximate posterior. The exact distortion map transports the approximate marginal posterior CDF onto the corresponding exact posterior CDF . The distortion map is a function of the parameter defined at each by the relation given in Eqn. 1 below. The distortion map at the data contains easily-interpreted diagnostic information about the approximation error in the approximate marginal CDF . If the distortion map differs substantially from the identity map, then the magnitude and location of any distortion is of interest. Our distortion map is an optimal transport (El Moselhy and Marzouk, 2012) constructed from a normalising flow.

A reliable estimate of must be hard to achieve, as it maps to the exact posterior CDF . We estimate a map to a distribution which is only asymptotically closer in KL-divergence to , not equal to it. If is far from in KL divergence then it is easy to find a distribution which is closer to than was. It follows that if differs significantly from the identity map then the approximation defining was poor. In this approach we get diagnostically useful estimates of the distortion map without sampling or otherwise constructing the exact posterior.

The map

may be represented in several ways, with varying convenience depending on the setting. We can parameterise a transport map from the approximate density to the exact density, or a mapping between the CDF’s, or a function of the approximate random variable itself. Since we are not interested in approximating the true posterior, but in checking an existing approximation for quality, we map CDF’s, estimating the distortion of the CDF for each marginal of the joint posterior distribution. This has some benefits and some disadvantages.

On the plus side, the mapping from the CDF of the approximate posterior to the CDF of the exact posterior is an invertible mapping between functions of domain and range . This resembles a copula-like construction (see in particular Eqn. 8

) and doesn’t change from one problem to another, making it easier to write generic code. There is also a simple simulation based fitting scheme, Algorithm 1, to estimate the map. On the downside, we restrict ourselves to diagnostics for low-dimensional marginal distributions. However, multivariate posterior distributions are in practise almost always summarised by point estimates, credible intervals and univariate marginal densities, and the best tools we have seen,

Prangle et al. (2014) and Talts et al. (2018), also focus on univariate marginals. We extend our diagnostics to bivariate marginal distributions in Sec. 3.2 and give examples of estimated distortion surfaces in examples below. This works for higher dimensional distortion maps, but it is not clear how this would be useful for diagnostics and it is harder to do this well.

## 2 Distortion Map

Let be the prior distribution of a scalar parameter and let be the likelihood function of generic data . Let be the observed data value. Given generic data , let be the CDF of the exact posterior . In practice these densities will the marginals of some multivariate parameter of interest. For and , we have . We assume is continuous, so that is continuously differentiable and strictly increasing with at every . The case of discrete is a straightforward extension. Let be a generic approximate posterior on with CDF . We define a distortion map such that for each and each

 Dy(Gy(x))=Fy(x). (1)

The distortion map is a strictly increasing function mapping the unit interval to itself and, as Prangle et al. (2014) point out, is itself the CDF of when . To see this observe that since we have from Eqn. 1, and this is necessary and sufficient for .

Denote by

 dy(q)=ddqDy(q)

the density associated with the CDF so that

is random variable with probability density

for . Since , we have from Eqn. 1,

 π(x|y)=dy(Gy(x))~π(x|y), (2)

connecting the two posterior densities.

We seek an estimate, , of the true distortion map at the data, or equivalently an estimate, , of its density. Other authors, focusing on constructing new posterior approximations, have considered related problems, either without the distortion-map representation, or in an ABC setting. However, since we seek a diagnostic map, not a new approximate posterior, it is not necessary to estimate exactly, but simply to find an approximate that moves towards as measured by KL-divergence. The recalibrated CDF

 ^Fyobs(x)=^Dyobs(Gyobs(x)) (3)

should be a better approximation (in KL-divergence) to than was even if both are bad. The same argument applies at the level of densities. From Equation 1, the recalibrated density

 ^π(x|y)=^dy(Gy(x))~π(x|y)

must improve on . If our original approximation is bad, then we should be able to improve it.

Working with the distortion map is very convenient for building generic code: our diagnostic wrapper, Algorithm 1 below, is always based on a model for a density in . In practice users will have a multivariate approximation and get diagnostics by simulating or otherwise computing marginals . This distribution is computationally tractable, in contrast to .

## 3 Estimating a Distortion Map

We now explain how we approximate the distortion map without simulating the exact posterior. The distortion map we would like to approximate is a continuous distribution on so one approach is to sample it and use the samples to estimate . The difficulty is that is a function of which varies from one -value to another. We can proceed as in Algorithm 1 below which we now outline.

We start by explaining how to simulate . If we simulate the generative model, , then by Bayes rule with the marginal likelihood, so a simulation from the generative model gives us a draw from the exact posterior at the random data . This observation is just the starting point for ABC. Now, from our discussion below Equation 1, if then the pair

have a joint distribution with density

and conditional distribution . This is a recipe to simulate pairs which are realisations of : Simulate with and and then set (the subscript runs over samples, not multivariate components). If admits a closed form CDF then can be evaluated directly. If is not tractable (as in our examples below) then we estimate it using MCMC samples from the approximate posterior. We form the empirical CDF and set . The samples are our “data” for learning about .

We next define a semi-parametric model for

and a log-likelihood for our new “data”. For and let be a family of continuously differentiable strictly increasing CDFs parameterised by an -component parameter , and including the identity map, , for some and all

. Because we are parameterising the distortion, we are working with a probability distribution on

, so we simply model , the corresponding density of , using

 dy(q;w)=Beta(q;a(y;w),b(y;w)), (4)

a Beta density with parameters and which vary over . The functions are parameterised by a feed-forward neural net with two hidden layers and positive outputs and . We tried a Mixture Density Network (MDN) (Bishop, 1994)

of Beta-distributions but found no real gain from taking more than one mixture component.

We now fit our model and estimate at the data. The log likelihood for our parameters given our model and simulations is

 ℓ(w;{qi,yi}Ni=1)=1NN∑i=1logdyi(qi;w). (5)

Let maximise this log-likelihood and consider the estimate . Let

 W={w∗∈Rm:Dy(q)=Dy(q;w∗)} (6)

be the set of parameter values giving the true distortion map. This set is empty unless , so the true map can be represented by the neural net.

We show below that, if the neural net is sufficiently expressive, so that is non-empty, then for any fixed . This is not straightforward as in Eqn. 6 is in general not identifiable so standard regularity conditions for MLE-consistency are not satisfied. Our result compliments that of Papamakarios and Murray (2016) and Greenberg et al. (2019). Working in a similar setting, those authors show that the maximiser of the limit of the scaled log-likelihood gives the true distortion map (if the neural net is sufficiently expressive). Our consistency proof shows that the limit of the maximiser converges to the set of parameter values that express the true distortion map.

Proposition 1 translates the result of Papamakarios and Murray (2016) to our setting. At and fixed , the exact and approximate distortion maps, and have associated densities and . Their KL-divergence is

 KL(Dy(⋅),Dy(⋅;w))≡∫10dy(q)log(dy(q)dy(q;w))dq.

Here, as in Papamakarios and Murray (2016), the KL-divergence of interest is the complement of that used in variational inference. We choose the approximating distribution to fit samples drawn from the true distribution . This is possible using ABC-style joint sampling of and . By contrast in variational inference is varied so that its samples match .

###### Proposition 1.

Suppose the set in Equation 6 is non-empty. Let independently for . Then converges in probability to

 −EY(KL(DY(⋅),DY(⋅;w)))+EQ,Y(log(dY(Q)).

This limit function is maximized at .

We can remove the condition that is non-empty in Proposition 1. This leads to modified versions of the lemma and theorem below which may be more relevant in practice. This is discussed in Appendix.

Proposition 1 tells us that we are maximising the right function, since the limiting KL divergence is minimised at the true distortion map , but it does not show consistency for . In Lemma 1 we prove that is a consistent estimate of .

###### Lemma 1.

Under the conditions of Proposition 1, the estimate is consistent, that is

 limN→∞Pr(|Dy(q;^wN)−Dy(q)|>ϵ)=0.

for every fixed .

Our main result, Theorem 1, follows from Lemma 1. It states that, asymptotically, and in KL divergence, the “improved” CDF is closer to the true posterior CDF than the original approximation . All proofs are given in Appendix.

###### Theorem 1.

Under the conditions of Proposition 1 and assuming ,

 Pr(KL(Fy,^Fy)

as for every fixed .

The fitted distortion map at the data, is of interest as a diagnostic tool.The improved posterior CDF, in Equation 3, or the corresponding PDF , is of only indirect interest to us. The point here is that may be a useful diagnostic for the approximate posterior even if is a poor approximation to as is at least asymptotically closer in KL-divergence to than is. If we can improve on the approximation substantially in KL-divergence to the true posterior, then it was not a good approximation.

Plots of give an easily interpreted visual check on the approximate posterior . A check of this sort is not a formal test, but such a test would not help as we know is an approximation and want to know where it deviates and how badly. Since

is a quantile map, if

is a cup shaped function of then is under-dispersed, cap-shaped is over-dispersed, and if say then the median of lies above the median of and so this is evidence that

is skewed to the right.

When we apply Algorithm 1 we need good neural-net regression estimates for in the neighborhood of only. Fitting the neural net may be quite costly, and since the distortion-map estimate at is in any case dominated by information from pairs at -values close to , we regress on pairs such that , where is a neighbourhood of . This is not “an additional approximation” and quite different to the windowing used in ABC. In our case our estimator is consistent for any fixed neighborhood of , whilst in ABC this is not the case. Extending the regression to the whole of space would be straightforward but pointless.

Note that in Algorithm 1 we have introduced summary statistics on the data. This may be useful if the data are high dimensional, or where there is a sufficient statistic. In the examples which follow we found we were either able to train the network with , or had sufficient statistics in an exponential family model for a random network.

### 3.1 Validation checks on ^Dy

In this section we discuss the choice of and the sample variation of . Since is consistent, converges in probability on any increasing subsequence . In order to check we have taken large enough so that taking it larger will not lead to significant change, we estimate at a sampling of equally spaced -values with and increasing up to . We check that converges numerically at each with increasing and is stable. In order to check the sample dependence, we break up our sample into blocks and, for , form separate estimates and check the variation between function estimates is small.

### 3.2 Extending to higher dimensions

In this section we show how to estimate distortion maps and the corresponding densities for the approximate posterior density of a continuous bivariate parameter . The extension to higher dimensions is straightforward but not obviously useful for diagnostics.

Let and be the CDF’s of the approximate and exact conditional posteriors, respectively and , and let and be the CDF’s of the approximate and exact marginal posteriors, respectively and . Let be the distortion map defined by

 Dx1,y(Gx1,y(x2))=Fx1,y(x2), (7)

with as before. The transformation of the joint density is

 π(x1,x2|y)=dx1,y(Gx1,y(x2))dy(Gy(x1))~π(x1,x2|y) (8)

If the approximation is good at , then the densities and are near equal, which holds if the “distortion surface”, defined by

 dy(q1,q2)≡dG−1y(q1),y(q2)dy(q1), (9)

is close to one for all arguments .

We estimate as before. We estimate by treating as data alongside . We apply Algorithm 1, but now we simulate from the generative model in the for-loop, and create two datasets. The first dataset, with , is the same as before. The second, with , is used to estimate the conditional

. We fit two neural network models for the Beta-density parameters, one fitting the Beta-CDF

using inputs and choosing weights to maximise the likelihood

 ℓ(w;{q1,i,yi}Ni=1)=N∑i=1logdyi(q1,i;w) (10)

and the other fitting the Beta-CDF using inputs and choosing weights to maximise the likelihood

 ℓ(v;{q2,i,(x1,i,yi)}Ni=1)=N∑i=1logdx1,i,yi(q2,i;v). (11)

The run-time is approximately doubled. If and are the MLE’s then the estimates are and .

Finally, we plot the estimated distortion surface

 ^dyobs(q1,q2)=^dG−1yobs(q1),yobs(q2)^dyobs(q1) (12)

as a diagnostic plot. Both components are simply Beta-densities and straightforward to evaluate.

## 4 Further Related Works

Prangle et al. (2014) show that for all iff for . The authors give a diagnostic tool based on this idea for an ABC posterior using the simulated

’s as test statistics. They sample

from the truncated generative distribution , where is a subset containing , and compute for . Then they check that the simulated

are uniformly distributed over

. This corresponds to studying the distribution of the marginalized random variable rather than the conditional random variable which we study. The diagnostic histogram plotted by Prangle et al. (2014) estimates the marginal density of ,

 Q∼dΔ(⋅),dΔ(Q)∝∫y∈Δdy(Q)p(y)dy. (13)

Since is typically rather large, may vary over . In this case the marginal distribution of may be flat when the conditional distribution of is far from flat (or the converse). We give an example in which this is the case. Similar ideas are explored in Talts et al. (2018) and Yao et al. (2018). Notice that when we window our data for neural net regression estimation of there is no integration over data . We regress the distribution of at each (i.e. close ), so we explicitly model variation in with within .

Rodrigues et al. (2018) give a post-processing recalibration scheme for the ABC posterior developing Prangle et al. (2014). The setup is a multivariate version of Equation 1. Ignoring the intrinsic ABC approximation, the “approximation” they correct is due to the fact that they have posterior samples at one -value and they want to transport or recalibrate them so that they are samples from the posterior at a different -value. The main difference is that these authors are approximating the true posterior, whilst we are trying to avoid doing that.

Greenberg et al. (2019) propose Automatic Posterior Transformation (APT) to construct an approximate posterior. Our Algorithm 1 can be seen as the first loop of their Algorithm 4. In their notation, let be an approximation to where are parameters of the fitted approximation. Let be a proposal distribution for . Define

 ~qF(y,w)(x)=qF(y,w)(x)pr(x)π(x)Z(y,w), (14)

with a normalisation over , and

 ~L(w)=N∑i=1log~qF(yi,w)(xi), (15)

where iid for . Appealing to Papamakarios and Murray (2016), the authors show that the -values maximising the scaled limit of , say, satisfy (if the representation is sufficiently expressive) and this leads to a novel algorithm for approximating the posterior. Our approach is a special case obtained by taking , (so ) and the special parameterization

 qF(y,w)(x)=dy(G(x);w)~π(x|y). (16)

In further contrast, we are concerned with diagnosing an approximation, not targeting a posterior.

Some previous work on diagnostics has also avoided forming a good approximation to by focusing on estimating the error for expectations of special functions only. Work on calibration of credible sets by Xing et al. (2019) and Lee et al. (2018) falls in this category. Instead of estimating distortion over the whole CDF , these authors estimate the distortion in the value of one quantile. This lacks diagnostic detail compared to our distortion map. They consider a level approximate credible set computed from the approximate posterior. Xing et al. (2019) estimate how well this approximate credible set covers the true posterior, that is they estimate

 cyobs(q)=EX|Y=yobs(1(X∈~Cyobs(q))), (17)

using regression and methods related to importance sampling. In contrast to Xing et al. (2019), we estimate as a function of , so we estimate the distortion in the CDF, not just the distortion in the mass it puts on one set.

## 5 Toy Example

We apply Algorithm 1 to Bayesian logistic regression. Let

be a design matrix, let be regression coefficients and be binary response data. For each , where and is the th row of . The likelihood is

 p(y|β)=n∏j=1pjy(j)(1−pj)1−y(j),  pj=exp(xT(j)β)exp(xT(j)β)+1

We take a prior distribution with the identity matrix. We are interested in the posterior distribution .

The exact posterior can be sampled via standard MCMC. It is also possible to approximate the exact using computationally cheaper Variational Inference (VI) with posterior (Jaakkola and Jordan, 1997). In this example, we set , and we would like to diagnose the performance of the variational posterior using Algorithm 1. In our example, each entry in the design matrix is sampled independently from . We simulate synthetic -pairs from the generative model , randomly pick one synthetic data point as our observed , and keep the of pairs closest in Euclidean distance to as our training data (this corresponds to a particular choice of in Algorithm 1). Since there is no low dimensional sufficient statistic for this model, we simply use , the dimensional binary response vector, as the summary statistic. We then apply Algorithm 1 using a feed froward neural net with two hidden layers of 80 nodes to estimate the distortion map for each dimension of (recall ), and compare the estimated map to the exact as a function of (the exact map is available for this problem using standard methods).

We plot the marginal posteriors and the corresponding exact and fitted distortion maps for the first two dimensions , of the regression parameter in Fig. 3. The fitted distortion map and in the right column accurately recover the exact map. Both and slightly deviate from the identity map, correctly showing that the marginal VI posteriors for and are slightly under-dispersed compared to the exact posterior. This simple example shows our method is able to handle moderately high dimensional () summary statistics .

## 6 Karate Club Network

In this section we estimate distortion maps measuring the quality of three distinct network model approximations. The data we choose are relatively simple, but happen to illustrate several points neatly. We repeat the analysis on a larger data set in Appendix. The small size of the network data in this example is not an essential point. Conclusions from the larger data set are similar though in some respects less interesting.

The Zachary’s Karate Club network (Zachary, 1977) is a social network with 34 vertices (representing club members) and 78 undirected edges (representing friendship). The data is available at UCINET IV Datasets. See Fig. 4.

We fit an Exponential Random Graph Model (ERGM) (Robins et al., 2007) to these data. Let be the set of all graphs with nodes. Given , let be a -dimensional graphical summary statistic computed on and let be the corresponding ERGM parameter. In our example . In an ERGM, the likelihood of the graph is

 p(y|x)=exp{xTs(y)}/z(x) (18)

where is intractable even for relatively small networks.

Our example approximations come from Caimo and Friel (2012) and Bouranis et al. (2018). Let be the number of edges in . Following Hunter and Handcock (2006), let be the number of connected dyads in that have common neighbors, and let equal the number of nodes in that have neighbors. Let

 v(y,ϕv)=eϕvn−2∑l=1{1−(1−e−ϕv)l}EPl(y)

be the geometrically weighted edgewise shared partners (gwesp) statistic and

 u(y,ϕu)=eϕun−1∑l=1{1−(1−e−ϕu)l}Dl(y)

be the geometrically weighted degree (gwd) statistic. Following Caimo and Friel (2012) let and , and . Our observation model is given by Eqn 18. The prior distribution for is multivariate normal with and .

The exact is doubly intractable. We consider three approximation schemes yielding different approximations :

• Approximate Bayesian Computation with ABC acceptance fraction

and local linear regression adjustment (“ABC-reg”,

Pritchard et al. (1999); Beaumont (2010)).