Assessment and adjustment of approximate inference algorithms using the law of total variance

11/20/2019 ∙ by Xuejun Yu, et al. ∙ National University of Singapore 0

A common method for assessing validity of Bayesian sampling or approximate inference methods makes use of simulated data replicates for parameters drawn from the prior. Under continuity assumptions, quantiles of functions of the simulated parameter values within corresponding posterior distributions are uniformly distributed. Checking for uniformity when a posterior density is approximated numerically provides a diagnostic for algorithm validity. Furthermore, adjustments to achieve uniformity can improve the quality of approximate inference methods. A weakness of this general approach is that it seems difficult to extend beyond scalar functions of interest. The present article develops an alternative to quantile-based checking and adjustment methods which is inherently multivariate. The new approach is based on use of the tower property of conditional expectation and the law of total variance for relating prior and posterior expectations and covariances. For adjustment, approximate inferences are modified so that the correct prior to posterior relationships hold. We illustrate the method in three examples. The first uses an auxiliary model in a likelihood-free inference problem. The second considers corrections for variational Bayes approximations in a deep neural network generalized linear mixed model. Our final application considers a deep neural network surrogate for approximating Gaussian process regression predictive inference.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Researchers are increasingly called upon to analyse large datasets using complex models, and this has resulted in approximate inference methods being more widely used, such as likelihood-free inference (Marin et al., 2012; Blum et al., 2013) and variational inference (Ormerod and Wand, 2010; Blei et al., 2017). However, often the accuracy of these methods cannot be guaranteed, so there is interest in diagnosing algorithm failures, and in developing adjustments which improve their performance. The current paper is a contribution to this area, and considers an approach that can easily handle multivariate quantities of interest, unlike the existing approaches which are hard to extend beyond checking and adjustment for univariate quantities.

A common method for checking Bayesian computational algorithms is based on posterior quantiles (Cook et al., 2006; Gelman, 2017). To save notation, and following the discussion in Cook et al. (2006), consider the case of a scalar parameter of interest. The data will be denoted as . Suppose we have a joint Bayesian model for specified by a density , where is the prior density and is the sampling density. Consider a simulation from , obtained by drawing , and then . Noting that , we can equivalently think of as obtained by sampling , then , so that we can regard in the joint sample as being an exact draw from

. Under continuity assumptions, transforming a random variable by its distribution function gives a uniform random variable, so writing

for the posterior distribution function of given , we have that , where is the uniform density on . When Markov chain Monte Carlo (MCMC) simulation or some approximate inference algorithm is used to approximate by say, testing for uniformity of based on replicate samples of can be used to diagnose problems with the simulation algorithm, or to make inference adjustments.

Gelman (2017) points out that problems can occur in implementation of the method of Cook et al. (2006)

if proper account is not taken of correlation between samples when checking MCMC sampling algorithms. Difficulties can also arise from discretization when using sample based estimates of quantiles.

Talts et al. (2018) suggest an alternative approach based on ranks and ways of adjusting for dependence, if that is an issue. They also show how to interpret the nature of any problems with the algorithm based on the kind of departure from uniformity observed. Closely related methods to those of Cook et al. (2006) have been applied for checking likelihood-free and variational inference algorithms (Wegmann et al., 2009; Prangle et al., 2014; Yao et al., 2017; Lee et al., 2019). In the context of likelihood-free inference using approximate Bayesian computation (ABC), these quantile-based checks have motivated recalibration algorithms to improve the quality of posterior approximations (Rodrigues et al., 2018). In these approaches, if simple importance sampling ABC computational methods are used, then repeated posterior approximations for different data can be done using the same prior samples, and the recalibration adjustments can be performed efficiently. Rodrigues et al. (2018) also note that their adjustments are feasible when fast approximations are available, and they discuss in detail calibration of inferences based on auxiliary models. Variational approximations can also benefit from the suggested adjustments.

One difficulty with the quantile-based adjustment methods mentioned above is that it seems difficult to apply them in the case of a multivariate when corrections in the dependence structure of the posterior are desired. In the next section we outline a novel alternative to the standard quantile-based inference adjustments. In particular, we consider applying the tower property of conditional expectation and law of total variance to relate prior and posterior means and covariances. When the posterior means and covariances are estimated by an approximate inference algorithm, we suggest adjustments so that the correct prior to posterior relationships hold after the adjustment.

After outlining the main idea in Section 2, Section 3 explains the detailed implementation in the case where particle approximations of the posterior distributions are available. Section 4 considers several applications involving likelihood-free inference, and checking and adjusting for approximate inferences in some deep neural network and Gaussian process regression models. Section 5 gives some concluding discussion.

2 The basic idea

Suppose that is a possibly multivariate unknown of interest. We define a Bayesian model through a joint density

. We assume that the first two moments of the prior distribution exist. The tower property of conditional expectation allows us to write the prior mean as

(1)

for . Now suppose some approximate inference algorithm is available, approximating the true posterior density by , with approximate posterior mean and posterior covariance matrix and respectively. If our approximate inference algorithm is accurate, then

(2)

Similarly, applying the law of total variance to the prior covariance matrix of ,

(3)

for . Once again, if our approximate inference algorithm is accurate,

(4)

for . The main idea of our checking and adjustment method is that given independent samples , , we can estimate the left-hand sides of (2) and (4) based on the prior samples (if the prior mean and covariance are not available analytically), and the right-hand sides of (2) and (4) based on the samples and the approximations and . For checking, we can assess whether departures from equality in (2) and (4) are of practical concern. For inference adjustments, we can alter so that exact equality holds in (2) and (4) for and

after adjustment. Although this is the basic idea, some further refinements are outlined below. Checking Bayesian inferences according to equation (

2) is somewhat related to the idea of calibration of posterior means discussed in Gelman et al. (2013), p. 128, although the use of (2) and (4) jointly to check and adjust approximate inference methods is novel as far as we know. It may also be possible to extend to adjustments beyond second order moments using the law of total cumulance (Brillinger, 1969), but we do not consider this further.

2.1 Use of a conditioning set

A first observation (related to a similar remark in Prangle et al. (2014) for their quantile based diagnostic for likelihood-free inference algorithms) is that while achieving approximate equality in (2) and (4) is necessary for an approximate inference algorithm to be good, it is not sufficient. Consider, for example, the approximate inference method which always returns the prior as the estimated posterior. Then , and so (2) and (4) hold exactly if we always estimate the posterior regardless of by the prior. On the other hand, it is not hard to see that diagnosing the quality of the approximation using (2) and(4) could be useful. For example, suppose that our approximate inference algorithm always produces exact posterior mean values while underestimating variability where for covariance matrices and means that is non-negative definite. Then (2) would hold, but

and the situation we have considered is not unrealistic for some variational inference algorithms where point estimation may be excellent but posterior variability is substantially underestimated.

In the case of checking and adjustment of a posterior density for observed value of , it seems sensible to consider (again following Prangle et al. (2014)) a conditioning on in the joint model for some set with in order to make the adjustment more relevant to the data observed. This will also avoid problems such as the one mentioned above where an approximate inference algorithm returning the prior would be considered satisfactory or in no need of adjustment. This is because after modifying equations (2) and (4) by conditioning on , these modified relations will not hold, in general, for an approximate inference algorithm estimating the posterior density by the prior. The choice of will be discussed later in the examples. More explicitly, consider the joint model

where denotes the indicator function, and samples from this joint model can be obtained by drawing until . Write for the marginal of . In the model we can consider the following analogues of (1) and (3):

(5)

and

(6)

and then checking and adjustment can be based on assessment of departures from equality in (5) and (6) when and are replaced by and . If our approximation is accurate, then

(7)

and

(8)

for samples from . In the next section we describe in detail how checking and adjustment are done when approximate posterior distributions are given as particle approximations in the form of Monte Carlo samples.

3 Checking and adjustment strategy

For an approximate inference algorithm leading to approximate posterior distributions for we have suggested checking and adjustment based on assessing departure from equality in (7) and (8). For data we assume that the approximation comes in the form of an approximate posterior sample, , and if this is not the case, we assume that it is easy to generate from to obtain such a sample. Let , denote independent draws from .

Let us write

with

Here and are sample based estimates of the left- and right-hand sides of (7). Also write

and

where

and

Here and are sample estimates of the left- and right-hand sides of (8), with and sample estimates of the first and second terms on the right-hand side of (8). In the case where and are known exactly, these can be substituted for and above. This can happen when the approximation is not in the form of a Monte Carlo sample but takes some analytical form such as for a variational approximation. With the above ideas and notation established, we consider next inference algorithm assessment, and then inference adjustments.

3.1 Assessment of the inference algorithm

The equalities and will not hold exactly. We suggest assessing whether components of and and of and differ by a large amount compared to the variability in their sample based estimation. The practical importance of the size of any difference for inferences and decisions should also be considered, but this question tends to be application specific. To assess variability in sample based estimation of , , and we suggest using the bootstrap. Consider resampling of the triples , , with replacement, times. For each resample we compute , , and to obtain values , , , , . We can plot components of and

against each other, as well as standard deviations and correlations derived from

and . Estimates and are direct estimates of the mean and covariance of given based on prior samples. Estimates and are indirect estimates based on posterior approximations. For assessment of an approximate inference algorithm, checks of location, scale and correlation are often directly meaningful to users. In contrast, marginal quantile based assessments require the user to back out information about the nature of any miscalibration in location and scale from histograms of -values, which of course can be done but requires more sophistication. For our suggested plots of functions of and against each other, points lying above (below) the diagonal line indicate an average overestimation (underestimation) based on the posterior approximation. The use and meaning of these plots is discussed further in the examples of Section 4.

3.2 Adjustment of the inference algorithm

To adjust approximate inferences so that the equalities (7) and (8) hold, we can transform the particles . We assume that the components of are unrestricted, which can be achieved by a preliminary transformation if necessary. Write for the lower triangular Cholesky factor of . Also, assuming that is positive definite, write for it’s lower triangular Cholesky factor. Although is not guaranteed to be positive definite in general, it usually is, since the prior variation (captured by ) tends to be larger than the variability in the posterior means (estimated by ). We describe later how to handle the situation where is not positive definite, which can which can happen when there is a very poor approximation or because of the sampling variability in the estimation of and .

Now suppose we transform the samples , , to

(9)

The first two terms on the right-hand side of (9) perform a mean adjustment for the th replicate samples sufficient to ensure that (7) holds for the adjusted samples. The final term on the right-hand side of (9) is a scaling of the mean-centred particles for the th replicate samples, and ensures that an empirical estimate of based on the transformed particles leads to equality in (8). More precisely, denote the quantities etc. evaluated for the adjusted samples by , etc. We have

with

so that and (7) holds exactly for the adjusted samples. Next, note that

and hence . Also, and hence

so that and

so (8) also holds exactly for the adjusted samples. Given samples , for the observed data, we will transform them similarly to (9) to get an adjusted Monte Carlo sample from the posterior given to be used for inferential purposes:

(10)

As mentioned earlier it may happen that is not positive definite when the approximate inference algorithm gives particularly poor estimation of the posterior mean values, or due to sampling variability in the estimates and . To handle this we can do a preliminary preprocessing in which is changed to

in which is a shrinkage parameter, . It is easy to see that this results in shrinkage of the for the new particles towards , with changing to . The preprocessing leaves , and unchanged. The parameter is chosen so that

is positive definite. We suggest choosing it so that the smallest eigenvalue of

is equal to the smallest eigenvalue of . Once this preprocessing is done if necessary, our adjustment can proceed as above.

The adjustment suggested (and the similar quantile based adjustment and checking methods in the literature) are computationally intensive, since they require repeated approximations to the posterior for different datasets. However, these approximations are sometimes easy to obtain. For example, in the case of approximate Bayesian computation methods using samples from the prior, the same prior samples can be reused for the approximation for different data. In the case where fast enough posterior approximations are available Rodrigues et al. (2018) point out that recalibration adjustments can still be attractive. The computations involved in recalibration are trivially parallelizable, consisting of independent computations for different datasets.

4 Applications

We consider three examples. The first concerns a likelihood-free inference application discussed in Rodrigues et al. (2018), and we compare our own adjustment method with the quantile-based method considered in their paper. Our second example involves inference in a deep neural network generalized linear mixed model (Tran et al., 2019). In this example our method corrects for empirical Bayes and variational computational approximations in approximate Bayesian inference. We also explain the features of this example which make the adjustment of Rodrigues et al. (2018) performs poorly compared to our new method. The third example considers predictive inference in a Gaussian process model, when a deep neural network regression is used as a surrogate for the Gaussian process. The inference adjustment is intended to make the deep neural network regression uncertainty quantification closer to that provided by the Gaussian process. This can be of interest because the Gaussian process computation can be intractable for large datasets.

4.1 Likelihood-free inference using an auxiliary model

Our first example was considered in Rodrigues et al. (2018), where they considered a simple model that is useful in some finance and insurance applications. Consider independently and identically distributed observations of a random variable that is defined as where the are independent log-normal, . The density of is difficult to compute, and Rodrigues et al. (2018) consider the so-called Fenton-Wilkinson approximation (Fenton, 1960; Asmussen and Rojas-Nandayapa, 2008) which uses a log-normal density matching the mean and variance of to approximate the intractable density. is approximated by , where

As before, the observed is denoted . As a fast auxiliary model approximation to the posterior density for the parameter , Rodrigues et al. (2018) consider a Laplace approximation when the likelihood is approximated by the Fenton-Wilkinson method. For data , write for the approximate likelihood, for the mode of where , and for the Hessian of evaluated at . The posterior approximation based on the auxiliary Fenton-Wilkinson model and Laplace approximation is . Computation of this approximation is very fast, whereas an ABC approximation using as the summary statistic is computationally demanding. After obtaining a posterior approximation for , we transform back to an approximation for . Although this example is concerned with likelihood-free inference, the use of an auxiliary model is not the traditional ABC approach commonly used in such settings. For ABC, there are a number of other adjustment methods in the literature. These include methods based on regression (Beaumont et al., 2002; Blum, 2010; Blum and Tran, 2010; Blum and François, 2010) and on making adjustments to low-dimensional marginals (Nott et al., 2014; Li et al., 2017).

Rodrigues et al. (2018) consider simulating observations with and . The prior densities on and are independent and Gamma(1,1) respectively. The recalibration procedure in Algorithm 2 of Rodrigues et al. (2018) incorporates kernel weights. With a uniform kernel, this corresponds to choosing a conditioning set in their approach, similar to that considered in Section 2, based on the support of the kernel centred on . In this example they use kernel bandwidth , so that there is no reweighting or conditioning. As starting point for their adjustment, they generate a training sample of parameter and data pairs from the prior. For each of these they compute their auxiliary model posterior approximation. Our own multivariate adjustment approach will be implemented based on these same posterior approximations. For our new method, we use a conditioning set of the form

where and are the prior predictive mean absolute deviations of and respectively estimated from the training sample, and is chosen so that of the prior training samples are covered by .

First, we consider a check on the quality of the approximate inferences provided by the auxiliary model. Figure 1 shows plots of the kind suggested in Section 3.1 of the components of bootstrap replicates of and against each other, and of posterior standard deviations and correlations derived from bootstrap replicates of and . The values for and are direct estimates of moments for based on prior samples - the values for and are indirect estimates of moments based on the posterior approximation. The plots suggest some overall overestimation of posterior means for , underestimation of posterior means for , underestimation of standard deviations for and , and overestimation of correlation between and .

Figure 1: Plots of bootstrap replicates of components of and and of standard deviations and correlations derived from and against each other, with . Points above the red diagonal line indicate an overall overestimation by the posterior approximation, points below the red diagonal line indicate an overall underestimation.

The method of Rodrigues et al. (2018) works well in this example. The top panel of Figure 2 is similar to Figure 1 (b) in Rodrigues et al. (2018) and shows the estimated joint posterior based on the adjusted samples for their method, compared to a “gold standard” ABC method with small tolerance (see Rodrigues et al. (2018) for further details). The filled contours show the unadjusted auxiliary approximation. Shown in the bottom panel of Figure 2 is the adjusted posterior density estimate for our new approach, again comapred with the “gold standard” ABC analysis and the unadjusted approximation. Since the new method performs only mean and scale adjustments, the estimated joint posterior for will be normal after adjustment when the auxiliary model posterior approximation is normal. Figure 2 shows that both methods manage to correct for the miscalibration of mean and scale in the initial auxiliary model approximation. The method of Rodrigues et al. (2018) works best here, but we explain in the next example why it sometimes performs badly.

Figure 2:

Contour plots of uncalibrated approximation, “gold standard” ABC approximation and adjusted approximations based on marginal calibration (top) and the new total variance calibration (bottom). All posterior approximations shown are two-dimensional kernel density estimates obtained from particle approximations to the posterior density for the different methods.

4.2 Variational and empirical Bayes approximations within a deep longitudinal mixed model

Tran et al. (2019) describe generalized linear mixed models (GLMMs) for longitudinal data using deep neural network (DNN) basis functions. We consider inference for random effects variances in such models. Write for a binary response for an individual at time , , , and write

for a corresponding vector of covariates. A logistic deep GLMM assumes

where

is the scalar output of a feed-forward neural network with inner weights

and output weights , for input . If is the number of nodes in the last hidden layer of the neural network, the dimension of and each is , after inclusion of an intercept term. The vector contains fixed effects parameters and are random effects, modelled as where is diagonal with diagonal entries , , and is the variance of the random intercepts. In our later example , and only a single hidden layer will be used in the neural network. The model parameters are . Semiparametric versions of the above model which are more interpretable and in which some of the covariates are not transformed can also be considered. See Tran et al. (2019) for further discussion of the model.

For the neural network weights we use a prior which is normal, , , where in our later data analysis . The value was chosen based on prior predictive simulations. The parameters are assumed to follow a prior in which they are independent with , but constrained to the region . This means that the basis functions are ordered according to the size of the corresponding random effect variance. We set and

in this example, where the shape-scale parametrization of the gamma distribution is used here. We focus on inference for

. It is challenging to make inferences about a large set of random effects variance parameters here unless the number of subjects is large. In generalized linear mixed models, maximum likelihood estimates or posterior mode estimates with weak priors often lead to degenerate estimates (Chung et al., 2015). Here we use strong shrinkage priors and a random effect for each basis function to define a predictive model which flexibly describes within subject dependence.

We use our adjustment method to approximate full Bayes inference using the normal prior on the weights above, based on the approximate normal posterior produced by the algorithm of Tran et al. (2019). Tran et al. (2019)

approximate Bayesian inference under a different prior specification for the weights and using variational approximation methods and empirical Bayes methods for hyperparameter estimation. See Section 6.1.4 of their paper.

Tran et al. (2019) do not impose the identification constraint for inference, and in their normal posterior approximation it is assumed that the covariance matrix has the low-rank plus diagonal form , where is diagonal and it will be assumed here that is a column vector. The approximate posterior samples needed for our adjustment method are obtained by simulating from the normal variational posterior, and then reordering components so that . We take the effect of the relabelling as another aspect of the approximation that our method aims to correct.

Similar to the previous example, we use a conditioning set . If is a simulated dataset from the prior with , , then we use

where is the observed data, , , is a tolerance parameter, and

with the Jaccard distance (Jaccard, 1901) between two binary vectors with the same length. If is the number of components where and are both and is the number of components where and take different values, then , where here by convention is defined to be . We use our adjustment method with prior replicates, and choose in the conditioning set so that of these prior replicates are kept for the adjustment.

We consider an analysis of a German health care dataset (Geil et al., 1997) consisting of yearly records of hospitalized status of 1887 German workers. The response if worker was hospitalized in year , and

otherwise. The covariates include gender, age, income, education, marriage status, job type and insurance type. There are 12 predictors in total after using dummy variables to represent the categorical variables. We use only the first

panels in the dataset, since fitting the deep GLMM model is computationally expensive.

4.2.1 Assessment of approximate inference

First, consider our bootstrap assessment plots for , where . We focus on the four largest random effects variance parameters since they are the most important, and summarizing the results graphically is easier with a smaller number of parameters. Figure 3 shows the plots for the mean (top row) and standard deviation (bottom row). We see that the approximate inference method underestimates the largest values, and overestimates their variability.

Figure 3: Plots of bootstrap replicates of components of and (top row) and of standard deviations derived from and (bottom row) against each other for . Points above the diagonal line indicate an overall overestimation by the posterior approximation, points below the red diagonal line indicate an overall underestimation.

4.2.2 Approximate inference adjustments

When the value of is estimated by , the squared estimation error is where is the value of used in simulating the data for replicate . Figure 4 shows boxplots of these squared estimation errors for estimated posterior means obtained from unadjusted and adjusted posterior samples. The total variance adjustment improves substantially on the unadjusted method. We also tried the adjustment method of Rodrigues et al. (2018), which does not work well in this example (results not shown). The reason is that the unadjusted posterior approximation is very poor here, and the method of Rodrigues et al. (2018) obtains adjusted sample components by evaluating marginal posterior quantile functions conditional on the observed data at certain points. If the quantile functions are estimated based on the empirical distribution of a particle posterior estimate, then the adjusted samples cannot extend beyond the observed range of the unadjusted particles. Hence, if a good approximation requires putting posterior mass beyond the range of the unadjusted particles, the method of Rodrigues et al. (2018) can perform badly. The total variance method can still give useful results here because of the way that adjustments are based on mean and scale adjustments rather than quantiles.

Figure 4: Boxplots of squared estimation errors for , , with and without adjustment, treating each simulation replicate in turn as the observed data

Figure 5 shows, for the observed data, kernel estimates of the univariate and bivariate marginal posterior densities obtained with and without the total variance adjustment. Here we do not know the true parameter values. However, given the improvements demonstrated in Figure 4 for the simulated data, inference based on the adjusted approach seems preferable, and furthermore this adjustment leads to substantially different estimated posterior densities.

Figure 5: Univariate (diagonal panels) and bivariate (off-diagonal panels) estimated marginal posterior densities with (red) and without (black) total variance adjustment for , .

4.3 Approximate Gaussian process regression predictive inference using a deep neural network surrogate

Our final example considers application of a deep neural network regression model as a surrogate for Gaussian process regression predictive inference. We consider a dimensional input space. Suppose data are observed, where is a vector of observed values for a response and is a corresponding matrix of inputs with rows. The th row of is denoted , and gives the value of the input for the response . We wish to make predictive inference for a future response for input , and we will treat as random for the purpose of deriving appropriate adjustments. The main purpose of this example is to show how our adjustment method applies in a setting which involves predictive inference.

For predictive inference adjustments following the framework of Section 3.2, we will start with simulated datasets from the prior, with each dataset containing training and test samples respectively,

(11)

with indexing the different replicates. The training and test inputs are generated independently as bivariate normal, where the covariance matrix has upper triangular Cholesky factor . has non-zero elements , which we generated independently from an density here. is the same for all the replicates. The responses are generated conditionally on inputs from a regression model,

where , , , are independent and is a mean function. The values are simulated independently from a prior .

The function is simulated from a Gaussian process prior, and we discuss this next. Write