 # Efficient Bayesian Sampling Using Normalizing Flows to Assist Markov Chain Monte Carlo Methods

Normalizing flows can generate complex target distributions and thus show promise in many applications in Bayesian statistics as an alternative or complement to MCMC for sampling posteriors. Since no data set from the target posterior distribution is available beforehand, the flow is typically trained using the reverse Kullback-Leibler (KL) divergence that only requires samples from a base distribution. This strategy may perform poorly when the posterior is complicated and hard to sample with an untrained normalizing flow. Here we explore a distinct training strategy, using the direct KL divergence as loss, in which samples from the posterior are generated by (i) assisting a local MCMC algorithm on the posterior with a normalizing flow to accelerate its mixing rate and (ii) using the data generated this way to train the flow. The method only requires a limited amount of a priori input about the posterior, and can be used to estimate the evidence required for model validation, as we illustrate on examples.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Given a model with continuous parameters

, a prior on these parameters in the form of a probability density function

, and a set of observational data giving the likelihood for the model parameter

, Bayes formula asserts that the posterior distribution of the parameters has probability density

 ρ∗(θ)=ρ(θ|D)ρo(θ)=Z−1∗L(θ)ρo(θ) (1)

where the normalization factor

is the unknown evidence. A primary aim of Bayesian inference is to sample this posterior to identify which parameters best explain the data given the model. In addition one is typically interested in estimating

since it allows for model validation, comparison, and selection.

Markov Chain Monte Carlo (MCMC) algorithms Liu (2008) are nowadays the methods of choice to sample complex posterior distributions. MCMC methods generate a sequence of configurations over which the time average of any suitable observable converges towards its ensemble average over some target distribution, here the posterior. This is achieved by proposing new samples from a proposal density that is easy to sample, then accepting or rejecting them using a criterion that guarantees that the transition kernel of the chain is in detailed balance with respect to the posterior density: a popular choice is Metropolis-Hastings criterion.

MCMC methods, however, suffer from two problems. First, mixing may be slow when the posterior density is multimodal, which can occur when the likelihood is non-log-concave Fong et al. (2019). This is because proposal distributions using local dynamics like the popular Metropolis adjusted Langevin algorithm (MALA) Roberts and Tweedie (1996) are inefficient at making the chain transition from one mode to another, whereas uninformed non-local proposal distributions lead to high rejection rates. The second issue with MCMC algorithms is that they provide no efficient way to estimate the evidence : to this end, they need to be combined with other techniques such as thermodynamic integration or replica exchange, or traded for other techniques such as annealed importance sampling Neal (2001), nested sampling Skilling (2006), or the descent/ascent nonequilibrium estimator proposed in Rotskoff and Vanden-Eijnden (2019) and recently explored in Thin et al. (2021).

Here, we employ a data-driven approach to aid designing a fast-mixing transition kernel along the lines of mode-hoping algorithms proposed by Tjelmeland and Hegstad (2001); Andricioaei et al. (2001); Sminchisescu and Welling (2017) and learning based algorithms of  Levy et al. (2018); Titsias (2017); Song et al. (2017). Normalizing flows Tabak and Vanden-Eijnden (2010); Tabak and Turner (2013); Papamakarios et al. (2021) are especially promising in this context: these maps can approximate the posterior density  as the pushforward of a simple base density (e.g. the prior density ) by an invertible map . Their use for Bayesian inference, as opposed to density estimation Song et al. (2017, 2021), was first advocated in Rezende and Mohamed (2015). Since a representative training set of samples from the posterior density is typically unavailable beforehand, these authors proposed to use the reverse Kullback-Leibler (KL) divergence of the posterior from the push forward of the base , since this divergence can be expressed as an expectation over samples generated from , consistent with the variational inference framework Jordan et al. (1998); Blei et al. (2017). This procedure has the potential drawback that training the map is hard if the posterior differs significantly from the initial pushforward Hartnett and Mohseni (2020), as it may lead to “mode collapse.” Annealing of during training was shown to reduce this issue Wu et al. (2019); Nicoli et al. (2020).

Building on works using normalizing flows with MCMC Albergo et al. (2019); Noé et al. (2019); Gabrié et al. (2021) here we explore an alternative strategy that blends sampling and learning where we (i) assist a MCMC algorithm with a normalizing flow to accelerate mixing and (ii) use the generated data to train the flow on the direct KL divergence.

## 2 Posterior Sampling and Model Validation with Normalizing Flows

A normalizing flow (NF) is an invertible map that pushes forward a simple base density

(typically a Gaussian with unit variance, though we could also take

) towards a target distribution, here the posterior density . An ideal map (with inverse ) is such that if is drawn from then is a sample from . Of course, in practice, we have no access to this exact , but if we have an approximation of , it still assists sampling . Denote by  the push-forward of under the map ,

 ^ρ(θ)=ρB(¯T(θ))det∣∣∇θ¯T∣∣. (2)

As long as and are either both positive or both zero at any point , we can use a Metropolis-Hasting MCMC algorithm to sample from using as a transition kernel: a proposed configuration from a given configuration is accepted with probability

 acc(θ,θ′)=min[1,^ρ(θ)ρ∗(θ′)ρ∗(θ)^ρ(θ′)]. (3)

This procedure is equivalent to using the transition kernel

 πT(θ,θ′)=acc(θ,θ′)^ρ(θ′)+(1−r(θ))δ(θ−θ′) (4)

where . Since is irreducible and aperiodic under the aforementioned conditions on and , its associated chain is ergodic with respect to  Meyn and Tweedie (2012). In addition the evidence is given by

 Z∗=EρB[L(T(θB))ρo(T(θB))^ρ(T(θB))]. (5)

For the scheme to be efficient, two conditions are required. First the parametrization of the map must allow for easy evaluation of the density which requires easily estimable Jacobian determinants and inverses. This issue has been one of the main foci in the normalizing flow literature Papamakarios et al. (2021) and is for instance solved using coupling layers Dinh et al. (2015, 2017). Second, as shown by formula (3) the proposal density must produce samples with statistical weights comparable to the posterior density to ensure appreciable acceptance rates. This requires training the map to resemble the optimal . Figure 1: Sampling a mixture of 2 Gaussians in 10d. Top row: Training loss and acceptance rate of the NF non-local moves as a function of iterations. Middle row: Target density and estimation of the relative weight of modes A and B using sampling with ^ρ. Bottom row: ^ρ and example chains along training/sampling. (See Appendix A.2 for setup details.) Figure 2: Radial velocities. From the signal plotted in blue, we draw the noisy observations in red and obtain the initial samples in gray with the Joker algorithm Price-Whelan et al. (2017).

## 3 Compounding Local MCMCs and Generative Sampling

In Algorithm 1, we present a concurrent sampling/training strategy that synergistically uses to improve sampling from and samples obtained from to train . Let us describe the different components of the scheme.

Sampling. The sampling component of Algorithm 1 alternates between steps of the Metropolis-Hasting procedure using a NF as discussed in Section 2 and steps of a local MCMC sampler (here MALA) (line 11), using as potential

 U∗(θ)=−logL(θ)−logρo(θ). (6)

Strictly speaking, the second transition kernel does not need to be local, it should, however, have satisfactory acceptance rates early in the training procedure to provide initial data to start up the optimization of . From a random initialization , the parametrized density has initially little overlap with the posterior and the moves proposed by the NF have a high probability to be rejected. However, thanks to the data generated by the local sampler, the training of can be initiated. As training goes on, more and more moves proposed by the NF can be accepted. It is crucial to notice that these moves, generated by pushing forward independent draws from the base distribution , are non-local and easily mix between modes. Figure 3: Comparing samples from Algorithm 1 (top row in blue, 2d projections) with the samples from Joker (in green) and the initialization samples (in pink). The posterior densities marginalized over two parameters are shown in the bottom row.

Training. A standard way of training

is to minimize the Kullback-Leibler divergence from

 DKL(ρk∥^ρ)=Ck−∫Θlog^ρ(θ)ρk(θ)dθ, (7)

where denotes the density of the MCMC after steps and is a constant irrelevant for the optimization of . In practice, we run walkers in parallel in the chain: denoting their positions at iteration by , we use the following estimator of (7) (minus the constant) as objective for training:

 Lnk[T] =−1nn∑i=1log^ρ(θi(k)). (8)

The training component of Algorithm 1

uses stochastic gradient descent on this loss function to update the parameters of the normalizing flow (line

15). Note that are not perfect samples from to start with, but their quality increases with the number of iterations of the MCMC. Note also that this training strategy is different from the one proposed in Rezende and Mohamed (2015), which uses instead the reverse KL divergence of from : this objective could be combined with the one in (8). Here we will stick to using  (8) as loss, using approximate input from through the MCMC samples initialized as explained next. Müller et al. (2019) used the same forward KL divergence for training yet using a reweighing of samples from for its estimation, which may be imprecise if the initial pushforward density bears little overlap with the target .

Initialization. To initialize MCMC chains, we assume that we have initial data lying in each of the important modes of the posterior , but require no additional information. That is, we take where the are initially located in these modes but not necessarily drawn from . We stress that the method therefore applies in situations where these modes have been located beforehand, for example by doing gradient descent on  from random points in .

Architecture details of Algorithm 1 are deferred to Appendix A. Questions of convergence of this algorithm are treated in Gabrié et al. (2021) (Appendix B).

## 4 Numerical Experiments

### 4.1 Sampling Mixture of Gaussians in High-dimension

As a first test case of Algorithm 1, we sample a Gaussian mixture with 2 components in dimensions and estimate the relative statistical weights of its two modes.111In this experiment and the one that follows we used unadjusted Langevin dynamics (ULA) rather than MALA because the time steps were sufficiently small to ensure a high acceptance rate. The bottom row of Fig. 1 shows d projections of the trajectories of representative chains (in black) from initializations in each of the modes (red stars) as the NF learns to model the target density (blue contours). Running first locally under the Langevin sampler, the chains progressively mix between modes and grasp the difference of statistical weights also captured by the final map . Quantitatively, the acceptance rate of moves proposed by the NF reaches at the end of training (Fig. 1 top row). The estimator of the relative statistical weights of each modes (the right mode A is twice as likely as the left mode B) using Eq. (5) also converges to the exact value within a small statistical error (Fig. 1 middle row).

### 4.2 Radial Velocity of an Exoplanet

Next we apply our method to the Bayesian sampling of radial velocity parameters in a model close to the one studied by Price-Whelan et al. (2017) for a star-exoplanet system. The model has 4 parameters: an offset velocity , an amplitude , a period and a phase . Introducing , the radial velocity is

 v(t;θ)=v0+Kcos(Ωt+ϕ0) (9)

with . From a set of observations , the goal is to sample the posterior distribution over . Following Price-Whelan et al. (2017), we assume a Gaussian likelihood , with known variance , and the prior distributions

 lnP ∼U(lnPmin,lnPmax), ϕ0∼U(0,2π), (10) K ∼N(μK,σ2K), v0∼N(0,σ2v0).

We sample noisy observations at different times (red diamonds in Fig. 2) from a ground-truth radial velocity with parameters (dashed blue line). Using one iteration of the accept-reject Joker algorithm Price-Whelan et al. (2017) with samples from the prior distributions we obtain sets of likely parameters (corresponding to the gray lines in Fig. 2), which we will use as starting points for the MCMC chains in Algorithm 1. Note that to ensure a minimum acceptance rate of the Joker samples priors of and only, and computes the maximum likelihood value for the “linear parameters” and .

We assess the quality of sampling after iterations of Algorithm 1 on Fig. 3, looking at all the possible 2d projections of space of parameters.

The samples from Algorithm 1 (top row, in blue, final acceptance rate of , see Fig. 4) are generally covering well the modes of the marginal posterior distribution (second row), far beyond the initial samples (top row, in pink), at the exception of a light mode in the neighborhood of and in which no chain was initialized. This is an illustration of the need for prior knowledge of the rough location of a basin of interest to successful sample it with the proposed method. For comparison, we also report samples accepted by the Joker algorithm from an initial draw of samples (top row, in green). Because of the maximum likelihood step along and mentioned above, the posterior is not appropriately sampled along these two dimensions by this strategy.

## 5 Conclusion

Our results show that normalizing flows can be exploited to augment MCMC schemes used in Bayesian inference with a nonlocal transport component that significantly enhances mixing. By design, the method blends MCMC with an optimization component similar to that used in variational inference Blei et al. (2017), and it would be interesting to investigate further how both approaches compare on challenging applications with complex posteriors on high dimensional parameter spaces.

### Acknowledgments

GMR acknowledges generous support from the Terman Faculty Fellowship. EVE acknowledges partial support from the National Science Foundation (NSF) Materials Research Science and Engineering Center Program grant DMR-1420073, NSF DMS-1522767, and a Vannevar Bush Faculty Fellowship.

## References

• M. S. Albergo, G. Kanwar, and P. E. Shanahan (2019) Flow-based generative models for Markov chain Monte Carlo in lattice field theory. Physical Review D 100 (3), pp. 034515. External Links: ISSN 2470-0010, 2470-0029, Document Cited by: §1.
• I. Andricioaei, J. E. Straub, and A. F. Voter (2001) Smart darting Monte Carlo. Journal of Chemical Physics 114 (16), pp. 6994–7000. External Links: Document, ISSN 00219606 Cited by: §1.
• D. M. Blei, A. Kucukelbir, and J. D. McAuliffe (2017) Variational Inference: A Review for Statisticians. Journal of the American Statistical Association 112 (518), pp. 859–877. External Links: ISSN 0162-1459, 1537-274X, Document, 1601.00670 Cited by: §1, §5.
• L. Dinh, D. Krueger, and Y. Bengio (2015) NICE: Non-linear independent components estimation. 3rd International Conference on Learning Representations, ICLR 2015 - Workshop Track Proceedings 1 (2), pp. 1–13. External Links: 1410.8516 Cited by: §2.
• L. Dinh, J. Sohl-Dickstein, and S. Bengio (2017) Density Estimation Using Real NVP. In International Conference on Learning Representations, pp. 32. Cited by: §A.1, §2.
• E. Fong, S. Lyddon, and C. Holmes (2019) Scalable Nonparametric Sampling from Multimodal Posteriors with the Posterior Bootstrap. In International Conference on Machine Learning, pp. 1952–1962. External Links: ISSN 2640-3498 Cited by: §1.
• M. Gabrié, G. M. Rotskoff, and E. Vanden-Eijnden (2021) Adaptive Monte Carlo augmented with normalizing flows. arXiv:2105.12603 [cond-mat, physics:physics]. External Links: 2105.12603 Cited by: §1, §3.
• G. S. Hartnett and M. Mohseni (2020)

Self-Supervised Learning of Generative Spin-Glasses with Normalizing Flows

.
arXiv:2001.00585 [cond-mat, physics:quant-ph, stat]. External Links: 2001.00585 Cited by: §1.
• R. Jordan, D. Kinderlehrer, and F. Otto (1998) The Variational Formulation of the Fokker–Planck Equation. SIAM Journal on Mathematical Analysis 29 (1), pp. 1–17. External Links: ISSN 0036-1410, 1095-7154, Document Cited by: §1.
• D. Levy, M. D. Hoffman, and J. Sohl-Dickstein (2018)

Generalizing Hamiltonian Monte Carlo with Neural Networks

.
In International Conference on Learning Representations, Cited by: §1.
• J.S. Liu (2008) Monte carlo strategies in scientific computing. Springer Verlag, New York, Berlin, Heidelberg. Cited by: §1.
• S. P. Meyn and R. L. Tweedie (2012) Markov chains and stochastic stability. Springer Science & Business Media. Cited by: §2.
• T. Müller, B. McWilliams, F. Rousselle, M. Gross, and J. Novák (2019) Neural Importance Sampling. arXiv:1808.03856 [cs, stat]. External Links: 1808.03856 Cited by: §3.
• R. M. Neal (2001) Annealed importance sampling. Statistics and Computing 11, pp. 125–139. External Links: Document Cited by: §1.
• K. A. Nicoli, S. Nakajima, N. Strodthoff, W. Samek, K. R. Müller, and P. Kessel (2020)

Asymptotically unbiased estimation of physical observables with neural samplers

.
Physical Review E 101 (2). External Links: Document, ISSN 24700053 Cited by: §1.
• F. Noé, S. Olsson, J. Köhler, and H. Wu (2019)

Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning

.
Science 365 (6457), pp. eaaw1147. External Links: ISSN 0036-8075, 1095-9203, Document Cited by: §1.
• G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan (2021) Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning Research 22 (57), pp. 1–64. Cited by: §1, §2.
• A. M. Price-Whelan, D. W. Hogg, D. Foreman-Mackey, and H. Rix (2017) The Joker: A Custom Monte Carlo Sampler for Binary-star and Exoplanet Radial Velocity Data . The Astrophysical Journal 837 (1), pp. 20. Cited by: Figure 2, §4.2.
• D. Rezende and S. Mohamed (2015) Variational Inference with Normalizing Flows. In International Conference on Machine Learning, pp. 1530–1538. External Links: ISSN 1938-7228 Cited by: §1, §3.
• G. O. Roberts and R. L. Tweedie (1996) Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli 2 (4), pp. 341–363. External Links: ISSN 1350-7265 Cited by: §1.
• G. M. Rotskoff and E. Vanden-Eijnden (2019)

Dynamical Computation of the Density of States and Bayes Factors Using Nonequilibrium Importance Sampling

.
Physical Review Letters 122 (15), pp. 150602. External Links: ISSN 0031-9007, 1079-7114, Document Cited by: §1.
• J. Skilling (2006) Nested sampling for general Bayesian computation. Bayesian Analysis 1 (4), pp. 833–859. External Links: ISSN 1936-0975, Document Cited by: §1.
• C. Sminchisescu and M. Welling (2017) Generalized darting Monte Carlo. In

Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, PMLR

,
Vol. 2, pp. 516–523. External Links: Document, ISSN 00313203 Cited by: §1.
• J. Song, S. Zhao, and S. Ermon (2017) A-nice-mc: Adversarial training for MCMC. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Cited by: §1.
• Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole (2021) Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, Cited by: §1.
• E. G. Tabak and C. V. Turner (2013) A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics 66 (2), pp. 145–164. External Links: Document Cited by: §1.
• E. G. Tabak and E. Vanden-Eijnden (2010) Density estimation by dual ascent of the log-likelihood. Communications in Mathematical Sciences 8 (1), pp. 217–233. External Links: ISSN 15396746, 19450796, Document Cited by: §1.
• A. Thin, Y. Janati, S. L. Corff, C. Ollion, A. Doucet, A. Durmus, E. Moulines, and C. Robert (2021) Invertible Flow Non Equilibrium sampling. arXiv:2103.10943 [stat]. External Links: 2103.10943 Cited by: §1.
• M. K. Titsias (2017) Learning Model Reparametrizations: Implicit Variational Inference by Fitting MCMC distributions. arXiv:1708.01529 [stat]. External Links: 1708.01529 Cited by: §1.
• H. Tjelmeland and B. K. Hegstad (2001) Mode Jumping Proposals in MCMC. Scandinavian Journal of Statistics 28 (1), pp. 205–223. External Links: Document, ISSN 0303-6898, Link Cited by: §1.
• D. Wu, L. Wang, and P. Zhang (2019) Solving Statistical Mechanics Using Variational Autoregressive Networks. Physical Review Letters 122 (8), pp. 080602. External Links: Document Cited by: §1.

## Appendix A Computational details

All codes are made available through the Github repository anonymous (to be disclosed after double-blind review).

### a.1 Architectures

We parametrize the map as a RealNVP Dinh et al. (2017), for which inverse and Jacobian determinant can be computed efficiently. Its building block is an invertible affine coupling layer updating half of the variables,

 θ(k+1)1:d/2=es(θ(k)d/2:d)⊙θ(k)1:d/2+t(θ(k)d/2:d) (11)

where and are learnable neural networks from , denotes component-wise multiplication (Hadamard product), and

indexes the depth of the network. In our experiments we use fully connected networks with depth 3, hidden layers of width 100 with ReLU activations. The parameters of these networks are initialized with random Gaussian values of small variance so that the RealNVP implements a map

close to the identity at initialization.

### a.2 Mixture of Gaussians experiment

#### Target and hyperparameters.

The target distribution is a mixture of Gaussian in dimension with two components:

 ρ∗(θ)=wAe−12|θ−θA|2(2π)d/2+wBe−12|θ−θB|2(2π)d/2, (12)

with weights , and centroids with non-zero coordinates only in the first two dimensions and . Fig. 1 displays density slices in the -plane.

We use a RealNVP network with 6 pairs of coupling layers and a standard normal distribution as a base

. A set of 100 independent walkers were initialized in equal shares in modes A and B of the target. For the optimization, we compute gradients using batches of 1000 samples corresponding to 10 consecutive repeated sampling steps of Algorithm 1 (with and ). We run 4000 parameters updates using Adam with a learning rate of 0.005.

#### Computing log-evidence differences.

Interpreting as a posterior distribution over a set of parameters , we would like to estimate the relative evidence for modes and . Denoting by and two sets of configurations in corresponding to each mode, the difference between their log-evidence is given by

 logZA−logZB =log∫Θ1A(θ)ρ∗(θ)dθ∫Θ1B(θ)ρ∗(θ)dθ=logE∗(1A)−logE∗(1B). (13)

Once the normalizing flow has been learned to assist sampling, it can be used to approximate Eq. 5. Drawing from we have the Monte Carlo estimate of :

 ^Z∗,A=1Z∗n∑i=11A(θi)^w(θi), (14)

with the unnormalized weights , taking the form of an importance sampling estimator. The quality of the estimator can be monitored using an estimate of the effective sample size to adjust the variance estimate from the empirical variance

 neff=(∑ni=1^wi)2∑ni=1^w2i. (15)

Finally the unknown cancels out in the log-evidence difference:

 logZA−logZB ≈log(n∑i=11A(θi)^wi)−log(n∑i=11B(θi)^wi). (16)

In the middle right panel of Fig. 1 we report the performance of this estimator using the sets and , and . The estimator convergence to the exact value as the quality of increases over training.

#### Setup.

The parameters of the prior are set to the values , , , , and . The RealNVP used to learn the posterior distribution has 6 pairs of coupling layers. The base distribution is a standard normal distribution as in the previous experiment, but the NF is learned on a whitened representation of the parameters . Concretely, using the initial samples from the Joker, we compute the empirical mean and empirical covariance

. The latter admits an eigenvalue decomposition

, with the diagonal matrix of eigenvalues and

the orthogonal eigenvectors basis. Using

, we train the normalize flow to model the distribution of the whitened parameters .

#### Training.

A set of 110 independent walkers were initialized using 10 copies of each of set of parameters retained from an initial iteration of the Joker algorithm from draws of and . Here again we use 5 consecutive repeated sampling steps of Algorithm 1 (with and ) to compute gradients using batches of 550 samples. We run Adam for iterations with a learning rate of 0.001.

Fig. 4 reports the evolution of the approximate KL divergence along training (left) and the acceptance rates of the non-local moves proposed by the NF (middle). We also plot the projected trajectories of chains started with the initial Joker samples updated with combined sampling component of Algorithm 1 using the final learned . During the short chains that included 10 non-local resampling steps, the walkers can be seen to jump back and forth between two of the modes of the marginalized density (underlying in blue in the plot). However mixing is not well assisted with the mode around (see discussion in the main text as well). Figure 4: Training of a NF to model the posterior distribution for the radial velocity experiment. Left: Evolution of the approximate KL divergence used as an objective for training. Middle: The acceptance rate of the non-local moves proposed by the NF along training are above 60% by the end of training. Right: The contour blue plot reports the marginalized true posterior along lnP and ϕ0 computed with numerical integration. The colored lines mark the trajectories of MCMC chains using the assisted sampling startegy considered here with the final learned map T.