1 Introduction
Given a model with continuous parameters
, a prior on these parameters in the form of a probability density function
, and a set of observational data giving the likelihood for the model parameter, Bayes formula asserts that the posterior distribution of the parameters has probability density
(1) 
where the normalization factor
is the unknown evidence. A primary aim of Bayesian inference is to sample this posterior to identify which parameters best explain the data given the model. In addition one is typically interested in estimating
since it allows for model validation, comparison, and selection.Markov Chain Monte Carlo (MCMC) algorithms Liu (2008) are nowadays the methods of choice to sample complex posterior distributions. MCMC methods generate a sequence of configurations over which the time average of any suitable observable converges towards its ensemble average over some target distribution, here the posterior. This is achieved by proposing new samples from a proposal density that is easy to sample, then accepting or rejecting them using a criterion that guarantees that the transition kernel of the chain is in detailed balance with respect to the posterior density: a popular choice is MetropolisHastings criterion.
MCMC methods, however, suffer from two problems. First, mixing may be slow when the posterior density is multimodal, which can occur when the likelihood is nonlogconcave Fong et al. (2019). This is because proposal distributions using local dynamics like the popular Metropolis adjusted Langevin algorithm (MALA) Roberts and Tweedie (1996) are inefficient at making the chain transition from one mode to another, whereas uninformed nonlocal proposal distributions lead to high rejection rates. The second issue with MCMC algorithms is that they provide no efficient way to estimate the evidence : to this end, they need to be combined with other techniques such as thermodynamic integration or replica exchange, or traded for other techniques such as annealed importance sampling Neal (2001), nested sampling Skilling (2006), or the descent/ascent nonequilibrium estimator proposed in Rotskoff and VandenEijnden (2019) and recently explored in Thin et al. (2021).
Here, we employ a datadriven approach to aid designing a fastmixing transition kernel along the lines of modehoping algorithms proposed by Tjelmeland and Hegstad (2001); Andricioaei et al. (2001); Sminchisescu and Welling (2017) and learning based algorithms of Levy et al. (2018); Titsias (2017); Song et al. (2017). Normalizing flows Tabak and VandenEijnden (2010); Tabak and Turner (2013); Papamakarios et al. (2021) are especially promising in this context: these maps can approximate the posterior density as the pushforward of a simple base density (e.g. the prior density ) by an invertible map . Their use for Bayesian inference, as opposed to density estimation Song et al. (2017, 2021), was first advocated in Rezende and Mohamed (2015). Since a representative training set of samples from the posterior density is typically unavailable beforehand, these authors proposed to use the reverse KullbackLeibler (KL) divergence of the posterior from the push forward of the base , since this divergence can be expressed as an expectation over samples generated from , consistent with the variational inference framework Jordan et al. (1998); Blei et al. (2017). This procedure has the potential drawback that training the map is hard if the posterior differs significantly from the initial pushforward Hartnett and Mohseni (2020), as it may lead to “mode collapse.” Annealing of during training was shown to reduce this issue Wu et al. (2019); Nicoli et al. (2020).
Building on works using normalizing flows with MCMC Albergo et al. (2019); Noé et al. (2019); Gabrié et al. (2021) here we explore an alternative strategy that blends sampling and learning where we (i) assist a MCMC algorithm with a normalizing flow to accelerate mixing and (ii) use the generated data to train the flow on the direct KL divergence.
2 Posterior Sampling and Model Validation with Normalizing Flows
A normalizing flow (NF) is an invertible map that pushes forward a simple base density
(typically a Gaussian with unit variance, though we could also take
) towards a target distribution, here the posterior density . An ideal map (with inverse ) is such that if is drawn from then is a sample from . Of course, in practice, we have no access to this exact , but if we have an approximation of , it still assists sampling . Denote by the pushforward of under the map ,(2) 
As long as and are either both positive or both zero at any point , we can use a MetropolisHasting MCMC algorithm to sample from using as a transition kernel: a proposed configuration from a given configuration is accepted with probability
(3) 
This procedure is equivalent to using the transition kernel
(4) 
where . Since is irreducible and aperiodic under the aforementioned conditions on and , its associated chain is ergodic with respect to Meyn and Tweedie (2012). In addition the evidence is given by
(5) 
For the scheme to be efficient, two conditions are required. First the parametrization of the map must allow for easy evaluation of the density which requires easily estimable Jacobian determinants and inverses. This issue has been one of the main foci in the normalizing flow literature Papamakarios et al. (2021) and is for instance solved using coupling layers Dinh et al. (2015, 2017). Second, as shown by formula (3) the proposal density must produce samples with statistical weights comparable to the posterior density to ensure appreciable acceptance rates. This requires training the map to resemble the optimal .
3 Compounding Local MCMCs and Generative Sampling
In Algorithm 1, we present a concurrent sampling/training strategy that synergistically uses to improve sampling from and samples obtained from to train . Let us describe the different components of the scheme.
Sampling. The sampling component of Algorithm 1 alternates between steps of the MetropolisHasting procedure using a NF as discussed in Section 2 and steps of a local MCMC sampler (here MALA) (line 11), using as potential
(6) 
Strictly speaking, the second transition kernel does not need to be local, it should, however, have satisfactory acceptance rates early in the training procedure to provide initial data to start up the optimization of . From a random initialization , the parametrized density has initially little overlap with the posterior and the moves proposed by the NF have a high probability to be rejected. However, thanks to the data generated by the local sampler, the training of can be initiated. As training goes on, more and more moves proposed by the NF can be accepted. It is crucial to notice that these moves, generated by pushing forward independent draws from the base distribution , are nonlocal and easily mix between modes.
Training. A standard way of training
is to minimize the KullbackLeibler divergence from
to . Since we do not have access to samples from , here we use instead(7) 
where denotes the density of the MCMC after steps and is a constant irrelevant for the optimization of . In practice, we run walkers in parallel in the chain: denoting their positions at iteration by , we use the following estimator of (7) (minus the constant) as objective for training:
(8) 
The training component of Algorithm 1
uses stochastic gradient descent on this loss function to update the parameters of the normalizing flow (line
15). Note that are not perfect samples from to start with, but their quality increases with the number of iterations of the MCMC. Note also that this training strategy is different from the one proposed in Rezende and Mohamed (2015), which uses instead the reverse KL divergence of from : this objective could be combined with the one in (8). Here we will stick to using (8) as loss, using approximate input from through the MCMC samples initialized as explained next. Müller et al. (2019) used the same forward KL divergence for training yet using a reweighing of samples from for its estimation, which may be imprecise if the initial pushforward density bears little overlap with the target .Initialization. To initialize MCMC chains, we assume that we have initial data lying in each of the important modes of the posterior , but require no additional information. That is, we take where the are initially located in these modes but not necessarily drawn from . We stress that the method therefore applies in situations where these modes have been located beforehand, for example by doing gradient descent on from random points in .
4 Numerical Experiments
4.1 Sampling Mixture of Gaussians in Highdimension
As a first test case of Algorithm 1, we sample a Gaussian mixture with 2 components in dimensions and estimate the relative statistical weights of its two modes.^{1}^{1}1In this experiment and the one that follows we used unadjusted Langevin dynamics (ULA) rather than MALA because the time steps were sufficiently small to ensure a high acceptance rate. The bottom row of Fig. 1 shows d projections of the trajectories of representative chains (in black) from initializations in each of the modes (red stars) as the NF learns to model the target density (blue contours). Running first locally under the Langevin sampler, the chains progressively mix between modes and grasp the difference of statistical weights also captured by the final map . Quantitatively, the acceptance rate of moves proposed by the NF reaches at the end of training (Fig. 1 top row). The estimator of the relative statistical weights of each modes (the right mode A is twice as likely as the left mode B) using Eq. (5) also converges to the exact value within a small statistical error (Fig. 1 middle row).
4.2 Radial Velocity of an Exoplanet
Next we apply our method to the Bayesian sampling of radial velocity parameters in a model close to the one studied by PriceWhelan et al. (2017) for a starexoplanet system. The model has 4 parameters: an offset velocity , an amplitude , a period and a phase . Introducing , the radial velocity is
(9) 
with . From a set of observations , the goal is to sample the posterior distribution over . Following PriceWhelan et al. (2017), we assume a Gaussian likelihood , with known variance , and the prior distributions
(10)  
We sample noisy observations at different times (red diamonds in Fig. 2) from a groundtruth radial velocity with parameters (dashed blue line). Using one iteration of the acceptreject Joker algorithm PriceWhelan et al. (2017) with samples from the prior distributions we obtain sets of likely parameters (corresponding to the gray lines in Fig. 2), which we will use as starting points for the MCMC chains in Algorithm 1. Note that to ensure a minimum acceptance rate of the Joker samples priors of and only, and computes the maximum likelihood value for the “linear parameters” and .
We assess the quality of sampling after iterations of Algorithm 1 on Fig. 3, looking at all the possible 2d projections of space of parameters.
The samples from Algorithm 1 (top row, in blue, final acceptance rate of , see Fig. 4) are generally covering well the modes of the marginal posterior distribution (second row), far beyond the initial samples (top row, in pink), at the exception of a light mode in the neighborhood of and in which no chain was initialized. This is an illustration of the need for prior knowledge of the rough location of a basin of interest to successful sample it with the proposed method. For comparison, we also report samples accepted by the Joker algorithm from an initial draw of samples (top row, in green). Because of the maximum likelihood step along and mentioned above, the posterior is not appropriately sampled along these two dimensions by this strategy.
5 Conclusion
Our results show that normalizing flows can be exploited to augment MCMC schemes used in Bayesian inference with a nonlocal transport component that significantly enhances mixing. By design, the method blends MCMC with an optimization component similar to that used in variational inference Blei et al. (2017), and it would be interesting to investigate further how both approaches compare on challenging applications with complex posteriors on high dimensional parameter spaces.
Acknowledgments
GMR acknowledges generous support from the Terman Faculty Fellowship. EVE acknowledges partial support from the National Science Foundation (NSF) Materials Research Science and Engineering Center Program grant DMR1420073, NSF DMS1522767, and a Vannevar Bush Faculty Fellowship.
References
 Flowbased generative models for Markov chain Monte Carlo in lattice field theory. Physical Review D 100 (3), pp. 034515. External Links: ISSN 24700010, 24700029, Document Cited by: §1.
 Smart darting Monte Carlo. Journal of Chemical Physics 114 (16), pp. 6994–7000. External Links: Document, ISSN 00219606 Cited by: §1.
 Variational Inference: A Review for Statisticians. Journal of the American Statistical Association 112 (518), pp. 859–877. External Links: ISSN 01621459, 1537274X, Document, 1601.00670 Cited by: §1, §5.
 NICE: Nonlinear independent components estimation. 3rd International Conference on Learning Representations, ICLR 2015  Workshop Track Proceedings 1 (2), pp. 1–13. External Links: 1410.8516 Cited by: §2.
 Density Estimation Using Real NVP. In International Conference on Learning Representations, pp. 32. Cited by: §A.1, §2.
 Scalable Nonparametric Sampling from Multimodal Posteriors with the Posterior Bootstrap. In International Conference on Machine Learning, pp. 1952–1962. External Links: ISSN 26403498 Cited by: §1.
 Adaptive Monte Carlo augmented with normalizing flows. arXiv:2105.12603 [condmat, physics:physics]. External Links: 2105.12603 Cited by: §1, §3.

SelfSupervised Learning of Generative SpinGlasses with Normalizing Flows
. arXiv:2001.00585 [condmat, physics:quantph, stat]. External Links: 2001.00585 Cited by: §1.  The Variational Formulation of the Fokker–Planck Equation. SIAM Journal on Mathematical Analysis 29 (1), pp. 1–17. External Links: ISSN 00361410, 10957154, Document Cited by: §1.

Generalizing Hamiltonian Monte Carlo with Neural Networks
. In International Conference on Learning Representations, Cited by: §1.  Monte carlo strategies in scientific computing. Springer Verlag, New York, Berlin, Heidelberg. Cited by: §1.
 Markov chains and stochastic stability. Springer Science & Business Media. Cited by: §2.
 Neural Importance Sampling. arXiv:1808.03856 [cs, stat]. External Links: 1808.03856 Cited by: §3.
 Annealed importance sampling. Statistics and Computing 11, pp. 125–139. External Links: Document Cited by: §1.

Asymptotically unbiased estimation of physical observables with neural samplers
. Physical Review E 101 (2). External Links: Document, ISSN 24700053 Cited by: §1. 
Boltzmann generators: Sampling equilibrium states of manybody systems with deep learning
. Science 365 (6457), pp. eaaw1147. External Links: ISSN 00368075, 10959203, Document Cited by: §1.  Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning Research 22 (57), pp. 1–64. Cited by: §1, §2.
 The Joker: A Custom Monte Carlo Sampler for Binarystar and Exoplanet Radial Velocity Data . The Astrophysical Journal 837 (1), pp. 20. Cited by: Figure 2, §4.2.
 Variational Inference with Normalizing Flows. In International Conference on Machine Learning, pp. 1530–1538. External Links: ISSN 19387228 Cited by: §1, §3.
 Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli 2 (4), pp. 341–363. External Links: ISSN 13507265 Cited by: §1.

Dynamical Computation of the Density of States and Bayes Factors Using Nonequilibrium Importance Sampling
. Physical Review Letters 122 (15), pp. 150602. External Links: ISSN 00319007, 10797114, Document Cited by: §1.  Nested sampling for general Bayesian computation. Bayesian Analysis 1 (4), pp. 833–859. External Links: ISSN 19360975, Document Cited by: §1.

Generalized darting Monte Carlo.
In
Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, PMLR
, Vol. 2, pp. 516–523. External Links: Document, ISSN 00313203 Cited by: §1.  Anicemc: Adversarial training for MCMC. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Cited by: §1.
 Scorebased generative modeling through stochastic differential equations. In International Conference on Learning Representations, Cited by: §1.
 A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics 66 (2), pp. 145–164. External Links: Document Cited by: §1.
 Density estimation by dual ascent of the loglikelihood. Communications in Mathematical Sciences 8 (1), pp. 217–233. External Links: ISSN 15396746, 19450796, Document Cited by: §1.
 Invertible Flow Non Equilibrium sampling. arXiv:2103.10943 [stat]. External Links: 2103.10943 Cited by: §1.
 Learning Model Reparametrizations: Implicit Variational Inference by Fitting MCMC distributions. arXiv:1708.01529 [stat]. External Links: 1708.01529 Cited by: §1.
 Mode Jumping Proposals in MCMC. Scandinavian Journal of Statistics 28 (1), pp. 205–223. External Links: Document, ISSN 03036898, Link Cited by: §1.
 Solving Statistical Mechanics Using Variational Autoregressive Networks. Physical Review Letters 122 (8), pp. 080602. External Links: Document Cited by: §1.
Appendix A Computational details
All codes are made available through the Github repository anonymous (to be disclosed after doubleblind review).
a.1 Architectures
We parametrize the map as a RealNVP Dinh et al. (2017), for which inverse and Jacobian determinant can be computed efficiently. Its building block is an invertible affine coupling layer updating half of the variables,
(11) 
where and are learnable neural networks from , denotes componentwise multiplication (Hadamard product), and
indexes the depth of the network. In our experiments we use fully connected networks with depth 3, hidden layers of width 100 with ReLU activations. The parameters of these networks are initialized with random Gaussian values of small variance so that the RealNVP implements a map
close to the identity at initialization.a.2 Mixture of Gaussians experiment
Target and hyperparameters.
The target distribution is a mixture of Gaussian in dimension with two components:
(12) 
with weights , and centroids with nonzero coordinates only in the first two dimensions and . Fig. 1 displays density slices in the plane.
We use a RealNVP network with 6 pairs of coupling layers and a standard normal distribution as a base
. A set of 100 independent walkers were initialized in equal shares in modes A and B of the target. For the optimization, we compute gradients using batches of 1000 samples corresponding to 10 consecutive repeated sampling steps of Algorithm 1 (with and ). We run 4000 parameters updates using Adam with a learning rate of 0.005.Computing logevidence differences.
Interpreting as a posterior distribution over a set of parameters , we would like to estimate the relative evidence for modes and . Denoting by and two sets of configurations in corresponding to each mode, the difference between their logevidence is given by
(13) 
Once the normalizing flow has been learned to assist sampling, it can be used to approximate Eq. 5. Drawing from we have the Monte Carlo estimate of :
(14) 
with the unnormalized weights , taking the form of an importance sampling estimator. The quality of the estimator can be monitored using an estimate of the effective sample size to adjust the variance estimate from the empirical variance
(15) 
Finally the unknown cancels out in the logevidence difference:
(16) 
In the middle right panel of Fig. 1 we report the performance of this estimator using the sets and , and . The estimator convergence to the exact value as the quality of increases over training.
a.3 Radial velocity experiment
Setup.
The parameters of the prior are set to the values , , , , and . The RealNVP used to learn the posterior distribution has 6 pairs of coupling layers. The base distribution is a standard normal distribution as in the previous experiment, but the NF is learned on a whitened representation of the parameters . Concretely, using the initial samples from the Joker, we compute the empirical mean and empirical covariance
. The latter admits an eigenvalue decomposition
, with the diagonal matrix of eigenvalues andthe orthogonal eigenvectors basis. Using
, we train the normalize flow to model the distribution of the whitened parameters .Training.
A set of 110 independent walkers were initialized using 10 copies of each of set of parameters retained from an initial iteration of the Joker algorithm from draws of and . Here again we use 5 consecutive repeated sampling steps of Algorithm 1 (with and ) to compute gradients using batches of 550 samples. We run Adam for iterations with a learning rate of 0.001.
Fig. 4 reports the evolution of the approximate KL divergence along training (left) and the acceptance rates of the nonlocal moves proposed by the NF (middle). We also plot the projected trajectories of chains started with the initial Joker samples updated with combined sampling component of Algorithm 1 using the final learned . During the short chains that included 10 nonlocal resampling steps, the walkers can be seen to jump back and forth between two of the modes of the marginalized density (underlying in blue in the plot). However mixing is not well assisted with the mode around (see discussion in the main text as well).