## 1 Introduction

Estimating expectations with respect to target distributions that cannot be sampled from is a challenging task with many real world applications, such as estimating equilibrium properties of physical systems governed by the Boltzmann distribution (Lelievre2010). Boltzmann generators (noe2019boltzmann), which use normalizing flows to approximate the Boltzmann distribution, are a recent approach with growing interest (dibak2020; kohler2021). For challenging problems, current approaches to training Boltzmann generators rely partly on samples from the target for training by the maximum likelihood wu2020stochasticNF. Such samples are obtained through expensive MD simulations (leimkuhler2015). Although flows can be trained without samples from the target, current methods for this suffer from either being mode seeking or high variance in the loss, which leads to inferior performance on challenging problems (stimper2021).

To address these challenges, we propose using the -divergence with as the training objective, which is mass covering, and employ annealed importance sampling (AIS) to bring the samples from the flow model closer to the target, reducing variance in the objective. In our experiments, we apply our method, Flow AIS Bootstrap (FAB), to a 2D Gaussian mixture distribution as well as to the “Many Well” problem and show that it outperforms competing learning algorithms.

## 2 Background

Normalizing flows

Given a random variable

with distribution a normalizing flow (rezende2015variational; earlyFlowPaperTabak) uses an invertible map to transform yielding the random variable with the distribution(1) |

If we parameterize we can use this as a model to approximate a target distribution . If the target density is available, the flow is usually trained by minimizing the KL divergence, which is estimated via Monte Carlo with samples from the flow model. Alternatively, we could use the -divergence (zhu1995information), which is defined by

(2) |

as an objective (campbell2021gradient; muller2019). If , minimizing it corresponds to minimizing the variance of the importance weights . In contrast to KL divergence, which is mode seeking, is mass covering which is more desirable when approximating multimodal targets. In this case, the -divergence can be rewritten as

(3) |

This objective can be estimated either with samples from or . Since the integral is dominated by regions with high and low , the estimate will exhibit higher variance if we sample from than if we sample from .

Annealed importance sampling AIS begins by sampling from an initial proposal distribution , being the flow in our case, and then transitions via MCMC through a sequence of intermediate distributions, to , to produce samples closer to the target distribution (neal2001annealed). Each transition

is a Markov chain that leaves the intermediate distribution

invariant. AIS conventionally returns the importance weights for the samples, which are in the form(4) |

where

and we indicate that the probability density functions may be unnormalized with

. They exhibit variance reduction compared to their counterparts . Hamiltonian Monte Carlo (HMC) provides a suitable choice of transition operator for challenging problems (neal1995bayesian).## 3 Normalizing Flow Annealed Importance Sampling Bootstrap

FAB, defined in Algorithm 3, uses as a training objective. Furthermore, we introduce AIS into the training loop, improving the gradient estimator for minimising

by writing the loss function to train the flow as an expectation over

, and estimating it with the samples and importance weights generated by AIS with the flow as the initial distribution. If we plug in (4) in (3), compute the expectation over through AIS and use Jensen’s inequality, we obtain our loss(5) |

where are the flow’s parameter and and have blocked gradients^{1}^{1}1We only want to calculate the gradient with respect to the , and not with respect to the importance weights and samples generated by AIS, which are also a function of the flow’s parameters..
This method obtains the benefit of bootstrapping, where AIS is used to improve the flow by improved estimation of the loss’s gradient, which is used to update the flow. This in turn improves AIS by improving its initial distribution.
The effective sample size of the trained flow model, which may be limited e.g. through expressiveness constraints of the specific flow, can also be improved by using AIS after training.

[htbp]
Set target
Initialise proposal parameterized by
for *iteration = 1, * do

## 4 Experiments

### 4.1 Mixture of Gaussians Problem

We begin with a simple two dimensional mixture of Gaussians target distribution.
To estimate expectations with our proposal distribution, we set to be the toy quadratic function
where the elements of , and are sampled from a unit Gaussian and then fixed for the problem.
This allows us to inspect the bias and variance of estimates of the expectation of this toy function, as a further measure of performance.
We compare FAB^{2}^{2}2For all experiments, we use FAB with only 2 intermediate AIS distributions. For the transition operator between AIS distributions we use HMC with 1 outer step and 5 inner steps. to flows trained through minimisation of KLD^{3}^{3}3For brevity we refer to a flow/SNF trained through minimisation of KL divergence, estimated with samples from the flow/SNF, as simply being “trained with KLD”.,
as well as Stochastic Normalizing Flows (SNFs) (wu2020stochasticNF).
Like FAB, SNFs also combine flows with stochastic sampling methods such as MCMC, but instead focus on improving the flow’s expressive power. The SNF is trained with the KL divergence as well. For all models we choose real NVP (dinh2017RealNVP) as the flow architecture.

In Figure 1 we see that FAB allows us to train a flow that captures the shape of the target distribution well, while the flow and SNF trained with KLD both fail, capturing only a subset of the modes.
Table 1^{4}^{4}4Effective sample size (ESS) for both experiments is calculated with

samples. Bias and standard deviation are calculated using 100 runs of 1000 samples. We calculate the mean target log likelihood using 10000 samples.

shows that with FAB the trained flow may be used for accurate computation of expectations with respect to the target, while the alternative approaches yield highly biased estimates.Model | Mean , | ESS (%) | Bias (%) | Std (%) |
---|---|---|---|---|

FAB | -5.2 | 70.1 (77.5) | 1.2 (0.5) | 5.8 (5.5) |

Flow trained with KLD | -14.4 | 0.05 | 99.6 | 19.8 |

SNF trained with KLD | N/A | 0.02 | 104.2 | 9.7 |

### 4.2 The Many Well Problem

For a more challenging problem, we test FAB against a flow trained by KLD on a 16 dimensional “Many Well” problem, which we create by repeating the Double Well Boltzmann distribution from (noe2019boltzmann) 8 times. We create a hand-crafted test-set for this problem where we place a point on each of the 256 modes of the target. In Table 2 we see that FAB allows us to train a model that has a far superior test set log-likelihood and ESS than training a flow with KLD. In Figure 2 where we visualise a subset of the marginal distributions for pairs of dimensions, we see that the flow trained with FAB has captured the shape of the target well, while the flow trained with KLD fits only a subset of the modes.

Model | Test set mean | ESS (%) |
---|---|---|

FAB | -14.5 | 79.6 (85.2) |

Flow trained by KLD | -86.2 | 0.01 |

## 5 Discussion

In SNFs, MCMC is combined with normalizing flows by introducing sampling layers between the standard flow layers (nielsen2020; wu2020stochasticNF). They are usually trained with samples from the target and perform poorly when trained with samples from the flow, see section 4.1. Instead, we use AIS to improve the training procedure. However, the two contributions are orthogonal, so SNFs could be trained with FAB as well. Similarly, FAB works with other flow architectures (chen2019; grcic2021) and base distributions (papamakarios2017; stimper2021) than those we used. In ding2019VAE-AIS

AIS is applied in the context of variational inference to improve estimation of the marginal likelihood gradient for varitational autoencoders’ decoder. This approach may be extended by FAB to also improve the encoder training, through minimising the FAB loss with respect to the parameters of the encoder, with the latent posterior as the target distribution. This would be advantageous relative to

(ding2019VAE-AIS), as well as MCMC based approaches like (hoffman2017MCMC-VAE) for training the decoder, because if the flow-based encoder can learn a good approximation of the latent posterior, then this alleviates the requirement to run MCMC for long in order to obtain good samples for the decoder.We proposed FAB, a novel method of combing flows with AIS in a training procedure that allows them to improve each other in a bootstrapping manner. For future work we hope to scale up FAB and apply it to to more challenging real world problems, for example Boltzmann distributions of complex molecules.

GNCS and JMHL acknowledge support from a Turing AI Fellowship under grant EP/V023756/1. This work has been performed using resources operated by the University of Cambridge Research Computing Service, which is funded by the EPSRC (capital grant EP/P020259/1) and DiRAC funding from the STFC (http://www.dirac.ac.uk/). LIM acknowledges support from the Cambridge Trust, the Skype Foundation, and the Oppenheimer Memorial Trust.

## References

## Appendix A FAB derivation

FAB aims to minimise

(6) |

We obtain a gradient estimator by differentiating Equation 6 with respect to the parameters of the flow model

(7) |

where we have used the fact that since is independent to samples from , we can move inside the expectation. If we set , we see that Equation 7 is in the form , so we can estimate it with AIS.

With this goal in mind, during the training loop, we generate a batch of importance weights , and samples using AIS, with as the target distribution and as the proposal distribution. We can then obtain an importance weighted estimate of the above gradient operator

(8) |

We note that in Equation 7, is only differentiated through ’s contribution to the probability density function, and not^{5}^{5}5In Equation 7, is inside the expectation, with independent to . via .
Therefore, in Equation 8 we take care to block the gradient of with respect to .
We denote the blocked gradients with .
Thus, we can train the proposal by minimising the surrogate “loss function”

(9) |

taking care^{6}^{6}6In Equation 8, is not differentiated with respect to , so we must block the gradient of with respect to , as otherwise automatic differentiation will result in an incorrect estimate of the gradient.
This is because the flow model parameters participate in the calculation of and . to block the gradient of and with respect to .

To obtain a good loss function for training it is beneficial to instead seek to write the surrogate loss (Equation 9) in terms of log probabilities and log importance weights, because inside the expectation the importance weights and fractions of probabilities will have high variance. To do this we can re-write the surrogate loss as

(10) |

Finally, we minimise the below loss, which by Jensen’s inequality is an upper bound of ,

(11) | ||||

We can now work with probabilities and importance weights, and use the “” trick to obtain a numerically stable estimate. Equation 11 is the exact surrogate loss implemented in practice for training.

## 4 Experiments

### 4.1 Mixture of Gaussians Problem

We begin with a simple two dimensional mixture of Gaussians target distribution.
To estimate expectations with our proposal distribution, we set to be the toy quadratic function
where the elements of , and are sampled from a unit Gaussian and then fixed for the problem.
This allows us to inspect the bias and variance of estimates of the expectation of this toy function, as a further measure of performance.
We compare FAB^{2}^{2}2For all experiments, we use FAB with only 2 intermediate AIS distributions. For the transition operator between AIS distributions we use HMC with 1 outer step and 5 inner steps. to flows trained through minimisation of KLD^{3}^{3}3For brevity we refer to a flow/SNF trained through minimisation of KL divergence, estimated with samples from the flow/SNF, as simply being “trained with KLD”.,
as well as Stochastic Normalizing Flows (SNFs) (wu2020stochasticNF).
Like FAB, SNFs also combine flows with stochastic sampling methods such as MCMC, but instead focus on improving the flow’s expressive power. The SNF is trained with the KL divergence as well. For all models we choose real NVP (dinh2017RealNVP) as the flow architecture.

In Figure 1 we see that FAB allows us to train a flow that captures the shape of the target distribution well, while the flow and SNF trained with KLD both fail, capturing only a subset of the modes.
Table 1^{4}^{4}4Effective sample size (ESS) for both experiments is calculated with

samples. Bias and standard deviation are calculated using 100 runs of 1000 samples. We calculate the mean target log likelihood using 10000 samples.

shows that with FAB the trained flow may be used for accurate computation of expectations with respect to the target, while the alternative approaches yield highly biased estimates.Model | Mean , | ESS (%) | Bias (%) | Std (%) |
---|---|---|---|---|

FAB | -5.2 | 70.1 (77.5) | 1.2 (0.5) | 5.8 (5.5) |

Flow trained with KLD | -14.4 | 0.05 | 99.6 | 19.8 |

SNF trained with KLD | N/A | 0.02 | 104.2 | 9.7 |

### 4.2 The Many Well Problem

For a more challenging problem, we test FAB against a flow trained by KLD on a 16 dimensional “Many Well” problem, which we create by repeating the Double Well Boltzmann distribution from (noe2019boltzmann) 8 times. We create a hand-crafted test-set for this problem where we place a point on each of the 256 modes of the target. In Table 2 we see that FAB allows us to train a model that has a far superior test set log-likelihood and ESS than training a flow with KLD. In Figure 2 where we visualise a subset of the marginal distributions for pairs of dimensions, we see that the flow trained with FAB has captured the shape of the target well, while the flow trained with KLD fits only a subset of the modes.

Model | Test set mean | ESS (%) |
---|---|---|

FAB | -14.5 | 79.6 (85.2) |

Flow trained by KLD | -86.2 | 0.01 |

## 5 Discussion

In SNFs, MCMC is combined with normalizing flows by introducing sampling layers between the standard flow layers (nielsen2020; wu2020stochasticNF). They are usually trained with samples from the target and perform poorly when trained with samples from the flow, see section 4.1. Instead, we use AIS to improve the training procedure. However, the two contributions are orthogonal, so SNFs could be trained with FAB as well. Similarly, FAB works with other flow architectures (chen2019; grcic2021) and base distributions (papamakarios2017; stimper2021) than those we used. In ding2019VAE-AIS

AIS is applied in the context of variational inference to improve estimation of the marginal likelihood gradient for varitational autoencoders’ decoder. This approach may be extended by FAB to also improve the encoder training, through minimising the FAB loss with respect to the parameters of the encoder, with the latent posterior as the target distribution. This would be advantageous relative to

(ding2019VAE-AIS), as well as MCMC based approaches like (hoffman2017MCMC-VAE) for training the decoder, because if the flow-based encoder can learn a good approximation of the latent posterior, then this alleviates the requirement to run MCMC for long in order to obtain good samples for the decoder.We proposed FAB, a novel method of combing flows with AIS in a training procedure that allows them to improve each other in a bootstrapping manner. For future work we hope to scale up FAB and apply it to to more challenging real world problems, for example Boltzmann distributions of complex molecules.

GNCS and JMHL acknowledge support from a Turing AI Fellowship under grant EP/V023756/1. This work has been performed using resources operated by the University of Cambridge Research Computing Service, which is funded by the EPSRC (capital grant EP/P020259/1) and DiRAC funding from the STFC (http://www.dirac.ac.uk/). LIM acknowledges support from the Cambridge Trust, the Skype Foundation, and the Oppenheimer Memorial Trust.

## References

## Appendix A FAB derivation

FAB aims to minimise

(6) |

We obtain a gradient estimator by differentiating Equation 6 with respect to the parameters of the flow model

(7) |

where we have used the fact that since is independent to samples from , we can move inside the expectation. If we set , we see that Equation 7 is in the form , so we can estimate it with AIS.

With this goal in mind, during the training loop, we generate a batch of importance weights , and samples using AIS, with as the target distribution and as the proposal distribution. We can then obtain an importance weighted estimate of the above gradient operator

(8) |

We note that in Equation 7, is only differentiated through ’s contribution to the probability density function, and not^{5}^{5}5In Equation 7, is inside the expectation, with independent to . via .
Therefore, in Equation 8 we take care to block the gradient of with respect to .
We denote the blocked gradients with .
Thus, we can train the proposal by minimising the surrogate “loss function”

(9) |

taking care^{6}^{6}6In Equation 8, is not differentiated with respect to , so we must block the gradient of with respect to , as otherwise automatic differentiation will result in an incorrect estimate of the gradient.
This is because the flow model parameters participate in the calculation of and . to block the gradient of and with respect to .

To obtain a good loss function for training it is beneficial to instead seek to write the surrogate loss (Equation 9) in terms of log probabilities and log importance weights, because inside the expectation the importance weights and fractions of probabilities will have high variance. To do this we can re-write the surrogate loss as

(10) |

Finally, we minimise the below loss, which by Jensen’s inequality is an upper bound of ,

(11) | ||||

We can now work with probabilities and importance weights, and use the “” trick to obtain a numerically stable estimate. Equation 11 is the exact surrogate loss implemented in practice for training.

## 5 Discussion

In SNFs, MCMC is combined with normalizing flows by introducing sampling layers between the standard flow layers (nielsen2020; wu2020stochasticNF). They are usually trained with samples from the target and perform poorly when trained with samples from the flow, see section 4.1. Instead, we use AIS to improve the training procedure. However, the two contributions are orthogonal, so SNFs could be trained with FAB as well. Similarly, FAB works with other flow architectures (chen2019; grcic2021) and base distributions (papamakarios2017; stimper2021) than those we used. In ding2019VAE-AIS

AIS is applied in the context of variational inference to improve estimation of the marginal likelihood gradient for varitational autoencoders’ decoder. This approach may be extended by FAB to also improve the encoder training, through minimising the FAB loss with respect to the parameters of the encoder, with the latent posterior as the target distribution. This would be advantageous relative to

(ding2019VAE-AIS), as well as MCMC based approaches like (hoffman2017MCMC-VAE) for training the decoder, because if the flow-based encoder can learn a good approximation of the latent posterior, then this alleviates the requirement to run MCMC for long in order to obtain good samples for the decoder.We proposed FAB, a novel method of combing flows with AIS in a training procedure that allows them to improve each other in a bootstrapping manner. For future work we hope to scale up FAB and apply it to to more challenging real world problems, for example Boltzmann distributions of complex molecules.

GNCS and JMHL acknowledge support from a Turing AI Fellowship under grant EP/V023756/1. This work has been performed using resources operated by the University of Cambridge Research Computing Service, which is funded by the EPSRC (capital grant EP/P020259/1) and DiRAC funding from the STFC (http://www.dirac.ac.uk/). LIM acknowledges support from the Cambridge Trust, the Skype Foundation, and the Oppenheimer Memorial Trust.

## References

## Appendix A FAB derivation

FAB aims to minimise

(6) |

We obtain a gradient estimator by differentiating Equation 6 with respect to the parameters of the flow model

(7) |

where we have used the fact that since is independent to samples from , we can move inside the expectation. If we set , we see that Equation 7 is in the form , so we can estimate it with AIS.

With this goal in mind, during the training loop, we generate a batch of importance weights , and samples using AIS, with as the target distribution and as the proposal distribution. We can then obtain an importance weighted estimate of the above gradient operator

(8) |

We note that in Equation 7, is only differentiated through ’s contribution to the probability density function, and not^{5}^{5}5In Equation 7, is inside the expectation, with independent to . via .
Therefore, in Equation 8 we take care to block the gradient of with respect to .
We denote the blocked gradients with .
Thus, we can train the proposal by minimising the surrogate “loss function”

(9) |

taking care^{6}^{6}6In Equation 8, is not differentiated with respect to , so we must block the gradient of with respect to , as otherwise automatic differentiation will result in an incorrect estimate of the gradient.
This is because the flow model parameters participate in the calculation of and . to block the gradient of and with respect to .

To obtain a good loss function for training it is beneficial to instead seek to write the surrogate loss (Equation 9) in terms of log probabilities and log importance weights, because inside the expectation the importance weights and fractions of probabilities will have high variance. To do this we can re-write the surrogate loss as

(10) |

Finally, we minimise the below loss, which by Jensen’s inequality is an upper bound of ,

(11) | ||||

We can now work with probabilities and importance weights, and use the “” trick to obtain a numerically stable estimate. Equation 11 is the exact surrogate loss implemented in practice for training.

## References

## Appendix A FAB derivation

FAB aims to minimise

(6) |

(7) |

(8) |

^{5}^{5}5In Equation 7, is inside the expectation, with independent to . via .
Therefore, in Equation 8 we take care to block the gradient of with respect to .
We denote the blocked gradients with .
Thus, we can train the proposal by minimising the surrogate “loss function”

(9) |

^{6}^{6}6In Equation 8, is not differentiated with respect to , so we must block the gradient of with respect to , as otherwise automatic differentiation will result in an incorrect estimate of the gradient.
This is because the flow model parameters participate in the calculation of and . to block the gradient of and with respect to .

(10) |

Finally, we minimise the below loss, which by Jensen’s inequality is an upper bound of ,

(11) | ||||