Pytorch code for Sampling in Combinatorial Spaces with SurVAE Flow Augmented MCMC
Hybrid Monte Carlo is a powerful Markov Chain Monte Carlo method for sampling from complex continuous distributions. However, a major limitation of HMC is its inability to be applied to discrete domains due to the lack of gradient signal. In this work, we introduce a new approach based on augmenting Monte Carlo methods with SurVAE Flows to sample from discrete distributions using a combination of neural transport methods like normalizing flows and variational dequantization, and the Metropolis-Hastings rule. Our method first learns a continuous embedding of the discrete space using a surjective map and subsequently learns a bijective transformation from the continuous space to an approximately Gaussian distributed latent variable. Sampling proceeds by simulating MCMC chains in the latent space and mapping these samples to the target discrete space via the learned transformations. We demonstrate the efficacy of our algorithm on a range of examples from statistics, computational physics and machine learning, and observe improvements compared to alternative algorithms.READ FULL TEXT VIEW PDF
We introduce Projected Latent Markov Chain Monte Carlo (PL-MCMC), a tech...
Annealed Importance Sampling (AIS) and its Sequential Monte Carlo (SMC)
Deterministic dynamics is an essential part of many MCMC algorithms, e.g...
We consider the task of MCMC sampling from a distribution defined on a
We construct a zig-zag process targeting posterior distributions arising...
We study Markov Chain Monte Carlo (MCMC) methods operating in primary sa...
Efficient sampling of complex data distributions can be achieved using
Pytorch code for Sampling in Combinatorial Spaces with SurVAE Flow Augmented MCMC
The ability to draw samples from a known distribution is a fundamental computational challenge. It has applications in diverse fields like statistics, probability, and stochastic modeling where these methods are useful for both estimation and inference. These are further useful within the frequentist inference framework to form confidence intervals for a point estimate. Sampling procedures are also standard in the Bayesian setting for exploring posterior distributions, obtaining credible intervals, and solving inverse problems. The workhorse algorithms in these settings are simulation-based, amongst which the Markov Chain Monte Carlobrooks2011handbook method is the most broadly used method. Impressive advances have been made – both in increasing efficiency and reducing computation costs – for sampling using Monte Carlo methods over the past century. However, problems in discrete domains still lack an efficient general-purpose sampler.
In this paper, we consider the problem of sampling from a known discrete distribution. Inspired by the recent success of deep generative models – particularly neural transport methods like normalizing flows (TabakVE10; TabakTurner13; RezendeMohamed15)
– in unsupervised learning, we propose a new approach to design Monte Carlo methods to sample from a discrete distribution based on augmenting MCMC with atransport map. Informally, let be a discrete target distribution and be a simple source density from which it is easy to generate independent samples e.g. a Gaussian distribution. Then, a transport map from to is such that if then .
The significance of having such a transport map is particularly consequential: firstly, given such a map , we can generate samples from the target . Secondly, these samples can be generated cheaply irrespective of the cost of evaluating . Importantly, affords us the ability to sample from the marginal and conditional distributions of using given an appropriate structure. Indeed, this idea of using neural transport maps based on normalizing flows have been explored by ParnoMarzouk18 for continuous densities by learning a diffeomorphic transformation from the source density to the target density where .
In this paper, we extend this to discrete domains using the recently proposed SurVAE Flow framework nielsen2020survae. We first learn a transport map from the discrete space to a continuous space using a surjective transformation . However, such a continuous space is often highly multi-modal with unfavorable geometry for fast mixing of MCMC algorithms. Thus, we learn an additional normalizing flow that transforms this complex continuous space to a simple latent space with density where sampling is easy. We finally sample from the desired target distribution by generating samples from the latent space and using the learned transformations and to push-forward these samples to the target space. Our complete implementation allows parallelization over multiple GPUs, improving efficiency and reducing computation time.
The main contributions of this paper are:
We present a SurVAE augmented general purpose MCMC solver for combinatorial spaces.
We propose a new learning objective compared to previous methods in normalizing flows to train a transport map that “Gaussianizes” the discrete target distribution.
The rest of the manuscript is organized as follows: We begin in §2 by presenting a brief background on normalizing flows and setting up our main problem. In §3, we provide details of our method which consists of two parts: learning the transport map and, generating samples from the target distribution. Subsequently, in §4, we put our present work in perspective to other approaches to sampling in discrete spaces. Finally, we perform empirical evaluation on a diverse suite of problems in §5 to demonstrate the efficacy of our method.
In this section, we setup our main problem, provide key definitions and notations, and formulate the idea for an MCMC sampler augmented with normalizing flows for sampling in discrete spaces.
be two probability density functions (w.r.t. the Lebesgue measure) over the source domainand the target domain , respectively. Normalizing flows learn a diffeomorphic transformation that allows to represent using via the change of variables formula (Rudin87):
where is the (absolute value) of the Jacobian (determinant of the derivative) of
. In other words, we can obtain a new random variableby pushing the source random variable through the map . When we only have access to an i.i.d. sample , we can learn and thus through maximum likelihood estimation:
where is a class of diffeomorphic mappings. Conveniently, we can choose any source density to facilitate estimation e.g. standard normal density on (with zero mean and identity covariance) or uniform density over the cube .
This “push-forward” idea has played an important role in optimal transport theory (Villani08) and has been used successfully for Monte carlo simulations. For example MarzoukMPS16; ParnoMarzouk18; PeherstorferMarzouk18; hoffman2019neutra; albergo2019flow
have used normalizing flows for continuous random variables to address the limitations of HMC which suffers from the chain to mix slowly between distant states when the geometry of the target density is unfavourable.
Specifically, ParnoMarzouk18 addressed this problem of sampling in a space with difficult geometry by learning a diffeomorphic transport map that transforms the original random variable to another random variable with a simple distribution. Concretely, let our interest be to sample from where . We can proceed by learning a diffeomorphic map such that where such that has a simple geometry amenable to efficient MCMC sampling. Thus, samples can be generated from by running MCMC chain in the z-space and pushing these samples onto the -space using . The transformation can be learned by minimizing the KL-divergence between a fixed distribution with simple geometry in the z-space e.g. a standard Gaussian and above. The learning phase attempts to ensure that the distribution is approximately close to the fixed distribution with easy geometry so that MCMC sampling is efficient. A limitation of these works that use diffeomorphic transformations augmented samplers is that they are restricted to continuous random variables. This is primarily because the flow models used in these works can only learn density functions over continuous variables.
uria2013rnade introduced the concept of dequantization
to extend normalizing flows to discrete random variables. They consider the problem of estimating the discrete distributiongiven samples by “lifting” the discrete space to a continuous one
by filling the gaps in the discrete space with a uniform distribution i.e.is such that where and . Subsequently, they learn the continuous distribution over using a normalizing flow by maximizing the log-likelihood of the continuous model . theis2015note showed that maximizing likelihood of the continuous model is equivalent to maximizing a lower bound on the log-likelihood for a certain discrete model on the original discrete data. Thus, this learning procedure cannot lead to the continuous model degenerately collapsing onto the discrete data, because its objective is bounded above by the log-likelihood of a discrete model. ho2019flow++ extended the idea of uniform dequantization to propose variational dequantization where in instead of the uniform distribution. They learn the dequantizing distribution, from data by treating it as a variational distribution. They subsequently estimate by optimizing the following lower bound:
Recently, nielsen2020survae introduced SurVAE Flows that extends the framework of normalizing flows to include surjective and stochastic transformations for probabilistic modelling. In the SurVAE flow framework, dequantization can be seen as a rounding surjective map with parameters such that where the forward transformation is a discrete surjection , for . The inverse transformation is stochastic with support in
. Thus, these ideas of using dequantization to learn a discrete probability distribution can be viewed as learning a transformationthat transforms a discrete space to a continuous space .
In this paper, we study the problem of sampling in discrete spaces i.e. let be a discrete random variable in -dimensions with probability mass function (potentially unnormalized) . Given access to a function that can compute , we aim to generate samples . The aforementioned works on normalizing flows for discrete data is not directly applicable in this regime. This is due to the fact that uria2013rnade; ho2019flow++ and nielsen2020survae estimate given samples from the original distribution by maximizing the log-likelihood of a discrete model. However, we only have access to the function and the learning method given in Equation 1 cannot be used in our setting. In the next section, we address this by extending the ideas of neural transport MCMC (ParnoMarzouk18; hoffman2019neutra; albergo2019flow) and leveraging the framework of SurVAE Flows (nielsen2020survae).
We discussed in Section 2 the utility of normalizing flows augmented Monte Carlo samplers for overcoming the difficulties posed by unfavourable geometry in the target continuous distribution, as evidenced by the works of ParnoMarzouk18; hoffman2019neutra; albergo2019flow. We will now introduce our method of SurVAE Flow augmented MCMC for sampling in combinatorial spaces.
Informally, our method proceeds as follows: We first define a rounding surjective transformation with parameters such that is a continuous embedding of the space with density . Since the continuous embedding may be highly multi-modal with potentially unfavourable geometry for efficient MCMC sampling, we define an additional diffeomorphic transformation, with parameters , from a simple latent space to . Subsequently, we learn these transformations via maximum likelihood estimation. This concludes the learning phase of our algorithm. Finally, we generate samples from by running MCMC chains to generate samples from the learned distribution over and pushing-forward these samples to the space using and . We elaborate each of these steps below.
is a surjective transformation that takes as input a real-valued vectorand returns the rounded value of i.e. . Thus, the forward transformation from to is deterministic. The inverse transformation from to is however stochastic since the random variable where is given by where . Given access to , we can evaluate the density at a point exactly as:
since . Thus, we need to learn in-order to fully specify this surjective transformation. Since, can be any arbitrary continuous density, we learn using a normalizing flow i.e. we learn a diffeomorphic transformation where is standard Gaussian distributed. Under this setup, we get
Thus, learning the surjective transformation is equivalent to learning which reduces to learning a flow . Thus, for brevity we use the informal notation that the transformation quantizes the continuous space to . Finally, using Equation 3, the density in Equation 2 for a fixed can be written as:
The density over the continuous embedding of learned above can have any arbitrary geometry and may not be efficient for MCMC sampling. Thus, next we learn a diffeomorphic transformation from a latent space to such that using the change of variables formula we get:
Our complete model, therefore, consists of transformations and that transforms to . The next challenge is to learn and such that the induced density has a simple geometry for efficient MCMC sampling. We achieve this by forcing to be close to a standard Gaussian by minimizing the KL-divergence between and where .
Concretely, let . can be approximated with the empirical average by generating i.i.d. samples giving:
We thus arrive at the following optimization problem:
where , , and .
Learning results in a density that is approximately Gaussian with a landscape amenable to efficient MCMC sampling. Thus, our sampling phase consists of running an MCMC sampler of choice with target density resulting in samples . We can finally obtain samples as .
We end this section with two important remarks.
In our method, it is possible to by-pass the last step of sampling in the z-space using MCMC entirely since the trained flow is a generative model that can be used directly to generate samples from . This can indeed be done if the learned transformations and result in the density to be Gaussian. Thus, there is a natural trade-off here: we can spend only enough computation to train the flow to learn that is suitable for fast-mixing of the MCMC chain and generate samples from that is not Gaussian using any sampler of choice or we can spend a larger amount of compute to learn a flow that perfectly Gaussianizes the target density . Then, we can sample directly by sampling from a Gaussian in the z-space and using the learned transformations to obtain samples from .
As mentioned in Section 2, we can use any density instead of a Gaussian for training the transformations and . The main motivation of our method is to “push-forward” onto a space that is amenable to efficient sampling. An interesting future work might be to devise learning objectives that explicitly drive the learned density to have simple geometry that favours “off-the-shelf” samplers.
In Section 3 we introduced a normalizing flow augmented MCMC sampler for combinatorial spaces. Our method combines surjective transformation for learning continuous relaxations for discrete spaces and normalizing flows that map the continuous space to a simpler discrete space for easy MCMC sampling. In this section, we put both these ideas of continuous relaxations of discrete spaces and neural transport methods for efficient MCMC sampling in to perspective with existing work.
As we discussed briefly in Section 2, the neural transport augmented sampler for continuous variables has been successfully used by ParnoMarzouk18, PeherstorferMarzouk18, hoffman2019neutra and albergo2019flow. A subtle difference between these works and our work here – apart from the major difference that the aforementioned works are applicable only for continuous domains – is the method of training the transport map itself. These methods train the transport map (or normalizing flow) by minimizing the Kullback-Liebler divergence between the target density and the density learned via
by pushing-forward a standard normal distribution.albergo2019flow additionally use the reverse KL-divergence for the target density and the approximate density for training instead. We, on the other hand, learn by minimizing the KL-divergence in the latent space since we only have access to a lower-bound of the discrete density (cf. Equation 1).
The idea to relax the constraint that random variables of interest only take discrete values has been used extensively in combinatorial optimization(pardalos1996continuous). Such an approach is attractive since the continuous space affords the function with gradient information, contours, and curvatures that can better inform optimization algorithms. Surprisingly though, this idea has not received much attention in the MCMC setting. Perhaps the closest work to ours in the present manuscript is that of zhang2012continuous who use the Gaussian Integral Trick (hubbard1959calculation) to transform discrete variable undirected models into fully continuous systems where they perform HMC for inference and the evaluation of the normalization constant. The Gaussian Integral Trick used by zhang2012continuous can be viewed as specifying a fixed map from the discrete space to an augmented continuous space. However, this can result in a continuous space that is highly multi-modal and not amenable for efficient sampling.
nishimura2020discontinuous on the other hand map to a continuous space using uniform dequantization (uria2013rnade) i.e. filling the space between points in the discrete space with uniform noise inducing parameters with piecewise constant densities. 111This corresponds to in our method. They further propose a Laplace distribution for the momentum variables in dealing with discontinuous targets and argue this to be more effective than the Gaussian distribution. This work relies on the key theoretical insight of pakman2013auxiliary that Hamiltonian dynamics with a discontinuous potential energy function can be integrated explicitly near the discontinuity such that it preserves the total energy. pakman2013auxiliary used this to propose a sampler for binary distributions which was later extended to handle more general discontinuities by afshar2015reflection. dinh2017probabilistic also used this idea on settings where the parameter space involves phylogenetic trees. A major limitation of pakman2013auxiliary, afshar2015reflection and dinh2017probabilistic is the fact that these method do not represent a general-purpose solver since they encounter computational issues when dealing with complex discontinuities. Similarly, the work of nishimura2020discontinuous requires an integrator that works component-wise and is prohibitively slow for high dimensional problems. Furthermore, it is not clear if the the Laplace momentum based sampler leads to efficient exploration of the continuous space which is highly multi-modal due to uniform dequantization.
A main limitation of embedding discrete spaces into continuous ones is that they can often destroy natural topological properties of the space under consideration e.g. space of trees, partitions, permutations etc. titsias2017hamming proposed an alternative approach called Hamming ball sampler based on informed proposals that are obtained by augmenting the discrete space with auxiliary variables and performing Gibbs sampling in this augmented space. However, potentially strong correlations between auxiliary variables and the chain state severely slows down convergence. zanella2020informed tried to address this problem by introducing locally balanced proposals that incorporate local information about the target distribution. This framework was later called a Zanella process by power2019accelerated. They used the insights in (zanella2020informed) to build efficient, continuous-time, non-reversible algorithms by exploiting the structure of the underlying space through symmetries and group-theoretic notions. This helps them to build locally informed proposals for improved exploration of the target space. However, their method requires explicit knowledge of the underlying properties of the target space which is encoded in the sampler which can be problematic.
We now present experimental results for our SurVAE Flow augmented MCMC on problems covering a range of discrete models applied in statistics, physics, and machine learning. These include a synthetic example of a discretized Gaussian Mixture Model, Ising model for denoising corrupted MNIST, quantized logistic regression on four real world datasets, and Bayesian variable selection for high-dimensional data.
We compare our model to two baselines that include a random walk Metropolis-Hastings algorithm and Gibbs sampling. We further also compare to discrete HMC (dHMC) (nishimura2020discontinuous) although we use the original implementation released by the authors which is not parallelizable and implements using the numpy package (harris2020array)
and R. In contrast, we implemented our method and the other baselines in Pytorch(NEURIPS2019_9015) and are all parallelized to use multiple GPUs for fast and efficient computation.
For each experiment, we train our model to learn and for a fixed number of iterations with a batch-size of 128, learning rate of and optimize using the Adam optimizer. We run 128 parallel chains of Metropolis-Hastings algorithm for steps each with no burn-in since our learned flow already provides a good initialization for sampling. For fair comparison, we follow a similar setup for the baselines i.e. we run 128 parallel chains for each method for steps. However, we use different burn-in periods to compensate for the training time used by the flow in our method. For dHMC (nishimura2020discontinuous), we follow the exact setup as used by authors in their publicly released code to generate samples.
We compare the efficiencies for all these models by reporting both the effective sample size (ESS), ESS per minute, and the accuracy or (unnormalized) log-probability of the samples generated by each sampler for the corresponding downstream task. For fair comparison, the ESS per min reported includes the training time of for our method. The major aims of our experimental evaluation presented here are two-fold: Firstly, our aim is to demonstrate the efficacy of our flow augmented MCMC over broad applications. In particular, we want to demonstrate that flow-based generative models present an attractive methodology to sample from a discrete distribution by modelling the distribution in a continuous space and sampling in a simpler latent space. Thus, in our experiments here we have restricted ourselves to using the basic Metropolis-Hastings (MH) algorithm for sampling to demonstrate the advantages of being able to learn a transport map from the discrete space to a simple latent space. By using a more sophisticated sampler like HMC (duane1987hybrid; neal2011mcmc), we could thus get even better results. Secondly, most samplers (and especially discrete space samplers) are either inefficient and/or are prohibitively slow for high-dimensional problems. Thus, through our experiments we want demonstrate that leveraging the advances of flows to learn high-dimensional distributions, our sampler is both efficient and fast with significantly better results than the alternative methods for high-dimensional problems.
We first illustrate our method on a 2D toy problem. We define a set of target distributions by discretizing Gaussian mixture models with 5 components using a discretization level of 6 bits. In Fig. 1 we compare samples from the target distributions, samples from the approximation learned by the flow and samples from the MCMC chain. We observe that the flow gives an initial rough approximation to the target density – thus greatly simplifying the geometry for the MCMC method. Next, the samples from the MCMC chain are indistinguishable from true samples from the target distribution. This also highlights the trade-off we described in Remark 1 in Section 3.
We illustrate the application of discrete samplers to un-directed graphs in this section by considering the removal of noise from a binary
MNIST image. We take an image from the MNIST dataset and binarize it such that the true un-corrupted image is described by a vector. We obtain a corrupted image by taking this unknown noise-free image and randomly flipping the value of each pixel with probability 0.1. Given this corrupted image, our goal is to recover the original noise-free image.
|Flow + MH (iters = 1k)||200.08 8.09||7.94||4017 11.83|
|Flow + MH (iters = 2k)||334.4 13.5||13.20||4033 11.25|
|Flow + MH (iters = 5k)||800.1 47.3||31.62||4050.57 10.11|
|Flow + MH (iters = 10k)||878.4 36.2||33.44||4050.67 11.36|
|Gibbs (burn-in = 10k)||2.47 0.12||0.03||4013 38.45|
|Gibbs (burn-in = 20k)||3.04 0.16||0.02||4031.69 24.75|
|Gibbs (burn-in = 50k)||13.58 0.05||0.05||4049.01 13.05|
|Gibbs (burn-in = 100k)||17.51 0.11||0.03||4057.80 8.44|
|discrete MH (burn-in = 20k)||38.5 2.19||0.32||3925.98 57.23|
|discrete MH (burn-in = 50k)||4.08 0.15||0.02||4027.57 20.26|
|discrete MH (burn-in = 100k)||3.53 0.26||0.01||4040.65 16.89|
|dHMC||52.34 3.77||0.02||4036.19 17.42|
Following (bishop2006pattern, Section 8.3), we solve this by formulating it as an Ising model and using Monte carlo samplers to sample with energy function given by:
where and are the pixel in the sample under consideration and corrupted image respectively. We train the flow for and run the baselines with . We report the results in Figure 2. The results evidently show that Flow augmented MCMC significantly outperforms other samplers on both ESS and the underlying downstream task demonstrating its ability to disentangle complex correlations present in high-dimensional problems by mapping it to a latent space whereas other coordinate-wise samplers are not able to handle this efficiently.
Next, we consider the task of logistic regression where the learnable parameters and biases denoted by
taking discrete values. This problem is particularly applicable for training quantized neural networks where the weights take discrete values. However, here we restrict ourselves to a simpler version of quantized logistic regression on four real-world datasets from the UCI repository that include Iris, Breast Cancer, Wine, and Digits. For each dataset, we run 5-fold cross-validation – with results averaged across the 5 folds. In each fold, we train the flow foriterations and run discrete MCMC with a burn-in of steps. Amongst these datasets, the Digits dataset is comparatively high-dimensional where again our method outperforms the other alternatives. We consider parameters quantized to 4 bits and report the results for these in Table 2.
|Iris||15||Flow + MH||11.48 0.92||3.44||97.60||-8.38 1.02|
|discrete MCMC||2.32 0.33||6.81||97.45||-8.49 0.54|
|dHMC||15.94 3.06||4.58||97.35||-8.40 1.89|
|Wine||42||Flow + MH||11.76 1.94||2.76||97.52||-8.40 0.72|
|discrete MCMC||1.91 0.13||5.20||97.75||-2.08 0.61|
|dHMC||9.79 1.22||5.40||97.68||-3.54 0.48|
|Breast Cancer||62||Flow + MH||16.66 0.23||5.01||97.47||-8.46 0.23|
|discrete MCMC||0.61 0.05||1.2||96.58||-20.30 1.41|
|dHMC||13.45 2.12||3.96||97.12||-11.43 1.89|
|Digits||650||Flow + MH||11.90 1.72||2.70||97.71||-7.9 0.51|
|discrete MCMC||0.58 0.03||0.29||88.90||-651.23 27.16|
|dHMC||7.81 1.36||0.98||93.12||-32.87 4.59|
|,||Flow + MH||500.25 49.91||150||-1030.93 4.57|
|Gibbs||1.21 0.01||0.01||-1151.06 0.3|
|,||Flow + MH||41.01 2.64||6.15||-2284.54 6.36|
|Gibbs||1.08 0.01||0.009||-2532.31 0.3|
|,||Flow + MH||14.55 1.33||0.97||-4697.61 7.30|
|Gibbs||0.79 0.01||0.001||-5325.90 0.2|
Here, we consider the problem of probabilistic selection of features in regression where the hierarchical framework allows for complex interactions making posterior exploration extremely difficult. Following schafer2013sequential, we consider a hierarchical Bayesian model in which the target vector
in linear regression is observed through:
with features , parameters and the binary vector that indicates the features that are included in the model. To achieve a closed form for the marginal posterior that is independent of and , we make the following choices following power2019accelerated: with conjugates , and, where
is the inverse-gamma distribution and we setas in george1997approaches. For this setting, we create high-dimensional synthetic datasets rather than the low-dimensional datasets usually used to demonstrate the efficacy of our method. We present the results in Table 3 where we report the maximum ESS per feature due to the large percentage of uninformative features in the experiment.
In this paper, we presented a flow based Monte Carlo sampler for sampling in combinatorial spaces. Our method learns a deterministic transport map from a discrete space to a simple continuous latent space where it is efficient to sample. Thereby, we sample from the discrete space by generating samples in the latent space and using the transport map to obtain samples in the discrete space. By learning a map to a simple latent space (like standard Gaussian), our method is particularly suited for high-dimensional domains where alternative samplers are not efficient due to the presence of more complex correlations. This is also reflected in our implementation which is faster and efficient as demonstrated by our results on a suite of experiments. In the future, it will be interesting to devise learning strategies for the transport map that explicitly pushes the latent space to have certain desirable properties for efficient sampling. Another direction could be to extend the framework of SurVAE Flow layers to incorporate underlying symmetries and invariance in the target domain.