Learning modern Bayesian neural networks requires inference in the spaces with dimension up to several million by conditioning the weights of DNN on hundreds of thousands of objects. For such applications, one has to perform the approximate inference – predominantly by either sampling from the posterior with Markov Chain Monte Carlo (MCMC) methods or approximating the posterior with variational inference (VI) methods.
methods are non-parametric, provide the unbiased (in the limit) estimate but require careful hyperparameter tuning especially for big datasets and high dimensional problems. The large dataset problem has been addressed for different MCMC algorithms: stochastic gradient Langevin dynamics(Welling & Teh, 2011), stochastic gradient Hamiltonian Monte Carlo (Chen et al., 2014), minibatch Metropolis-Hastings algorithms (Korattikara et al., 2014; Chen et al., 2016)
. One way to address the problem of high dimension is the design of a proposal distribution. For example, for the Metropolis-Hastings (MH) algorithm there exists a theoretical guideline for scaling the variance of a Gaussian proposal(Roberts et al., 1997, 2001). More complex proposal designs include adaptive updates of the proposal distribution during iterations of MH algorithm (Holden et al., 2009; Giordani & Kohn, 2010).
Variational inference is extremely scalable but provides a biased estimate of the target distribution. Using the doubly stochastic procedure (Titsias & Lázaro-Gredilla, 2014; Hoffman et al., 2013) VI can be applied to extremely large datasets and high dimensional spaces, such as a space of neural network weights (Kingma et al., 2015; Gal & Ghahramani, 2015, 2016). The bias introduced by variational approximation can be mitigated by using flexible approximations (Rezende & Mohamed, 2015) and resampling (Grover et al., 2018).
Generative Adversarial Networks (Goodfellow et al., 2014) (GANs) is a different approach to learn samplers. Under the framework of adversarial training different optimization problems could be solved efficiently (Arjovsky et al., 2017; Nowozin et al., 2016). The shared goal of ”learning to sample” inspired the connection of GANs with VI (Mescheder et al., 2017) and MCMC (Song et al., 2017).
In this paper, we propose a novel perspective on learning to sample from a target distribution by optimizing parameters of either explicit or implicit probabilistic model. Our objective is inspired by the view on the acceptance rate of the Metropolis-Hastings algorithm as a quality measure of the sampler. We derive a lower bound on the acceptance rate and maximize it with respect to parameters of the sampler, treating the sampler as a proposal distribution in the Metropolis-Hastings scheme.
We consider two possible forms of the target distribution: unnormalized density (density-based setting) and a set of samples (sample-based setting). Each of these settings reveals a unifying property of the proposed perspective and the derived lower bound. In the density-based setting, the lower bound is the sum of forward and reverse KL-divergences between the true posterior and its approximation, connecting our approach to VI. In the sample-based setting, the lower bound admit a form of an adversarial game between the sampler and a discriminator, connecting our approach to GANs.
The closest work to ours is of Song et al. (2017). In contrast to their paper our approach (1) is free from hyperparameters; (2) is able to optimize the acceptance rate directly; (3) avoids minimax problem in the density based setting.
Our main contributions are as follows:
We introduce a novel perspective on learning to sample from the target distribution by treating the acceptance rate in the Metropolis-Hastings algorithm as a measure of sampler quality.
We derive the lower bound on the acceptance rate allowing for doubly stochastic optimization of the proposal distribution in case when the target distribution factorizes (i.e. over data points).
For sample-based and density-based forms of target distribution we show the connection of the proposed algorithm to variational inference and GANs.
The rest of the paper is organized as follows. In Section 2 we introduce the lower bound on the AR. Special forms of target distribution are addressed in Section 3. We validate our approach on the problems of approximate Bayesian inference in the space of high dimensional neural network weights and generative modeling in the space of images in Section 4. We discuss results and directions of the future work in Section 5.
2 Acceptance Rate for Metropolis-Hastings Algorithm
In MH algorithm we need to sample from target distribution while we are only able to sample from proposal distribution . One step of the MH algorithm can be described as follows.
sample proposal point , given previously accepted point
If the proposal distribution does not depend on , i.e. , the algorithm is called independent MH algorithm.
The quality of the proposal distribution is measured by acceptance rate and mixing time. Mixing time defines the speed of convergence of the Markov chain to the stationary distribution. The acceptance rate of the MH algorithm is defined as
In case of independent proposal distribution we show that the acceptance rate defines a semimetric in distribution space between and (see Appendix B).
2.2 Optimizing the lower bound on acceptance rate
Although, we can maximize the acceptance rate of the MH algorithm (Eq. 1) directly w.r.t. parameters of the proposal distribution , we propose to maximize the lower bound on the acceptance rate. As our experiments show (see Section 4) the optimization of the lower bound compares favorably to the direct optimization of the acceptance rate. To introduce this lower bound we first express the acceptance rate in terms of total variation distance.
For random variable
For random variable
where is the total variation distance.
The maximization of this lower bound can be equivalently formulated as
In the following sections, we show the benefits of this optimization problem in two different settings — when the target distribution is given in a form of unnormalized density and as a set of samples.
3 Optimization of Proposal Distribution
To estimate the loss function (Eq.5) we need to evaluate the density ratio. In the density-based setting unnormalized density of the target distribution is given, so we suggest to use explicit proposal distribution to compute the density ratio explicitly. In the sample-based setting, however, we cannot compute the density ratio, so we propose to approximate it via adversarial training (Goodfellow et al., 2014). The brief summary of constraints for both settings is shown in Table 1.
The following subsections describe the algorithms in detail.
|Setting||Target distribution||Proposal distribution||Density Ratio|
|Sample-based||set of samples||implicit model||learned discriminator|
3.1 Density-based Setting
In the density-based setting, we assume the proposal to be an explicit probabilistic model, i.e. the model that we can sample from and evaluate its density at any point up to the normalization constant. We also assume that the proposal is reparametrisable (Kingma & Welling, 2014; Rezende et al., 2014; Gal, 2016).
If the proposal belongs to a parametric family, e.g. we might face the collapsing to the delta-function problem. To tackle this problem one can properly choose a parametric family of the proposal, or make the proposal independent . In Appendix C we provide an intuition that shows why the Markov chain proposal can collapse to delta-function and the independent proposal can’t. In this section, we consider only the independent proposal. We also provide empirical evidence in section 4 that collapsing to the delta-function does not happen for independent proposal distribution.
Considering as the proposal, optimization problem 5 takes the form
Explicit form of the proposal and the target distributions allows us to obtain density ratios and for any points . But to estimate the loss in Eq. 6 we also need to obtain samples from the target distribution during training. For this purpose, we use the current proposal and run the independent MH algorithm. After obtaining samples from the target distribution it is possible to perform optimization step by taking stochastic gradients w.r.t. . Pseudo-code for the obtained procedure is shown in Algorithm 1.
Algorithm 1 could also be employed for the direct optimization of the acceptance rate (Eq. 1). Now we apply this algorithm for Bayesian inference problem and show that during optimization of the lower bound we can use minibatches of data, while it is not the case for direct optimization of the acceptance rate. We consider Bayesian inference problem for discriminative model on dataset , where
is the feature vector ofth object and is its label. For the discriminative model we know likelihood and prior distribution . In order to obtain predictions for some object , we need to evaluate the predictive distribution
To obtain samples from posterior distribution we suggest to learn proposal distribution and perform independent MH algorithm. Thus the optimization problem 6 can be rewritten as follows.
Note that due to the usage of independent proposal, the minimized KL-divergence splits up into the sum of two KL-divergences.
Minimization of the first KL-divergence corresponds to the variational inference procedure.
The second KL-divergence has the only term that depends on . Thus we obtain the following optimization problem
The first summand here contains the sum over all objects in dataset
. We follow doubly stochastic variational inference and suggest to perform unbiased estimation of the gradient in Eq.11 using only minibatches of data. Moreover, we can use recently proposed techniques (Korattikara et al., 2014; Chen et al., 2016) that perform the independent MH algorithm using only minibatches of data. Combination of these two techniques allows us to use only minibatches of data during iterations of algorithm 1. In the case of the direct optimization of the acceptance rate, straightforward usage of minibatches results in biased gradients. Indeed, for the direct optimization of the acceptance rate (Eq. 1) we have the product over the all training data inside function.
3.2 Sample-based Setting
In the sample-based setting, we assume the proposal to be an implicit probabilistic model, i.e. the model that we can only sample from. As in the density-based setting, we assume that we are able to perform the reparameterization trick for the proposal.
In this subsection we consider only Markov chain proposal , but everything can be applied to independent proposal by simple substitution with . From now we will assume our proposal distribution to be a neural network that takes as its input and outputs . Considering proposal distribution parameterized by a neural network allows us to easily exclude delta-function from the space of solutions. We avoid learning the identity mapping by using neural networks with the bottleneck and noisy layers. For the detailed description of the architectures see Appendix E.
The set of samples from the true distribution allows for the Monte Carlo estimation of the loss
To compute the density ratio we suggest to use well-known technique of density ratio estimation via training discriminator network. Denoting discriminator output as , we suggest the following optimization problem for the discriminator.
Speaking informally, such discriminator takes two images as input and tries to figure out which image is sampled from true distribution and which one is generated by the one step of proposal distribution. It is easy to show that optimal discriminator in problem 13 will be
Note that for optimal discriminator we have . In practice, we have no optimal discriminator and these values can differ significantly. Thus, we have four ways for density ratio estimation that may differ significantly.
To avoid the ambiguity we suggest to use the discriminator of a special structure. Let
be a convolutional neural network with scalar output. Then the output of discriminatoris defined as follows.
In other words, such discriminator can be described as the following procedure. For single neural network we evaluate two outputs and . Then we take softmax operation for these values. Summing up all the steps, we obtain algorithm 2.
In this section, we provide experiments for both density-based and sample-based settings, showing the proposed procedure is applicable to high dimensional target distributions. Code for reproducing all of the experiments will be published with the camera-ready version of the paper.
4.1 Toy Problem
This experiment shows that it is possible to optimize the acceptance rate, optimizing its lower bound. For the target distribution we consider bimodal Gaussian , for the independent proposal we consider unimodal gaussian . We perform stochastic gradient optimization from the same initialization for both objectives (Fig. 1) and obtain approximately the same local maximums.
4.2 Density-based Setting
In density-based setting, we consider Bayesian inference problem for the weights of a neural network. In our experiments we consider approximation of predictive distribution (Eq. 7) as our main goal. To estimate the goodness of the approximation we measure negative log-likelihood and accuracy on the test set.
In subsection 3.1 we show that lower bound on acceptance rate can be optimized more efficiently than acceptance rate due to the usage of minibatches. But other questions arise.
Does the proposed objective in Eq. 11 allow for better estimation of predictive distribution compared to the variational inference?
Does the application of the MH correction to the learned proposal distribution allow for better estimation of the predictive distribution (Eq. 7) than estimation via raw samples from the proposal?
To answer these questions we consider reduced LeNet-5 architecture (see Appendix D) for classification task on 20k images from MNIST dataset (for test data we use all of the MNIST test set). Even after architecture reduction we still face a challenging task of learning a complex distribution in -dimensional space. For the proposal distribution we use fully-factorized gaussian
and standard normal distribution for prior.
For variational inference, we train the model using different initialization and pick the model according to the best ELBO. For our procedure, we do the same and choose the model by the maximum value of the acceptance rate lower bound. In algorithm 1 we propose to sample from the posterior distribution using the independent MH and the current proposal. It turns out in practice that it is better to use the currently learned proposal as the initial state for random-walk MH algorithm. That is, we start with the mean as an initial point, and then use random-walk proposal with the variances
of current independent proposal. This should be considered as a heuristic that improves the approximation of the loss function.
The optimization of the acceptance rate lower bound results in the better estimation of predictive distribution than the variational inference (see Fig. 2
). Optimization of acceptance rate for the same number of epochs results in nearlyaccuracy on the test set. That is why we do not report results for this procedure in Fig. 2.
To answer the second question we estimate predictive distribution in two ways. The first way is to perform accept/reject steps of the independent MH algorithm with the learned proposal after each epoch, i.e. perform MH correction of the samples from the proposal. The second way is to take the same number of samples from without MH correction. For both estimations of predictive distribution, we evaluate negative log-likelihood on the test set and compare them.
The MH correction of the learned proposal improves the estimation of predictive distribution for the variational inference (right plot of Fig. 3) but does not do so for the optimization of the acceptance rate lower bound (left plot of Fig. 3). This fact may be considered as an implicit evidence that our procedure learns the proposal distribution with higher acceptance rate.
4.3 Sample-based Setting
In the sample-based setting, we estimate density ratio using a discriminator. Hence we do not use the minibatching property (see subsection 3.1) of the obtained lower bound, and optimization problems for the acceptance rate and for the lower bound have the same efficiency in terms of using data. That is why our main goal in this setting is to compare the optimization of the acceptance rate and the optimization of the lower bound. Also, in this setting, we have Markov chain proposal that is interesting to compare with the independent proposal. Summing up, we formulate the following questions:
Does the optimization of the lower bound has any benefits compared to the direct optimization of the acceptance rate?
Do we have mixing issue while learning Markov chain proposal in practice?
Could we improve the visual quality of samples by applying the MH correction to the learned proposal?
We use DCGAN architecture for the proposal and discriminator (see Appendix E) and apply our algorithm to MNIST dataset. We consider two optimization problems: direct optimization of the acceptance rate and its lower bound. We also consider two ways to obtain samples from the approximation of the target distribution — use raw samples from the learned proposal, or perform the MH algorithm, where we use the learned discriminator for density ratio estimation.
In case of the independent proposal, we show that the MH correction at evaluation step allows to improve visual quality of samples — figures 3(a) and 3(b) for the direct optimization of acceptance rate, figures 3(c) and 3(d) for the optimization of its lower bound. Note that in Algorithm 2 we do not apply the independent MH algorithm during training. Potentially, one can use the MH algorithm considering any generative model as a proposal distribution and learning a discriminator for density ratio estimation. Also, for this proposal, we demonstrate the negligible difference in visual quality of samples obtained by the direct optimization of acceptance rate (see Fig. 3(a)) and by the optimization of the lower bound (see Fig. 3(c)).
In the case of the Markov chain proposal, we show that the direct optimization of acceptance rate results in slow mixing (see Fig. 4(a)) — most of the time the proposal generates samples from one of the modes (digits) and rarely switches to another mode. When we perform the optimization of the lower bound the proposal switches between modes frequently (see Fig. 4(b)).
To show that the learned proposal distribution has the Markov property rather than being totally independent, we show samples from the proposal conditioned on two different points in the dataset (see Fig. 6). The difference in samples from two these distributions (Fig. 5(a), 5(a)) reflects the dependence on the conditioning.
Additionally, in Appendix G we present samples from the chain after accepted images and also samples from the chain that was initialized with noise.
5 Discussion and future work
This paper proposes to use the acceptance rate of the MH algorithm as the universal objective for learning to sample from some target distribution. We also propose the lower bound on the acceptance rate that should be preferred over the direct maximization of the acceptance rate in many cases. The proposed approach provides many ways of improvement by the combination with techniques from the recent developments in the field of MCMC, GANs, variational inference. For example
The quality of a sampler in density-based setting could be improved with the normalizing flows (Rezende & Mohamed, 2015).
In sample-based setting one can use more advanced techniques of density ratio estimation.
Another interesting direction of further research is the design of the family of explicit Markov chain proposals resistant to the collapsing to the delta-function problem. Application of the MH algorithm to improve the quality of generative models also requires exhaustive further exploration and rigorous treatment.
- Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
- Chen et al. (2016) Haoyu Chen, Daniel Seita, Xinlei Pan, and John Canny. An efficient minibatch acceptance test for metropolis-hastings. arXiv preprint arXiv:1610.06848, 2016.
Chen et al. (2014)
Tianqi Chen, Emily Fox, and Carlos Guestrin.
Stochastic gradient hamiltonian monte carlo.
International Conference on Machine Learning, pp. 1683–1691, 2014.
- Gal (2016) Yarin Gal. Uncertainty in deep learning. PhD thesis, University of Cambridge, 2016.
- Gal & Ghahramani (2015) Yarin Gal and Zoubin Ghahramani. Bayesian convolutional neural networks with bernoulli approximate variational inference. arXiv preprint arXiv:1506.02158, 2015.
- Gal & Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059, 2016.
- Giordani & Kohn (2010) Paolo Giordani and Robert Kohn. Adaptive independent metropolis–hastings by fast estimation of mixtures of normals. Journal of Computational and Graphical Statistics, 19(2):243–259, 2010.
- Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
Grover et al. (2018)
Aditya Grover, Ramki Gummadi, Miguel Lazaro-Gredilla, Dale Schuurmans, and
Variational rejection sampling.
In Amos Storkey and Fernando Perez-Cruz (eds.),
Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pp. 823–832, Playa Blanca, Lanzarote, Canary Islands, 09–11 Apr 2018. PMLR. URL http://proceedings.mlr.press/v84/grover18a.html.
- Hoffman et al. (2013) Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
Holden et al. (2009)
Lars Holden, Ragnar Hauge, Marit Holden, et al.
Adaptive independent metropolis–hastings.
The Annals of Applied Probability, 19(1):395–413, 2009.
- Kingma & Welling (2014) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ICLR, 2014.
- Kingma et al. (2015) Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pp. 2575–2583, 2015.
- Korattikara et al. (2014) Anoop Korattikara, Yutian Chen, and Max Welling. Austerity in mcmc land: Cutting the metropolis-hastings budget. In International Conference on Machine Learning, pp. 181–189, 2014.
- Mescheder et al. (2017) Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. arXiv preprint arXiv:1701.04722, 2017.
- Molchanov et al. (2017) Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. arXiv preprint arXiv:1701.05369, 2017.
- Nowozin et al. (2016) Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pp. 271–279, 2016.
- Rezende & Mohamed (2015) Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015.
Rezende et al. (2014)
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.ICML, 2014.
- Roberts et al. (1997) Gareth O Roberts, Andrew Gelman, Walter R Gilks, et al. Weak convergence and optimal scaling of random walk metropolis algorithms. The annals of applied probability, 7(1):110–120, 1997.
- Roberts et al. (2001) Gareth O Roberts, Jeffrey S Rosenthal, et al. Optimal scaling for various metropolis-hastings algorithms. Statistical science, 16(4):351–367, 2001.
- Song et al. (2017) Jiaming Song, Shengjia Zhao, and Stefano Ermon. A-nice-mc: Adversarial training for mcmc. In Advances in Neural Information Processing Systems, pp. 5140–5150, 2017.
- Titsias & Lázaro-Gredilla (2014) Michalis Titsias and Miguel Lázaro-Gredilla. Doubly stochastic variational bayes for non-conjugate inference. In International Conference on Machine Learning, pp. 1971–1979, 2014.
- Welling & Teh (2011) Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681–688, 2011.
Appendix A Proof of Theorem 1
Remind that we have random variables and , and want to prove the following equalities.
Equality is obvious.
Equality can be proofed as follows.
where is CDF of random variable . Note that since . Eq. 21 can be rewritten in two ways.
To rewrite Eq. 21 in the second way we note that .
Using the form of we can rewrite the acceptance rate as
Appendix B Acceptance rate of independent MH defines semimetric in distribution space
In independent case we have and we want to prove that is semimetric (or pseudo-metric) in space of distributions. For this appendix, we denote . The first two axioms for metric obviously holds
There is an example when triangle inequality does not hold. For distributions
But weaker inequality can be proved.
Appendix C On collapsing to the delta-function
Firstly, let’s consider the case of gaussian random-walk proposal . The optimization problem for the acceptance rate takes the form
It is easy to see that we can obtain acceptance rate arbitrarly close to , taking small enough.
In the case of the independent proposal, we don’t have the collapsing to the delta-function problem. In our work, it is important to show non-collapsing during optimization of the lower bound, but the same hold for the direct optimization of the acceptance rate. To provide such intuition we consider one-dimensional case where we have some target distribution and independent proposal . Choosing small enough, we approximate sampling with the independent MH as sampling on some finite support
. For this support, we approximate the target distribution with the uniform distribution (see Fig.7).
For such approximation, optimization of lower bound takes the form
Here is truncated normal distribution. The first KL-divergence can be written as follows.
is normalization constant of truncated log normal distribution and, where is CDF of standard normal distribution. The second KL-divergence is
Summing up two KL-divergencies and taking derivative w.r.t. we obtain
To show that the derivative of the lower bound w.r.t. is negative, we need to prove that the following inequality holds for positive .
Defining and noting that we can rewrite inequality 47 as
By the fundamental theorem of calculus, we have
Using this inequality twice, we obtain
Thus, the target inequality can be verified by the verification of
Thus, we show that partial derivative of our lower bound w.r.t. is negative. Using that knowledge we can improve our loss by taking a bigger value of . Hence, such proposal does not collapse to delta-function.
Appendix D Architecture of the reduced LeNet-5
class LeNet5(BayesNet): def __init__(self): super(LeNet5, self).__init__() self.num_classes = 10 self.conv1 = layers.ConvFFG(1, 10, 5, padding=0) self.relu1 = nn.ReLU(True) self.pool1 = nn.MaxPool2d(2, padding=0) self.conv2 = layers.ConvFFG(10, 20, 5, padding=0) self.relu2 = nn.ReLU(True) self.pool2 = nn.MaxPool2d(2, padding=0) self.flatten = layers.ViewLayer([20*4*4]) self.dense1 = layers.LinearFFG(20*4*4, 10) self.relu3 = nn.ReLU() self.dense2 = layers.LinearFFG(10, 10)
Appendix E Architectures of neural networks in sample-based setting
In sample-based setting we use usual DCGAN architecture for independent proposal distribution
class Generator(layers.ModuleWrapper): def __init__(self): super(Generator, self).__init__() self.fc = nn.Linear(100, 128*8*8) self.unflatten = layers.ViewLayer([128, 8, 8]) self.in1 = nn.InstanceNorm2d(128) self.us1 = nn.ConvTranspose2d(128, 128, 2, 2) self.conv1 = nn.Conv2d(128, 128, 3, stride=1, padding=1) self.in2 = nn.InstanceNorm2d(128, 0.8) self.lrelu1 = nn.LeakyReLU(0.2, inplace=True) self.us2 = nn.ConvTranspose2d(128, 128, 2, 2) self.conv2 = nn.Conv2d(128, 64, 3, stride=1, padding=1) self.in3 = nn.InstanceNorm2d(64, 0.8) self.lrelu2 = nn.LeakyReLU(0.2, inplace=True) self.conv3 = nn.Conv2d(64, 1, 3, stride=1, padding=1) self.tanh = nn.Tanh()
And a little be modified acrhitecture for Markov chain proposal distribution
class Generator(layers.ModuleWrapper): def __init__(self): super(Generator, self).__init__() self.d_conv1 = nn.Conv2d(1, 16, 5, stride=2, padding=2) self.d_lrelu1 = nn.LeakyReLU(0.2, inplace=True) self.d_do1 = nn.Dropout2d(0.5) self.d_conv2 = nn.Conv2d(16, 4, 5, stride=2, padding=2) self.d_in2 = nn.InstanceNorm2d(4, 0.8) self.d_lrelu2 = nn.LeakyReLU(0.2, inplace=True) self.d_do2 = nn.Dropout2d(0.5) self.b_view = layers.ViewLayer([4*8*8]) self.b_fc = nn.Linear(4*8*8, 256) self.b_lrelu = nn.LeakyReLU(0.2, inplace=True) self.b_fc = nn.Linear(256, 128 * 8 * 8) self.b_do = layers.AdditiveNoise(0.5) self.e_unflatten = layers.ViewLayer([128, 8, 8]) self.e_in1 = nn.InstanceNorm2d(128, 0.8) self.e_us1 = nn.ConvTranspose2d(128, 128, 2, 2) self.e_conv1 = nn.Conv2d(128, 128, 3, stride=1, padding=1) self.e_in2 = nn.InstanceNorm2d(128, 0.8) self.e_lrelu1 = nn.LeakyReLU(0.2, inplace=True) self.e_us2 = nn.ConvTranspose2d(128, 128, 2, 2) self.e_conv2 = nn.Conv2d(128, 64, 3, stride=1, padding=1) self.e_in3 = nn.InstanceNorm2d(64, 0.8) self.e_lrelu2 = nn.LeakyReLU(0.2, inplace=True) self.e_conv3 = nn.Conv2d(64, 1, 3, stride=1, padding=1) self.e_tanh = nn.Tanh()
For both proposals we use the proposed discriminator with the following architecture.
class Discriminator(nn.Module): def __init__(self): super(Discriminator, self).__init__() self.conv1 = nn.Conv2d(2, 16, 3, 2, 1) self.lrelu1 = nn.LeakyReLU(0.2, inplace=True) self.conv2 = nn.Conv2d(16, 32, 3, 2, 1) self.lrelu2 = nn.LeakyReLU(0.2, inplace=True) self.in2 = nn.InstanceNorm2d(32, 0.8) self.conv3 = nn.Conv2d(32, 64, 3, 2, 1) self.lrelu3 = nn.LeakyReLU(0.2, inplace=True) self.in3 = nn.InstanceNorm2d(64, 0.8) self.conv4 = nn.Conv2d(64, 128, 3, 2, 1) self.lrelu4 = nn.LeakyReLU(0.2, inplace=True) self.in4 = nn.InstanceNorm2d(128, 0.8) self.flatten = layers.ViewLayer([128*2*2]) self.fc = nn.Linear(128*2*2, 1) def forward(self, x, y): xy = torch.cat([x, y], dim=1) for module in self.children(): xy = module(xy) yx = torch.cat([y, x], dim=1) for module in self.children(): yx = module(yx) return F.softmax(torch.cat([xy, yx], dim=1), dim=1)
Appendix F Intuition for better gradients in sample-based setting
In this section, we provide an intuition for sample-based setting that the loss function for lower bound has better gradients than the loss function for acceptance rate. Firstly, we remind that in the sample-based setting we use a discriminator for density ratio estimation.
For this purpose we use the discriminator of special structure
We denote and consider the case when the discriminator can easily distinguish fake pairs from valid pairs. So is close to and for and . To evaluate gradients we consider Monte Carlo estimations of each loss and take gradients w.r.t. in order to obtain gradients for parameters of proposal distribution. We do not introduce the reparameterization trick to simplify the notation but assume it to be performed. For the optimization of the acceptance rate we have
While for the optimization of the lower bound we have
Appendix G Additional figures for Markov chain proposals in sample-based setting
In this section, we show additional figures for Markov chain proposals. In Fig. 8 we show samples from the chain that was initialized by the noise. In Fig. 9 we show samples from the chain after accepted samples.