Metropolis-Hastings view on variational inference and adversarial training

10/16/2018 ∙ by Kirill Neklyudov, et al. ∙ 0

In this paper we propose to view the acceptance rate of the Metropolis-Hastings algorithm as a universal objective for learning to sample from target distribution -- given either as a set of samples or in the form of unnormalized density. This point of view unifies the goals of such approaches as Markov Chain Monte Carlo (MCMC), Generative Adversarial Networks (GANs), variational inference. To reveal the connection we derive the lower bound on the acceptance rate and treat it as the objective for learning explicit and implicit samplers. The form of the lower bound allows for doubly stochastic gradient optimization in case the target distribution factorizes (i.e. over data points). We empirically validate our approach on Bayesian inference for neural networks and generative models for images.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 9

page 10

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bayesian framework and deep learning have become more and more interrelated during recent years. Recently Bayesian deep neural networks were used for estimating uncertainty

(Gal & Ghahramani, 2016), ensembling (Gal & Ghahramani, 2016) and model compression (Molchanov et al., 2017). On the other hand, deep neural networks may be used to improve approximate inference in Bayesian models (Kingma & Welling, 2014).

Learning modern Bayesian neural networks requires inference in the spaces with dimension up to several million by conditioning the weights of DNN on hundreds of thousands of objects. For such applications, one has to perform the approximate inference – predominantly by either sampling from the posterior with Markov Chain Monte Carlo (MCMC) methods or approximating the posterior with variational inference (VI) methods.

MCMC

methods are non-parametric, provide the unbiased (in the limit) estimate but require careful hyperparameter tuning especially for big datasets and high dimensional problems. The large dataset problem has been addressed for different MCMC algorithms: stochastic gradient Langevin dynamics

(Welling & Teh, 2011), stochastic gradient Hamiltonian Monte Carlo (Chen et al., 2014), minibatch Metropolis-Hastings algorithms (Korattikara et al., 2014; Chen et al., 2016)

. One way to address the problem of high dimension is the design of a proposal distribution. For example, for the Metropolis-Hastings (MH) algorithm there exists a theoretical guideline for scaling the variance of a Gaussian proposal

(Roberts et al., 1997, 2001). More complex proposal designs include adaptive updates of the proposal distribution during iterations of MH algorithm (Holden et al., 2009; Giordani & Kohn, 2010).

Variational inference is extremely scalable but provides a biased estimate of the target distribution. Using the doubly stochastic procedure (Titsias & Lázaro-Gredilla, 2014; Hoffman et al., 2013) VI can be applied to extremely large datasets and high dimensional spaces, such as a space of neural network weights (Kingma et al., 2015; Gal & Ghahramani, 2015, 2016). The bias introduced by variational approximation can be mitigated by using flexible approximations (Rezende & Mohamed, 2015) and resampling (Grover et al., 2018).

Generative Adversarial Networks (Goodfellow et al., 2014) (GANs) is a different approach to learn samplers. Under the framework of adversarial training different optimization problems could be solved efficiently (Arjovsky et al., 2017; Nowozin et al., 2016). The shared goal of ”learning to sample” inspired the connection of GANs with VI (Mescheder et al., 2017) and MCMC (Song et al., 2017).

In this paper, we propose a novel perspective on learning to sample from a target distribution by optimizing parameters of either explicit or implicit probabilistic model. Our objective is inspired by the view on the acceptance rate of the Metropolis-Hastings algorithm as a quality measure of the sampler. We derive a lower bound on the acceptance rate and maximize it with respect to parameters of the sampler, treating the sampler as a proposal distribution in the Metropolis-Hastings scheme.

We consider two possible forms of the target distribution: unnormalized density (density-based setting) and a set of samples (sample-based setting). Each of these settings reveals a unifying property of the proposed perspective and the derived lower bound. In the density-based setting, the lower bound is the sum of forward and reverse KL-divergences between the true posterior and its approximation, connecting our approach to VI. In the sample-based setting, the lower bound admit a form of an adversarial game between the sampler and a discriminator, connecting our approach to GANs.

The closest work to ours is of Song et al. (2017). In contrast to their paper our approach (1) is free from hyperparameters; (2) is able to optimize the acceptance rate directly; (3) avoids minimax problem in the density based setting.

Our main contributions are as follows:

  1. We introduce a novel perspective on learning to sample from the target distribution by treating the acceptance rate in the Metropolis-Hastings algorithm as a measure of sampler quality.

  2. We derive the lower bound on the acceptance rate allowing for doubly stochastic optimization of the proposal distribution in case when the target distribution factorizes (i.e. over data points).

  3. For sample-based and density-based forms of target distribution we show the connection of the proposed algorithm to variational inference and GANs.

The rest of the paper is organized as follows. In Section 2 we introduce the lower bound on the AR. Special forms of target distribution are addressed in Section 3. We validate our approach on the problems of approximate Bayesian inference in the space of high dimensional neural network weights and generative modeling in the space of images in Section 4. We discuss results and directions of the future work in Section 5.

2 Acceptance Rate for Metropolis-Hastings Algorithm

2.1 Preliminaries

In MH algorithm we need to sample from target distribution while we are only able to sample from proposal distribution . One step of the MH algorithm can be described as follows.

  1. sample proposal point , given previously accepted point

  2. accept

If the proposal distribution does not depend on , i.e. , the algorithm is called independent MH algorithm.

The quality of the proposal distribution is measured by acceptance rate and mixing time. Mixing time defines the speed of convergence of the Markov chain to the stationary distribution. The acceptance rate of the MH algorithm is defined as

(1)

where

(2)

In case of independent proposal distribution we show that the acceptance rate defines a semimetric in distribution space between and (see Appendix B).

2.2 Optimizing the lower bound on acceptance rate

Although, we can maximize the acceptance rate of the MH algorithm (Eq. 1) directly w.r.t. parameters of the proposal distribution , we propose to maximize the lower bound on the acceptance rate. As our experiments show (see Section 4) the optimization of the lower bound compares favorably to the direct optimization of the acceptance rate. To introduce this lower bound we first express the acceptance rate in terms of total variation distance.

Theorem 1

For random variable

(3)

where is the total variation distance.

The proof of Theorem 1 can be found in Appendix A. This reinterpretation in terms of total variation allows us to lower bound the acceptance rate through the Pinsker’s inequality

(4)

The maximization of this lower bound can be equivalently formulated as

(5)

In the following sections, we show the benefits of this optimization problem in two different settings — when the target distribution is given in a form of unnormalized density and as a set of samples.

3 Optimization of Proposal Distribution

From now on we consider only optimization problem Eq. 5 but the proposed algorithms can be also used for the direct optimization of the acceptance rate (Eq. 1).

To estimate the loss function (Eq. 

5) we need to evaluate the density ratio. In the density-based setting unnormalized density of the target distribution is given, so we suggest to use explicit proposal distribution to compute the density ratio explicitly. In the sample-based setting, however, we cannot compute the density ratio, so we propose to approximate it via adversarial training (Goodfellow et al., 2014). The brief summary of constraints for both settings is shown in Table 1.

The following subsections describe the algorithms in detail.

Setting Target distribution Proposal distribution Density Ratio
Density-based given explicit model explicit
Sample-based set of samples implicit model learned discriminator
implicit model
Table 1: Constraints for two settings of learning sampling algorithms

3.1 Density-based Setting

In the density-based setting, we assume the proposal to be an explicit probabilistic model, i.e. the model that we can sample from and evaluate its density at any point up to the normalization constant. We also assume that the proposal is reparametrisable (Kingma & Welling, 2014; Rezende et al., 2014; Gal, 2016).

If the proposal belongs to a parametric family, e.g. we might face the collapsing to the delta-function problem. To tackle this problem one can properly choose a parametric family of the proposal, or make the proposal independent . In Appendix C we provide an intuition that shows why the Markov chain proposal can collapse to delta-function and the independent proposal can’t. In this section, we consider only the independent proposal. We also provide empirical evidence in section 4 that collapsing to the delta-function does not happen for independent proposal distribution.

Considering as the proposal, optimization problem 5 takes the form

(6)

Explicit form of the proposal and the target distributions allows us to obtain density ratios and for any points . But to estimate the loss in Eq. 6 we also need to obtain samples from the target distribution during training. For this purpose, we use the current proposal and run the independent MH algorithm. After obtaining samples from the target distribution it is possible to perform optimization step by taking stochastic gradients w.r.t. . Pseudo-code for the obtained procedure is shown in Algorithm 1.

explicit probabilistic model
density of target distribution
while  not converged do
     sample
     sample using independent MH with current proposal
      approximate loss with finite number of samples
      perform gradient descent step
end while
return optimal parameters
Algorithm 1 Optimization of proposal distribution in density-based case

Algorithm 1 could also be employed for the direct optimization of the acceptance rate (Eq. 1). Now we apply this algorithm for Bayesian inference problem and show that during optimization of the lower bound we can use minibatches of data, while it is not the case for direct optimization of the acceptance rate. We consider Bayesian inference problem for discriminative model on dataset , where

is the feature vector of

th object and is its label. For the discriminative model we know likelihood and prior distribution . In order to obtain predictions for some object , we need to evaluate the predictive distribution

(7)

To obtain samples from posterior distribution we suggest to learn proposal distribution and perform independent MH algorithm. Thus the optimization problem 6 can be rewritten as follows.

(8)

Note that due to the usage of independent proposal, the minimized KL-divergence splits up into the sum of two KL-divergences.

(9)

Minimization of the first KL-divergence corresponds to the variational inference procedure.

(10)

The second KL-divergence has the only term that depends on . Thus we obtain the following optimization problem

(11)

The first summand here contains the sum over all objects in dataset

. We follow doubly stochastic variational inference and suggest to perform unbiased estimation of the gradient in Eq. 

11 using only minibatches of data. Moreover, we can use recently proposed techniques (Korattikara et al., 2014; Chen et al., 2016) that perform the independent MH algorithm using only minibatches of data. Combination of these two techniques allows us to use only minibatches of data during iterations of algorithm 1. In the case of the direct optimization of the acceptance rate, straightforward usage of minibatches results in biased gradients. Indeed, for the direct optimization of the acceptance rate (Eq. 1) we have the product over the all training data inside function.

3.2 Sample-based Setting

In the sample-based setting, we assume the proposal to be an implicit probabilistic model, i.e. the model that we can only sample from. As in the density-based setting, we assume that we are able to perform the reparameterization trick for the proposal.

In this subsection we consider only Markov chain proposal , but everything can be applied to independent proposal by simple substitution with . From now we will assume our proposal distribution to be a neural network that takes as its input and outputs . Considering proposal distribution parameterized by a neural network allows us to easily exclude delta-function from the space of solutions. We avoid learning the identity mapping by using neural networks with the bottleneck and noisy layers. For the detailed description of the architectures see Appendix E.

The set of samples from the true distribution allows for the Monte Carlo estimation of the loss

(12)

To compute the density ratio we suggest to use well-known technique of density ratio estimation via training discriminator network. Denoting discriminator output as , we suggest the following optimization problem for the discriminator.

(13)

Speaking informally, such discriminator takes two images as input and tries to figure out which image is sampled from true distribution and which one is generated by the one step of proposal distribution. It is easy to show that optimal discriminator in problem 13 will be

(14)

Note that for optimal discriminator we have . In practice, we have no optimal discriminator and these values can differ significantly. Thus, we have four ways for density ratio estimation that may differ significantly.

(15)

To avoid the ambiguity we suggest to use the discriminator of a special structure. Let

be a convolutional neural network with scalar output. Then the output of discriminator

is defined as follows.

(16)

In other words, such discriminator can be described as the following procedure. For single neural network we evaluate two outputs and . Then we take softmax operation for these values. Summing up all the steps, we obtain algorithm 2.

implicit probabilistic model
large set of samples
for  iterations do
     sample
     sample
     train discriminator by optimizing 13
      approximate loss with finite number of samples
      perform gradient descent step
end for
return parameters
Algorithm 2 Optimization of proposal distribution in sample-based case

Algorithm 2 could also be employed for direct optimization of the acceptance rate (Eq. 1). But, in Appendix F we provide an intuition for this setting that the direct optimization of the acceptance rate may struggle from vanishing gradients.

4 Experiments

In this section, we provide experiments for both density-based and sample-based settings, showing the proposed procedure is applicable to high dimensional target distributions. Code for reproducing all of the experiments will be published with the camera-ready version of the paper.

4.1 Toy Problem

Figure 1: Level-plots in parameter space for the toy problem. Left: level-plot for the acceptance rate of the MH algorithm. Right: level-plot for the lower bound of the acceptance rate.

This experiment shows that it is possible to optimize the acceptance rate, optimizing its lower bound. For the target distribution we consider bimodal Gaussian , for the independent proposal we consider unimodal gaussian . We perform stochastic gradient optimization from the same initialization for both objectives (Fig. 1) and obtain approximately the same local maximums.

4.2 Density-based Setting

In density-based setting, we consider Bayesian inference problem for the weights of a neural network. In our experiments we consider approximation of predictive distribution (Eq. 7) as our main goal. To estimate the goodness of the approximation we measure negative log-likelihood and accuracy on the test set.

In subsection 3.1 we show that lower bound on acceptance rate can be optimized more efficiently than acceptance rate due to the usage of minibatches. But other questions arise.

  1. Does the proposed objective in Eq. 11 allow for better estimation of predictive distribution compared to the variational inference?

  2. Does the application of the MH correction to the learned proposal distribution allow for better estimation of the predictive distribution (Eq. 7) than estimation via raw samples from the proposal?

To answer these questions we consider reduced LeNet-5 architecture (see Appendix D) for classification task on 20k images from MNIST dataset (for test data we use all of the MNIST test set). Even after architecture reduction we still face a challenging task of learning a complex distribution in -dimensional space. For the proposal distribution we use fully-factorized gaussian

and standard normal distribution for prior

.

For variational inference, we train the model using different initialization and pick the model according to the best ELBO. For our procedure, we do the same and choose the model by the maximum value of the acceptance rate lower bound. In algorithm 1 we propose to sample from the posterior distribution using the independent MH and the current proposal. It turns out in practice that it is better to use the currently learned proposal as the initial state for random-walk MH algorithm. That is, we start with the mean as an initial point, and then use random-walk proposal with the variances

of current independent proposal. This should be considered as a heuristic that improves the approximation of the loss function.

Figure 2: Negative log-likelihood (left) and accuracy (right) on test set of MNIST dataset for variational inference (blue lines) and the optimization of the acceptance rate lower bound (orange lines). In both procedures we apply the independent MH algorithm to estimate the predictive distribution.

The optimization of the acceptance rate lower bound results in the better estimation of predictive distribution than the variational inference (see Fig. 2

). Optimization of acceptance rate for the same number of epochs results in nearly

accuracy on the test set. That is why we do not report results for this procedure in Fig. 2.

To answer the second question we estimate predictive distribution in two ways. The first way is to perform accept/reject steps of the independent MH algorithm with the learned proposal after each epoch, i.e. perform MH correction of the samples from the proposal. The second way is to take the same number of samples from without MH correction. For both estimations of predictive distribution, we evaluate negative log-likelihood on the test set and compare them.

The MH correction of the learned proposal improves the estimation of predictive distribution for the variational inference (right plot of Fig. 3) but does not do so for the optimization of the acceptance rate lower bound (left plot of Fig. 3). This fact may be considered as an implicit evidence that our procedure learns the proposal distribution with higher acceptance rate.

Figure 3: Test negative log-likelihood for two approximations of the predictive distribution based on samples: from proposal distribution and after MH correction . Left figure corresponds to the optimization of the acceptance rate lower bound, right figure corresponds to the variational inference.

4.3 Sample-based Setting

In the sample-based setting, we estimate density ratio using a discriminator. Hence we do not use the minibatching property (see subsection 3.1) of the obtained lower bound, and optimization problems for the acceptance rate and for the lower bound have the same efficiency in terms of using data. That is why our main goal in this setting is to compare the optimization of the acceptance rate and the optimization of the lower bound. Also, in this setting, we have Markov chain proposal that is interesting to compare with the independent proposal. Summing up, we formulate the following questions:

  1. Does the optimization of the lower bound has any benefits compared to the direct optimization of the acceptance rate?

  2. Do we have mixing issue while learning Markov chain proposal in practice?

  3. Could we improve the visual quality of samples by applying the MH correction to the learned proposal?

We use DCGAN architecture for the proposal and discriminator (see Appendix E) and apply our algorithm to MNIST dataset. We consider two optimization problems: direct optimization of the acceptance rate and its lower bound. We also consider two ways to obtain samples from the approximation of the target distribution — use raw samples from the learned proposal, or perform the MH algorithm, where we use the learned discriminator for density ratio estimation.

In case of the independent proposal, we show that the MH correction at evaluation step allows to improve visual quality of samples — figures 3(a) and 3(b) for the direct optimization of acceptance rate, figures 3(c) and 3(d) for the optimization of its lower bound. Note that in Algorithm 2 we do not apply the independent MH algorithm during training. Potentially, one can use the MH algorithm considering any generative model as a proposal distribution and learning a discriminator for density ratio estimation. Also, for this proposal, we demonstrate the negligible difference in visual quality of samples obtained by the direct optimization of acceptance rate (see Fig. 3(a)) and by the optimization of the lower bound (see Fig. 3(c)).

(a)
(b)
(c)
(d)
Figure 4: Samples from the learned independent proposal obtained via optimization: of acceptance rate (3(a), 3(b)) and its lower bound (3(c), 3(d)). In Fig. 3(b), 3(d) we show raw samples from the learned proposal. In Fig. 3(a), 3(c) we show the samples after applying the independent MH correction to the samples, using the learned discriminator for density ratio estimation.

In the case of the Markov chain proposal, we show that the direct optimization of acceptance rate results in slow mixing (see Fig. 4(a)) — most of the time the proposal generates samples from one of the modes (digits) and rarely switches to another mode. When we perform the optimization of the lower bound the proposal switches between modes frequently (see Fig. 4(b)).

(a)
(b)
Figure 5: Samples from the chain obtained via the MH algorithm with the learned proposal and the learned discriminator for density ratio estimation. Fig. 4(a) corresponds to the direct optimization of the acceptance rate. Fig. 4(b) – to optimization of the lower bound on acceptance rate. Samples in the chain are obtained one by one from left to right from top to bottom.

To show that the learned proposal distribution has the Markov property rather than being totally independent, we show samples from the proposal conditioned on two different points in the dataset (see Fig. 6). The difference in samples from two these distributions (Fig. 5(a), 5(a)) reflects the dependence on the conditioning.

Additionally, in Appendix G we present samples from the chain after accepted images and also samples from the chain that was initialized with noise.

(a)
(b)
Figure 6: Samples from the proposal distribution and conditioned on the digit in the red box. The proposal was optimized according to the lower bound on the acceptance rate.

5 Discussion and future work

This paper proposes to use the acceptance rate of the MH algorithm as the universal objective for learning to sample from some target distribution. We also propose the lower bound on the acceptance rate that should be preferred over the direct maximization of the acceptance rate in many cases. The proposed approach provides many ways of improvement by the combination with techniques from the recent developments in the field of MCMC, GANs, variational inference. For example

  • The quality of a sampler in density-based setting could be improved with the normalizing flows (Rezende & Mohamed, 2015).

  • We can use stochastic Hamiltonian Monte Carlo (Chen et al., 2014) for the loss estimation in Algorithm 1.

  • In sample-based setting one can use more advanced techniques of density ratio estimation.

Another interesting direction of further research is the design of the family of explicit Markov chain proposals resistant to the collapsing to the delta-function problem. Application of the MH algorithm to improve the quality of generative models also requires exhaustive further exploration and rigorous treatment.

References

Appendix A Proof of Theorem 1

Remind that we have random variables and , and want to prove the following equalities.

(17)

Equality is obvious.

(18)
(19)

Equality can be proofed as follows.

(20)
(21)

where is CDF of random variable . Note that since . Eq. 21 can be rewritten in two ways.

(22)

To rewrite Eq. 21 in the second way we note that .

(23)

Summing equations 22 and 23 results in the following formula

(24)

Using the form of we can rewrite the acceptance rate as

(25)

Appendix B Acceptance rate of independent MH defines semimetric in distribution space

In independent case we have and we want to prove that is semimetric (or pseudo-metric) in space of distributions. For this appendix, we denote . The first two axioms for metric obviously holds

There is an example when triangle inequality does not hold. For distributions

(26)

But weaker inequality can be proved.

(27)
(28)
(29)
(30)
(31)
(32)

Summing up equations 28, 30 and 32 we obtain

(33)
(34)
(35)

Appendix C On collapsing to the delta-function

Firstly, let’s consider the case of gaussian random-walk proposal . The optimization problem for the acceptance rate takes the form

(36)

It is easy to see that we can obtain acceptance rate arbitrarly close to , taking small enough.

In the case of the independent proposal, we don’t have the collapsing to the delta-function problem. In our work, it is important to show non-collapsing during optimization of the lower bound, but the same hold for the direct optimization of the acceptance rate. To provide such intuition we consider one-dimensional case where we have some target distribution and independent proposal . Choosing small enough, we approximate sampling with the independent MH as sampling on some finite support

. For this support, we approximate the target distribution with the uniform distribution (see Fig.

7).

Figure 7: In this figure we show schematic view of approximation of of target distribution with uniform distribution. Red bounding box is made bigger for better comprehension.

For such approximation, optimization of lower bound takes the form

(37)
(38)

Here is truncated normal distribution. The first KL-divergence can be written as follows.

(39)
(40)
(41)

Here

is normalization constant of truncated log normal distribution and

, where is CDF of standard normal distribution. The second KL-divergence is

(42)
(43)

Summing up two KL-divergencies and taking derivative w.r.t. we obtain

(44)
(45)
(46)

To show that the derivative of the lower bound w.r.t. is negative, we need to prove that the following inequality holds for positive .

(47)

Defining and noting that we can rewrite inequality 47 as

(48)

By the fundamental theorem of calculus, we have

(49)

Hence,

(50)

Or equivalently,

(51)

Using this inequality twice, we obtain

(52)

and

(53)

Thus, the target inequality can be verified by the verification of

(54)

Thus, we show that partial derivative of our lower bound w.r.t. is negative. Using that knowledge we can improve our loss by taking a bigger value of . Hence, such proposal does not collapse to delta-function.

Appendix D Architecture of the reduced LeNet-5

class LeNet5(BayesNet): def __init__(self): super(LeNet5, self).__init__() self.num_classes = 10 self.conv1 = layers.ConvFFG(1, 10, 5, padding=0) self.relu1 = nn.ReLU(True) self.pool1 = nn.MaxPool2d(2, padding=0) self.conv2 = layers.ConvFFG(10, 20, 5, padding=0) self.relu2 = nn.ReLU(True) self.pool2 = nn.MaxPool2d(2, padding=0) self.flatten = layers.ViewLayer([20*4*4]) self.dense1 = layers.LinearFFG(20*4*4, 10) self.relu3 = nn.ReLU() self.dense2 = layers.LinearFFG(10, 10)

Appendix E Architectures of neural networks in sample-based setting

In sample-based setting we use usual DCGAN architecture for independent proposal distribution

class Generator(layers.ModuleWrapper): def __init__(self): super(Generator, self).__init__() self.fc = nn.Linear(100, 128*8*8) self.unflatten = layers.ViewLayer([128, 8, 8]) self.in1 = nn.InstanceNorm2d(128) self.us1 = nn.ConvTranspose2d(128, 128, 2, 2) self.conv1 = nn.Conv2d(128, 128, 3, stride=1, padding=1) self.in2 = nn.InstanceNorm2d(128, 0.8) self.lrelu1 = nn.LeakyReLU(0.2, inplace=True) self.us2 = nn.ConvTranspose2d(128, 128, 2, 2) self.conv2 = nn.Conv2d(128, 64, 3, stride=1, padding=1) self.in3 = nn.InstanceNorm2d(64, 0.8) self.lrelu2 = nn.LeakyReLU(0.2, inplace=True) self.conv3 = nn.Conv2d(64, 1, 3, stride=1, padding=1) self.tanh = nn.Tanh()

And a little be modified acrhitecture for Markov chain proposal distribution

class Generator(layers.ModuleWrapper):
    def __init__(self):
        super(Generator, self).__init__()

        self.d_conv1 = nn.Conv2d(1, 16, 5, stride=2, padding=2)
        self.d_lrelu1 = nn.LeakyReLU(0.2, inplace=True)
        self.d_do1 = nn.Dropout2d(0.5)
        self.d_conv2 = nn.Conv2d(16, 4, 5, stride=2, padding=2)
        self.d_in2 = nn.InstanceNorm2d(4, 0.8)
        self.d_lrelu2 = nn.LeakyReLU(0.2, inplace=True)
        self.d_do2 = nn.Dropout2d(0.5)

        self.b_view = layers.ViewLayer([4*8*8])
        self.b_fc = nn.Linear(4*8*8, 256)
        self.b_lrelu = nn.LeakyReLU(0.2, inplace=True)
        self.b_fc = nn.Linear(256, 128 * 8 * 8)
        self.b_do = layers.AdditiveNoise(0.5)

        self.e_unflatten = layers.ViewLayer([128, 8, 8])
        self.e_in1 = nn.InstanceNorm2d(128, 0.8)
        self.e_us1 = nn.ConvTranspose2d(128, 128, 2, 2)
        self.e_conv1 = nn.Conv2d(128, 128, 3, stride=1, padding=1)
        self.e_in2 = nn.InstanceNorm2d(128, 0.8)
        self.e_lrelu1 = nn.LeakyReLU(0.2, inplace=True)
        self.e_us2 = nn.ConvTranspose2d(128, 128, 2, 2)
        self.e_conv2 = nn.Conv2d(128, 64, 3, stride=1, padding=1)
        self.e_in3 = nn.InstanceNorm2d(64, 0.8)
        self.e_lrelu2 = nn.LeakyReLU(0.2, inplace=True)
        self.e_conv3 = nn.Conv2d(64, 1, 3, stride=1, padding=1)
        self.e_tanh = nn.Tanh()

For both proposals we use the proposed discriminator with the following architecture.

class Discriminator(nn.Module): def __init__(self): super(Discriminator, self).__init__() self.conv1 = nn.Conv2d(2, 16, 3, 2, 1) self.lrelu1 = nn.LeakyReLU(0.2, inplace=True) self.conv2 = nn.Conv2d(16, 32, 3, 2, 1) self.lrelu2 = nn.LeakyReLU(0.2, inplace=True) self.in2 = nn.InstanceNorm2d(32, 0.8) self.conv3 = nn.Conv2d(32, 64, 3, 2, 1) self.lrelu3 = nn.LeakyReLU(0.2, inplace=True) self.in3 = nn.InstanceNorm2d(64, 0.8) self.conv4 = nn.Conv2d(64, 128, 3, 2, 1) self.lrelu4 = nn.LeakyReLU(0.2, inplace=True) self.in4 = nn.InstanceNorm2d(128, 0.8) self.flatten = layers.ViewLayer([128*2*2]) self.fc = nn.Linear(128*2*2, 1) def forward(self, x, y): xy = torch.cat([x, y], dim=1) for module in self.children(): xy = module(xy) yx = torch.cat([y, x], dim=1) for module in self.children(): yx = module(yx) return F.softmax(torch.cat([xy, yx], dim=1), dim=1)

Appendix F Intuition for better gradients in sample-based setting

In this section, we provide an intuition for sample-based setting that the loss function for lower bound has better gradients than the loss function for acceptance rate. Firstly, we remind that in the sample-based setting we use a discriminator for density ratio estimation.

(55)

For this purpose we use the discriminator of special structure

(56)

We denote and consider the case when the discriminator can easily distinguish fake pairs from valid pairs. So is close to and for and . To evaluate gradients we consider Monte Carlo estimations of each loss and take gradients w.r.t. in order to obtain gradients for parameters of proposal distribution. We do not introduce the reparameterization trick to simplify the notation but assume it to be performed. For the optimization of the acceptance rate we have

(57)
(58)
(59)

While for the optimization of the lower bound we have

(60)
(61)
(62)

Now we compare Eq. 59 and Eq. 62. We see that in case of strong discriminator we have vanishing gradients in Eq. 59 due to , while it is not the case for Eq. 62.

Appendix G Additional figures for Markov chain proposals in sample-based setting

In this section, we show additional figures for Markov chain proposals. In Fig. 8 we show samples from the chain that was initialized by the noise. In Fig. 9 we show samples from the chain after accepted samples.

(a)
(b)
Figure 8: Samples from the chain initialized with noise. To obtain samples we use the MH algorithm with the learned proposal and the learned discriminator for density ratio estimation. In Fig. 4(a) we use proposal and discriminator that are learned during optimization of acceptance rate. In Fig. 4(b) we use proposal and discriminator that are learned during the optimization of the acceptance rate lower bound. Samples in the chain are obtained one by one from left to right from top to bottom starting with noise (first image in the figure).
(a)
(b)
Figure 9: Samples from the chain after accepted samples. To obtain samples we use the MH algorithm with the learned proposal and the learned discriminator for density ratio estimation. In Fig. 4(a) we use proposal and discriminator that are learned during optimization of acceptance rate. In Fig. 4(b) we use proposal and discriminator that are learned during the optimization of the acceptance rate lower bound. Samples in chain are obtained one by one from left to right from top to bottom.