Implicit Variational Inference with Kernel Density Ratio Fitting

05/29/2017 ∙ by Jiaxin Shi, et al. ∙ UNIVERSITY OF TORONTO Tsinghua University 0

Recent progress in variational inference has paid much attention to the flexibility of variational posteriors. Work has been done to use implicit distributions, i.e., distributions without tractable likelihoods as the variational posterior. However, existing methods on implicit posteriors still face challenges of noisy estimation and can hardly scale to high-dimensional latent variable models. In this paper, we present an implicit variational inference approach with kernel density ratio fitting that addresses these challenges. As far as we know, for the first time implicit variational inference is successfully applied to Bayesian neural networks, which shows promising results on both regression and classification tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bayesian methods have been playing vital roles in machine learning by providing a principled approach for generative modeling, posterior inference and preventing over-fitting 

(Ghahramani, 2015). As it becomes a common practice to build deep models that have many parameters (LeCun et al., 2015), it is even more important to have a Bayesian formulation to capture the uncertainty in these models. For example, Bayesian Neural Networks (BNNs) (Neal, 2012; Blundell et al., 2015)

have shown promise in reasoning about model confidence and learning with few labeled data. Another recent trend is to incorporate deep neural networks as a powerful function mapping between random variables in a Bayesian network, such as deep generative models like variational autoencoders (VAE) 

(Kingma & Welling, 2013).

Except a few simple examples, Bayesian inference is typically challenging, for which variational inference (VI) has been a standard workhorse to approximate the true posterior 

(Zhu et al., 2017). Traditional VI focuses on factorized variational posteriors to get analytical updates (known as Mean-field VI). While recent progress in this field drives VI into stochastic, differentiable and amortized (Hoffman et al., 2013; Paisley et al., 2012; Mnih & Gregor, 2014; Kingma & Welling, 2013)

, which does not rely on analytical updates anymore, factorized posteriors are still commonly used as the variational family. This greatly restricts the flexibility of the variational posterior, especially in high-dimensional spaces, which often leads to biased solutions as the true posterior is usually not factorized, thus not in the family. There have been some works that try to improve the flexibility of variational posteriors, borrowing ideas from invertible transformation of probability distributions

(Rezende & Mohamed, 2015; Kingma et al., 2016). In their works, it is important for the transformation to be invertible to ensure that the transformed distribution has a tractable density.

Although utilizing invertible transformation is a promising direction to increase the expressiveness of the variational posterior, we argue that a more flexible variational family can be constructed by using general deterministic or stochastic transformations, which are not necessarily invertible. As a common result, the variational posterior we get in this way does not have a tractable density, despite that there is a way to sample from it. This kind of distribution is called implicit distributions, and for variational methods that use an implicit variational posterior (also known as variational programs (Ranganath et al., 2016) or wild variational approximations (Liu & Feng, 2016)), we refer to them as Implicit Variational Inference (implicit VI). Most of the existing implicit VI methods (Mescheder et al., 2017; Huszár, 2017; Tran et al., 2017) rely on a discriminator to produce estimates of the variational objective and its gradients. As pointed out by many of them, the estimates are often noisy and can lead to unstable training. Besides, discriminator-based approaches are computationally infeasible when applied to nontrivial BNNs.

In this paper we present an approach named Kernel Implicit Variational Inference

(KIVI), which addresses the noisy estimation problem in previous works by providing a principled way of tuning the bias-variance tradeoff. Furthermore, KIVI does not rely on a discriminator and thus is computationally feasible for models with high-dimensional latent variables (e.g., BNNs). KIVI is applicable to both global and local latent variable models, which is demonstrated by experiments on BNNs and VAEs. As far as we know, this is the first time that implicit VI is successfully applied to BNNs, which shows promising results on both regression and classification tasks.

2 Background

Consider a generative model , where and denote observed and latent variables, respectively. In VI, a variational distribution in some parametric family is chosen to approximate the true posterior by optimizing the evidence lower bound (ELBO):

(1)

where

denotes the Kullback-Leibler divergence

. This objective is a lower bound of the log-likelihood since it can be written as The maximum of this objective is achieved when . From Eq. (1), we can see that the challenge of using an implicit is that calculating requires evaluating the density of , which is intractable for an implicit distribution.

Recently, inspired by the probabilistic interpretation of Generative Adversarial Networks (GAN) (Goodfellow et al., 2014; Mohamed & Lakshminarayanan, 2016), there have been some works that extend the GAN approach to the posterior inference of latent variable models (LVMs) (Mescheder et al., 2017; Huszár, 2017; Tran et al., 2017). These methods all use an implicit variational family and thus can be categorized into implicit VI methods. One of their key observations is that the density ratio

can be estimated from samples of the two distributions by a probabilistic classifier called the discriminator. They first assign class labels (

) to and : Let samples from be of class , and samples from be of class . Given an equal class prior, the density ratio at a given point can be calculated as which is the ratio between the class probabilities given the data point. To estimate this, a discriminator is trained to classify between the two classes, with a logistic loss:

(2)

where outputs the probability of ’s being from class . Given that is sufficiently flexible, the optimal solution of Eq. (2) is Therefore, the KL divergence term in the ELBO of Eq. (1) can be approximated as This is called prior-contrastive forms of VI in Huszár (2017). Note that the ratio approximation does not change the gradients once the approximation is accurate, as we shall see later in Eq. (7). Though incorporating the discriminative power in a probabilistic model has shown great success in GANs, this method still suffers from challenging problems when applied to VI:

  • [leftmargin=*]

  • Noisy density ratio estimation (DRE)  In VI, the variational posterior gets updated in each iteration. As shown in Eq. (2), the discriminator should be trained to optimum after each update. However, in practice the inner loop for training the discriminator is often truncated to one or several iterations. At the beginning of the inference procedure, it is hard for the discriminator to catch up with the variational posterior. The noisy signal produced by the discriminator leads to noisy gradients and thus unstable training. Besides, even if the discriminator quickly achieves the optimum in a small number of iterations, there is still another issue. Notice that the training loss in Eq. (2) is with expectations. But in practice we are using samples from the two distributions to approximate it. When the support of the distributions is high-dimensional, given the limited number of samples we use, the variance of this estimate is considerable, i.e., the discriminator tends to overfit the samples. The phenomenon is that the discriminator arrives at a state where samples are easily distinguished and the probabilities given by the discriminator are near 0 or 1, which is commonly observed in experiments (Mescheder et al., 2017).

  • Computationally infeasible for high dimensional latent variables  As the density ratio is estimated by a discriminator, the samples from the two distributions of latent variables should be fed into it. However, the typically used neural network discriminator cannot afford very high-dimensional inputs (e.g., parameters in a moderate-size Bayesian neural network).

3 Kernel Implicit Variational Inference

To address the above challenges for implicit VI, we propose to replace the discriminator with a kernel method for DRE. The advantages of this method are that it has a closed-form solution and that it allows us to explicitly tradeoff between bias and variance by tuning a regularization coefficient.

3.1 Estimating the KL Term

Specifically, let be the latent variable, and the true density ratio is . Consider modeling it with a function , where is a Reproducing Kernel Hilbert Space (RKHS) induced by a positive definite kernel

. Similar to kernel ridge regression, we use an objective composed of a squared loss for regression plus a penalty for the complexity of the function. For the squared loss we choose the form used by the unconstrained Least Square Importance Fitting (uLSIF) 

(Kanamori et al., 2009):

(3)

where is a constant, and approximate the expectation in by Monte Carlo estimates:

where and are the number of samples from and , respectively. Note that in Eq. (3), the expectation of the squared loss is taken w.r.t. so that the resulting form can be estimated without evaluating the density of both distributions. For the penalty term, the complexity of is measured by its RKHS norm (). Putting them together, we get the final objective:

(4)

Here is the regularization coefficient.

Proposition 1.

The optimal solution of Eq. (4) lies in the linear subspace spanned by the kernel functions centered at the samples (), i.e., has the form:

(5)
Proof.

This can be seen as the generalization of the representer theorem (Schölkopf et al., 2001) to the density ratio problem. So the proof follows the same procedure. See Appendix A. ∎

Substituting Eq. (5) into Eq. (4) and setting the derivatives w.r.t. and to zeros, we get the optimal solution (a detailed derivation is given in Appendix B):

(6)

where and . We use the common RBF kernels .

is the kernel bandwidth, which is determined by the commonly used median heuristic (the median of pairwise distances of the sample points).

Once we have the approximate density ratio function , a Monte Carlo estimate of can be constructed by . Note that there is a constraint that the estimated density ratio should be non-negative. However, we do not involve it in the optimization objective in order to get a closed-form solution, which indicates that some post-processing is needed to ensure this property. We solve the issue by clipping the estimated density ratio. The clipping values are searched from . In experiments we found that the algorithm is not sensitive to the clipping value. This is due to the accurate estimation guaranteed by the global optimum in the RKHS, which is a universal family when RBF kernels are used (Carmeli et al., 2010).

The reverse ratio trick  Another technique is essential to improve the estimation of the KL term, which we call the reverse ratio trick. The key observation is that the expectation in the squared loss in Eq. (3) is taken w.r.t. , whereas the expectation in the KL term () is taken w.r.t. . Unless and match very well in where they put most probabilities, a small squared loss does not always mean a good KL estimate. The solution is by a simple trick. Instead of estimating , we choose to estimate and compute the KL term as . We denote the estimated reverse density ratio as , then the corresponding KL estimate is . Note that in this way the squared loss changes to , whose expectation is taken w.r.t. the same probability measure () as the KL term’s. As we shall see in experiments (Appendix F.1), the trick is essential to make the estimation sufficiently accurate for VI.

Gradient computation  We now consider how to estimate the gradients of the KL term w.r.t. variational parameters . First it is easy to prove as in Huszár (2017) that

(7)

A detailed proof is in Appendix C. Eq. (7) indicates that the true gradients of the KL term w.r.t. do not flow through the density ratio function. We now replace the ratio on the right side with : . Note that without Eq. (7) we cannot do the approximation since has zero gradients w.r.t. . Then, the reparameterization trick (Kingma & Welling, 2013) can be used:

3.2 The Algorithm

We have constructed a closed-form estimate for the KL term and show its gradients can be estimated by the reparameterization trick. Note that the reparameterization trick can also be used to compute the gradients of the reconstruction term in Eq. (1) and thus can be applied to the ELBO. See Algo. 1 for the complete algorithm. Note that the number of samples () used in the reconstruction term can be different from that required for the KL estimation, which can be reduced when the model is expensive (e.g., we set in the experiments of VAEs). Thus compared to normal reparameterized VI, the extra computational cost is mainly in calculating the inverse of the matrix in Eq. (6). As we shall see in experiments, tens or a hundred samples are sufficient to obtain a stable KL estimate, so the added cost is not high.

0:  Observed data , model .
0:  Implicit variational posterior , , , .
1:  repeat
2:     Sample from the prior: .
3:     Sample from the variational posterior: .
4:     Compute the density ratio by Eq. (5), (6) and clip to be positive at s.
5:     Compute and .
6:     Estimate with the reparameterization trick.
7:     Do gradient ascent with .
8:     (Optional) For parameter learning, do gradient ascent with .
9:  until Convergence
Algorithm 1 Kernel Implicit Variational Inference (KIVI)

KIVI addresses the two challenges stated in Sec. 2. First, the ratio estimates are given in closed-forms, thus not having the problem of not catching up. Second, the bias-variance trade-off of the estimation can be controlled by the regularization coefficient . When is set smaller, the estimation is more aggressive to match the samples. When is set larger, the estimated ratio function is smoother. Choosing an appropriate , the variance of the gradients can be controlled, compared to the extreme ratio estimates given by discriminators when their output probabilities are near 0 or 1. Moreover, KIVI is directly applicable to both global and local latent variable models (LVMs), which is an advantage over nonparametric VI methods like particle mirror descent (Dai et al., 2015) and Stein variational gradient descent (SVGD) (Liu & Wang, 2016). For the task of training local LVMs like VAEs, we additionally use the adaptive contrast (AC) technique (Mescheder et al., 2017), whose details are summarized in Appendix D.

4 Example: Implicit Variational Bayesian Neural Networks

Now we present an example for using KIVI in Bayesian neural networks (BNNs), which have received increasing attention due to their ability to model uncertainty, an important factor in many tasks such as adversarial defense and reinforcement learning. However, despite that we have removed the need for a discriminator, it is still nontrivial to apply KIVI to BNNs because we need to design an implicit posterior that outputs very high-dimensional samples of latent variables (weights). Existing implicit posteriors

(Mescheder et al., 2017; Song et al., 2017) based on traditional fully-connected neural networks cannot handle such a high-dimensional output space. We present Matrix Multiplication Neural Network (MMNN), an efficient architecture for sampling large matrices. Deploying MMNN, KIVI can easily scale up to large BNNs.

In BNNs, a prior is specified over the neural network parameters , where indicates weights in the -th layer. Given an input , the output is modeled with

where is the output of the feed-forward network , and is of a distribution parameterized by and . For regression, is usually a Gaussian with as the mean. For classification, is usually a discrete distribution with as the unnormalized log probabilities.

The true posterior of in BNNs is intractable. Thus we turn to VI and use a variational posterior to approximate it. Denoting the dataset with , we have the ELBO:

The variational posterior is usually set to be factorized by layer: However, previous methods used variational posteriors with a limited capacity for each , including factorized Gaussian (Hernandez-Lobato & Adams, 2015), matrix variate Gaussian (Louizos & Welling, 2016; Sun et al., 2017) and normalizing flows  (Louizos & Welling, 2017). Enabled to learn implicit variational posteriors, we propose to adopt a general distribution without an explicit density function, which has a form of

0:  Input matrix
0:  Network parameters
1:  for  do
2:     Left multiplication:
3:     Right multiplication:
4:     if  then
5:         
6:     end if
7:  end for
8:  Output
Algorithm 2 MMNN

Here is a transformation parameterized by , and are treated as samples from . The key challenge is that are very high dimensional for moderate size neural networks. Thus, we often cannot use a fully connected neural network (MLP) as . Inspired by low-rank matrix factorization (Koren et al., 2009), we propose a new kind of network called Matrix Multiplication Neural Network (MMNN) to serve as , as shown in Alg. 2. In each layer of an MMNN, an input matrix () is left multiplied with a parameter matrix () and is added a bias matrix (), then it is right multiplied with a parameter matrix () and is added a bias matrix (

). Finally it is passed through a nonlinear activation (e.g., ReLU). We call such a layer as an

Matrix Multiplication layer.

To model the implicit posterior of , we only need to randomly sample a matrix of smaller size, and propagate it through the MMNN to get the output variational samples ():

When modeling a matrix, MMNN has significant computational advantages over MLPs, due to its low-rank property. For example, to model an weight matrix, consider a single-layer MMNN with the input matrix of size , the parameters needed are and they are in total of size , while if a single fully-connected layer is used, the parameter size is , which is much larger. Thus, we can use an MMNN as in the variational posterior for normal-size neural networks. In tasks with very small networks, we still use an MLP as .

5 Related Work

Our work closely relates to the works on implicit generative models (IGMs, generative models that define implicit distributions) and density ratio estimation (DRE). IGMs have drawn much attention due to the popularity of GANs. General learning algorithms of implicit models have been surveyed in Mohamed & Lakshminarayanan (2016), where DRE plays a central role. The connection between DRE and GANs is also discussed in Uehara et al. (2016). For a comprehensive review of DRE, we refer the readers to the survey (Hido et al., 2011). We also refer the readers to many other works by Sugiyama and his collaborators on DRE, such as KLIEP (Sugiyama et al., 2008), LSIF (Kanamori et al., 2009), and DRE based on Bregman divergence (Sugiyama et al., 2012).

Our work also builds upon the recent VI methods, including stochastic approximation by mini-batches (Hoffman et al., 2013), direct gradient optimization of variational lower bounds (Paisley et al., 2012; Mnih & Gregor, 2014), and the reparameterization trick for training continuous LVMs (Kingma & Welling, 2013). Following the success of learning with IGMs, implicit distributions are applied to VI. Many of them are based on discriminators, which can be divided into two categories: prior-contrastive (PC) and joint-contrastive (JC) (Huszár, 2017)

. In PC methods discriminators distinguish between samples from the prior and those from the variational posterior, while in JC methods they distinguish between the model joint distribution and the joint distribution composed of the data distribution and the variational posterior. Concurrent with

Huszár (2017), Mescheder et al. (2017) proposed Adversarial Variational Bayes (AVB), which is an amortized version of PC methods for training local LVMs like VAEs. Prior to Huszár (2017), similar ideas with JC methods have been proposed in ALI (Dumoulin et al., 2016) and Bi-GAN (Donahue et al., 2016). Nonparametric VI methods such as PMD (Dai et al., 2015) and SVGD (Liu & Wang, 2016) that adapt a set of particles towards the true posterior are also closely related to implicit VI. They share the similar advantage of flexible approximations. More recently, the amortized version of SVGD has been developed (Liu & Feng, 2016) and the same idea has been applied to MCMC (Li et al., 2017). It was further shown in Li & Turner (2017) that the core identity in SVGD (Stein’s identity) could also be employed to approximate the gradients of implicit distributions.

6 Experiments

(a) VI (normal posterior)
(b) KIVI
Figure 1: Fitting Gaussian Mixture distribution

We present empirical results on both synthetic and real datasets to demonstrate the benefits of KIVI. All implementations are based on ZhuSuan (Shi et al., 2017).

6.1 Toy 1-D Gaussian Mixtures

We firstly conduct a toy experiment to approximate a 1-D Gaussian mixture distribution with VI. The Gaussian mixture distribution has two equally distributed unit-variance components whose means are -3 and 3. We compare KIVI with VI using a single Gaussian posterior (Fig. 1

). The variational distribution used by KIVI generates samples by propagating a standard normal distribution through a two-layer MLP with 10 hidden units in each layer and one output unit. As shown, the Gaussian posterior converges to a single mode. In contrast, KIVI can accurately approximate the two modes with an expressive variational posterior. We defer another toy experiment on 2-D Bayesian logistic regression to Appendix 

F.1, where the importance of the reverse ratio trick is illustrated.

6.2 Bayesian Neural Networks (BNNs)

As stated in Sec. 4, the latent variables in BNNs are global to all data points and are usually very high-dimensional, for which a flexible variational family is essential. We compare KIVI with state-of-the-art VI methods by doing regression and classification on standard benchmarks.

6.2.1 Regression

Dataset Avg. Test RMSE Avg. Test LL
SVGD Dropout KIVI SVGD Dropout KIVI
Boston 2.9570.099 2.970.19 2.7980.173 -2.5040.029 -2.460.06 -2.5270.102
Concrete 5.3240.104 5.230.12 4.7020.116 -3.0820.018 -3.040.02 -3.0540.043
Energy 1.3740.045 1.660.04 0.4670.015 -1.7670.024 -1.990.02 -1.2980.005
Kin8nm 0.0900.001 0.100.00 0.0750.001 0.9840.008 0.950.01 1.1620.008
Naval 0.0040.000 0.010.00 0.0010.000 4.0890.012 3.800.01 5.5010.121
Combined 4.0330.033 4.020.04 3.9760.037 -2.8150.008 -2.800.01 -2.7940.009
Protein 4.6060.013 4.360.01 4.2550.019 -2.9470.003 -2.890.00 -2.8680.005
Wine 0.6090.010 0.620.01 0.6290.008 -0.9250.014 -0.930.01 -0.9580.015
Yacht 0.8640.052 1.110.09 0.7370.068 -1.2250.042 -1.550.03 -2.1230.010
Year 8.684NA 8.849NA 8.950NA -3.580NA -3.588NA -3.615NA
Table 1: Average test set RMSE, predictive log-likelihood for the regression datasets.

To quantitatively measure the predictive ability of BNNs with KIVI as the inference method, we use standard multivariate regression benchmarks from recent works (Table 1

), such as probabilistic backpropagation (PBP)

(Hernandez-Lobato & Adams, 2015). We compare with state-of-the-art methods: the Bayesian interpretation of dropout (Gal & Ghahramani, 2016) and stein variational gradient descent (SVGD) (Liu & Wang, 2016) 111Note that VMG (Louizos & Welling, 2016) and PBP-MV (Sun et al., 2017) used adaptive weight priors, which are different from the common setting of standard normal priors; the former also additionally used variational dropout, thus their results are not directly comparable to those of the works discussed here as well as ours.. Following the setup in PBP, we use BNNs with one 50-unit hidden layer except in the two large datasets, i.e., Protein Structure and Year Predication MSD, where 100 units are used. We randomly select of the whole dataset for training and leave the rest for testing. We also put a Gamma prior on the precision of the observation noise to adaptively learn it (see Appendix E). For all datasets, we set

and set the batch size to 100 and the learning rate to 0.001. The model is trained for 3000 epochs for the small datasets with less than 1000 data points, and 500 epochs for the others. We report the mean errors and standard deviations averaged over 20 runs, except 5 runs for

Protein Structure and 1 run for Year Predication MSD. As networks used in these tasks are of a small scale, we use MLPs with one hidden layer in the implicit variational posterior (see Appendix G.2.1 for details).

Table 1 shows the results with the best ones marked in bold. Results of SVGD and dropout are cited from their papers, which have the same setting as ours. We can see that KIVI consistently outperforms SVGD and dropout on both RMSE and test-LL for most datasets. Especially on RMSE, KIVI has significant improvements over them except on Wine and Year Predication MSD

. It suggests that KIVI enables the implicit variational posterior to capture the predictive uncertainty in network parameters, which is hard to be fully described by a mixture of two delta distributions (dropout) and a fixed set of particles (SVGD). We emphasize that although the nonparametric nature of SVGD has also made the approximation more flexible, it uses the same set of particles throughout the inference procedure, while each iteration of KIVI generates a new set of particles. Thus the implicit posterior learned by KIVI is smoothed by the parametric model. Recently, normalizing flows have shown good performance on BNNs

(Louizos & Welling, 2017). So we also experiment with directly applying normalizing flows to this task. The results are reported in Appendix F.2.

6.2.2 Classification

Method

# Hidden # Weights Test err.
SGD (Simard et al., 2003) 800 1.3m
Dropout (Srivastava et al., 2014)
Dropconnect (Wan et al., 2013) 800 1.3m
Bayes B. (Blundell et al., 2015), 400 500k
with Gaussian posterior 800 1.3m
1200 2.4m
Bayes B. (Blundell et al., 2015), 400 500k
with scale mixture prior 800 1.3m
1200 2.4m
KIVI 400 500k
800 1.3m
1200 2.4m
Figure 2: Results for MNIST classification. The left table shows the test error rates. indicates results that are not directly comparable to ours: Wan et al. (2013) used an ensemble of networks, and the second part of Blundell et al. (2015) changed the prior to a scale mixture. The plot on the right shows training lower bound in MNIST classification with prior-contrastive and KIVI.

For classification, we present the results on MNIST, which consists of 60,000 training images and 10,000 test images of handwriting digits. Compared to the above datasets in regression, MNIST has a much higher feature dimension, introducing millions of parameters even in moderate-size networks, which brings big challenges for BNNs. As a standard benchmark, the performance on MNIST can be improved by many other techniques, such as convolution, generative pre-training, data augmentation, etc. To ensure a fair comparison, we follow the settings from Bayes-By-Backprop (Blundell et al., 2015) and focus on improving the performance of ordinary MLPs without using any of these techniques. The network structures used are three MLPs, all with two ReLU hidden layers, and the layer sizes are 400, 800 and 1200, respectively. For KIVI, we used MMNNs with two hidden matrix multiplication layers in the implicit posterior (see Appendix G.2.2 for details). We set , and train for 300 epochs with the batch size as 100. The initial learning rate is 0.001 and is annealed every 100 epochs by ratio 0.5. We used the last 10,000 samples of the training set as the validation set for model selection.

Fig. 2 summarizes the results. We can see that KIVI achieves better accuracy compared to plain SGD (Simard et al., 2003), dropout (Srivastava et al., 2014) and Bayes-By-Backprop (Blundell et al., 2015) on all three types of MLPs. KIVI even performs better than Bayes-By-Backprop with a changed prior (scale mixture), which makes the model itself more flexible than ours. When the layer size is 800, our result is comparable to that of an ensemble of 5 networks with dropconnect (Wan et al., 2013), which demonstrates that the implicit posterior has been learned to account for most model uncertainty.

We also conduct experiments with the prior-contrastive (PC) method (the counterpart of AVB for global latent variables, see Sec. 2). The key challenge for applying PC here is that the posterior samples are extremely high-dimensional, and if fed into discriminators like neural networks, they will cause unaffordable computation cost. To get around this, we use a logistic regression as the discriminator in PC. The experiment settings of PC are reported in Appendix G.2.2. The training lower bounds of the two methods are plotted in Fig. 2. We can see that in the beginning they increase at the same pace, then PC fails to converge with lower bound explosion while KIVI improves consistently. The explosion is mainly because the input to the discriminator is of hundreds of thousands of dimensions, and plain logistic regression cannot produce reliable density ratio estimates. We also experiment with PC for layer size 800 and 1200. They both fail to converge in the end.

6.3 Variational Autoencoders

(a)
(b)
Figure 3:

Variational Autoencoders: (a) Gaussian posterior vs. implicit posterior, where fc denotes a fully-connected layer. They are used by the plain VAE and KIVI, respectively; (b) Training and evaluation curves of the lower bounds on statically binarized MNIST.

As stated in Sec. 3.2, KIVI is applicable to both global and local LVMs, and the latter can be learned using an amortized scheme (i.e., use instead of ). Here we present an application on training variational autoencoders (VAE) with implicit posteriors to demonstrate this. The generative process of VAEs proceeds as

where are latent features, are observations, and is the output distribution for modeling the observations, which takes the outputs of the neural network () as parameters. We conduct experiments on two widely used datasets for generative modeling: binarized MNIST and CelebA (Liu et al., 2015).

is a Bernoulli distribution for binarized MNIST, while a normal distribution for CelebA. For VAE the variational lower bound has the same form as Eq. (

1), except that is replaced by . The original VAE parameterizes as

where is a neural network that outputs the parameters of the normal distribution. In the MNIST case, is an MLP with two hidden layers, which is illustrated in Fig. 2(a) (left). To form an implicit posterior, a direct choice is to move the stochastic noise from the output layer to the penultimate hidden layer, as illustrated by Fig. 2(a) (right).

Before applying KIVI, a crucial question is what we expect to get by using implicit posteriors for training VAEs. One target could be that we may get tighter lower bounds of the data log-likelihood (LL) because the algorithm searches in a larger variational family for the optimal lower bound. This suggests, however in a very weak way, that doing optimization in the larger space will lead to better test LL, given the optimization always arrives at local optima. Previously Adversarial Variational Bayes (AVB) has shown some results on MNIST, by comparing the test LL of the plain VAE and the VAE trained by AVB, using golden truths estimated by annealed importance sampling (AIS) (Wu et al., 2016). However, for the results reported in AVB, the model architectures used by the plain VAE and AVB are very different, which leads to concerns about which part of the change contributes to the improved likelihoods.

Here we adopt another setting to better demonstrate the gain from implicit posteriors. We observe that the key improvement of implicit posteriors is that objectives with them average over a much broader range of posterior configurations. This effect not only contributes to a larger search space that contains tighter lower bound values, but also makes the VAE model better prevent overfitting. To verify that KIVI keeps this property, we conduct experiments on a statically binarized MNIST dataset and use models with no prior knowledge of the problem (MLPs instead of convnets), which is a typical setting that leads to overfitting. The latent dimension of the VAE model is 8, and is an MLP with two hidden layers of size 500. The parameters for KIVI are . More details can be found in Appendix G.3. Fig 2(b) shows the training and testing curves of VAEs with or without KIVI. It can be seen that the lower bound gap between the training and testing curves of the plain VAE are much larger than that of KIVI, which indicates that the VAE trained by KIVI is less prone to overfitting. After 3k epochs we evaluate the test LL on 2048 test images using AIS and get -97.3 for the plain VAE, while -94.8 for the VAE trained by KIVI.

Figure 4: Interpolation experiments for CelebA: AVB (left); KIVI (right).

To demonstrate that KIVI scales to larger models, we trained a VAE with the same network structure used by DCGAN (Radford et al., 2015) on CelebA using KIVI. The latent dimension in this case is 32. The implicit posterior is constructed in a way similar to the one shown in Fig. 2(a), with the bottom hidden layers symmetric to the decoder network (See Appendix G.3 for details). To visually check the latent space learned by KIVI, we show the reconstruction results of linearly interpolating between the latent vectors of two real images after 25 epochs (Fig. 4), when the model has converged. Compared to the latent space learned by AVB after the same epochs (we use the public code from AVB and set the same decoder structure), we find ours are smoother, and the interpolated images are much sharper. More interpolation results through the training process are presented and compared in Appendix F.5. We also use the learned model to generate 10,000 images and evaluate the sample quality using Fréchet Inception Distance (FID) (Heusel et al., 2017). The FID scores achieved at epoch 25 by AVB and KIVI are 160 and 41, respectively (smaller is better). In fact many efforts are required to make AVB successfully train a model for CelebA to produce results shown in the figure. As reported in Mescheder et al. (2017), the log prior is explicitly added to the discriminator (), while KIVI does not need much tuning: there is no need to carefully design a discriminator, and the only two hyper-parameters (i.e., and the clipping value) both have clear meanings.

7 Conclusions

We present an implicit VI method named Kernel Implicit Variational Inference (KIVI), which provides a principled way of tuning bias-variance tradeoff and makes implicit VI computationally feasible for models with high-dimensional latent variables. We successfully applied this approach to Bayesian neural networks and achieved superior performance on both regression and classification tasks. We also demonstrate that KIVI can be applied to learn local latent variable models like VAEs. Future work may include applying this method to larger-scale networks and improving the kernel estimator further.

Acknowledgments

The work was supported by the National Natural Science Foundation of China (NSFC) Projects (Nos. 61620106010, 61621136008), Beijing Natural Science Foundation No. L172037, Tsinghua Tiangong Institute for Intelligent Computing, the NVIDIA NVAIL Program and a Project from Siemens.

References

Appendix A Proof of Proposition 1

Proof.

Let denote the span of the representers of the sample points:

where we label as and as . Define the orthogonal complement to be

Because , it can be decomposed into two parts:

where , and . Note that the squared loss is not changed by the orthogonal component :

Meanwhile the regularization term can be decomposed into . So the optimal solution will have , which indicates that . ∎

Appendix B Derivation of the Optimal Density Ratio Function

Substituting Eq. (5) into Eq. (4) and removing the constant, we get

where , , and . Taking derivatives w.r.t. and and setting them to zeros:

Rearranging the terms, we have

(8)
(9)

Left multiplying Eq. (8) with and then subtracting Eq. (9) yields

Substituting the optimal into Eq. (8), we get

Note that although is used in the derivation, the forms of the solution do not require to be invertible. In fact, if

has zero eigenvalues (not invertible), the objective

is not bounded below since one can always choose an in the null space of to decrease it. In other words, the form of the solution is well defined in any condition and is optimal when the objective is well defined.

Appendix C Gradients of the KL Term

The formula above shows that a good density ratio estimator gives accurate gradient estimation of .

Appendix D KIVI with Adaptive Contrast

Although KIVI gives rather good estimation of the density ratio, the estimation accuracy still degrades with larger discrepancy between and . The problem is very critical for local latent variable models like VAEs because the same variational model is required to infer posteriors of all local latent variables. In order to mitigate that, AVB (Mescheder et al., 2017) adopted a technique called Adaptive Contrast (AC), which can easily be integrated with KIVI. AC introduces an auxiliary tractable distribution that resembles . With , the ELBO can be rewritten as

Gradients of the first term w.r.t. can be easily computed using Monte Carlo, and gradients of the second term can be estimated using KIVI. Adaptive contrast gives better estimates if approximates well. Because

is required to have a tractable density, a commonly used adaptive distribution is a Gaussian distribution whose mean

and standard derivation match with . In practice and are estimated from the samples of . According to the invariance of KL divergence under reparameterization, we have

where denotes the distribution of and denotes the standard normal distribution. Under this reparameterization, we only need to estimate the density ratio between two distributions with zero means and unit variances.

Appendix E Lower Bound with Gamma-Prior Precision

In the multivariate regression task, the output is sampled from a normal distribution with as the mean and a parameter as the variance. The variance controls the likelihood of the model, therefore, choosing an appropriate variance is essential. We place a Gamma prior on its reciprocal (i.e., the precision of the normal distribution). The variational posterior we used is . Then the ELBO can be calculated as

where is the digamma function and can be calculated in closed-form.

Appendix F Additional Experiment Results

f.1 2-D Bayesian Logistic Regression

(a) Training data
(b) True posterior
(c) VI (factorized)
(d) HMC
(e) KIVI
Figure 5: 2-D Bayesian logistic regression

We also conduct experiments on a 2-D Bayesian logistic regression example, which has an intractable posterior. The model is

where ;

is the sigmoid function.

data points () are generated from the true model as the training data (Fig. 4(a)). The unnormalized true posterior is plotted in Fig. 4(b). As a baseline, we first run VI with a factorized normal distribution. The result is shown in Fig. 4(c). It can be clearly seen that the factorized normal can capture the position and the scale of the true posterior but cannot fit well to the shape due to its independence across dimensions.

We then apply KIVI. The implicit posterior we use is a simple stochastic neural network (see Appendix G.1). To demonstrate how good the result is, we also run Hamiltonian Monte Carlo (HMC) to get posterior samples. The results are plotted in Fig. 4(d) and 4(e). We can see that the implicit posterior is learned to capture the strong correlation between the two dimensions and can produce posterior samples that have a similar shape with the samples drawn by HMC.

(a)
(b)
Figure 6: Variational distributions produced by only optimizing : (a) With the reverse ratio trick; (b) Without the reverse ratio trick.

We also use this experiment to illustrate the importance of the reverse ratio trick. See Figure 6 for the implicit variational distribution learned by only optimizing . In this way we expect the learned to be close to the prior . The result produced by using the trick is compared to the result without it. We can see that the latter fails to work well.

RMSE Factorized NF KSD VI KIVI HMC
boston 3.420.19 3.430.19 4.100.25 2.800.17 2.200.05
concrete 6.000.10 6.040.10 12.490.19 4.700.12 4.270.13
energy 2.420.06 2.480.09 5.540.14 0.470.02 0.470.01
kin8nm 0.090.00 0.090.00 0.260.00 0.080.00 0.070.00
naval 0.010.00 0.010.00 0.010.01 0.000.00 0.000.00
LL Factorized NF KSD VI KIVI HMC
boston -2.660.04 -2.660.04 -3.320.01 -2.530.10 -2.290.01
concrete -3.220.06 -3.240.06 -4.130.01 -3.050.04 -2.810.02
energy -2.340.02 -2.360.03 -3.610.00 -1.300.01 -1.430.01
kin8nm 0.960.01 1.010.01 -0.180.01 1.160.01 1.220.01
naval 4.000.11 4.040.12 3.280.13 5.500.12 7.310.00
Table 2: Test RMSE, log-likelihood for the regression datasets. Factorized and NF represent VI with factorized normal posteriors and normalizing flow, respectively.

f.2 Comparison with More Methods

In this section we present comparisons with two more methods towards flexible posteriors, namely, normalizing flow (Rezende & Mohamed, 2015) and KSD variational inference (KSD VI) (Liu & Feng, 2016). We also include the results by HMC as the ground truth.

Normalizing flow  The basic idea of normalizing flow has been introduced in Section 1. Specifically, given a random variable following a simple distribution , and an invertible mapping so that , the probability density of the transformed variable can be calculated as

(10)

Thus we can construct complex distributions by composing invertible mappings (), while keeping the probability density of the result distribution tractable by successively applying Eq. (10). For example, Rezende & Mohamed (2015) proposed the planar normalizing flow:

where are free parameters, and is a smooth element-wise nonlinearty, chosen as Tanh in the following experiments. A simple variational posterior can be made more expressive when this parametric transformation is employed.

KSD VI  KSD variational inference (Liu & Feng, 2016) is a method that directly minimizes the kernelized Stein discrepancy (KSD) between the true posterior () and the variational posterior ():

(11)

where is a unit ball in the vector-valued RKHS induced by the RBF kernel , and . Note that the objective in Eq. (11) only requires samples from , so KSD VI also belongs to implicit VI methods.

We conduct the regression experiments in Section 6.2.1. The results are shown in Table 2. For normalizing flow, we apply 10 planar flows on the weights to match the running time of our implicit posteriors. To see whether flows help the inference, we also present results of VI using factorized normal distributions on the weights (with their means and standard deviations optimized by the reparameterization trick). For KSD VI, we use the same implicit posteriors as those we used for KIVI in Section 6.2.1. For VI with factorized normal distributions and normalizing flow, we use 100 samples, batch size 10 except that for kin8nm and naval we use batch size 100. We set the learning rate to 0.01 and run 500 epochs for 20 times. For KSD VI, we set all the training parameters (number of samples, number of epochs, batch size, and learning rate) the same as KIVI’s. We also run HMC to produce the ground truths. For HMC, we use 20 chains, 150000 iterations and set the target acceptance rate to 97% to adapt the step size. As the HMC experiment is time-consuming, we only perform 10 runs.

From Table 2 we can see that normalizing flow does not show improvements over VI with factorized normal distributions. This is probably due to optimization challenges caused by the limited form of planar flows, thus the inference procedure takes little benefit from the flexibility introduced by the flow. For KSD VI, we found it is very sensitive to the initialization of implicit posteriors. We had to set the variance of the input Gaussian noise larger so that KSD VI would not diverge at the very beginning of training. KSD VI also soon converges to unsatisfying local optima after the optimization process starts. These two findings are well explained by the conclusion that KSD is the magnitude of a functional gradient of KL divergence (Liu & Wang, 2016), then all saddle points in the original problem of optimizing the KL divergence become local optima when optimizing KSD.

f.3 Visualization of Inferred Posteriors

In order to visually check the quality of the uncertainty estimated by KIVI, we use a visualization technique named parallel coordinates (Inselberg & Dimsdale, 1987) to plot the posterior over weights. We compare the posteriors inferred in the regression experiment on Boston housing (Table 2) by VI with factorized normal approximation, HMC, and KIVI.

The feature dimension of the Boston housing data is 13. And the BNN used is an MLP with a hidden layer of size 50. So the network has two weight matrices: and (The bias parameters are included). Since has 700 dimensions, we only plot the posterior samples of to avoid visual clutter. Specifically, we draw 100 samples from the posterior: , and the goal is to show these 51-dimensional samples. In parallel coordinates each sample is plotted as a polyline (see Figure 7

). The vertices on the polyline represent each single dimension. Their positions on the horizontal axis are the indices of the dimension (from 0 to 50). And the vertical coordinates represent the weight values. Note that an important fact about hidden neurons in neural networks is that they are non-identifiable (any two hidden nodes can be exchanged without affecting the output distribution). This indicates that all dimensions of

are non-identifiable. Different inference algorithms may converge to different local modes caused by this symmetry. To make the visualizations comparable, we sort the dimensions of for each algorithm by the mean value of posterior samples in each dimension.

From the visualization we could see that BNNs with factorized normal approximation have significant over-pruning problems. A large proportion of hidden nodes are turned off during the inference and the information is mainly carried by several others. These pruned weights can be identified by their low signal-to-noise ratio , where is the mean, and is the standard deviation. This problem has been pointed out previously, both in VAE works (Sønderby et al., 2016; Hoffman, 2017) and BNN works (Blundell et al., 2015), and can be explained by the looseness of the variational bound when factorized approximation is used (Trippe & Turner, 2017). In contrast, the ground truth by HMC does not have the pruning problems. And there are very strong correlations captured if we observe the neighboring dimensions.222Note that in HMC the samples plotted are drawn from a single chain. This is also because the non-identifiability of weights, since different chains with different initializations tend to converge to different local modes that cannot be identified from each other. So if we plotted all the chains, there were too much visual clutter, and to be fair, we would need to run the two VI methods with different initializations. KIVI also does not have the pruning problems and could capture the strong correlations across the dimensions. And with the gain in accuracy, there is still a good amount of uncertainty in the implicit posterior. Despite the weight dimensions are non-identifiable, we could still see that the two VI methods arrive at biased solutions compared to the ground truth by HMC in terms of scales. Note that this is not necessarily problematic since VI is typically known to produce biased solutions.

(a) Factorized normal
(b) HMC
(c) KIVI
Figure 7: Visualization of learned posteriors for regression on Boston housing.

f.4 Accuracy of KL Estimation

From Section 3.1 we know that the approximate gradients of the ELBO are directly related to the KL estimates. So we would like to assess the accuracy of the KL estimator. We adopt the settings from the VAE experiment on MNIST (Section 6.3). For comparison, we must also be able to compute a ground truth of the KL term. To achieve this, we use normalizing flow in , which has a complicated but tractable density. Thus we can get a good Monte Carlo estimate of the true KL term: In Figure 7(a) we compare the KL term estimated using KIVI with the ground truth. Note that since we use adaptive-contrast (AC) (see Appendix D) in the VAE experiments, where the KL term is broken down into two parts, it should make more sense to look at the only part that uses the density ratio estimator (the KL divergence between the standardized and a standard normal distribution), which is plotted in Figure 7(b). We can see that the KL estimates closely track the ground truth, and are more accurate as the variational approximation improves over time.

(a)
(b)
Figure 8: True KL term vs. estimated KL term for posteriors with normalizing flows.

f.5 Change of Interpolation Results on CelebA Through Training

In Figures 14, 13, 12, 11, 10 and 9 we present the generated images for the interpolation experiments on CelebA through the training process. The images are generated after 1, 5, 10, 15, 20, and 25 epochs. Results on the left are produced by AVB, and on the right by KIVI. It can be clearly seen that AVB’s training process is of very high variance, as we have mentioned in Section 2.

(a) AVB
(b) KIVI
Figure 9: Epoch 1
(a) AVB
(b) KIVI
Figure 10: Epoch 5
(a) AVB
(b) KIVI
Figure 11: Epoch 10
(a) AVB
(b) KIVI
Figure 12: Epoch 15
(a) AVB
(b) KIVI
Figure 13: Epoch 20
(a) AVB
(b) KIVI
Figure 14: Epoch 25

Appendix G Details of Experiments

g.1 Toy Experiments

1-D Gaussian Mixture The implicit posterior generates samples by propagating samples from a standard normal distribution through a two-layer MLP with 10 hidden units and one output unit. We set the regularization coefficient to and the density ratio clipping threshold to .

2-D Bayesian Logistic Regression The inputs are 200 points randomly sampled from . The outputs are the predictions of with randomly sampled weights from the prior. For HMC, we run 100 chains, 200 iterations each, and use 10 leapfrog steps. The step size is automatically adapted using dual averaging, starting from 0.001. We discard the first 100 samples generated. For factorized VI, the training is run for 100 epochs and 100 samples are used. In the training, we anneal the learning rate linearly according to . For KIVI we follow the same setting, except that we use 1k samples. The regularization coefficient and the minimal density ratio are set to 0.1 and , respectively. Below we describe the sample generation process of the implicit variational posterior used in KIVI. First, some 2-D random normal samples are propagated through two fully-connected layers of size 20, producing

(we do not use activation functions for the second layer), and then

is added with another random normal noise with trainable variances, producing . Finally, we propagate through a fully-connected layer of size 20 and then a linear output layer, getting the variational samples.

g.2 Bayesian Neural Networks

g.2.1 Regression

As the regression datasets have small feature dimensions (all less than 15, except 90 for Year), using BNNs of one hidden layer (50 units) does not produce very high-dimensional weights. Therefore, we still use MLPs in the implicit variational posterior, of which the samples are generated by propagating samples from a standard normal distribution through an MLP. For all datasets, we use ReLU as the activation function. The MLP has one hidden layer except that for Yacht it has two hidden layers.

We list the details in Table 3, which consist of 10 datasets and 2 weight matrices each. Taking Layer 1 for Boston as an example, (20, 30, N1) represents that 20 random normal samples are generated and propagated through an MLP with a hidden layer of size 30 and an output layer of size N1. Note that N1 corresponds to the number of weights in the first layer of the BNN.

-.7in-.7in Boston Concrete Energy Kin8nm Naval Layer 1 (20, 30, N1) (30, 50, N1) (100, 500, N1) (100, 500, N1) (100, 500, N1) Layer 2 (20, 30, N2) (30, 50, N2) (50, 100, N2) (50, 100, N2) (50, 100, N2) Combined Protein Wine Yacht Year Layer 1 (100, 500, N1) (100, 500, N1) (20, 10, N1) (100, 800, N1, N1) (100, 500, N1) Layer 2 (100, 500, N2) (100, 500, N2) (5, 20, N2) (50, 200, 51, N2) (100, 500, N2)

Table 3: Implicit variational posteriors for regression experiments

g.2.2 Classification

MNIST classification needs a larger scale network than the one used in multivariate regression. Therefore, we adopt an MMNN in the implicit variational posterior. We denote the hidden-layer size of the BNN as (In our experiments or ). Below we set when , otherwise we set . In the MMNN, we use two matrix multiplication layers.

For the first-layer weights (), the two hidden matrix multiplication layers are both of size and are with ReLU activations. The output layer of the MMNN is of size and is with linear activations. The input matrices are random samples of size drawn from a standard normal distribution.

For the second-layer weights (), the two hidden matrix multiplication layers are both of size and are with ReLU activations. The output layer of the MMNN is of size and is with linear activations. The input matrices are random samples of size drawn from a standard normal distribution.

For the third-layer weights (), the two hidden matrix multiplication layers are both of size and are with ReLU activations. The output layer of the MMNN is of size and is with linear activations. The input matrices are random samples of size drawn from a standard normal distribution.

We use the above variational posterior settings in both KIVI and PC. For PC, a logistic regression serves as the discriminator. We set the regularization coefficient to 0.001 and the minimal density ratio to for KIVI. Both two methods use 10 samples, and we set the batch size to 100 and the learning rate to 0.001 in training.

g.3 Variational Autoencoders

The decoders used in the MNIST experiment are MLPs with two hidden ReLU layers. The latent dimension is 8. Each hidden layer is of size 500. The implicit posterior is also an MLP with two hidden ReLU layers, with Gaussian noises of 500 dimensions added to the first hidden layer. The noise has zero means and 500-dimensional trainable variances. Training is with the batch size 128. The learning rate is 0.001, which is annealed by a factor of 0.5 every 200 epochs. The parameters for KIVI in this case are , and the clipping value is set to .

The decoders used in the CelebA experiment have exactly the same structure with the one used for images in the DCGAN paper (Radford et al., 2015). The latent dimension is 32. The implicit variational posterior is a deep convolutional neural network with a symmetric structure to the decoder, except that the output of the last convolutional layer is flattened and is added a Gaussian noise of the same shape as the last dimension. The noise has zero means and trainable variances. The last hidden layer is fully-connected and has 500 ReLU units. For AVB we use the same decoder. For both KIVI and AVB, we use batch size 64. The other training parameters of AVB follow from its original code for CelebA. The parameters for KIVI in this case are