1 Introduction
Bayesian methods have been playing vital roles in machine learning by providing a principled approach for generative modeling, posterior inference and preventing overfitting
(Ghahramani, 2015). As it becomes a common practice to build deep models that have many parameters (LeCun et al., 2015), it is even more important to have a Bayesian formulation to capture the uncertainty in these models. For example, Bayesian Neural Networks (BNNs) (Neal, 2012; Blundell et al., 2015)have shown promise in reasoning about model confidence and learning with few labeled data. Another recent trend is to incorporate deep neural networks as a powerful function mapping between random variables in a Bayesian network, such as deep generative models like variational autoencoders (VAE)
(Kingma & Welling, 2013).Except a few simple examples, Bayesian inference is typically challenging, for which variational inference (VI) has been a standard workhorse to approximate the true posterior
(Zhu et al., 2017). Traditional VI focuses on factorized variational posteriors to get analytical updates (known as Meanfield VI). While recent progress in this field drives VI into stochastic, differentiable and amortized (Hoffman et al., 2013; Paisley et al., 2012; Mnih & Gregor, 2014; Kingma & Welling, 2013), which does not rely on analytical updates anymore, factorized posteriors are still commonly used as the variational family. This greatly restricts the flexibility of the variational posterior, especially in highdimensional spaces, which often leads to biased solutions as the true posterior is usually not factorized, thus not in the family. There have been some works that try to improve the flexibility of variational posteriors, borrowing ideas from invertible transformation of probability distributions
(Rezende & Mohamed, 2015; Kingma et al., 2016). In their works, it is important for the transformation to be invertible to ensure that the transformed distribution has a tractable density.Although utilizing invertible transformation is a promising direction to increase the expressiveness of the variational posterior, we argue that a more flexible variational family can be constructed by using general deterministic or stochastic transformations, which are not necessarily invertible. As a common result, the variational posterior we get in this way does not have a tractable density, despite that there is a way to sample from it. This kind of distribution is called implicit distributions, and for variational methods that use an implicit variational posterior (also known as variational programs (Ranganath et al., 2016) or wild variational approximations (Liu & Feng, 2016)), we refer to them as Implicit Variational Inference (implicit VI). Most of the existing implicit VI methods (Mescheder et al., 2017; Huszár, 2017; Tran et al., 2017) rely on a discriminator to produce estimates of the variational objective and its gradients. As pointed out by many of them, the estimates are often noisy and can lead to unstable training. Besides, discriminatorbased approaches are computationally infeasible when applied to nontrivial BNNs.
In this paper we present an approach named Kernel Implicit Variational Inference
(KIVI), which addresses the noisy estimation problem in previous works by providing a principled way of tuning the biasvariance tradeoff. Furthermore, KIVI does not rely on a discriminator and thus is computationally feasible for models with highdimensional latent variables (e.g., BNNs). KIVI is applicable to both global and local latent variable models, which is demonstrated by experiments on BNNs and VAEs. As far as we know, this is the first time that implicit VI is successfully applied to BNNs, which shows promising results on both regression and classification tasks.
2 Background
Consider a generative model , where and denote observed and latent variables, respectively. In VI, a variational distribution in some parametric family is chosen to approximate the true posterior by optimizing the evidence lower bound (ELBO):
(1) 
where
denotes the KullbackLeibler divergence
. This objective is a lower bound of the loglikelihood since it can be written as The maximum of this objective is achieved when . From Eq. (1), we can see that the challenge of using an implicit is that calculating requires evaluating the density of , which is intractable for an implicit distribution.Recently, inspired by the probabilistic interpretation of Generative Adversarial Networks (GAN) (Goodfellow et al., 2014; Mohamed & Lakshminarayanan, 2016), there have been some works that extend the GAN approach to the posterior inference of latent variable models (LVMs) (Mescheder et al., 2017; Huszár, 2017; Tran et al., 2017). These methods all use an implicit variational family and thus can be categorized into implicit VI methods. One of their key observations is that the density ratio
can be estimated from samples of the two distributions by a probabilistic classifier called the discriminator. They first assign class labels (
) to and : Let samples from be of class , and samples from be of class . Given an equal class prior, the density ratio at a given point can be calculated as which is the ratio between the class probabilities given the data point. To estimate this, a discriminator is trained to classify between the two classes, with a logistic loss:(2) 
where outputs the probability of ’s being from class . Given that is sufficiently flexible, the optimal solution of Eq. (2) is Therefore, the KL divergence term in the ELBO of Eq. (1) can be approximated as This is called priorcontrastive forms of VI in Huszár (2017). Note that the ratio approximation does not change the gradients once the approximation is accurate, as we shall see later in Eq. (7). Though incorporating the discriminative power in a probabilistic model has shown great success in GANs, this method still suffers from challenging problems when applied to VI:

[leftmargin=*]

Noisy density ratio estimation (DRE) In VI, the variational posterior gets updated in each iteration. As shown in Eq. (2), the discriminator should be trained to optimum after each update. However, in practice the inner loop for training the discriminator is often truncated to one or several iterations. At the beginning of the inference procedure, it is hard for the discriminator to catch up with the variational posterior. The noisy signal produced by the discriminator leads to noisy gradients and thus unstable training. Besides, even if the discriminator quickly achieves the optimum in a small number of iterations, there is still another issue. Notice that the training loss in Eq. (2) is with expectations. But in practice we are using samples from the two distributions to approximate it. When the support of the distributions is highdimensional, given the limited number of samples we use, the variance of this estimate is considerable, i.e., the discriminator tends to overfit the samples. The phenomenon is that the discriminator arrives at a state where samples are easily distinguished and the probabilities given by the discriminator are near 0 or 1, which is commonly observed in experiments (Mescheder et al., 2017).

Computationally infeasible for high dimensional latent variables As the density ratio is estimated by a discriminator, the samples from the two distributions of latent variables should be fed into it. However, the typically used neural network discriminator cannot afford very highdimensional inputs (e.g., parameters in a moderatesize Bayesian neural network).
3 Kernel Implicit Variational Inference
To address the above challenges for implicit VI, we propose to replace the discriminator with a kernel method for DRE. The advantages of this method are that it has a closedform solution and that it allows us to explicitly tradeoff between bias and variance by tuning a regularization coefficient.
3.1 Estimating the KL Term
Specifically, let be the latent variable, and the true density ratio is . Consider modeling it with a function , where is a Reproducing Kernel Hilbert Space (RKHS) induced by a positive definite kernel
. Similar to kernel ridge regression, we use an objective composed of a squared loss for regression plus a penalty for the complexity of the function. For the squared loss we choose the form used by the unconstrained Least Square Importance Fitting (uLSIF)
(Kanamori et al., 2009):(3) 
where is a constant, and approximate the expectation in by Monte Carlo estimates:
where and are the number of samples from and , respectively. Note that in Eq. (3), the expectation of the squared loss is taken w.r.t. so that the resulting form can be estimated without evaluating the density of both distributions. For the penalty term, the complexity of is measured by its RKHS norm (). Putting them together, we get the final objective:
(4) 
Here is the regularization coefficient.
Proposition 1.
The optimal solution of Eq. (4) lies in the linear subspace spanned by the kernel functions centered at the samples (), i.e., has the form:
(5) 
Proof.
Substituting Eq. (5) into Eq. (4) and setting the derivatives w.r.t. and to zeros, we get the optimal solution (a detailed derivation is given in Appendix B):
(6) 
where and . We use the common RBF kernels .
is the kernel bandwidth, which is determined by the commonly used median heuristic (the median of pairwise distances of the sample points).
Once we have the approximate density ratio function , a Monte Carlo estimate of can be constructed by . Note that there is a constraint that the estimated density ratio should be nonnegative. However, we do not involve it in the optimization objective in order to get a closedform solution, which indicates that some postprocessing is needed to ensure this property. We solve the issue by clipping the estimated density ratio. The clipping values are searched from . In experiments we found that the algorithm is not sensitive to the clipping value. This is due to the accurate estimation guaranteed by the global optimum in the RKHS, which is a universal family when RBF kernels are used (Carmeli et al., 2010).
The reverse ratio trick Another technique is essential to improve the estimation of the KL term, which we call the reverse ratio trick. The key observation is that the expectation in the squared loss in Eq. (3) is taken w.r.t. , whereas the expectation in the KL term () is taken w.r.t. . Unless and match very well in where they put most probabilities, a small squared loss does not always mean a good KL estimate. The solution is by a simple trick. Instead of estimating , we choose to estimate and compute the KL term as . We denote the estimated reverse density ratio as , then the corresponding KL estimate is . Note that in this way the squared loss changes to , whose expectation is taken w.r.t. the same probability measure () as the KL term’s. As we shall see in experiments (Appendix F.1), the trick is essential to make the estimation sufficiently accurate for VI.
Gradient computation We now consider how to estimate the gradients of the KL term w.r.t. variational parameters . First it is easy to prove as in Huszár (2017) that
(7) 
A detailed proof is in Appendix C. Eq. (7) indicates that the true gradients of the KL term w.r.t. do not flow through the density ratio function. We now replace the ratio on the right side with : . Note that without Eq. (7) we cannot do the approximation since has zero gradients w.r.t. . Then, the reparameterization trick (Kingma & Welling, 2013) can be used:
3.2 The Algorithm
We have constructed a closedform estimate for the KL term and show its gradients can be estimated by the reparameterization trick. Note that the reparameterization trick can also be used to compute the gradients of the reconstruction term in Eq. (1) and thus can be applied to the ELBO. See Algo. 1 for the complete algorithm. Note that the number of samples () used in the reconstruction term can be different from that required for the KL estimation, which can be reduced when the model is expensive (e.g., we set in the experiments of VAEs). Thus compared to normal reparameterized VI, the extra computational cost is mainly in calculating the inverse of the matrix in Eq. (6). As we shall see in experiments, tens or a hundred samples are sufficient to obtain a stable KL estimate, so the added cost is not high.
KIVI addresses the two challenges stated in Sec. 2. First, the ratio estimates are given in closedforms, thus not having the problem of not catching up. Second, the biasvariance tradeoff of the estimation can be controlled by the regularization coefficient . When is set smaller, the estimation is more aggressive to match the samples. When is set larger, the estimated ratio function is smoother. Choosing an appropriate , the variance of the gradients can be controlled, compared to the extreme ratio estimates given by discriminators when their output probabilities are near 0 or 1. Moreover, KIVI is directly applicable to both global and local latent variable models (LVMs), which is an advantage over nonparametric VI methods like particle mirror descent (Dai et al., 2015) and Stein variational gradient descent (SVGD) (Liu & Wang, 2016). For the task of training local LVMs like VAEs, we additionally use the adaptive contrast (AC) technique (Mescheder et al., 2017), whose details are summarized in Appendix D.
4 Example: Implicit Variational Bayesian Neural Networks
Now we present an example for using KIVI in Bayesian neural networks (BNNs), which have received increasing attention due to their ability to model uncertainty, an important factor in many tasks such as adversarial defense and reinforcement learning. However, despite that we have removed the need for a discriminator, it is still nontrivial to apply KIVI to BNNs because we need to design an implicit posterior that outputs very highdimensional samples of latent variables (weights). Existing implicit posteriors
(Mescheder et al., 2017; Song et al., 2017) based on traditional fullyconnected neural networks cannot handle such a highdimensional output space. We present Matrix Multiplication Neural Network (MMNN), an efficient architecture for sampling large matrices. Deploying MMNN, KIVI can easily scale up to large BNNs.In BNNs, a prior is specified over the neural network parameters , where indicates weights in the th layer. Given an input , the output is modeled with
where is the output of the feedforward network , and is of a distribution parameterized by and . For regression, is usually a Gaussian with as the mean. For classification, is usually a discrete distribution with as the unnormalized log probabilities.
The true posterior of in BNNs is intractable. Thus we turn to VI and use a variational posterior to approximate it. Denoting the dataset with , we have the ELBO:
The variational posterior is usually set to be factorized by layer: However, previous methods used variational posteriors with a limited capacity for each , including factorized Gaussian (HernandezLobato & Adams, 2015), matrix variate Gaussian (Louizos & Welling, 2016; Sun et al., 2017) and normalizing flows (Louizos & Welling, 2017). Enabled to learn implicit variational posteriors, we propose to adopt a general distribution without an explicit density function, which has a form of
Here is a transformation parameterized by , and are treated as samples from . The key challenge is that are very high dimensional for moderate size neural networks. Thus, we often cannot use a fully connected neural network (MLP) as . Inspired by lowrank matrix factorization (Koren et al., 2009), we propose a new kind of network called Matrix Multiplication Neural Network (MMNN) to serve as , as shown in Alg. 2. In each layer of an MMNN, an input matrix () is left multiplied with a parameter matrix () and is added a bias matrix (), then it is right multiplied with a parameter matrix () and is added a bias matrix (
). Finally it is passed through a nonlinear activation (e.g., ReLU). We call such a layer as an
Matrix Multiplication layer.To model the implicit posterior of , we only need to randomly sample a matrix of smaller size, and propagate it through the MMNN to get the output variational samples ():
When modeling a matrix, MMNN has significant computational advantages over MLPs, due to its lowrank property. For example, to model an weight matrix, consider a singlelayer MMNN with the input matrix of size , the parameters needed are and they are in total of size , while if a single fullyconnected layer is used, the parameter size is , which is much larger. Thus, we can use an MMNN as in the variational posterior for normalsize neural networks. In tasks with very small networks, we still use an MLP as .
5 Related Work
Our work closely relates to the works on implicit generative models (IGMs, generative models that define implicit distributions) and density ratio estimation (DRE). IGMs have drawn much attention due to the popularity of GANs. General learning algorithms of implicit models have been surveyed in Mohamed & Lakshminarayanan (2016), where DRE plays a central role. The connection between DRE and GANs is also discussed in Uehara et al. (2016). For a comprehensive review of DRE, we refer the readers to the survey (Hido et al., 2011). We also refer the readers to many other works by Sugiyama and his collaborators on DRE, such as KLIEP (Sugiyama et al., 2008), LSIF (Kanamori et al., 2009), and DRE based on Bregman divergence (Sugiyama et al., 2012).
Our work also builds upon the recent VI methods, including stochastic approximation by minibatches (Hoffman et al., 2013), direct gradient optimization of variational lower bounds (Paisley et al., 2012; Mnih & Gregor, 2014), and the reparameterization trick for training continuous LVMs (Kingma & Welling, 2013). Following the success of learning with IGMs, implicit distributions are applied to VI. Many of them are based on discriminators, which can be divided into two categories: priorcontrastive (PC) and jointcontrastive (JC) (Huszár, 2017)
. In PC methods discriminators distinguish between samples from the prior and those from the variational posterior, while in JC methods they distinguish between the model joint distribution and the joint distribution composed of the data distribution and the variational posterior. Concurrent with
Huszár (2017), Mescheder et al. (2017) proposed Adversarial Variational Bayes (AVB), which is an amortized version of PC methods for training local LVMs like VAEs. Prior to Huszár (2017), similar ideas with JC methods have been proposed in ALI (Dumoulin et al., 2016) and BiGAN (Donahue et al., 2016). Nonparametric VI methods such as PMD (Dai et al., 2015) and SVGD (Liu & Wang, 2016) that adapt a set of particles towards the true posterior are also closely related to implicit VI. They share the similar advantage of flexible approximations. More recently, the amortized version of SVGD has been developed (Liu & Feng, 2016) and the same idea has been applied to MCMC (Li et al., 2017). It was further shown in Li & Turner (2017) that the core identity in SVGD (Stein’s identity) could also be employed to approximate the gradients of implicit distributions.6 Experiments


We present empirical results on both synthetic and real datasets to demonstrate the benefits of KIVI. All implementations are based on ZhuSuan (Shi et al., 2017).
6.1 Toy 1D Gaussian Mixtures
We firstly conduct a toy experiment to approximate a 1D Gaussian mixture distribution with VI. The Gaussian mixture distribution has two equally distributed unitvariance components whose means are 3 and 3. We compare KIVI with VI using a single Gaussian posterior (Fig. 1
). The variational distribution used by KIVI generates samples by propagating a standard normal distribution through a twolayer MLP with 10 hidden units in each layer and one output unit. As shown, the Gaussian posterior converges to a single mode. In contrast, KIVI can accurately approximate the two modes with an expressive variational posterior. We defer another toy experiment on 2D Bayesian logistic regression to Appendix
F.1, where the importance of the reverse ratio trick is illustrated.6.2 Bayesian Neural Networks (BNNs)
As stated in Sec. 4, the latent variables in BNNs are global to all data points and are usually very highdimensional, for which a flexible variational family is essential. We compare KIVI with stateoftheart VI methods by doing regression and classification on standard benchmarks.
6.2.1 Regression
Dataset  Avg. Test RMSE  Avg. Test LL  

SVGD  Dropout  KIVI  SVGD  Dropout  KIVI  
Boston  2.9570.099  2.970.19  2.7980.173  2.5040.029  2.460.06  2.5270.102 
Concrete  5.3240.104  5.230.12  4.7020.116  3.0820.018  3.040.02  3.0540.043 
Energy  1.3740.045  1.660.04  0.4670.015  1.7670.024  1.990.02  1.2980.005 
Kin8nm  0.0900.001  0.100.00  0.0750.001  0.9840.008  0.950.01  1.1620.008 
Naval  0.0040.000  0.010.00  0.0010.000  4.0890.012  3.800.01  5.5010.121 
Combined  4.0330.033  4.020.04  3.9760.037  2.8150.008  2.800.01  2.7940.009 
Protein  4.6060.013  4.360.01  4.2550.019  2.9470.003  2.890.00  2.8680.005 
Wine  0.6090.010  0.620.01  0.6290.008  0.9250.014  0.930.01  0.9580.015 
Yacht  0.8640.052  1.110.09  0.7370.068  1.2250.042  1.550.03  2.1230.010 
Year  8.684NA  8.849NA  8.950NA  3.580NA  3.588NA  3.615NA 
To quantitatively measure the predictive ability of BNNs with KIVI as the inference method, we use standard multivariate regression benchmarks from recent works (Table 1
), such as probabilistic backpropagation (PBP)
(HernandezLobato & Adams, 2015). We compare with stateoftheart methods: the Bayesian interpretation of dropout (Gal & Ghahramani, 2016) and stein variational gradient descent (SVGD) (Liu & Wang, 2016) ^{1}^{1}1Note that VMG (Louizos & Welling, 2016) and PBPMV (Sun et al., 2017) used adaptive weight priors, which are different from the common setting of standard normal priors; the former also additionally used variational dropout, thus their results are not directly comparable to those of the works discussed here as well as ours.. Following the setup in PBP, we use BNNs with one 50unit hidden layer except in the two large datasets, i.e., Protein Structure and Year Predication MSD, where 100 units are used. We randomly select of the whole dataset for training and leave the rest for testing. We also put a Gamma prior on the precision of the observation noise to adaptively learn it (see Appendix E). For all datasets, we setand set the batch size to 100 and the learning rate to 0.001. The model is trained for 3000 epochs for the small datasets with less than 1000 data points, and 500 epochs for the others. We report the mean errors and standard deviations averaged over 20 runs, except 5 runs for
Protein Structure and 1 run for Year Predication MSD. As networks used in these tasks are of a small scale, we use MLPs with one hidden layer in the implicit variational posterior (see Appendix G.2.1 for details).Table 1 shows the results with the best ones marked in bold. Results of SVGD and dropout are cited from their papers, which have the same setting as ours. We can see that KIVI consistently outperforms SVGD and dropout on both RMSE and testLL for most datasets. Especially on RMSE, KIVI has significant improvements over them except on Wine and Year Predication MSD
. It suggests that KIVI enables the implicit variational posterior to capture the predictive uncertainty in network parameters, which is hard to be fully described by a mixture of two delta distributions (dropout) and a fixed set of particles (SVGD). We emphasize that although the nonparametric nature of SVGD has also made the approximation more flexible, it uses the same set of particles throughout the inference procedure, while each iteration of KIVI generates a new set of particles. Thus the implicit posterior learned by KIVI is smoothed by the parametric model. Recently, normalizing flows have shown good performance on BNNs
(Louizos & Welling, 2017). So we also experiment with directly applying normalizing flows to this task. The results are reported in Appendix F.2.6.2.2 Classification


For classification, we present the results on MNIST, which consists of 60,000 training images and 10,000 test images of handwriting digits. Compared to the above datasets in regression, MNIST has a much higher feature dimension, introducing millions of parameters even in moderatesize networks, which brings big challenges for BNNs. As a standard benchmark, the performance on MNIST can be improved by many other techniques, such as convolution, generative pretraining, data augmentation, etc. To ensure a fair comparison, we follow the settings from BayesByBackprop (Blundell et al., 2015) and focus on improving the performance of ordinary MLPs without using any of these techniques. The network structures used are three MLPs, all with two ReLU hidden layers, and the layer sizes are 400, 800 and 1200, respectively. For KIVI, we used MMNNs with two hidden matrix multiplication layers in the implicit posterior (see Appendix G.2.2 for details). We set , and train for 300 epochs with the batch size as 100. The initial learning rate is 0.001 and is annealed every 100 epochs by ratio 0.5. We used the last 10,000 samples of the training set as the validation set for model selection.
Fig. 2 summarizes the results. We can see that KIVI achieves better accuracy compared to plain SGD (Simard et al., 2003), dropout (Srivastava et al., 2014) and BayesByBackprop (Blundell et al., 2015) on all three types of MLPs. KIVI even performs better than BayesByBackprop with a changed prior (scale mixture), which makes the model itself more flexible than ours. When the layer size is 800, our result is comparable to that of an ensemble of 5 networks with dropconnect (Wan et al., 2013), which demonstrates that the implicit posterior has been learned to account for most model uncertainty.
We also conduct experiments with the priorcontrastive (PC) method (the counterpart of AVB for global latent variables, see Sec. 2). The key challenge for applying PC here is that the posterior samples are extremely highdimensional, and if fed into discriminators like neural networks, they will cause unaffordable computation cost. To get around this, we use a logistic regression as the discriminator in PC. The experiment settings of PC are reported in Appendix G.2.2. The training lower bounds of the two methods are plotted in Fig. 2. We can see that in the beginning they increase at the same pace, then PC fails to converge with lower bound explosion while KIVI improves consistently. The explosion is mainly because the input to the discriminator is of hundreds of thousands of dimensions, and plain logistic regression cannot produce reliable density ratio estimates. We also experiment with PC for layer size 800 and 1200. They both fail to converge in the end.
6.3 Variational Autoencoders


Variational Autoencoders: (a) Gaussian posterior vs. implicit posterior, where fc denotes a fullyconnected layer. They are used by the plain VAE and KIVI, respectively; (b) Training and evaluation curves of the lower bounds on statically binarized MNIST.
As stated in Sec. 3.2, KIVI is applicable to both global and local LVMs, and the latter can be learned using an amortized scheme (i.e., use instead of ). Here we present an application on training variational autoencoders (VAE) with implicit posteriors to demonstrate this. The generative process of VAEs proceeds as
where are latent features, are observations, and is the output distribution for modeling the observations, which takes the outputs of the neural network () as parameters. We conduct experiments on two widely used datasets for generative modeling: binarized MNIST and CelebA (Liu et al., 2015).
is a Bernoulli distribution for binarized MNIST, while a normal distribution for CelebA. For VAE the variational lower bound has the same form as Eq. (
1), except that is replaced by . The original VAE parameterizes aswhere is a neural network that outputs the parameters of the normal distribution. In the MNIST case, is an MLP with two hidden layers, which is illustrated in Fig. 2(a) (left). To form an implicit posterior, a direct choice is to move the stochastic noise from the output layer to the penultimate hidden layer, as illustrated by Fig. 2(a) (right).
Before applying KIVI, a crucial question is what we expect to get by using implicit posteriors for training VAEs. One target could be that we may get tighter lower bounds of the data loglikelihood (LL) because the algorithm searches in a larger variational family for the optimal lower bound. This suggests, however in a very weak way, that doing optimization in the larger space will lead to better test LL, given the optimization always arrives at local optima. Previously Adversarial Variational Bayes (AVB) has shown some results on MNIST, by comparing the test LL of the plain VAE and the VAE trained by AVB, using golden truths estimated by annealed importance sampling (AIS) (Wu et al., 2016). However, for the results reported in AVB, the model architectures used by the plain VAE and AVB are very different, which leads to concerns about which part of the change contributes to the improved likelihoods.
Here we adopt another setting to better demonstrate the gain from implicit posteriors. We observe that the key improvement of implicit posteriors is that objectives with them average over a much broader range of posterior configurations. This effect not only contributes to a larger search space that contains tighter lower bound values, but also makes the VAE model better prevent overfitting. To verify that KIVI keeps this property, we conduct experiments on a statically binarized MNIST dataset and use models with no prior knowledge of the problem (MLPs instead of convnets), which is a typical setting that leads to overfitting. The latent dimension of the VAE model is 8, and is an MLP with two hidden layers of size 500. The parameters for KIVI are . More details can be found in Appendix G.3. Fig 2(b) shows the training and testing curves of VAEs with or without KIVI. It can be seen that the lower bound gap between the training and testing curves of the plain VAE are much larger than that of KIVI, which indicates that the VAE trained by KIVI is less prone to overfitting. After 3k epochs we evaluate the test LL on 2048 test images using AIS and get 97.3 for the plain VAE, while 94.8 for the VAE trained by KIVI.






To demonstrate that KIVI scales to larger models, we trained a VAE with the same network structure used by DCGAN (Radford et al., 2015) on CelebA using KIVI. The latent dimension in this case is 32. The implicit posterior is constructed in a way similar to the one shown in Fig. 2(a), with the bottom hidden layers symmetric to the decoder network (See Appendix G.3 for details). To visually check the latent space learned by KIVI, we show the reconstruction results of linearly interpolating between the latent vectors of two real images after 25 epochs (Fig. 4), when the model has converged. Compared to the latent space learned by AVB after the same epochs (we use the public code from AVB and set the same decoder structure), we find ours are smoother, and the interpolated images are much sharper. More interpolation results through the training process are presented and compared in Appendix F.5. We also use the learned model to generate 10,000 images and evaluate the sample quality using Fréchet Inception Distance (FID) (Heusel et al., 2017). The FID scores achieved at epoch 25 by AVB and KIVI are 160 and 41, respectively (smaller is better). In fact many efforts are required to make AVB successfully train a model for CelebA to produce results shown in the figure. As reported in Mescheder et al. (2017), the log prior is explicitly added to the discriminator (), while KIVI does not need much tuning: there is no need to carefully design a discriminator, and the only two hyperparameters (i.e., and the clipping value) both have clear meanings.
7 Conclusions
We present an implicit VI method named Kernel Implicit Variational Inference (KIVI), which provides a principled way of tuning biasvariance tradeoff and makes implicit VI computationally feasible for models with highdimensional latent variables. We successfully applied this approach to Bayesian neural networks and achieved superior performance on both regression and classification tasks. We also demonstrate that KIVI can be applied to learn local latent variable models like VAEs. Future work may include applying this method to largerscale networks and improving the kernel estimator further.
Acknowledgments
The work was supported by the National Natural Science Foundation of China (NSFC) Projects (Nos. 61620106010, 61621136008), Beijing Natural Science Foundation No. L172037, Tsinghua Tiangong Institute for Intelligent Computing, the NVIDIA NVAIL Program and a Project from Siemens.
References
 Blundell et al. (2015) Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
 Carmeli et al. (2010) Claudio Carmeli, Ernesto De Vito, Alessandro Toigo, and Veronica Umanitá. Vector valued reproducing kernel hilbert spaces and universality. Analysis and Applications, 8(01):19–61, 2010.
 Dai et al. (2015) Bo Dai, Niao He, Hanjun Dai, and Le Song. Provable bayesian inference via particle mirror descent. arXiv preprint arXiv:1506.03101, 2015.
 Donahue et al. (2016) Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
 Dumoulin et al. (2016) Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mastropietro, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.

Gal & Ghahramani (2016)
Yarin Gal and Zoubin Ghahramani.
Dropout as a bayesian approximation: Representing model uncertainty in deep learning.
In International Conference on Machine Learning, pp. 1050–1059, 2016. 
Ghahramani (2015)
Zoubin Ghahramani.
Probabilistic machine learning and artificial intelligence.
Nature, 521(7553):452–459, 2015.  Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
 HernandezLobato & Adams (2015) Jose Miguel HernandezLobato and Ryan Adams. Probabilistic backpropagation for scalable learning of bayesian neural networks. In Proceedings of The 32nd International Conference on Machine Learning, pp. 1861–1869, 2015.
 Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Günter Klambauer, and Sepp Hochreiter. Gans trained by a two timescale update rule converge to a nash equilibrium. arXiv preprint arXiv:1706.08500, 2017.

Hido et al. (2011)
Shohei Hido, Yuta Tsuboi, Hisashi Kashima, Masashi Sugiyama, and Takafumi
Kanamori.
Statistical outlier detection using direct density ratio estimation.
Knowledge and information systems, 26(2):309–336, 2011. 
Hoffman (2017)
Matthew D Hoffman.
Learning deep latent gaussian models with markov chain monte carlo.
In International Conference on Machine Learning, pp. 1510–1519, 2017.  Hoffman et al. (2013) Matthew D Hoffman, David M Blei, Chong Wang, and John William Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303–1347, 2013.
 Huszár (2017) Ferenc Huszár. Variational inference using implicit distributions. arXiv preprint arXiv:1702.08235, 2017.
 Inselberg & Dimsdale (1987) Alfred Inselberg and Bernard Dimsdale. Parallel coordinates for visualizing multidimensional geometry. In Computer Graphics 1987, pp. 25–44. Springer, 1987.
 Kanamori et al. (2009) Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A leastsquares approach to direct importance estimation. Journal of Machine Learning Research, 10(Jul):1391–1445, 2009.
 Kingma & Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Kingma et al. (2016) Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pp. 4743–4751, 2016.
 Koren et al. (2009) Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8), 2009.
 LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
 Li & Turner (2017) Yingzhen Li and Richard E Turner. Gradient estimators for implicit models. arXiv preprint arXiv:1705.07107, 2017.
 Li et al. (2017) Yingzhen Li, Richard E Turner, and Qiang Liu. Approximate inference with amortised mcmc. arXiv preprint arXiv:1702.08343, 2017.
 Liu & Feng (2016) Qiang Liu and Yihao Feng. Two methods for wild variational inference. arXiv preprint arXiv:1612.00081, 2016.
 Liu & Wang (2016) Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems, pp. 2370–2378, 2016.

Liu et al. (2015)
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
Deep learning face attributes in the wild.
In
Proceedings of International Conference on Computer Vision (ICCV)
, 2015.  Louizos & Welling (2016) Christos Louizos and Max Welling. Structured and efficient variational deep learning with matrix gaussian posteriors. arXiv preprint arXiv:1603.04733, 2016.
 Louizos & Welling (2017) Christos Louizos and Max Welling. Multiplicative normalizing flows for variational bayesian neural networks. arXiv preprint arXiv:1703.01961, 2017.
 Mescheder et al. (2017) Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. arXiv preprint arXiv:1701.04722, 2017.
 Mnih & Gregor (2014) Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In Proceedings of the 31st International Conference on Machine Learning, 2014.
 Mohamed & Lakshminarayanan (2016) Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv preprint arXiv:1610.03483, 2016.
 Neal (2012) Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
 Paisley et al. (2012) John Paisley, David Blei, and Michael Jordan. Variational bayesian inference with stochastic search. arXiv preprint arXiv:1206.6430, 2012.
 Radford et al. (2015) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 Ranganath et al. (2016) Rajesh Ranganath, Dustin Tran, Jaan Altosaar, and David Blei. Operator variational inference. In Advances in Neural Information Processing Systems, pp. 496–504, 2016.
 Rezende & Mohamed (2015) Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015.
 Schölkopf et al. (2001) Bernhard Schölkopf, Ralf Herbrich, and Alex Smola. A generalized representer theorem. In Computational learning theory, pp. 416–426. Springer, 2001.
 Shi et al. (2017) Jiaxin Shi, Jianfei Chen, Jun Zhu, Shengyang Sun, Yucen Luo, Yihong Gu, and Yuhao Zhou. ZhuSuan: A library for Bayesian deep learning. arXiv preprint arXiv:1709.05870, 2017.

Simard et al. (2003)
P. Y. Simard, D. Steinkraus, and J. C. Platt.
Best practices for convolutional neural networks applied to visual document analysis.
In Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings., pp. 958–963, Aug 2003. doi: 10.1109/ICDAR.2003.1227801.  Sønderby et al. (2016) Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. In Advances in Neural Information Processing Systems, pp. 3738–3746, 2016.
 Song et al. (2017) Jiaming Song, Shengjia Zhao, and Stefano Ermon. Anicemc: Adversarial training for mcmc. arXiv preprint arXiv:1706.07561, 2017.
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):1929–1958, 2014.
 Sugiyama et al. (2008) Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V Buenau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in neural information processing systems, pp. 1433–1440, 2008.
 Sugiyama et al. (2012) Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Densityratio matching under the bregman divergence: a unified framework of densityratio estimation. Annals of the Institute of Statistical Mathematics, 64(5):1009–1044, 2012.
 Sun et al. (2017) Shengyang Sun, Changyou Chen, and Lawrence Carin. Learning structured weight uncertainty in bayesian neural networks. In Artificial Intelligence and Statistics, pp. 1283–1292, 2017.
 Tran et al. (2017) Dustin Tran, Rajesh Ranganath, and David M Blei. Deep and hierarchical implicit models. arXiv preprint arXiv:1702.08896, 2017.
 Trippe & Turner (2017) Brian L. Trippe and Richard E. Turner. Overpruning in variational bayesian neural networks. In Advances in Approximate Bayesian Inference workshop, NIPS, 2017.
 Uehara et al. (2016) Masatoshi Uehara, Issei Sato, Masahiro Suzuki, Kotaro Nakayama, and Yutaka Matsuo. Generative adversarial nets from a density ratio estimation perspective. arXiv preprint arXiv:1610.02920, 2016.
 Wan et al. (2013) Li Wan, Matthew Zeiler, Sixin Zhang, Yann L Cun, and Rob Fergus. Regularization of neural networks using dropconnect. In Proceedings of the 30th international conference on machine learning (ICML13), pp. 1058–1066, 2013.
 Wu et al. (2016) Yuhuai Wu, Yuri Burda, Ruslan Salakhutdinov, and Roger Grosse. On the quantitative analysis of decoderbased generative models. arXiv preprint arXiv:1611.04273, 2016.
 Zhu et al. (2017) Jun Zhu, Jianfei Chen, Wenbo Hu, and Bo Zhang. Big learning with bayesian methods. National Science Review (NSR), 2017.
Appendix A Proof of Proposition 1
Proof.
Let denote the span of the representers of the sample points:
where we label as and as . Define the orthogonal complement to be
Because , it can be decomposed into two parts:
where , and . Note that the squared loss is not changed by the orthogonal component :
Meanwhile the regularization term can be decomposed into . So the optimal solution will have , which indicates that . ∎
Appendix B Derivation of the Optimal Density Ratio Function
Substituting Eq. (5) into Eq. (4) and removing the constant, we get
where , , and . Taking derivatives w.r.t. and and setting them to zeros:
Rearranging the terms, we have
(8)  
(9) 
Left multiplying Eq. (8) with and then subtracting Eq. (9) yields
Substituting the optimal into Eq. (8), we get
Note that although is used in the derivation, the forms of the solution do not require to be invertible. In fact, if
has zero eigenvalues (not invertible), the objective
is not bounded below since one can always choose an in the null space of to decrease it. In other words, the form of the solution is well defined in any condition and is optimal when the objective is well defined.Appendix C Gradients of the KL Term
The formula above shows that a good density ratio estimator gives accurate gradient estimation of .
Appendix D KIVI with Adaptive Contrast
Although KIVI gives rather good estimation of the density ratio, the estimation accuracy still degrades with larger discrepancy between and . The problem is very critical for local latent variable models like VAEs because the same variational model is required to infer posteriors of all local latent variables. In order to mitigate that, AVB (Mescheder et al., 2017) adopted a technique called Adaptive Contrast (AC), which can easily be integrated with KIVI. AC introduces an auxiliary tractable distribution that resembles . With , the ELBO can be rewritten as
Gradients of the first term w.r.t. can be easily computed using Monte Carlo, and gradients of the second term can be estimated using KIVI. Adaptive contrast gives better estimates if approximates well. Because
is required to have a tractable density, a commonly used adaptive distribution is a Gaussian distribution whose mean
and standard derivation match with . In practice and are estimated from the samples of . According to the invariance of KL divergence under reparameterization, we havewhere denotes the distribution of and denotes the standard normal distribution. Under this reparameterization, we only need to estimate the density ratio between two distributions with zero means and unit variances.
Appendix E Lower Bound with GammaPrior Precision
In the multivariate regression task, the output is sampled from a normal distribution with as the mean and a parameter as the variance. The variance controls the likelihood of the model, therefore, choosing an appropriate variance is essential. We place a Gamma prior on its reciprocal (i.e., the precision of the normal distribution). The variational posterior we used is . Then the ELBO can be calculated as
where is the digamma function and can be calculated in closedform.
Appendix F Additional Experiment Results
f.1 2D Bayesian Logistic Regression





We also conduct experiments on a 2D Bayesian logistic regression example, which has an intractable posterior. The model is
where ;
is the sigmoid function.
data points () are generated from the true model as the training data (Fig. 4(a)). The unnormalized true posterior is plotted in Fig. 4(b). As a baseline, we first run VI with a factorized normal distribution. The result is shown in Fig. 4(c). It can be clearly seen that the factorized normal can capture the position and the scale of the true posterior but cannot fit well to the shape due to its independence across dimensions.We then apply KIVI. The implicit posterior we use is a simple stochastic neural network (see Appendix G.1). To demonstrate how good the result is, we also run Hamiltonian Monte Carlo (HMC) to get posterior samples. The results are plotted in Fig. 4(d) and 4(e). We can see that the implicit posterior is learned to capture the strong correlation between the two dimensions and can produce posterior samples that have a similar shape with the samples drawn by HMC.


We also use this experiment to illustrate the importance of the reverse ratio trick. See Figure 6 for the implicit variational distribution learned by only optimizing . In this way we expect the learned to be close to the prior . The result produced by using the trick is compared to the result without it. We can see that the latter fails to work well.
RMSE  Factorized  NF  KSD VI  KIVI  HMC 

boston  3.420.19  3.430.19  4.100.25  2.800.17  2.200.05 
concrete  6.000.10  6.040.10  12.490.19  4.700.12  4.270.13 
energy  2.420.06  2.480.09  5.540.14  0.470.02  0.470.01 
kin8nm  0.090.00  0.090.00  0.260.00  0.080.00  0.070.00 
naval  0.010.00  0.010.00  0.010.01  0.000.00  0.000.00 
LL  Factorized  NF  KSD VI  KIVI  HMC 
boston  2.660.04  2.660.04  3.320.01  2.530.10  2.290.01 
concrete  3.220.06  3.240.06  4.130.01  3.050.04  2.810.02 
energy  2.340.02  2.360.03  3.610.00  1.300.01  1.430.01 
kin8nm  0.960.01  1.010.01  0.180.01  1.160.01  1.220.01 
naval  4.000.11  4.040.12  3.280.13  5.500.12  7.310.00 
f.2 Comparison with More Methods
In this section we present comparisons with two more methods towards flexible posteriors, namely, normalizing flow (Rezende & Mohamed, 2015) and KSD variational inference (KSD VI) (Liu & Feng, 2016). We also include the results by HMC as the ground truth.
Normalizing flow The basic idea of normalizing flow has been introduced in Section 1. Specifically, given a random variable following a simple distribution , and an invertible mapping so that , the probability density of the transformed variable can be calculated as
(10) 
Thus we can construct complex distributions by composing invertible mappings (), while keeping the probability density of the result distribution tractable by successively applying Eq. (10). For example, Rezende & Mohamed (2015) proposed the planar normalizing flow:
where are free parameters, and is a smooth elementwise nonlinearty, chosen as Tanh in the following experiments. A simple variational posterior can be made more expressive when this parametric transformation is employed.
KSD VI KSD variational inference (Liu & Feng, 2016) is a method that directly minimizes the kernelized Stein discrepancy (KSD) between the true posterior () and the variational posterior ():
(11)  
where is a unit ball in the vectorvalued RKHS induced by the RBF kernel , and . Note that the objective in Eq. (11) only requires samples from , so KSD VI also belongs to implicit VI methods.
We conduct the regression experiments in Section 6.2.1. The results are shown in Table 2. For normalizing flow, we apply 10 planar flows on the weights to match the running time of our implicit posteriors. To see whether flows help the inference, we also present results of VI using factorized normal distributions on the weights (with their means and standard deviations optimized by the reparameterization trick). For KSD VI, we use the same implicit posteriors as those we used for KIVI in Section 6.2.1. For VI with factorized normal distributions and normalizing flow, we use 100 samples, batch size 10 except that for kin8nm and naval we use batch size 100. We set the learning rate to 0.01 and run 500 epochs for 20 times. For KSD VI, we set all the training parameters (number of samples, number of epochs, batch size, and learning rate) the same as KIVI’s. We also run HMC to produce the ground truths. For HMC, we use 20 chains, 150000 iterations and set the target acceptance rate to 97% to adapt the step size. As the HMC experiment is timeconsuming, we only perform 10 runs.
From Table 2 we can see that normalizing flow does not show improvements over VI with factorized normal distributions. This is probably due to optimization challenges caused by the limited form of planar flows, thus the inference procedure takes little benefit from the flexibility introduced by the flow. For KSD VI, we found it is very sensitive to the initialization of implicit posteriors. We had to set the variance of the input Gaussian noise larger so that KSD VI would not diverge at the very beginning of training. KSD VI also soon converges to unsatisfying local optima after the optimization process starts. These two findings are well explained by the conclusion that KSD is the magnitude of a functional gradient of KL divergence (Liu & Wang, 2016), then all saddle points in the original problem of optimizing the KL divergence become local optima when optimizing KSD.
f.3 Visualization of Inferred Posteriors
In order to visually check the quality of the uncertainty estimated by KIVI, we use a visualization technique named parallel coordinates (Inselberg & Dimsdale, 1987) to plot the posterior over weights. We compare the posteriors inferred in the regression experiment on Boston housing (Table 2) by VI with factorized normal approximation, HMC, and KIVI.
The feature dimension of the Boston housing data is 13. And the BNN used is an MLP with a hidden layer of size 50. So the network has two weight matrices: and (The bias parameters are included). Since has 700 dimensions, we only plot the posterior samples of to avoid visual clutter. Specifically, we draw 100 samples from the posterior: , and the goal is to show these 51dimensional samples. In parallel coordinates each sample is plotted as a polyline (see Figure 7
). The vertices on the polyline represent each single dimension. Their positions on the horizontal axis are the indices of the dimension (from 0 to 50). And the vertical coordinates represent the weight values. Note that an important fact about hidden neurons in neural networks is that they are nonidentifiable (any two hidden nodes can be exchanged without affecting the output distribution). This indicates that all dimensions of
are nonidentifiable. Different inference algorithms may converge to different local modes caused by this symmetry. To make the visualizations comparable, we sort the dimensions of for each algorithm by the mean value of posterior samples in each dimension.From the visualization we could see that BNNs with factorized normal approximation have significant overpruning problems. A large proportion of hidden nodes are turned off during the inference and the information is mainly carried by several others. These pruned weights can be identified by their low signaltonoise ratio , where is the mean, and is the standard deviation. This problem has been pointed out previously, both in VAE works (Sønderby et al., 2016; Hoffman, 2017) and BNN works (Blundell et al., 2015), and can be explained by the looseness of the variational bound when factorized approximation is used (Trippe & Turner, 2017). In contrast, the ground truth by HMC does not have the pruning problems. And there are very strong correlations captured if we observe the neighboring dimensions.^{2}^{2}2Note that in HMC the samples plotted are drawn from a single chain. This is also because the nonidentifiability of weights, since different chains with different initializations tend to converge to different local modes that cannot be identified from each other. So if we plotted all the chains, there were too much visual clutter, and to be fair, we would need to run the two VI methods with different initializations. KIVI also does not have the pruning problems and could capture the strong correlations across the dimensions. And with the gain in accuracy, there is still a good amount of uncertainty in the implicit posterior. Despite the weight dimensions are nonidentifiable, we could still see that the two VI methods arrive at biased solutions compared to the ground truth by HMC in terms of scales. Note that this is not necessarily problematic since VI is typically known to produce biased solutions.



f.4 Accuracy of KL Estimation
From Section 3.1 we know that the approximate gradients of the ELBO are directly related to the KL estimates. So we would like to assess the accuracy of the KL estimator. We adopt the settings from the VAE experiment on MNIST (Section 6.3). For comparison, we must also be able to compute a ground truth of the KL term. To achieve this, we use normalizing flow in , which has a complicated but tractable density. Thus we can get a good Monte Carlo estimate of the true KL term: In Figure 7(a) we compare the KL term estimated using KIVI with the ground truth. Note that since we use adaptivecontrast (AC) (see Appendix D) in the VAE experiments, where the KL term is broken down into two parts, it should make more sense to look at the only part that uses the density ratio estimator (the KL divergence between the standardized and a standard normal distribution), which is plotted in Figure 7(b). We can see that the KL estimates closely track the ground truth, and are more accurate as the variational approximation improves over time.


f.5 Change of Interpolation Results on CelebA Through Training
In Figures 14, 13, 12, 11, 10 and 9 we present the generated images for the interpolation experiments on CelebA through the training process. The images are generated after 1, 5, 10, 15, 20, and 25 epochs. Results on the left are produced by AVB, and on the right by KIVI. It can be clearly seen that AVB’s training process is of very high variance, as we have mentioned in Section 2.












Appendix G Details of Experiments
g.1 Toy Experiments
1D Gaussian Mixture The implicit posterior generates samples by propagating samples from a standard normal distribution through a twolayer MLP with 10 hidden units and one output unit. We set the regularization coefficient to and the density ratio clipping threshold to .
2D Bayesian Logistic Regression The inputs are 200 points randomly sampled from . The outputs are the predictions of with randomly sampled weights from the prior. For HMC, we run 100 chains, 200 iterations each, and use 10 leapfrog steps. The step size is automatically adapted using dual averaging, starting from 0.001. We discard the first 100 samples generated. For factorized VI, the training is run for 100 epochs and 100 samples are used. In the training, we anneal the learning rate linearly according to . For KIVI we follow the same setting, except that we use 1k samples. The regularization coefficient and the minimal density ratio are set to 0.1 and , respectively. Below we describe the sample generation process of the implicit variational posterior used in KIVI. First, some 2D random normal samples are propagated through two fullyconnected layers of size 20, producing
(we do not use activation functions for the second layer), and then
is added with another random normal noise with trainable variances, producing . Finally, we propagate through a fullyconnected layer of size 20 and then a linear output layer, getting the variational samples.g.2 Bayesian Neural Networks
g.2.1 Regression
As the regression datasets have small feature dimensions (all less than 15, except 90 for Year), using BNNs of one hidden layer (50 units) does not produce very highdimensional weights. Therefore, we still use MLPs in the implicit variational posterior, of which the samples are generated by propagating samples from a standard normal distribution through an MLP. For all datasets, we use ReLU as the activation function. The MLP has one hidden layer except that for Yacht it has two hidden layers.
We list the details in Table 3, which consist of 10 datasets and 2 weight matrices each. Taking Layer 1 for Boston as an example, (20, 30, N1) represents that 20 random normal samples are generated and propagated through an MLP with a hidden layer of size 30 and an output layer of size N1. Note that N1 corresponds to the number of weights in the first layer of the BNN.
.7in.7in Boston Concrete Energy Kin8nm Naval Layer 1 (20, 30, N1) (30, 50, N1) (100, 500, N1) (100, 500, N1) (100, 500, N1) Layer 2 (20, 30, N2) (30, 50, N2) (50, 100, N2) (50, 100, N2) (50, 100, N2) Combined Protein Wine Yacht Year Layer 1 (100, 500, N1) (100, 500, N1) (20, 10, N1) (100, 800, N1, N1) (100, 500, N1) Layer 2 (100, 500, N2) (100, 500, N2) (5, 20, N2) (50, 200, 51, N2) (100, 500, N2)
g.2.2 Classification
MNIST classification needs a larger scale network than the one used in multivariate regression. Therefore, we adopt an MMNN in the implicit variational posterior. We denote the hiddenlayer size of the BNN as (In our experiments or ). Below we set when , otherwise we set . In the MMNN, we use two matrix multiplication layers.
For the firstlayer weights (), the two hidden matrix multiplication layers are both of size and are with ReLU activations. The output layer of the MMNN is of size and is with linear activations. The input matrices are random samples of size drawn from a standard normal distribution.
For the secondlayer weights (), the two hidden matrix multiplication layers are both of size and are with ReLU activations. The output layer of the MMNN is of size and is with linear activations. The input matrices are random samples of size drawn from a standard normal distribution.
For the thirdlayer weights (), the two hidden matrix multiplication layers are both of size and are with ReLU activations. The output layer of the MMNN is of size and is with linear activations. The input matrices are random samples of size drawn from a standard normal distribution.
We use the above variational posterior settings in both KIVI and PC. For PC, a logistic regression serves as the discriminator. We set the regularization coefficient to 0.001 and the minimal density ratio to for KIVI. Both two methods use 10 samples, and we set the batch size to 100 and the learning rate to 0.001 in training.
g.3 Variational Autoencoders
The decoders used in the MNIST experiment are MLPs with two hidden ReLU layers. The latent dimension is 8. Each hidden layer is of size 500. The implicit posterior is also an MLP with two hidden ReLU layers, with Gaussian noises of 500 dimensions added to the first hidden layer. The noise has zero means and 500dimensional trainable variances. Training is with the batch size 128. The learning rate is 0.001, which is annealed by a factor of 0.5 every 200 epochs. The parameters for KIVI in this case are , and the clipping value is set to .
The decoders used in the CelebA experiment have exactly the same structure with the one used for images in the DCGAN paper (Radford et al., 2015). The latent dimension is 32. The implicit variational posterior is a deep convolutional neural network with a symmetric structure to the decoder, except that the output of the last convolutional layer is flattened and is added a Gaussian noise of the same shape as the last dimension. The noise has zero means and trainable variances. The last hidden layer is fullyconnected and has 500 ReLU units. For AVB we use the same decoder. For both KIVI and AVB, we use batch size 64. The other training parameters of AVB follow from its original code for CelebA. The parameters for KIVI in this case are
Comments
There are no comments yet.