# Stochastic Variational Inference via Upper Bound

Stochastic variational inference (SVI) plays a key role in Bayesian deep learning. Recently various divergences have been proposed to design the surrogate loss for variational inference. We present a simple upper bound of the evidence as the surrogate loss. This evidence upper bound (EUBO) equals to the log marginal likelihood plus the KL-divergence between the posterior and the proposal. We show that the proposed EUBO is tighter than previous upper bounds introduced by χ-divergence or α-divergence. To facilitate scalable inference, we present the numerical approximation of the gradient of the EUBO and apply the SGD algorithm to optimize the variational parameters iteratively. Simulation study with Bayesian logistic regression shows that the upper and lower bounds well sandwich the evidence and the proposed upper bound is favorably tight. For Bayesian neural network, the proposed EUBO-VI algorithm outperforms state-of-the-art results for various examples.

## Authors

• 2 publications
• 1 publication
11/01/2016

### Variational Inference via χ-Upper Bound Minimization

Variational inference (VI) is widely used as an efficient alternative to...
01/28/2021

### Low Complexity Approximate Bayesian Logistic Regression for Sparse Online Learning

Theoretical results show that Bayesian methods can achieve lower bounds ...
10/25/2020

### Statistical optimality and stability of tangent transform algorithms in logit models

A systematic approach to finding variational approximation in an otherwi...
10/24/2020

### Variational Bayesian Unlearning

This paper studies the problem of approximately unlearning a Bayesian mo...
09/27/2018

### On some variance reduction properties of the reparameterization trick

The so-called reparameterization trick is widely used in variational inf...
04/25/2019

### Reference Bayesian analysis for hierarchical models

This paper proposes an alternative approach for constructing invariant J...
11/06/2018

### Deep Probabilistic Ensembles: Approximate Variational Inference through KL Regularization

In this paper, we introduce Deep Probabilistic Ensembles (DPEs), a scala...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Stochastic variational inference (SVI) plays a key role in Bayesian deep learning. SVI solves the Bayesian inference problem by introducing a variational distribution

over the latent variables Jordan_99 ; Hoffman_etal_2013 , and then minimizes the Kullback-Leibler (KL) divergence between the approximating distribution and the exact posterior . This minimization is the same as maximizing the evidence lower bound (ELBO), Jordan_99 , which is a lower bound of the model evidence . The KL-divergence is the discrepancy between the surrogate loss ELBO and the evidence. SVI turns the Bayesian inference into an optimization problem. It has been recognized that the property of discrepancy has strong influence on the optimization of ELBO Li_Turner_2017 ; Dieng_etal_2017 ; Zhang_etal_2018 . The classical objective leads to approximate the posteriors with zero-forcing behavior. This zero-forcing behavior imposes undesirable properties, which may lead to underestimation of the posterior’s support especially when dealing with light-tailed posteriors or multi-modal posteriors Murphy_2012 ; Hensman_etal_2014 ; Dieng_etal_2017 . Various SVI methods are proposed to use different divergences to achieve tighter bound and/or mass-covering property, which may lead to better performance in SVI Li_Turner_2017 ; Dieng_etal_2017 ; Tao_etal_2018_VI . Recently, the upper bound of has captured certain attention. For instance, the variational Rényi bound (, for ) and the upper bound (, for ) are introduced to provide the mass-covering property: optimizing such divergence leads to a variational distribution with a mass-covering (or say zero-avoiding) behavior. The resulting gradient of -divergence based upper bound is,

 ∇λLα=11−αEq[wα∇λ(logp(D,θ)−logq(θ;λ))] (1)

where Li_Turner_2017 . Interestingly, we have the connection that when , the Rényi- divergence become the KL-divergence and the Rényi-VI algorithm reduces to the standard SVI Hoffman_etal_2013 ; when , the Rényi-VI becomes the importance weighted auto-encoder (IWAE) or say the importance weighted VIBurda_etal_2015IWVAE ; Domke_2018_IWVI ; when , it becomes the divergence based CHIVI algorithm Dieng_etal_2017 . The integration in the gradient

can be estimated by Monte Carlo approach with samples drawn from

. Then the stochastic gradient descendant (SGD) algorithm is applied to optimize the variational parameter . Moreover, to deal with large datasets, we may use a minibatch of the data

in the evaluation of the joint distribution

Ranganath_etal_2014 . Inspired by these previous works, here we explore a simple but effective upper bound for variational inference, our main contributions are summarized:

1. We propose a new upper bound of the evidence by introducing the KL-divergence between the posterior and the variational distribution. The proposed EUBO possess the mass covering properties, and is tighter then the upper bounds introduced by -divergence or -divergence.

2. We present the numerical approximation of the gradient of the EUBO, apply the SGD algorithm to optimize the variational parameters, and format a black-box style variational inference which is suitable for scalable Bayesian inference.

3. With the upper and lower bound sandwiching the evidence, we can be more confident to evaluate the fitting of model. We also find that the EUBO converges faster and is tighter than the ELBO. In Bayesian neural network regression, the EUBO-VI algorithm gains improvement in both the test error and the model fitting.

## 2 Variational Inference with the EUBO

### 2.1 Upper bound of the model evidence

As it is well known that the KL-divergence is not symmetric, possesses the zero-force property, but the inverse KL-divergence posses the mass-covering property. This motivates our work to leverage on in the design of the surrogate loss. By using Gibbs’ inequality that holds for any distribution , we introduce an upper bound on the model evidence,

 logp(D) =∫p(θ|D)logp(D,θ)dθ−∫p(θ|D)logp(θ|D)dθ (2)

We define as the Evidence Upper BOund (EUBO). It is easy to find that , that is just the discrepancy between the EUBO and the true .

### 2.2 Some properties of the EUBO

We present three favorable properties of the proposed EUBO . First the discrepancy possesses mass covering property. According to previous studiesLi_Turner_2017 ; Dieng_etal_2017 , the mass covering property has advance in approximation of the posterior. The mass covering property of EUBO, can be easily verified by simulation examples Minka_05_Divergence ; Ji_06_adaptiveMC . Second, is a tight discrepancy. We find that the proposed in this work is tighter than the -divergence and -divergence. With this tighter discrepancy, we can then design tighter upper bounds. As discussed in Burda_etal_2015IWVAE ; Tao_etal_2018_VI , tighter bound tends to lead better performance in SVI. We present a theorem to show the advantage of the proposed EUBO (see the appendix for a brief proof).

Theorem 1. Define the EUBO as equation (2). Then the following holds:

• ELBO EUBO .

• , for ; , for .

• , for ; , for .

Third, the numerical estimation of EUBO is close to the ground true evidence, even when the variational distribution has not been well optimized. The EUBO involves an integration with respect to the unknown posterior. As stated in later section, we use importance sampling in the estimation of this upper bound. The using of importance sampling reduces the bias of the MC estimation of the evidence when the variational proposal has certain discrepancy with the posterior. This property leads to two advantage: 1) model selection based solely on the ELBO is inappropriate because of the possible variation in the tightness of this bound, with the accompanying upper bound to sandwich the evidence, one can perform model selection with more confidenceGrosse_etal_2015 ; Ji_2010_Bounded ; 2) in case we need to evaluate the evidence or its surrogate when the optimization process of variational parameters can run only few steps, for example in the Bayesian Meta-learningKim_2018_BayesianMAML ; Ravi_2019_AmortizedBMAML , the proposed EUBO may be preferable since its estimation is insensitive to the bias of variational distribution.

### 2.3 SGD for the EUBO

By definition, the EUBO has an integration with respect to unknown posterior. In previous studyJi_2009_Thesis ; Ji_2010_Bounded

, the posterior is represented by simulated sample from Markov chain Monte Carlo (MCMC), and the integration is approximated via Monte Carlo. However, Monte Carlo particularly MCMC is not a scalable method in dealing with large scale Bayesian learning problem. Following the idea of black-box VI

Ranganath_etal_2014 , we use only a few number of Monte Carlo samples to obtain a ‘noisy’ gradient of the EUBO, and then apply the SGD approach to optimize iteratively. First, we derive the gradient of (refer to the Appendix for details), denoted by ,

 ∇λU=Eq{w(θ)[(logω(θ)+1)∇λlogp(D,θ)−∇λlogq(θ;λ)]} (3)

where and Note that the posterior in is generally unknown, so we use the joint distribution instead and normalizes the weights to cancel the unknown constant . Given the samples drawn from and normalized weights , we estimate as follows,

 ^∇λU=M∑i^w(i)[(logω(θ(i))+1)∇λlogp(D,θ(i))−∇λlogq(θ(i);λ)]. (4)

To deal with large dataset, we divide the entire dataset to mini-batch. Only a mini-batch data is used in the evaluation of ,

 p(D,θ(i))≈[S∏n=1p(xn|θ(i))]NSp(θ(i)).

To incorporate with the advantage of autogradient packages , we take the reparametrization trick Kingma_2014_VAE that , where and , then the resulting reparameterization gradient becomes,

 ^∇λU=M∑i^w(i)[(logω(θ(i))+1)∇λlogp(D,gλ(ϵ(i)))−∇λlogq(gλ(ϵ(i));λ)]. (5)

This reparametrization trick significantly reduces the effort needed to implement variational inference in a wide variety of models. Finally, with the noisy estimation of the gradient , the SGD algorithm is applied to minimize the EUBO iteratively, . The algorithm is presented as follows,

After we obtain the optimal , the resulting minimum upper bound is then estimated by

 ^U∗=M∑i=1^w(i)[N∑n=1logp(xn|θ(i))+logp(θ(i))−logq(θ(i);λ∗)] (6)

Furthermore, if we assume the posterior and the joint distribution has no relation to the variational parameter , which means and , then we obtain the score gradient, . Compare with the score gradient of CHIVIDieng_etal_2017 , we find the only difference is the definition of : for our EUBO-VI, while for CHIVI. However, although there is some similarity between the EUBO-VI and CHIVI/Rényi-VI, EUBO-VI is unique and does not fall into any special case of these previous upper bound based SVI algorithms.

## 3 Simulation studies

### 3.1 Bayesian Logistic Regression

For Bayesian logistic regression, we use data sets from the UCI repository: Iris, Pima, Spectf, Wdbc and Ionos. The variate dimension in these data sets range from 5 to 45, including a dimension of all ones to account for offset. We set the prior distribution of the coefficients as , choose the number of importance sampling samples as 10 and the mini-batch size as 100. In our experiments we use the Adam algorithm Kingma_2014_Adam

, an adaptive version of SGD which automatically tune the learning rates according to the history of gradients and their variances. We perform experiments with the proposed EUBO-SVI, in comparision with vanilla SVI

Paisley_etal_2012 , CHIVIDieng_etal_2017 and Rényi-VI Li_Turner_2017 . We also compared the optimized bounds obtained by various algorithms: denotes ELBO from vanilla SVI, denotes EUBO from the proposed algorithm, denotes the upper bound from CHIVI, and denotes the upper and lower bounds from the Rényi-VI with different -divergences.

Take the logistic regression of Iris dataset as example, we show the estimated bound of ELBO and EUBO of each epoch in Figure 1. We observed that the EUBO converges faster, and is tighter than the ELBO even in the early stage when the variational parameters are not well optimized. We run 20 trails for each bounds, the results are shown in table 1. The upper and lower bounds well bracket the log evidence. With both the lower and upper bounds being close to each other, we are more confidence that the VIs find the true evidence. Moreover, simulation results confirm that

, which is consistent with Theorem 1 discussed in Section 2.2. To test the performance of prediction, all the datasets are randomly partitioned into 90% for training and 10% for testing, and the results are averaged over 20 random trials. The averaged test error is shown in table 2, which shows that the EUBO-SVI algorithm performs well in model prediction.

### 3.2 Bayesian Neural Network

In the Bayesian neural network regression, we take the same setting with previous work Li_Turner_2017 . We use neural networks with one hidden layers, and take 50 hidden units for most datasets, except that we take 100 units for Protein which are relatively large; We set as the prior distribution for the weight and bias of the neural network. We choose

as the active function. The number of importance sampling samples is 10 and the mini-batch size is 100. All the datasets are randomly partitioned into 90% for training and 10% for testing, and the results are averaged over 20 random trials. We compare our EUBO-VI with state-of-the-art results directly cited from some representative algorithms: Probabilistic backpropagation(BPB)

HernndezLobato_2015_ProbabilisticBP , Rényi-VI Li_Turner_2017 , and CLBO-VI Tao_etal_2018_VI . The averaged prediction accuracy and averaged test log likelihood (LL) are given in Table 3 and Table 4 respectively. The proposed EUBO-VI achieves substantial improvement on both the test error and negative log likelihood for various examples.

## 4 Conclusion

We investigate the variational inference utilize an effective evidence upper bound (EUBO). The proposed upper bound is based on the KL-divergence between the posterior and variational distribution. We real that this upper bound is tighter than the previous Rényi bound and -bound. We proposed the SGD algorithm to optimize the EUBO and develop the reparametrization trick for easy implementation using prevalent python packages for large scale problems. We compare the proposed algorithm with vanilla SVI, CHIVI and Rényi-VI. Simulation study shows that the proposed EUBO-VI algorithm gains improvement on both the test error and log likelihood. We also observed that the EUBO converges faster, and is tighter than the ELBO even in the early stage of the optimization procedure. Moreover, with the both upper and lower bound, we are much confidence with the model fitness. This upper bound VI not only provided for variational inference for complex model, but also provide an easy accessible upper bound for model criticism.

## References

• [1] Yuri Burda, Roger B. Grosse, and Ruslan Salakhutdinov.

Importance weighted autoencoders.

In Proceedings of the International Conference on Learning Representations (ICLR), 2015.
• [2] Adji B. Dieng, Dustin Tran, Rajesh Ranganath, John Paisley, and David M. Blei. Variational inference via -upper bound minimization. In Proceedings of the Neural Information Processing Systems, 2017.
• [3] Justin Domke and Daniel Sheldon. Importance weighting and variational inference. In Proceedings of the Neural Information Processing Systems, 2018.
• [4] R. B. Grosse, Z. Ghahramani, and R. P Adams. Sandwiching the marginal likelihood using bidirectional Monte Carlo. ArXiv preprint, arXiv:1511.02543, 2015.
• [5] J. Hensman, M. Zwiessele, and N. D. Lawrence. Tilted variational Bayes.

Proceedings of Machine Learning Research

, 33(9):356–364, 2014.
• [6] José Miguel Hernández-Lobato and Ryan P. Adams. Probabilistic backpropagation for scalable learning of Bayesian neural networks. In Proceedings of the International Conference on Machine Learning, 2015.
• [7] Matthew D. Hoffman, David M. Blei, Chong Wang, and John William Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14:1303–1347, 2013.
• [8] Chunlin Ji. Adaptive monte carlo methods for Bayesian inference. Master’s thesis, University of Cambridge, UK, 2006.
• [9] Chunlin Ji. Advances in Bayesian modelling and computation: Spatio-temporal processes, model assessment and adaptive MCMC, 2009. Ph.D. thesis, Department of Statistical Science, Duke University.
• [10] Chunlin Ji, Haige Shen, and Mike West. Bounded approximations for marginal likelihoods, 2010. Technical Report, Department of Statistical Science, Duke University.
• [11] M. Jordan, Z. Ghahramani, T. Jaakkola, and K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37(2):183–233, 1999.
• [12] Taesup Kim, Jaesik Yoon, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn. Bayesian model-agnostic meta-learning. In Proceedings of the Neural Information Processing Systems, 2018.
• [13] D. Kingma and M. Welling. Auto-encoding variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations (ICLR), 2014.
• [14] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint, arXiv:1412.698, 2014.
• [15] Y. Li and R. E. Turner. Variational inference with Rényi divergence. In Proceedings of the Neural Information Processing Systems, 2016.
• [16] Thomas Minka. Divergence measures and message passing. Technical report, Microsoft Research, 2005.
• [17] K. P. Murphy. Machine Learning: A Probabilistic Perspective. MIT press, 2012.
• [18] J. Paisley, D. Blei, and M. I. Jordan. Variational Bayesian inference with stochastic search. In Proceedings of the 29th International Conference on Machine Learning, pages 1363–1370, 2012.
• [19] R. Ranganath, S. Gerrish, and D. Blei. Black box variational inference. In Artificial Intelligence and Statistics, 2014.
• [20] Sachin Ravi and Alex Beatson. Amortized Bayesian meta-learning. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.
• [21] Chenyang Tao, Liqun Chen, Ruiyi Zhang, Ricardo Henao, and Lawrence Carin. Variational inference and model selection with generalized evidence bounds. In Proceedings of the International Conference on Machine Learning, 2018.
• [22] Cheng Zhang, Judith Butepage, Hedvig Kjellstrom, and Stephan Mandt. Advances in variational inference. ArXiv preprint, arXiv:1711.05597, 2018.

## Appendix

We present two ways to derive the gradient of the EUBO . Reform as an integration with , that is , and denote , . Note that . We can derive the gradient as follows,

 ∇λU = ∇λ∫q(θ;λ)^w(θ)logw(θ)dθ = ∫∇λ[q(θ;λ)^w(θ)logw(θ)]dθ = ∫∇λq(θ;λ)^w(θ)logw(θ)dθ+∫q(θ;λ)∇λ^w(θ)logw(θ)dθ+∫q(θ;λ)^w(θ)∇λlogw(θ)dθ = Eq[^w(θ)logw(θ)∇λlogq(θ;λ)]+Eq[logw(θ)∇λ^w(θ)]+Eq[^w(θ)∇λlogw(θ)] = Eq[^w(θ)logw(θ)∇λ%$log$q(θ;λ)]+Eq[logw(θ)^w(θ)∇λlogw(θ)]+Eq[^w(θ)∇λlogw(θ)] = Eq[^w(θ)logw(θ)∇λ%$log$q(θ;λ)]+Eq[logw(θ)^w(θ)(% ∇λlogp(D,θ)−∇λlogq(θ;λ))]+Eq[^w(θ)∇λlogw(θ)] = Eq[^w(θ)logw(θ)∇λlogp(D,θ)]+Eq[^w(θ)[∇λlogp(D,θ)−∇λlogq(θ;λ)]dθ = Eq[^w(θ)(logw(θ)+1)∇λlogp(D,θ)]−Eq[^w(θ)∇λlogq(θ;λ)]

In another way, we assume the posterior and the joint distribution has no relation to the parameter of the variational distribution, that is and . So we have,

 ∇λU = ∇λ∫p(θ|D)logp(D,θ)q(θ;λ)dθ = ∫∇λ[p(θ|D)(logp(D,θ)−logq(θ;λ))]dθ = ∫∇λp(θ|D)(logp(D,θ)−logq(θ;λ))dθ+∫p(θ|D)∇λ(logp(D,θ)−logq(θ;λ))dθ = −∫q(θ;λ)p(θ|D)q(θ;λ)∇λlogq(θ;λ)dθ = −Eq[^w(θ)∇λlogq(θ;λ)]

where , in the last step we take importance sampling from , instead of directly sampling from .

### The relation between KL-divergence and α-divergence/χn-divergence

We show that the KL-divergence is a tighter divergence than -divergence (for ). By the Jensen’s inequality, we known that for a concave function, such as for , we have . So we have,

 Dα(q(θ)||p(θ|D)) = = 11−αlog∫p(θ|D)(p(θ|D)q(θ))−αdθ ≥ 11−α∫p(θ|D)log⎡⎣(p(θ|D)q(θ))−α⎤⎦dθ = −α1−αDKL(p(θ|D)||q(θ)).

Given this inequality, it is easy to find the relation between their corresponding bounds,

 Uα(λ) =11−αlogEq[(p(D,θ)q(θ;λ))1−α] =11−αlogEq[(p(θ|D)q(θ;λ))1−α]+11−αlogp(D)1−α ≥−α1−αDKL(p(θ|D)||q(θ))+logp(D) =−α1−α∫p(θ|D)logp(D,θ)q(θ;λ)dθ+11−αlogp(D) =−α1−αU(λ)+11−αlogp(D).

Let , then we get similar inequalities between the -divergence and the KL-divergence , and their corresponding bounds.