# Fast Second-Order Stochastic Backpropagation for Variational Inference

We propose a second-order (Hessian or Hessian-free) based optimization method for variational inference inspired by Gaussian backpropagation, and argue that quasi-Newton optimization can be developed as well. This is accomplished by generalizing the gradient computation in stochastic backpropagation via a reparametrization trick with lower complexity. As an illustrative example, we apply this approach to the problems of Bayesian logistic regression and variational auto-encoder (VAE). Additionally, we compute bounds on the estimator variance of intractable expectations for the family of Lipschitz continuous function. Our method is practical, scalable and model free. We demonstrate our method on several real-world datasets and provide comparisons with other stochastic gradient methods to show substantial enhancement in convergence rates.

## Authors

• 20 publications
• 15 publications
• 1 publication
• 7 publications
• 20 publications
06/07/2017

### Fast Black-box Variational Inference through Stochastic Trust-Region Optimization

We introduce TrustVI, a fast second-order algorithm for black-box variat...
03/15/2022

### Accelerating Stochastic Probabilistic Inference

Recently, Stochastic Variational Inference (SVI) has been increasingly a...
06/16/2020

05/04/2022

### Second-Order Sensitivity Analysis for Bilevel Optimization

In this work we derive a second-order approach to bilevel optimization, ...
11/01/2017

### Stochastic Variational Inference for Fully Bayesian Sparse Gaussian Process Regression Models

This paper presents a novel variational inference framework for deriving...
10/30/2018

### Using Large Ensembles of Control Variates for Variational Inference

Variational inference is increasingly being addressed with stochastic op...
01/28/2021

### Low Complexity Approximate Bayesian Logistic Regression for Sparse Online Learning

Theoretical results show that Bayesian methods can achieve lower bounds ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Generative models have become ubiquitous in machine learning and statistics and are now widely used in fields such as bioinformatics, computer vision, or natural language processing. These models benefit from being highly interpretable and easily extended. Unfortunately, inference and learning with generative models is often intractable, especially for models that employ continuous latent variables, and so fast approximate methods are needed. Variational Bayesian (VB) methods

[1] deal with this problem by approximating the true posterior that has a tractable parametric form and then identifying the set of parameters that maximize a variational lower bound on the marginal likelihood. That is, VB methods turn an inference problem into an optimization problem that can be solved, for example, by gradient ascent.

Indeed, efficient stochastic gradient variational Bayesian (SGVB) estimators have been developed for auto-encoder models [17] and a number of papers have followed up on this approach [28, 25, 19, 16, 15, 26, 10]. Recently, [25] provided a complementary perspective by using stochastic backpropagation that is equivalent to SGVB and applied it to deep latent gaussian models. Stochastic backpropagation overcomes many limitations of traditional inference methods such as the mean-field or wake-sleep algorithms [12]

due to the existence of efficient computations of an unbiased estimate of the gradient of the variational lower bound. The resulting gradients can be used for parameter estimation via stochastic optimization methods such as stochastic gradient decent(SGD) or adaptive version (Adagrad)

[6].

Unfortunately, methods such as SGD or Adagrad converge slowly for some difficult-to-train models, such as untied-weights auto-encoders or recurrent neural networks. The common experience is that gradient decent always gets stuck near saddle points or local extrema. Meanwhile the learning rate is difficult to tune.

[18] gave a clear explanation on why Newton’s method is preferred over gradient decent, which often encounters under-fitting problem if the optimizing function manifests pathological curvature. Newton’s method is invariant to affine transformations so it can take advantage of curvature information, but has higher computational cost due to its reliance on the inverse of the Hessian matrix. This issue was partially addressed in [18] where the authors introduced Hessian free (HF) optimization and demonstrated its suitability for problems in machine learning.

In this paper, we continue this line of research into order variational inference algorithms. Inspired by the property of location scale families [8]

, we show how to reduce the computational cost of the Hessian or Hessian-vector product, thus allowing for a

order stochastic optimization scheme for variational inference under Gaussian approximation. In conjunction with the HF optimization, we propose an efficient and scalable order stochastic Gaussian backpropagation for variational inference called HFSGVI. Alternately, L-BFGS [3] version, a quasi-Newton method merely using the gradient information, is a natural generalization of order variational inference.

The most immediate application would be to look at obtaining better optimization algorithms for variational inference. As to our knowledge, the model currently applying order information is LDA [2, 14], where the Hessian is easy to compute [11]. In general, for non-linear factor models like non-linear factor analysis or the deep latent Gaussian models this is not the case. Indeed, to our knowledge, there has not been any systematic investigation into the properties of various optimization algorithms and how they might impact the solutions to optimization problem arising from variational approximations.

The main contributions of this paper are to fill such gap for variational inference by introducing a novel order optimization scheme. First, we describe a clever approach to obtain curvature information with low computational cost, thus making the Newton’s method both scalable and efficient. Second, we show that the variance of the lower bound estimator can be bounded by a dimension-free constant, extending the work of [25] that discussed a specific bound for univariate function. Third, we demonstrate the performance of our method for Bayesian logistic regression and the VAE model in comparison to commonly used algorithms. Convergence rate is shown to be competitive or faster.

## 2 Stochastic Backpropagation

In this section, we extend the Bonnet and Price theorem [4, 24] to develop order Gaussian backpropagation. Specifically, we consider how to optimize an expectation of the form , where and refer to latent variables and observed variables respectively, and expectation is taken w.r.t distribution and

is some smooth loss function (e.g. it can be derived from a standard variational lower bound

[1]). Sometimes we abuse notation and refer to by omitting when no ambiguity exists. To optimize such expectation, gradient decent methods require the derivatives, while Newton’s methods require both the gradients and Hessian involving order derivatives.

### 2.1 Second Order Gaussian Backpropagation

If the distribution is a -dimensional Gaussian , the required partial derivative is easily computed with a lower algorithmic cost [25]

. By using the property of Gaussian distribution, we can compute the

order partial derivative of as follows:

 ∇2μi,μjEN(z|μ,C)[f(z)] = (1) ∇2Ci,j,Ck,lEN(z|μ,C)[f(z)] = 14EN(z|μ,C)[∇4zi,zj,zk,zlf(z)], (2) ∇2μi,Ck,lEN(z|μ,C)[f(z)] = 12EN(z|μ,C)[∇3zi,zk,zlf(z)]. (3)

Eq. (1), (2), (3) (proof in supplementary) have the nice property that a limited number of samples from are sufficient to obtain unbiased gradient estimates. However, note that Eq. (2), (3) needs to calculate the third and fourth derivatives of , which is highly computationally inefficient. To avoid the calculation of high order derivatives, we use a co-ordinate transformation.

### 2.2 Covariance Parameterization for Optimization

By constructing the linear transformation (a.k.a. reparameterization)

, where , we can generate samples from any Gaussian distribution

by simulating data from a standard normal distribution, provided the decomposition

holds. This fact allows us to derive the following theorem indicating that the computation of order derivatives can be scalable and programmed to run in parallel.

###### Theorem 1 (Fast Derivative).

If is a twice differentiable function and follows Gaussian distribution , , where both the mean and depend on a -dimensional parameter , i.e. , we have , and . This then implies

 ∇θlEN(μ,C)[f(z)] = Eϵ∼N(0,I)[g⊤∂(μ+Rϵ)∂θl], (4) ∇2θl1θl2EN(μ,C)[f(z)] = Eϵ∼N(0,I)[∂(μ+Rϵ)∂θl1⊤H∂(μ+Rϵ)∂θl2+g⊤∂2(μ+Rϵ)∂θl1∂l2], (5)

where is Kronecker product, and gradient , Hessian are evaluated at in terms of .

If we consider the mean and covariance matrix as the variational parameters in variational inference, the first two results w.r.t make parallelization possible, and reduce computational cost of the Hessian-vector multiplication due to the fact that . If the model has few parameters or a large resource budget (e.g. GPU) is allowed, Theorem 1 launches the foundation for exact order derivative computation in parallel. In addition, note that the order gradient computation on model parameter only involves matrix-vector or vector-vector multiplication, thus leading to an algorithmic complexity that is for order derivative of , which is the same as order gradient [25]. The derivative computation at function is up to order, avoiding to calculate or order derivatives. One practical parametrization assumes a diagonal covariance matrix . This reduces the actual computational cost compared with Theorem 1, albeit the same order of the complexity () (see supplementary material). Theorem 1 holds for a large class of distributions in addition to Gaussian distributions, such as student -distribution. If the dimensionality of embedded parameter is large, computation of the gradient and Hessian (differ from , above) will be linear and quadratic w.r.t , which may be unacceptable. Therefore, in the next section we attempt to reduce the computational complexity w.r.t .

### 2.3 Apply Reparameterization on Second Order Algorithm

In standard Newton’s method, we need to compute the Hessian matrix and its inverse, which is intractable for limited computing resources. [18]

applied Hessian-free (HF) optimization method in deep learning effectively and efficiently. This work largely relied on the technique of fast Hessian matrix-vector multiplication

[23]. We combine reparameterization trick with Hessian-free or quasi-Newton to circumvent matrix inverse problem.

Hessian-free Unlike quasi-Newton methods HF doesn’t make any approximation on the Hessian. HF needs to compute , where is any vector that has the matched dimension to , and then uses conjugate gradient algorithm to solve the linear system , for any objective function . [18] gives a reasonable explanation for Hessian free optimization. In short, unlike a pre-training method that places the parameters in a search region to regularize[7], HF solves issues of pathological curvature in the objective by taking the advantage of rescaling property of Newton’s method. By definition indicating that can be numerically computed by using finite differences at . However, this numerical method is unstable for small .

In this section, we focus on the calculation of by leveraging a reparameterization trick. Specifically, we apply an -operator technique [23] for computing the product exactly. Let and reparameterize again as Sec. 2.2, we do variable substitution after gradient Eq. (4) is obtained, and then take derivative on . Thus we have the following analytical expression for Hessian-vector multiplication:

 Hθv = ∂∂γ∇F(θ+γv)∣∣∣γ=0=∂∂γEN(0,I)[g⊤∂(μ(θ)+R(θ)ϵ)∂θ∣∣∣θ←θ+γv]∣∣ ∣∣γ=0 (6) = EN(0,I)[∂∂γ(g⊤∂(μ(θ)+R(θ)ϵ)∂θ∣∣∣θ←θ+γv)]∣∣ ∣∣γ=0.

Eq. (6) is appealing since it does not need to store the dense matrix and provides an unbiased estimator with a small sample size. In order to conduct the order optimization for variational inference, if the computation of the gradient for variational lower bound is completed, we only need to add one extra step for gradient evaluation via Eq. (6) which has the same computational complexity as Eq. (4). This leads to a Hessian-free variational inference method described in Algorithm 1.

For the worst case of HF, the conjugate gradient (CG) algorithm requires at most iterations to terminate, meaning that it requires evaluations of product. However, the good news is that CG leads to good convergence after a reasonable number of iterations. In practice we found that it may not necessary to wait CG to converge. In other words, even if we set the maximum iteration in CG to a small fixed number (e.g., 10 in our experiments, though with thousands of parameters), the performance does not deteriorate. The early stoping strategy may have the similar effect of Wolfe condition to avoid excessive step size in Newton’s method. Therefore we successfully reduce the complexity of each iteration to , whereas is for one SGD iteration.

L-BFGS Limited memory BFGS utilizes the information gleaned from the gradient vector to approximate the Hessian matrix without explicit computation, and we can readily utilize it within our framework. The basic idea of BFGS approximates Hessian by an iterative algorithm , where and . By Eq. (4), the gradient at each iteration can be obtained without any difficulty. However, even if this low rank approximation to the Hessian is easy to invert analytically due to the Sherman-Morrison formula, we still need to store the matrix. L-BFGS will further implicitly approximate this dense or by tracking only a few gradient vectors and a short history of parameters and therefore has a linear memory requirement. In general, L-BFGS can perform a sequence of inner products with the most recent and , where is a predefined constant (10 or 15 in our experiments). Due to the space limitations, we omit the details here but none-the-less will present this algorithm in experiments section.

### 2.4 Estimator Variance

The framework of stochastic backpropagation [16, 17, 19, 25] extensively uses the mean of very few samples (often just one) to approximate the expectation. Similarly we approximate the left side of Eq. (4), (5), (6) by sampling few points from the standard normal distribution. However, the magnitude of the variance of such an estimator is not seriously discussed. [25] simply explored the variance quantitatively for separable functions.[19]

merely borrowed the variance reduction technique from reinforcement learning by centering the learning signal in expectation and performing variance normalization. Here, we will generalize the treatment of variance to a broader family, Lipschitz continuous function.

###### Theorem 2 (Variance Bound).

If is an -Lipschitz differentiable function and , then

The proof of Theorem 2 (see supplementary) employs the properties of sub-Gaussian distributions and the duplication trick that are commonly used in learning theory. Significantly, the result implies a variance bound independent of the dimensionality of Gaussian variable. Note that from the proof, we can only obtain the for . Though this result is enough to illustrate the variance independence of , we can in fact tighten it to a sharper upper bound by a constant scalar, i.e. , thus leading to the result of Theorem 2 with . If all the results above hold for smooth (twice continuous and differentiable) functions with Lipschitz constant then it holds for all Lipschitz functions by a standard approximation argument. This means the condition can be relaxed to Lipschitz continuous function.

###### Corollary 3 (Bias Bound).

.

It is also worth mentioning that the significant corollary of Theorem 2 is probabilistic inequality to measure the convergence rate of Monte Carlo approximation in our setting. This tail bound, together with variance bound, provides the theoretical guarantee for stochastic backpropagation on Gaussian variables and provides an explanation for why a unique realization () is enough in practice. By reparametrization, Eq. (4), (5, (6) can be formulated as the expectation w.r.t the isotropic Gaussian distribution with identity covariance matrix leading to Algorithm 1. Thus we can rein in the number of samples for Monte Carlo integration regardless dimensionality of latent variables . This seems counter-intuitive. However, we notice that larger may require more samples, and Lipschitz constants of different models vary greatly.

## 3 Application on Variational Auto-encoder

Note that our method is model free. If the loss function has the form of the expectation of a function w.r.t latent Gaussian variables, we can directly use Algorithm 1. In this section, we put the emphasis on a standard framework VAE model [17] that has been intensively researched; in particular, the function endows the logarithm form, thus bridging the gap between Hessian and fisher information matrix by expectation (see a survey [22] and reference therein).

### 3.1 Model Description

Suppose we have i.i.d. observations , where is a data vector that can take either continuous or discrete values. In contrast to a standard auto-encoder model constructed by a neural network with a bottleneck structure, VAE describes the embedding process from the prospective of a Gaussian latent variable model. Specifically, each data point follows a generative model , where this process is actually a decoder that is usually constructed by a non-linear transformation with unknown parameters and a prior distribution . The encoder or recognition model is used to approximate the true posterior , where is similar to the parameter of variational distribution. As suggested in [16, 17, 25]

, multi-layered perceptron (MLP) is commonly considered as both the probabilistic encoder and decoder. We will later see that this construction is equivalent to a variant deep neural networks under the constrain of unique realization for

. For this model and each datapoint, the variational lower bound on the marginal likelihood is,

 logpψ(x(i)) ≥ Eqϕ(z|x(i))[logpψ(x(i)|z)]−DKL(qϕ(z|x(i))∥pψ(z))=L(x(i)). (7)

We can actually write the KL divergence into the expectation term and denote as . By the previous discussion, this means that our objective is to solve the optimization problem of full dataset variational lower bound. Thus L-BFGS or HF SGVI algorithm can be implemented straightforwardly to estimate the parameters of both generative and recognition models. Since the first term of reconstruction error appears in Eq. (7) with an expectation form on latent variable, [17, 25] used a small finite number samples as Monte Carlo integration with reparameterization trick to reduce the variance. This is, in fact, drawing samples from the standard normal distribution. In addition, the second term is the KL divergence between the variational distribution and the prior distribution, which acts as a regularizer.

### 3.2 Deep Neural Networks with Hybrid Hidden Layers

In the experiments, setting can not only achieve excellent performance but also speed up the program. In this special case, we discuss the relationship between VAE and traditional deep auto-encoder. For binary inputs, denote the output as , we have , which is exactly the negative cross-entropy. It is also apparent that is equivalent to negative squared error loss for continuous data. This means that maximizing the lower bound is roughly equal to minimizing the loss function of a deep neural network (see Figure 1 in supplementary), except for different regularizers. In other words, the prior in VAE only imposes a regularizer in encoder or generative model, while penalty for all parameters is always considered in deep neural nets. From the perspective of deep neural networks with hybrid hidden nodes, the model consists of two Bernoulli layers and one Gaussian layer. The gradient computation can simply follow a variant of backpropagation layer by layer (derivation given in supplementary). To further see the rationale of setting

, we will investigate the upper bound of the Lipschitz constant under various activation functions in the next lemma. As Theorem

2 implies, the variance of approximate expectation by finite samples mainly relies on the Lipschitz constant, rather than dimensionality. According to Lemma 4, imposing a prior or regularization to the parameter can control both the model complexity and function smoothness. Lemma 4 also implies that we can get the upper bound of the Lipschitz constant for the designed estimators in our algorithm.

###### Lemma 4.

For a sigmoid activation function in deep neural networks with one Gaussian layer , . Let , then the Lipschitz constant of is bounded by , where is th row of weight matrix and is the th element bias. Similarly, for hyperbolic tangent or softplus function, the Lipschitz constant is bounded by .

## 4 Experiments

We apply our order stochastic variational inference to two different non-conjugate models. First, we consider a simple but widely used Bayesian logistic regression model, and compare with the most recent order algorithm, doubly stochastic variational inference (DSVI) [28], designed for sparse variable selection with logistic regression. Then, we compare the performance of VAE model with our algorithms.

### 4.1 Bayesian Logistic Regression

Given a dataset , where each instance includes the default feature 1 and

is the binary label, the Bayesian logistic regression models the probability of outputs conditional on features and the coefficients

with an imposed prior. The likelihood and the prior can usually take the form as and respectively, where

is sigmoid function and

is a diagonal covariance matrix for simplicity. We can propose a variational Gaussian distribution to approximate the posterior of regression parameter. If we further assume a diagonal , a factorized form is both efficient and practical for inference. Unlike iteratively optimizing and as in variational EM, [28] noticed that the calculation of the gradient w.r.t the lower bound indicates the updates of can be analytically worked out by variational parameters, thus resulting a new objective function for the representation of lower bound that only relies on (details refer to [28]). We apply our algorithm to this variational logistic regression on three appropriate datasets: DukeBreast and Leukemia are small size but high-dimensional for sparse logistic regression, and a9a which is large. See Table 1 for additional dataset descriptions.

Fig. 5 shows the convergence of Gaussian variational lower bound for Bayesian logistic regression in terms of running time. It is worth mentioning that the lower bound of HFSGVI converges within 3 iterations on the small datasets DukeBreast and Leukemia. This is because all data points are fed to all algorithms and the HFSGVI uses a better approximation of the Hessian matrix to proceed order optimization. L-BFGS-SGVI also take less time to converge and yield slightly larger lower bound than DSVI. In addition, as an SGD-based algorithm, it is clearly seen that DSVI is less stable for small datasets and fluctuates strongly even at the later optimized stage. For the large a9a, we observe that HFSGVI also needs 1000 iterations to reach a good lower bound and becomes less stable than the other two algorithms. However, L-BFGS-SGVI performs the best both in terms of convergence rate and the final lower bound. The misclassification report in Table 1 reflects the similar advantages of our approach, indicating a competitive predication ability on various datasets. Finally, it is worth mentioning that all three algorithms learn a set of very sparse regression coefficients on the three datasets (see supplement for additional visualizations).

### 4.2 Variational Auto-encoder

We also apply the order stochastic variational inference to train a VAE model (setting for Monte Carlo integration to estimate expectation) or the equivalent deep neural networks with hybrid hidden layers. The datasets we used are images from the Frey Face, Olivetti Face and MNIST. We mainly learned three tasks by maximizing the variational lower bound: parameter estimation, images reconstruction and images generation. Meanwhile, we compared the convergence rate (running time) of three algorithms, where in this section the compared SGD is the Ada version [6] that is recommended for VAE model in [17, 25]. The experimental setting is as follows. The initial weights are randomly drawn from or , while all bias terms are initialized as 0. The variational lower bound only introduces the regularization on the encoder parameters, so we add an regularizer on decoder parameters with a shrinkage parameter or . The number of hidden nodes for encoder and decoder is the same for all auto-encoder model, which is reasonable and convenient to construct a symmetric structure. The number is always tuned from 200 to 800 with 100 increment. The mini-batch size is 100 for L-BFGS and Ada, while larger mini-batch is recommended for HF, meaning it should vary according to the training size.

The detailed results are shown in Fig. 2 and 3. Both Hessian-free and L-BFGS converge faster than Ada in terms of running time. HFSGVI also performs competitively with respet to generalization on testing data. Ada takes at least four times as long to achieve similar lower bound. Theoretically, Newton’s method has a quadratic convergence rate in terms of iteration, but with a cubic algorithmic complexity at each iteration. However, we manage to lower the computation in each iteration to linear complexity. Thus considering the number of evaluated training data points, the order algorithm needs much fewer step than order gradient descent (see visualization in supplementary on MNIST). The Hessian matrix also replaces manually tuned learning rates, and the affine invariant property allows for automatic learning rate adjustment. Technically, if the program can run in parallel with GPU, the speed advantages of order algorithm should be more obvious [21].

Fig. 2(b) and Fig. 3(b) are reconstruction results of input images. From the perspective of deep neural network, the only difference is the Gaussian distributed latent variables . By corollary of Theorem 2, we can roughly tell the mean is able to represent the quantity of , meaning this layer is actually a linear transformation with noise, which looks like dropout training [5]. Specifically, Olivetti includes 6464 pixels faces of various persons, which means more complicated models or preprocessing [13]

(e.g. nearest neighbor interpolation, patch sampling) is needed. However, even when simply learning a very bottlenecked auto-encoder, our approach can achieve acceptable results. Note that although we have tuned the hyperparameters of Ada by cross-validation, the best result is still a bunch of mean faces. For manifold learning, Fig.

2(c) represents how the learned generative model can simulate the images by HFSGVI. To visualize the results, we choose the 2D latent variable in , where the parameter is estimated by the algorithm. The two coordinates of take values that were transformed through the inverse CDF of the Gaussian distribution from equal distance grid (1010 or 20

20) on the unit square. Then we merely use the generative model to simulate the images. Besides these learning tasks, denoising, imputation

[25]

and even generalizing to semi-supervised learning

[16] are possible application of our approach.

## 5 Conclusions and Discussion

In this paper we proposed a scalable order stochastic variational method for generative models with continuous latent variables. By developing Gaussian backpropagation through reparametrization we introduced an efficient unbiased estimator for higher order gradients information. Combining with the efficient technique for computing Hessian-vector multiplication, we derived an efficient inference algorithm (HFSGVI) that allows for joint optimization of all parameters. The algorithmic complexity of each parameter update is quadratic w.r.t the dimension of latent variables for both and derivatives. Furthermore, the overall computational complexity of our order SGVI is linear w.r.t the number of parameters in real applications just like SGD or Ada. However, HFSGVI may not behave as fast as Ada in some situations, e.g., when the pixel values of images are sparse due to fast matrix multiplication implementation in most softwares.

Future research will focus on some difficult deep models such as RNNs [10, 27] or Dynamic SBN [9]. Because of conditional independent structure by giving sampled latent variables, we may construct blocked Hessian matrix to optimize such dynamic models. Another possible area of future work would be reinforcement learning (RL) [20]. Many RL problems can be reduced to compute gradients of expectations (e.g., in policy gradient methods) and there has been series of exploration in this area for natural gradients. However, we would suggest that it might be interesting to consider where stochastic backpropagation fits in our framework and how order computations can help.

#### Acknolwedgement

This research was supported in part by the Research Grants Council of the Hong Kong Special Administrative Region (Grant No. 614513).

## References

• [1] Matthew James Beal.

Variational algorithms for approximate Bayesian inference

.
PhD thesis, 2003.
• [2] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3, 2003.
• [3] Joseph-Frédéric Bonnans, Jean Charles Gilbert, Claude Lemaréchal, and Claudia A Sagastizábal. Numerical optimization: theoretical and practical aspects. Springer Science & Business Media, 2006.
• [4] Georges Bonnet. Transformations des signaux aléatoires a travers les systèmes non linéaires sans mémoire. Annals of Telecommunications, 19(9):203–220, 1964.
• [5] George E Dahl, Tara N Sainath, and Geoffrey E Hinton.

Improving deep neural networks for lvcsr using rectified linear units and dropout.

In ICASSP, 2013.
• [6] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
• [7] Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, 11:625–660, 2010.
• [8] Thomas S Ferguson. Location and scale parameters in exponential families of distributions. Annals of Mathematical Statistics, pages 986–1001, 1962.
• [9] Zhe Gan, Chunyuan Li, Ricardo Henao, David Carlson, and Lawrence Carin. Deep temporal sigmoid belief networks for sequence modeling. In NIPS, 2015.
• [10] Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. Draw: A recurrent neural network for image generation. In ICML, 2015.
• [11] James Hensman, Magnus Rattray, and Neil D Lawrence. Fast variational inference in the conjugate exponential family. In NIPS, 2012.
• [12] Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The ”wake-sleep” algorithm for unsupervised neural networks. Science, 268(5214):1158–1161, 1995.
• [13] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
• [14] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303–1347, 2013.
• [15] Mohammad E Khan. Decoupled variational gaussian inference. In NIPS, 2014.
• [16] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In NIPS, 2014.
• [17] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.
• [18] James Martens. Deep learning via hessian-free optimization. In ICML, 2010.
• [19] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In ICML, 2014.
• [20] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
• [21] Jiquan Ngiam, Adam Coates, Ahbik Lahiri, Bobby Prochnow, Quoc V Le, and Andrew Y Ng. On optimization methods for deep learning. In ICML, 2011.
• [22] Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584, 2013.
• [23] Barak A Pearlmutter. Fast exact multiplication by the hessian. Neural computation, 6(1):147–160, 1994.
• [24] Robert Price. A useful theorem for nonlinear devices having gaussian inputs. Information Theory, IRE Transactions on, 4(2):69–72, 1958.
• [25] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014.
• [26] Tim Salimans. Markov chain monte carlo and variational inference: Bridging the gap. In ICML, 2015.
• [27] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
• [28] Michalis Titsias and Miguel Lázaro-Gredilla. Doubly stochastic variational bayes for non-conjugate inference. In ICML, 2014.

## 6 Proofs of the Extending Gaussian Gradient Equations

###### Lemma 5.

Let be an integrable and twice differentiable function. The second gradient of the expectation of under a Gaussian distribution with respect to the mean can be expressed as the expectation of the Hessian of :

 ∇2μi,μjEN(z|μ,C)[f(z)]=EN(z|μ,C)[∇2zi,zjf(z)]=2∇CijEN(z|μ,C)[f(z)]. (8)
###### Proof.

From Bonnet’s theorem [4], we have

 ∇μiEN(z|μ,C)[f(z)]=EN(z|μ,C)[∇zif(z)]. (9)

Moreover, we can get the second order derivative,

 ∇2μi,μjEN(z|μ,C)[f(z)] = ∇μi(EN(z|μ,C)[∇zjf(z)]) = ∫∇μiN(z|μ,C)∇zjf(z)dz = −∫∇ziN(z|μ,C)∇zjf(z)dz = −[∫N(z|μ,C)∇zjf(z)dz¬i]zi=+∞zi=−∞+∫N(z|μ,C)∇zi,zjf(z)dz = EN(z|μ,C)[∇2zi,zjf(z)] = 2∇CijEN(z|μ,C)[f(z)],

where the last euality we use the equation

 ∇CijN(z|μ,C)=12∇2zi,zjN(z|μ,C). (10)

###### Lemma 6.

Let be an integrable and fourth differentiable function. The second gradient of the expectation of under a Gaussian distribution with respect to the covariance can be expressed as the expectation of the forth gradient of

 (11)
###### Proof.

From Price’s theorem [24], we have

 ∇Ci,jEN(z|μ,C)[f(z)]=12EN(z|μ,C)[∇2zi,zjf(z)]. (12)
 ∇2Ci,j,Ck,lEN(z|μ,C)[f(z)] = 12∇Ck,l(EN(z|μ,C)[∇2zi,zjf(z)]) = 12∫∇Ck,lN(z|μ,C)∇2zi,zjf(z)dz = 14∫∇2zk,zlN(z|μ,C)∇2zi,zjf(z)dz = 14∫N(z|μ,C)∇4zi,zj,zk,zlf(z)dz = 14EN(z|μ,C)[∇4zi,zj,zk,zlf(z)].

In the third equality we use the Eq.(10) again. For the fourth equality we use the product rule for integrals twice. ∎

From Eq.(9) and Eq.(12) we can straightforward write the second order gradient of interaction term as well:

 ∇2μi,Ck,lEN(μ,C)[f(z)]=12EN(μ,C)[∇3zi,zk,zlf(z)]. (13)

## 7 Proof of Theorem 1

By using the linear transformation , where , we can generate samples form any Gaussian distribution , , where are both dependent on parameter .

Then the gradients of the expectation with respect to and (or) is

 ∇REN(μ,C)[f(z)] = ∇REN(0,I)[f(μ+Rϵ)]=EN(0,I)[ϵg⊤] ∇2Ri,j,Rk,lEN(μ,C)[f(z)] = ∇Ri,jEN(0,I)[ϵlgk]=EN(0,I)[ϵjϵlHik] ∇2μi,Rk,lEN(μ,C)[f(z)] = ∇μiEN(0,I)[ϵlgk]=EN(0,I)[ϵlHik] ∇2μEN(μ,C)[f(z)] = EN(0,I)[H]

where is the gradient of evaluated at , is the Hessian of evaluated at .

Furthermore, we write the second order derivatives into matrix form:

 ∇2μ,REN(μ,C)[f(z)] = EN(0,I)[ϵ⊤⊗H], ∇2REN(μ,C)[f(z)] = EN(0,I)[(ϵϵT)⊗H].

For a particular model, such as deep generative model, and are depend on the model parameters, we denote them as , i.e. . Combining Eq.9 and Eq.12

and using the chain rule we have

 ∇θlEN(μ,C)[f(z)] = EN(μ,C)[g⊤∂μ∂θl+12Tr(H∂C∂θl)],

where and are the first and second order gradient of for abusing notation. This formulation involves matrix-matrix product, resulting in an algorithmic complexity for any single element of w.r.t , and , for overall gradient and Hessian respectively.

Considering , ,

 ∇θlEN(μ,C)[f(z)] = EN(0,I)[g⊤∂μ∂θl+Tr(ϵg⊤∂R∂θl)] = EN(0,I)[g⊤∂μ∂θl+g⊤∂R∂θlϵ].

For the second order, we have the following separated formulation:

 ∇2θl1θl2EN(μ,C)[f(z)] = ∇θl1EN(0,I)[∑igi∂μi∂θl2+∑i,jϵjgi∂Rij∂θl2] = + ∑i,jϵj(∑kHik(∂μk∂θl1+∑lϵl∂Rkl∂θl1))∂Rij∂θl2+∑i,jϵjgi∂2Rij∂θl1∂l2] = EN(0,I)⎡⎣∂μ∂θl1⊤H∂μ∂θl2+(∂R∂θl1ϵ)⊤H∂μ∂θl2+g⊤∂2μ∂θl1∂l2 + (∂R∂θl2ϵ)⊤H∂μ∂θl1+(∂R∂θl1ϵ)⊤H∂R∂θl2ϵ+g⊤∂2R∂θl1∂θl2ϵ⎤⎦ = EN(0,I)[∂(μ+Rϵ)∂θl1⊤H∂(μ+Rϵ)∂θl2+g⊤∂2(μ+Rϵ)∂θl1∂l2].

It is noticed that for second order gradient computation, it only involves matrix-vector or vector-vector multiplication, thus leading to an algorithmic complexity for each pair of .

One practical parametrization is or , which will reduce the actual second order gradient computation complexity, albeit the same order of . Then we have

 ∇θlEN(μ,C)[f(z)] = EN(0,I)[g⊤∂μ∂θl+∑iϵigi∂σi∂θl] (14) = EN(0,I)[g⊤∂μ∂θl+(ϵ⊙g)⊤∂σ∂θl], ∇2θl1θl2EN(μ,C)[f(z)] = EN(0,I)⎡⎣∂μ∂θl1⊤H∂μ∂θl2+(ϵ⊙∂σ∂θl1)⊤H∂μ∂θl2+g⊤∂2μ∂θl1∂θl2 (15) + (ϵ⊙∂σ∂θl2)⊤H∂μ∂θl1+(ϵ⊙∂σ∂θl1)⊤H(ϵ⊙∂σ∂θl2) + (ϵ⊙g)⊤∂2σ∂θl1∂θl2] = EN(0,I)⎡⎣(∂μ∂θl1+ϵ⊙∂σ∂θl1)⊤H(∂μ∂θl2+ϵ⊙∂σ∂θl2) + g⊤(∂2μ∂θl1∂θl2+∂2(ϵ⊙σ)∂θl1∂θl2)],

where is Hadamard (or element-wise) product, and .

Derivation for Hessian-Free SGVI without Plugging This means is the parameter for variational distribution. According the derivation in this section, the Hessian matrix with respect to can represented as . For any matrix with the same dimensionality of , we also have the Hessian-vector multiplication equation.

 Hμ,Rvec(V)=EN(0,I)[vec(HV[1ϵ][1,ϵ⊤])]

where denotes the vectorization of the matrix formed by stacking the columns into a single column vector. This allows an efficient computation both in speed and storage.

## 8 Forward-Backward Algorithm for Special Variation Auto-encoder Model

We illustrate the equivalent deep neural network model (Figure 4) by setting in VAE, and derive the gradient computation by lawyer-wise backpropagation. Without generalization, we give discussion on the binary input and diagonal covariance matrix, while it is straightforward to write the continuous case. For binary input, the parameters are .

The feedforward process is as follows:

 he = tanh(W1x+b1) μe = W2he+b2 σe = exp{(W3he+b3)/2} ϵ<