1 Introduction
Building flexible and scalable uncertainty models [MacKay, 1992, Neal, 2012, Hinton and Van Camp, 1993] has long been a goal in Bayesian deep learning. Variational Bayesian neural networks [Graves, 2011, Blundell et al., 2015] are especially appealing because they combine the flexibility of deep learning with Bayesian uncertainty estimation. However, such models tend to impose overly restricted assumptions (e.g., fullyfactorized) in approximating posterior distributions. There have been attempts to fit more expressive distributions [Louizos and Welling, 2016, Sun et al., 2017], but they are difficult to train due to strong and complicated posterior dependencies.
Noisy natural gradient is a simple and efficient method to fit multivariate Gaussian posteriors [Zhang et al., 2017]. It adds adaptive weight noise to regular natural gradient updates. Noisy KFAC is a practical algorithm in the family of noisy natural gradient [Zhang et al., 2017], which fits a matrixvariate Gaussian posterior (flexible posterior) with only minimal changes to ordinary KFAC update [Martens and Grosse, 2015] (cheap inference). The update for noisy KFAC closely resembles standard KFAC update with correlated weight noise.
Nevertheless, we note that a matrixvariate Gaussian cannot capture an accurate diagonal variance. In this work, we build upon the large body of noisy KFAC and Eigenvalue corrected Kroneckerfactored Approximate Curvature (EKFAC) [George et al., 2018] to improve the flexibility of the posterior distribution. We compute the diagonal variance, not in parameter coordinates, but in KFAC eigenbasis. This leads to a more expressive posterior distribution. The relationship is described in Figure 1. Using this insight, we introduce a modified training method for variational Bayesian neural networks called noisy EKFAC.
2 Natural Gradient
Natural gradient descent is a secondorder optimization technique first proposed by Amari [1997]. It is classically motivated as a way of implementing steepest descent in the space of distributions instead of the space of parameters. The distance function for distribution space is the KL divergence on the model’s predictive distribution: , where is the Fisher matrix.
(1) 
This results in the preconditioned gradients . Natural gradient descent is invariant to smooth and invertible reparameterizations of the model [Martens, 2014].
2.1 KroneckerFactored Approximate Curvature
Modern neural networks contain millions of parameters which makes storing and computing the inverse of the Fisher matrix impractical. KroneckerFactored Approximate Curvature (KFAC) [Martens and Grosse, 2015] uses Kronecker products to efficiently approximate the inverse Fisher matrix^{1}^{1}1Extending on this work, KFAC was shown to be amenable to distributed computation [Ba et al., 2016] and could generalize as well as SGD [Zhang et al., 2018]..
For a layer of a neural network whose input activations are , weight matrix , and preactivation output , we can write . The gradient with respect to the weight matrix is . Assuming and are independent under the model’s predictive distribution, KFAC decouples the Fisher matrix :
(2)  
where and . Bernacchia et al. [2018] showed that the Kronecker approximation is exact for deep linear networks, justifying the validity of the above assumption. Further assuming the betweenlayer independence, the Fisher matrix is approximated as block diagonal consisting of layerwise Fisher matrices. Decoupling into and avoids the memory issue of storing the full matrix while also having the ability to perform efficient inverse Fisher vector products:
(3) 
As shown in equation (3), natural gradient descent with KFAC only consists of a series of matrix multiplications comparable to the size of . This enables an efficient computation of a natural gradient descent.
2.2 An Alternative Interpretation of Natural Gradient
George et al. [2018] suggest an alternative way of interpreting the natural gradient update. It can be broken down into three stages:
(4) 
The first stage (–) projects the gradient vector to the full Fisher eigenbasis . The next step (–) rescales the coordinates in full Fisher eigenbasis with the diagonal rescaling factor . The last stage (–) projects back to the parameter coordinates.
For a diagonal approximation of the Fisher matrix, the basis is chosen to be identity matrix
and the rescaling factor is the second moment of the gradient vector. While estimating the diagonal factor is simple and efficient, obtaining an accurate eigenbasis is difficult. The crude basis in the diagonal Fisher introduces a significant approximation error.KFAC decouples the Fisher matrix into and . Since and are symmetric positive semidefinite matrices, by eigendecomposition, they can be represented as and , where
is an orthogonal matrix whose columns are eigenvectors and
is a diagonal matrix with eigenvalues. We use properties of the Kronecker product to further decompose the factorization:(5) 
Based on the new interpretation, we have KFAC eigenbasis and diagonal rescaling factor . KFAC eigenbasis is provably a more accurate approximation of the full Fisher eigenbasis. However, it does not use the estimated variance along the basis. The rescaling factor in KFAC is constrained to the Kronecker structure.
2.3 Eigenvalue Corrected KroneckerFactored Approximate Curvature
Eigenvalue corrected KFAC (EKFAC) [George et al., 2018] extends on KFAC to compute a more accurate diagonal rescaling factor in KFAC eigenbasis. The rescaling factor for KFAC is expressed in degrees of freedom, where and are input and output size of a layer. KFAC factorization in equation (5) does not capture an accurate diagonal rescaling factor in KFAC eigenbasis because of the Kronecker structure. Instead, EKFAC computes the second moment of the gradient vector in KFAC eigenbasis. We define the rescaling matrix as follows:
(6) 
is a diagonal matrix whose entries are the second moment. The Fisher matrix can be approximated with KFAC eigenbasis and the rescaling matrix:
(7) 
EKFAC rescaling matrix minimizes the approximation error of the above equation in Frobenius norm. In comparison to KFAC approximation, Eigenvalue corrected KFAC (EKFAC) approximation is more flexible in representing the diagonal rescaling factor with degrees of freedom. Figure 2 illustrates the difference between KFAC and EKFAC.
3 Variational Bayesian Neural Networks
Given a dataset , a Bayesian Neural Network (BNN) is composed of a loglikelihood and a prior on the weights. Performing inference on BNN requires integrating over the intractable posterior distribution . Variational Bayesian methods [Hinton and Van Camp, 1993, Graves, 2011, Blundell et al., 2015] attempt to fit an approximate posterior to maximize the evidence lower bound (ELBO):
(8) 
where is a regularization parameter and
are parameters of the variational posterior. The exact Bayesian inference uses
, but it can be tuned in practical settings.Bayes By Backprop (BBB) [Blundell et al., 2015] is the most common variational BNN training method. It uses a fullyfactorized Gaussian approximation to the posterior i.e. . The variational parameters are updated according to stochastic gradients of obtained by the reparameterization trick [Kingma and Welling, 2013].
There has been attempts to fit a matrixvariate Gaussian posterior for BNNs [Louizos and Welling, 2016, Sun et al., 2017]. Compared to overly restricted variational families, a matrixvariate Gaussian effectively captures correlations between weights. However, computing the gradients and enforcing the positive semidefinite constraint for and make the inference challenging. Existing methods typically impose additional structures such as diagonal covariance [Louizos and Welling, 2016] or products of Householder transformation [Sun et al., 2017] to ensure efficient updates.
3.1 Noisy Natural Gradient
Noisy natural gradient (NNG) is an efficient method to fit multivariate Gaussian posteriors [Zhang et al., 2017] by adding adaptive weight noise to ordinary natural gradient updates^{2}^{2}2Khan et al. [2018] also found the relationship between natural gradient and variational inference and derived VAdam by adding weight noise to Adam, which is similar to noisy Adam in Zhang et al. [2017].. Assuming is a multivariate Gaussian posterior parameterized by and is a spherical Gaussian, the update rules are:
(9)  
where is the KL weight and is the prior variance. In each iteration, NNG samples weights from the variational posterior , which is a multivariate Gaussian with the covariance matrix:
(10) 
However, due to computational intractability, it is necessary to impose a structured restriction to the covariance matrix. This is equivalent to imposing the same structure to the Fisher matrix.
3.2 Fitting MatrixVariate Gaussian Posteriors with Noisy KFAC
Noisy KFAC is a tractable instance of NNG with Kroneckerfactored approximation to the Fisher. Because imposing a structured approximation to the covariance is equivalent to imposing the same structure to the Fisher matrix, noisy KFAC enforces a Kronecker product structure to the covariance matrix. It efficiently fits the matrixvariate Gaussian posterior. The posterior covariance is given by
(11)  
where is a scalar constant introduced by Martens and Grosse [2015] in the context of damping to keep a compact representation of the Kronecker product. The pseudocode for noisy KFAC is given in Appendix D. In comparison to existing methods that fit MVG posteriors [Sun et al., 2017, Louizos and Welling, 2016], noisy KFAC does not assume additional approximations.
4 Methods
While matrixvariate Gaussian posterior efficiently captures correlations between different weights, the diagonal variance in KFAC eigenbasis is not optimal. KFAC diagonal rescaling factor does not match the second moment along the associated eigenvector .
We develop a new tractable instance of noisy natural gradient. It keeps track of the diagonal variance in KFAC eigenbasis, resulting in a more flexible posterior distribution. In the context of NNG, imposing a structural restriction to the Fisher matrix is equivalent to imposing the same restriction to the variational posterior. For example, noisy KFAC imposes a Kronecker product structure to the covariance matrix as shown in equation (11).
Given these insights, building a flexible variational posterior boils down to finding an improved approximation of the Fisher matrix. We adopt EKFAC method, which is provably a better approximation of the Fisher matrix than KFAC. We term the new BNN training method noisy EKFAC.
EKFAC uses eigenvalue corrected Kroneckerfactored approximation to the Fisher matrix as described in equation (7). For each layer, it estimates , and online using exponential moving averages. Conveniently, this resembles the exponential moving average updates for the noisy natural gradient in equation (9).
(12)  
where is the learning rate for Kronecker factors and is the learning rate for the diagonal rescaling factor.
We introduce an eigenvalue corrected matrixvariate Gaussian (EMVG) posterior shown in Figure 1
. An EMVG is a generalization of a multivariate Gaussian distribution with the following form:
(13) 
An EMVG posterior is potentially powerful because it not only compactly represents covariances between weights but also computes a full diagonal variance in KFAC eigenbasis. Applying EKFAC approximation into equation (10) yields an EMVG posterior. Therefore, we factorize the covariance matrix in the same sense EKFAC approximates the Fisher matrix:
(14)  
where is an intrinsic damping term. Since the damping does not affect KFAC eigenbasis, we explicitly represent the damping term in the rescaling matrix. In practice, it may be advantageous to add extrinsic damping to the rescaling matrix for the stable training process.
The only difference from standard EKFAC is that the weights are sampled from the variational posterior . We can interpret noisy EKFAC in the sense that is a point estimate of the weights and is the covariance of correlated Gaussian noise for each training examples. The full algorithm is described in alg. 1.
The inference is efficient because the covariance matrix is factorized with three small matrices , and . We can use the following identity to compute Kronecker products efficiently: .
5 Related Work
Variational inference was first applied to neural networks by Peterson [1987] and Hinton and Van Camp [1993]. Then, Graves [2011] proposed a practical method for variational inference with fully factorized Gaussian posteriors which uses a simple (but biased) gradient estimator. Improving on this work, Blundell et al. [2015] proposed an unbiased gradient estimator using the reparameterization trick of Kingma and Welling [2013]. Several nonGaussian variational posteriors have also been proposed such as Multiplicative Normalizing Flows [Louizos and Welling, 2017] and implicit distributions [Shi et al., 2018]. Neural networks with dropout were also interpreted as BNNs [Gal and Ghahramani, 2016, Gal et al., 2017].
A few recent works explored structured covariance approximations by exploiting the relationship between natural gradient and variational inference. Both Zhang et al. [2017] and Khan et al. [2018] used a diagonal Fisher approximation in natural gradient VI, obtaining a fully factorized Gaussian posterior. Zhang et al. [2017] also proposed an interesting extension by using KFAC, which leads to a matrixvariate Gaussian posterior. Concurrently, Mishkin et al. [2018] adopted a "diagonal plus lowrank" approximation. This method shares the same spirit as this work. However, their low rank approximation is computationally expensive and thus only applied to twolayer (shallow) neural networks.
6 Experiments
In order to empirically evaluate the proposed method, we test under two scenarios, regression and classification, to investigate the following questions. (1) Does noisy EKFAC have improved prediction performance compared to existing methods? (2) Is it able to scale to large dataset and modern convolution neural network architecture?
6.1 Regression
We evaluate our method on standard BNN benchmarks from UCI collections [Dheeru and Karra Taniskidou, 2017], adopting evaluation protocols from HernándezLobato and Adams [2015]. In particular, we introduce a Gamma prior for the precision of Gaussian likelihood and include a Gamma posterior into the variational objective.
Test RMSE  Test loglikelihood  

Dataset  BBB  Noisy Adam  Noisy KFAC  Noisy EKFAC  BBB  Noisy Adam  Noisy KFAC  Noisy EKFAC 
Boston  3.1710.149  3.0310.155  2.7420.015  2.5270.158  2.6020.031  2.5580.032  2.4090.047  2.3780.044 
Concrete  5.6780.087  5.6130.113  5.0190.127  4.8800.120  3.1490.018  3.1450.023  3.0390.025  3.0020.025 
Energy  0.5650.018  0.8390.046  0.4850.019  0.4970.023  1.5000.006  1.6290.020  1.4210.004  1.4480.004 
Kin8nm  0.0800.001  0.0790.001  0.0760.001  0.0760.000  1.1110.007  1.1120.008  1.1480.007  1.1490.012 
Naval  0.0000.000  0.0010.000  0.0000.000  0.0000.000  6.1430.032  6.2310.041  7.0790.034  7.2870.002 
Pow. Plant  4.0230.036  4.0020.039  3.8860.041  3.8950.053  2.8070.010  2.8030.010  2.7760.012  2.7740.012 
Protein  4.3210.017  4.3800.016  4.0970.009  4.0420.027  2.8820.004  2.8960.004  2.8360.002  2.8190.007 
Wine  0.6430.012  0.6440.011  0.6370.011  0.6350.013  0.9770.017  0.9760.016  0.9690.014  0.9640.002 
Yacht  1.1740.086  1.2890.069  0.9790.077  0.9740.116  2.4080.007  2.4120.006  2.3160.006  2.2240.007 
Year  9.076NA  9.071NA  8.885NA  8.642NA  3.614NA  3.620NA  3.595NA  3.573NA 
We randomly split training (90%) and test (10%) data. To reduce the randomness, we repeat the splitting 10 times, except for two largest datasets. "Year" and "Protein" are repeated 1 and 5 times. During training, we normalize input features and training targets to zero mean and unit variance. We do not adopt this normalization at test time. All experiments train a network with a single hidden layer with 50 units except for "Protein" and "Year" datasets, which have 100 hidden units. We use batch size of 10 for smaller datasets, 100 for larger datasets, and 500 for "Year" dataset. To stabilize the training, we reinitialize the rescaling matrix every 50 iterations with KFAC eigenvalues. This is equivalent to executing a KFAC iteration. We found that the second moment of the gradient vector is unstable for a small batch. We amortize the basis update to ensure rescaling matrix matches the eigenbasis, setting . For learning rates, we use , , and for all datasets. They are decayed by 0.1 for the second half of the training.
We compare the results with Bayes by Backprop [Blundell et al., 2015], noisy Adam, and noisy KFAC [Zhang et al., 2017]. We report root mean square error (RMSE) and loglikelihood on the test dataset. The results are summarized in Table 1. The evaluation result shows that noisy EKFAC yields a higher test loglikelihood compared to other methods. The training plot is shown in Figure 3. Noisy EKFAC also achieves a higher ELBO compared to noisy KFAC.
6.2 Classification
Method  Test Accuracy  

D  B  D + B  
SGD  81.79  88.35  85.75  91.39 
KFAC  82.39  88.89  86.86  92.13 
NoisyKFAC  85.52  89.35  88.22  92.01 
NoisyEKFAC  87.07  89.86  88.45  92.22 
denotes Batch Normalization.
To evaluate the scalability of the proposed method, we train a modified VGG16 [Simonyan and Zisserman, 2014] and test on CIFAR10 benchmarks [Krizhevsky, 2009]. The modified VGG16 has a half reduced number of hidden units in all layers. Similar to applying KFAC on convolutional layers with Kronecker factors [Grosse and Martens, 2016], EKFAC can be extended to convolutional layers. We compare the results with SGD (with momentum), KFAC, and noisy KFAC.
We use batch size of 128 for all experiments. To reduce the computational overhead, we amortize covariance, inverse, and rescaling updates. Specifically, we set , , and . We noticed that the amortization does not significantly impact periteration optimization performance. and are set to 0.01 for both noisy KFAC and noisy EKFAC. We adopt batch normalization [Ioffe and Szegedy, 2015] and data augmentation. We tune regularization parameter and prior variance . With data augmentation, we use a smaller regularization parameter. is set to 0.1 without batch normalization and 1.0 with batch normalization.
The results are summarized in table 2. Noisy EKFAC achieves the highest test accuracy in all settings without introducing computational overhead. Without extra regularization tricks, noisy EKFAC has 1.55 improvement compared to noisy KFAC.
7 Conclusion
In this paper, we introduced a modified training method for variational Bayesian neural networks. An eigenvalue corrected matrixvariate Gaussian extends on a matrixvariate Gaussian to represent the posterior distribution with more flexibility. It not only efficiently captures correlations between weights but also computes an accurate diagonal variance under KFAC eigenbasis. For both regression and classification evaluations, noisy EKFAC achieves higher ELBO and test accuracy, demonstrating its effectiveness.
References
 Amari [1997] Shunichi Amari. Neural learning in structured parameter spacesnatural riemannian gradient. In Advances in neural information processing systems, pages 127–133, 1997.
 Ba et al. [2016] Jimmy Ba, Roger Grosse, and James Martens. Distributed secondorder optimization using kroneckerfactored approximations. 2016.
 Bernacchia et al. [2018] Alberto Bernacchia, Máté Lengyel, and Guillaume Hennequin. Exact natural gradient in deep linear networks and application to the nonlinear case. 2018.
 Blundell et al. [2015] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.

Dheeru and Karra Taniskidou [2017]
Dua Dheeru and Efi Karra Taniskidou.
UCI machine learning repository, 2017.
URL http://archive.ics.uci.edu/ml.  Gal and Ghahramani [2016] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, pages 1050–1059, 2016.
 Gal et al. [2017] Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. In Advances in Neural Information Processing Systems, pages 3581–3590, 2017.
 George et al. [2018] Thomas George, César Laurent, Xavier Bouthillier, Nicolas Ballas, and Pascal Vincent. Fast approximate natural gradient descent in a kroneckerfactored eigenbasis. arXiv preprint arXiv:1806.03884, 2018.
 Graves [2011] Alex Graves. Practical variational inference for neural networks. In Advances in neural information processing systems, pages 2348–2356, 2011.
 Grosse and Martens [2016] Roger Grosse and James Martens. A kroneckerfactored approximate fisher matrix for convolution layers. In International Conference on Machine Learning, pages 573–582, 2016.

HernándezLobato and Adams [2015]
José Miguel HernándezLobato and Ryan Adams.
Probabilistic backpropagation for scalable learning of bayesian neural networks.
In International Conference on Machine Learning, pages 1861–1869, 2015. 
Hinton and Van Camp [1993]
Geoffrey E Hinton and Drew Van Camp.
Keeping the neural networks simple by minimizing the description
length of the weights.
In
Proceedings of the sixth annual conference on Computational learning theory
, pages 5–13. ACM, 1993.  Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 Khan et al. [2018] Mohammad Emtiyaz Khan, Didrik Nielsen, Voot Tangkaratt, Wu Lin, Yarin Gal, and Akash Srivastava. Fast and scalable bayesian deep learning by weightperturbation in adam. arXiv preprint arXiv:1806.04854, 2018.
 Kingma and Welling [2013] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Krizhevsky [2009] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 Louizos and Welling [2016] Christos Louizos and Max Welling. Structured and efficient variational deep learning with matrix gaussian posteriors. In International Conference on Machine Learning, pages 1708–1716, 2016.
 Louizos and Welling [2017] Christos Louizos and Max Welling. Multiplicative normalizing flows for variational Bayesian neural networks. In International Conference on Machine Learning, pages 2218–2227, 2017.
 MacKay [1992] David JC MacKay. A practical bayesian framework for backpropagation networks. Neural computation, 4(3):448–472, 1992.
 Martens [2014] James Martens. New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193, 2014.
 Martens and Grosse [2015] James Martens and Roger Grosse. Optimizing neural networks with kroneckerfactored approximate curvature. In International conference on machine learning, pages 2408–2417, 2015.
 Mishkin et al. [2018] Aaron Mishkin, Frederik Kunstner, Didrik Nielsen, Mark Schmidt, and Mohammad Emtiyaz Khan. Slang: Fast structured covariance approximations for bayesian deep learning with natural gradient. In Advances in Neural Information Processing Systems, pages 6246–6256, 2018.
 Neal [2012] Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
 Peterson [1987] Carsten Peterson. A mean field theory learning algorithm for neural networks. Complex systems, 1:995–1019, 1987.
 Shi et al. [2018] Jiaxin Shi, Shengyang Sun, and Jun Zhu. Kernel implicit variational inference. In International Conference on Learning Representations, 2018.
 Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 Sun et al. [2017] Shengyang Sun, Changyou Chen, and Lawrence Carin. Learning structured weight uncertainty in bayesian neural networks. In Artificial Intelligence and Statistics, pages 1283–1292, 2017.
 Zhang et al. [2017] Guodong Zhang, Shengyang Sun, David Duvenaud, and Roger Grosse. Noisy natural gradient as variational inference. arXiv preprint arXiv:1712.02390, 2017.
 Zhang et al. [2018] Guodong Zhang, Chaoqi Wang, Bowen Xu, and Roger Grosse. Three mechanisms of weight decay regularization. arXiv preprint arXiv:1810.12281, 2018.
Appendix A MatrixVariate Gaussian
A matrixvariate Gaussian distribution models a multivariate Gaussian distribution for a matrix .
(15) 
where is the mean, is the covariance among rows and is the covariance among columns. Since and are covariance matrices, they are positive definite. Vectorization of forms a multivariate Gaussian distribution whose covariance matrix is a Kronecker product of and .
(16) 
Appendix B Eigenvalue Corrected MatrixVariate Gaussian
An eigenvalue corrected matrixvariate Gaussian is an extension of a matrixvariate Gaussian to consider the full diagonal variance in Kroneckerfactored eigenbasis.
(17) 
is the rescaling matrix. Because covariance matrices are positive definite, diagonal entries in are all positive. Similar to a matrixvariate Gaussian distribution, vectorization of generalizes a multivariate distribution whose covariance matrix has a Kronecker structure.
(18) 
Sampling from an eigenvalue corrected matrixvariate distribution is also a special case of sampling from a multivariate Gaussian distribution. Let be a matrix with independent samples for a standard multivariate Gaussian.
(19) 
Then let
(20) 
where , is an elementwise multiplication, and unvec is an inverse of vec operation.
Appendix C Derivation of EKFAC Update
Let be the weight gradient, the covariance matrix of input activations, and the covariance matrix of output preactivations. is the diagonal rescaling matrix in KFAC eigenbasis: . The following is the derivation of EKFAC update shown in alg. 1.
(21)  
where , is an elementwise division, and unvec is an inverse of vec operation.