Building flexible and scalable uncertainty models [MacKay, 1992, Neal, 2012, Hinton and Van Camp, 1993] has long been a goal in Bayesian deep learning. Variational Bayesian neural networks [Graves, 2011, Blundell et al., 2015] are especially appealing because they combine the flexibility of deep learning with Bayesian uncertainty estimation. However, such models tend to impose overly restricted assumptions (e.g., fully-factorized) in approximating posterior distributions. There have been attempts to fit more expressive distributions [Louizos and Welling, 2016, Sun et al., 2017], but they are difficult to train due to strong and complicated posterior dependencies.
Noisy natural gradient is a simple and efficient method to fit multivariate Gaussian posteriors [Zhang et al., 2017]. It adds adaptive weight noise to regular natural gradient updates. Noisy K-FAC is a practical algorithm in the family of noisy natural gradient [Zhang et al., 2017], which fits a matrix-variate Gaussian posterior (flexible posterior) with only minimal changes to ordinary K-FAC update [Martens and Grosse, 2015] (cheap inference). The update for noisy K-FAC closely resembles standard K-FAC update with correlated weight noise.
Nevertheless, we note that a matrix-variate Gaussian cannot capture an accurate diagonal variance. In this work, we build upon the large body of noisy K-FAC and Eigenvalue corrected Kronecker-factored Approximate Curvature (EK-FAC) [George et al., 2018] to improve the flexibility of the posterior distribution. We compute the diagonal variance, not in parameter coordinates, but in K-FAC eigenbasis. This leads to a more expressive posterior distribution. The relationship is described in Figure 1. Using this insight, we introduce a modified training method for variational Bayesian neural networks called noisy EK-FAC.
2 Natural Gradient
Natural gradient descent is a second-order optimization technique first proposed by Amari . It is classically motivated as a way of implementing steepest descent in the space of distributions instead of the space of parameters. The distance function for distribution space is the KL divergence on the model’s predictive distribution: , where is the Fisher matrix.
This results in the preconditioned gradients . Natural gradient descent is invariant to smooth and invertible reparameterizations of the model [Martens, 2014].
2.1 Kronecker-Factored Approximate Curvature
Modern neural networks contain millions of parameters which makes storing and computing the inverse of the Fisher matrix impractical. Kronecker-Factored Approximate Curvature (K-FAC) [Martens and Grosse, 2015] uses Kronecker products to efficiently approximate the inverse Fisher matrix111Extending on this work, K-FAC was shown to be amenable to distributed computation [Ba et al., 2016] and could generalize as well as SGD [Zhang et al., 2018]..
For a layer of a neural network whose input activations are , weight matrix , and pre-activation output , we can write . The gradient with respect to the weight matrix is . Assuming and are independent under the model’s predictive distribution, K-FAC decouples the Fisher matrix :
where and . Bernacchia et al.  showed that the Kronecker approximation is exact for deep linear networks, justifying the validity of the above assumption. Further assuming the between-layer independence, the Fisher matrix is approximated as block diagonal consisting of layer-wise Fisher matrices. Decoupling into and avoids the memory issue of storing the full matrix while also having the ability to perform efficient inverse Fisher vector products:
As shown in equation (3), natural gradient descent with K-FAC only consists of a series of matrix multiplications comparable to the size of . This enables an efficient computation of a natural gradient descent.
2.2 An Alternative Interpretation of Natural Gradient
George et al.  suggest an alternative way of interpreting the natural gradient update. It can be broken down into three stages:
The first stage (–) projects the gradient vector to the full Fisher eigenbasis . The next step (–) re-scales the coordinates in full Fisher eigenbasis with the diagonal re-scaling factor . The last stage (–) projects back to the parameter coordinates.
For a diagonal approximation of the Fisher matrix, the basis is chosen to be identity matrixand the re-scaling factor is the second moment of the gradient vector. While estimating the diagonal factor is simple and efficient, obtaining an accurate eigenbasis is difficult. The crude basis in the diagonal Fisher introduces a significant approximation error.
K-FAC decouples the Fisher matrix into and . Since and are symmetric positive semi-definite matrices, by eigendecomposition, they can be represented as and , whereis a diagonal matrix with eigenvalues. We use properties of the Kronecker product to further decompose the factorization:
Based on the new interpretation, we have K-FAC eigenbasis and diagonal re-scaling factor . K-FAC eigenbasis is provably a more accurate approximation of the full Fisher eigenbasis. However, it does not use the estimated variance along the basis. The re-scaling factor in K-FAC is constrained to the Kronecker structure.
2.3 Eigenvalue Corrected Kronecker-Factored Approximate Curvature
Eigenvalue corrected K-FAC (EK-FAC) [George et al., 2018] extends on K-FAC to compute a more accurate diagonal re-scaling factor in K-FAC eigenbasis. The re-scaling factor for K-FAC is expressed in degrees of freedom, where and are input and output size of a layer. K-FAC factorization in equation (5) does not capture an accurate diagonal re-scaling factor in K-FAC eigenbasis because of the Kronecker structure. Instead, EK-FAC computes the second moment of the gradient vector in K-FAC eigenbasis. We define the re-scaling matrix as follows:
is a diagonal matrix whose entries are the second moment. The Fisher matrix can be approximated with K-FAC eigenbasis and the re-scaling matrix:
EK-FAC re-scaling matrix minimizes the approximation error of the above equation in Frobenius norm. In comparison to K-FAC approximation, Eigenvalue corrected K-FAC (EK-FAC) approximation is more flexible in representing the diagonal re-scaling factor with degrees of freedom. Figure 2 illustrates the difference between K-FAC and EK-FAC.
3 Variational Bayesian Neural Networks
Given a dataset , a Bayesian Neural Network (BNN) is composed of a log-likelihood and a prior on the weights. Performing inference on BNN requires integrating over the intractable posterior distribution . Variational Bayesian methods [Hinton and Van Camp, 1993, Graves, 2011, Blundell et al., 2015] attempt to fit an approximate posterior to maximize the evidence lower bound (ELBO):
where is a regularization parameter and
are parameters of the variational posterior. The exact Bayesian inference uses, but it can be tuned in practical settings.
Bayes By Backprop (BBB) [Blundell et al., 2015] is the most common variational BNN training method. It uses a fully-factorized Gaussian approximation to the posterior i.e. . The variational parameters are updated according to stochastic gradients of obtained by the reparameterization trick [Kingma and Welling, 2013].
There has been attempts to fit a matrix-variate Gaussian posterior for BNNs [Louizos and Welling, 2016, Sun et al., 2017]. Compared to overly restricted variational families, a matrix-variate Gaussian effectively captures correlations between weights. However, computing the gradients and enforcing the positive semi-definite constraint for and make the inference challenging. Existing methods typically impose additional structures such as diagonal covariance [Louizos and Welling, 2016] or products of Householder transformation [Sun et al., 2017] to ensure efficient updates.
3.1 Noisy Natural Gradient
Noisy natural gradient (NNG) is an efficient method to fit multivariate Gaussian posteriors [Zhang et al., 2017] by adding adaptive weight noise to ordinary natural gradient updates222Khan et al.  also found the relationship between natural gradient and variational inference and derived VAdam by adding weight noise to Adam, which is similar to noisy Adam in Zhang et al. .. Assuming is a multivariate Gaussian posterior parameterized by and is a spherical Gaussian, the update rules are:
where is the KL weight and is the prior variance. In each iteration, NNG samples weights from the variational posterior , which is a multivariate Gaussian with the covariance matrix:
However, due to computational intractability, it is necessary to impose a structured restriction to the covariance matrix. This is equivalent to imposing the same structure to the Fisher matrix.
3.2 Fitting Matrix-Variate Gaussian Posteriors with Noisy K-FAC
Noisy K-FAC is a tractable instance of NNG with Kronecker-factored approximation to the Fisher. Because imposing a structured approximation to the covariance is equivalent to imposing the same structure to the Fisher matrix, noisy K-FAC enforces a Kronecker product structure to the covariance matrix. It efficiently fits the matrix-variate Gaussian posterior. The posterior covariance is given by
where is a scalar constant introduced by Martens and Grosse  in the context of damping to keep a compact representation of the Kronecker product. The pseudo-code for noisy K-FAC is given in Appendix D. In comparison to existing methods that fit MVG posteriors [Sun et al., 2017, Louizos and Welling, 2016], noisy K-FAC does not assume additional approximations.
While matrix-variate Gaussian posterior efficiently captures correlations between different weights, the diagonal variance in K-FAC eigenbasis is not optimal. K-FAC diagonal re-scaling factor does not match the second moment along the associated eigenvector .
We develop a new tractable instance of noisy natural gradient. It keeps track of the diagonal variance in K-FAC eigenbasis, resulting in a more flexible posterior distribution. In the context of NNG, imposing a structural restriction to the Fisher matrix is equivalent to imposing the same restriction to the variational posterior. For example, noisy K-FAC imposes a Kronecker product structure to the covariance matrix as shown in equation (11).
Given these insights, building a flexible variational posterior boils down to finding an improved approximation of the Fisher matrix. We adopt EK-FAC method, which is provably a better approximation of the Fisher matrix than K-FAC. We term the new BNN training method noisy EK-FAC.
EK-FAC uses eigenvalue corrected Kronecker-factored approximation to the Fisher matrix as described in equation (7). For each layer, it estimates , and online using exponential moving averages. Conveniently, this resembles the exponential moving average updates for the noisy natural gradient in equation (9).
where is the learning rate for Kronecker factors and is the learning rate for the diagonal re-scaling factor.
We introduce an eigenvalue corrected matrix-variate Gaussian (EMVG) posterior shown in Figure 1
. An EMVG is a generalization of a multivariate Gaussian distribution with the following form:
An EMVG posterior is potentially powerful because it not only compactly represents covariances between weights but also computes a full diagonal variance in K-FAC eigenbasis. Applying EK-FAC approximation into equation (10) yields an EMVG posterior. Therefore, we factorize the covariance matrix in the same sense EK-FAC approximates the Fisher matrix:
where is an intrinsic damping term. Since the damping does not affect K-FAC eigenbasis, we explicitly represent the damping term in the re-scaling matrix. In practice, it may be advantageous to add extrinsic damping to the re-scaling matrix for the stable training process.
The only difference from standard EK-FAC is that the weights are sampled from the variational posterior . We can interpret noisy EK-FAC in the sense that is a point estimate of the weights and is the covariance of correlated Gaussian noise for each training examples. The full algorithm is described in alg. 1.
The inference is efficient because the covariance matrix is factorized with three small matrices , and . We can use the following identity to compute Kronecker products efficiently: .
5 Related Work
Variational inference was first applied to neural networks by Peterson  and Hinton and Van Camp . Then, Graves  proposed a practical method for variational inference with fully factorized Gaussian posteriors which uses a simple (but biased) gradient estimator. Improving on this work, Blundell et al.  proposed an unbiased gradient estimator using the reparameterization trick of Kingma and Welling . Several non-Gaussian variational posteriors have also been proposed such as Multiplicative Normalizing Flows [Louizos and Welling, 2017] and implicit distributions [Shi et al., 2018]. Neural networks with dropout were also interpreted as BNNs [Gal and Ghahramani, 2016, Gal et al., 2017].
A few recent works explored structured covariance approximations by exploiting the relationship between natural gradient and variational inference. Both Zhang et al.  and Khan et al.  used a diagonal Fisher approximation in natural gradient VI, obtaining a fully factorized Gaussian posterior. Zhang et al.  also proposed an interesting extension by using K-FAC, which leads to a matrix-variate Gaussian posterior. Concurrently, Mishkin et al.  adopted a "diagonal plus low-rank" approximation. This method shares the same spirit as this work. However, their low rank approximation is computationally expensive and thus only applied to two-layer (shallow) neural networks.
In order to empirically evaluate the proposed method, we test under two scenarios, regression and classification, to investigate the following questions. (1) Does noisy EK-FAC have improved prediction performance compared to existing methods? (2) Is it able to scale to large dataset and modern convolution neural network architecture?
We evaluate our method on standard BNN benchmarks from UCI collections [Dheeru and Karra Taniskidou, 2017], adopting evaluation protocols from Hernández-Lobato and Adams . In particular, we introduce a Gamma prior for the precision of Gaussian likelihood and include a Gamma posterior into the variational objective.
|Test RMSE||Test log-likelihood|
|Dataset||BBB||Noisy Adam||Noisy K-FAC||Noisy EK-FAC||BBB||Noisy Adam||Noisy K-FAC||Noisy EK-FAC|
We randomly split training (90%) and test (10%) data. To reduce the randomness, we repeat the splitting 10 times, except for two largest datasets. "Year" and "Protein" are repeated 1 and 5 times. During training, we normalize input features and training targets to zero mean and unit variance. We do not adopt this normalization at test time. All experiments train a network with a single hidden layer with 50 units except for "Protein" and "Year" datasets, which have 100 hidden units. We use batch size of 10 for smaller datasets, 100 for larger datasets, and 500 for "Year" dataset. To stabilize the training, we re-initialize the re-scaling matrix every 50 iterations with K-FAC eigenvalues. This is equivalent to executing a K-FAC iteration. We found that the second moment of the gradient vector is unstable for a small batch. We amortize the basis update to ensure re-scaling matrix matches the eigenbasis, setting . For learning rates, we use , , and for all datasets. They are decayed by 0.1 for the second half of the training.
We compare the results with Bayes by Backprop [Blundell et al., 2015], noisy Adam, and noisy K-FAC [Zhang et al., 2017]. We report root mean square error (RMSE) and log-likelihood on the test dataset. The results are summarized in Table 1. The evaluation result shows that noisy EK-FAC yields a higher test log-likelihood compared to other methods. The training plot is shown in Figure 3. Noisy EK-FAC also achieves a higher ELBO compared to noisy K-FAC.
|D||B||D + B|
denotes Batch Normalization.
To evaluate the scalability of the proposed method, we train a modified VGG16 [Simonyan and Zisserman, 2014] and test on CIFAR10 benchmarks [Krizhevsky, 2009]. The modified VGG16 has a half reduced number of hidden units in all layers. Similar to applying K-FAC on convolutional layers with Kronecker factors [Grosse and Martens, 2016], EK-FAC can be extended to convolutional layers. We compare the results with SGD (with momentum), K-FAC, and noisy K-FAC.
We use batch size of 128 for all experiments. To reduce the computational overhead, we amortize covariance, inverse, and re-scaling updates. Specifically, we set , , and . We noticed that the amortization does not significantly impact per-iteration optimization performance. and are set to 0.01 for both noisy K-FAC and noisy EK-FAC. We adopt batch normalization [Ioffe and Szegedy, 2015] and data augmentation. We tune regularization parameter and prior variance . With data augmentation, we use a smaller regularization parameter. is set to 0.1 without batch normalization and 1.0 with batch normalization.
The results are summarized in table 2. Noisy EK-FAC achieves the highest test accuracy in all settings without introducing computational overhead. Without extra regularization tricks, noisy EK-FAC has 1.55 improvement compared to noisy K-FAC.
In this paper, we introduced a modified training method for variational Bayesian neural networks. An eigenvalue corrected matrix-variate Gaussian extends on a matrix-variate Gaussian to represent the posterior distribution with more flexibility. It not only efficiently captures correlations between weights but also computes an accurate diagonal variance under K-FAC eigenbasis. For both regression and classification evaluations, noisy EK-FAC achieves higher ELBO and test accuracy, demonstrating its effectiveness.
- Amari  Shun-ichi Amari. Neural learning in structured parameter spaces-natural riemannian gradient. In Advances in neural information processing systems, pages 127–133, 1997.
- Ba et al.  Jimmy Ba, Roger Grosse, and James Martens. Distributed second-order optimization using kronecker-factored approximations. 2016.
- Bernacchia et al.  Alberto Bernacchia, Máté Lengyel, and Guillaume Hennequin. Exact natural gradient in deep linear networks and application to the nonlinear case. 2018.
- Blundell et al.  Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
Dheeru and Karra Taniskidou 
Dua Dheeru and Efi Karra Taniskidou.
UCI machine learning repository, 2017.URL http://archive.ics.uci.edu/ml.
- Gal and Ghahramani  Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, pages 1050–1059, 2016.
- Gal et al.  Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. In Advances in Neural Information Processing Systems, pages 3581–3590, 2017.
- George et al.  Thomas George, César Laurent, Xavier Bouthillier, Nicolas Ballas, and Pascal Vincent. Fast approximate natural gradient descent in a kronecker-factored eigenbasis. arXiv preprint arXiv:1806.03884, 2018.
- Graves  Alex Graves. Practical variational inference for neural networks. In Advances in neural information processing systems, pages 2348–2356, 2011.
- Grosse and Martens  Roger Grosse and James Martens. A kronecker-factored approximate fisher matrix for convolution layers. In International Conference on Machine Learning, pages 573–582, 2016.
Hernández-Lobato and Adams 
José Miguel Hernández-Lobato and Ryan Adams.
Probabilistic backpropagation for scalable learning of bayesian neural networks.In International Conference on Machine Learning, pages 1861–1869, 2015.
Hinton and Van Camp 
Geoffrey E Hinton and Drew Van Camp.
Keeping the neural networks simple by minimizing the description
length of the weights.
Proceedings of the sixth annual conference on Computational learning theory, pages 5–13. ACM, 1993.
- Ioffe and Szegedy  Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
- Khan et al.  Mohammad Emtiyaz Khan, Didrik Nielsen, Voot Tangkaratt, Wu Lin, Yarin Gal, and Akash Srivastava. Fast and scalable bayesian deep learning by weight-perturbation in adam. arXiv preprint arXiv:1806.04854, 2018.
- Kingma and Welling  Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Krizhevsky  Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
- Louizos and Welling  Christos Louizos and Max Welling. Structured and efficient variational deep learning with matrix gaussian posteriors. In International Conference on Machine Learning, pages 1708–1716, 2016.
- Louizos and Welling  Christos Louizos and Max Welling. Multiplicative normalizing flows for variational Bayesian neural networks. In International Conference on Machine Learning, pages 2218–2227, 2017.
- MacKay  David JC MacKay. A practical bayesian framework for backpropagation networks. Neural computation, 4(3):448–472, 1992.
- Martens  James Martens. New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193, 2014.
- Martens and Grosse  James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pages 2408–2417, 2015.
- Mishkin et al.  Aaron Mishkin, Frederik Kunstner, Didrik Nielsen, Mark Schmidt, and Mohammad Emtiyaz Khan. Slang: Fast structured covariance approximations for bayesian deep learning with natural gradient. In Advances in Neural Information Processing Systems, pages 6246–6256, 2018.
- Neal  Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
- Peterson  Carsten Peterson. A mean field theory learning algorithm for neural networks. Complex systems, 1:995–1019, 1987.
- Shi et al.  Jiaxin Shi, Shengyang Sun, and Jun Zhu. Kernel implicit variational inference. In International Conference on Learning Representations, 2018.
- Simonyan and Zisserman  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Sun et al.  Shengyang Sun, Changyou Chen, and Lawrence Carin. Learning structured weight uncertainty in bayesian neural networks. In Artificial Intelligence and Statistics, pages 1283–1292, 2017.
- Zhang et al.  Guodong Zhang, Shengyang Sun, David Duvenaud, and Roger Grosse. Noisy natural gradient as variational inference. arXiv preprint arXiv:1712.02390, 2017.
- Zhang et al.  Guodong Zhang, Chaoqi Wang, Bowen Xu, and Roger Grosse. Three mechanisms of weight decay regularization. arXiv preprint arXiv:1810.12281, 2018.
Appendix A Matrix-Variate Gaussian
A matrix-variate Gaussian distribution models a multivariate Gaussian distribution for a matrix .
where is the mean, is the covariance among rows and is the covariance among columns. Since and are covariance matrices, they are positive definite. Vectorization of forms a multivariate Gaussian distribution whose covariance matrix is a Kronecker product of and .
Appendix B Eigenvalue Corrected Matrix-Variate Gaussian
An eigenvalue corrected matrix-variate Gaussian is an extension of a matrix-variate Gaussian to consider the full diagonal variance in Kronecker-factored eigenbasis.
is the re-scaling matrix. Because covariance matrices are positive definite, diagonal entries in are all positive. Similar to a matrix-variate Gaussian distribution, vectorization of generalizes a multivariate distribution whose covariance matrix has a Kronecker structure.
Sampling from an eigenvalue corrected matrix-variate distribution is also a special case of sampling from a multivariate Gaussian distribution. Let be a matrix with independent samples for a standard multivariate Gaussian.
where , is an element-wise multiplication, and unvec is an inverse of vec operation.
Appendix C Derivation of EK-FAC Update
Let be the weight gradient, the covariance matrix of input activations, and the covariance matrix of output pre-activations. is the diagonal re-scaling matrix in K-FAC eigenbasis: . The following is the derivation of EK-FAC update shown in alg. 1.
where , is an element-wise division, and unvec is an inverse of vec operation.