Eigenvalue Corrected Noisy Natural Gradient

11/30/2018 ∙ by Juhan Bae, et al. ∙ 4

Variational Bayesian neural networks combine the flexibility of deep learning with Bayesian uncertainty estimation. However, inference procedures for flexible variational posteriors are computationally expensive. A recently proposed method, noisy natural gradient, is a surprisingly simple method to fit expressive posteriors by adding weight noise to regular natural gradient updates. Noisy K-FAC is an instance of noisy natural gradient that fits a matrix-variate Gaussian posterior with minor changes to ordinary K-FAC. Nevertheless, a matrix-variate Gaussian posterior does not capture an accurate diagonal variance. In this work, we extend on noisy K-FAC to obtain a more flexible posterior distribution called eigenvalue corrected matrix-variate Gaussian. The proposed method computes the full diagonal re-scaling factor in Kronecker-factored eigenbasis. Empirically, our approach consistently outperforms existing algorithms (e.g., noisy K-FAC) on regression and classification tasks.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Building flexible and scalable uncertainty models [MacKay, 1992, Neal, 2012, Hinton and Van Camp, 1993] has long been a goal in Bayesian deep learning. Variational Bayesian neural networks [Graves, 2011, Blundell et al., 2015] are especially appealing because they combine the flexibility of deep learning with Bayesian uncertainty estimation. However, such models tend to impose overly restricted assumptions (e.g., fully-factorized) in approximating posterior distributions. There have been attempts to fit more expressive distributions [Louizos and Welling, 2016, Sun et al., 2017], but they are difficult to train due to strong and complicated posterior dependencies.

Figure 1: A cartoon illustration to describe the relationships of FFG, MVG, and EMVG.

Noisy natural gradient is a simple and efficient method to fit multivariate Gaussian posteriors [Zhang et al., 2017]. It adds adaptive weight noise to regular natural gradient updates. Noisy K-FAC is a practical algorithm in the family of noisy natural gradient [Zhang et al., 2017], which fits a matrix-variate Gaussian posterior (flexible posterior) with only minimal changes to ordinary K-FAC update [Martens and Grosse, 2015] (cheap inference). The update for noisy K-FAC closely resembles standard K-FAC update with correlated weight noise.

Nevertheless, we note that a matrix-variate Gaussian cannot capture an accurate diagonal variance. In this work, we build upon the large body of noisy K-FAC and Eigenvalue corrected Kronecker-factored Approximate Curvature (EK-FAC) [George et al., 2018] to improve the flexibility of the posterior distribution. We compute the diagonal variance, not in parameter coordinates, but in K-FAC eigenbasis. This leads to a more expressive posterior distribution. The relationship is described in Figure 1. Using this insight, we introduce a modified training method for variational Bayesian neural networks called noisy EK-FAC.

2 Natural Gradient

Figure 2: The diagonal re-scaling factor in K-FAC has Kronecker structure with degrees of freedom. The diagonal re-scaling matrix in EK-FAC is the second moment of the gradient vector with degrees of freedom.

Natural gradient descent is a second-order optimization technique first proposed by Amari [1997]. It is classically motivated as a way of implementing steepest descent in the space of distributions instead of the space of parameters. The distance function for distribution space is the KL divergence on the model’s predictive distribution: , where is the Fisher matrix.

(1)

This results in the preconditioned gradients . Natural gradient descent is invariant to smooth and invertible reparameterizations of the model [Martens, 2014].

2.1 Kronecker-Factored Approximate Curvature

Modern neural networks contain millions of parameters which makes storing and computing the inverse of the Fisher matrix impractical. Kronecker-Factored Approximate Curvature (K-FAC) [Martens and Grosse, 2015] uses Kronecker products to efficiently approximate the inverse Fisher matrix111Extending on this work, K-FAC was shown to be amenable to distributed computation [Ba et al., 2016] and could generalize as well as SGD [Zhang et al., 2018]..

For a layer of a neural network whose input activations are , weight matrix , and pre-activation output , we can write . The gradient with respect to the weight matrix is . Assuming and are independent under the model’s predictive distribution, K-FAC decouples the Fisher matrix :

(2)

where and Bernacchia et al. [2018] showed that the Kronecker approximation is exact for deep linear networks, justifying the validity of the above assumption. Further assuming the between-layer independence, the Fisher matrix is approximated as block diagonal consisting of layer-wise Fisher matrices. Decoupling into and avoids the memory issue of storing the full matrix while also having the ability to perform efficient inverse Fisher vector products:

(3)

As shown in equation (3), natural gradient descent with K-FAC only consists of a series of matrix multiplications comparable to the size of . This enables an efficient computation of a natural gradient descent.

2.2 An Alternative Interpretation of Natural Gradient

George et al. [2018] suggest an alternative way of interpreting the natural gradient update. It can be broken down into three stages:

(4)

The first stage () projects the gradient vector to the full Fisher eigenbasis . The next step () re-scales the coordinates in full Fisher eigenbasis with the diagonal re-scaling factor . The last stage () projects back to the parameter coordinates.

For a diagonal approximation of the Fisher matrix, the basis is chosen to be identity matrix

and the re-scaling factor is the second moment of the gradient vector. While estimating the diagonal factor is simple and efficient, obtaining an accurate eigenbasis is difficult. The crude basis in the diagonal Fisher introduces a significant approximation error.

K-FAC decouples the Fisher matrix into and . Since and are symmetric positive semi-definite matrices, by eigendecomposition, they can be represented as and , where

is an orthogonal matrix whose columns are eigenvectors and

is a diagonal matrix with eigenvalues. We use properties of the Kronecker product to further decompose the factorization:

(5)

Based on the new interpretation, we have K-FAC eigenbasis and diagonal re-scaling factor . K-FAC eigenbasis is provably a more accurate approximation of the full Fisher eigenbasis. However, it does not use the estimated variance along the basis. The re-scaling factor in K-FAC is constrained to the Kronecker structure.

2.3 Eigenvalue Corrected Kronecker-Factored Approximate Curvature

Eigenvalue corrected K-FAC (EK-FAC) [George et al., 2018] extends on K-FAC to compute a more accurate diagonal re-scaling factor in K-FAC eigenbasis. The re-scaling factor for K-FAC is expressed in degrees of freedom, where and are input and output size of a layer. K-FAC factorization in equation (5) does not capture an accurate diagonal re-scaling factor in K-FAC eigenbasis because of the Kronecker structure. Instead, EK-FAC computes the second moment of the gradient vector in K-FAC eigenbasis. We define the re-scaling matrix as follows:

(6)

is a diagonal matrix whose entries are the second moment. The Fisher matrix can be approximated with K-FAC eigenbasis and the re-scaling matrix:

(7)

EK-FAC re-scaling matrix minimizes the approximation error of the above equation in Frobenius norm. In comparison to K-FAC approximation, Eigenvalue corrected K-FAC (EK-FAC) approximation is more flexible in representing the diagonal re-scaling factor with degrees of freedom. Figure 2 illustrates the difference between K-FAC and EK-FAC.

3 Variational Bayesian Neural Networks

Given a dataset , a Bayesian Neural Network (BNN) is composed of a log-likelihood and a prior on the weights. Performing inference on BNN requires integrating over the intractable posterior distribution . Variational Bayesian methods [Hinton and Van Camp, 1993, Graves, 2011, Blundell et al., 2015] attempt to fit an approximate posterior to maximize the evidence lower bound (ELBO):

(8)

where is a regularization parameter and

are parameters of the variational posterior. The exact Bayesian inference uses

, but it can be tuned in practical settings.

Bayes By Backprop (BBB) [Blundell et al., 2015] is the most common variational BNN training method. It uses a fully-factorized Gaussian approximation to the posterior i.e. . The variational parameters are updated according to stochastic gradients of obtained by the reparameterization trick [Kingma and Welling, 2013].

There has been attempts to fit a matrix-variate Gaussian posterior for BNNs [Louizos and Welling, 2016, Sun et al., 2017]. Compared to overly restricted variational families, a matrix-variate Gaussian effectively captures correlations between weights. However, computing the gradients and enforcing the positive semi-definite constraint for and make the inference challenging. Existing methods typically impose additional structures such as diagonal covariance [Louizos and Welling, 2016] or products of Householder transformation [Sun et al., 2017] to ensure efficient updates.

3.1 Noisy Natural Gradient

Noisy natural gradient (NNG) is an efficient method to fit multivariate Gaussian posteriors [Zhang et al., 2017] by adding adaptive weight noise to ordinary natural gradient updates222Khan et al. [2018] also found the relationship between natural gradient and variational inference and derived VAdam by adding weight noise to Adam, which is similar to noisy Adam in Zhang et al. [2017].. Assuming is a multivariate Gaussian posterior parameterized by and is a spherical Gaussian, the update rules are:

(9)

where is the KL weight and is the prior variance. In each iteration, NNG samples weights from the variational posterior , which is a multivariate Gaussian with the covariance matrix:

(10)

However, due to computational intractability, it is necessary to impose a structured restriction to the covariance matrix. This is equivalent to imposing the same structure to the Fisher matrix.

3.2 Fitting Matrix-Variate Gaussian Posteriors with Noisy K-FAC

Noisy K-FAC is a tractable instance of NNG with Kronecker-factored approximation to the Fisher. Because imposing a structured approximation to the covariance is equivalent to imposing the same structure to the Fisher matrix, noisy K-FAC enforces a Kronecker product structure to the covariance matrix. It efficiently fits the matrix-variate Gaussian posterior. The posterior covariance is given by

(11)

where is a scalar constant introduced by Martens and Grosse [2015] in the context of damping to keep a compact representation of the Kronecker product. The pseudo-code for noisy K-FAC is given in Appendix D. In comparison to existing methods that fit MVG posteriors [Sun et al., 2017, Louizos and Welling, 2016], noisy K-FAC does not assume additional approximations.

4 Methods

0:  : stepsize
0:  : exponential moving average parameter for covariance factors
0:  : exponential moving average parameter for re-scaling factor
0:   KL weighting, prior variance, extrinsic damping term
0:  : stats, eigendecomposition, and re-scaling update intervals.
   and initialize
  Calculate the intrinsic damping term , total damping term
  while stopping criterion not met do
     
     
     if  (mod then
        Update the covariance factors using eq. (12)
     end if
     if  (mod then
        Update the re-scaling factor using eq. (12)
     end if
     if  (mod then
        Compute the eigenbasis and using eq. (5).
     end if
     
      {Derivation is shown in Appendix C}
  end while
Algorithm 1 Noisy EK-FAC. A subscript denotes the index of a layer, , and . We assume zero momentum for simplicity. denotes element-wise division and . unvec is an inverse of vec operation i.e. and . Differences from standard EK-FAC are shown in blue.

While matrix-variate Gaussian posterior efficiently captures correlations between different weights, the diagonal variance in K-FAC eigenbasis is not optimal. K-FAC diagonal re-scaling factor does not match the second moment along the associated eigenvector .

We develop a new tractable instance of noisy natural gradient. It keeps track of the diagonal variance in K-FAC eigenbasis, resulting in a more flexible posterior distribution. In the context of NNG, imposing a structural restriction to the Fisher matrix is equivalent to imposing the same restriction to the variational posterior. For example, noisy K-FAC imposes a Kronecker product structure to the covariance matrix as shown in equation (11).

Given these insights, building a flexible variational posterior boils down to finding an improved approximation of the Fisher matrix. We adopt EK-FAC method, which is provably a better approximation of the Fisher matrix than K-FAC. We term the new BNN training method noisy EK-FAC.

EK-FAC uses eigenvalue corrected Kronecker-factored approximation to the Fisher matrix as described in equation (7). For each layer, it estimates , and online using exponential moving averages. Conveniently, this resembles the exponential moving average updates for the noisy natural gradient in equation (9).

(12)

where is the learning rate for Kronecker factors and is the learning rate for the diagonal re-scaling factor.

We introduce an eigenvalue corrected matrix-variate Gaussian (EMVG) posterior shown in Figure 1

. An EMVG is a generalization of a multivariate Gaussian distribution with the following form:

(13)

An EMVG posterior is potentially powerful because it not only compactly represents covariances between weights but also computes a full diagonal variance in K-FAC eigenbasis. Applying EK-FAC approximation into equation (10) yields an EMVG posterior. Therefore, we factorize the covariance matrix in the same sense EK-FAC approximates the Fisher matrix:

(14)

where is an intrinsic damping term. Since the damping does not affect K-FAC eigenbasis, we explicitly represent the damping term in the re-scaling matrix. In practice, it may be advantageous to add extrinsic damping to the re-scaling matrix for the stable training process.

The only difference from standard EK-FAC is that the weights are sampled from the variational posterior . We can interpret noisy EK-FAC in the sense that is a point estimate of the weights and is the covariance of correlated Gaussian noise for each training examples. The full algorithm is described in alg. 1.

The inference is efficient because the covariance matrix is factorized with three small matrices , and . We can use the following identity to compute Kronecker products efficiently: .

5 Related Work

Variational inference was first applied to neural networks by Peterson [1987] and Hinton and Van Camp [1993]. Then, Graves [2011] proposed a practical method for variational inference with fully factorized Gaussian posteriors which uses a simple (but biased) gradient estimator. Improving on this work, Blundell et al. [2015] proposed an unbiased gradient estimator using the reparameterization trick of Kingma and Welling [2013]. Several non-Gaussian variational posteriors have also been proposed such as Multiplicative Normalizing Flows [Louizos and Welling, 2017] and implicit distributions [Shi et al., 2018]. Neural networks with dropout were also interpreted as BNNs [Gal and Ghahramani, 2016, Gal et al., 2017].

A few recent works explored structured covariance approximations by exploiting the relationship between natural gradient and variational inference. Both Zhang et al. [2017] and Khan et al. [2018] used a diagonal Fisher approximation in natural gradient VI, obtaining a fully factorized Gaussian posterior. Zhang et al. [2017] also proposed an interesting extension by using K-FAC, which leads to a matrix-variate Gaussian posterior. Concurrently, Mishkin et al. [2018] adopted a "diagonal plus low-rank" approximation. This method shares the same spirit as this work. However, their low rank approximation is computationally expensive and thus only applied to two-layer (shallow) neural networks.

6 Experiments

In order to empirically evaluate the proposed method, we test under two scenarios, regression and classification, to investigate the following questions. (1) Does noisy EK-FAC have improved prediction performance compared to existing methods? (2) Is it able to scale to large dataset and modern convolution neural network architecture?

6.1 Regression

Figure 3: Training curve for all three methods. Note that BBB, noisy K-FAC, and noisy EK-FAC use FFG, MVG, EMVG accordingly. EMVG has the most flexible variational posterior distribution.

We evaluate our method on standard BNN benchmarks from UCI collections [Dheeru and Karra Taniskidou, 2017], adopting evaluation protocols from Hernández-Lobato and Adams [2015]. In particular, we introduce a Gamma prior for the precision of Gaussian likelihood and include a Gamma posterior into the variational objective.

Test RMSE Test log-likelihood
Dataset BBB Noisy Adam Noisy K-FAC Noisy EK-FAC BBB Noisy Adam Noisy K-FAC Noisy EK-FAC
Boston 3.1710.149 3.0310.155 2.7420.015 2.5270.158 -2.6020.031 -2.5580.032 -2.4090.047 -2.3780.044
Concrete 5.6780.087 5.6130.113 5.0190.127 4.8800.120 -3.1490.018 -3.1450.023 -3.0390.025 -3.0020.025
Energy 0.5650.018 0.8390.046 0.4850.019 0.4970.023 -1.5000.006 -1.6290.020 -1.4210.004 -1.4480.004
Kin8nm 0.0800.001 0.0790.001 0.0760.001 0.0760.000 1.1110.007 1.1120.008 1.1480.007 1.1490.012
Naval 0.0000.000 0.0010.000 0.0000.000 0.0000.000 6.1430.032 6.2310.041 7.0790.034 7.2870.002
Pow. Plant 4.0230.036 4.0020.039 3.8860.041 3.8950.053 -2.8070.010 -2.8030.010 -2.7760.012 -2.7740.012
Protein 4.3210.017 4.3800.016 4.0970.009 4.0420.027 -2.8820.004 -2.8960.004 -2.8360.002 -2.8190.007
Wine 0.6430.012 0.6440.011 0.6370.011 0.6350.013 -0.9770.017 -0.9760.016 -0.9690.014 -0.9640.002
Yacht 1.1740.086 1.2890.069 0.9790.077 0.9740.116 -2.4080.007 -2.4120.006 -2.3160.006 -2.2240.007
Year 9.076NA 9.071NA 8.885NA 8.642NA -3.614NA -3.620NA -3.595NA -3.573NA
Table 1: Average RMSE and log-likelihood in test data for UCI regression benchmarks.

We randomly split training (90%) and test (10%) data. To reduce the randomness, we repeat the splitting 10 times, except for two largest datasets. "Year" and "Protein" are repeated 1 and 5 times. During training, we normalize input features and training targets to zero mean and unit variance. We do not adopt this normalization at test time. All experiments train a network with a single hidden layer with 50 units except for "Protein" and "Year" datasets, which have 100 hidden units. We use batch size of 10 for smaller datasets, 100 for larger datasets, and 500 for "Year" dataset. To stabilize the training, we re-initialize the re-scaling matrix every 50 iterations with K-FAC eigenvalues. This is equivalent to executing a K-FAC iteration. We found that the second moment of the gradient vector is unstable for a small batch. We amortize the basis update to ensure re-scaling matrix matches the eigenbasis, setting . For learning rates, we use , , and for all datasets. They are decayed by 0.1 for the second half of the training.

We compare the results with Bayes by Backprop [Blundell et al., 2015], noisy Adam, and noisy K-FAC [Zhang et al., 2017]. We report root mean square error (RMSE) and log-likelihood on the test dataset. The results are summarized in Table 1. The evaluation result shows that noisy EK-FAC yields a higher test log-likelihood compared to other methods. The training plot is shown in Figure 3. Noisy EK-FAC also achieves a higher ELBO compared to noisy K-FAC.

6.2 Classification

Method Test Accuracy
D B D + B
SGD 81.79 88.35 85.75 91.39
KFAC 82.39 88.89 86.86 92.13
Noisy-KFAC 85.52 89.35 88.22 92.01
Noisy-EKFAC 87.07 89.86 88.45 92.22
Table 2: Classification accuracy on CIFAR10 with modified VGG16. [D] denotes data augmentation and [B]

denotes Batch Normalization.

To evaluate the scalability of the proposed method, we train a modified VGG16 [Simonyan and Zisserman, 2014] and test on CIFAR10 benchmarks [Krizhevsky, 2009]. The modified VGG16 has a half reduced number of hidden units in all layers. Similar to applying K-FAC on convolutional layers with Kronecker factors [Grosse and Martens, 2016], EK-FAC can be extended to convolutional layers. We compare the results with SGD (with momentum), K-FAC, and noisy K-FAC.

We use batch size of 128 for all experiments. To reduce the computational overhead, we amortize covariance, inverse, and re-scaling updates. Specifically, we set , , and . We noticed that the amortization does not significantly impact per-iteration optimization performance. and are set to 0.01 for both noisy K-FAC and noisy EK-FAC. We adopt batch normalization [Ioffe and Szegedy, 2015] and data augmentation. We tune regularization parameter and prior variance . With data augmentation, we use a smaller regularization parameter. is set to 0.1 without batch normalization and 1.0 with batch normalization.

The results are summarized in table 2. Noisy EK-FAC achieves the highest test accuracy in all settings without introducing computational overhead. Without extra regularization tricks, noisy EK-FAC has 1.55 improvement compared to noisy K-FAC.

7 Conclusion

In this paper, we introduced a modified training method for variational Bayesian neural networks. An eigenvalue corrected matrix-variate Gaussian extends on a matrix-variate Gaussian to represent the posterior distribution with more flexibility. It not only efficiently captures correlations between weights but also computes an accurate diagonal variance under K-FAC eigenbasis. For both regression and classification evaluations, noisy EK-FAC achieves higher ELBO and test accuracy, demonstrating its effectiveness.

References

Appendix A Matrix-Variate Gaussian

A matrix-variate Gaussian distribution models a multivariate Gaussian distribution for a matrix .

(15)

where is the mean, is the covariance among rows and is the covariance among columns. Since and are covariance matrices, they are positive definite. Vectorization of forms a multivariate Gaussian distribution whose covariance matrix is a Kronecker product of and .

(16)

Appendix B Eigenvalue Corrected Matrix-Variate Gaussian

An eigenvalue corrected matrix-variate Gaussian is an extension of a matrix-variate Gaussian to consider the full diagonal variance in Kronecker-factored eigenbasis.

(17)

is the re-scaling matrix. Because covariance matrices are positive definite, diagonal entries in are all positive. Similar to a matrix-variate Gaussian distribution, vectorization of generalizes a multivariate distribution whose covariance matrix has a Kronecker structure.

(18)

Sampling from an eigenvalue corrected matrix-variate distribution is also a special case of sampling from a multivariate Gaussian distribution. Let be a matrix with independent samples for a standard multivariate Gaussian.

(19)

Then let

(20)

where , is an element-wise multiplication, and unvec is an inverse of vec operation.

Appendix C Derivation of EK-FAC Update

Let be the weight gradient, the covariance matrix of input activations, and the covariance matrix of output pre-activations. is the diagonal re-scaling matrix in K-FAC eigenbasis: . The following is the derivation of EK-FAC update shown in alg. 1.

(21)

where , is an element-wise division, and unvec is an inverse of vec operation.

Appendix D Pseudo-Code for Noisy K-FAC

0:  : stepsize
0:  : exponential moving average parameter for covariance factors
0:   KL weighting, prior variance, extrinsic damping term
0:  : stats and inverse update intervals.
   and initialize
  Calculate the intrinsic damping term , total damping term
  while stopping criterion not met do
     
     
     if  (mod then
        Update the covariance factors using eq. (12)
     end if
     if  (mod then
        Compute the inverses using eq. (11).
     end if
     
     
  end while
Algorithm 2 Noisy K-FAC. A subscript denotes the index of a layer, , and . We assume zero momentum for simplicity. Differences from standard K-FAC are shown in blue.