The k-tied Normal Distribution: A Compact Parameterization of Gaussian Mean Field Posteriors in Bayesian Neural Networks

02/07/2020 ∙ by Jakub Świątkowski, et al. ∙ 6

Variational Bayesian Inference is a popular methodology for approximating posterior distributions over Bayesian neural network weights. Recent work developing this class of methods has explored ever richer parameterizations of the approximate posterior in the hope of improving performance. In contrast, here we share a curious experimental finding that suggests instead restricting the variational distribution to a more compact parameterization. For a variety of deep Bayesian neural networks trained using Gaussian mean-field variational inference, we find that the posterior standard deviations consistently exhibit strong low-rank structure after convergence. This means that by decomposing these variational parameters into a low-rank factorization, we can make our variational approximation more compact without decreasing the models' performance. Furthermore, we find that such factorized parameterizations improve the signal-to-noise ratio of stochastic gradient estimates of the variational lower bound, resulting in faster convergence.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bayesian neural networks (MacKay, 1992; Neal, 1993)

are a popular class of deep learning models. The most widespread approach for training these models relies on variational inference 

(Peterson, 1987; Hinton & Van Camp, 1993), a training paradigm that approximates a Bayesian posterior with a simpler class of distributions by solving an optimization problem. The common wisdom is that more expressive distributions lead to better posterior approximations and ultimately to better model performance. This paper puts this into question and instead finds that for Bayesian neural networks, more restrictive classes of distributions, based on low-rank factorizations, can outperform the common mean-field family.

Bayesian Neural Networks explicitly represent their parameter-uncertainty by forming a posterior distribution over model parameters, instead of relying on a single point estimate for making predictions, as is done in traditional deep learning. For neural network weights , features and labels , the posterior distribution is computed using Bayes’ rule, which multiplies the prior distribution and data likelihood and renormalizes. When predicting with Bayesian neural networks, we form an average over model predictions where each prediction is generated using a set of parameters that is randomly sampled from the posterior distribution. Bayesian neural networks are thus a type of ensembling, of which various types have proven highly effective in deep learning (see e.g. Goodfellow et al., 2016, sec 7.11).

Besides offering improved predictive performance over single models, Bayesian ensembles are also more robust because ensemble members will tend to make different predictions on hard examples (Raftery et al., 2005). In addition, the diversity of the ensemble represents predictive uncertainty and can be used for out-of-domain detection or other risk-sensitive applications (Ovadia et al., 2019).

Variational inference is a popular class of methods for approximating the posterior distribution , since the exact Bayes’ rule is often intractable to compute for models of practical interest. This class of methods specifies a distribution of given parametric or functional form as the posterior approximation, and optimizes the approximation by solving an optimization problem. In particular, we minimize the Kullback-Leibler (KL) divergence between the variational distribution and the true posterior distribution , which is given by

(1)

Here, we do not know the normalizing constant of the exact posterior , but since this term does not depend on , we may ignore it for the purpose of optimizing our approximation . We are then left with what is called the negative Evidence Lower Bound (negative ELBO):

(2)

In practice, the expectation of the likelihood with respect to is usually not analytically tractable and instead is estimated using Monte Carlo sampling:

(3)

where the ELBO is optimized by differentiating this stochastic approximation with respect to the variational parameters (Salimans et al., 2013; Kingma & Welling, 2013).

Figure 1: Summarization of different variational inference methods for Bayesian deep learning. Our approach complements existing approaches by combining the mean-field assumption with a dramatic reduction in the number of parameters by weight sharing.

In Gaussian Mean Field Variational Inference (GMFVI)  (Blei et al., 2017; Blundell et al., 2015)

, we choose the variational approximation to be a fully factorized Gaussian distribution

with , where is a layer number, and and are the row and column indices in the layer’s weight matrix. While Gaussian Mean-Field posteriors are considered to be one of the simplest types of variational approximations, with some known limitations (Giordano et al., 2018), they scale to comparatively large models and generally provide competitive performance (Ovadia et al., 2019). Additionally, Farquhar et al. (2020) have found that the Mean-Field becomes a less restrictive assumption as the depth of the network increases. However, when compared to deterministic neural networks, GMFVI doubles the number of parameters and is often harder to train due to the increased noise in stochastic gradient estimates. Furthermore, despite the theoretical advantages of GMFVI over the deterministic neural networks, GMFVI suffers from over-regularization for larger networks, which leads to underfitting and often worse predictive performance in such settings (Osawa et al., 2019).

Beyond mean-field variational inference, recent work on approximate Bayesian inference has explored ever richer parameterizations of the approximate posterior in the hope of improving the performance of Bayesian neural networks (see Figure 1). In contrast, here we study a simpler, more compactly parameterized variational approximation. Our motivation for studying this setting is to better understand the behaviour of GMFVI with the goal to address the issues with its practical applicability. Consequently, we show that the compact approximations can also work well for a variety of models. In particular we find that:

  • Converged posterior standard deviations under GMFVI consistently display strong low-rank structure. This means that by decomposing these variational parameters into a low-rank factorization, we can make our variational approximation more compact without decreasing our model’s performance.

  • Factorized parameterizations of posterior standard deviations improve the signal-to-noise ratio of stochastic gradient estimates, and thus not only reduce the number of parameters compared to standard GMFVI, but also can lead to faster convergence.

2 Mean field posterior standard deviations naturally have low-rank structure

In this section we show that the converged posterior standard deviations of Bayesian neural networks trained using standard GMFVI consistently display strong low-rank structure. We also show that it is possible to compress the learned posterior standard deviation matrix using a low-rank approximation without decreasing the network’s performance. We first briefly introduce the mathematical notation for our GMFVI setting and the low-rank approximation that we explore. We then provide experimental results that support the two main claims of this section.

To avoid any confusion among the readers, we would like to clarify that we use the terminology “low-rank” in a particular context. While variational inference typically makes use of low-rank decompositions to compactly represent the dense covariance of a Gaussian variational distribution (see numerous references in Section 4), we investigate instead underlying low-rank structures within the already diagonal covariance of a Gaussian fully-factorized variational distribution. We will make this explanation more formal in the next section.

2.1 Methodology

To introduce the notation, we consider layers that consist of a linear transformation followed by a non-linearity

,

(4)

where , and . To simplify the notation in the following, we drop the subscript such that , , and we focus on the kernel matrix for a single layer.

In GMFVI, we model the variational posterior as

(5)

where

is the posterior mean vector,

is the diagonal posterior covariance matrix. The weights are then usually sampled using a reparameterization trick (Kingma & Welling, 2013), i.e, for the -th sample, we have

(6)

In practice, we often represent the posterior standard deviation parameters in the form of a matrix . Note that we have the relationship where the elementwise-squared is vectorized by stacking its columns, and then expanded as a diagonal matrix into .

In the sequel, we start by empirically studying the properties of the spectrum of matrices post training (after convergence), while using standard Gaussian mean-field variational distributions. Interestingly, we observe that those matrices naturally exhibit a low-rank structure (see Section 2.3 for the corresponding experiments), i.e,

(7)

for some , and a small value (e.g., 2 or 3). This observation motivates the introduction of the following variational family, which we name -tied Normal:

(8)

where the squaring of the matrix is applied elementwise. Due to the tied parameterization of the diagonal covariance matrix, we emphasize that this variational family is smaller—i.e., included in—the standard Gaussian mean-field variational distribution family.

Variational family Parameters (total)
multivariate Normal
diagonal Normal
(diagonal)
-tied Normal
Table 1: Number of variational parameters for a variational family for a matrix . (diagonal) is from Louizos & Welling (2016).

As formally discussed in Appendix A, the matrix variate Gaussian distribution (Gupta & Nagar, 2018), referred to as and already used for variational inference by Louizos & Welling (2016) and Sun et al. (2017), is related to our -tied Normal distribution with when uses diagonal row and column covariances. Interestingly, we prove that for , our -tied Normal distribution cannot be represented by any distribution. This illustrates the main difference of our approach from the most closely related previous work of Louizos & Welling (2016) (see also Figure 7 in Appendix A).

Notice that our diagonal covariance repeatedly reuses the same elements of and , which results in parameter sharing across different weights. The total number of the standard deviation parameters in our method is from and , compared to from in the standard GMFVI parameterization. Given that in our experiments the is very low (e.g. ) this reduces the number of parameters from quadratic to linear in the dimensions of the layer, see Table 1. More importantly, such parameter sharing across the weights leads to higher signal-to-noise ratio during training and thus in some cases faster convergence. We demonstrate this phenomena in the next section. In the rest of this section, we show that the standard GMFVI methods already learn a low-rank structure in the posterior standard deviation matrix . Furthermore, we provide evidence that replacing the full matrix with its low-rank approximation does not reduce the predictive performance.

2.2 Experimental setting

Before describing the experimental results, we briefly explain the key properties of the experimental setting. We analyze three types of GMFVI Bayesian neural network models:

  • Multilayer Perceptron (MLP): a network of 3 dense layers and ReLu activations that we train on the MNIST dataset (LeCun & Cortes, 2010). We use the last 10,000 examples of the training set as a validation set.

  • Convolutional Neural Network (CNN): a LeNet architecture (LeCun et al., 1998) with 2 convolutional layers and 2 dense layers that we train on the CIFAR-100 dataset (Krizhevsky et al., 2009b). We use the last 10,000 examples of the training set as a validation set.

  • Long Short-Term Memory (LSTM): a model that consists of an embedding and an LSTM cell (Hochreiter & Schmidhuber, 1997), followed by a single unit dense layer. We train it on the IMDB dataset (Maas et al., 2011), in which we use the last 5,000 examples of the training set as a validation set.

  • Residual Convolutional Neural Network (ResNet): a ResNet-18111See: https://github.com/tensorflow/probability/blob/master/tensorflow_probability/examples/cifar10_bnn.py. architecture (He et al., 2016b) trained on all 50,000 training examples of the CIFAR-10 dataset (Krizhevsky et al., 2009a).

In each of the four models, we use the standard mean-field Normal variational posterior and a Normal prior, for which we set a single scalar standard deviation hyper-parameter shared by all layers. We optimize the variational posterior parameters using the Adam optimizer (Kingma & Ba, 2014). For a more comprehensive explanation of the experimental setup please refer to Appendix B. Finally, we highlight that our experiments focus primarily on the comparison across a broad range of model types rather than competing with the state-of-the-art results over the specifically used datasets. Nevertheless, we also show that our results extend to larger models with competitive performance such as the ResNet-18 model. Note that scaling GMFVI to such larger model sizes is still a challenging research problem (Osawa et al., 2019).

2.3 Main experimental observation

Our main experimental observation is that the standard GMFVI learns posterior standard deviation matrices that have a low-rank structure across different model types (MLP, CNN, LSTM), model sizes (LeNet, ResNet-18) and layer types (dense, convolutional). To show this, we investigate the results of the SVD decomposition of posterior standard deviation matrices in the four described models types. We analyze the models post training, where the models are already trained until ELBO convergence using the standard GMFVI approach. While for the first three models (MLP, CNN and LSTM), we evaluate the low-rank structure only in the dense layers, for the ResNet model we consider the low-rank structure in the convolutional layers as well.

Figure 3

shows the fraction of variance explained per each singular value

from the SVD decomposition of matrices in the dense layers of the first three models. The fraction of variance explained per singular value is calculated as , where are the singular values. We observe that, unlike posterior means, the posterior standard deviations have most of their variance explained by the first few singular values. In particular, a rank-1 approximation of the posterior standard deviations matrices explains most of its variance, while a rank-2 approximation can encompass nearly all of the remaining variance. Figure 2 further supports this claim visually by comparing the heat maps of the full posterior standard deviations matrix and its rank-1 and rank-2 approximations. In particular, we observe that the rank-2 approximation results in the heat map looking visually very similar to the full matrix, while this is not the case for the rank-1 approximation. Importantly, Figure 4

illustrates that the same low-rank structure is also present in both the dense and the convolutional layers of the larger ResNet-18 model. In the analysis of the above experiments, we use the shorthand SEM to refer to the standard error of the mean.

Figure 2: Heat maps of the posterior standard deviation matrix in the first dense layer of a LeNet CNN trained using GMFVI on the CIFAR-100 dataset (left: full matrix, middle: rank-1 approximation, right: rank-2 approximation). Unlike the rank-1 approximation, the rank-2 approximation looks visually similar to the full matrix. This is consistent with our numerical results from Figure 3.
Figure 3: Fraction of variance explained per each singular value from SVD of matrices of posterior means and posterior standard deviations in different dense layers of three model types trained using standard GMFVI: MLP (left), CNN (center), LSTM (right). Unlike posterior means, posterior standard deviations clearly display strong low-rank structure, with most of the variance contained in the top few singular values.
MNIST, MLP CIFAR100, CNN IMDB, LSTM
Rank -ELBO NLL Accuracy -ELBO NLL Accuracy -ELBO NLL Accuracy
Full 0.4310.0057 0.1000.0034 97.60.15 3.830.020 2.230.017 42.10.49 0.5360.0058 0.4930.0057 80.10.25
1 3.410.019 0.6770.0040 93.60.25 4.330.021 2.300.016 41.70.49 0.6870.0058 0.4910.0056 80.00.25
2 0.4560.0059 0.1070.0033 97.60.15 3.880.020 2.240.017 42.20.49 0.6210.0058 0.4940.0057 80.10.25
3 0.4500.0059 0.1060.0033 97.60.15 3.860.020 2.240.017 42.10.49 0.5950.0058 0.4930.0056 80.10.25
Table 2: Impact of low-rank approximation of the GMFVI-trained posterior standard deviation matrix on model’s ELBO and predictive performance, for three types of models. We report mean and SEM of each metric across 100 weights samples. The low-rank approximations with ranks higher than one achieve predictive performance close to that when not using any approximations.
Rank -ELBO NLL Accuracy Full 122.610.012 0.4950.0080 83.50.37 1 122.570.012 0.6580.0069 81.70.39 2 122.770.012 0.5030.0080 83.20.37 3 122.670.012 0.5010.0079 83.20.37
Figure 4: Unlike posterior means, the posterior standard deviations of both dense and convolutional layers in the ResNet-18 model trained using standard GMFVI display strong low-rank structure and can be approximated without loss in predictive metrics. Top: Fraction of variance explained per each singular value of the matrices of converged posterior means and standard deviations. Bottom: Impact of post training low-rank approximation of the posterior standard deviation matrices on the model’s performance. We report mean and SEM of each metric across 100 weights samples.

2.4 Low-rank approximation of variance matrices

Motivated by the above observations, we show that it is possible to replace the full posterior standard deviation matrices with their low-rank approximations without a decrease in predictive performance. Table 2 shows the performance comparison of the MLP, CNN and LSTM models with different ranks of the approximations. Figure 4 contains analogous results for the ResNet-18 model. The results show that the post-training approximations with ranks higher than one achieve predictive performance close to that of the full posterior for all the analyzed model types, model sizes and layer types. Furthermore, Table 3 shows that, for the ResNet-18 model, the approximations with ranks higher than one also do not decrease the quality of the uncertainty estimates compared to the full model without the approximations222

We compute the Brier Score and the ECE using the implementations from the TensorFlow Probability

(Dillon et al., 2017) Stats module: https://www.tensorflow.org/probability/api_docs/python/tfp/stats.. These observations could be used as a form of post-training network compression. Moreover, they give rise to further interesting exploration directions such as formulating posteriors that exploit such a low-rank structure. In the next section we explore this particular direction while focusing on the first three model types (MLP, CNN, LSTM).

Rank Brier Score NLL ECE
Full -0.7610.0039 0.4950.0080 0.0477
1 -0.695 0.0034 0.6580.0069 0.1642
2 -0.7580.0038 0.5030.0080 0.0540
3 -0.7580.0038 0.5010.0079 0.0541
Table 3: Quality of predictive uncertainty estimates for the ResNet-18 model on the CIFAR10 dataset without and with post training low-rank approximations of the GMFVI posterior standard deviation matrices in all the layers of the model. The approximations with ranks match the quality of the predictive uncertainty estimates from the full posteriors without the approximations. The quality of the predictive uncertainty estimates is measured by the negative log-likelihood (NLL), the Brier Score and the ECE (with 15 bins). For the NLL and the Brier Score metrics we report mean and SEM across 100 weights samples.

3 The -tied Normal Distribution: Exploiting Low-Rank Parameter-Structure in Mean Field Posteriors

In the previous section we have shown that it is possible to replace a full matrix of posterior standard deviations, which is already trained using GMFVI, with its low-rank approximation without decreasing the predictive performance. In this section we show that it is also possible to exploit this observation during training time. We achieve this by exploiting our novel variational family, the -tied Normal distribution (see Section 2.1).

We show that using this distribution in the context of GMFVI in Bayesian neural networks allows to reduce the number of network parameters, increase the signal-to-noise ratio of the stochastic gradient estimates and speed up model convergence while maintaining the predictive performance of the standard parameterization of the GMFVI. We start by recalling the definition of the -tied Normal distribution:

where the variational parameters are comprised of .

3.1 Experimental setting

We now introduce the experimental setting in which we evaluate the GMFVI variational posterior parameterized by the -tied Normal distribution. We assess the impact of the described posterior in terms of predictive performance and reduction in the number of parameters for the same first three model types (MLP, CNN, LSTM) and respective datasets (MNIST, CIFAR-100, IMDB) as we used in the previous section. Additionally, we also analyze the impact of weight tying in the posterior on the signal-to-noise ratio of stochastic gradient estimates of the variational lower bound for the CNN model as a representative example. Overall, the experimental setup is very similar to the one introduced in the previous section. Therefore, we highlight here only the key differences.

We apply the -tied Normal variational posterior distribution to the layers which we analyzed in the previous section. Namely, we use the -tied Normal variational posterior for all the three layers of the MLP model, the two dense layers of the CNN model and the LSTM cell’s kernel and recurrent kernel. We initialize the parameters from and from of the -tied Normal distribution so that after the outer-product operation the respective standard deviations have the same mean values as we obtain in the standard GMFVI posterior parameterization. In the experiments for this section, we use KL annealing (Sønderby et al., 2016), where we linearly scale-up the contribution of the term in Equation 2 from zero to its full contribution over the course of training. More details about the experimental setup are available in Appendix B.

3.2 Experimental results

We first investigate the predictive performance of the GMFVI Bayesian neural network models trained using the -tied Normal posterior distribution, with different levels of tying . We compare these results to those obtained from the same models, but trained using the standard parameterization of the GMFVI. Figure 5 (left) shows that for the -tied Normal posterior is able to achieve the performance competitive with the standard GMFVI posterior parameterization, while reducing the total number of model parameters. The benefits of using the -tied Normal posterior are the most visible for models where the layers with the -tied Normal posterior constitute a significant portion of the total number of the model parameters (e.g. the MLP model).

We further investigate the impact of the -tied Normal posterior on the signal-to-noise ratio (SNR)333SNR for each gradient value is calculated as , where is the gradient value for a single parameter. The expectation and variance of the gradient values are calculated over a window of last 10 batches. of stochastic gradient estimates of the variational lower bound (ELBO). In particular, we focus on the gradient SNR of the GMFVI posterior standard deviation parameters for which we perform the tying. These parameters are either and for the -tied Normal posterior or for the standard GMFVI parameterization, all optimized in their log forms for numerical stability. Figure 5 (top right) shows that the and parameters used in the -tied Normal posterior are trained with significantly higher gradient SNR than the parameters used in the standard GMFVI parameterization. Consequently, Figure 5 (bottom right) shows that the increased SNR from the -tied Normal distribution translates into faster convergence for the MLP model, which uses the -tied Normal distribution in all of its layers.

Note that the -tied Normal posterior does not increase the training step time compared to the standard parameterization of the GMFVI, see Table 4 for the support of this claim444Code to compare the training step times of the -tied Normal and the standard GMFVI is available under: https://colab.research.google.com/drive/14pqe_VG5s49xlcXB-Jf8S9GoTFyjv4OF. The code uses the network architecture from: https://github.com/tensorflow/docs/blob/master/site/en/tutorials/keras/classification.ipynb.. Therefore, the -tied Normal posterior speeds up model convergence also in terms of wall-clock time.

Figure 6 shows the convergence plots of validation negative ELBO for all the three model types. We observe that the impact of the -tied Normal posterior on convergence depends on the model type. As shown in Figure 5 (bottom right), the impact on the MLP model is strong and consistent with the -tied Normal posterior increasing convergence speed compared to the standard GMFVI parameterization. For the LSTM model we also observe a similar speed-up. However, for the CNN model the impact of the -Normal posterior on the ELBO convergence is much smaller. We hypothesize that this is due to the fact that we use the -tied Normal posterior for all the layers trained using GMFVI in the MLP and the LSTM models, while in the CNN model we use the -tied Normal posterior only for some of the GMFVI trained layers. More precisely, in the CNN model we use the -tied Normal posterior only for the two dense layers, while the two convolutional layers are trained using the standard parameterization of the GMFVI.

Model & Dataset Rank -ELBO NLL Accuracy #Par. [k] MNIST, MLP full 0.5010.0061 0.1330.0040 96.80.18 957 MNIST, MLP 1 0.5390.0063 0.1550.0043 96.10.19 482 MNIST, MLP 2 0.5200.0063 0.1290.0039 96.80.18 484 MNIST, MLP 3 0.4970.0060 0.1200.0038 96.90.18 486 CIFAR100, CNN full 3.720.018 2.160.016 43.90.50 4,405 CIFAR100, CNN 1 3.650.017 2.120.015 45.50.50 2,262 CIFAR100, CNN 2 3.760.019 2.150.016 44.30.50 2,268 CIFAR100, CNN 3 3.730.018 2.130.016 44.30.50 2,273 IMDB, LSTM full 0.5380.0054 0.4780.0052 79.50.26 2,823 IMDB, LSTM 1 0.5920.0041 0.5120.0040 77.60.26 2,693 IMDB, LSTM 2 0.5600.0042 0.4840.0041 78.20.26 2,694 IMDB, LSTM 3 0.5500.0051 0.4910.0050 78.80.26 2,695 Rank MNIST, MLP Dense 2, SNR at step 1000 5000 9000 full 4.130.027 4.450.091 3.210.035 1 5840190 1583.8 5.30.20 2 7500240 14011 4.30.26 3 7000270 1171.7 4.10.20 Rank MNIST, MLP, -ELBO at step 1000 5000 9000 full 42.160.070 26.520.016 15.390.016 1 43.110.039 14.850.017 2.060.027 2 42.740.090 13.970.023 1.820.017 3 42.630.068 13.610.020 1.800.031
Figure 5: Left: impact of the -tied Normal posterior on test ELBO, test predictive performance and number of model parameters. We report the test metrics on the test splits of the respective datasets as a mean and SEM across 100 weights samples after training each of the models for

300 epochs. The

-tied Normal distribution with rank allows to train models with smaller number of parameters without decreasing the predictive performance. Top right: mean gradient SNR in the log posterior standard deviation parameters of the Dense 2 layer of the MNIST MLP model at increasing training steps for different ranks of tying . The -tied Normal distribution significantly increases the SNR for these parameters. We observe a similar increase in the SNR from tying in all the layers that use the -tied Normal posterior. Bottom right: negative ELBO on the MNIST validation dataset at increasing training steps for different ranks of tying . The higher SNR from the -tied Normal posterior translates into the increased convergence speed for the MLP model. We report mean and SEM across 3 training runs with different random seeds in both the top right and the bottom right table.
Figure 6: Convergence of negative ELBO (lower is better) reported for validation dataset when training with tied variational posterior standard deviations for MLP (left), CNN (center), and LSTM (right) with different low-rank factorizations of the posterior standard deviation matrix. Full-rank is the standard parameterization of the GMFVI.
Training method Train step time [ms]
Point estimate 2.000.0064
Standard GMFVI 7.170.014
-tied Normal GMFVI 6.140.018
Table 4: Training step evaluation times for a simple model architecture with two dense layers for different training methods. We report mean and SEM of evaluation times across a single training run in the Google Colab environment linked in the footnote. The -tied Normal posterior does not increase the train step evaluation times compared to the standard parameterization of the GMFVI posterior. We expect this to hold more generally because the biggest additional operation per step when using the -tied Normal posterior is the multiplication to materialize the matrix of posterior standard deviations , where , and is a small value (e.g., 2 or 3). The time complexity of this operations is , which is usually negligible compared to the time complexity of data-weight matrix multiplication , where is the batch size.

4 Related work

The application of variational inference to neural networks dates back at least to Peterson (1987) and Hinton & Van Camp (1993). Many developments555We refer the interested readers to Zhang et al. (2018) for a recent review of variational inference. have followed those seminal research efforts, in particular regarding (1) the expressiveness of the variational posterior distribution and (2) the way the variational parameters themselves can be structured to lead to compact, easier-to-learn and scalable formulations. We organize the discussion of this section around those two aspects, with a specific focus on the Gaussian case.

Full Gaussian posterior.

Because of their substantial memory and computational cost, Gaussian variational distributions with full covariance matrices have been primarily applied to (generalized) linear models and shallow neural networks (Jaakkola & Jordan, 1997; Barber & Bishop, 1998; Marlin et al., 2011; Titsias & Lázaro-Gredilla, 2014; Miller et al., 2017; Ong et al., 2018).

To represent the dense covariance matrix efficiently in terms of variational parameters, several schemes have been proposed, including the sum of low-rank plus diagonal matrices (Barber & Bishop, 1998; Seeger, 2000; Miller et al., 2017; Zhang et al., 2017; Ong et al., 2018), the Cholesky decomposition (Challis & Barber, 2011) or by operating instead on the precision matrix (Tan & Nott, 2018; Mishkin et al., 2018).

Gaussian posterior with block-structured covariances.

In the context of Bayesian neural networks, the layers represent a natural structure to be exploited by the covariance matrix. When assuming independence across layers, the resulting covariance matrix exhibits a block-diagonal structure that has been shown to be a well-performing simplification of the dense setting (Sun et al., 2017; Zhang et al., 2017), with both memory and computational benefits.

Within each layer, the corresponding diagonal block of the covariance matrix can be represented by a Kronecker product of two smaller matrices (Louizos & Welling, 2016; Sun et al., 2017), possibly with a parameterization based on rotation matrices (Sun et al., 2017). Finally, using similar techniques, Zhang et al. (2017) proposed to use a block tridiagonal structure that better approximates the behavior of a dense covariance.

Fully factorized mean-field Gaussian posterior.

A fully factorized Gaussian variational distribution constitutes the simplest option for variational inference. The resulting covariance matrix is diagonal and all underlying parameters are assumed to be independent. While the mean-field assumption is known to have some limitations—e.g., underestimated variance of the posterior distribution (Turner & Sahani, 2011) and robustness issues (Giordano et al., 2018)—it leads to scalable formulations, with already competitive performance, as for instance illustrated by the recent uncertainty quantification benchmark of Ovadia et al. (2019).

Because of its simplicity and scalability, the fully-factorized Gaussian variational distribution has been widely used for Bayesian neural networks (Graves, 2011; Ranganath et al., 2014; Blundell et al., 2015; Hernández-Lobato & Adams, 2015; Zhang et al., 2017; Khan et al., 2018).

Our approach can be seen as an attempt to further reduce the number of parameters of the (already) diagonal covariance matrix. Closest to our approach is the work of Louizos & Welling (2016). Their matrix variate Gaussian distribution instantiated with the Kronecker product of the diagonal row- and column-covariance matrices leads to a rank-1 tying of the posterior variances. In contrast, we explore tying strategies beyond the rank-1 case, which we show to lead to better performance (both in terms of ELBO and predictive metrics). Importantly, we further prove that tying strategies with a rank greater than one cannot be represented in a matrix variate Gaussian distribution, thus clearly departing from Louizos & Welling (2016) (see Appendix A for details).

Our approach can be also interpreted as a form of hierarchical variational inference from  Ranganath et al. (2016). In this interpretation, the prior on the variational parameters corresponds to a Dirac distribution, non-zero only when a pre-specified low-rank tying relationship holds. More recently, Karaletsos et al. (2018) proposed a hierarchical structure which also couples network weights, but achieves this by introducing representations of network units as latent variables.

We close this related work section by mentioning the existence of other strategies to produce more flexible approximate posteriors, e.g., normalizing flows (Rezende & Mohamed, 2015) and extensions thereof (Louizos & Welling, 2017).

5 Conclusion

In this work we have shown that Bayesian Neural Networks trained with standard Gaussian Mean-Field Variational Inference learn posterior standard deviation matrices that can be approximated with little information loss by low-rank SVD decompositions. This suggests that richer parameterizations of the variational posterior may not always be needed, and that compact parameterizations can also work well. We used this insight to propose a simple, yet effective variational posterior parameterization, which speeds up training and reduces the number of variational parameters without degrading predictive performance on a range of model types.

In future work, we hope to scale up variational inference with compactly parameterized approximate posteriors to much larger models and more complex problems. For mean-field variational inference to work well in that setting several challenges will likely need to be addressed (Osawa et al., 2019); improving the signal-to-noise ratio of ELBO gradients using our compact variational parameterizations may provide a piece of the puzzle.

References

  • Barber & Bishop (1998) Barber, D. and Bishop, C. M. Ensemble learning for multi-layer networks. In Advances in neural information processing systems, pp. 395–401, 1998.
  • Blei et al. (2017) Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
  • Blundell et al. (2015) Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
  • Challis & Barber (2011) Challis, E. and Barber, D. Concave gaussian variational approximations for inference in large-scale bayesian linear models. In

    Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics

    , pp. 199–207, 2011.
  • Chollet et al. (2015) Chollet, F. et al. Keras, 2015.
  • Dillon et al. (2017) Dillon, J. V., Langmore, I., Tran, D., Brevdo, E., Vasudevan, S., Moore, D., Patton, B., Alemi, A., Hoffman, M., and Saurous, R. A. Tensorflow distributions. arXiv preprint arXiv:1711.10604, 2017.
  • Farquhar et al. (2020) Farquhar, S., Smith, L., and Gal, Y. Try Depth instead of weight correlations: Mean-field is a less restrictive assumption for variational inference in deep networks. Bayesian Deep Learning Workshop at NeurIPS, 2020.
  • Giordano et al. (2018) Giordano, R., Broderick, T., and Jordan, M. I. Covariances, robustness and variational bayes. The Journal of Machine Learning Research, 19(1):1981–2029, 2018.
  • Goodfellow et al. (2016) Goodfellow, I., Bengio, Y., and Courville, A. Deep learning. MIT press, 2016.
  • Graves (2011) Graves, A. Practical variational inference for neural networks. In Advances in neural information processing systems, pp. 2348–2356, 2011.
  • Gupta & Nagar (2018) Gupta, A. K. and Nagar, D. K. Matrix variate distributions. Chapman and Hall/CRC, 2018.
  • He et al. (2015) He, K., Zhang, X., Ren, S., and Sun, J.

    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.

    In

    Proceedings of the IEEE international conference on computer vision

    , pp. 1026–1034, 2015.
  • He et al. (2016a) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pp. 770–778, 2016a.
  • He et al. (2016b) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016b.
  • Hernández-Lobato & Adams (2015) Hernández-Lobato, J. M. and Adams, R.

    Probabilistic backpropagation for scalable learning of bayesian neural networks.

    In International Conference on Machine Learning, pp. 1861–1869, 2015.
  • Hinton & Van Camp (1993) Hinton, G. and Van Camp, D. Keeping neural networks simple by minimizing the description length of the weights. In

    in Proc. of the 6th Ann. ACM Conf. on Computational Learning Theory

    . Citeseer, 1993.
  • Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Jaakkola & Jordan (1997) Jaakkola, T. and Jordan, M.

    A variational approach to bayesian logistic regression models and their extensions.

    In Sixth International Workshop on Artificial Intelligence and Statistics, volume 82, pp.  4, 1997.
  • Karaletsos et al. (2018) Karaletsos, T., Dayan, P., and Ghahramani, Z. Probabilistic meta-representations of neural networks. arXiv preprint arXiv:1810.00555, 2018.
  • Khan et al. (2018) Khan, M. E., Nielsen, D., Tangkaratt, V., Lin, W., Gal, Y., and Srivastava, A. Fast and scalable bayesian deep learning by weight-perturbation in adam. arXiv preprint arXiv:1806.04854, 2018.
  • Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kingma & Welling (2013) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Krizhevsky et al. (2009a) Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009a.
  • Krizhevsky et al. (2009b) Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009b.
  • LeCun & Cortes (2010) LeCun, Y. and Cortes, C. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/.
  • LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Louizos & Welling (2016) Louizos, C. and Welling, M. Structured and efficient variational deep learning with matrix gaussian posteriors. In International Conference on Machine Learning, pp. 1708–1716, 2016.
  • Louizos & Welling (2017) Louizos, C. and Welling, M. Multiplicative normalizing flows for variational bayesian neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2218–2227. JMLR. org, 2017.
  • Maas et al. (2011) Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C.

    Learning word vectors for sentiment analysis.

    In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1, pp. 142–150. Association for Computational Linguistics, 2011.
  • MacKay (1992) MacKay, D. J. A practical bayesian framework for backpropagation networks. Neural computation, 4(3):448–472, 1992.
  • Marlin et al. (2011) Marlin, B. M., Khan, M. E., and Murphy, K. P. Piecewise bounds for estimating bernoulli-logistic latent gaussian models. In Proceedings of the International Conference on Machine Learning, pp. 633–640, 2011.
  • Miller et al. (2017) Miller, A. C., Foti, N. J., and Adams, R. P. Variational boosting: Iteratively refining posterior approximations. In Proceedings of the 34th International Conference on Machine Learning, pp. 2420–2429. JMLR. org, 2017.
  • Mishkin et al. (2018) Mishkin, A., Kunstner, F., Nielsen, D., Schmidt, M., and Khan, M. E. Slang: Fast structured covariance approximations for bayesian deep learning with natural gradient. In Advances in Neural Information Processing Systems, pp. 6245–6255, 2018.
  • Neal (1993) Neal, R. M. Bayesian learning via stochastic dynamics. In Advances in neural information processing systems, pp. 475–482, 1993.
  • Ong et al. (2018) Ong, V. M.-H., Nott, D. J., and Smith, M. S. Gaussian variational approximation with a factor covariance structure. Journal of Computational and Graphical Statistics, 27(3):465–478, 2018.
  • Osawa et al. (2019) Osawa, K., Swaroop, S., Jain, A., Eschenhagen, R., Turner, R. E., Yokota, R., and Khan, M. E. Practical deep learning with bayesian principles. arXiv preprint arXiv:1906.02506, 2019.
  • Ovadia et al. (2019) Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J. V., Lakshminarayanan, B., and Snoek, J. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. arXiv preprint arXiv:1906.02530, 2019.
  • Peterson (1987) Peterson, C. A mean field theory learning algorithm for neural networks. Complex systems, 1:995–1019, 1987.
  • Raftery et al. (2005) Raftery, A. E., Gneiting, T., Balabdaoui, F., and Polakowski, M. Using bayesian model averaging to calibrate forecast ensembles. Monthly weather review, 133(5):1155–1174, 2005.
  • Ranganath et al. (2014) Ranganath, R., Gerrish, S., and Blei, D. Black box variational inference. In Artificial Intelligence and Statistics, pp. 814–822, 2014.
  • Ranganath et al. (2016) Ranganath, R., Tran, D., and Blei, D. Hierarchical variational models. In International Conference on Machine Learning, pp. 324–333, 2016.
  • Rezende & Mohamed (2015) Rezende, D. and Mohamed, S. Variational inference with normalizing flows. In International Conference on Machine Learning, pp. 1530–1538, 2015.
  • Salimans et al. (2013) Salimans, T., Knowles, D. A., et al.

    Fixed-form variational posterior approximation through stochastic linear regression.

    Bayesian Analysis, 8(4):837–882, 2013.
  • Seeger (2000) Seeger, M.

    Bayesian model selection for support vector machines, gaussian processes and other kernel classifiers.

    In Advances in neural information processing systems, pp. 603–609, 2000.
  • Sønderby et al. (2016) Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., and Winther, O.

    How to train deep variational autoencoders and probabilistic ladder networks.

    In 33rd International Conference on Machine Learning (ICML 2016), 2016.
  • Sun et al. (2017) Sun, S., Chen, C., and Carin, L. Learning structured weight uncertainty in bayesian neural networks. In Artificial Intelligence and Statistics, pp. 1283–1292, 2017.
  • Tan & Nott (2018) Tan, L. S. and Nott, D. J. Gaussian variational approximation with sparse precision matrices. Statistics and Computing, 28(2):259–275, 2018.
  • Titsias & Lázaro-Gredilla (2014) Titsias, M. and Lázaro-Gredilla, M. Doubly stochastic variational bayes for non-conjugate inference. In International conference on machine learning, pp. 1971–1979, 2014.
  • Tran et al. (2019) Tran, D., Dusenberry, M. W., Hafner, D., and van der Wilk, M. Bayesian Layers: A module for neural network uncertainty. In Neural Information Processing Systems, 2019.
  • Turner & Sahani (2011) Turner, R. and Sahani, M. Two problems with variational expectation maximisation for time-series models, pp. 109–130. Cambridge University Press, 2011.
  • Wen et al. (2018) Wen, Y., Vicol, P., Ba, J., Tran, D., and Grosse, R. Flipout: Efficient pseudo-independent weight perturbations on mini-batches. arXiv preprint arXiv:1803.04386, 2018.
  • Zhang et al. (2018) Zhang, C., Butepage, J., Kjellstrom, H., and Mandt, S. Advances in variational inference. IEEE transactions on pattern analysis and machine intelligence, 2018.
  • Zhang et al. (2017) Zhang, G., Sun, S., Duvenaud, D., and Grosse, R. Noisy natural gradient as variational inference. arXiv preprint arXiv:1712.02390, 2017.

Appendix A Proof of the Matrix Variate Normal Parameterization

In this section of the appendix, we formally explain the connections between the -tied Normal distribution and the matrix variate Gaussian distribution (Gupta & Nagar, 2018), referred to as .

Consider positive definite matrices and and some arbitrary matrix . We have by definition that if and only if , where stacks the columns of a matrix and is the Kronecker product

The has already been used for variational inference by Louizos & Welling (2016) and Sun et al. (2017). In particular, Louizos & Welling (2016) consider the case where both and are restricted to be diagonal matrices. In that case, the resulting distribution corresponds to our -tied Normal distribution with since

Importantly, we prove below that, in the case where , the -tied Normal distribution cannot be represented as a matrix variate Gaussian distribution.

Lemma (Rank-2 matrix and Kronecker product).

Let be a rank- matrix in . There do not exist matrices and such that

Proof.

Let us introduce the shorthand . By construction, is diagonal and has its diagonal terms strictly positive (it is assumed that , i.e., for all ).

We proceed by contradiction. Assume there exist and such that .

This implies that all diagonal blocks of are themselves diagonal with strictly positive diagonal terms. Thus, is diagonal for all , which implies in turn that is diagonal, with non-zero diagonal terms and . Moreover, since the off-diagonal blocks for must be zero and , we have and is also diagonal.

To summarize, if there exist and such that , then it holds that with and . This last equality can be rewritten as for all and , or equivalently

This leads to a contradiction since has rank one while is assumed to have rank two. ∎

Figure 7 provides an illustration of the difference between the -tied Normal and the distribution.

Figure 7: Illustration of the difference in modeling of the posterior covariance by the -tied Normal distribution (green), the distribution (red), the Gaussian mean field (blue) and the full Gaussian covariance (black) for a layer of size . The -tied Normal with is equivalent to with diagonal row and column covariance matrices (half-red, half-green circle). Our experiments show that the fails to capture the performance of the mean field. On the other hand, while the full/non-diagonal increases the expressiveness of the posterior, it also increases the number of parameters. In contrast, the -tied Normal distribution with not only decreases the number of parameters, but also matches the predictive performance of the mean field.

Appendix B Experimental details

In this section we provide additional information on the experimental setup used in the main paper. In particular, we describe the details of the models and datasets, the utilized standard Gaussian Mean Field Variational Inference (GMFVI) training procedure, the low-rank structure analysis of the GMFVI trained posteriors and the proposed -tied Normal posterior training procedure.

b.1 Models and datasets

To confirm the validity of our results, we performe the experiments on a range of models and datasets with different data types, architecture types and sizes. Below we describe their details.

Mlp Mnist

Multilayer perceptron (MLP) model with three dense layers and ReLu activations trained on the MNIST dataset (LeCun & Cortes, 2010). The three layers have sizes of 400, 400 and 10 hidden units. We preprocess the images to be have values in range . We use the last 10,000 examples of the training set as a validation set.

LeNet CNN CIFAR-100

LeNet convolutional neural network (CNN) model (LeCun et al., 1998) with two convolutional layers followed by two dense layers, all interleaved with ReLu activations. The two convolutional layers have 32 and 64 output filters respectively, each produced by kernels of size . The two dense layers have sizes of 512 and 100 hidden units. We train this network on the CIFAR-100 dataset (Krizhevsky et al., 2009b). We preprocess the images to have values in range . We use the last 10,000 examples of the training set as a validation set.

Lstm Imdb

Long short-term memory (LSTM) model (Hochreiter & Schmidhuber, 1997) that consists of an embedding and an LSTM cell, followed by a dense layer with a single unit. The LSTM cell consists of two dense weight matrices, namely the kernel and the recurrent kernel. The embedding and the LSTM cell have both 128-dimensional output space. More precisely, we adopt the publicly available LSTM Keras (Chollet et al., 2015) example666See: https://github.com/keras-team/keras/blob/master/examples/imdb_lstm.py., except that we set the dropout rate to zero. We train this model on the IMDB text sentiment classification dataset (Maas et al., 2011), in which we use the last 5,000 examples of the training set as a validation set.

ResNet-18 CIFAR-10

ResNet-18 model (He et al., 2016a) trained on the CIFAR-10 dataset (Krizhevsky et al., 2009b). We adopt the ResNet-18 implementation777See: https://github.com/tensorflow/probability/blob/master/tensorflow_probability/examples/cifar10_bnn.py. from the Tensorflow Probability (Dillon et al., 2017) repository. We train/evaluate this model on the train/test split of 50,000 and 10,000 images, respectively, from the CIFAR-10 dataset available in Tensorflow Datasets888See: https://www.tensorflow.org/datasets/catalog/cifar10..

b.2 GMFVI training

We train all the above models using GMFVI. We split the discussion of the details of the GMFVI training procedure into two parts. First, we describe the setup for the MLP, CNN and LSTM models, for which we prepare our own GMFVI implementations. Second, we explain the setup for the GMFVI training of the ResNet-18 model, for which we use the implementation available in the Tensorflow Probability repository as mentioned above.

MLP, CNN and LSTM

In the MLP and the CNN models, we approximate the posterior using GMFVI for all the weights (both kernel and bias weights). For the LSTM model, we approximate the posterior using GMFVI only for the kernel weights, while for the bias weights we use a point estimate. For all the three models, we use the standard reparametrization trick estimator (Kingma & Welling, 2013). We initialize the GMFVI posterior means using the standard He initialization (He et al., 2015) and the GMFVI posterior standard deviations using samples from . Furthermore, we use a Normal prior with a single scalar standard deviation hyper-parameter for all the layers. We select for each of the models separately from a set of based on the validation data set performance.

We optimize the variational parameters using an Adam optimizer (Kingma & Ba, 2014). We pick the optimal learning rate for each model from the set of also based on the validation data set performance. We choose the batch size of 1024 for the MLP and CNN models, and the batch size of 128 for the LSTM model. We train all the models until the ELBO convergence.

To implement the MLP and CNN models we use the tfp.layers module from the Tensorflow Probability, while to implement the LSTM model we use the LSTMCellReparameterization999See: https://github.com/google/edward2/blob/master/edward2/tensorflow/layers/recurrent.py. class from the Edward2 Layers module (Tran et al., 2019).

ResNet-18

The specific details of the GMFVI training of the ResNet-18 model can be found in the previously linked implementation from the Tensorflow Probability repository. Here, we describe the most important and distinctive aspects of this implementation.

The ResNet-18 model approximates the posterior using GMFVI only for the kernel weights, while for the bias weights it uses a point estimate. The model uses the Flipout estimator (Wen et al., 2018) and a constraint on the maximum value of the GMFVI posterior standard deviations of . The GMFVI posterior means are initialized using samples from , while the GMFVI posterior log standard deviations are initialized using samples from . Furthermore, the model uses a Normal prior for all of its layers.

The variational parameters are trained using the Adam optimizer with a learning rate of and a batch size of 128. The model is trained for 700 epochs. The contribution of the term in the negative Evidence Lower Bound (ELBO) equation is annealed linearly from zero to its full contribution over the first 50 epochs (Sønderby et al., 2016).

b.3 Low-rank structure analysis

After training the above models using GMFVI, we investigate the low-rank structure in their trained variational posteriors. For the MLP, CNN and LSTM models, we investigate the low-rank structure of their dense layers only. For the ResNet-18 model, we investigate both its dense and convolutional layers.

To investigate the low-rank structure in the GMFVI posterior of a dense layer, we inspect a spectrum of the posterior mean and standard deviation matrices. In particular, for both the posterior mean and standard deviation matrices, we consider the fraction of the variance explained by the top singular values from their SVD decomposition (see Figure 9 in the main paper). Furthermore, we explore the impact on predictive performance of approximating the full matrices with their low-rank approximations using only the components corresponding to the top singular values (see Table 2 in the main paper). Note that such low-rank approximations may contain values below zero. This has to be addressed when approximating the matrices of the posterior standard deviations, which can contain only positive values. Therefore, we use a lower bound of zero for the values of the approximations to the posterior standard deviations.

To investigate the low-rank structure in a GMFVI posterior of a convolutional layer, we need to add a few more steps compared to those for a dense layer. In particular, weights of the convolutional layers considered here are 4-dimensional, instead of 2-dimensional as in the dense layer. Therefore, before performing the SVD decomposition, as for the dense layers, we first reshape the 4-dimensional weight tensor from the convolutional layer into a 2-dimensional weight matrix. More precisely, we flatten all dimensions of the weight tensor except for the last dimension (e.g., a weight tensor of shape

is reshaped to ). Figure 8 contains example visualizations of the resulting flattened 2-dimensional matrices101010After this specific reshape operation, all the weights corresponding to a single output filter are contained in a single column of the resulting weight matrix.. Given the 2-dimensional form of the weight tensor, we can investigate the low-rank structure in the convolutional layers as for the dense layers. As noted already in Figure 4 in the main paper, we observe the same strong low-rank structure behaviour in the flattened convolutional layers as in the dense layers. Interestingly, the low-rank structure is the most visible in the final convolutional layers, which also contain the highest number of parameters, see Figure 9.

Importantly, note that after performing the low-rank approximation in this 2-dimensional space, we can reshape the resulting 2-dimensional low-rank matrices back into the 4-dimensional form of a convolutional layer. Table 5 shows that such a low-rank approximation of the convolutional layers of the analyzed ResNet-18 model can be performed without a loss in the model’s predictive performance, while significantly reducing the total number of model parameters.

Figure 8: Heat maps of the partially flattened posterior standard deviation tensors for the selected convolutional layers of the ResNet-18 GMFVI BNN trained on CIFAR-10. The partially flattened posterior standard deviation tensors of the convolutional layers display similar low-rank patterns that we observe for the dense layers.
Figure 9: Fraction of variance explained per each singular value from SVD of partially flattened tensors of posterior means and posterior standard deviations for different convolutional layers of the ResNet-18 GMFVI BNN trained on CIFAR-10. Posterior standard deviations clearly display strong low-rank structure, with most of the variance contained in the top few singular values, while this is not the case for posterior means. Interestingly, the low-rank structure is the most visible for the final convolutional layers, which also contain the highest number of parameters.
Rank -ELBO NLL Accuracy #Params %Params
Full 122.610.012 0.4950.0080 83.50.37 9,814,026 100.0
1 122.570.012 0.6580.0069 81.70.39 4,929,711 50.2
2 122.770.012 0.5030.0080 83.20.37 4,946,964 50.4
3 122.670.012 0.5010.0079 83.20.37 4,964,217 50.6
Table 5: Impact of the low-rank approximation of the GMFVI-trained posterior standard deviations of a ResNet-18 model on the model’s predictive performance. We report mean and SEM of each metric across 100 weights samples. The low-rank approximations with ranks higher than one achieve predictive performance close to that when not using any approximations, while significantly reducing the number of model parameters.

b.4 -tied Normal posterior training

To exploit the low-rank structure observation, we propose the -tied Normal posterior, as discussed in Section 3. We study the properties of the -tied Normal posterior applied to the MLP, CNN and LSTM models. We use the -tied Normal variational posterior for all the dense layers of the analyzed models. Namely, we use the -tied Normal variational posterior for all the three layers of the MLP model, for the two dense layers of the CNN model and for the LSTM cell’s kernel and recurrent kernel.

We initialize the parameters and of the -tied Normal distribution so that after the outer-product operation the respective standard deviations have the same mean values as we obtain when using the standard GMFVI posterior parametrization. More precisely, we initialize the parameters and so that after the outer-product operation the respective standard deviations have means at before transforming to log-domain. This means that in the log domain the parameters and are initialized as

. We also add white noise

to the values of and in the log domain to break symmetry.

During training of the models with the -tied Normal posterior, we linearly anneal the contribution of the term of the ELBO loss. We select the best linear coefficient for the annealing from (per batch) and increase the effective contribution every 100 batches in a step-wise manner. In particular, we anneal the term to obtain the predictive performance results for all the models in Figure 5 in the main paper. However, we do not perform the annealing in the Signal-to-Noise ratio (SNR) and negative ELBO convergence speed experiments in the same Figure 5. In these two cases, KL annealing would occlude the values of interest, which show the clear impact of the -tied Normal posterior.