are a popular class of deep learning models. The most widespread approach for training these models relies on variational inference(Peterson, 1987; Hinton & Van Camp, 1993), a training paradigm that approximates a Bayesian posterior with a simpler class of distributions by solving an optimization problem. The common wisdom is that more expressive distributions lead to better posterior approximations and ultimately to better model performance. This paper puts this into question and instead finds that for Bayesian neural networks, more restrictive classes of distributions, based on low-rank factorizations, can outperform the common mean-field family.
Bayesian Neural Networks explicitly represent their parameter-uncertainty by forming a posterior distribution over model parameters, instead of relying on a single point estimate for making predictions, as is done in traditional deep learning. For neural network weights , features and labels , the posterior distribution is computed using Bayes’ rule, which multiplies the prior distribution and data likelihood and renormalizes. When predicting with Bayesian neural networks, we form an average over model predictions where each prediction is generated using a set of parameters that is randomly sampled from the posterior distribution. Bayesian neural networks are thus a type of ensembling, of which various types have proven highly effective in deep learning (see e.g. Goodfellow et al., 2016, sec 7.11).
Besides offering improved predictive performance over single models, Bayesian ensembles are also more robust because ensemble members will tend to make different predictions on hard examples (Raftery et al., 2005). In addition, the diversity of the ensemble represents predictive uncertainty and can be used for out-of-domain detection or other risk-sensitive applications (Ovadia et al., 2019).
Variational inference is a popular class of methods for approximating the posterior distribution , since the exact Bayes’ rule is often intractable to compute for models of practical interest. This class of methods specifies a distribution of given parametric or functional form as the posterior approximation, and optimizes the approximation by solving an optimization problem. In particular, we minimize the Kullback-Leibler (KL) divergence between the variational distribution and the true posterior distribution , which is given by
Here, we do not know the normalizing constant of the exact posterior , but since this term does not depend on , we may ignore it for the purpose of optimizing our approximation . We are then left with what is called the negative Evidence Lower Bound (negative ELBO):
In practice, the expectation of the likelihood with respect to is usually not analytically tractable and instead is estimated using Monte Carlo sampling:
, we choose the variational approximation to be a fully factorized Gaussian distributionwith , where is a layer number, and and are the row and column indices in the layer’s weight matrix. While Gaussian Mean-Field posteriors are considered to be one of the simplest types of variational approximations, with some known limitations (Giordano et al., 2018), they scale to comparatively large models and generally provide competitive performance (Ovadia et al., 2019). Additionally, Farquhar et al. (2020) have found that the Mean-Field becomes a less restrictive assumption as the depth of the network increases. However, when compared to deterministic neural networks, GMFVI doubles the number of parameters and is often harder to train due to the increased noise in stochastic gradient estimates. Furthermore, despite the theoretical advantages of GMFVI over the deterministic neural networks, GMFVI suffers from over-regularization for larger networks, which leads to underfitting and often worse predictive performance in such settings (Osawa et al., 2019).
Beyond mean-field variational inference, recent work on approximate Bayesian inference has explored ever richer parameterizations of the approximate posterior in the hope of improving the performance of Bayesian neural networks (see Figure 1). In contrast, here we study a simpler, more compactly parameterized variational approximation. Our motivation for studying this setting is to better understand the behaviour of GMFVI with the goal to address the issues with its practical applicability. Consequently, we show that the compact approximations can also work well for a variety of models. In particular we find that:
Converged posterior standard deviations under GMFVI consistently display strong low-rank structure. This means that by decomposing these variational parameters into a low-rank factorization, we can make our variational approximation more compact without decreasing our model’s performance.
Factorized parameterizations of posterior standard deviations improve the signal-to-noise ratio of stochastic gradient estimates, and thus not only reduce the number of parameters compared to standard GMFVI, but also can lead to faster convergence.
2 Mean field posterior standard deviations naturally have low-rank structure
In this section we show that the converged posterior standard deviations of Bayesian neural networks trained using standard GMFVI consistently display strong low-rank structure. We also show that it is possible to compress the learned posterior standard deviation matrix using a low-rank approximation without decreasing the network’s performance. We first briefly introduce the mathematical notation for our GMFVI setting and the low-rank approximation that we explore. We then provide experimental results that support the two main claims of this section.
To avoid any confusion among the readers, we would like to clarify that we use the terminology “low-rank” in a particular context. While variational inference typically makes use of low-rank decompositions to compactly represent the dense covariance of a Gaussian variational distribution (see numerous references in Section 4), we investigate instead underlying low-rank structures within the already diagonal covariance of a Gaussian fully-factorized variational distribution. We will make this explanation more formal in the next section.
To introduce the notation, we consider layers that consist of a linear transformation followed by a non-linearity,
where , and . To simplify the notation in the following, we drop the subscript such that , , and we focus on the kernel matrix for a single layer.
In GMFVI, we model the variational posterior as
is the posterior mean vector,is the diagonal posterior covariance matrix. The weights are then usually sampled using a reparameterization trick (Kingma & Welling, 2013), i.e, for the -th sample, we have
In practice, we often represent the posterior standard deviation parameters in the form of a matrix . Note that we have the relationship where the elementwise-squared is vectorized by stacking its columns, and then expanded as a diagonal matrix into .
In the sequel, we start by empirically studying the properties of the spectrum of matrices post training (after convergence), while using standard Gaussian mean-field variational distributions. Interestingly, we observe that those matrices naturally exhibit a low-rank structure (see Section 2.3 for the corresponding experiments), i.e,
for some , and a small value (e.g., 2 or 3). This observation motivates the introduction of the following variational family, which we name -tied Normal:
where the squaring of the matrix is applied elementwise. Due to the tied parameterization of the diagonal covariance matrix, we emphasize that this variational family is smaller—i.e., included in—the standard Gaussian mean-field variational distribution family.
|Variational family||Parameters (total)|
As formally discussed in Appendix A, the matrix variate Gaussian distribution (Gupta & Nagar, 2018), referred to as and already used for variational inference by Louizos & Welling (2016) and Sun et al. (2017), is related to our -tied Normal distribution with when uses diagonal row and column covariances. Interestingly, we prove that for , our -tied Normal distribution cannot be represented by any distribution. This illustrates the main difference of our approach from the most closely related previous work of Louizos & Welling (2016) (see also Figure 7 in Appendix A).
Notice that our diagonal covariance repeatedly reuses the same elements of and , which results in parameter sharing across different weights. The total number of the standard deviation parameters in our method is from and , compared to from in the standard GMFVI parameterization. Given that in our experiments the is very low (e.g. ) this reduces the number of parameters from quadratic to linear in the dimensions of the layer, see Table 1. More importantly, such parameter sharing across the weights leads to higher signal-to-noise ratio during training and thus in some cases faster convergence. We demonstrate this phenomena in the next section. In the rest of this section, we show that the standard GMFVI methods already learn a low-rank structure in the posterior standard deviation matrix . Furthermore, we provide evidence that replacing the full matrix with its low-rank approximation does not reduce the predictive performance.
2.2 Experimental setting
Before describing the experimental results, we briefly explain the key properties of the experimental setting. We analyze three types of GMFVI Bayesian neural network models:
Residual Convolutional Neural Network (ResNet): a ResNet-18111See: https://github.com/tensorflow/probability/blob/master/tensorflow_probability/examples/cifar10_bnn.py. architecture (He et al., 2016b) trained on all 50,000 training examples of the CIFAR-10 dataset (Krizhevsky et al., 2009a).
In each of the four models, we use the standard mean-field Normal variational posterior and a Normal prior, for which we set a single scalar standard deviation hyper-parameter shared by all layers. We optimize the variational posterior parameters using the Adam optimizer (Kingma & Ba, 2014). For a more comprehensive explanation of the experimental setup please refer to Appendix B. Finally, we highlight that our experiments focus primarily on the comparison across a broad range of model types rather than competing with the state-of-the-art results over the specifically used datasets. Nevertheless, we also show that our results extend to larger models with competitive performance such as the ResNet-18 model. Note that scaling GMFVI to such larger model sizes is still a challenging research problem (Osawa et al., 2019).
2.3 Main experimental observation
Our main experimental observation is that the standard GMFVI learns posterior standard deviation matrices that have a low-rank structure across different model types (MLP, CNN, LSTM), model sizes (LeNet, ResNet-18) and layer types (dense, convolutional). To show this, we investigate the results of the SVD decomposition of posterior standard deviation matrices in the four described models types. We analyze the models post training, where the models are already trained until ELBO convergence using the standard GMFVI approach. While for the first three models (MLP, CNN and LSTM), we evaluate the low-rank structure only in the dense layers, for the ResNet model we consider the low-rank structure in the convolutional layers as well.
Figure 3from the SVD decomposition of matrices in the dense layers of the first three models. The fraction of variance explained per singular value is calculated as , where are the singular values. We observe that, unlike posterior means, the posterior standard deviations have most of their variance explained by the first few singular values. In particular, a rank-1 approximation of the posterior standard deviations matrices explains most of its variance, while a rank-2 approximation can encompass nearly all of the remaining variance. Figure 2 further supports this claim visually by comparing the heat maps of the full posterior standard deviations matrix and its rank-1 and rank-2 approximations. In particular, we observe that the rank-2 approximation results in the heat map looking visually very similar to the full matrix, while this is not the case for the rank-1 approximation. Importantly, Figure 4
illustrates that the same low-rank structure is also present in both the dense and the convolutional layers of the larger ResNet-18 model. In the analysis of the above experiments, we use the shorthand SEM to refer to the standard error of the mean.
|MNIST, MLP||CIFAR100, CNN||IMDB, LSTM|
2.4 Low-rank approximation of variance matrices
Motivated by the above observations, we show that it is possible to replace the full posterior standard deviation matrices with their low-rank approximations without a decrease in predictive performance. Table 2 shows the performance comparison of the MLP, CNN and LSTM models with different ranks of the approximations. Figure 4 contains analogous results for the ResNet-18 model. The results show that the post-training approximations with ranks higher than one achieve predictive performance close to that of the full posterior for all the analyzed model types, model sizes and layer types. Furthermore, Table 3 shows that, for the ResNet-18 model, the approximations with ranks higher than one also do not decrease the quality of the uncertainty estimates compared to the full model without the approximations2222017) Stats module: https://www.tensorflow.org/probability/api_docs/python/tfp/stats.. These observations could be used as a form of post-training network compression. Moreover, they give rise to further interesting exploration directions such as formulating posteriors that exploit such a low-rank structure. In the next section we explore this particular direction while focusing on the first three model types (MLP, CNN, LSTM).
3 The -tied Normal Distribution: Exploiting Low-Rank Parameter-Structure in Mean Field Posteriors
In the previous section we have shown that it is possible to replace a full matrix of posterior standard deviations, which is already trained using GMFVI, with its low-rank approximation without decreasing the predictive performance. In this section we show that it is also possible to exploit this observation during training time. We achieve this by exploiting our novel variational family, the -tied Normal distribution (see Section 2.1).
We show that using this distribution in the context of GMFVI in Bayesian neural networks allows to reduce the number of network parameters, increase the signal-to-noise ratio of the stochastic gradient estimates and speed up model convergence while maintaining the predictive performance of the standard parameterization of the GMFVI. We start by recalling the definition of the -tied Normal distribution:
where the variational parameters are comprised of .
3.1 Experimental setting
We now introduce the experimental setting in which we evaluate the GMFVI variational posterior parameterized by the -tied Normal distribution. We assess the impact of the described posterior in terms of predictive performance and reduction in the number of parameters for the same first three model types (MLP, CNN, LSTM) and respective datasets (MNIST, CIFAR-100, IMDB) as we used in the previous section. Additionally, we also analyze the impact of weight tying in the posterior on the signal-to-noise ratio of stochastic gradient estimates of the variational lower bound for the CNN model as a representative example. Overall, the experimental setup is very similar to the one introduced in the previous section. Therefore, we highlight here only the key differences.
We apply the -tied Normal variational posterior distribution to the layers which we analyzed in the previous section. Namely, we use the -tied Normal variational posterior for all the three layers of the MLP model, the two dense layers of the CNN model and the LSTM cell’s kernel and recurrent kernel. We initialize the parameters from and from of the -tied Normal distribution so that after the outer-product operation the respective standard deviations have the same mean values as we obtain in the standard GMFVI posterior parameterization. In the experiments for this section, we use KL annealing (Sønderby et al., 2016), where we linearly scale-up the contribution of the term in Equation 2 from zero to its full contribution over the course of training. More details about the experimental setup are available in Appendix B.
3.2 Experimental results
We first investigate the predictive performance of the GMFVI Bayesian neural network models trained using the -tied Normal posterior distribution, with different levels of tying . We compare these results to those obtained from the same models, but trained using the standard parameterization of the GMFVI. Figure 5 (left) shows that for the -tied Normal posterior is able to achieve the performance competitive with the standard GMFVI posterior parameterization, while reducing the total number of model parameters. The benefits of using the -tied Normal posterior are the most visible for models where the layers with the -tied Normal posterior constitute a significant portion of the total number of the model parameters (e.g. the MLP model).
We further investigate the impact of the -tied Normal posterior on the signal-to-noise ratio (SNR)333SNR for each gradient value is calculated as , where is the gradient value for a single parameter. The expectation and variance of the gradient values are calculated over a window of last 10 batches. of stochastic gradient estimates of the variational lower bound (ELBO). In particular, we focus on the gradient SNR of the GMFVI posterior standard deviation parameters for which we perform the tying. These parameters are either and for the -tied Normal posterior or for the standard GMFVI parameterization, all optimized in their log forms for numerical stability. Figure 5 (top right) shows that the and parameters used in the -tied Normal posterior are trained with significantly higher gradient SNR than the parameters used in the standard GMFVI parameterization. Consequently, Figure 5 (bottom right) shows that the increased SNR from the -tied Normal distribution translates into faster convergence for the MLP model, which uses the -tied Normal distribution in all of its layers.
Note that the -tied Normal posterior does not increase the training step time compared to the standard parameterization of the GMFVI, see Table 4 for the support of this claim444Code to compare the training step times of the -tied Normal and the standard GMFVI is available under: https://colab.research.google.com/drive/14pqe_VG5s49xlcXB-Jf8S9GoTFyjv4OF. The code uses the network architecture from: https://github.com/tensorflow/docs/blob/master/site/en/tutorials/keras/classification.ipynb.. Therefore, the -tied Normal posterior speeds up model convergence also in terms of wall-clock time.
Figure 6 shows the convergence plots of validation negative ELBO for all the three model types. We observe that the impact of the -tied Normal posterior on convergence depends on the model type. As shown in Figure 5 (bottom right), the impact on the MLP model is strong and consistent with the -tied Normal posterior increasing convergence speed compared to the standard GMFVI parameterization. For the LSTM model we also observe a similar speed-up. However, for the CNN model the impact of the -Normal posterior on the ELBO convergence is much smaller. We hypothesize that this is due to the fact that we use the -tied Normal posterior for all the layers trained using GMFVI in the MLP and the LSTM models, while in the CNN model we use the -tied Normal posterior only for some of the GMFVI trained layers. More precisely, in the CNN model we use the -tied Normal posterior only for the two dense layers, while the two convolutional layers are trained using the standard parameterization of the GMFVI.
|Training method||Train step time [ms]|
|-tied Normal GMFVI||6.140.018|
4 Related work
The application of variational inference to neural networks dates back at least to Peterson (1987) and Hinton & Van Camp (1993). Many developments555We refer the interested readers to Zhang et al. (2018) for a recent review of variational inference. have followed those seminal research efforts, in particular regarding (1) the expressiveness of the variational posterior distribution and (2) the way the variational parameters themselves can be structured to lead to compact, easier-to-learn and scalable formulations. We organize the discussion of this section around those two aspects, with a specific focus on the Gaussian case.
Full Gaussian posterior.
Because of their substantial memory and computational cost, Gaussian variational distributions with full covariance matrices have been primarily applied to (generalized) linear models and shallow neural networks (Jaakkola & Jordan, 1997; Barber & Bishop, 1998; Marlin et al., 2011; Titsias & Lázaro-Gredilla, 2014; Miller et al., 2017; Ong et al., 2018).
To represent the dense covariance matrix efficiently in terms of variational parameters, several schemes have been proposed, including the sum of low-rank plus diagonal matrices (Barber & Bishop, 1998; Seeger, 2000; Miller et al., 2017; Zhang et al., 2017; Ong et al., 2018), the Cholesky decomposition (Challis & Barber, 2011) or by operating instead on the precision matrix (Tan & Nott, 2018; Mishkin et al., 2018).
Gaussian posterior with block-structured covariances.
In the context of Bayesian neural networks, the layers represent a natural structure to be exploited by the covariance matrix. When assuming independence across layers, the resulting covariance matrix exhibits a block-diagonal structure that has been shown to be a well-performing simplification of the dense setting (Sun et al., 2017; Zhang et al., 2017), with both memory and computational benefits.
Within each layer, the corresponding diagonal block of the covariance matrix can be represented by a Kronecker product of two smaller matrices (Louizos & Welling, 2016; Sun et al., 2017), possibly with a parameterization based on rotation matrices (Sun et al., 2017). Finally, using similar techniques, Zhang et al. (2017) proposed to use a block tridiagonal structure that better approximates the behavior of a dense covariance.
Fully factorized mean-field Gaussian posterior.
A fully factorized Gaussian variational distribution constitutes the simplest option for variational inference. The resulting covariance matrix is diagonal and all underlying parameters are assumed to be independent. While the mean-field assumption is known to have some limitations—e.g., underestimated variance of the posterior distribution (Turner & Sahani, 2011) and robustness issues (Giordano et al., 2018)—it leads to scalable formulations, with already competitive performance, as for instance illustrated by the recent uncertainty quantification benchmark of Ovadia et al. (2019).
Because of its simplicity and scalability, the fully-factorized Gaussian variational distribution has been widely used for Bayesian neural networks (Graves, 2011; Ranganath et al., 2014; Blundell et al., 2015; Hernández-Lobato & Adams, 2015; Zhang et al., 2017; Khan et al., 2018).
Our approach can be seen as an attempt to further reduce the number of parameters of the (already) diagonal covariance matrix. Closest to our approach is the work of Louizos & Welling (2016). Their matrix variate Gaussian distribution instantiated with the Kronecker product of the diagonal row- and column-covariance matrices leads to a rank-1 tying of the posterior variances. In contrast, we explore tying strategies beyond the rank-1 case, which we show to lead to better performance (both in terms of ELBO and predictive metrics). Importantly, we further prove that tying strategies with a rank greater than one cannot be represented in a matrix variate Gaussian distribution, thus clearly departing from Louizos & Welling (2016) (see Appendix A for details).
Our approach can be also interpreted as a form of hierarchical variational inference from Ranganath et al. (2016). In this interpretation, the prior on the variational parameters corresponds to a Dirac distribution, non-zero only when a pre-specified low-rank tying relationship holds. More recently, Karaletsos et al. (2018) proposed a hierarchical structure which also couples network weights, but achieves this by introducing representations of network units as latent variables.
In this work we have shown that Bayesian Neural Networks trained with standard Gaussian Mean-Field Variational Inference learn posterior standard deviation matrices that can be approximated with little information loss by low-rank SVD decompositions. This suggests that richer parameterizations of the variational posterior may not always be needed, and that compact parameterizations can also work well. We used this insight to propose a simple, yet effective variational posterior parameterization, which speeds up training and reduces the number of variational parameters without degrading predictive performance on a range of model types.
In future work, we hope to scale up variational inference with compactly parameterized approximate posteriors to much larger models and more complex problems. For mean-field variational inference to work well in that setting several challenges will likely need to be addressed (Osawa et al., 2019); improving the signal-to-noise ratio of ELBO gradients using our compact variational parameterizations may provide a piece of the puzzle.
- Barber & Bishop (1998) Barber, D. and Bishop, C. M. Ensemble learning for multi-layer networks. In Advances in neural information processing systems, pp. 395–401, 1998.
- Blei et al. (2017) Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
- Blundell et al. (2015) Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
Challis & Barber (2011)
Challis, E. and Barber, D.
Concave gaussian variational approximations for inference in
large-scale bayesian linear models.
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 199–207, 2011.
- Chollet et al. (2015) Chollet, F. et al. Keras, 2015.
- Dillon et al. (2017) Dillon, J. V., Langmore, I., Tran, D., Brevdo, E., Vasudevan, S., Moore, D., Patton, B., Alemi, A., Hoffman, M., and Saurous, R. A. Tensorflow distributions. arXiv preprint arXiv:1711.10604, 2017.
- Farquhar et al. (2020) Farquhar, S., Smith, L., and Gal, Y. Try Depth instead of weight correlations: Mean-field is a less restrictive assumption for variational inference in deep networks. Bayesian Deep Learning Workshop at NeurIPS, 2020.
- Giordano et al. (2018) Giordano, R., Broderick, T., and Jordan, M. I. Covariances, robustness and variational bayes. The Journal of Machine Learning Research, 19(1):1981–2029, 2018.
- Goodfellow et al. (2016) Goodfellow, I., Bengio, Y., and Courville, A. Deep learning. MIT press, 2016.
- Graves (2011) Graves, A. Practical variational inference for neural networks. In Advances in neural information processing systems, pp. 2348–2356, 2011.
- Gupta & Nagar (2018) Gupta, A. K. and Nagar, D. K. Matrix variate distributions. Chapman and Hall/CRC, 2018.
He et al. (2015)
He, K., Zhang, X., Ren, S., and Sun, J.
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.In
Proceedings of the IEEE international conference on computer vision, pp. 1026–1034, 2015.
He et al. (2016a)
He, K., Zhang, X., Ren, S., and Sun, J.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016a.
- He et al. (2016b) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016b.
Hernández-Lobato & Adams (2015)
Hernández-Lobato, J. M. and Adams, R.
Probabilistic backpropagation for scalable learning of bayesian neural networks.In International Conference on Machine Learning, pp. 1861–1869, 2015.
Hinton & Van Camp (1993)
Hinton, G. and Van Camp, D.
Keeping neural networks simple by minimizing the description length
of the weights.
in Proc. of the 6th Ann. ACM Conf. on Computational Learning Theory. Citeseer, 1993.
- Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
Jaakkola & Jordan (1997)
Jaakkola, T. and Jordan, M.
A variational approach to bayesian logistic regression models and their extensions.In Sixth International Workshop on Artificial Intelligence and Statistics, volume 82, pp. 4, 1997.
- Karaletsos et al. (2018) Karaletsos, T., Dayan, P., and Ghahramani, Z. Probabilistic meta-representations of neural networks. arXiv preprint arXiv:1810.00555, 2018.
- Khan et al. (2018) Khan, M. E., Nielsen, D., Tangkaratt, V., Lin, W., Gal, Y., and Srivastava, A. Fast and scalable bayesian deep learning by weight-perturbation in adam. arXiv preprint arXiv:1806.04854, 2018.
- Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Kingma & Welling (2013) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Krizhevsky et al. (2009a) Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009a.
- Krizhevsky et al. (2009b) Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009b.
- LeCun & Cortes (2010) LeCun, Y. and Cortes, C. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/.
- LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Louizos & Welling (2016) Louizos, C. and Welling, M. Structured and efficient variational deep learning with matrix gaussian posteriors. In International Conference on Machine Learning, pp. 1708–1716, 2016.
- Louizos & Welling (2017) Louizos, C. and Welling, M. Multiplicative normalizing flows for variational bayesian neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2218–2227. JMLR. org, 2017.
Maas et al. (2011)
Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C.
Learning word vectors for sentiment analysis.In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1, pp. 142–150. Association for Computational Linguistics, 2011.
- MacKay (1992) MacKay, D. J. A practical bayesian framework for backpropagation networks. Neural computation, 4(3):448–472, 1992.
- Marlin et al. (2011) Marlin, B. M., Khan, M. E., and Murphy, K. P. Piecewise bounds for estimating bernoulli-logistic latent gaussian models. In Proceedings of the International Conference on Machine Learning, pp. 633–640, 2011.
- Miller et al. (2017) Miller, A. C., Foti, N. J., and Adams, R. P. Variational boosting: Iteratively refining posterior approximations. In Proceedings of the 34th International Conference on Machine Learning, pp. 2420–2429. JMLR. org, 2017.
- Mishkin et al. (2018) Mishkin, A., Kunstner, F., Nielsen, D., Schmidt, M., and Khan, M. E. Slang: Fast structured covariance approximations for bayesian deep learning with natural gradient. In Advances in Neural Information Processing Systems, pp. 6245–6255, 2018.
- Neal (1993) Neal, R. M. Bayesian learning via stochastic dynamics. In Advances in neural information processing systems, pp. 475–482, 1993.
- Ong et al. (2018) Ong, V. M.-H., Nott, D. J., and Smith, M. S. Gaussian variational approximation with a factor covariance structure. Journal of Computational and Graphical Statistics, 27(3):465–478, 2018.
- Osawa et al. (2019) Osawa, K., Swaroop, S., Jain, A., Eschenhagen, R., Turner, R. E., Yokota, R., and Khan, M. E. Practical deep learning with bayesian principles. arXiv preprint arXiv:1906.02506, 2019.
- Ovadia et al. (2019) Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J. V., Lakshminarayanan, B., and Snoek, J. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. arXiv preprint arXiv:1906.02530, 2019.
- Peterson (1987) Peterson, C. A mean field theory learning algorithm for neural networks. Complex systems, 1:995–1019, 1987.
- Raftery et al. (2005) Raftery, A. E., Gneiting, T., Balabdaoui, F., and Polakowski, M. Using bayesian model averaging to calibrate forecast ensembles. Monthly weather review, 133(5):1155–1174, 2005.
- Ranganath et al. (2014) Ranganath, R., Gerrish, S., and Blei, D. Black box variational inference. In Artificial Intelligence and Statistics, pp. 814–822, 2014.
- Ranganath et al. (2016) Ranganath, R., Tran, D., and Blei, D. Hierarchical variational models. In International Conference on Machine Learning, pp. 324–333, 2016.
- Rezende & Mohamed (2015) Rezende, D. and Mohamed, S. Variational inference with normalizing flows. In International Conference on Machine Learning, pp. 1530–1538, 2015.
Salimans et al. (2013)
Salimans, T., Knowles, D. A., et al.
Fixed-form variational posterior approximation through stochastic linear regression.Bayesian Analysis, 8(4):837–882, 2013.
- Seeger (2000) Seeger, M. In Advances in neural information processing systems, pp. 603–609, 2000.
Sønderby et al. (2016)
Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., and Winther,
How to train deep variational autoencoders and probabilistic ladder networks.In 33rd International Conference on Machine Learning (ICML 2016), 2016.
- Sun et al. (2017) Sun, S., Chen, C., and Carin, L. Learning structured weight uncertainty in bayesian neural networks. In Artificial Intelligence and Statistics, pp. 1283–1292, 2017.
- Tan & Nott (2018) Tan, L. S. and Nott, D. J. Gaussian variational approximation with sparse precision matrices. Statistics and Computing, 28(2):259–275, 2018.
- Titsias & Lázaro-Gredilla (2014) Titsias, M. and Lázaro-Gredilla, M. Doubly stochastic variational bayes for non-conjugate inference. In International conference on machine learning, pp. 1971–1979, 2014.
- Tran et al. (2019) Tran, D., Dusenberry, M. W., Hafner, D., and van der Wilk, M. Bayesian Layers: A module for neural network uncertainty. In Neural Information Processing Systems, 2019.
- Turner & Sahani (2011) Turner, R. and Sahani, M. Two problems with variational expectation maximisation for time-series models, pp. 109–130. Cambridge University Press, 2011.
- Wen et al. (2018) Wen, Y., Vicol, P., Ba, J., Tran, D., and Grosse, R. Flipout: Efficient pseudo-independent weight perturbations on mini-batches. arXiv preprint arXiv:1803.04386, 2018.
- Zhang et al. (2018) Zhang, C., Butepage, J., Kjellstrom, H., and Mandt, S. Advances in variational inference. IEEE transactions on pattern analysis and machine intelligence, 2018.
- Zhang et al. (2017) Zhang, G., Sun, S., Duvenaud, D., and Grosse, R. Noisy natural gradient as variational inference. arXiv preprint arXiv:1712.02390, 2017.
Appendix A Proof of the Matrix Variate Normal Parameterization
In this section of the appendix, we formally explain the connections between the -tied Normal distribution and the matrix variate Gaussian distribution (Gupta & Nagar, 2018), referred to as .
Consider positive definite matrices and and some arbitrary matrix . We have by definition that if and only if , where stacks the columns of a matrix and is the Kronecker product
The has already been used for variational inference by Louizos & Welling (2016) and Sun et al. (2017). In particular, Louizos & Welling (2016) consider the case where both and are restricted to be diagonal matrices. In that case, the resulting distribution corresponds to our -tied Normal distribution with since
Importantly, we prove below that, in the case where , the -tied Normal distribution cannot be represented as a matrix variate Gaussian distribution.
Lemma (Rank-2 matrix and Kronecker product).
Let be a rank- matrix in . There do not exist matrices and such that
Let us introduce the shorthand . By construction, is diagonal and has its diagonal terms strictly positive (it is assumed that , i.e., for all ).
We proceed by contradiction. Assume there exist and such that .
This implies that all diagonal blocks of are themselves diagonal with strictly positive diagonal terms. Thus, is diagonal for all , which implies in turn that is diagonal, with non-zero diagonal terms and . Moreover, since the off-diagonal blocks for must be zero and , we have and is also diagonal.
To summarize, if there exist and such that , then it holds that with and . This last equality can be rewritten as for all and , or equivalently
This leads to a contradiction since has rank one while is assumed to have rank two. ∎
Figure 7 provides an illustration of the difference between the -tied Normal and the distribution.
Appendix B Experimental details
In this section we provide additional information on the experimental setup used in the main paper. In particular, we describe the details of the models and datasets, the utilized standard Gaussian Mean Field Variational Inference (GMFVI) training procedure, the low-rank structure analysis of the GMFVI trained posteriors and the proposed -tied Normal posterior training procedure.
b.1 Models and datasets
To confirm the validity of our results, we performe the experiments on a range of models and datasets with different data types, architecture types and sizes. Below we describe their details.
Multilayer perceptron (MLP) model with three dense layers and ReLu activations trained on the MNIST dataset (LeCun & Cortes, 2010). The three layers have sizes of 400, 400 and 10 hidden units. We preprocess the images to be have values in range . We use the last 10,000 examples of the training set as a validation set.
LeNet CNN CIFAR-100
LeNet convolutional neural network (CNN) model (LeCun et al., 1998) with two convolutional layers followed by two dense layers, all interleaved with ReLu activations. The two convolutional layers have 32 and 64 output filters respectively, each produced by kernels of size . The two dense layers have sizes of 512 and 100 hidden units. We train this network on the CIFAR-100 dataset (Krizhevsky et al., 2009b). We preprocess the images to have values in range . We use the last 10,000 examples of the training set as a validation set.
Long short-term memory (LSTM) model (Hochreiter & Schmidhuber, 1997) that consists of an embedding and an LSTM cell, followed by a dense layer with a single unit. The LSTM cell consists of two dense weight matrices, namely the kernel and the recurrent kernel. The embedding and the LSTM cell have both 128-dimensional output space. More precisely, we adopt the publicly available LSTM Keras (Chollet et al., 2015) example666See: https://github.com/keras-team/keras/blob/master/examples/imdb_lstm.py., except that we set the dropout rate to zero. We train this model on the IMDB text sentiment classification dataset (Maas et al., 2011), in which we use the last 5,000 examples of the training set as a validation set.
ResNet-18 model (He et al., 2016a) trained on the CIFAR-10 dataset (Krizhevsky et al., 2009b). We adopt the ResNet-18 implementation777See: https://github.com/tensorflow/probability/blob/master/tensorflow_probability/examples/cifar10_bnn.py. from the Tensorflow Probability (Dillon et al., 2017) repository. We train/evaluate this model on the train/test split of 50,000 and 10,000 images, respectively, from the CIFAR-10 dataset available in Tensorflow Datasets888See: https://www.tensorflow.org/datasets/catalog/cifar10..
b.2 GMFVI training
We train all the above models using GMFVI. We split the discussion of the details of the GMFVI training procedure into two parts. First, we describe the setup for the MLP, CNN and LSTM models, for which we prepare our own GMFVI implementations. Second, we explain the setup for the GMFVI training of the ResNet-18 model, for which we use the implementation available in the Tensorflow Probability repository as mentioned above.
MLP, CNN and LSTM
In the MLP and the CNN models, we approximate the posterior using GMFVI for all the weights (both kernel and bias weights). For the LSTM model, we approximate the posterior using GMFVI only for the kernel weights, while for the bias weights we use a point estimate. For all the three models, we use the standard reparametrization trick estimator (Kingma & Welling, 2013). We initialize the GMFVI posterior means using the standard He initialization (He et al., 2015) and the GMFVI posterior standard deviations using samples from . Furthermore, we use a Normal prior with a single scalar standard deviation hyper-parameter for all the layers. We select for each of the models separately from a set of based on the validation data set performance.
We optimize the variational parameters using an Adam optimizer (Kingma & Ba, 2014). We pick the optimal learning rate for each model from the set of also based on the validation data set performance. We choose the batch size of 1024 for the MLP and CNN models, and the batch size of 128 for the LSTM model. We train all the models until the ELBO convergence.
To implement the MLP and CNN models we use the tfp.layers module from the Tensorflow Probability, while to implement the LSTM model we use the LSTMCellReparameterization999See: https://github.com/google/edward2/blob/master/edward2/tensorflow/layers/recurrent.py. class from the Edward2 Layers module (Tran et al., 2019).
The specific details of the GMFVI training of the ResNet-18 model can be found in the previously linked implementation from the Tensorflow Probability repository. Here, we describe the most important and distinctive aspects of this implementation.
The ResNet-18 model approximates the posterior using GMFVI only for the kernel weights, while for the bias weights it uses a point estimate. The model uses the Flipout estimator (Wen et al., 2018) and a constraint on the maximum value of the GMFVI posterior standard deviations of . The GMFVI posterior means are initialized using samples from , while the GMFVI posterior log standard deviations are initialized using samples from . Furthermore, the model uses a Normal prior for all of its layers.
The variational parameters are trained using the Adam optimizer with a learning rate of and a batch size of 128. The model is trained for 700 epochs. The contribution of the term in the negative Evidence Lower Bound (ELBO) equation is annealed linearly from zero to its full contribution over the first 50 epochs (Sønderby et al., 2016).
b.3 Low-rank structure analysis
After training the above models using GMFVI, we investigate the low-rank structure in their trained variational posteriors. For the MLP, CNN and LSTM models, we investigate the low-rank structure of their dense layers only. For the ResNet-18 model, we investigate both its dense and convolutional layers.
To investigate the low-rank structure in the GMFVI posterior of a dense layer, we inspect a spectrum of the posterior mean and standard deviation matrices. In particular, for both the posterior mean and standard deviation matrices, we consider the fraction of the variance explained by the top singular values from their SVD decomposition (see Figure 9 in the main paper). Furthermore, we explore the impact on predictive performance of approximating the full matrices with their low-rank approximations using only the components corresponding to the top singular values (see Table 2 in the main paper). Note that such low-rank approximations may contain values below zero. This has to be addressed when approximating the matrices of the posterior standard deviations, which can contain only positive values. Therefore, we use a lower bound of zero for the values of the approximations to the posterior standard deviations.
To investigate the low-rank structure in a GMFVI posterior of a convolutional layer, we need to add a few more steps compared to those for a dense layer. In particular, weights of the convolutional layers considered here are 4-dimensional, instead of 2-dimensional as in the dense layer. Therefore, before performing the SVD decomposition, as for the dense layers, we first reshape the 4-dimensional weight tensor from the convolutional layer into a 2-dimensional weight matrix. More precisely, we flatten all dimensions of the weight tensor except for the last dimension (e.g., a weight tensor of shapeis reshaped to ). Figure 8 contains example visualizations of the resulting flattened 2-dimensional matrices101010After this specific reshape operation, all the weights corresponding to a single output filter are contained in a single column of the resulting weight matrix.. Given the 2-dimensional form of the weight tensor, we can investigate the low-rank structure in the convolutional layers as for the dense layers. As noted already in Figure 4 in the main paper, we observe the same strong low-rank structure behaviour in the flattened convolutional layers as in the dense layers. Interestingly, the low-rank structure is the most visible in the final convolutional layers, which also contain the highest number of parameters, see Figure 9.
Importantly, note that after performing the low-rank approximation in this 2-dimensional space, we can reshape the resulting 2-dimensional low-rank matrices back into the 4-dimensional form of a convolutional layer. Table 5 shows that such a low-rank approximation of the convolutional layers of the analyzed ResNet-18 model can be performed without a loss in the model’s predictive performance, while significantly reducing the total number of model parameters.
b.4 -tied Normal posterior training
To exploit the low-rank structure observation, we propose the -tied Normal posterior, as discussed in Section 3. We study the properties of the -tied Normal posterior applied to the MLP, CNN and LSTM models. We use the -tied Normal variational posterior for all the dense layers of the analyzed models. Namely, we use the -tied Normal variational posterior for all the three layers of the MLP model, for the two dense layers of the CNN model and for the LSTM cell’s kernel and recurrent kernel.
We initialize the parameters and of the -tied Normal distribution so that after the outer-product operation the respective standard deviations have the same mean values as we obtain when using the standard GMFVI posterior parametrization. More precisely, we initialize the parameters and so that after the outer-product operation the respective standard deviations have means at before transforming to log-domain. This means that in the log domain the parameters and are initialized as
. We also add white noiseto the values of and in the log domain to break symmetry.
During training of the models with the -tied Normal posterior, we linearly anneal the contribution of the term of the ELBO loss. We select the best linear coefficient for the annealing from (per batch) and increase the effective contribution every 100 batches in a step-wise manner. In particular, we anneal the term to obtain the predictive performance results for all the models in Figure 5 in the main paper. However, we do not perform the annealing in the Signal-to-Noise ratio (SNR) and negative ELBO convergence speed experiments in the same Figure 5. In these two cases, KL annealing would occlude the values of interest, which show the clear impact of the -tied Normal posterior.