Sample code for running deterministic variational inference to train Bayesian neural networks
Bayesian neural networks (BNNs) hold great promise as a flexible and principled solution to deal with uncertainty when learning from finite data. Among approaches to realize probabilistic inference in deep neural networks, variational Bayes (VB) is theoretically grounded, generally applicable, and computationally efficient. With wide recognition of potential advantages, why is it that variational Bayes has seen very limited practical use for BNNs in real applications? We argue that variational inference in neural networks is fragile: successful implementations require careful initialization and tuning of prior variances, as well as controlling the variance of Monte Carlo gradient estimates. We fix VB and turn it into a robust inference tool for Bayesian neural networks. We achieve this with two innovations: first, we introduce a novel deterministic method to approximate moments in neural networks, eliminating gradient variance; second, we introduce a hierarchical prior for parameters and a novel empirical Bayes procedure for automatically selecting prior variances. Combining these two innovations, the resulting method is highly efficient and robust. On the application of heteroscedastic regression we demonstrate strong predictive performance over alternative approaches.READ FULL TEXT VIEW PDF
Variational inference has become an increasingly attractive, computation...
Deep neural networks have achieved impressive results on a wide variety ...
The motivations for using variational inference (VI) in neural networks
Bayesian modeling is now ubiquitous in problems of large-scale inference...
Recent advances in variational inference enable the modelling of highly
In discriminative settings such as regression and classification there a...
This note is concerned with an accurate and computationally efficient
Sample code for running deterministic variational inference to train Bayesian neural networks
Bayesian approaches to neural network training marry the representational flexibility of deep neural networks with principled parameter estimation in probabilistic models. Compared to “standard” parameter estimation by maximum likelihood, the Bayesian framework promises to bring key advantages such as better uncertainty estimates on predictions and automatic model regularization (MacKay, 1992; Graves, 2011). These features are often crucial for informing downstream decision tasks and reducing overfitting, particularly on small datasets. However, despite potential advantages, such Bayesian neural networks (BNNs) are often overlooked due to two limitations: First, posterior inference in deep neural networks is analytically intractable and approximate inference with Monte Carlo (MC) techniques can suffer from crippling variance given only a reasonable computation budget (Kingma et al., 2015; Molchanov et al., 2017; Miller et al., 2017; Zhu et al., 2018). Second, performance of the Bayesian approach is sensitive to the choice of prior (Neal, 1993), and although we may have a priori knowledge concerning the function represented by a neural network, it is generally difficult to translate this into a meaningful prior on neural network weights. Sensitivity to priors and initialization makes BNNs non-robust and thus often irrelevant in practice.
In this paper, we describe a novel approach for inference in feed-forward BNNs that is simple to implement and aims to solve these two limitations. We adopt the paradigm of variational Bayes (VB) for BNNs (Hinton & van Camp, 1993; MacKay, 1995c) which is normally deployed using Monte Carlo variational inference (MCVI) (Graves, 2011; Blundell et al., 2015). Within this paradigm we address the two shortcomings of current practice outlined above: First, we address the issue of high variance in MCVI, by reducing this variance to zero through novel deterministic approximations to variational inference in neural networks. Second, we derive a general and robust Empirical Bayes (EB) approach to prior choice using hierarchical priors. By exploiting conjugacy we derive data-adaptive closed-form variance priors for neural network weights, which we experimentally demonstrate to be remarkably effective.
Combining these two novel ingredients gives us a performant and robust BNN inference scheme that we refer to as “deterministic variational inference” (DVI). We demonstrate robustness and superior predictive performance in the context of non-linear regression models, deriving novel closed-form results for expected log-likelihoods in homoscedastic and heteroscedastic regression (similar derivations for classification can be found in the appendix).
Experiments on standard regression datasets from the UCI repository, (Dheeru & Karra Taniskidou, 2017), show that for identical models DVI converges to local optima with better predictive log-likelihoods than existing methods based on MCVI. In direct comparisons, we show that our Empirical Bayes formulation automatically provides better or comparable test performance than manual tuning of the prior and that heteroscedastic models consistently outperform the homoscedastic models.
Concretely, our contributions are:
Development of an EB method for principled tuning of weight priors during BNN training.
Experimental results showing the accuracy and efficiency of our method and applicability to heteroscedastic and homoscedastic regression on real datasets.
We start by describing the inference task that our method must solve to successfully train a BNN. Given a model parameterized by weights and a dataset , the inference task is to discover the posterior distribution . A variational approach acknowledges that this posterior generally does not have an analytic form, and introduces a variational distribution parameterized by to approximate . The approximation is considered optimal within the variational family for that minimizes the Kullback-Leibler (KL) divergence between and the true posterior.
Introducing a prior and applying Bayes rule allows us to rewrite this as optimization of the quantity known as the evidence lower bound (ELBO):
Analytic results exist for the KL term in the ELBO for careful choice of prior and variational distributions (e.g. Gaussian families). However, when is a non-linear neural network, the first term in equation 1 (referred to as the reconstruction term) cannot be computed exactly: this is where MC approximations with finite sample size are typically employed:
Our goal in the next section is to develop an explicit and accurate approximation for this expectation, which provides a deterministic, closed-form expectation calculation, stabilizing BNN training by removing all stochasticity due to Monte Carlo sampling.
Figure 1 shows the architecture of the computation of
for a feed-forward neural network. The computation can be divided into two parts: first, propagation of activations though parameterized layers and second, evaluation of an unparameterized log-likelihood function (). In this section, we describe how each of these stages is handled in our deterministic framework.
We begin by considering activation propagation (figure 1(a)), with the aim of deriving the form of an approximation to the final layer activation distribution that will be passed to the likelihood computation. We compute by sequentially computing the distributions for the activations in the preceding layers. Concretely, we define the action of the layer that maps to as follows:
where is a non-linearity and
are random variables representing the weights and biases of thelayer that are assumed independent from weights in other layers. For notational clarity, in the following we will suppress the explicit layer index , and use primed symbols to denote variables from the layer, e.g.
. Note that we have made the non-conventional choice to draw the boundaries of the layers such that the linear transform is applied after the non-linearity. This is to emphasize thatis constructed by linear combination of many distinct elements of
, and in the limit of vanishing correlation between terms in this combination, we can appeal to the central limit theorem (CLT). Under the CLT, for a large enough hidden dimension, elements
will be normally distributed regardless of the potentially complicated distribution forinduced by 111We are also required to choose a Gaussian variational approximation for
to preserve the Gaussian distribution of.. We empirically observe that this claim is approximately valid even when (weak) correlations appear between the elements of during training (see section 3.1.1).
Having argued that adopts a Gaussian form, it remains to compute the first and second moments. In general, these cannot be computed exactly, so we develop an approximate expression. An overview of this derivation is presented here with more details in appendix A. First, we model , and as independent random variables, allowing us to write:
where we have employed the Einstein summation convention and used angle brackets to indicate expectation over . If we choose a variational family with analytic forms for weight means and covariances (e.g. Gaussian with variational parameters and ), then the only difficult terms are the moments of :
where we have used the Gaussian form of parameterized by mean and covariance , and for brevity we have omitted the normalizing constants. Closed form solutions for the integral in equation 4 exist for Heaviside or ReLU choices of non-linearity (see appendix A). Furthermore, for these non-linearities, the and asymptotes of the integral in equation 5 have closed form. Figure 2 shows schematically how these asymptotes can be used as a first approximation for equation 5. This approximation is improved by considering that (by definition) the residual decays to zero far from the origin in the plane, and so is well modelled by a decaying function , where is a polynomial in with a dominant positive even term. In practice we truncate at the quadratic term, and calculate the polynomial coefficients by matching the moments of the resulting Gaussian with the analytic moments of the residual. Specifically, using dimensionless variables and , this improved approximation takes the form
Using equation 6 in equation 3 gives a closed form approximation for the moments of as a function of moments of . Since is approximately normally distributed by the CLT, this is sufficient information to sequentially propagate moments all the way through the network to compute the mean and covariances of , our explicit multivariate Gaussian approximation to
. Any deep learning framework supporting special functionsand
will immediately support backpropagation through the deterministic expressions we have presented. Below we briefly empirically verify the presented approximation, and in section3.2
we will show how it is used to compute an approximate log-likelihood and posterior predictive distribution for regression and classification tasks.
The approximation derived above relies on three assumptions. First, that some form of CLT holds for the hidden units during training where the iid assumption of the classic CLT is not strictly enforced; second, that a quadratic truncation of is sufficient222Additional Taylor expansion terms can be computed if this assumption fails.; and third that there are only weak correlation between layers so that they can be represented using independent variables in the variational distribution. To provide evidence that these assumptions hold in practice, we train a small ReLU network with two hidden layers each of 128 units to perform 1D heteroscedastic regression on a toy dataset of 500 points drawn from the distribution shown in figure 3(b). The training objective is taken from section 4, and the only detail required here is that
is a 2-element vector where the elements are labelled as. We use a diagonal Gaussian variational family to represent the weights, but we preserve the full covariance of during propagation. Using an input (see arrow, Figure 3(b)) we compute the distributions for and both at the start of training (where we expect the iid assumption to hold) and at convergence (where iid does not necessarily hold). Figure 3(c) shows the comparison between distributions reported by our deterministic approximation and MC evaluation using 20k samples from . This comparison is qualitatively excellent for all cases considered.
In traditional MCVI, propagation of samples of -dimensional activations through a layer containing a -dimensional transformation requires compute and memory. Our DVI method approximates the limit, while only demanding compute and memory (the additional factor of arises from manipulation of the quadratically large covariance matrix ). Whereas MCVI can always trade compute and memory for accuracy by choosing a small value for , the inherent scaling of DVI with could potentially limit its practical use for networks with large hidden size. To avoid this limitation, we also consider the case where only the diagonal entries are computed and stored at each layer. We refer to this method as “diagonal-DVI” (dDVI), and in section 6 we show the surprising result that the strong test performance of DVI is largely retained by dDVI across a range of datasets. Figure 4 shows the time required to propagate activations through a single layer using the MCVI, DVI and dDVI methods on a Tesla V100 GPU. As a rough rule of thumb (on this hardware), for layer sizes of practical relevance, we see that absolute DVI runtimes roughly equate to MCVI with and dDVI runtime equates to .
To use the moment propagation procedure derived above for training BNNs, we need to build a function that maps final layer activations to the expected log-likelihood term in equation 1 (see figure 1(b)). In appendix B.1 we show the intuitive result that this expected log-likelihood over can be rewritten as an expectation over .
For simplicity we consider scalar and a Gaussian noise model parameterized by mean and heteroscedastic log-variance . The parameters of this Gaussian are read off as the elements of a 2-dimensional output layer so that . Recall that these parameters themselves are uncertain and the statistics and can be computed following section 3.1. Inserting the Gaussian forms for and into equation 7 and performing the integral (see appendix B.2) gives a closed form expression for the ELBO reconstruction term:
This heteroscedastic model can be made homoscedastic by setting . The expression in equation 8 completes the derivations required to implement the closed form approximation to the ELBO reconstruction term for training a network. In addition, we can also compute a closed form approximation to the predictive distribution that is used at test-time to produce predictions that incorporate all parameter uncertainties. By approximating the moments of the posterior predictive and assuming normality (see appendix B.3), we find:
So far, we have described methods for deterministic approximation of the reconstruction term in the ELBO. We now turn to the KL term. For a -dimensional Gaussian prior , the KL divergence with the Gaussian variational distribution has closed form:
However, this requires selection of for which there is usually little intuition beyond arguing by symmetry and choosing to preserve the expected magnitude of the propagated activations (Glorot & Bengio, 2010; He et al., 2015). In practice, variational Bayes for neural network parameters is sensitive to the choice of prior variance parameters, and we will demonstrate this problem empirically in section 6 (figure 5).
To make variational Bayes robust we parameterize the prior hierarchically, retaining a conditional diagonal Gaussian prior and variational distribution on the weights. The hierarchical prior takes the form
, using an inverse gamma distribution on
as the conjugate prior to the elements of the diagonal Gaussian variance. We partition the weights into setsthat typically coincide with the layer partitioning333In general, any arbitrary partitioning can be used, and assign a single element in to each set:
for shape and scale , and where is the weight in set .
Rather than taking the fully Bayesian approach, we adopt an empirical Bayes approach (Type-2 MAP), optimizing , assuming that the integral is dominated by a contribution from this optimal value . We use the data to inform the optimal setting of to produce the tightest ELBO:
where is the number of weights in the set . The influence of the data on the choice of is made explicit here through dependence on the learned variational parameters and . Using to populate the elements of the diagonal prior variance , we can evaluate the KL in equation 10 under the empirical Bayes prior. Optimization of the resulting ELBO then simultaneously tunes the variational distribution and prior.
In the experiments we will demonstrate that the proposed empirical Bayes approach works well; however, it only approximates the full Bayesian solution, and it could fail if we were to allow too many degrees of freedom. To see this, assume we were to use one prior per weight element, and we would also define a hyperprior for each prior mean. Then, adjusting both the prior variance and prior mean using empirical Bayes would always lead to a KL-divergence of zero and theobjective would degenerate into maximum likelihood.
Bayesian neural networks have a rich history. In a 1992 landmark paper David MacKay demonstrated the many potential benefits of a Bayesian approach to neural network learning (MacKay, 1992); in particular, this work contained a convincing demonstration of naturally accounting for model flexibility in the form of the Bayesian Occam’s razor
, facilitating comparison between different models, accurate calibration of predictive uncertainty, and to perform learning robust to overfitting. However, at the time Bayesian inference was achieved only for small and shallow neural networks using a comparatively crude Laplace approximation. Another early review article summarizing advantages and challenges in Bayesian neural network learning is(MacKay, 1995c).
This initial excitement around Bayesian neural networks led to two main methods being developed; First, Hinton & van Camp (1993) and MacKay (1995b) developed the variational Bayes (VB) approach for posterior inference. Whereas Hinton & van Camp (1993) were motivated from a minimum description length (MDL) compression perspective, MacKay (1995b) motivated his equivalent ensemble learning method from a statistical physics perspective of variational free energy minimization. Barber & Bishop (1998) extended the methodology for two-layer neural networks to use general multivariate Normal variational distributions. Second, Neal (1993) developed efficient gradient-based Monte Carlo methods in the form of “hybrid Monte Carlo”, now known as Hamiltonian Monte Carlo, and also raised the question of prior design and limiting behaviour of Bayesian neural networks.
Rebirth of Bayesian neural networks. After more than a decade of no further work on Bayesian neural networks Graves (2011) revived the field by using Monte Carlo variational inference (MCVI) to make VB practical and scalable, demonstrating gains in predictive performance on real world tasks.
Since 2015 the VB approach to Bayesian neural networks is mainstream (Blundell et al., 2015); key research drivers since then are the problems of high variance in MCVI and the search for useful variational families. One approach to reduce variance in feedforward networks is the local reparameterization trick (Kingma et al., 2015) (see appendix D). To enhance the variational families more complicated distributions such as Matrix Gaussian posteriors (Louizos & Welling, 2016), multiplicative posteriors (Kingma et al., 2015), and hierarchical posteriors (Louizos & Welling, 2017) are used. Both our methods, the deterministic moment approximation and the empirical Bayes estimation, can potentially be extended to these richer families.
Prior choice. Choosing priors in Bayesian neural networks remains an open issue. The hierarchical priors for feedforward neural networks that we use have been investigated before by Neal (1993) and MacKay (1995a)
, the latter proposing a “cheap and cheerful” heuristic, alternating optimization of weights and inverse variance parameters.Barber & Bishop (1998) also used a hierarchical prior and an efficient closed-form factored VB approximation; our approach can be seen as a point estimate to their approach in order to enable use of our closed-form moment approximation. Graves (2011) also used hierarchical Gaussian priors with flat hyperpriors, deriving a closed-form update for the prior mean and variance. Compared to these prior works our approach is rigorous and with sufficient data accurately approximates the Bayesian approach of integrating over the prior parameters.
Alternative inference procedures. As an alternative to variational Bayes, probabilistic backpropagation (PBP) (Hernández-Lobato & Adams, 2015) applies approximate inference in the form of assumed density filtering (ADF) to refine a Gaussian posterior approximation. Like in our work, each update to the approximate posterior requires propagating means and variances of activations through the network. (Hernández-Lobato & Adams, 2015) only consider the diagonal propagation case and regression. Since the original work, PBP has been generalized to classification (Ghosh et al., 2016) and richer posterior families such as the matrix variate Normal posteriors (Sun et al., 2017). Our moment approximation could be used to improve the inference accuracy of PBP.
Gaussianity in neural networks. Our demonstration of Gaussianity of ReLU network activations is also directly relevant to recent work on Gaussian process interpretations of deep neural networks (Matthews et al., 2018; Lee et al., 2017), validating the insight that activations in deep neural networks are closely approximated by Gaussian processes. Two recent works derived deterministic moment approximations for deep neural networks: Bibi et al. (2018), using Price’s theorem, derived exact first and second moment expressions for ReLU activations but limit themselves to the case of zero-mean Gaussian activations. Kandemir et al. (2018) also derive closed-form solutions to the ELBO for the case of diagonal Gaussian variational families. However, their approach is limited to linear layers without bias.
Markov chain Monte Carlo approaches.
Another rich class of approximate inference methods for Bayesian neural networks are stochastic gradient Markov chain Monte Carlo (SG-MCMC) methods. These methods allow for approximate posterior parameter inference using unbiased log-likelihood estimates. Stochastic gradient Langevin dynamics (SGLD) was the first method in this class(Welling & Teh, 2011). SGLD is particularly simple and efficient to implement, but recent methods increase efficiency in the case of correlated posteriors by estimating the Fisher information matrix (Ahn et al., 2012) and extend Hamiltonian Monte Carlo to the stochastic gradient case (Chen et al., 2014). A complete characterization of SG-MCMC methods is given by (Ma et al., 2015; Gong et al., 2018). However, despite this progress, important theoretical questions regarding approximation guarantees for practical computational budgets remain (Nagapetyan et al., 2017). Moreover, while SG-MCMC methods work robustly in practice, they remain computationally inefficient, especially because evaluation of the posterior predictive requires evaluating an ensemble of models.
Wild approximations. The above methods are principled but often require sophisticated implementations; recently, a few methods aim to provide “cheap” approximations to the Bayes posterior. Dropout has been interpreted by Gal & Ghahramani (2016) to approximately correspond to variational inference. Likewise, Bootstrap posteriors (Lakshminarayanan et al., 2017; Fushiki et al., 2005; Harris, 1989) have been proposed as a general, robust, and accurate method for posterior inference. However, obtaining a bootstrap posterior ensemble of size is computationally intense at times the computation of training a single model.
We implement444 Our implementation in TensorFlow is available at
Our implementation in TensorFlow is available athttps://github.com/Microsoft/deterministic-variational-inference deterministic variational inference (DVI) as described above to train small ReLU networks on UCI regression datasets (Dheeru & Karra Taniskidou, 2017). The experiments address the claims that our methods for eliminating gradient variance and automatic tuning of the prior improve the performance of the final trained model. In Appendix C we present extended results to demonstrate that our method is competitive against a variety of models and inference schemes.
We compare DVI with MCVI from equation 2 with samples. The same model is used for each inference method: a single hidden layer of 50 units for each dataset considered, extending this to 100 units in the special case of the larger protein structure dataset, prot. Additionally, both methods use the same EB prior from equation 13 with a broad inverse Gamma hyperprior (, ) and an independent for each linear transformation. Each dataset is split into random training and test sets with 90% and 10% of the data respectively. This splitting process is repeated 20 times and the average test performance of each method at convergence is reported in table 2 (see also learning curves in appendix E). We see that DVI consistently outperforms MCVI, by up to 0.35 nats per data point on some datasets. The computationally efficient diagonal-DVI (dDVI) surprisingly retains much of this performance. By default we use the heteroscedastic model, and we observe that this uniformly delivers better results than a homoscedastic model (hoDVI; rightmost column in table 2) on these datasets with no overfitting issues555Note that this result is non-trivial because heteroscedastic models are more complex and could result in poorer approximate inference leading to worse test performance.
In Figure 5
we compare the performance of networks trained with manual tuning of a fixed Gaussian prior to networks trained with the automatic EB tuning. We find that the EB method consistently finds priors that produce models with competitive or significantly improved test log-likelihood relative to the best manual setting. Since this observation holds across all datasets considered, we say that our method is “robust”. Note that the EB method can outperform manual tuning because it automatically finds different prior variances for each weight matrix, whereas in the manual tuning case we search over a single hyperparameter controlling all prior variances.
We introduced two innovations to make variational inference for neural networks more robust: 1. an effective deterministic approximation to the moments of activations of a neural networks; and 2. a simple empirical Bayes hyperparameter update. We demonstrate that together these innovations make variational Bayes a competitive method for Bayesian inference in neural heteroscedastic regression models.
Beside the challenge of efficient posterior inference, for Bayesian neural networks two major issues remain open. First, how to design suitable priors for functions represented by neural network parameters? And second, what structure do the posterior distributions in neural network models have and how can this be used to improve approximate inference (Watanabe, 2009)?
International Conference on Machine Learning, pp. 1472–1481, 2016.
Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256, 2010.
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034, 2015.
Algebraic geometry and statistical learning theory, volume 25. Cambridge University Press, 2009.
Under assumption of independence of , and , we can write:
where is a “soft ReLU”, and represent the standard Gaussian PDF and CDF, and we have introduced the dimensionless variables . These results are is sufficient to evaluate equation 14, so in the following sections we turn to each term from equation 15.
In the general case, we can use the results from section A.2 to evaluate off-diagonal . However, in our experiments we always consider the the special case where is diagonal. In this case we can write the first term in equation 15 as (reintroducing the explicit summation):
i.e. this term is a diagonal matrix with the diagonal given by the left product of the vector with the matrix . Note that can be evaluated analytically for Heaviside and ReLU activation functions:
Evaluation of requires an expression for . From equation 5, we write:
where is the quadratic form:
Here we have introduced further dimensionless variables , and . We can then rewrite equation 16 in terms of a dimensionless integral :
where the normalization constant is evaluated by integrating over and is explicitly written as , where . Now, following equation 6, we have the task to write as an asymptote plus a decaying correction . To evaluate and , we have to insert the explicit form of the non-linearity , which we do for Heaviside and ReLU functions in the next sections.
For the Heaviside activation, we can represent the integral as the shaded area under the Gaussian in the upper-left quadrant shown below. In general, this integral does not have a closed form. However, for , vanishing weight appears under the Gaussian in the upper-right quadrant, so we can write down the asymptote of the integral in this limit:
Here we performed the integral by noticing that the outer integral over marginalizes out from the bivariate Gaussian, leaving the inner integral as the definition of the Gaussian CDF. By symmetry, we also have and . We can then write down the following symmetrized form that satisfies all the limits required to qualify as an asymptote:
To compute the correction factor we evaluate the derivatives of at the origin up to second order to match the moments of for quadratic . Description of this process is found below
At the origin , we can diagonalize the quadratic form :
where . Performing this change of variables in the integral gives:
where we integrated in polar coordinates over the region in which the Heaviside function is non-zero. The angle can be found from the coordinate transform between and as666Here we use the identity :
Since , we can evaluate:
Performing a change of variables , we can write as:
where is the Heaviside function. Now, using , we have:
In addition, using , we have:
By symmetry also has zero gradient with respect to at the origin. Therefore has no linear term in .
Taking another derivative in equation 17 gives:
where we used the identity , which holds for arbitrary . In addition, we have:
and the same result holds for the second derivative w.r.t. . To complete the Hessian, it is a simple extension of previous results to show that:
Now that we have obtained derivatives of the residual up to second order we propose a correction factor of the form where is truncated at quadratic terms:
We then find the coefficients by matching () derivatives at :
This yields the expression seen in table 1 of the main text.
As in the Heaviside case, we begin by computing the asymptote of by inspecting the limit as :
Now, we construct a full 2-dimensional asymptote by symmetrizing equation 18 (using properties and as to check that the correct limits are preserved after symmetrizing):
Next we compute the correction factor . The details of this procedure closely follow those for the Heaviside non-linearity of the previous section, so we omit them here (and in practice we use Mathematica to perform the intermediate calculations). The final result is presented in table 1 of the main text.