We propose a new Bayesian Neural Net (BNN) formulation that affords variational inference for which the evidence lower bound (ELBO) is analytically tractable subject to a tight approximation. We achieve this tractability by decomposing ReLU nonlinearities into an identity function and a Kronecker delta function. We demonstrate formally that assigning the outputs of these functions to separate latent variables allows representing the neural network likelihood as the composition of a chain of linear operations. Performing variational inference on this construction enables closed-form computation of the evidence lower bound. It can thus be maximized without requiring Monte Carlo sampling to approximate the problematic expected log-likelihood term. The resultant formulation boils down to stochastic gradient descent, where the gradients are not distorted by any factor besides minibatch selection. This amends a long-standing disadvantage of BNNs relative to deterministic nets. Experiments on four benchmark data sets show that the cleaner gradients provided by our construction yield a steeper learning curve, achieving higher prediction accuracies for a fixed epoch budget.READ FULL TEXT VIEW PDF
The advent of data-flow libraries has made fast prototyping of novel neural net architectures possible by writing short and simple high-level code. Availability of these tools triggered an explosion of research output on application-specific neural net design, which in turn allowed a period of fast improvement of prediction performance in almost all fields of application where machine learning is used. At present, we can at least speculate about whether the era of large accuracy leaps is over. The next grand challenge is to solve mainstream machine learning problems with more time-efficient, energy-efficient, and interpretable models that make predictions with an attached uncertainty estimate. For industry-scale applications, we require models less vulnerable to adversarial attacksgoodfellow2015explaining ; szegedy2014intriguing .
The Bayesian modeling approach provides principled solutions to all of the aforementioned next-stage challenges of machine learning. Bayesian Neural Networks (BNNs) mackay1992apractical
lie at the intersection of deep learning and the Bayesian approach that learns the parameters of a machine learning model via posterior inferencemackay1995probable ; neal1995bayesian
. A neural net having an arbitrary architecture and loss function can be turned into a BNN simply by corrupting its synaptic connection weights with noise, thereby upgrading from deterministic parameters to latent random variables that follow a prior distribution.
Unfortunately, the non-linear activation functions at the neuron outputs render direct methods to estimate the posterior distribution of BNN weights analytically intractable. A recently established technique for approximating this posterior is Stochastic Gradient Variational Bayes (SGVB)kingma2014auto
, which suggests reparameterizing the variational distribution and then Monte Carlo (MC) integrating the intractable expected data fit part of the ELBO. Sample white noise for a cascade of random variables distorts the gradient signal, leading to unstable training. Improving the sampling procedure to reduce the variance of the gradient estimate is an active research topic. Recent advances in this vein include the local reparameterization trickkingma2015variational and variance reparameterization molchanov2017variational ; neklyudov2017variational .
We here present a novel BNN construction that makes variational inference possible with a closed-form ELBO, obviating the need for Monte Carlo sampling and the associated precautions required for variance reduction. Without a substantial loss of generality, we restrict the activation functions of all neurons of a net to the Rectified Linear Unit (ReLU), which is shown to be sufficient to improve the state of the art in numerous caseshe2017mask ; springenberg2015striving . We build our formulation on the fact that the ReLU function can be expressed as the product of the identity function and Kronecker delta: . Exploiting the fact that we are up to devising a probabilistic learner, we assign a separate latent variable to the Kronecker delta. We relax the Kronecker delta factor by for some large , where is a Bernoulli mass function and is the standard sigmoid. The idea is illustrated in Figure 1. We show how the asymptotic account of this relaxation converts the likelihood calculation into a chain of linear matrix operations, giving way to closed-form computation of the data fit term of the Evidence Lower Bound (ELBO) in mean-field variational BNN inference. In our construction, the data fit term lends itself as the sum of a standard neural net loss (e.g. Mean-Squared Error (MSE) or cross entropy) on the expected prediction output and the prediction variance. We discover the plausible property of our construction that the predictor variance term has a recursive form, describing how the predictor variance back-propagates through the layers of a BNN. We refer to our model as Variance Back-Propagation (VBP).
Unlike the present gold standard in BNNs, the closed-form ELBO of our model does not require any Monte Carlo integration step. Keeping the ELBO gradients free from multiplicative kingma2015variational or additive molchanov2017variational white noise distortion, VBP boosts learning. Experiments on four benchmark data sets and two different network architectures show that VBP accelerates learning within widely-adopted training budgets. Last but not least, VBP presents a generic formulation that is directly applicable to all weight prior selections as long as their Kullback-Leibler (KL) divergence with respect to the variational distribution is available in closed form, including the common log-uniform kingma2015variational ; molchanov2017variational , normal gal2016dropout ; lobato2015probabilistic , and horseshoe louizos17bayesian densities.
We assume to have a continuous-output feed-forward neural net that uses the ReLU activation function. Decomposing the ReLU function as , here
being the Kronecker delta function, the feature map vectorof data point at layer expressed as
where is the feature map vector of the same data point at layer , the matrix contains the synaptic connection weights between the neurons of layers and and denotes the element-wise Hadamard product of equal sized matrices or vectors. The is the linear output vector of layer . When the argument of the function returns a vector, we mean that it applies separately to all its entries. We denote the above factorized description of the feature map as the Identity-Delta Decomposition. Applying this decomposed expression on a feed-forward neural net with two dense layers, we get
for a data point consisting of the input-output pair . Here, denotes the predictor function and is the complete set of synaptic weights. Note that given the binary outputs and of the delta function, the predictor output can be computed following a chain of linear operations.
A Bayesian Neural Net (BNN) extends a deterministic net by placing the predictor as the parameter of a likelihood function and assigning a prior distribution on the weights. Suitably for continuous-output problems, choosing the likelihood function as the normal density, we get
for some prior density , observation precision
, and the identity matrixI with the proper size. Applying the identity-delta decomposition to an layer BNN, we obtain
where is the collection of all step function outputs in the model and is a desired prior on the weight of the connection between neuron of layer and neuron of layer . As above, the mean of the normal likelihood consists only of linear operations when conditioning on the values of all .
In the Bayesian context, learning consists of infering the posterior distribution over the free parameters of the model , which is intractable for neural nets due to the integral in the denominator. Hence we need to resort to approximations. From multiple possibilities, our study focuses on variational inference due to its computational efficiency. Variational inference approximates the true posterior by a proxy distribution with a known functional form parametrized by and minimizes the Kullback-Leibler (KL) divergence between and . After a few algebraic manipulations, minimizing this KL divergence and maximizing the functional below turn out to be equivalent problems
This functional is often referred to as the Evidence Lower BOund (ELBO). The ELBO has the intuitive interpretation that is responsible for the data fit, as it maximizes the expected log-likelihood of the data, and serves as a complexity regularizer by punishing unnecessary divergence of the approximate posterior from the prior. A separate term needs to be added to the formula for each i.i.d. data point, which is by nature amenable to Stochastic Gradient Descent (SGD).
We follow prior art kingma2015variational ; molchanov2017variational and adopt the mean-field assumption that the variational distribution factorizes across individual weights. We also assume each factor to follow . We assign an individual factor to each feature map output of each data point. Rather than handcrafting the functional form of this factor, we calculate its ideal form having other factors fixed, as detailed in Section 2.3.1. The final variational distribution is
where is the number of hidden layers and denotes the number of neurons at layer .
Our contribution concerns calculation of in closed from, which has thus far been approximated by MC integration in the previous work. For a BNN with identity-step decomposed feature maps and linear activations denoted as for output channel of layer , the data fit term reads
where is the observation at the th output channel. In its eventual form, the first term is the MSE evaluated at the mean of the predictor and the second term is its variance. This largely overlooked form of the data fit has some interesting implications. Firstly, infers the total amount of model variance to account for the epistemic uncertainty in the learning task kendall2017what . Secondly, shrinking would also shrink to a certain extent, as the posterior will converge from all possible ’s to a deterministic net. Thirdly, shrinking does not necessarily shrink . As approaches zero, we end up with Gaussian Dropout srivastava2014dropout .
As our ultimate goal is to obtain the ELBO in closed form, any prior on weights that lends itself to a closed form is acceptable. We have a list of attractive and well-settled possibilities to choose from, including: i) the normal prior blundell2015weight ; gal2016dropout for mere model selection, ii) the log-uniform prior kingma2015variational ; molchanov2017variational for atomic sparsity induction and aggressive synaptic connection pruning, and iii) the horseshoe prior louizos2017variational for group sparsity induction and neuron-level pruning.
When all feature maps of the predictor are identity-delta decomposed and ’s are approximated by a Bernoulli-sigmoid mass as described in Section 2.1, the expectation of with respect to can be calculated in closed form, as consists only of linear operations with which the expectation can commute operation orders. This order interchangeability results in a mere forward pass where each weight takes its mean value with respect to its related factor in the approximate distribution . For instance, for a Bayesian neural net with two hidden layers, we have
Consequently, the MSE part of can be calculated in closed form. This interchangeability property of linear operations against expectations holds as long as we keep independence between the layers, hence the MSE can be calculated in closed form also in the non-mean-field case.
While we update the variational parameters of the synaptic connection weight factors towards the gradient of the ELBO, for the feature map activation distribution , we choose to perform the update at the function level. Benefiting from variational calculus, we fix all other factors in except for a single and find the optimal functional form for this remaining factor. We first devise in Propositon 1 a generic approach for calculating variational update rules of this sort. The proofs of all propositions can be found in the Supplementary Material.
Proposition 1. Consider a Bayesian model including the generative process excerpt below
for some arbitrary function . If the variational inference of this model is to be performed with an approximate distribution111Note that might or might not exist depending on whether is latent or observed. , the optimal closed-form update for is
Following Proposition 1, the variational update of turns out to be . Note that the expectation of a delta function is the binary outcome of the condition it tests. Because is an intermediary step while computing , the updates can be done concurrently with this computation. A side benefit of the resultant and the general structure of is that the complicated term can be calculated analytically subject to a controllable degree of relaxation.
Proposition 2. For the model and the inference scheme in Proposition 1 with , in the relaxed delta function formulation with some finite , the expression is (i) analytically tractable and (ii) its magnitude is proportional to the mass of that falls on the opposite side of the line with respect to .
Note that the approximation is extremely tight even for decently small values. Hence, we can rescue the expected KL term from diverging to infinity by setting to a value such as 10, without making a tangible change on the behavior of the activation function. Although this approximation would provide an analytical solution due to part (i) of Proposition 2, for practical purposes, we benefit from part (ii) and remove this term from the ELBO. The magnitude of this term depends on how tightly the mass of is concentrated around the expectation . This soft constraint on the variance is already enforced by the term in .
The final step in the closed-form calculation of the ELBO is the term. Let us recall that the likelihood function of our BNN involves a linear chain of operations that contain only sums and products of the random variables. Two identities concerning the relationship between the variances of two multiplied and added independent random variables and are
Applying these well-known identities to the linear activation output of Layer , we attain
This formula contains two functions that are yet to be evaluated. One is , the calculation of which is discussed in Equation 2. The other term is the variance of one of the linear activation outputs of the previous layer. Hence, we arrive at a recursive description of the model variance. Following this formula, we can express as a function of , then as a function of , and repeat this procedure until the observed input layer, where variance is zero. For noisy inputs, the desired homo-skedastic or hetero-skedastic noise model can be injected to the input layer, which would still not break the recursion and keep the formula valid. As this formula reveals how variance back-propagates through the layers, we refer to our construction as Variance Back-Propagation (VBP).
As a linear operation, convolution is directly applicable to the VBP formulation by modifying all the sums between weights and feature maps with sliding windows. Doing the same will suffice also for the calculation of
. In VBP, one layer affects the next via only sums and products of variables, which is not the case for max-pooling. Even though convolutions are found to be sufficient for building state-of-the-art architecturesspringenberg2015striving , we show in the Supplementary Material with Proposition 3 that max-pooling is also directly applicable to VBP by extending Proposition 1.
For binary classification, we treat as a vector of latent decision margins and squash it with a binary-output likelihood . From hensman2013gaussian , the log-marginal likelihood of the resultant model is
where is the new ELBO for classification. Choosing where is the Probit function, we get
by a sigmoid function and setting, the new ELBO boils down to three terms: i) negative binary cross-entropy loss evaluated at the mean of the predictor, ii) the variance of the predictor, and iii) the regularization term. Note that both in regression and classification cases, the first term is a standard loss evaluated at the mean of the predictor and the remaining two terms are identical. Extension to multiple classes follows similar lines, which we skip due to space constraints.
Several approaches have been introduced for approximating the intractable posterior of BNNs. One line is model-based Markov Chain Monte Carlo (MCMC), such as Hybrid (Hamiltonian) Monte Carloneal2010mcmc and Stochastic Gradient Langevin Dynamics welling2011bayesian . Later work has adapted HMC to stochastic gradients chen14stochastic by quantifying the entropy overhead stemming from the stochasticity of minibatch selection. This study has made HMC applicable to large data sets and modern neural net architectures saatchi17bayesian . Follow-up studies on merging the strengths of HMC and SGLD girolami2011riemann also exist.
While being actively used for a wide spectrum of models, successful application of variational inference to deep neural nets has taken place only recently. The earliest study to infer a Bayesian neural net with variational inference hinton1993keeping was applicable for only one hidden layer. This limitation has been overcome only recently graves2014practical by approximating intractable expectations by numerical integration. Further scalability has been achieved after SGVB kingma2014auto (a.k.a. Stochastic Back-Prop rezende2014stochastic ) is made applicable to BNN inference using weight reparameterization blundell2015weight .
Dropout has strong connections to variational inference of BNNs srivastava2014dropout . Gal et al. gal2016dropout developed a theoretical link between a dropout network and a deep Gaussian process damianou2013deep inferred by variational inference. It has later been shown that extending the Bayesian model selection interpretation of Gaussian Dropout kingma2015variational with a log-uniform prior on model weights leads to a BNN inferred by SGVB.
A fundamental step in reduction of ELBO gradient variance has been taken by Kingma et al. kingma2015variational with local reparameterization, which suggests taking the Monte Carlo integrals by sampling the linear activations rather than the weights. Further variance reduction has been achieved by defining the variances of the variational distribution factors as free parameters and the dropout rate as a function of them molchanov2017variational . Theoretical treatments of the same problem have also been recently studied miller2017variational ; roeder2017variational .
SGVB has been introduced initially for fully factorized variational distributions, which provides limited support for feasible posteriors that can be inferred. Strategies for improving the approximation quality of variational BNN inference include employment of structured versions of dropout matrix normals louizos2016structured , repetitive invertible transformations of latent variables (Normalizing Flows) rezende2015variational and their application to variational dropout louizos2017variational
. Lastly, there is active research on enriching variational inference using its interpolative connection to expectation propagationhernandez2016black ; li2016renyi ; li2017variational .
We evaluate the proposed model on a variety of datasets and settings. Unless otherwise indicated we rely on our own implementation of each method to ensure comparability and fairness. The source code to replicate the experiments is available online222https://github.com/manuelhaussmann/vbp
. Details on the hyperparameters can be found in the Supplementary Material.
with densely connected nets having two hidden layers of 400, 800, or 1200 units. The prior placed over the weights is either a normal prior or a scale mixture prior of two normal distributions whose combination allows for placing most of the mass around zero, while still allowing for heavy tails. We only report results for VBP with a normal prior to avoid an expensive grid search over the hyperparametersblundell2015weight mention, as it performs better than BBB+Normal coming close to its mixture prior performance (see Table 1).
|Model (Test Error in %)||400||800||1200|
|BBB, Normal prior|
|BBB, Mixture prior|
|VBP, Normal prior|
Our main experiment is an evaluation on four datasests, MNIST lecun1998gradient , SVHN netzer2011reading , Cifar-10, and Cifar-100 krizhevsky2009learning . A commonly reported problem of BNNs is their inability to cope with increasing network depth, requiring a lot of tricks such as initialization from a pretrained deterministic net or an annealing of the KL term in the ELBO. In order to focus our analysis on the steepness of the learning curves, we avoid such tricks and stick to a small LeNet5-sized architecture consisting of two convolutional and two fully connected layers, where the two Cifar datasets get more filters per layer. As mentioned in Section 2.3.3
, VBP can handle max-pooling layers, but they require a careful tracking of indices between the data fit and variance terms, which comes at some extra runtime cost in present deep learning libraries. Instead, we provide a reference implementation on how to do this, but stick in the experiments with strided convolutions following the recent trend of“all-convolutional-nets” redmon2018yolov3 ; springenberg2015striving ; yu2017dilated .
We compare VBP with Variational Dropout (VarOut) as introduced by kingma2015variational in the improved version of molchanov2017variational . Their approach places a log-uniform prior333hron2017variational recently showed that the improper log-uniform prior will usually lead to an improper posterior. This can be avoided neklyudov2018variance by using a Student- distribution with a tiny d.o.f. parameter (approximating log-uniform). and a factorized normal distribution as the variational posterior over the weights. The KL between these two is evaluated via a tight approximation. We use the same setup and approximation to the KL for VBP. The models are trained with Adam kingma2014adam with the default hyperparameters and a learning rate which is linearly reduced to zero from over 100 epochs. The learning curves are shown in Figure 2, the final performance on each dataset is summarized in Table 2. Avoiding the sampling step leads to a clear improvement in all four datasets. VBP achieves a lower error rate than Varout consistently in all four data sets.
The log-uniform prior is motivated for its sparsification properties. molchanov2017variational prune all weights with , where , for . This roughly translates to dropping all weights with a binary dropout rate of larger than . For VBP, this strict pruning rule tends to be suboptimal. Instead we follow louizos17bayesian and place the threshold based on a visual evaluation of the histogram of the which allows us to rank the weights and trade sparsity for accuracy. See Table 2 for the results, where a fixed pruning (fixed) strategy tends to a larger sparsity over the variable (var) one, but also a greater cost in accuracy. See the Supplementary Material for details.
As a final experiment, we allowed the VarOut variation to estimate with multiple samples instead of one. However, similar to the results reported in the context of VAEs burda2015importance it could only marginally improve its performance with an increase in the number of samples (up to 30), while drastically increasing the computational requirements. We conjecture that one would require a similar importance weighing scheme as burda2015importance introduce to solve this. See Table 3 for results.
Our experiments demonstrate that VBP follows a steeper learning curve than its sampling-based counterpart. The consistency in performance improvement illustrates the price SGVB has to pay for keeping the model probabilistic. As Table 2 shows, cleaning the gradients from sampling noise improves not only the prediction performance but also the model selection effectiveness. VBP is able to discover a less sparse but more effective architecture than VarOut, although it still prunes the far majority of the synaptic connections in three data sets, more than half on the fourth.
Following the No-Free-Lunch theorem, our closed-form available ELBO comes at the expense of a number of restrictions, such as a fully factorized approximate posterior, sticking to ReLU activations, and inapplicability of Batch Normalizationioffe2015batch . An immediate implication of this work is to explore ways to relax the mean-field assumption and incorporate normalizing flows without sacrificing from the closed-form solution. Because Equations 3 and 4 extend easily to dependent variables after adding the covariance of each variable pair, our formulation is applicable to structured variational inference schemes without major theoretical obstacles. We speculate that the same should be true for normalizing flows after designing transformation functions carefully.
Importance weighted autoencoders.In ICLR, 2016.
What uncertainties do we need in Bayesian deep learning for computer vision?In NIPS, 2017.
Stochastic backpropagation and approximate inference in deep generative models.In ICML, 2014.