## I Introduction

Nowadays, due to excellent results obtained in many fields of applied machine learning including computer vision and natural language processing

[1] the popularity of deep learning is increasing rapidly. One of the reasons can surely be found in the fact that Krizhevsky et al. [2]outperformed the competitors in the ImageNet Large Scale Visual Recognition Challenge 2012 by proposing a convolutional neural network (CNN) named AlexNet. While AlexNet includes eight layers, more recent architectures for image classification go even deeper

[3, 4]. It is well known, that a feed-forward network with merely one hidden layer can approximate a broad class of functions abitrarily well. A mathematical more profound formulation of this statement, the so-called universal approximation theorem, was proven by Hornik et al. [5]. However, Liang and Srikant [6] could show that deep nets require exponential less parameters than shallow ones in order to achive a given degree of approximation. Possible applications of deep nets for computer vision include medical imaging, psychology, the automotive industry, finance and life sciences [7, 8, 9, 10, 11, 12].Despite the large and ever-increasing number of real world use cases deep learning comes along with two restrictions which still limit its areas of application. The first restriction is that deep networks require a large amount of training data, otherwise they are prone to overfiting. The reason for this is the huge amount of parameters neural nets hold. Although deep nets require exponential less parameters than shallow ones, the remaining number is nevertheless very high. Thus, in many potential fields of application, where such an amount cannot be provided, deep learning is of limited use, or often even cannot be used. To counteract this problem commonly diverse regularization techniques are applied. Besides classical approaches, such as the penalization of the L2 norm or the L1 norm, stochastic regularization methods gain increasing attention. For instance, dropout [13] and dropconnect [14]

count to these stochastic techniques. The first one randomly sets the activation of non-output neurons to zero during network training and the second one randomly sets network weights to zero. While dropout classically is interpreted as an efficient way of performing model averaging with neural networks, Gal and Ghahramani

[15] as well as Kingma et al. [16] recently showed that it can also be considered as an application of Bayesian statistics. The second restriction deep networks struggle with is that prediction uncertainty cannot be measured. Especially, in the medical field or for self-driving vehicles it is essential that the prediction uncertainty can be determined [17]. In these areas of application a model which predicts on average quite well is not good enough. One has to know if the model is certain in its predictions or not, such that in the case of high uncertainty a human can decide instead of the machine. Please be aware of the fact that the probabilites obtained when running a deep net for a classification task should not be interpreted as the confidence of the model. As a matter of fact a neural net can guess randomly while returning a high class probability

[18].A possible strategy to overcome the restrictions classical deep learning has to deal with is applying Bayesian statistics. In so-called Bayesian deep learning the network parameters are treated like random variables and are not considered to be fixed deterministic values. In particular, an a priori distribution is assigned to them and updating the prior knowledge after observing traning data results in the so-called posterior distribution. The uncertainty in the network parameters can be directly translated in uncertainty about predictions. Further, Bayesian methods are robust to overfitting because of the built-in regularization due to the prior. Buntine and S. Weigend

[19] were one of the first who presented approximate Bayesian methods for neural nets. Two years later Hinton and van Camp [20]already proposed the first variational methods. Variational methods try to approximate the true posterior distribution with another parameteric distribution, the so-called variational distribution. The approximation takes place due to an optimization of the parameters of the variational distribution. They followed the idea that there should be much less information in the weights than in the output vectors of the training cases in order to allow for a good generalization of neural networks. Denker and Lecun

[21] as well as J.C. MacKay [22] used Laplace approximation in order to investigate the posterior distributions of neural nets. Neal [23] proposed and investigated hybrid Monte Carlo training for neural networks as a less limited alternative to the Laplace approximation. However, the approaches mentioned up to now are often not scalable for modern applications which go along with highly parameterized networks. Graves [24]was the first to show how variational inference can be applied to modern deep neural networks due to Monte Carlo integration. He used a Gaussian distribution with a diagonal covariance matrix as variational distribution. Blundell et al.

[25] extended and improved the work of Graves [24] and also used a diagonal Gaussian to approximate the posterior. As already mentioned, Gal and Ghahramani [15] as well as Kingma et al. [16] showed that using the regularization technique dropout can also be considered as variational inference.The indepenece assumptions going along with variational inference via diagonal Gaussians (complete independence of network parameters), or also going along with variational inference according to dropout (independence of neurons) are restrictive. Permitting an exchange of information between different parts of neural network architectures should lead to more accurate uncertainty estimates. Louizos and Welling

[26] used a distribution over random matrices in order to define the variational distribution. Thus, they could reduce the amount of variance-related parameters that have to be estimated and further allow for an information sharing between the network weights. Note that in the diagonal Gaussian approach assigning one variance term to each random weight and one variance term to each random bias doubles the amount of parameters to optimize in comparison to frequentist deep learning. Consequently, network training becomes more complicated and computationally expensive.It should be mentioned that variational Bayes is just a specific case of local -divergence minimization. According to Amari [27] the -divergence between two densities and is given by . Thus, the -divergence converges for to the KL divergence [28] which is typically used in variational Bayes. It has been shown [29, 30] that an optimal choice of is task specific and that non-standard settings, i.e. settings with can lead to better prediction results and uncertainty estimates.

However, in this paper we do not propose an optimal choice of . Rather, for the classical case we will propose a good and easy to interpret variational distribution. For this task recent work from Posch et al. [31] is extended. They used a product of Gaussian distributions with specific diagonal covariance matrices in order to define the variational distribution. In particular, the a posteriori uncertainty of the network parameters is represented per network layer and depending on the estimated parameter expectation values. Therefore, only few additional parameters have to be optimized compared to classical deep learning and the parameter uncertainty itself can easily be analyzed per network layer. We extened this distribution by allowing network parameters to be correlated with each other. In particular, the diagonal covariance matrices are replaced with tridiagonal ones. Each tridiagonal matrix is defined in such a way that the correlations between neighboured parameters are identical. This way of treating network layers as units in terms of dependence allows for an easy analysis of the dependence between network parameters. Moreover, again only few additional parameters compared to classical deep learning need to be optimized, which guarantees that the difficulty of the network optimization does not increase significantly. Note that our extension allows for an exchange of information between different parts of the network and therefore should lead to more reliable uncertainty estimates. We have evaluated our approach on basis of the popular benchmark datasets MNIST [32] and CIFAR-10 [33]. The promising results can be found in Section IV.

## Ii Background

In this section based on previous work [20, 24, 18, 34, 31] we briefly discuss how variational inference can be applied in deep learning. Since we are particularly interested in image classification the focus will be on networks designed for classification tasks. Note that the methodology also holds for regression models after some slight modifications.

Let denote the random vector covering all parameters (weights and biases) of a given neural net . Further, let denote the density used to define a priori knowledge regarding . According to the Bayesian theorem the posterior distribution of is given by the density

where denotes a set of training examples and holds the corresponding class labels. Note that the probability is given by the product in accordance with the classical assumptions on stochastic independence and modeling in deep learning for classification. The integral in the above representation of is commonly intractable due to its high dimension . Variational inference aims at approximating the posterior with another parametric distribution, the so-called variational distribution . To this end the so-called variational parameters

are optimized by minimizing the Kullback Leibler divergence (KL divergence)

between the variational distribution and the posterior. Although the KL divergence is no formal distance measure (does not satisfy some of the requested axioms) it is a common choice to measure the similarity of probability distributions.

Since the posterior distribution is unknown the divergence cannot be minimized directly. Anyway, the minimization of is equivalent to the minimization of the so-called negative log evidence lower bound

Thus, the optimization problem reduces to the minimization of the sum of the expected negative log likelihood and the KL divergence between the variational distribution and the prior with respect to

. Inspired by stochastic gradient descent it is not unusual to approximate the expected values in

via Monte Carlo integration with one sample during network training. Note that the re-sampling in each training iteration guarantees that a sufficient amount of samples is drawn overall. Commonly, mini-batch gradient descent is used for optimization in deep learning. To take account of the resulting reduction of the number of training examples used in each iteration of the optimization the objective function has to be rescaled. Thus, in the -th iteration the function to minimize is given bywhere denotes a sample from , denotes the mini-batch size, and denote the mini-batch itself.

Summing up, optimizing a Bayesian neural net is quite similar to the optimization of a classical one. While in frequentist deep learning it is common to penalize the Euclidean norm of the network parameters in terms of regularization in Bayesian deep learning deviations of the variational distribution from the prior are penalized. In principal, the same loss function

(cross entropy loss) is minimized, but with the crucial difference that network parameters have to be sampled since they are random.In Bayesian deep learning predictions are based on the posterior predictive distribution, i.e. the distribution of a class label

for a given example conditioned on the observed data . The distribution can be approximated via Monte Carlo integrationwhere denote samples from . Therefore, the class of an object is predicted by computing multiple network outputs with parameters sampled from the variational distribution. Averaging the output vectors results in an estimate of the posterior predictive distribution, such that the a posteriori most probable class finally serves as prediction.

The a posteriori uncertainty in the network parameters can directly be translated in uncertainty about the random network output

and thus the posterior probability of an object

belonging to class . Therefore, at first multiple samples are drawn from the variational distribution . Then the corresponding network outputs are determined. Finally, the empirical and quantiles of these outputs provide an estimate of the credible interval for the probability .## Iii Methodology

In this section we describe our novel approach. At first we give a formal definition of the variational distribution we use to approximate the posterior and, additionally, we propose a normal prior. Moreover, we report the derivatives of the approximation of the negative log evidence lower bound (see Section II) with respect to the variational parameters, i.e. the learnable parameters. Finally, we present the pseudocode which is the basis of our implementation of the proposed method.

### Iii-a Variational Distribution and Prior

Let denote the random weights of layer . Further, let denote the corresponding random bias terms. The integers and denote the number of weights and the number of biases in layer , respectively.

As already mentioned in Section I

, we define the variational distribution as a product of multivariate normal distributions with tridiagonal covariance matrices. Applying variational inference to Bayesian deep learning presupposes that samples from the variational distribution can be drawn during network training as well as at the stage in which new samples are used for prediction. Especially at the training phase, it is essential that the random sampling can be reduced to the sampling from a (multivariate) standard normal distribution and appropriate affine-linear transformation of the drawn samples based on the learnable parameters. A direct sampling from the non-trivial normal distributions would mask the variational parameters and thus make it impossible to optimize them by gradient descent. Provided that a covariance matrix is positive definite (in general covariance matrices are only positive semidefinte) there exists a unique Cholesky decomposition of the matrix which can be used for this task. Note that for each real-valued symmetric positive-definite square matrix a unique decomposition (Cholesky decomposition) of the form

exists, where L is a lower triangular matrix with real and positive diagonal entries. Thus, we are interested in symmetric tridiagonal matrices which always stay positive definite no matter how the corresponding learnable parameters are adjusted during network training. The first thing required is a criterion for the positive definitness of for our purposes approriate tridiagonal matrices. Andelić and Fonseca [35] gave the following sufficient condition for positive definiteness of tridiagonal matrices: Leta symmetric tridiagonal matrix with positive diagonal entries. If

(1) |

then is positive definite. Consider now a matrix defined as follows:

If satisfies condition and has positive diagonal entries, the matrix defines its Cholesky decomposition and further is a valid covariance matrix since every real, symmetric, and positive semidefinite square matrix defines a valid covariance matrix. As in the work of Posch et al. [31] we define the variances as multiples of the corresponding expectation values, denoted by

(2) | ||||

(3) |

where . Defining the variances proportional to the expectation values allows for a useful specification of them. This specification requires, besides the expectation values, only one additional variational parameter. Moreover, we want the correlations to be identical, which leads to the following covariances

(4) | ||||

(5) |

for . By rearranging Equations and one obtains a recursive formula for the elements of the matrix :

(6) |

(7) | ||||

(8) |

Note that Equation for instance is satisfied for

(9) |

By defining the this way one does not end up by the Cholseky decomposition which assumes the diagonal elements of to be positive and thus by also assumes the to be positive. Taking the absolute values of the according to Equation would result in the Cholesky decomposition, but is not necessary for our purposes and therefore not done. Thus, the elements of are recursively defined as

(10) | ||||

(11) | ||||

(12) | ||||

(13) |

Note that the matrix defined by the Equations satisfies condition iff

(14) | ||||

(15) | ||||

(16) |

Thus, provided that condition holds there exists a unique Cholesky decomposition of which again guarantees that the according to are well defined and, further, also that according to is well defined.

Using the considerations above the variational distribution can finally be defined as follows. In a first step lower bidiagonal matrices are specified for :

(17) |

(18) | ||||

(19) | ||||

(20) | ||||

(21) |

The variational distribution of the weights of the -th layer is then defined as a multivariate normal distribution

with expected value and a tridiagonal covariance matrix . According to the considerations above, the variances of the normal distribution are given by () and the covariances are given by (). This again implies that the correlations are all the same and given by the parameter . Since the parameter regulates the variances of the distribution it should not take negative values during optimization. To guarantee this it is reparameterized with help of the softplus function

(22) |

Moreover, the parameter should lie in the interval to ensure that the matrix is positive definite. In deep learning dimensions are commonly high, such that the approximation in Equation can be considered as valid. The following reparameterization ensures that the desired property holds:

(23) |

In addition, the diagonal entries of have to be non-zero to ensure positive definiteness, which again implies that each component of has to be non-zero. We decide to set values which are not significantly different from zero to small random numbers in our implementation instead of introducing another reparameterization. Finally, , , and can be summarized as the variational parameters corresponding to the weights of the -th network layer.

One can easily sample from a random vector belonging to this distribution using samples from a standard normal distribution since it can be written as

(24) |

Note that Equation can also be written as:

(25) | ||||

(26) | ||||

(27) |

The layer-wise variational distributions of the bias terms denoted by are defined completely analogous to those of the weights. Assuming independence of the layers as well as independence between weights and biases, the overall variational distribution is given by

where , , denotes the density of , denotes the density of , and is a vector including all weights and all biases.

We define the a priori distribution completely analogous to Posch et al. [31]. In particular, its density is given by:

where denotes the density of and denotes the density of .

### Iii-B Kullback Leibler Divergence

The fact that the variational distribution as well as the prior factorize simplifies the computation of the Kullback Leibler divergence. Indeed, the overall divergence is given by the sum of the layer-wise divergences (for further details refer to Posch et al. [31]):

Thus, computing the overall divergence can be reduced to compute for fixed , since the remaining divergences compute completely analogously (only the indices differ). According to Hershey and Olsen [36] the KL divergence between two -dimensional normal distributions, given by and , computes as

(28) | ||||

Thus, the determinant of the covariance matrix is required for the computation of the Kullback Leibler divergence . Using basic properties of determinants computes as follows for fixed :

(29) |

Using and then reads

(30) |

where always denotes an additive constant.

### Iii-C Derivatives

Commonly neural networks are optimized via mini-batch gradient descent. Thus, in order to train a neural net according to our novel approach the partial derivatives of the approximation of the negative log evidence lower bound described in Section II with respect to the variational parameters , are required. In particular, the partial derivatives of the loss function typically used in deep learning and the partial derivatives of the Kullback Leibler divergence between prior and variational distribution have to be computed. Note, that the loss function equals the negative log likelihood of the data and is given by the cross-entropy loss in the case of classification and by the Euclidean loss in the case of regression. Thus, depends on the network itself, with parameters sampled from the variational distribution

. With the help of the multivariate chain rule the required partial derivatives of

can be computed based on the classical derivatives used in non-Bayesian deep learning:(31) | ||||

(32) | ||||

(33) |

Equations only deal with the derivatives of with respect to the variational parameters belonging to the network weights. Completely analogous equations hold for the bias terms. In the sequel we focus on the derivatives of the weights, since the derivatives for the biases are obviously of the same form. Note that for a given sample from the variational distribution the layer-wise derivatives () are computed as in non-Bayesian deep learning. Thus, the problem of finding closed form expressions for the required derivatives of reduces to the problem of finding these expressions for the ’s and the ’s. Taking account of the Equations the needed derivatives of the weights can be expressed in terms of the corresponding derivatives of the and the

(34) | ||||

(35) | ||||

(36) |

where the index lies in the set , while for a given the indices and lie in the set . The derivatives of the () with respect to the variational parameters are given by

(37) | ||||

(38) | ||||

(39) |

where , , , and , and obviously each derivative of equals zero. Moreover, the derivatives of the () with respect to the variational parameters are given by:

(40) | ||||

(41) | ||||

(42) |

In addition, the derivatives of with respect to the variational parameters are given by

(43) | ||||

(44) | ||||

(45) |

where .

Finally, the partial derivatives of the KL divergence (abbreviated with ) with respect to the variational parameters are given by:

(46) | ||||

Comments

There are no comments yet.