## 1 Introduction

Deep learning works very well in practice for many tasks, ranging from image processing [Krizhevsky et al., 2012] to language modelling [Bengio et al., 2006]

. However the framework has some major limitations as well. Our inability to reason about uncertainty over the features is an example of such – the features extracted from a dataset are often given as point estimates. These do not capture how much the model is confident in its estimation. On the other hand, probabilistic Bayesian models such as the Gaussian process

[Rasmussen and Williams, 2006] offer us the ability to reason about our confidence. But these often come with a price of lessened performance.Another major obstacle with deep learning techniques is over-fitting. This problem has been largely answered with the introduction of dropout [Hinton et al., 2012; Srivastava et al., 2014]. Indeed many modern models use dropout to avoid over-fitting in practice. Over the last several years many have tried to explain why dropout helps in avoiding over-fitting, a property which is not often observed in Bayesian models. Papers such as [Wager et al., 2013; Baldi and Sadowski, 2013]

have suggested that dropout performs stochastic gradient descent on a regularised error function, or is equivalent to an

regulariser applied after scaling the features by some estimate.Here we show that a deep neural network (NN) with arbitrary depth and non-linearities, with dropout applied before every weight layer, is mathematically equivalent to an approximation to the probabilistic deep Gaussian process model [Damianou and Lawrence, 2013]

(marginalised over its covariance function parameters). We would like to stress that no simplifying assumptions are made on the use of dropout in the literature, and that the results derived are applicable to any network architecture that makes use of dropout exactly as it appears in practical applications. We show that the dropout objective, in effect, minimises the Kullback–Leibler divergence between an approximate distribution and the posterior of a deep Gaussian process (marginalised over its finite rank covariance function parameters).

We survey possible applications of this new interpretation, and discuss insights shedding light on dropout’s properties. This interpretation of dropout as a Bayesian model offers an explanation to some of its properties, such as its ability to avoid over-fitting. Further, our insights allow us to treat NNs with dropout as fully Bayesian models, and obtain uncertainty estimates over their features. In practice, this allows the introduction of the Bayesian machinery into existing deep learning frameworks in a principled way. Lastly, our analysis suggests straightforward generalisations of dropout for future research which should improve on current techniques.

The work presented here is an extensive theoretical treatment of the above, with applications studied separately. We next review the required background, namely dropout, Gaussian processes, and variational inference. We then derive the main results of the paper. We finish with insights and applications, and discuss how various dropout variants fit within our framework.

## 2 Background

We review dropout, the Gaussian process model^{1}^{1}1For a full treatment of Gaussian processes, see Rasmussen and Williams [2006]., and approximate variational inference quickly. These tools will be used in the following section to derive the main results of this work. We use the following notation throughout the paper. Bold lower case letters (

) denote vectors, bold upper case letters (

) denote matrices, and standard weight letters () denote scalar quantities. We use subscripts to denote either entire rows / columns (with bold letters, ), or specific elements (). We use subscripts to denote variables as well (such as ), with corresponding lower case indices to refer to specific rows / columns ( for the first variable and for the second). We use a second subscript to denote the element index of a specific variable: denotes the element at row column of the variable .### 2.1 Dropout

We review the dropout NN model [Hinton et al., 2012; Srivastava et al., 2014] quickly for the case of a single hidden layer NN. This is done for ease of notation, and the generalisation to multiple layers is straightforward. Denote by

the weight matrices connecting the first layer to the hidden layer and connecting the hidden layer to the output layer respectively. These linearly transform the layers’ inputs before applying some element-wise non-linearity

. Denote by the biases by which we shift the input of the non-linearity. We assume the model to output dimensional vectors while its input is dimensional vectors, with hidden units. Thus is a matrix, is a matrix, and is a dimensional vector. A standard NN model would output given some input .^{2}

^{2}2Note that we omit the outer-most bias term as this is equivalent to centring the output.

Dropout is applied by sampling two binary vectors of dimensions and

respectively. The elements of the vectors are distributed according to a Bernoulli distribution with some parameter

for . Thus for , and for . Given an input , proportion of the elements of the input are set to zero: where signifies the Hadamard product. The output of the first layer is given by , which is linearly transformed to give the dropout model’s output . This is equivalent to multiplying the weight matrices by the binary vectors to zero out entire rows:The process is repeated for multiple layers. Note that to keep notation clean we will write when we mean with the operator mapping a vector to a diagonal matrix whose diagonal is the elements of the vector.

To use the NN model for regression we might use the Euclidean loss (also known as “square loss”),

(1) |

where are observed outputs, and being the outputs of the model with corresponding observed inputs .

To use the model for classification, predicting the probability of

being classified with label

, we pass the output of the model through an element-wise softmax function to obtain normalised scores: . Taking the log of this function results in a softmax loss,(2) |

where is the observed class for input .

During optimisation a regularisation term is often added. We often use regularisation weighted by some weight decay (alternatively, the derivatives might be scaled), resulting in a minimisation objective (often referred to as cost),

(3) |

We sample new realisations for the binary vectors for every input point and every forward pass thorough the model (evaluating the model’s output), and use the same values in the backward pass (propagating the derivatives to the parameters).

The dropped weights and are often scaled by to maintain constant output magnitude. At test time no sampling takes place. This is equivalent to initialising the weights with scale with no further scaling at training time, and at test time scaling the weights by .

### 2.2 Gaussian Processes

The Gaussian process (GP) is a powerful tool in statistics that allows us to model distributions over functions. It has been applied in both the supervised and unsupervised domains, for both regression and classification tasks [Rasmussen and Williams, 2006; Titsias and Lawrence, 2010; Gal et al., 2015]. The Gaussian process offers desirable properties such as uncertainty estimates over the function values, robustness to over-fitting, and principled ways for hyper-parameter tuning. The use of approximate variational inference for the model allows us to scale it to large data via stochastic and distributed inference [Hensman et al., 2013; Gal et al., 2014].

Given a training dataset consisting of inputs and their corresponding outputs , we would like to estimate a function that is likely to have generated our observations. We denote the inputs and the outputs .

What is a function that is likely to have generated our data? Following the Bayesian approach we would put some prior distribution over the space of functions . This distribution represents our prior belief as to which functions are more likely and which are less likely to have generated our data. We then look for the posterior distribution over the space of functions given our dataset :

This distribution captures the most likely functions given our observed data.

By modelling our distribution over the space of functions with a Gaussian process we can analytically evaluate its corresponding posterior in regression tasks, and estimate the posterior in classification tasks. In practice what this means is that for regression we place a joint Gaussian distribution over all function values,

(4) | ||||

with some precision hyper-parameter and where

is the identity matrix with dimensions

. For classification we sample from a categorical distribution with probabilities given by passing through an element-wise softmax,(5) | ||||

for with observed class label . Note that we did not simply write because of notational convenience that will allow us to treat regression and classification together.

To model the data we have to choose a covariance function for the Gaussian distribution. This function defines the (scalar) similarity between every pair of input points . Given a finite dataset of size this function induces an covariance matrix which we will denote . For example we may choose a stationary squared exponential covariance function. We will see below that certain non-stationary covariance functions correspond to TanH (hyperbolic tangent) or ReLU (rectified linear) NNs.

Evaluating the Gaussian distribution above involves an inversion of an by matrix, an operation that requires time complexity. Many approximations to the Gaussian process result in a manageable time complexity. Variational inference can be used for such, and will be explained next.

### 2.3 Variational Inference

To approximate the model above we could condition the model on a finite set of random variables

. We make a modelling assumption and assume that the model depends on these variables alone, making them into sufficient statistics in our approximate model.The predictive distribution for a new input point is then given by

with . The distribution cannot usually be evaluated analytically. Instead we define an approximating variational distribution , whose structure is easy to evaluate.

We would like our approximating distribution to be as close as possible to the posterior distribution obtained from the full Gaussian process. We thus minimise the Kullback–Leibler (KL) divergence, intuitively a measure of similarity between two distributions:

resulting in the approximate predictive distribution

(6) |

Minimising the Kullback–Leibler divergence is equivalent to maximising the log evidence lower bound [Bishop, 2006],

(7) |

with respect to the variational parameters defining . Note that the KL divergence in the last equation is between the approximate posterior and the prior over . Maximising this objective will result in a variational distribution that explains the data well (as obtained from the first term—the log likelihood) while still being close to the prior—preventing the model from over-fitting.

We next present a variational approximation to the Gaussian process extending on [Gal and Turner, 2015], which results in a model mathematically identical to the use of dropout in arbitrarily structured NNs with arbitrary non-linearities.

## 3 Dropout as a Bayesian Approximation

We show that deep NNs with dropout applied before every weight layer are mathematically equivalent to approximate variational inference in the deep Gaussian process (marginalised over its covariance function parameters). For this we build on previous work [Gal and Turner, 2015] that applied variational inference in the sparse spectrum Gaussian process approximation [Lázaro-Gredilla et al., 2010]. Starting with the full Gaussian process we will develop an approximation that will be shown to be equivalent to the NN optimisation objective with dropout (eq. (3)) with either the Euclidean loss (eq. (1)) in the case of regression or softmax loss (eq. (2)) in the case of classification. This view of dropout will allow us to derive new probabilistic results in deep learning.

### 3.1 A Gaussian Process Approximation

We begin by defining our covariance function. Let be some non-linear function such as the rectified linear (ReLU) or the hyperbolic tangent function (TanH). We define to be

with

a standard multivariate normal distribution of dimensionality

and some distribution . It is trivial to show that this defines a valid covariance function following [Tsuda et al., 2002].We use Monte Carlo integration with terms to approximate the integral above. This results in the finite rank covariance function

with and . will be the number of hidden units in our single hidden layer NN approximation.

Using instead of as the covariance function of the Gaussian process yields the following generative model:

(8) |

with a matrix parametrising our covariance function.

Integrating over the covariance function parameters results in the following predictive distribution:

where the integration is with respect to , and .

Denoting the row vector

and the feature matrix , we have . We rewrite as

analytically integrating with respect to .

The normal distribution of inside the integral above can be written as a joint normal distribution over , the ’th columns of the matrix , for

. For each term in the joint distribution, following identity

[Bishop, 2006, page 93], we introduce a auxiliary random variable ,(9) |

Writing a matrix, the above is equivalent to^{3}^{3}3This is equivalent to the weighted basis function interpretation of the Gaussian process [Rasmussen and Williams, 2006] where the various quantities are analytically integrated over.

where the integration is with respect to , and .

We have re-parametrised the GP model and marginalised over the additional auxiliary random variables and . We next approximate the posterior over these variables with appropriate approximating variational distributions.

### 3.2 Variational Inference in the Approximate Model

Our sufficient statistics are , and .
To perform variational inference in our approximate model we need to define a variational distribution . We define to be a Gaussian mixture distribution with two components, factorised over :^{4}^{4}4Note that this is a bi-modal distribution defined over each output dimensionality; as a result the joint distribution over is highly multi-modal.

(10) | ||||

with some probability , scalar and . We put a similar approximating distribution over :

(11) | ||||

with some probability .

We put a simple Gaussian approximating distribution over :

(12) |

Next we evaluate the log evidence lower bound for the task of regression, for which we optimise over the variational parameters , , and , to maximise Eq. (7). The task of classification is discussed later.

### 3.3 Evaluating the Log Evidence Lower Bound for Regression

We need to evaluate the log evidence lower bound:

(13) |

where the integration is with respect to , and .

For the task of regression we can rewrite the integrand as a sum:

as the output dimensions of a multi-output Gaussian process are assumed to be independent. Denote . We can then sum over the rows instead of the columns of and write

Here , resulting in the integrand

This allows us to write the log evidence lower bound as

We re-parametrise the integrands in the sum to not depend on , and directly, but instead on the standard normal distribution and the Bernoulli distribution. Let and for , and and for . Finally let . We write

(14) |

allowing us to re-write the sum over the integrals in the above equation as

where each integration is over , and .

We estimate each integral using Monte Carlo integration with a distinct single sample to obtain:

with realisations defined following eq. (3.3) with , , , and . Following [Blei et al., 2012; Hoffman et al., 2013; Kingma and Welling, 2013; Rezende et al., 2014; Titsias and Lázaro-Gredilla, 2014], optimising the stochastic objective we would converge to the same limit as .

We can’t evaluate the KL divergence term between a mixture of Gaussians and a single Gaussian analytically. However we can perform Monte Carlo integration like in the above. A further approximation for large (number of hidden units) and small yields a weighted sum of KL divergences between the mixture components and the single Gaussian (proposition 1 in the appendix). Intuitively, this is because the entropy of a mixture of Gaussians with a large enough dimensionality and randomly distributed means tends towards the sum of the Gaussians’ volumes. Following the proposition, for large enough we can approximate the KL divergence term as

with constant w.r.t. our parameters, and similarly for . The term can be evaluated analytically as

with constant w.r.t. our parameters. We drop the constants for brevity.

Next we explain the relation between the above equations and the equations brought in section 2.1.

### 3.4 Log Evidence Lower Bound Optimisation

Ignoring the constant terms we obtain the maximisation objective

(15) |

Note that in the Gaussian processes literature the terms will often be optimised as well.

Letting tend to zero, we get that the KL divergence of the prior blows-up and tends to infinity. However, in real-world scenarios setting to be machine epsilon ( for example in quadruple precision decimal systems) results in a constant value

. With high probability samples from a standard Gaussian distribution with such a small standard deviation will be represented on a computer, in effect, as zero. Thus the random variable realisations

can be approximated asNote that are not maximum a posteriori (MAP) estimates, but random variable realisations. This gives us

Scaling the optimisation objective by a positive constant doesn’t change the parameter values at its optimum (as long as we don’t optimise with respect to ). We thus scale the objective to get

(16) |

and we recovered equation (1) for an appropriate setting of . Maximising eq. (16) results in the same optimal parameters as the minimisation of eq. (3). Note that eq. (16

) is a scaled unbiased estimator of eq. (

13). With correct stochastic optimisation scheduling both will converge to the same limit.The optimisation of proceeds as follows. We sample realisations to evaluate the lower-bound and its derivatives. We perform a single optimisation step (for example a single gradient descent step), and repeat, sampling new realisations.

We can make several interesting observations at this point. First, we can find the model precision from the identity which gives . Second, it seems that the weight-decay for the dropped-out weights should be scaled by the probability of the weights to not be dropped. Lastly, it is known that setting the dropout probability to zero () results in a standard NN. Following the derivation above, this would result in delta function approximating distributions on the weights (replacing eqs. (10)-(12)). As was discussed in [Lázaro-Gredilla et al., 2010] this leads to model over-fitting. Empirically it seems that the Bernoulli approximating distribution is sufficient to considerably prevent over-fitting.

Note that even though our approximating distribution is, in effect, made of a sum of two point masses, each point mass with zero variance, the mixture does not have zero variance. It has the variance of a Bernoulli random variable, which is transformed through the network. This choice of approximating distribution results in the dropout model.

## 4 Extensions

We have presented the derivation for a single hidden layer NN in the task of regression. The derivation above extends to tasks of classification, flexible priors, mini-batch optimisation, deep models, and much more. This will be studied next.

### 4.1 Evaluating the Log Evidence Lower Bound for Classification

For classification we have an additional step in the generative model in eq. (5) compared to eq. (4), sampling class assignment given weight . We can write this generative model using the auxiliary random variables introduced in section 3.1 for the regression case by

where is an dimensional vector of categorical values. We can write the log evidence lower bound in this case as (proposition 2 in the appendix)

The integrand of the first term can be re-written like before as a sum

resulting in a log evidence lower bound given by

where the integration of each term in the first expression is over , and .

We can re-parametrise each integrand in the sum following (3.3) to obtain

(17) |

Like before, we estimate each integral using Monte Carlo integration with a distinct single sample to obtain:

with realisations , and .

Each term in the sum in the first expression can be re-written as

We evaluate the second expression as before. Scaling the objective by a positive this results in the following maximisation objective,

with , identical (up to a sign flip) to that of eqs. (2), (3) for appropriate selection of weight decay .

### 4.2 Prior Length-scale

We can define a more expressive prior than over the weights of the first layer . This will allow us to incorporate our prior belief over the frequency of the observed data. Instead of we may use with length-scale .

To use this more expressive prior we need to adapt proposition 1 approximating the prior term’s KL divergence. It is easy to see that the KL divergence between and can be approximated as:

plus a constant for large enough . This follows from the KL divergence for multivariate normal distributions, and can be repeated for as well (defining ). Following the previous derivations we obtain the regression objective

and similarly for classification.

For high frequency data, setting and to small values would result in a weaker regularisation over the weights and . This leads to larger magnitude weights which can capture high frequency data [Gal and Turner, 2015].

The length-scale decouples the precision parameter from the weight-decays :

(18) |

which results in

(19) |

The length-scale is a user specified value that captures our belief over the function frequency. A short length-scale (corresponding to high frequency data) with high precision (equivalently, small observation noise) results in a small weight-decay – encouraging the model to fit the data well. A long length-scale with low precision results in a large weight-decay – and stronger regularisation over the weights. This trade-off between the length-scale and model precision results in different weight-decay values.

A similar term to the length-scale can be obtained for . Eq. (9) can be rewritten by substituting : we replace with and replace with . This results in the standard network structure often used in many implementations (without the additional scaling by ). Following the above derivation, we can rewrite the KL divergence between and to obtain the objective

(20) |

here acts in a similar way to and . A large number of hidden units results in a stronger regularisation, pushing the elements of to become smaller and smaller. A smaller will result in a weaker regularisation term over , allowing its elements to take larger values.

### 4.3 Mini-batch Optimisation

We often use mini-batches when optimising eq. (3). This is done by setting eq. (1) to

with a random subset of data points of size , and similarly for classification.

Using recent results in stochastic variational inference [Hoffman et al., 2013] we can derive the equivalent to this in the GP approximation above. We change the likelihood term in eq. (15) to sum over the data points in alone, and multiply the term by to get an unbiased estimator to the original sum. Multiplying equation (15) by as before we obtain

(21) |

recovering eq. (3) for the mini-batch optimisation case as well.

### 4.4 Predictive Log-likelihood

Given a dataset and a new data point we can calculate the probability of possible output values using the predictive probability . The log of the predictive likelihood captures how well the model fits the data, with larger values indicating better model fit.

Our predictive log-likelihood can be approximated by Monte Carlo integration of eq. (6) with terms:

with .

For regression we have

Comments

There are no comments yet.