A Bayesian encourages dropout

12/22/2014 ∙ by Shin-ichi Maeda, et al. ∙ Kyoto University 0

Dropout is one of the key techniques to prevent the learning from overfitting. It is explained that dropout works as a kind of modified L2 regularization. Here, we shed light on the dropout from Bayesian standpoint. Bayesian interpretation enables us to optimize the dropout rate, which is beneficial for learning of weight parameters and prediction after learning. The experiment result also encourages the optimization of the dropout.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, large scale neural networks has successfully shown its great learning abilities for various kind of tasks such as image recoginition, speech recognition or natural language processing. Although it is likely that the large scale neural networks fail to learn the appropriate parameter because of its huge model complexity, the development of several learning techniques, optimization and utilization of large dataset overcome such difficulty of learning.

One such successful learning technique is ‘dropout’, which is considered to be important to suppress the overfitting (Hinton et al., 2012). The success of dropout attracts many researchers’ attention and several theoretical studies have been performed aiming for better understanding of dropout (Wager et al., 2013; Baldi & Sadowski, 2013; Ba & Frey, 2013). The most of them explain the dropout as a kind of new input invariant regularization technique, and prove the applicability to other applications such as generalized linear model (Wager et al., 2013; Wang & Manning, 2013).

In contrast, we formally provide a Bayesian interpretation to dropout. From this standpoint, we can think that the dropout is a way to solve the model selection problem such as input feature selection or the selection of number of hidden internal states by Bayesian averaging of the model. That is, the inference after dropout training can be considered as an approximate inference by Bayesian model averaging where each model is weighted in accordance with the posterior distribution, and the dropout training can be considered as an approximate learning of parameter which optimizes a weighted sum of likelihoods of all possible models, i.e., marginal likelihood. This interpretation enables the optimization of ‘dropout rate’ as the learning of the optimal weights of the model, i.e., posterior of the model. This optimization of dropout rate benefits the parameter learning by a better approximation of marginal likelihood and prediction by a closer distribution to the predictive distribution. Note the Bayesian view on ‘dropout’ was already seen in the original paper of dropout 

(Hinton et al., 2012) which regards the dropout training as a kind of approximate Bayesian learning, however it was not mentioned how much the gap would be between the dropout training and standard Bayesian learning, and how to set or optimize the dropout rate.

2 standard dropout algorithm

2.1 Dropout training

We will explain the dropout for three-layered neural network not only because the neural network is the first model that the dropout training is applied, but also because it is one of the simplest models that involves both of the input feature selection problem and the hidden internal units selection problem. Note the concept of dropout itself can be easily extended to any other learning machines.

Let , , and be -dimensional input, -dimensional hidden unit, and -dimensional output, respectively. Then, the hidden unit and the output are computed as follows;

where and are and matrix, and and are - and

-dimensional vector, respectively.

is a nonlinear activation function such as a sigmoid function. The set

is a parameter set to be optimized.

Now we consider the model selection problem where the full model is represented as the one described above, and the other possible models (submodels) consist of the subset of the input features or the hidden units of the full model. Let us introduce the diagonal mask matrices and where the -th diagonal elements and are both binary to represent such subsets. We will sometimes use to represent all the mask variables. Then the possible model is represented as follows;

The determination of corresponds to the input feature selection problem while the determination of corresponds to the hidden internal units selection problem. We may rewrite with to represent the explicit dependence on the masks and as well as the parameter . There are possible architectures when ignoring the redundancy brought by the hierarchy of the architecture, and this number becomes huge in practice. This poses a great challenge in the optimization of the best mask , i.e., the best model selection.

Instead of choosing a specific binary mask , the standard dropout takes the approach to stochastically mix all the possible models by a stochastic optimization. The algorithm is summarized as follows;
Standard dropout for learning : Set

and set an initial estimate for parameter

. Pick a pair of sample at random. Randomly set the mask by determining every element independently according to the dropout rate, typically, where

denotes a Bernoulli distribution with probability

Update the parameter as
where determines a step size, and should be decreased properly to assure the convergence. In case of the training of the three-layered neural network for regression, where denotes a Euclidean norm.
Increment and go to step 2 unless certain termination condition is not satisfied.

2.2 Prediction after dropout training

After training of , the output for the test input is given as


where denotes an expected value of function with respect to the distribution . The distribution and in this case must correspond to the ones used in choosing mask matrices and during learning. If an independent Bernoulli distirbution with probability is used for both choosing and , then the three-layered neural network output is equivalent to the output using halved weights .

There is another way to approximate known as ‘fast dropout’  (Wang & Manning, 2013)

. When we apply dropout, not only to neural network, but also to other models, we often need to compute a weighted sum of many independent random variables

where the upper script denotes a transpose of vector (or matrix), is a weight vector, and is a diagonal mask matrix whose diagonal elements are binary random variables.

is either a sample of the input or the hidden unit, and treated as a fixed value. Fast dropout approximates this weighted sum of random variables by a single Gaussian random variable utilizing Lyapunov’s central limit theorem. This Gaussianization improves the accuracy of the calculation of the expectation while saving the computation time. It is also shown that the Gaussian approximation is beneficial to the learning.

3 Existing interpretations of dropout algorithm

Although the idea of dropout originates from Bayesian model averaging, it is proposed that the artificial corruption of the model by dropout is interpreted as a kind of adaptive regularization (Wager et al., 2013; Wang & Manning, 2013) analogous with the artificial feature corruption that is interpreted as a kind of regularization. Under this viewpoint, the dropout regularizer used for generalized linear model can be seen as the first-order equivalent to -regularization after transforming the input by an estimate of the inverse diagonal Fisher information matrix (Wager et al., 2013). This input transformation let the regularizer scale invariant while the conventional -regularizer does not hold this desirable property.

In (Baldi & Sadowski, 2013), the cost function used in dropout training is analyzed. The cost function used in dropout training is considered as an average of the cost function of submodels. They analyze the difference of the average of the cost function of submodels and a cost function of the average of the submodels, and show that the dropout brings an input-dependent regularization term to the cost function of the average of the submodels. They also pointed out that the strongest regularization effect is obtained when the dropout rate is set to be 0.5, which is a typical value used in practice.

There is also a study that treats dropout from Bayesian viewpoint. Ba & Frey (2013) extend the dropout to ‘standout’ where the dropout rate is adaptively trained. They propose that the random masks used in dropout training should be considered as a random variable of the Bayesian posterior distribution over submodels, which shares our view to the dropout. However, their proposed adaptive dropout rate depends on each input variable, which implies there should be a different mask for each input, while here we assume a consistent mask for whole dataset, and tries to estimate the best mix of models based on the entire training dataset. Also, it is not clear why their update of the dropout rate works as the leaning of the posterior. In the following section, we will show a clear interpretation of dropout from Bayesian standpoint, and provide a theoretically solid way to optimize the dropout rate. The difference of our study from the existing studies will be further discussed in Section 7.

4 Bayesian interpretation of dropout algorithm

4.1 Bayesian interpretation of dropout training

In standard Bayesian framework, all the parameters are treated as random variables to be inferred via posterior except for the hyper parameter. In dropout training, in Section 2 corresponds to the hyper parameter while the mask in Section 2 corresponds to the parameter. Let us denote an -dimensional binary mask vector whose elements take either or , the set of training data set.

Then the marginal log-likelihood for hyperparameter

is defined as

Although we omit the dependence of on here for simplicity, it is possible to include such dependence in general. By introducing any distribution of the parameter , , which we call trial distribution following the conventions, the marginal log-likelihood can be written as



denotes a Kullback-Leibler divergence between the distritbuions

and , and is a posterior distribution. The lower bound in Eq. (2) is valid for any trial distribution because of the non-negativity of Kullback-Leibler divergence, and becomes tight only when the trial distribution corresponds to the posterior . This means should be close to the posterior to minimize the gap.

Because depends only on the first term of Eq. (2), the maximization of the marginal likelihood with respect to for some fixed is equivalent to the maximization of the first term of Eq. (2), i.e.,


The above optimization of the lower bound by the stochastic gradient descent leads the dropout training algorithm explained in Section 

2. Note that we need to randomly sample both the input-output pair and the mask . The sample of the mask variable should obey the trial distribution . To be identical to the dropout training explained in Section 2.1, we need to assume and . In this case, the parameter corresponds to the dropout rate. In this paper, we will say as the dropout rate because determines the dropout rate.

4.2 Bayesian interpretation of the prediction after dropout training

We can also interpret the output for the test input after learning of from a Bayesian standpoint. In a Bayesian framework, the output for the test input should be inferred as the expected value of the predictive distribution given as


That is, the output is predicted by the weighted average of submodel predictions . However, the above calculation becomes often intractable since the exact calculation of both the posterior itself and the expectation with respect to the posterior needs the summation over which has variations. Therefore, consider to replace the posterior with some tractable trial distribution .


When we use independent for , we will have the same output explained in Section 2.2. The last approximation in Eq.(5) is the approximation to the expectation over . This may be replaced by a Gaussian approximation explained in Section 2.2.

5 Bayesian dropout algorithm with variable dropout rate

As apparent from the discussion in the preceding section, should be close to the posterior because it makes the lower bound much more tight, which means the cost function with respect to is presumably closer to the desirable cost function, marginal log-likelihood . Moreover, it enables to approximate the predictive distribution much more accurately when we use that better approximates the posterior .

The best is the posterior which, however, cannot be solved in an analytical form in most cases. So consider to optimize in a certain parametric distribution family, that is parameterized by . Then the best parameter is obtained as the one that maximizes the lower bound,


The above optimization problem could be solved by a numerical optimization method such as a stochastic gradient descent.

Overall, the entire Bayesian dropout algorithm would be as follows;

Bayesian dropout for learning : Set and set an initial estimate for parameters and . Pick a pair of sample at random. Randomly set a mask by determining every element independently according to the dropout rate . Update the parameter as

Update the parameter of the dropout rate as
where and determine step sizes, and should be decreased properly to assure the convergence while is an inverse of the effective number of samples, which will be explained later.
Increment and go to step 2 unless certain termination condition is not satisfied.

The first term in the right hand side of Eq. (7) comes from the first term of the derivative of , where and denote the expectation with respect to the trial distribution and the true distribution of the training set . Because these two expectations cannot be evaluated analytically in most cases due to the complex dependence of on both and , the expectations are replaced by random samples according to the recipe of the stochastic gradient descent. In contrast, the rest of two terms corresponding to the derivatives of the last two terms of can be evaluated directly if we assume independent distributions for both and . The summation over is usually intractable because has variations. However, the independence assumption decomposes the intractable summation over into tractable summations over a binary . This allows us the direct evaluation of these terms in practice without using Monte Carlo method. That is the reason why the last two terms does not depend on the sample as well as .

A scalar denotes an inverse of the effective number of samples, i.e., so that the last two terms correspond to the derivative of the last two terms of . This may be scheduled as when we think the amount of the training data is increasing as the algorithm proceeds. It is noted that many other variants of this algorithm can be considered. For example, we can introduce momentum term, we can update the parameters for a mini-batch, or we can consider EM-like algorithm, i.e., the update of (steps 2-4) is repeatedly performed until holds, then proceeds to the update of , and is updated until holds.

Following the standard dropout training, one natural choice of the parameterization of would be where represents a common dropout rate to all the input features. Hereafter, we will refer the dropout with the optimized uniform scalar to uniformly optimized rate dropout (UOR dropout). We could consider other parameterizations of . For example, we can parameterize for each differently by letting . Hereafter, we will refer the dropout with the optimized element-wise to feature-wise optimized rate dropout (FOR dropout). The parameter of the mask distribution can specify the distribution of subset of independently, such as, where denotes -th subset of . The subset could be the units in a same layer. This layer-wise setting of the dropout rate is sometimes used in practice, e.g.,  (Graham, 2014).

6 Experiment

We test the validity of the optimization of the dropout rate with a binary classification problem.

6.1 Data

Let be a target binary label and be a 1000-dimensional input vector consisting of 100-dimensional informative features and the rest of 900-dimensional non-informative features. Label obeys . Each informative feature () is generated independently according to and so that each informative feature has a weak correlation with the label (cross-correlation is 0.1) while each non-informative feature () is generated independently according to irrelevant to the label . Here,

denotes a Gaussian distribution with mean

and variance

. 2000 samples are generated for the training, and another 1000 samples are generated for the validation. The performance is evaluated with another 20,000 test samples.

6.2 Model and Algorithm

Linear logistic regression is used for this task.


where is a diagonal matrix whose diagonal elements are binary masks . In this case, the Gaussian approximation to works effectively as proposed in  (Wang & Manning, 2013). Approximating by a Gaussian random variable whose mean is and whose variance is , and applying a well-known formula which approximates Gaussian integral of sigmoid function , the predictive distribution is obtained as


We compared four algorithms, 1) maximum likelihood estimation (MLE), 2) standard dropout algorithm (fixed dropout) where the dropout rate is fixed to be 0.5, 3) Bayesian dropout algorithm that optimizes a uniform dropout rate (UOR dropout), and 4) Bayesian dropout algorithm that optimizes feature-wise dropout rates (DOR dropout). For the optimization of and , we need to determine the step sizes and of the algorithm described in Section 5 properly. These values are scheduled as and . The step size used for updating should be decreased slower than used for updating . Considering this, we chose the best parameter set , and that shows the best accuracy for the validation data among , , , , respectively. For Bayesian dropout algorithm, the inverse of the effective number of samples is simply set to be the inverse of the number of samples, i.e., , and the initial value of the dropout rate is set to be all 0.5 for both UOR dropout and DOR dropout.

6.3 Result

Test accuracy after learning is shown in Fig.1(a), and the trained dropout rate is shown in Fig.1(b). In Fig.1(b), the height of the bar denotes the dropout rate of FOR dropout while ’+’ denotes the dropout rate of UOR dropout.

Figure 1: Experimental results (a) Test accuracy  (b) Dropout rate after learning

Because only 10 of features are relevant to the input, it is difficult to determine the best uniform dropout rate. Eventually, the optimized uniform dropout rate becomes zero. Then, the test accuracies of both MLE and UOR dropout takes almost same value. Fixed dropout rate shows a slightly better accuracy although the difference cannot be visible from Fig. 1(a). On the other hand, if we optimize the feature-wise dropout rate, the test accuracy increases, getting closer to the Bayes optimal (theoretical limit). As can be seen from Fig1.(b), only the dropout rate of the informative features (the first 100 elements) are selectively low. Note that FOR dropout shows a significant regularization effect although it has no hyperparameter to be tuned.

7 Discussion

7.1 Why Bayesian?

The input feature selection problem and the hidden internal units selection problem is, in general, difficult because it requires to compare a vast number of models as large as and in case of the three-layered neural network explained in Section 2.

Moreover, the maximum likelihood criterion is not useful in choosing the best mask because it always prefers the largest network, that is, it always chooses where is a vector containing all 1 in its elements.


It is a well known fact that the number of the parameters has a proportional relation with the discrepancy between the generalization error and the training error of the non-singular statistical models trained by (MAP) estimation which uses a fixed prior and the maximum likelihood estimation (see Section 6.4 of Watanabe (2009) for example).

In contrast, Bayesian inference does not rely on the prediction of a single model, but on the weighted sum of submodel predictions where the weights are determined according to the posterior of the submodel.The marginal likelihood integrated over all possible submodels can take into account the complexity of the model. This fact helps to prevent the overfitting of the type explained in Section 3.4 of  

(Bishop, 2006). As for the asymptotic behavior of the Bayes generalization error, there are studies from algebraic geometry (Watanabe, 2009), which reveal that the asymptotic behavior of a singular statistical model is different from that of the non-singular statistical model; The generalization error does not increase necessary proportionally to the number of the parameters. This partly explains the reason why the recent big neural network, which is categorized into a singular statistical model, can resist overfitting as opposed to the regular statistical model optimized by the maximum likelihood estimation or MAP estimation.

7.2 What’s different from the other studies?

There are several studies that view dropout training as a kind of approximate Bayesian learning except for the original study (Hinton et al., 2012). However, their interpretation differ from our interpretation or lacks solid theoretical foundation.

In (Wang & Manning, 2013), it is stated that the cost function for can be seen as the lower bound of the marginal log-likelihood. However, their marginal log-likelihood is defined as which has a lower bound with an independent Bernoulli distribution as the dropout rate. Similar cost function is proposed by (Ba & Frey, 2013). They propose to optimize the dropout rate so as to maximize . Basically, their view on is a hidden variable rather than a parameter because is not inferred by the entire training dataset, but inferred sample by sample as . In contrast, we define the marginal likelihood as . With this definition, we may interpret the computation of the expectation of the submodel prediction with respect to the submodel posterior (i.e. Bayesian model averaging) as a Bayesian way to solve the model selection problem.

We shall note that, in the context of the neural network, there is nothing new about the philosophy of assigning the prior to the parameter and seeking the posterior achieving the best Bayesian prediction Eq. (4). See, for example,  (Neal, 1996; Xiong et al., 2011), which assigned Gaussian distribution to the weight matrices s. The search of the optimal prior over the vast space of probability measures is indeed a daunting task, however. The aforementioned algorithms do suffer from runtime. From this computational point of view, the standard dropout algorithm emerges as an efficient Bayesian compromise to this search. In particular, it restricts the search of the distribution of each element of , to the ones for which one can write as

for all -th unit adjacent to the connection 111To precisely correspond to the original dropout algorithm, we also need to set the prior appropriately so that the reguralization term becomes identical with the reguralization term used in the original dropout aglorithm.. In other words, the algorithm only looks into a specific family of discrete valued distributions parametrized by the values of . This way, the standard dropout algorithm bridges between the stochastic feature selection and the Beyesian averaging.

In this light, our algorithm can be seen as one step improvement of the standard dropout algorithm, which restricts the search of the posterior distribution of to the ones of the form

for all -th unit adjacent to the connection . This is a set of discrete valued distributions parametrized by and . Because the search space becomes larger, our method will consume more runtime than the standard dropout algorithm. By allowing more freedom to the distribution, however, one can expect the trained machine to be much more data specific. As apparent from the above discussion, we can consider the dropconnect (Wan et al., 2013) as another parameterization of the distribution of , i.e., where obeys a Bernoulli distribution.

If need be, one can also group s to match the specific task required for the model. For example, let us consider the family of time-series prediction model such as vector autoregression (VAR) model:

For this family, the number of parameters will grow quickly with the state space dimension and the size of lag time , making the search for the best posterior distribution difficult. Again, we may put where are diagonal matrices with entries . We may reduce the number of the hyperparameters by assuming some structure to the set of s; for example, we can put or and introduce an intended space-time correlations. This type of parametrization may allow us to resolve a very sophisticated feature selection problems, which are considered intractable by the conventional methods.