For large-scale tasks like image classification, the general practice in recent times krizhevsky2012imagenet
has been to train large Convolutional Neural Network (CNN) models. Even with large datasets, the risk of over-fitting runs high because of the large model size. As a result, strong regularizers are required to restrict the complexity of these models. Dropoutsrivastava2014dropout
is a stochastic regularizer that has been widely used in recent times. However, the rule itself was proposed as a heuristic - with the objective of reducing co-adaption among neurons. As a result, it’s behaviour was (and still is) not well understood. Gal and Gharamanigal2015bayesian showed that dropout implicitly performs approximate Bayesian inference - making it a Bayesian Neural Net.
Bayesian Neural Nets (BNNs) view parameters of a Neural Network as random variables rather than fixed unknown quantities. As a result, there exists a distribution of possible values that each parameter can take. By placing an appropriate prior over these random variables, it is possible to restrict the model’s capacity and implicitly perform regularization. The theoretical attractiveness of these methods is that one can now use tools from probability theory to work with these models. What advantages do BNNs offer over plain Neural Nets? First, they inherently capture uncertainty - both in the model parameters as well as predictions. Second, they are ideal for learning from small amounts of data. Third, a Bayesian approach has the advantage of distilling complex assumptions about the model in the form of prior distributions.
Inference over BNNs is typically intractable. As a result, one often uses approximations to the posterior distribution. MCMC and Variational Inference (VI) bishop2006pattern are two popular methods for performing these approximations. In recent times, VI has emerged as the preferred method of performing this approximation as it is scalable to large models. When using VI, it is common to assume independence of model parameters. For Neural Networks, this assumption may seem unnecessarily stringent. After all, weights in a particular filter are highly correlated to produce specific patterns (an oriented edge, for instance). However, different filters in a CNN are more-or-less independent as they compute different features. In fact, it might even be advantageous to enforce independence of different filters through VI, as they reduce co-adaptation among features. In this work, we strive to enforce independence among features rather than weights.
The overall contributions of the paper are as follows:
We derive a Bayesian approach to performing inference with neural networks. In doing so, we introduce a rich family of regularizers - Generalized Dropout (GD).
We perform experimental analysis with Dropout++, a set of methods under GD, to understand it’s behaviour.
We perform experiments with Stochastic Architecture Learning, another set of methods under GD, and show that they can be used to select the width of neural networks.
We test Dropout++ on standard networks and show that it can be used to boost performance.
2 Bayesian Neural Networks
In this section, we shall formally introduce the notion of BNNs and also discuss our proposed method. Let denote a neural network function with parameters . For a given input , the neural network produces
a probability distribution over possible labels (through softmax) for a classification problem. Given training data
, the parameter vectoris updated using Bayes’ Rule.
After computing the posterior distribution, we perform inference on a new data point as follows. Since the neural network produces a probability distribution over labels,
Computing the posterior from equation 1 is intractable due to the complicated neural network structure. In Variational Inference, we define another distribution , called the variational distribution, which approximates the posterior . This distribution is used instead of the posterior in equation 2 for inference.
One common assumption when working with the variational distribution is the mean-field assumption, which requires that
This states that all parameters in are independent. Such an assumption is certainly not true, especially at the feature-level, where parameters are highly correlated. Works like those of Denil et al.denil2013predicting explicitly show that parameters of a NN can be predicted given other parameters - hinting at the large amount of correlation present. Trying to enforce independence among such weights may end up having adverse effects.
It is very difficult to overcome the independence assumption within the framework of VI. In the next section, we introduce an approach to overcome this difficulty.
2.1 Bayesian Neural Networks with Gates
We add multiplicative gates after each feature / neuron in the neural network, as in Figure 0(a). These gates modulate the output of each neuron. Let these new parameters be denoted by . Let us also assume that they lie in . Intuitively, these gates now control the relative importance of each particular feature as viewed by the next layer. Such relative importance may be fundamentally uncertain given a particular feature. Hence, it may be useful to think of gate parameters as random variables. We shall now crystallize all these assumptions in the form of choices for the variational distribution.
We first place the following prior-hyperprior pair over the gate parameters, and a prior over the regular parameters .
Note that the products are over all possible variables defined in the network. Here,
denotes the bernoulli parameters and also needs to be estimated along with. Now, given that we use variational inference, let us now define the forms of the variational distributions we use. Let .
Note that even though we make an independence assumption on the weights (equation 5), we overcome the disadvantages described in the previous section by effectively not being Bayesian with respect to , using a delta distribution. Also note that we use the same parameter for both distributions and . While it is true that using different parameters for both distributions could make the formulation more powerful, we use the same parameter for simplicity. Now we write equations describing the variational approximation by using the definitions above.
Our objective is now to solve equation 6. We observe that the exhaustive summation in equation 6 in intractable for large models. A popular method to deal with this is to use a Monte-Carlo approximation of the summation. However, even this may be infeasible for large models. As a result, we further approximate this with a single Monte-Carlo sample. In other words, we perform the following approximation:
While this approximation seems to be drastic, we soon shall see that Classic Dropout also implicitly performs the same approximation.
2.2 Generalized Dropout
Given all the assumptions and approximations discussed above, we now write the complete objective function we aim to solve. Since the variational distributions for and are delta distributions, we shall now use instead of in our notations, for simplicity.
In the expression above, we have used the fact that is a beta distribution. This form of the objective function 2.2, with gates constitutes the Generalized Dropout regularizer.
Let us now briefly look at the behaviour of the beta distribution at various values of , as shown in Figure 0(b). We shall refer to each of these specific cases as different versions of Dropout++. For reasons to be discussed later, we shall refer to the last case as Stochastic Architecture Learning (SAL).
Dropout++ (0.5), where : is the most probable value of .
Dropout++ (flat), where : All values of are equally probable.
Dropout++ (1), where : is the most probable value of .
Dropout++ (0), where : is the most probable value of .
SAL, where : and are the most probable values of .
Note that Dropout++ (0.5) becomes indistinguishable from Classic Dropout (with 0.5 Dropout rate) at . To obtain other Dropout rates, we simply ensure . In the next section, we shall discuss another algorithm called Architecture Learning, and how it relates to the SAL method above.
2.3 Architecture Learning
Srinivas and Babu srinivas2015learning recently introduced a method to learn the width and depth of neural network architectures. They also add additional learnable parameters similar to gates. Also, their objective function has the following form in our notation.
Note that our objective function 2.2 looks very similar to this when and , except that we use instead of . Another difference is that they use a heaviside threshold to select
rather than sampling from a bernoulli. We observe that this is equivalent to taking a maximum likelihood sample from the bernoulli distribution. Given these similarities, we found it apt to name the corresponding method withas Stochastic Architecture Learning, as it is a stochastic version of the algorithm described above.
Most surprisingly, we find that the motivation to arrive at this algorithm was completely different - they intended to minimize the number of neurons in the network. We arrive at a very similar formulation from a purely Bayesian perspective.
2.4 A Practitioner’s Perspective
In this section, we shall attempt to provide an intuitive explanation for Generalized Dropout. Going back to Fig. 0(a), each neuron is augmented with a gate which learns values between 0 and 1. This is enforced by our regularizers and well as by parameter clipping. During the forward pass, we treat each of these gate values as probabilities and toss a coin with that probability. The output of the coin toss is used to block / allow neuron outputs. As a result of the learning, important features tend to have higher probability values than unimportant features.
At test time, we do not perform any sampling. Rather, we simply use the real-valued probability values in the gate variables. This approximation - called re-scaling - is used in classical Dropout as well.
What do the different Generalized Dropout methods do? Intuitively, they place restriction on the gate values (probabilities) that can be learnt. As an example, Dropout++ (0) encourages most gate values to be close to , with only a few important ones being high. On the other hand, Dropout++ (1) encourages gates values to be close to . Intuitively, this means that Dropout++ (0) restricts the capacity of a layer by a large amount, whereas Dropout++ (1) hardly changes anything. SAL, on the other hand, encourages neurons to be close to either or . In contrast to other methods, SAL produces neural network layers that are very close to being deterministic - neurons close to are almost never ’on’ and those close to are almost always ’on’. Dropout++ (flat) is also unique in the sense that it doesn’t place any restriction on the gate values. As a result, we do not require to set any hyper-parameters for this method. From a Bayesian Perspective, when we have no prior beliefs on what the gate values should be, we use the most non-informative prior - which is Dropout++ (flat) in this case.
Dropout++ (0.5) encourages values to be close to 0.5. If the regularization constants are increased, then gate values other than 0.5 are penalized more and more heavily. In the limiting case we get Dropout, where any deviation from probability value of 0.5 is ”infinitely” penalized.
2.5 Estimating gradients for binary stochastic gates
Given our formalism of stochastic gate variables, it is unclear how one might compute error gradients through them. Bengio et al. bengio2013estimating investigated this problem for binary stochastic neurons and empirically verified the efficacy of different solutions. They conclude that the simplest way of computing gradients - the straight-through estimator works best overall. This involves simply back-propagating through a stochastic neuron as if it were an identity function. If the sampling step is given by , then the gradient is used.
Another issue of consideration is that of ensuring that always lies in so that it is a valid bernoulli parameter. Bengio et al. bengio2013estimating use a sigmoid activation over . Our experiments showed clipping functions worked better. This can be thought of as a ‘linearized’ sigmoid. The clipping function is given by the following expression.
The overall sampling function is hence given by , and the straight-through estimator is used to estimate gradients overall.
2.6 Applying to Convolution Layers
Here we shall discuss how to apply this to convolutional layers. Let us assume that the output feature map from a convolutional layer is , i.e; feature maps of size . Classical dropout samples bernoulli random variables and performs pointwise multiplication with the output feature map. We follow the same for Generalized Dropout as well.
However, if we wish to perform architecture selection like Architecture Learning srinivas2015learning , we need to select a subset of the feature maps. In this case, we only have gate variables, multiplying to the output of each feature map. When a gate is close to zero, and entire feature map’s output becomes close to zero at test time. By selecting few feature maps out of , we determine which of the filters in the previous layer are essential.
3 Related Work
There are plenty of works which aim to extend Dropout. DropConnect wan2013regularization stochastically drops weights instead of neurons to obtain better accuracy on ensembles of networks. As stated earlier, using the independence assumption for weights may not be correct. Indeed, DropConnect is shown to work on only fully connected layers. Standout ba2013adaptive is a version of Dropout where the dropout rate depends on the output activations of a layer. Variational Dropout kingma2015variational proposes a Bayesian interpretation for Gaussian Dropout rather than the canonical multiplicative Dropout. By considering multiplicative Dropout, we make important connections to Architecture Learning / neuron pruning. Gal and Gharamani gal2015bayesian showed a Bayesian interpretation for binary dropout and show that test performance improves by performing Monte-Carlo averaging rather than re-scaling. For simplicity, we use the re-scaling method at test time for Generalized Dropout. Our work can be seen as an extension of this work by considering a hyper-prior along with a bernoulli prior.
further investigated this notion by using different priors and relevant approximations for large networks. Probabalistic Backpropagationhernandez2015probabilistic
is an algorithm for inferring marginal posterior probabilities for special classes of Bayesian Neural Networks. Our method is different from any of these methods as they are all Bayesian over the weights, whereas we are only Bayesian with respect to the gates.
In this section, we perform experiments with the Generalized Dropout family to test their usefulness. First, we perform a wide variety of analysis with the Generalized Dropout family. Later, we study some specific applications of this method. We perform experiments primarily using Theanobergstra2010theano and Lasagne.
4.1 Analysis of Generalized Dropout
We shall now analyze the behaviours of different members of Generalized Dropout family to find out which ones are useful. For the experiments on the MNIST dataset, we use the standard LeNet-like architecture lecun1998gradient , which consists of two convolutional layers with 20 and 50 filters, and two fully connected layers with 500 and 10 (output layer) neurons. While there is nothing particularly special about this architecture, we simply use this as a standard net to analyze our method.
4.1.1 Effect of data-size
We investigate whether Generalized Dropout indeed has any advantage over Dropout in terms of accuracy. Here, we apply Dropout and Generalized Dropout only to the last fully connected layer. Our experiments reveal that for the network considered, the accuracies achieved by any Generalized Dropout method are not always strictly better than Dropout, as shown in Figure 1(a). This indicates that most of the regularization power of Dropout comes from the independence assumption of Variational Inference, rather than particular values of the dropout parameter. This is a surprising result which we shall use to our advantage in the paper.
However, we note that for small data-sizes, Dropout++ (0) seems to be advantageous over Dropout (Figure 1(b)). This is possibly because Dropout++ (0) forces most neurons (but not all) to have very low capacity due to low value of the parameters. 111Note that in our notation, a large value of Dropout++ indicates a large probability of retaining the neuron, contrary to popularly used notation for Dropout.
|Method||Architecture||Error (%)||No. of Params|
|Architecture Learning srinivas2015learning||20-50-20-10||0.93||41.8k|
|SAL [ = 1]||18-50-296-10||0.69||263k|
|SAL [ = 10]||11-33-38-10||0.84||29.8k|
|SAL [ = 100]||7-13-16-10||1.14||5.9k|
|Model||Original||D++ ()||Drop ()||D++ ()||Drop ()|
4.1.2 Effect of Layer-width
Inspired from the above results about Dropout++ (0), we look at the relationship between using different layer-widths for the fully connected layer and the learnt gate parameters. Intuitively, it is natural to assume that larger layers should learn lower gate values, whereas smaller layers should learn much higher values, if we wish for the overall capacity of the layer to remain roughly the same. Our experiments confirm this intuition as shown in Figure 1(c).
We also test if this flexibility translates to higher accuracy numbers over a fixed dropout value, and we find this to be indeed the case. We find that for small layer-widths, Dropout (at for example), tends to remove too many neurons, while Dropout++ adjusts it’s parameter values to account for small layer-widths, as shown in Figure 1(d).
4.1.3 Effect of Initialization
Initialization of good parameters is known to play a key-role in generalization of deep learning systems. To test whether this holds for the newly introduced Generalized Dropout parameters as well, we try different initializations of the Generalized Dropout parameters. In this example, we simply initialize all gates to a single constant value. As expected, we find that the choice of this initialization is much less crucial when compared to setting the Dropout value, as shown in Figure 1(e).
The choice of initialization, however, affects training time. As an example, it is empirically observed that Dropout with is much slower than . Therefore, it is helpful to have higher Dropout rates to facilitate faster training. To help faster training in Dropout++, we simply initialize with , i.e; start with a network with no Dropout and gradually learn how much Dropout to add. We observe that this indeed helps training time and at the same time provides the flexibility of Dropout, as shown in Figure 1(f).
4.1.4 Visualization of Learnt Parameters
Until this point, we have focussed on using Generalized Dropout on the fully connected layers. Similar effects hold when we apply these to convolutional layers as well. Here, we visualize the learnt parameters in convolutional layers. First, we add Dropout++ only to the input layer. The resulting gate parameters are shown in Figure 1(g). We observe a similar effect when we add Dropout++ only to the first convolutional layer, as shown in Figure 1(h), which shows the average gate map of all the convolutional filters in that layer. In both cases, we observe that Dropout++ learns to selectively attend to the centre of the image rather than towards the corners.
This has multiple advantages. First, by not looking at the corners of each feature, we can potentially decrease model evaluation time. Second, this breaks translation equivariance implicit in convolutions, as in our case certain spatial locations are more important for a filter than others. This could be helpful when using CNNs for face images (for example), where a filter need not look for an ”eye” everywhere in the image. Such locally connected layers have been previously used in works such as DeepFace taigman2014deepface . Dropout++ could offer a more natural way to incorporate such an assumption.
4.1.5 Architecture Selection
We shall now attempt to use Stochastic Architecture Learning (SAL) to automatically learn the required layer width of the network. The inherent assumption here is that the initial architecture is over-complete, and that a sub-set of neurons is sufficient to get similar performance. We first learn the parameters of the network using SAL regularizer, later we prune neurons with low gates parameters. Figure 1(i) shows that SAL learns gate parameters that are often close to either 0 or 1, resulting in a much sharper rise compared to the other methods. We use this sharp rise as a criterion to select the width of a layer. We observe that varying the parameter encourages the method to get smaller architectures, sometimes at the cost of accuracy, as shown in Table 1.
4.2 Dropout++ on standard models
So far we have studied the various properties of Generalized Dropout by performing various experiments on LeNet. We shall now shift to larger networks to test the effectiveness of Dropout++. Modern networks mainly use dropout only in the fully connected layers, or simply not at all, owing to much powerful regularizers such as Batch Normalization. Here we shall take such networks, simply add Dropout++ (flat) after each layer, and see if we get an increase in accuracy. We perform experiments with ResNet32, ResNet56 and a Generic VGG-like network, all trained on the CIFAR-10 dataset. As in Table2, we see that for all three models, adding Dropout++ is largely helpful.
We have proposed Generalized Dropout, a family of methods that generalize Dropout-like behaviour. One set of methods in this family, Dropout++, is an adaptive version of Dropout. Stochastic Architecture Learning is another set of methods that performs architecture selection. An uninformed choice of the Dropout parameter usually hurts performance. Dropout++ helps in setting a useful parameter value regardless of factors such as layer width and initialization. Experiments show that it is generally beneficial to simply add Dropout++ (flat) after every layer of a Deep Network.
- (1) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Dropout: A simple way to prevent neural networks from overfitting.
The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
- (3) Yarin Gal and Zoubin Ghahramani. Bayesian convolutional neural networks with bernoulli approximate variational inference. arXiv preprint arXiv:1506.02158, 2015.
- (4) Christopher M Bishop. Pattern recognition. Machine Learning, 2006.
- (5) Misha Denil, Babak Shakibi, Laurent Dinh, Nando de Freitas, et al. Predicting parameters in deep learning. In Advances in Neural Information Processing Systems, pages 2148–2156, 2013.
- (6) Suraj Srinivas and R Venkatesh Babu. Learning the architecture of deep neural networks. arXiv preprint arXiv:1511.05497, 2015.
- (7) Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
- (8) Li Wan, Matthew Zeiler, Sixin Zhang, Yann L Cun, and Rob Fergus. Regularization of neural networks using dropconnect. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1058–1066, 2013.
- (9) Jimmy Ba and Brendan Frey. Adaptive dropout for training deep neural networks. In Advances in Neural Information Processing Systems, pages 3084–3092, 2013.
- (10) Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. arXiv preprint arXiv:1506.02557, 2015.
Geoffrey E Hinton and Drew Van Camp.
Keeping the neural networks simple by minimizing the description
length of the weights.
Proceedings of the sixth annual conference on Computational learning theory, pages 5–13. ACM, 1993.
- (12) Alex Graves. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, pages 2348–2356, 2011.
- (13) Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
- (14) José Miguel Hernández-Lobato and Ryan P Adams. Probabilistic backpropagation for scalable learning of bayesian neural networks. arXiv preprint arXiv:1502.05336, 2015.
- (15) James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: a cpu and gpu math expression compiler. In Proceedings of the Python for scientific computing conference (SciPy), volume 4, page 3. Austin, TX, 2010.
- (16) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- (17) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
- (18) Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Md Patwary, Mostofa Ali, Ryan P Adams, et al. Scalable bayesian optimization using deep neural networks. arXiv preprint arXiv:1502.05700, 2015.
Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf.
Deepface: Closing the gap to human-level performance in face
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1701–1708, 2014.