Dropout (Hinton et al., 2012) has recently garnered much attention as a novel regularization strategy for neural networks involving the use of structured masking noise during stochastic gradient-based optimization. Dropout training can be viewed as a form of ensemble learning similar to bagging (Breiman, 1994) on an ensemble of size exponential in the number of hidden units and input features, where all members of the ensemble share subsets of their parameters. Combining the predictions of this enormous ensemble would ordinarily be prohibitively expensive, but a scaling of the weights admits an approximate computation of the geometric mean of the ensemble predictions.
Dropout has been a crucial ingredient in the winning solution to several high-profile competitions, notably in visual object recognition (Krizhevsky et al., 2012a) as well as the Merck Molecular Activity Challenge and the Adzuna Job Salary Prediction competition. It has also inspired work on activation function design (Goodfellow et al., 2013a) as well as extensions to the basic dropout technique (Wan et al., 2013; Wang and Manning, 2013) and similar fast approximate model averaging methods (Zeiler and Fergus, 2013).
Several authors have recently investigated the mechanism by which dropout achieves its regularization effect in linear models (Baldi and Sadowski, 2013; Wang and Manning, 2013; Wager et al., 2013), as well as linear and sigmoidal hidden units (Baldi and Sadowski, 2013)
. However, many of the recent empirical successes of dropout, and feed forward neural networks more generally, have utilised piecewise linear activation functions(Jarrett et al., 2009; Glorot et al., 2011; Goodfellow et al., 2013a; Zeiler et al., 2013). In this work, we empirically study dropout in rectified linear networks, employing the recently popular hidden unit activation function .
We begin by expanding upon previous work which investigated the quality of dropout’s approximate ensemble prediction by comparing against Monte Carlo estimates of the correct geometric average (Srivastava, 2013; Goodfellow et al., 2013a). Here, we compare against the true average, in networks of size small enough that the exact computation is tractable. We find, by exhaustive enumeration of all sub-networks in these small cases, that the weight scaling approximation is a remarkably and somewhat surprisingly accurate surrogate for the true geometric mean.
Next, we consider the importance of the geometric mean itself. Traditionally, bagged ensembles produce an averaged prediction via the arithmetic mean, but the weight scaling trick employed with dropout provides an efficient approximation only for the geometric mean. While, as noted by (Baldi and Sadowski, 2013), the difference between the two can be bounded (Cartwright and Field, 1978), it is not immediately obvious what effect this source of error will have on classification performance in practice. We therefore investigate this question empirically and conclude that the geometric mean is indeed a suitable replacement for the arithmetic mean in the context of a dropout-trained ensemble.
The questions raised thus far pertain primarily to the approximate model averaging performed at test time, but dropout training also raises some important questions. At each update, the dropout learning rule follows the same gradient that true bagging training would follow. However, in the case of traditional bagging, all members of the ensemble would have independent parameters. In the case of dropout training, all of the models share subsets of their parameters. It is unclear how much this coordination serves to regularize the eventual ensemble. It is also not clear whether the most important effect is that dropout performs model averaging, or that dropout encourages each individual unit to work well in a variety of contexts.
To investigate this question, we train a set of independent models on resamplings (with replacement) of the training data, as in traditional bagging. Each ensemble member is trained with a single randomly sampled dropout mask fixed throughout all steps of training. We combine these independently trained networks into ensembles of varying size, and compare the ensembles’ performance with that of a single network of identical size, trained instead with dropout. We find evidence to support the claim that the weight sharing taking place in the context of dropout (between members of the implicit ensemble) plays an important role in further regularizing the ensemble.
Finally, we investigate an alternative criterion for training the exponentially large shared-parameter ensemble invoked by dropout. Rather than performing stochastic gradient descent on a randomly selected sub-network in a manner similar to bagging, we consider a biased estimator of the gradient of the geometrically averaged ensemble log likelihood (i.e. the gradient of the model being approximately evaluated at test-time), with the particular estimator bearing a resemblance to boosting(Schapire, 1990). We find that this new criterion, employing masking noise with the exact same distribution as is employed by dropout, yields no discernible robustness gains over networks trained with ordinary stochastic gradient descent.
2 Review of dropout
Dropout is an ensemble learning and prediction technique that can be applied to deterministic feedforward architectures that predict a target
given input vector. These architectures contain a series of hidden layers . Dropout trains an ensemble of models consisting of the set of all models that contain a subset of the variables in both and . The same set of parameters is used to parameterize a family of distributions where is a binary mask vector determining which variables to include in the model, e.g., for a given , each input unit and each hidden unit is set to zero if the corresponding element of is 0. On each presentation of a training example, we train a different sub-network by following the gradient of for a different randomly sampled . For many parameterizations of
(such as most multilayer perceptrons) the instantiation of different sub-networkscan be obtained by element-wise multiplication of and with the mask .
2.1 Dropout as bagging
Dropout training is similar to bagging (Breiman, 1994) and related ensemble methods (Opitz and Maclin, 1999). Bagging is an ensemble learning technique in which a set of models are trained on different subsets of the same dataset. At test time, the predictions of each of the models are averaged together. The ensemble predictions formed by voting in this manner tend to generalize better than the predictions of the individual models.
Dropout training differs from bagging in three ways:
All of the models share parameters. This means that they are no longer really trained on separate subsets of the dataset, and much of what we know about bagging may not apply.
Training stops when the ensemble starts to overfit. There is no guarantee that the individual models will be trained to convergence. In fact, typically, the vast majority of sub-networks are never trained for even one gradient step.
Because there are too many models to average together explicitly, dropout averages them together with a fast approximation. This approximation is to the geometric mean, rather than the arithmetic mean.
2.2 Approximate model averaging
The functional form of the model becomes important when it comes time for the ensemble to make a prediction by averaging together all the sub-networks’ predictions. When , the predictive distribution defined by renormalizing the geometric mean of over is simply given by . This is also true for sigmoid output units, which are special cases of the softmax. This result holds exactly in the case of a single layer softmax model (Hinton et al., 2012) or an MLP with no non-linearity applied to each unit (Goodfellow et al., 2013a)
. Previous work on dropout applies the same scheme in deep architectures with hidden units that have nonlinearities, such as rectified linear units, where themethod is only an approximation to the geometric mean. The approximation has been characterized mathematically for linear and sigmoid networks (Baldi and Sadowski, 2013; Wager et al., 2013), but seems to perform especially well in practice for nonlinear networks with piecewise linear activation functions (Srivastava, 2013; Goodfellow et al., 2013a).
3 Experimental setup
Our initial investigations employed rectifier networks with 2 hidden layers and 10 hidden units per layer, and a single logistic sigmoid output unit. We applied this class of networks to six binary classification problems derived from popular multi-class benchmarks, simplified in this fashion in order to allow for much simpler architectures to effectively solve the task, as well as a synthetic task of our own design.
Specifically, we chose four binary sub-tasks from the MNIST handwritten digit database(LeCun et al., 1998). Our training sets consisted of all occurrences of two digit classes (1 vs. 7, 1 vs. 8, 0 vs. 8, and 2 vs. 3) within the first 50,000 examples of the MNIST training set, with the occurrences from the last 10,000 examples held back as a validation set. We used the corresponding occurrences from the official MNIST test set for evaluating test error.
We also chose two binary sub-tasks from the CoverType dataset of the UCI Machine Learning Repository, specifically discriminating classes 1 and 2 (Spruce-Fir vs. Lodgepole Pine) and classes 3 and 4 (Ponderosa Pine vs. Cottonwood/Willow). This task represents a very different domain than the first two datasets, but one where neural network approaches have nonetheless seen success (see e.g.Rifai et al. (2011)).111Unlike Rifai et al. (2011), we train and evaluate on the records of each class from the data split advertised in the original dataset description. This makes the task much more challenging and many methods prone to overfitting.
The final task is a synthetic task in two dimensions: inputs lie in , and the domain is divided into two regions of equal area: the diamond with corners , , , and the union of the outlying triangles. In order to keep the synthetic task moderately challenging, the training set size was restricted to 100 points sampled uniformly at random. An additional 500 points were sampled for a validation set and another 1000 as a test set.
In order to keep the mask enumeration tractable in the case of the larger input dimension tasks, we chose to apply dropout in the hidden layers only. This has the added benefit of simplifying the ensemble computation: though dropout is typically applied in the input layer, inclusion probabilities higher than 0.5 are employed (e.g.in Hinton et al. (2012); Krizhevsky et al. (2012b)
), making it necessary to unevenly weight the terms in the average. We chose hyperparameters by random search(Bergstra and Bengio, 2012)
over learning rate and momentum (initial values and decrease/increase schedules, respectively), as well as mini-batch size. We performed early stopping on the validation set, terminating when a lower validation error had not been observed for 100 epochs; when training with dropout, the figure of merit for early stopping was the validation error using the weight-scaled predictions.
4 Weight scaling versus Monte Carlo or exact model averaging
previously investigated the fidelity of the weight scaling approximation in the context of rectifier networks and maxout networks, respectively, through the use of a Monte Carlo approximation to the true model average. By concerning ourselves with small networks where exhaustive enumeration is possible, we were able to avoid the effect of additional variance due to the Monte-Carlo average and compute the exact geometric mean over all possible dropout sub-networks.
On each of the 7 tasks, we randomly sampled 50 sets of hyperparameters and trained 50 networks with dropout. We then computed, for each point in the test set for each task, the activities of the network corresponding to each of the possible dropout masks. We then geometrically averaged their predictions (by arithmetically averaging all values of the input to the sigmoid output unit) and computed the geometric average prediction for each point in the test set. Finally, we compared the misclassification rate using these predictions to that obtained using the approximate, weight-scaled predictions.
The results are shown in Figure 1, where each point represents a different hyperparameter configuration. The overall result is that the approximation yields a network that performs very similarly. In order to make differences visible, we plot on the -axis the relative difference in test error between the true geometric average network and the weight-scaled approximation for different networks achieving different values of the test error.
Additionally, we statistically tested the fidelity of the approximation via the Wilcoxon signed-rank test, a nonparametric paired sample test similar to the paired -test, applying a Bonferroni correction for multiple hypotheses. At , no significant differences were observed for any of the seven tasks.
5 Geometric mean versus arithmetic mean
Though the inexpensive computation of an approximate geometric mean was noted in (Hinton et al., 2012), little has been said of the choice of the geometric mean. Ensemble methods in the literature often employ an arithmetic mean for model averaging. It is thus natural to pose the question as to whether the choice of the geometric mean has an impact on the generalization capabilities of the ensemble.
Using the same networks trained in Section 4, we combined the forward-propagated predictions of all models using the arithmetic mean.
In Figure 2, we plot the relative difference in test error between the arithmetic mean predictions. We find that across all seven tasks, the geometric mean is a reasonable proxy for the arithmetic mean, with relative error rarely exceeding 20% except for the synthetic task. In absolute terms, the discrepancy between the test error achieved by the geometric mean and the arithmetic mean never exceeded 0.75% for any of the tasks.
6 Dropout ensembles versus untied weights
We now turn from our investigation of the characteristics of inference in dropout-trained networks to an investigation of the training procedure. For the remainder of the experiments, we trained networks of a more realistic size and capacity on the full multiclass MNIST problem. Once again, we employed two layers of rectified linear units. In addition to dropout, we utilised norm constraint regularization on the incoming weights to each hidden unit. We again performed random search over hyperparameter values, now including in our search the initial ranges of weights, the number of hidden units in each of two layers, and the maximum weight vector norms of each layer.
Dropout training can be viewed as performing bagging on an ensemble that is of size exponential in the number of hidden units, where each member of the ensemble shares parameters with other members of the ensemble. Because each gradient step is taken on a different mini-batch of training data, each sub-network can be seen to be trained on a different resampling of the training set, as in traditional bagging. Furthermore, while each step is taken with respect to the log likelihood of a single ensemble member, the effect of the weight update is applied to all members of the ensemble simultaneously222At least, all members of the ensemble that share any parameters with the sub-network just updated. There certainly exist pairs of ensemble members whose parameter sets are disjoint. We investigate the role of this complex weight-sharing scheme by training an ensemble of independent networks on resamplings of the training data, each with a single dropout mask fixed in place throughout training.
We first performed a hyperparameter search by sampling 50 hyperparameter configurations and choosing the network with the lowest validation error. The best of these networks obtains a test error of 1.06%, matching results reported by Srivastava (2013). Using the same hyperparameters, we trained 360 models initialized with different random seeds, on different resamplings (with replacement) of the training set, as in traditional bagging. Instead of applying dropout during training (and thus applying a different mask at each gradient step), we sampled one dropout mask per model and held it fixed throughout training and at test time. The resulting networks thus have architectures sampled from the same distribution as the sub-networks trained during dropout training, but each network’s parameters are independent of all other networks.
We then evaluate test error for ensembles of these networks, combining their predictions (with the dropout mask used during training still fixed in place at test time) via the geometric mean, as is approximately done in the context of dropout. Our results for various sizes of ensemble are shown in Figure 3.
Our results suggest that there indeed an effect; combining all 360 independently trained models yields a test error of 1.66%, far above the even the suboptimally tuned networks trained with dropout. Aside from the size of the independent ensemble being considerably smaller, one potential confounding factor is that the non-architectural hyperparameters were selected in the context of their performance when using dropout and used as-is to train the networks with untied weights; although each of these was early-stopped independently, it remains unclear how to efficiently optimize hyperparameters for the individual members of a large ensemble so as to facilitate a fairer comparison (indeed, this highlights a general issue with the high cost of training ensembles of neural networks, that dropout conveniently sidesteps).
7 Dropout bagging versus dropout boosting
Other algorithms such as denoising autoencoders(Vincent et al., 2010) are motivated by the idea that models trained with noise are robust to slight transformations of their inputs. Previous work has drawn connections between noise and regularization penalties (Bishop, 1995); similar connections in the case of dropout have recently been noted (Baldi and Sadowski, 2013; Wager et al., 2013). It is natural to question whether dropout can be wholly characterized in terms of learned noise robustness, and whether the model-averaging perspective is necessary or fruitful.
In order to investigate this question we propose an algorithm that injects exactly the same noise as dropout. For this test to be effective, we require an algorithm that can successfully minimize training error, and obtain acceptable generalization performance. It needs to perform at least as well as standard maximum likelihood; otherwise all we have done is designed a pathological algorithm that fails to train.
We therefore introduce dropout boosting. The objective function for each (sub-network, example) pair in dropout boosting is the likelihood of the data according to the ensemble; however, only the parameters of the current sub-network may be updated for each example. Ordinary dropout performs bagging by maximizing the likelihood of the correct target for the current example under the current sub-network, whereas dropout boosting takes into account the contributions of other sub-networks, in a manner reminiscent of boosting.
The objective function for dropout is . For dropout boosting, assume each mask has a separate set of parameters (though in reality these parameters are tied, as in conventional dropout). The dropout boosting objective function is then given by , where
The boosting learning rule is to select one model and update its parameters given all of the other models. In conventional boosting, these other models have already been trained to convergence. In dropout boosting, the other models actually share parameters with the network being trained at any given step, and initially the other models have not been trained at all. The learning rule is to select a sub-network indexed by and follow the ensemble gradient , i.e.
Rather than using the boosting-like algorithm, one could obtain a generic Monte-Carlo procedure for maximizing the log likelihood of the ensemble by averaging together the gradient for multiple values of , and optionally using a different for the term in the left and the term on the right. Empirically, we obtained the best results in the special case of boosting, where the term on the left uses the same as the term on the right – that is, both terms of the gradient apply updates only to one member of the ensemble, even though the criterion being optimized is global.
Note that the intractable still appears in the learning rule. To implement the training algorithm efficiently, we can approximate the ensemble predictions using the weight scaling approximation. This introduces further bias into the estimator, but our findings in Section 4 suggest that the approximation error is small.
Note that dropout boosting employs exactly the same noise as regular dropout uses to perform bagging, and thus should perform similarly to conventional dropout if learned noise robustness is the important ingredient. If we instead take the view that this is a large ensemble of complex learners whose likelihood is being jointly optimized, we would expect that employing a criterion more similar to boosting than bagging would perform more poorly. As boosting maximizes the likelihood of the ensemble, it would perhaps be prone to overfitting in this setting, as the ensemble is very large and the learners are not particularly weak.
Starting with the 50 models trained in Section 6, we employed the same hyperparameters to train a matched set of 50 networks with dropout boosting, and another with plain stochastic gradient descent. In Figure 4, we plot the relative performance of dropout and dropout boosting compared to a model with the same hyperparameters trained with SGD. While dropout unsurprisingly shows a very consistent edge, dropout boosting performs, on average, little better than stochastic gradient descent. The Wilcoxon signed-rank test similarly failed to find a significant difference between dropout boosting and SGD (
). While several outliers approach very good performance (perhaps owing to the added stochasticity), dropout boosting is, on average, no better and often slightly worse than maximum likelihood training, in stark contrast with dropout’s systematic advantage in generalization performance.
We investigated several questions related to the efficacy of dropout, focusing on the specific case of the popular rectified linear nonlinearity for hidden units. We showed that the weight-scaling approximation is a remarkably accurate proxy for the usually intractable geometric mean over all possible sub-networks, and that the geometric mean (and thus its weight-scaled surrogate) compares favourably to the traditionally popular arithmetic mean in terms of classification performance. We demonstrated that weight-sharing between members of the implicit dropout ensemble appears to have a significant regularization effect, by comparing to analogously trained ensembles of the same form that did not share parameters. Finally, we demonstrated that simply adding noise, even noise with identical characteristics to the noise applied during dropout training, is not sufficient to obtain the benefits of dropout, by introducing dropout boosting, a training procedure utilising the same masking noise as conventional dropout, which successfully trains networks but loses dropout’s benefits, instead performing roughly as well as ordinary stochastic gradient descent.
Our results suggest that dropout is an extremely effective ensemble learning method, paired with a clever approximate inference scheme that is remarkably accurate in the case of rectified linear networks. Further research is necessary to shed more light on the model averaging interpretation of dropout. Hinton et al. (2012) noted that dropout forces each hidden unit to perform computation that is useful in a wide variety of contexts. Our results with a sizeable ensemble of independent bagged models seem to lend support to this view, though our experiments were limited to ensembles of several hundred networks at most, tiny in comparison with the weight-sharing ensemble invoked by dropout. The relative importance of the astronomically large ensemble versus the learned “mixability” of hidden units remains an open question. Another interesting direction involves methods that are able to efficiently, approximately average over different classes of model that share parameters in some manner, rather than merely averaging over members of the same model class.
The authors would like to acknowledge the efforts of the many developers of Theano(Bergstra et al., 2010; Bastien et al., 2012), pylearn2 (Goodfellow et al., 2013b)
which were utilised in experiments. We would also like to thank NSERC, Compute Canada, and Calcul Québec for providing computational resources. Ian Goodfellow is supported by the 2013 Google Fellowship in Deep Learning.
- Baldi and Sadowski (2013) Baldi, P. and Sadowski, P. J. (2013). Understanding dropout. In Advances in Neural Information Processing Systems 26, pages 2814–2822.
- Bastien et al. (2012) Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., and Bengio, Y. (2012). Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop.
- Bergstra and Bengio (2012) Bergstra, J. and Bengio, Y. (2012). Random search for hyper-parameter optimization. J. Machine Learning Res., 13, 281–305.
- Bergstra et al. (2010) Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy). Oral Presentation.
- Bishop (1995) Bishop, C. M. (1995). Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1), 108–116.
- Breiman (1994) Breiman, L. (1994). Bagging predictors. Machine Learning, 24(2), 123–140.
- Cartwright and Field (1978) Cartwright, D. I. and Field, M. J. (1978). A refinement of the arithmetic mean-geometric mean inequality. Proceedings of the American Mathematical Society, 71(1), pp. 36–38.
Glorot et al. (2011)
Glorot, X., Bordes, A., and Bengio, Y. (2011).
Deep sparse rectifier neural networks.
JMLR W&CP: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2011).
- Goodfellow et al. (2013a) Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013a). Maxout networks. In ICML’2013.
- Goodfellow et al. (2013b) Goodfellow, I. J., Warde-Farley, D., Lamblin, P., Dumoulin, V., Mirza, M., Pascanu, R., Bergstra, J., Bastien, F., and Bengio, Y. (2013b). Pylearn2: a machine learning research library. arXiv preprint arXiv:1308.4214.
- Hinton et al. (2012) Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinv, R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. Technical report, arXiv:1207.0580.
Jarrett et al. (2009)
Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009).
What is the best multi-stage architecture for object recognition?
Proc. International Conference on Computer Vision (ICCV’09), pages 2146–2153. IEEE.
- Krizhevsky et al. (2012a) Krizhevsky, A., Sutskever, I., and Hinton, G. (2012a). ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS’2012).
- Krizhevsky et al. (2012b) Krizhevsky, A., Sutskever, I., and Hinton, G. (2012b). ImageNet classification with deep convolutional neural networks. In NIPS’2012.
- LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
- Opitz and Maclin (1999) Opitz, D. and Maclin, R. (1999). Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, 11, 169––198.
Rifai et al. (2011)
Rifai, S., Dauphin, Y., Vincent, P., Bengio, Y., and Muller, X. (2011).
The manifold tangent classifier.In NIPS’2011. Student paper award.
- Schapire (1990) Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5(2), 197–227.
- Srivastava (2013) Srivastava, N. (2013). Improving Neural Networks With Dropout. Master’s thesis, U. Toronto.
Vincent et al. (2010)
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A.
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.Journal of Machine Learning Research, 11, 3371–3408.
- Wager et al. (2013) Wager, S., Wang, S., and Liang, P. (2013). Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems 26, pages 351–359.
- Wan et al. (2013) Wan, L., Zeiler, M., Zhang, S., LeCun, Y., and Fergus, R. (2013). Regularization of neural networks using dropconnect. In ICML’2013.
- Wang and Manning (2013) Wang, S. and Manning, C. (2013). Fast dropout training. In ICML’2013.
- Zeiler and Fergus (2013) Zeiler, M. D. and Fergus, R. (2013). Stochastic pooling for regularization of deep convolutional neural networks. Technical Report Arxiv 1301.3557.
- Zeiler et al. (2013) Zeiler, M. D., Ranzato, M., Monga, R., Mao, M., Yang, K., Le, Q., Nguyen, P., Senior, A., Vanhoucke, V., Dean, J., and Hinton, G. E. (2013). On rectified linear units for speech processing. In ICASSP 2013.