Continuous Dropout

11/28/2019 ∙ by Xu Shen, et al. ∙ University of Technology Sydney USTC 16

Dropout has been proven to be an effective algorithm for training robust deep networks because of its ability to prevent overfitting by avoiding the co-adaptation of feature detectors. Current explanations of dropout include bagging, naive Bayes, regularization, and sex in evolution. According to the activation patterns of neurons in the human brain, when faced with different situations, the firing rates of neurons are random and continuous, not binary as current dropout does. Inspired by this phenomenon, we extend the traditional binary dropout to continuous dropout. On the one hand, continuous dropout is considerably closer to the activation characteristics of neurons in the human brain than traditional binary dropout. On the other hand, we demonstrate that continuous dropout has the property of avoiding the co-adaptation of feature detectors, which suggests that we can extract more independent feature detectors for model averaging in the test stage. We introduce the proposed continuous dropout to a feedforward neural network and comprehensively compare it with binary dropout, adaptive dropout, and DropConnect on MNIST, CIFAR-10, SVHN, NORB, and ILSVRC-12. Thorough experiments demonstrate that our method performs better in preventing the co-adaptation of feature detectors and improves test performance. The code is available at: https://github.com/jasonustc/caffe-multigpu/tree/dropout.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Dropout is an efficient algorithm introduced by Hinton et al for training robust neural networks [11] and has been applied to many vision tasks [33, 12, 16]. During the training stage, hidden units of the neural networks are randomly omitted at a rate of 50% [11][23]. Thus, the presentation of each training sample can be viewed as providing updates of parameters for a randomly chosen subnetwork. The weights of this subnetwork are trained by back propagation [25]. Weights are shared for the hidden units that are present among different subnetworks at each iteration. During the test stage, predictions are made by the entire network, which contains all the hidden units with their weights halved.

The motivation and intuition behind dropout is to prevent overfitting by avoiding co-adaptations of the feature detectors [11]. Deep network can achieve better representation than shallow networks, but overfitting is a serious problem when training a large feedforward neural network on a small training set [11][28]. Randomly dropping the units from the neural network can greatly reduce this overfitting problem. Encouraged by the success of dropout, several related works have been presented, including fast dropout [32], adaptive dropout [1], and DropConnect [31]. To accelerate dropout training, Wang and Manning suggested sampling the output from an approximated distribution rather than sampling binary mask variables for the inputs [32]. In [1]

, Ba and Frey proposed adaptively learning the dropout probability

from the inputs and weights of the network. Wan et al. generalized dropout by randomly dropping the weights rather than the units [31].

To interpret the success of dropout, several explanations from both theoretical and biological perspectives have been proposed. Based on theoretical explanations, dropout is viewed as an extreme form of bagging [11], as a generalization of naive Bayes [11], or as adaptive regularization [30][2], which is proven to be a very useful approach for neural network training [6]. From the biological perspective, Hinton et al. explain that there is an intriguing similarity between dropout and the theory of the role of sex in evolution [11]. However, no understanding from the perspective of the brain’s neural network the origin of deep neural networks has been proposed. In fact, by analyzing the firing patterns of neural networks in the human brain [5][8][3]

, we find that there is a strong analogy between dropout and the firing pattern of brain neurons. That is, a small minority of strong synapses and neurons provide a substantial portion of the activity in all brain states and situations

[5]. This phenomenon explains why we need to randomly delete hidden units from the network and train different subnetworks for different samples (situations). However, the remainder of the brain is not silent. The remaining neuronal activity in any given time window is supplied by very large numbers of weak synapses and cells. The amplitudes of oscillations of neurons obey a random continuous pattern [8][3]. In other words, the division between “strong” and “weak” neurons is not absolute. They obey a continuous rather than bimodal distribution [8]. Consequently, we should assign a continuous random mask to each neuron in the dropout network for the divisions of “strong” and “weak” rather than use a binary mask to choose “activated” and “silent” neurons.

Inspired by this phenomenon, we propose a continuous dropout algorithm in this paper, i.e.

, the dropout variables are subject to a continuous distribution rather than the discrete (Bernoulli) distribution in

[11]. Specifically, in our continuous dropout, the units in the network are randomly multiplied by continuous dropout masks sampled from or , termed uniform dropout or Gaussian dropout, respectively. Although multiplicative Gaussian noise has been mentioned in [27], no theoretical analysis or generalized continuous dropout form is presented. We investigate two specific continuous distributions, i.e., uniform and Gaussian, which are commonly used and also are similar to the process of neuron activation in the brain. We conduct extensive theoretical analyses, including both static and dynamic property analyses of our continuous dropout, and demonstrate that continuous dropout prevents the co-adaptation of feature detectors in deep neural networks. In the static analysis, we find that continuous dropout achieves a good balance between the diversity and independence of subnetworks. In the dynamic analysis, we find that continuous dropout training is equivalent to a regularization of covariance between weights, inputs, and hidden units, which successfully prevents the co-adaptation of feature detectors in deep neural networks.

We evaluate our continuous dropout through extensive experiments on several datasets, including MNIST, CIFAR-, SVHN, NORB, and ILSVRC-. We compare it with Bernoulli dropout, adaptive dropout, and DropConnect. The experimental results demonstrate that our continuous dropout performs better in preventing the co-adaptation of feature detectors and improves test performance.

Ii Continuous Dropout

In [11], Hinton et al. interpret dropout from the biological perspective, i.e., it has an intriguing similarity to the theory of the role of sex in evolution [19]. Sexual reproduction involves taking half the genes of each parent and combining them to produce offspring. This corresponds to the result where dropout training works the best when ; more extreme probabilities produce worse results [11]. The criteria for natural selection may not be individual fitness but rather the mixability of genes to combine [11]. The ability of genes to work well with another random set of genes makes them more robust. The mixability theory described in [21] is that sex breaks up sets of co-adapted genes, and this means that achieving a function by using a large set of co-adapted genes is not nearly as robust as achieving the same function, perhaps less than optimally, in multiple alternative ways, each of which only uses a small number of co-adapted genes.

Following this train of thought, we can infer that randomly dropping units tends to produce more multiple alternative networks, which is able to achieve better performance. For example, when we use one hidden layer with units for dropout training, i.e., the value of the dropout variable is randomly set to or , alternative networks will be produced during training and will make up the entire network for testing. From this perspective, it is more reasonable to take the continuous dropout distribution into account because, for continuous dropout variables, a hidden layer with units can produce an infinite number of multiple alternative networks, which are expected to work better than the Bernoulli dropout proposed in [11]. The experimental results in Section IV demonstrate the superiority of continuous dropout over Bernoulli dropout.

Iii Co-adaptation Regularization in Continuous Dropout

In this section, we derive the static and dynamic properties of our continuous dropout. Static properties refer to the properties of the network with a fixed set of weights, that is, given an input, how dropout affects the output of the network. Dynamic properties refer to the properties of updating of the weights for the network, i.e., how continuous dropout changes the learning process of the network [2]. Because Bernoulli dropout with achieves the best performance in most situations [11][26], we set for Bernoulli dropout. For our continuous dropout, we apply and for uniform dropout and Gaussian dropout to ensure that all three dropout algorithms have the same expected output (0.5).

Iii-a Static Properties of Continuous Dropout

In this section, we focus on the static properties of continuous dropout, i.e., properties of dropout for a fixed set of weights. We start from the single layer of linear units, and then we extend it to multiple layers of linear and nonlinear units.

Iii-A1 Continuous dropout for a single layer of linear units

We consider a single fully connected linear layer with input , weighting matrix , and output . The th output . In Bernoulli dropout, each input unit is kept with probability . The th output and its expectation are

In our uniform dropout, is kept with probability . The output becomes

When Gaussian dropout is applied, is kept with probability ,

Therefore, the three dropout methods achieve the same expected output.

Because dropout is applied to the input units independently, the variance and covariance of the output units are:

The aim of dropout is to avoid the co-adaptation of feature detectors, reflected by the covariance between output units. Generally, networks with lower covariance between feature detectors tend to generate more independent subnetworks and therefore tend to work better during the test stage. Comparing the covariance of the output units of the three dropout algorithms, we can see that uniform dropout has a lower covariance than Bernoulli dropout. The covariance of Gaussian dropout is controlled by the parameter . Through extensive experiments, we find that Gaussian dropout with works the best among the three dropout algorithms. This phenomenon implies that there is a balance between the diversity of subnetworks (larger variance of the output of hidden units) and their independence (lower covariance between units in the same layer). Bernoulli dropout achieves the highest variance but its covariance is also the highest. In contrast, uniform dropout achieves the lowest covariance, but its variance is also the lowest. Gaussian dropout with a suitable achieves the best balance between variance and covariance, ensuring a good generalization capability.

Iii-A2 Continuous dropout approximation for non-linear unit

For the non-linear unit, we consider the case that the output of a single unit with total linear input

is given by the logistic sigmoidal function

(1)

For uniform dropout, . We have and . Because ,

. According to Corollary 2.7.1 of Lyapunov’s central limit theorem

[19],

tends to a normal distribution as

. It yields that

(2)

For Gaussian dropout, and i.i.d. We can easily infer that , where .

jhus, for both uniform dropout and Gaussian dropout, is subject to a normal distribution. In the following sections, we only derive the statistical property of Gaussian dropout because it is the same for uniform dropout.

The expected output is

(3)

This means that for Gaussian dropout , we have the recursion

(4)

While for Bernoulli dropout [2]:

(5)

In Bernoulli dropout, the expected output is only the propagation of deterministic variables among the entire network, whereas our continuous dropout has a regularization term of . Thus, continuous dropout can regularize complex weights and inputs during forward propagation.

Iii-B Dynamic Properties of Continuous Dropout

In this section, we will investigate the dynamic properties of continuous dropout related to the training procedure and the update of the weights. We also start from the simple case of a single linear unit, and then we discuss the non-linear case. As proven in last section, in uniform dropout, the tends to a normal distribution as . Therefore, we analyze the dynamic properties of Gaussian dropout only.

Iii-B1 Continuous dropout gradient and adaptive regularization – single linear unit

In the case of a single linear unit trained with dropout with an input , an output , and a target , the error is typically quadratic of the form , where . In the linear case, the ensemble network is identical to the deterministic network obtained by scaling the connections using the dropout probabilities. For a single output , the ensemble error of all possible subnetworks is defined by:

The gradients of the ensemble error can be computed by:

(6)

For Gaussian dropout, . Here,

is the random variable with a Gaussian distribution. Hence,

is a random variable, while is a deterministic function.

For dropout error, the learning gradients are of the form

therefore,

(7)

and

(8)

Remarkably, the relationship between the expectation of ensemble error and dropout error is:

(9)

In Bernoulli dropout [2], this relationship is:

(10)

Generally, the regularization term is weight decay based on the square of the weights, and it ensures that the weights do not become too large to overfit the training data. Bernoulli dropout extends this regularization term by incorporating the square of the input terms and the variance of the dropout variables; however, both the expected output and the weight of regularization term are determined by the dropout probability (

), i.e., there is no freedom for adjusting the model complexity to reduce overfitting. In contrast, in Gaussian dropout, we have an extra degree of freedom of

to achieve the balance between network output and model complexity.

Iii-B2 Continuous dropout gradient and adaptive regularization – single sigmoidal unit

In Gaussian dropout, for a single sigmoidal unit,

where and with . Commonly, we use relative entropy error

(11)

By the chain rule

, we obtain

For the ensemble network,

We have

(12)

Therefore,

(13)

For the dropout network,

(14)

Here, are the random variables with Gaussian distributions; thus, and are both random variables. It yields that

(15)

where , and .

(16)

The gradient of the dropout is:

Fig. 1:

Samples of benchmark datasets. MNIST and SVHN are digit classification tasks. NORB, CIFAR-10, and ImageNet

are object recognition tasks. All of them are formulated as classification problems, which is commonly evaluated by classification accuracy (error).

For approximation

(17)

Note that for Bernoulli dropout [2]

(18)

Bernoulli dropout only provides the magnitude of the regularization term, which is adaptively scaled by the square of the input terms, by the gain of the sigmoidal function, by the variance of the dropout variables, and by the instantaneous derivative of the sigmoidal function; however, this term only tends to achieve a simpler model and avoid overfitting. It has little help in avoiding the co-adaptation of units (feature detectors) in the same layer. In contrast, continuous dropout not only provides the regularization of squares of input units, weights, and dropout variance individually , but also regularizes the covariance between input units and weights . In other words, in Gaussian dropout, the regularization term penalizes the covariance between weights, dropout variables, and input units; that is, it prevents the co-adaptation of feature detectors in the neural network. Therefore, through this co-adaptation regularization, Gaussian dropout can indeed avoid co-adaptation and overfitting.

Iv Experiments

We investigate the performance of our continuous dropout on MNIST [17], CIFAR-10 [14], SVHN [29], NORB [18], and ImageNet ILSVRC-2012 classification task [22]. Samples and brief description of these datasets are presented in Fig. 1. We compare continuous dropout with the original dropout proposed in [11] (Bernoulli dropout), adaptive dropout [1], and DropConnect [31]. Fast Dropout [32]

is an approximation of Bernoulli dropout that accelerates the sampling process. Its performance is similar to that of Bernoulli dropout. For evaluation metric, the classification error, which is defined as the ratio of misclassified samples to all samples, is applied (

loss). We use the publicly available THEANO library

[4]

to implement the feedforward neural networks that consist of fully connected layers only, and the networks that consist of Convolutional Neural Networks (CNNs) are implemented based on Caffe

[13]. In all experiments, the dropout rate in Bernoulli dropout and DropConnect is set as because this is the most commonly used configuration in dropout and performs the best. All the other parameters are selected based on performance on the validation set. To ensure that all three dropout algorithms achieve the same expected output, for uniform dropout, the variables are subject to . In Gaussian dropout, , and is selected from . For Adaptive dropout, is selected from and is selected from . To avoid divergence during propagation, we clip the Gaussian dropout variable to be in , yielding if and if .

To verify whether the performance gain is statistically significant, we repeated all experiments times for all methods and reported the mean error and standard derivation. Here, for datasets MNIST, CIFAR-10, SVHN, and NORB, and for dataset ImageNet ILSVRC-2012 because of the high computational cost in this dataset. In each of the independent runs, we randomly initialized weights of the network and then applied different dropout algorithms to train this network. In other words, in the th independent run (), all dropout algorithms share the same weights initialization. In another independent run, the network was randomly initialized again, i.e., the network had different initialized weights in the th run and the th run (). In this way, we obtained

groups of results and then conducted paired t-test and paired Wilcoxon signed rank test between Gaussian dropout and all other baseline methods. Their p-values are reported.

Method Error (p-value-T/p-value-W)
Sigmoid ReLU
No dropout 1.58 0.045 (6.5e-26/9.1e-7) 1.15 0.036 (1.7e-19/9.1e-7)
Bernoulli dropout 1.35 0.049 (3.1e-17/9.1e-7) 1.06 0.037 (3.6e-17/9.1e-7)
Adaptive dropout 1.30 0.072 (7.4e-11/1.1e-6) 1.02 0.027 (1.9e-11/1.2e-6)
DropConnect 1.37 0.058 (8.7e-19/9.1e-7) 1.01 0.052 (8.5e-6/6.0e-5)
Uniform dropout 1.21 0.046 (9.0e-7/1.2e-5) 0.96 0.039 (0.031/0.027)
Gaussian dropout 1.15 0.035 0.95 0.028
TABLE I: Performance comparison on MNIST (mean error and standard derivation). No data augmentation is used. Architecture: . Paired t-test and paired Wilcoxon signed rank test are conducted between Gaussian dropout and all other baseline methods. Their p-values are reported: p-value-T for t-test and p-value-W for Wilcoxon signed rank test.
(a)
(b)
Fig. 2: (a) and (b) show the

testing errors vs. epochs

of Bernoulli, uniform, and Gaussian dropouts on MNIST. No data augmentation is used in this experiment. By regularizing the covariance between neurons in the same layer, the capacity of the neural network is improved.

Iv-a Experiments on MNIST

We first verify the effectiveness of our continuous dropout on MNIST. The MNIST handwritten digit dataset consists of training images and test images. Each image is pixels in size. We randomly separate the training images into two parts: for training and for validation. We replicate the results of dropout in [11] and use the same settings for uniform dropout and Gaussian dropout. These settings include a linear momentum schedule, a constant weight constraint, and an exponentially decaying learning rate. More details can be found in [11].

We train models with two fully connected layers using sigmoid or ReLU activation functions

. Table I shows the performance when image pixels are taken as the input and no data augmentation is utilized. From this table, we can see that both uniform dropout and Gaussian dropout outperform Bernoulli dropout, adaptive dropout, and DropConnect on this dataset, irrespective of whether Sigmoid or ReLU is applied. Gaussian dropout achieves slightly better performance than uniform dropout. To further analyze the effects of continuous dropout, Fig. 2 shows the testing errors vs. epochs of Bernoulli dropout, uniform dropout, and Gaussian dropout. We can see that continuous dropout achieves a considerably lower testing error than Bernoulli dropout, which demonstrates that continuous dropout has a better generalization capability.

Fig. 3: Performance curve of Gaussian Dropout w.r.t different variance. Our Gaussian Dropout consistently outperforms Bernoulli Dropout for all sigma values. It shows that the performance gain in Gaussian Dropout mainly comes from the distribution not the extra freedom of .
(a)
(b)
Fig. 4: Log histogram of covariance between pairs of units from the same layer. Left: layer ; right: layer . It shows that in continuous dropout, the distribution is more concentrated around , which indicates that continuous dropout performs better than Bernoulli dropout in preventing the co-adaptation of feature detectors. (MNIST, , ReLU)
Method Architecture Act Function Error(%) (p-value-T/p-value-W)
No dropout 2CNN+1FC ReLU 0.674 0.047 (1.8e-15/9.1e-7)
Bernoulli dropout 2CNN+1FC ReLU 0.551 0.017 (3.3e-5/1.5e-4)
Adaptive dropout 2CNN+1FC ReLU 0.591 0.017 (8.6e-18/9.1e-7)
DropConnect 2CNN+1FC ReLU 0.581 0.012 (4.2e-18/9.1e-7)
Uniform dropout 2CNN+1FC ReLU 0.549 0.021 (2.9e-3/4.5e-3)
Gaussian dropout 2CNN+1FC ReLU 0.534 0.006
TABLE II: Performance comparison on MNIST with Gaussian Initialization (mean error and standard derivation). No data augmentation is used. Paired t-test and paired Wilcoxon signed rank test are conducted between Gaussian dropout and all other baseline methods. Their p-values are reported: p-value-T for t-test and p-value-W for Wilcoxon signed rank test.
Method Architecture Act Function Error(%) (p-value-T/p-value-W)
No dropout 2CNN+1FC ReLU 0.670 0.051 (3.0e-16/9.1e-7)
Bernoulli dropout 2CNN+1FC ReLU 0.558 0.018 (4.4e-8/6.5e-6)
Adaptive dropout 2CNN+1FC ReLU 0.586 0.016 (3.2e-13/1.0e-6)
DropConnect 2CNN+1FC ReLU 0.579 0.011 (4.6e-16/9.1e-7)
Uniform dropout 2CNN+1FC ReLU 0.558 0.018 (1.3e-11/1.4e-6)
Gaussian dropout 2CNN+1FC ReLU 0.521 0.017
TABLE III: Performance comparison on MNIST with Uniform Initialization (mean error and standard derivation). No data augmentation is used. Paired t-test and paired Wilcoxon signed rank test are conducted between Gaussian dropout and all other baseline methods. Their p-values are reported: p-value-T for t-test and p-value-W for Wilcoxon signed rank test.

Influence of variance in Gaussian Dropout. In Section III, we find that we have an extra degree of freedom by using to achieve the balance between network output and model complexity. To investigate the influence of on model performance in Gaussian Dropout, we train Gaussian Dropout models with neurons. Dropout masks are sampled from Gaussian distribution with mean and variance in

. Activation functions are set to be Sigmoid or ReLU. Performance of Gaussian Dropout with different standard deviations are shown in Fig.

3. We can see that the best variance for Gaussian Dropout is . For normal distributions, the values less than two standard deviations from the mean account for of the set. And for three standard deviations, that is . Thus, almost all the values of and distribute in (reasonable distribution for dropout mask variables). Most importantly, our Gaussian Dropout consistently outperforms Bernoulli Dropout for all sigma values, which demonstrate that the performance gain in Gaussian Dropout mainly comes from the distribution not the extra freedom of sigma.

Covariance of hidden units. In Section III, we demonstrate that continuous dropout can prevent the co-adaptation of feature detectors. To verify this property, we investigate the distribution of covariance between units in the same layer. We construct histograms of the variance of all pairs of units in the same layer in a trained MNIST model with ReLU. Figure 4 shows the log of the number of pairs whose covariance falls into different intervals. Histograms are obtained by taking all the unit pairs in each layer and aggregating the results over random input samples. For each sample, the dropout process is repeated

times to estimate the covariance. Figure

4 shows that in continuous dropout, the distribution is more concentrated around , which indicates that continuous dropout performs better than Bernoulli dropout in preventing the co-adaptation of feature detectors. Furthermore, comparing Fig. 4(a) and Fig. 4(b), we can see that in “No dropout” the covariance in second layer is much more concentrated around 0 than that in the first layer. After using Continuous Dropout, the covariance curve becomes more concentrated than “No dropout” in both layers. The reason why the effects of Continuous Dropout become less significant in a higher layer is that the room for improvement (reduce covariance) becomes smaller in a higher layer.

To further improve the classification results, we also apply a more powerful network, which consists of a -layer CNN with feature maps and fully connected layer with ReLU units. All the dropout algorithms are applied on the fully connected layer. We use an initial learning rate of and manually decay the learning rate by a multiplier ( or

) when the loss function of the validation error reaches a plateau. The input is also the original image pixels without cropping, rotation, or scaling. To verify whether the improvement of continuous dropout is benefited from a favoured initialization, we initialize weights by using both Gaussian distribution (

) and Uniform distribution proposed by Glorot and Bengio

[9]. The experimental results are summarized in Table II and Table III. We can see that Gaussian dropout consistently performs the best among all dropout methods, no matter which initialization distribution is applied. Paired t-test and paired Wilcoxon signed rank test are conducted between Gaussian dropout and other methods. Both tables show that Gaussian dropout achieves statistically significant improvement over all baseline methods and the p-values are less than 0.05.

Iv-B Experiments on CIFAR-10

The CIFAR- dataset consists of classes of RGB images with for training and for testing. We preprocess the data by global contrast normalization and ZCA whitening as in [10]. To produce comparable results to the state-of-the-art method, we apply all the dropout algorithms on the Network In Network (NIN) model [20]. This network consists of convolutional layers and part of them are connected to pooling layers. Two dropout layers are applied to the pooling layers. To compare continuous dropout with adaptive dropout and DropConnect, we slightly change this model by omitting the two dropout layers between the CNNs and replace the last pooling layer by two fully connected layers with and units, respectively. Dropout is applied to the first fully connected layer. During training, we first initialize our model by the weights trained in [20], and then we finetune the model using different dropout methods. The learning rate is initialized by and decayed by every iterations, without any data augmentations.

Method Error(%) (p-value-T/p-value-W)
No dropout 10.65 0.114 (1.5e-14/9.1e-7)
Bernoulli dropout 10.55 0.050 (5.5e-15/9.1e-7)
Adaptive dropout 10.46 0.081 (1.6e-10/9.1e-7)
DropConnect 10.40 0.178 (2.9e-7/7.8e-6)
Uniform dropout 10.47 0.142 (5.2e-10/2.0e-6)
Gaussian dropout 10.18 0.129
TABLE IV: Performance comparison on CIFAR-10 (mean error and standard derivation). Paired t-test and paired Wilcoxon signed rank test are conducted between Gaussian dropout and all other baseline methods. Their p-values are reported: p-value-T for t-test and p-value-W for Wilcoxon signed rank test.

The models are tested after iterations, and the results are presented in Table IV. We can see that Gaussian dropout achieves the best performance among all dropout algorithms on this task again. Based on the results of paired t-test and paired Wilcoxon signed rank test, Gaussian dropout significantly outperforms all other methods (p-values are less than 0.05). To further investigate their performance on each class, confusion matrices are also reported, as shown in Fig. 5. We can see that Gaussian Dropout achieves the best performance on five classes among all six methods. Specifically, Gaussian Dropout achieves higher classification accuracy on 10, 8, 8, 8, and 7 classes than No dropout, Bernoulli dropout, Adaptive dropout, DropConnect and Uniform dropout, respectively.

Method Error(%) (p-value-T/p-value-W)
No dropout 2.09 0.002 (1.8e-35/9.1e-7)
Bernoulli dropout 2.00 0.012 (4.2e-19/9.1e-7)
Adaptive dropout 2.13 0.235 (3.5e-5/1.3e-4)
DropConnect 1.96 0.007 (8.4e-16/9.1e-7)
Uniform dropout 1.92 0.012 (0.982/0.981)
Gaussian dropout 1.93 0.012
TABLE V: Performance comparison on SVHN (mean error and standard derivation). Paired t-test and paired Wilcoxon signed rank test are conducted between Gaussian dropout and all other baseline methods. Their p-values are reported: p-value-T for t-test and p-value-W for Wilcoxon signed rank test.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 5: The confusion matrices of all six methods: (a) no Dropout; (b) Bernoulli Dropout; (c) Adaptive Dropout; (d) DropConnect; (e) Uniform Dropout; (f) Gaussian Dropout. We can see that Gaussian Dropout achieves the best performance on five classes among all six methods. Specifically, Gaussian Dropout achieves higher classification accuracy on 10, 8, 8, 8, and 7 classes than No dropout, Bernoulli dropout, Adaptive dropout, DropConnect and Uniform dropout, respectively.

Iv-C Experiments on SVHN

The Street View House Numbers (SVHN) dataset includes 604388 training images (both training set and extra set) and 26032 testing images [29]

. Like MNIST, the goal is to classify the digit centered in each

image (). The dataset is augmented by: 1) randomly select a region from the original image; 2) introducing scaling and rotation variations; and 3) randomly flip images during training. Following [31], we preprocess the images using local contrast normalization as in [34].

Method Error(%) (p-value-T/p-value-W)
No dropout 3.55 0.070 (1.4e-16/9.1e-7)
Bernoulli dropout 3.33 0.128 (1.6e-5/9.9e-5)
Adaptive dropout 3.49 0.204 (1.4e-8/2.0e-6)
DropConnect 3.53 0.059 (1.2e-15/9.1e-7)
Uniform dropout 3.29 0.197 (1.2e-3/2.4e-3)
Gaussian dropout 3.15 0.114
TABLE VI: Performance comparison on NORB (mean error and standard derivation). Paired t-test and paired Wilcoxon signed rank test are conducted between Gaussian dropout and all other baseline methods. Their p-values are reported: p-value-T for t-test and p-value-W for Wilcoxon signed rank test.
Method ConvNet config. smallest image side top-5 error(%) (p-value-T/p-value-W) top-1 error(%) (p-value-T/p-value-W)
train(S) test(Q)
Bernoulli Dropout[24] VGG_ILSVRC_16_layers 256 256 8.86 0.042 (9.5e-11/9.8e-4) 26.99 0.065 (7.4e-11/9.8e-4)
Adaptive Dropout 8.41 0.061 (1.1e-6/9.8e-4) 26.27 0.046 (2.8e-8/9.8e-4)
DropConnect 8.56 0.037 (2.3e-8/9.8e-4) 26.82 0.050 (6.2e-11/9.8e-4)
Uniform Dropout 8.08 0.048 (0.017/0.024) 25.91 0.046 (0.005/0.014)
Gaussian Dropout 7.99 0.065 25.79 0.045
TABLE VII: Performance Comparison on ImageNet ILSVRC-2012 (mean top-5/top-1 error and standard derivation). Paired t-test and paired Wilcoxon signed rank test are conducted between Gaussian dropout and all other baseline methods. Their p-values are reported: p-value-T for t-test and p-value-W for Wilcoxon signed rank test.

The model consists of 2 convolutional layers and 2 locally connected layers as described in [15] (layers-conv-local-11pct.cfg). A fully connected layer with

neurons and ReLU activations is added between the softmax layer and the final locally connected layer. We manually decrease the learning rate if the performance on validation set goes to plateaus

[15]. In detail, we multiply the initial learning by and then repeatedly. Initial learning rate is set to . The bias learning rate is set to be the learning rate for the weights. Additionally, weights are initialized with random values for fully connected layers and for convolutional layers. To further improve the performance, we train independent networks with random permutations of the training sequence and different random seeds. We report the classification error of averaging the output probabilities from the networks before making a prediction.

The experimental results on this dataset are summarized in Table V. Comparing the mean classification errors, standard deviations, and p-values of paired t-test and paired Wilcoxon signed rank test, we can see that our proposed continuous dropout achieves better performance than No dropout, Bernoulli dropout, Adaptive dropout, and DropConnect. The performance gain of Gaussian dropout is statistically significant (all p-values are less than 0.05). Uniform dropout and Gaussian dropout achieve similar performance on this dataset. Besides, all dropout methods achieves stable performance on this dataset with small standard deviation, except Adaptive dropout which has a large standard deviation (0.235).

Iv-D Experiments on NORB

In this experiment we evaluate our models on the 2-fold NORB (jittered-cluttered) dataset [18]. Each image is classified into one of the six classes, which appears on a random background. Images are downsampled from to as in [7]. We train on 2-fold of 29160 images each and test on a total of 58320 images. We use the same architecture as in SVHN. Dataset is augmented by rotation and scaling. No random crop or flip is applied. Models are trained with an initial learning rate of . Other training and testing settings are the same as in SVHN.

The experimental results are given in Table VI. From this table, we can see that Gaussian dropout significantly outperforms No dropout, Adaptive dropout, DropConnect and Uniform Dropout on this dataset. Compared with the results on SVHN dataset, all methods have a larger standard deviation on NORB dataset. Experiments on these two dataset adopt the same network architecture and other experimental settings. The reason for higher standard deviation on NORB dataset may be that we have much fewer training images on NORB. Therefore, the models trained on NORB are not as stable as that on SVHN.

Iv-E Experiments on ILSVRC-2012

The ILSVRC-2012 dataset was used for ILSVRC challenges. This data set includes images of classes, and is split into three sets: training (1.3M images), validation (50K images), and testing (100K images with held-out class labels). The classification performance is evaluated using two measures: the top-1 and top-5 error. The former is a multi-class classification error, and the latter is the main evaluation criteria used in ILSVRC, and is computed as the proportion of images such that the ground-truth category is outside the top-5 predicted categories.

We compare all the dropout algorithms by finetuning on the model with layers proposed by VGG team (configuration D) in [24]. The model consists of convolution layers and fully connected layers. All the filters used in the convolution layers are configured with receptive field, and number of channels are

respectively. The convolution stride is fixed to

pixel; the spatial padding of convolution layer input is

pixels to preserve the spatial resolution of input. The convolutional layers are followed by three Fully-Connected (FC) layers: the first two have channels each, the third contains channels to perform way ILSVRC classification. All hidden layers are equipped with the rectification (ReLU [16]) and Bernoulli dropout is imposed on the first two FC layers. In our experiment, the two FC layers with Bernoulli dropout are replaced by Adaptive dropout FC layers, DropConnect FC layers, FC layers with Uniform dropout and Gaussian dropout, respectively.

During training, weights are first initialized by the VGG_ILSVRC_16_layers model111http://www.robots.ox.ac.uk/~vgg/research/very_deep/ in [24], then finetuned by iterations. The input to the ConvNet is a fixed-sized RGB images, which are zero-centered by a subtraction of on BGR values. The batch size was set to , momentum to

, gradient clip to

. The finetuning was regularized by weight decay . For Adaptive dropout, alpha is set to , beta is set to . Dropout ratio is in DropConnect. In Uniform dropout, mask is sampling from . While the Gaussian dropout mask is sampled from . The learning rate was initially set to , and then decreased by a factor of after iterations. Following [24], the smallest side (denoted as S) of the training images are isotropically-rescaled to .

Method MNIST CIFAR-10 SVHN NORB ILSVRC-2012 Average rank
Bernoulli Dropout 3 5 4 3 5 4
Adaptive Dropout 5 3 5 4 3 4
DropConnect 4 2 3 5 4 3.6
Uniform Dropout 2 3 1 2 2 2
Gaussian Dropout 1 1 2 1 1 1.2
TABLE VIII: Performance rank of different dropout methods on all five datasets.

During testing, the testing images are isotropically rescaled to a smallest image side, denoted as Q. Then, the fully-connected layers are first converted to convolutional layers ( the first FC layer to a conv. layer, the last two FC layers to conv. layers). The resulting fully-convolutional net is then applied to the whole (uncropped) image. The result is a class score map with the number of channels equal to number of classes. Then, the class score map is spatially averaged (sum-pooled). And the test set is also augmented by horizontal flipping of the images. Finally, the soft-max class posteriors of the original and flipped images are averaged to obtain final scores for the image as in [24].

Performance of all the dropout algorithms are shown in Table VII. This table shows that continuous dropout can improve the performance of conventional dropout algorithms even for very large scale dataset. All the p-values are far less than 0.05, which indicates that Gaussian dropout achieves significantly performance gain over other methods on this dataset.

To summarize the overall performance of different dropout methods, we rank all five dropout methods according to their performance on each of the five datasets, as shown in Table VIII. We can see that Gaussian dropout is ranked first on four datasets and ranked second on one dataset.

V Conclusion

In this paper, we have introduced a new explanation for the dropout algorithm from the perspective of the neural network properties in the human brain. The activation rate of neurons in neural networks for different situations is random and continuous. Inspired by this phenomenon, we extend the traditional binary dropout to continuous dropout. Thorough theoretical analyses and extensive experiments demonstrate that our continuous dropout has the advantage of reducing the co-adaptation while maintaining variance, and continuous dropout is equivalent to involving a regularizer that is able to prevent co-adaptation between feature detectors.

In the future, we plan to further explore continuous dropout from the following two aspects. First, although we have shown that continuous dropout penalizes the covariance between neurons, the corresponding regularization term is not explicitly defined. We will try to propose a more direct and interpretable way for the regularization term. Second, dropout is naturally viewed as a mixture of different models. From this perspective of view, we plan to derive an error bound for this way of mixture, leading to a more solid theoretical analysis of continuous dropout.

References

  • [1] J. Ba and B. Frey (2013) Adaptive dropout for training deep neural networks. In NIPS, pp. 3084–3092. Cited by: §I, §IV.
  • [2] P. Baldi and P. Sadowski (2014) The dropout learning algorithm. Artificial Intelligence 210, pp. 78–122. Cited by: §I, §III-A2, §III-B1, §III-B2, §III.
  • [3] S. C. Bekkers JM (1990) Origin of variability in quantal size in cultured hippocampal neurons and hippocampal slices. In Proceedings of the National Academy of Sciences of the United States of America., pp. 5359–5362. Cited by: §I.
  • [4] O. Bergstra, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio (2010) Theano: a cpu and gpu math expression compiler. In Proceedings of the Python for Scientific Computing Conference, Vol. 4, pp. 3. Cited by: §IV.
  • [5] G. Buzsáki and K. Mizuseki (2014)

    The log-dynamic brain: how skewed distributions affect network operations

    .
    Nature Reviews Neuroscience 15 (4), pp. 264–278. Cited by: §I.
  • [6] J. Chorowski and J. M. Zurada (2015) Learning understandable neural networks with nonnegative weight constraints. IEEE Transactions on Neural Networks and Learning Systems 26 (1), pp. 62–69. Cited by: §I.
  • [7] D. Ciresan, U. Meier, and J. Schmidhuber (2012) Multi column deep neural networks for image classilcation. CVPR. Cited by: §IV-D.
  • [8] P. Fatt and B. Katz (1952) Spontaneous subthreshold activity at motor nerve endings. The Journal of Physiology 117 (1), pp. 109–128. Cited by: §I.
  • [9] X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In AISTATS, Cited by: §IV-A.
  • [10] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio (2013) Maxout networks. arXiv:1302.4389. Cited by: §IV-B.
  • [11] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580. Cited by: §I, §I, §I, §I, §II, §II, §III, §IV-A, §IV.
  • [12] W. Hou, X. Gao, D. Tao, and X. Li (2015) Blind image quality assessment via deep learning. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §I.
  • [13] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell (2014) Caffe: convolutional architecture for fast feature embedding. In ACM MM, pp. 675–678. Cited by: §IV.
  • [14] A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Computer Science Department, University of Toronto, Tech. Rep 1 (4), pp. 7. Cited by: §IV.
  • [15] A. Krizhevsky (2012) Cuda-convnet. In http://code.google.com/p/cuda-convnet/, Cited by: §IV-C.
  • [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. NIPS, pp. 1097–1105. Cited by: §I, §IV-E.
  • [17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §IV.
  • [18] Y. LeCun, F. J. Huang, and L. Bottou (2004) Learning methods for generic object recognition with invariance to pose and lighting. CVPR. Cited by: §IV-D, §IV.
  • [19] E. L. Lehmann (1999) Elements of large-sample theory. Springer Science & Business Media. Cited by: §II, §III-A2.
  • [20] M. Lin, Q. Chen, and S. Yan (2014) Network in network. arXiv:1312.4400. Cited by: §IV-B.
  • [21] A. Livnat, C. Papadimitriou, N. Pippenger, and M.W. Feldman (2010) Sex, mixability, and modularity.. Proceedings of the National Academy of Sciences of the United States of America 107 (4), pp. 1452–1457. Cited by: §II.
  • [22] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge.

    International Journal of Computer Vision

    115 (3), pp. 211–252.
    Cited by: §IV.
  • [23] L. Shao, D. Wu, and X. Li (2014) Learning deep and wide: a spectral method for learning deep networks. IEEE Transactions on Neural Networks and Learning Systems 25 (12), pp. 2303–2308. Cited by: §I.
  • [24] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Cited by: §IV-E, §IV-E, §IV-E, TABLE VII.
  • [25] D. J. Spiegelhalter and S. L. Lauritzen (1990) Sequential updating of conditional probabilities on directed graphical structures. Networks 20 (5), pp. 579–605. Cited by: §I.
  • [26] N. Srivastava (2013) Improving neural networks with dropout. Master’s Thesis, University of Toronto. Cited by: §III.
  • [27] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting.

    Journal of Machine Learning Research

    15, pp. 1929–1958.
    Cited by: §I.
  • [28] L. Szymanski and B. McCane (2014) Deep networks are effective encoders of periodicity. IEEE Transactions on Neural Networks and Learning Systems 25 (10), pp. 1816–1827. Cited by: §I.
  • [29] V.Nair and G.E.Hinton (2010) Rectified linear units improve restricted boltzmann machines.. ICML. Cited by: §IV-C, §IV.
  • [30] S. Wager, S. Wang, and P. S. Liang (2013) Dropout training as adaptive regularization. In NIPS, pp. 351–359. Cited by: §I.
  • [31] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus (2013) Regularization of neural networks using dropconnect. In Proceedings of the 30th International Conference on Machine Learning, pp. 1058–1066. Cited by: §I, §IV-C, §IV.
  • [32] S. Wang and C. Manning (2013) Fast dropout training. In ICML, pp. 118–126. Cited by: §I, §IV.
  • [33] Y. Yuan, L. Mou, and X. Lu (2015) Scene recognition by manifold regularized deep learning architecture. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §I.
  • [34] M.D. Zerler and R. Fergus (2013) Stochastic pooling for regualization of deep convolutional neural networks. ICLR. Cited by: §IV-C.