Dropout (Srivastava et al., 2014)
has proven to be effective for preventing overfitting over many deep learning areas, such as image classification(Shrivastava et al., 2017)2016) and speech recognition (Amodei et al., 2016). In the years since, a wide range of variants have been proposed for wider scenarios, and most related work focus on the improvement of Dropout structures, i.e., how to drop. For example, drop connect (Wan et al., 2013) drops the weights instead of neurons, evolutional dropout (Li et al., 2016)2015) drops neurons in the max-pooling kernel so smaller feature values have some probabilities to to affect the activations.
These Dropout-like methods process each neuron/channel in one layer independently and introduce randomness by dropping. These architectures are certainly simple and effective. However, randomly dropping independently is not the only method to introduce randomness. Hinton et al. (2012) argues that overfitting can be reduced by preventing co-adaptation between feature detectors. Thus it is helpful to consider other neurons’ information when adding noise to one neuron. For example, lateral inhibition noise could be more effective than independent noise.
In this paper, we propose RotationOut as a regularization method for neural networks. RotationOut regards the neurons in one layer as a vector and introduces noise by randomly rotating the vector. Specifically, consider a fully-connected layer with neurons: . If applying RotationOut to this layer, the output is where is a random rotation matrix. It rotates the input with random angles and directions, bringing noise to the input. The noise added to a neuron comes not only from itself, but also from other neurons. It is the major difference between RotationOut and Dropout-like methods. We further show that RotationOut uses the activations of the other neurons as the noise to one neuron so that the co-adaptation between neurons can be reduced.
RotationOut uses random rotation matrices instead of unrestricted matrices because the directions of feature vectors are important. Random rotation provides noise to the directions directly. Most neural networks use dot product between the feature vector and weight vector as the output. The network actually learns the direction of the weights, especially when there is a normalization layer (e.g. Batch Normalization (Ioffe & Szegedy, 2015) or Weight Normalization (Salimans & Kingma, 2016)) after the weight layer. Random rotation of feature vecoters introduces noise into the angle between the feature and the weight, making the learning of weights directions more stable. Sabour et al. (2017) also uses the orientation of feature vectors to represent the instantiation parameters in capsules. Another motivation for rotating feature vectors comes from network dissection. Bau et al. (2017) finds that random rotations of a learned representation can destroy the interpretability which is axis-aligned. Thus random rotating the feature during training makes the network more robust. Even small rotations can be a strong regularization.
We study how RotationOut helps prevent neural networks from overfitting. Hinton et al. (2012) introduces co-adaptation to interpret Dropout but few literature give a clear concept of co-adaptation. In this paper,we provide a metric to approximate co-adaptations and derive a general formula for noise analysis. Using the formula, we prove that RotationOut can reduce co-adaptations more effectively than Dropout and show how to combine Dropout and Batch Normalization together.
In our experiments, RotationOut can achieve results on par with or better than Dropout and Dropout-like methods among several deep learning tasks. Applying RotationOut after convolutional layers and fully connected layers improves image classification accuracy of ConvNet on CIFAR100 and ImageNet datasets. On COCO datasets, RotationOut also improves the generalization of object detection models. For LSTM models, RotationOut can achieve competitive results with existing RNN dropout method for speech recognition task on Wall Street Journal (WSJ) corpus.
The main contributions of this paper are as follows: We propose RotationOut as a regularization method for neural networks which is different from existing Dropout-like methods that operate on each neuron independently. RotationOut randomly rotates the feature vector and introduces noise to one neuron with other neurons’ information. We present a theoretical analysis method for general formula of noise. Using the method, we answer two questions: 1) how noise-based regularization methods reduce co-adaptions and 2) how to combine noise-based regularization methods with Batch Normalization. Experiments in vision and language tasks are conducted to show the effectiveness of the proposed RotationOut method.
Dropout is effective for fully connected layers. When applied to convolution layers, it is less effective. Ghiasi et al. (2018) argues that information about the input can still be sent to the next layer even with dropout, which causes the networks to overfit (Ghiasi et al., 2018). SpatialDropout (Tompson et al., 2015) drops the entire channel from the feature map. Shake-shake regularization (Gastaldi, 2017) drops the residual branches. Cutout (DeVries & Taylor, 2017) and Dropblock (Ghiasi et al., 2018) drop a continuois square region from the inputs/feature maps.
Applying standard dropout to recurrent layers also results in poor performance (Zaremba et al., 2014; Labach et al., 2019), since the noise caused by dropout at each time step prevents the network from retaining long-term memory. Gal & Ghahramani (2016); Moon et al. (2015); Merity et al. (2017) generate a dropout mask for each input sequence, and keep it the same at every time step so that memory can be retained.
Batch Normalization (BN) (Ioffe & Szegedy, 2015) accelerates deep network training. It is also a regularization to the network, and discourage the strength of dropout to prevent overfitting (Ioffe & Szegedy, 2015). Many modern ConvNet architectures such as ResNet (He et al., 2016) and DenseNet (Huang et al., 2017) do not apply dropout in convolutions. Li et al. (2019)
is the first to argue that it is caused by the a variance shift. In this paper, we use the noise analysis method to further explore this problem.
There is a lot of work studying rotations in networks. Rotations on the images (Lenc & Vedaldi, 2015; Simard et al., 2003) are important data augmentation methods. There are also studies about rotation equivalence. Worrall et al. (2017) uses an enriched feature map explicitly capturing the underlying orientations. Marcos et al. (2017) applies multiple rotated versions of each filter to the input to solve problems requiring different responses with respect to the inputs’ rotation. The motivations of these work are different from ours. The most related work is network dissection (Bau et al., 2017). They discuss the impact on the interpretability of random rotations of learned features, showing that rotation in training can be a strong regularization.
In this section, we first introduce the formulation of RotationOut. Next, we use linear models to demonstrate how RotationOut helps for regularization. In the last part, we discuss the implementation of RotationOut in neural networks.
2.1 Random Rotation Matrix
A rotation in dimension is represented by the product between a rotation matrix and the feature vector . The complexity for random rotation matrix generation and the matrix multiplication are both , which would be less efficient than Dropout with complexity. We consider a special case that uses Givens rotations (Anderson, 2000) to construct random rotation matrices to reduce the complexity.
Let be an even number, and be a permutation of . A rotation matrix can be generated by function :
Here represents the element of where . See Appendix A.1 for some examples of such rotation matrices. Suppose we sample the anglefrom ,the set of all permutations of , with equal probability. The RotationOut operator can be generated using the function :
Here is a normalization term and is not a rotation matrix strictly speaking. The random operator generated from Equation 2 have some good properties. 1) The noise is zero centered: . 2) For any vector and any random permutation , the angle between and is determined by angle : . 3) For fixed angel , there exists different rotations. 4) The complexity for random rotation matrix generation and the matrix multiplication are both .
Permutation draws the rotation direction and angel draws the rotation angle. As an analogy, permutation is similar to the dropout mask widely used in RNN dropout. There exists different dropout mask ( for ), thus the diversity of random rotation in Equation 1 is sufficient for network training. Angle is similar to the percentage of dropped neurons in Dropout, and the distribution of controls the regularization strength. (Srivastava et al., 2014) used the multiplier’s variance to compare Bernoulli dropout and Gaussian dropout. Following this setting, RotationOut is equivalent to Bernoulli Dropout with the keeping rate and Gaussian dropout with variance if .
Reviewing the formulation of the random rotation matrix, it arranges all dimensions of the input into pairs randomly, and rotates the two dimension vectors with angle in each pair. Suppose and are two dimensions/neurons in one pair, the outputs of and after RotationOut are
The noise of comes from and the noise of comes from since is random. Note that the pairs are randomly arranged, thus RotationOut uses all other dimensions/neurons as the noise for one dimension/neuron of the feature vector. With RotationOut, the neurons are trained to work more independently since one neuron has to regard the activation of other neurons as noise. Thus the co-adaptations are reduced.
Consider Gaussian dropout, the outputs are where . The difference between Gaussian dropout and RotationOut is the source of noise, i.e., the Gaussian dropout noise for one neuron comes from itself while the RotationOut noise comes from other neurons.
2.2 RotationOut in Linear Models
First we consider a simple case of applying RotationOut to the classical problem of linear regression. Letbe the dataset where . Linear regression tries to find the weight that minimizes . When applied RotationOut to each , we generate from Equation 2 for each . The objective function becomes:
Denote . To compare RotationOut with Dropout with keep rate , we suppose . Equation 4 reduces to:
Therefore, linear regression with RotationOut and Dropout are equivalent to ridge regression with different regularization terms. Set(Dropout rate ) for simplicity. LR with Dropout doubles the diagonal elements of to make the problem numerical stable. LR with RotationOut is more close to ridge regression:
The condition number of Equation 7 and the LR with RotationOut problem is up bounded by . For the Dropout case, if some data dimensions have extremely small variances, both and are ill-conditioned. LR with Dropout problem has unbounded condition number.
Next we consider an
-way classification model of logistic regression. The input isand the weights are . The probability that the input belongs to the category is:
In Equation 8, denotes the angel between and . Assume that the length of each weights are very close, the input belongs to the category if is most close to in angle.
Consider a hard sample case that are the two smallest weight-data angles. But and are very close:
, i.e., the data are close to the decision boundary. The model should classify the data correctly but could make mistakes if there is some noise. Applying RotationOut, the angle between the data and the weights can be changed, and the new angles can be. To classify the data correctly, there should be a gap between and . In other words, the decision boundary changed from to where is a positive constant that depends on the regularization. Thus RotationOut can be regarded as a margin-based hard sample mining.
Here we provide an intuitive understanding of how Dropout with low keep rates leads to lower performance. Randomly zeroing units, Dropout method also rotates the feature vector. A lower keep rate results in a bigger rotation angle: . Consider the last hidden layer in neural networks, it is similar to logistic regression on the features. If one feature is most close to , it belongs to the . A lower keep rate Dropout would rotate the feature with a bigger angle, and the Dropout output can be most close to another weight with higher probability, which may hurts the training.
2.3 RotationOut in Neural Networks
Consider a neural network with hidden layers. Let , , and denote the vector of inputs, the vector of output before activation, and the weights for the layer . Let be generated from Equation 2 and
We rotate the zero-centered features and then add the expectation back. The reasons will be explained later. Here we give an intuitive understanding. If features are not zero-centered, we do not know the exact regularization strength. Suppose all features elements are in one interval, say . The angle between any two feature vectors is a sharp angle. In this case a rotation angle of would be too big. It is the same for Dropout. The regularization strength is influenced by the mean value of features which we may not know. At test time, the RotationOut operation is removed.
Consider 2D case for example, the input for 2D convolutional layers are three dimensional: number of channels , width and height :
We regard each as a feature vector with semantic information for each position , and apply rotation to each position. As Ghiasi et al. (2018) argued, the convolutional feature maps are spatially correlated, so information can still flow through convolutional layers if features are dropped out randomly. Similarly, if we rotate feature vectors in different positions with random directions, random directions offset each other and result in no rotation. So we rotate all feature vectors with the same directions but different angles. The operation on convolutional feature maps can be:
The operation for general convolutional networks are very similar. Also note that RotationOut can combined with DropBlock (Ghiasi et al., 2018) easily: only rotating features in a continuous block. Experiments show that the combination can get extra performance gain. As mentioned in Section 3.1, the rotation directions defined by is similar to the dropout mask in RNN drops. RotationOut can also be used in recurrent networks following Equation 11.
3 Noise Analysis
In this section, we first study the general formula of adding noise. Using the formula, we show how introducing randomness/noise helps reduce co-adaptations and why RotationOut is more efficient than the vanilla dropout. Next, we explain the variance between Dropout and batch normalization (Li et al., 2019) using the formula and propose some solutions.
3.1 Randomness and Co-adaptation
Strictly speaking, the co-adaptations describe the dependence between neurons. The mutual information between two neurons may be the best metric to define co-adaptations. To compute mutual information, we need the exact distributions of neurons, which are generally unknown. So we consider the correlation coefficient to evaluate co-adaptations, which only need the first and second moment. Moreover, if we assume the distributions of neurons are Gaussian, correlation coefficient and mutual information are equivalent in co-adaptations evaluation.
Suppose is the activations of one hidden layer. Let . The ideal situation is that , i.e., the neurons are mutually independent. We define the co-adaptations as the distance between and .
Here is a normalization term that defines the regularization strength. Let be the out of with arbitrary noise (e.g. Dropout or RotationOut). We assume that the noise should follow two assumptions: 1) zero-center: ; 2) non-trivial: (avoid that always equals to ). Consider the law of total variance, we have:
Let be the out of after Dropout with drop rate , and be the out of after RotationOut with , we have Lemma 1 (proof see Appendix A.3):
Note that , we have:
We can compute the co-adaptations of (Assume ):
Under zero-center assumption, Dropout with keep rate reduces co-adaptation by times, and the equivalent RotationOut reduces co-adaptation by times.
We take a close look at the correlation coefficient to see what makes the difference. Let be the element of . Recall Equation 13, we have:
For Dropout-and other dropout-like methods, they add noise to different neurons independently, so . The only term to reduce correlation coefficients in Equation 16 is . Under out non-trivial noise assumption, is always positive. Thus non-trivial noise can always reduce co-adaptations. For RotationOut, there is another term to reduce correlation coefficients: and typically . In addition to increasing the uncertainty of each neuron as Dropout does, RotationOut can also reduce the correlation between two neurons. In other words, inhibition noise.
Here we explain why we need a zero-center assumption and rotate the zero-centered features in Section 2.3. Equation 14 and 16 show that the non-zero mean value can further reduce the co-adaptations. If we do not know the exact mean value, we do not know the exact regularization strength. Suppose the neurons
follow a normal distribution, and we apply Dropout on the ReLU activations. With a keep rate 0.9, Dropout reduces the co-adaptations by 0.86 times, while Dropout reduces the co-adaptations by 0.61 times with a keep rate 0.7, which is a non-linear mapping and influenced by the mean value. We rotate/drop the zero-centered features so that the regularization strength is independent with the mean value.
3.2 Dropout before Batch Normalization
Dropout changes the variance of a specific neuron when transferring the network from training to inference. However, BN requires a consistent statistical variance. The variance inconsistency (variance shift) in training and inference leads to unstable numerical behaviors and more erroneous predictions when applying Dropout-before BN.
We can easily understand this using Equation 13. If a Dropout layer is applied right before a BN layer. In training time, the BN layer records the diagonal element of as the running variance and uses them in inference. However, the actul variance in inference should be the diagonal element of which is small than the recorded running variance (train variance). Li et al. (2019) argues:
Instead of using Dropout, a more variance-stable form Uout can be used to mitigate the problem: where .
Instead of applying Dropout-a (Figure 1), applying Dropout-b can mitigate the problem.
In Dropout-b, let be the ratio between train variance and test variance. Expanding the input dimension of weight layer can mitigate the problem: , .
We revisit these propositions and discuss how to mitigate the problem. For Proposition 1, Uout is unlikely to mitigate the problem. The Uout noise to different neurons are independent, so the variance shift is the only term to reduce co-adaptations in Equation 16. Though Uout is variance-stable, it provides less regularization, which is equivalent to Dropout with a higher keep rate.
Proposition 2 and 3 discuss the positions to insert Dropout. Let be the output from ReLU layer with and be the input of BN layer. The weight layer in Dropout-a and b are the same with weight . During test time, the inputs to BN layers in Dropout-a and b are the same with variance . During training time, the inputs are different. In Dropout-a, the formulation is . In Dropout-b, the formulation is . So the training variance for the two types are different. Recall Lemma 1, we have:
Let be row of of and assume is uniformly distributed on the unit ball. Since the length of expands the training and testing variance with the same proportion, it does not affect the ratio between training and testing variance, and we can assume the length of is fixed. The element of actual testing variance is . For Dropout-a, the element of running variance (i.e., the training variance) is . For Dropout-b, the element of running variance is . Dropout-a and b have the same expected variance shift:
Though the expected variance shift is the same, the variance of the shift is different. Let be the ratio between the training variance and the testing variance: . We have the following observation:
Observation. If which is the case that the activation function is ReLU. The ratio in Dropout-b is more centered: . Sample weights to make the weight layer , the maximum ratio in Dropout-a is bigger than the maximum ratio in Dropout-b with high probability: .
According to this observation, Proposition 2 and 3 are basically right but might not be precise. Dropout-b does help mitigate the problem but there might be other reasons. The expected variance shift is the same in Dropout-a and b: , . Dropout-b has more stable variance shift among different dimensions. Dropout-a is more likely to have very big training/testing variance ratio, leading to more serious unstable numerical behaviour.
Consider zero-centered Dropout-a in Equation 17: . The ratio is fixed to be for any weights, i.e. . It leads to fewer unstable numerical behaviour since there is no extreme variance shift ratio, and we can modify BN layer’s validation mode (reduce the running variance by times). Zero-centered Dropout-a can be one solution to mitigate the variance shift problem.
We verified this claim on the CIFAR100 dataset using ResNet110. We apply Dropout between the convolutions of all residual blocks in the third residual stage (18 dropout layers are added). We test three types of Dropout with a keep rate of 0.5: 1) Dropout-a-centered, 2) Dropout-b) and 3) Dropout-b-centered. Following (Li et al., 2019), the experiments are conducted by following three steps: 1) Calculate the running variance of all BN layers in training mode. It is the the training variance. 2) Calculate the running variance of all BN layers in testing mode. It is the the testing variance. Data augmentation and the dataloader are also kept to ensure that every possible detail for calculating neural variances remains exactly the same with training. 3) Obtain (Note that and are 64 dimentional vectors). For dropout-a-center, we reduce the running variance by times (We also tried this for the other two dropout, but the results are not better). The obtained ratio measures the variance shift between training and testing mode. A smaller ratio is better. The results are averaged over 3 runs and shown in Figure 2
In this section, we evaluate the performance of RotationOut for image classification, object detection, and speech recognition. First, we conduct detailed ablation studies with CIFAR100 dataset. Next, we compare RotationOut with other regularization techniques using more data and higher resolution. We test on two tasks: image classification on ILSVRC dataset and object detection on COCO dataset.
4.1 Ablation Study on CIFAR100
The CIFAR100 dataset consists of 60,000 colour images of size pixels and 100 classes. The official version of the dataset is split into a training set with 50,000 images and a test set with 10,000 images. We conduct image classification experiments on the dataset.
Our focus is on the regularization abilities, so the experiment settings for different regularization techniques are the same. We follow the setting from He et al. (2016). The network inputs arepixel image, then randomly crop a
pixel image, and finally mirror the images horizontally with 50% probability. For all of these experiments, we use the same optimizer: training for 64k iterations with batches of 128 images using SGD, momentum of 0.9, and weight decay of 1e-5. We start with a learning rate of 0.1, divide it by 10 at 32k and 48k iterations, and terminate training at 64k iterations. For each run, we record the best validation accuracy and the avergae validation accuracy of the last 10 epochs. Each experiment is repeated 5 times and we report the top 1 (best and avergae) validation accuracy as “meanstandard deviation” of the 5 runs.
We compare the regularization abilities of RotationOut and Dropout on two classical architectures: ResNet110 from He et al. (2016) and WideResNet28-10 from Zagoruyko & Komodakis (2016). ResNet110 is a deep but not so wide architecture using BasicBlocks (Zagoruyko & Komodakis, 2016) in three residual stages. The feature map sizes are respectively and the numbers of filters are respectively. WideResNet28-10 is a wide but not so deep architecture using BasicBlocks in three residual stages. The feature map sizes are respectively and the numbers of filters are respectively. For ResNet110, we only apply RotationOut or Dropout (with the same rate) to all convolutional layers in the third residual stages. FOr WideResNet28-10, we apply RotationOut or Dropout (with the same keep rate) to all convolutional layers in the second and third residual stages since WideResNet28-10 has much more parameters.
As mentioned ealier, we can use different distributions to generate . and the regularization strength is controlled by . We compare RotationOut with the corresponding Dropout. We tried different distributions and found that the performance difference is very small. We report the results of Gaussian distributions here.
Table 1 shows the results on CIFAR100 dataset with two architectures. Table (a)a and (b)b are the results for ResNet110. Table (c)c and (d)d are the results for WideResNet28-10. Results in the same row compare the regularization abilities of Dropout and the equivalent keep rate RotationOut. We can find dropping too many neurons is less effective and may hurt training. Since WideResNet28-10 has much more parameters, the best performance is from a heavier regularization.
4.2 Experiments with more data and higher resolution
The ILSVRC 2012 classification dataset contains 1.2 million training images and 50,000 validation images with 1,000 categories. We following the training and test schema as in (Szegedy et al., 2015; He et al., 2016) but train the model for 240 epochs. The learning rate is decayed by the factor of 0.1 at 120, 190 and 230 epochs. We apply RotationOut with with normal distribution of tangent to convolutional layers in Res3 and Res4 as well as the last fully connected layer. As mentioned earlier, RotationOut is easily combined with DropBlock idea. We rotate features in a continuous block size of in Res3 and in Res4.
Table 2 shows the results of some state of the art methods and our results. Our results are average over 5 runs. Results of other methods are from Ghiasi et al. (2018), and also regularize on Res3 and Res4. Our result is significantly better than Dropout and SpatialDropout. By using the DropBlock idea, RotationOut can get competitive results compared with state of the art methods and get a 2.07% improvement compared with the baseline.
|ResNet-50 (He et al., 2016)|
|ResNet-50 + dropout(kp=0.7)(Srivastava et al., 2014)|
|ResNet-50 + DropPath(kp=0.9)(Larsson et al., 2016)|
|ResNet-50 + SpatialDropout(kp=0.9) (Tompson et al., 2015)|
|ResNet-50 + Cutout (DeVries & Taylor, 2017)|
|ResNet-50 + DropBlock(kp=0.9) (Ghiasi et al., 2018)|
|ResNet-50 + RotationOut|
|ResNet-50 + RotationOut (Block)|
COCO Object Detection.
as the detection method and apply RotationOut to the ResNet backbone. We use the same hyperparameters as in ImageNet classification. We follow the implementation details in(Ghiasi et al., 2018): resize images between scales [512, 768] and then crop the image to max dimension 640. The model are initialized with ImageNet pretraining and trained for 35 epochs with learning decay at 20 and 28 epochs. We set and for focal loss, a weight decay of 0.0001, a momentum of 0.9 and a batch size of 64. The model is trained on COCO train2017 and evaluated on COCO val2017. We compare our result with DropBlock (Ghiasi et al., 2018) as table 3 shows.
|RetinaNet, no DropBlock||Random||36.8||54.6||39.4|
|RetinaNet, RotationOut (Block)||ImageNet||38.7||56.6||41.4|
Due to limited computing resources, we finetune the model from PyTorch library’s pretraining ImageNet classification models while DropBlock method trained the model from scratch. We think it is fair to compare DropBlock method since the initialization does not help increase the results as showed in the first two rows. Our RotationOut can still have additional 0.3 AP based on the DropBlock result.
4.3 Experiment in speech recognition
We show that our RotationOut can also help train LSTMs. We conduct an Auto2Text experiment on the WSJ (Wall Street Journal) dataset (Paul & Baker, 1992). The dataset is a database with 80 hours of transcribed speech. The inputs are variable length speech where is the length and is the feature dimension for one time step. The labels are character-based words. We use a four-layer bidirectional LSTM network to design a CTC (Connectionist temporal classification) Graves et al. (2006) model. The input dimension, hidden dimension and output dimension of the four-layer bidirectional LSTM network are 40, 512, 137 respectively. We use Adam optimizer with learning rate 1e-3, weight decay 1e-5 and batch size 32, and train the model for 80 epochs and reduce the learning rate by 5x at epoch 40. We report the edit distance between our prediction and ground truth on the “eval92” test set. Table 4 shows the performance of different regularization methods.
|Variational Weight Drop(kp=0.8)||7.5|
|Locked Drop(kp=0.8)+Variational Weight Drop(kp=0.8)||6.7|
|RotationOut +Variational Weight Drop(kp=0.9)||6.4|
In this work, we introduce RotationOut as an alternative for dropout for neural network. RotationOut adds continuous noise to data/features and keep the semantics. We further establish an analysis of noise to show how co-adaptations are reduced in neural network and why dropout is more effective than dropout. Our experiments show that applying RotationOut in neural network helps training and increase the accuracy. Possible direction for further work is the theoretical analysis of co-adaptations. As discussed earlier, the proposed correlation analysis is not optimal. It cannot explain the difference between standard Dropout and Gaussian dropout. Also it can not explain some methods such as Shake-shake regularization. Further work on co-adaptation analysis can help better understand noise-based regularization methods.
Amodei et al. (2016)
Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric
Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang
Chen, et al.
Deep speech 2: End-to-end speech recognition in english and mandarin.
International conference on machine learning, pp. 173–182, 2016.
Discontinuous plane rotations and the symmetric eigenvalue problem.2000.
- Bau et al. (2017) David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. In
- DeVries & Taylor (2017) Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
Gal & Ghahramani (2016)
Yarin Gal and Zoubin Ghahramani.
A theoretically grounded application of dropout in recurrent neural networks.In Advances in neural information processing systems, pp. 1019–1027, 2016.
- Gastaldi (2017) Xavier Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485, 2017.
- Ghiasi et al. (2018) Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Dropblock: A regularization method for convolutional networks. In Advances in Neural Information Processing Systems, pp. 10727–10737, 2018.
- Graves et al. (2006) Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pp. 369–376. ACM, 2006.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Hinton et al. (2012) Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
- Hu et al. (2016) Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell. Natural language object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4555–4564, 2016.
- Huang et al. (2017) Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708, 2017.
- Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
- Labach et al. (2019) Alex Labach, Hojjat Salehinejad, and Shahrokh Valaee. Survey of dropout methods for deep neural networks. arXiv preprint arXiv:1904.13310, 2019.
- Larsson et al. (2016) Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals. arXiv preprint arXiv:1605.07648, 2016.
- Lenc & Vedaldi (2015) Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 991–999, 2015.
- Li et al. (2019) Xiang Li, Shuo Chen, Xiaolin Hu, and Jian Yang. Understanding the disharmony between dropout and batch normalization by variance shift. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2682–2690, 2019.
- Li et al. (2016) Zhe Li, Boqing Gong, and Tianbao Yang. Improved dropout for shallow and deep learning. In Advances in Neural Information Processing Systems, pp. 2523–2531, 2016.
- Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pp. 740–755. Springer, 2014.
- Lin et al. (2017) Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988, 2017.
- Luo et al. (2018) Ping Luo, Xinjiang Wang, Wenqi Shao, and Zhanglin Peng. Towards understanding regularization in batch normalization. 2018.
- Marcos et al. (2017) Diego Marcos, Michele Volpi, Nikos Komodakis, and Devis Tuia. Rotation equivariant vector field networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5048–5057, 2017.
- Merity et al. (2017) Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing lstm language models. arXiv preprint arXiv:1708.02182, 2017.
Moon et al. (2015)
Taesup Moon, Heeyoul Choi, Hoshik Lee, and Inchul Song.
Rnndrop: A novel dropout for rnns in asr.
2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 65–70. IEEE, 2015.
- Paul & Baker (1992) Douglas B Paul and Janet M Baker. The design for the wall street journal-based csr corpus. In Proceedings of the workshop on Speech and Natural Language, pp. 357–362. Association for Computational Linguistics, 1992.
- Sabour et al. (2017) Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. In Advances in neural information processing systems, pp. 3856–3866, 2017.
- Salimans & Kingma (2016) Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pp. 901–909, 2016.
- Shrivastava et al. (2017) Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda Wang, and Russell Webb. Learning from simulated and unsupervised images through adversarial training. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2107–2116, 2017.
Simard et al. (2003)
Patrice Y Simard, David Steinkraus, John C Platt, et al.
Best practices for convolutional neural networks applied to visual document analysis.In Icdar, volume 3, 2003.
- Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. In The journal of machine learning research, pp. 1929–1958, 2014.
- Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015.
- Tompson et al. (2015) Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 648–656, 2015.
- Wan et al. (2013) Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using dropconnect. In International conference on machine learning, pp. 1058–1066, 2013.
- Worrall et al. (2017) Daniel E Worrall, Stephan J Garbin, Daniyar Turmukhambetov, and Gabriel J Brostow. Harmonic networks: Deep translation and rotation equivariance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5028–5037, 2017.
- Wu & Gu (2015) Haibing Wu and Xiaodong Gu. Towards dropout training for convolutional neural networks. Neural Networks, 71:1–10, 2015.
- Wu & He (2018) Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19, 2018.
- Zagoruyko & Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
- Zaremba et al. (2014) Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
Appendix A Appendix
a.1 Random Rotation Matrix
One example of such a matrix that rotates the dimensions and dimensions can be:
The sparse matrix in Equation 20 is similar to a combine of permutation matrix, and we do not need matrix multiplications to get the output. The output can be get by slicing and an elementwise multiplication: .
a.2 Marginalizing Linear Regression
a.3 Proof of Lemma 1
Appendix B Rethinking Small Batch Size Batch Normalization
Batch normalization(BN) performance decreases rapidly when the batch size becomes smaller, which limits BN’s usage for training larger models. Many people think it is caused by inaccurate batch statistics estimationWu & He (2018); Luo et al. (2018). However, we think the reason may be more complicated. Noise based methods also lead to inaccurate statistics, but they usually make the network more robust. We argue that BN is a non-linear operation, but we always assume it is linear. The non-linearity is an important issue when the batch size is small.
Let be one dimension of the data where is the dataset size. During mini-batch training, one batch of data is sampled and the BN operation can be formulate as:
BN records a running mean and running variance to be used in testing:
The test mode needs an assumption:
Easy to know Equation 27 does not strictly hold (Jensen’s inequality!). We want to know the gap between training and test modes. Suppose we have a batch of data . Denote:
Note that and are not independent from , but and are. We have:
We can have the expected output of after normalization. Reuse the Equation 28, we denote:
In the train mode, the output after BN for one unit in a batch of data is . In the test model the unit should output . The expectation and variance of are and . BN assumes the expectation of is . We argue that the test mode output is biased, and the true expectation is non-linear.
Suppose the input follows Gaussian distribution and is easy to compute. We can simulate are by Monte Carlo sampling. Figure 3 shows the values of are