Beyond Dropout: Feature Map Distortion to Regularize Deep Neural Networks

02/23/2020 ∙ by Yehui Tang, et al. ∙ HUAWEI Technologies Co., Ltd. The University of Sydney Peking University 72

Deep neural networks often consist of a great number of trainable parameters for extracting powerful features from given datasets. On one hand, massive trainable parameters significantly enhance the performance of these deep networks. On the other hand, they bring the problem of over-fitting. To this end, dropout based methods disable some elements in the output feature maps during the training phase for reducing the co-adaptation of neurons. Although the generalization ability of the resulting models can be enhanced by these approaches, the conventional binary dropout is not the optimal solution. Therefore, we investigate the empirical Rademacher complexity related to intermediate layers of deep neural networks and propose a feature distortion method (Disout) for addressing the aforementioned problem. In the training period, randomly selected elements in the feature maps will be replaced with specific values by exploiting the generalization error bound. The superiority of the proposed feature map distortion for producing deep neural network with higher testing performance is analyzed and demonstrated on several benchmark image datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

The superiority of deep neural networks, especially convolutional neural networks (CNNs) has been well demonstrated in a large variety of computer vision tasks including image recognition 

[15, 9, 26], object detection [19, 18], video analysis [5]

, Natural Language Processing 

[25] etc. Actually, the huge success of deep CNNs should be attributed to the larger number of trainable parameters and available annotation data, e.g.

the ImageNet 

[3] dataset with over 1 million images from 1000 different categories.

Since deep networks are often over parameterized for achieving higher performance on the training set, an important problem is to avoid over-fitting, i.e. the excellent performance achieved on the train set is expected to be repeated on the test set [11, 27]. In other words, the empirical risk should be closed to the expected risk. To this end, [11] first proposed the conventional binary dropout approach, which reduces the co-adaptation of neurons by stochastically dropping part of them in the training phase. This operation can be either regarded as a model ensemble technique or a data augmentation method, which significantly enhances the performance of the resulting network on the test set.

To improve the performance of dropout implemented on deep neural networks, [1]

adaptively adjusted the dropout probability of each neuron by interleaving a binary belief network into the neural networks. Gaussian Dropout 

[21] multiplying the outputs of the neurons by Gaussian random noise is equal to the conventional binary dropout. It was further analyzed from the perspective of Bayesian regularization and the dropout probability can be optimized automatically [13]. Instead of disabling the activation, DropConnect [24] randomly set a subset of network weights to zero. [24] derived a bound on the generalization performance for Dropout and DropConnect. [28] connected the bound with drop probability and optimized the dropout probability together with network parameters during the training. Focusing on the convolutional neural networks,  [6] proposed to drop contiguous regions of a feature map to obstruct the information flow more radically.

Existing variants of dropout have made tremendous efforts for minimizing the gap between the expected risk and the empirical risk, but they all follow the general idea of disabling parts of the output of an arbitrary layer in the neural network. The essence of the success is to randomly obscure part of semantic information extracted by the deep neural network and avoid the massive parameters to over-fit the training set. Setting a certain number of the elements in the feature map to zero is a straightforward way to disturb the information propagation across layers in the neural network, but it is by no means the only way to accomplish this goal. Most importantly, such sort of hand-crafted operations are hardly to be the optimal ones in most cases.

In this work, we propose a novel approach for enhancing the generalization ability of deep neural networks by investigating the distortion on the feature maps (Disout). The generalization error bound of the given deep neural network is established in terms of the Rademacher complexity of its intermediate layers. Distortion is introduced onto the feature maps to decrease the associated Rademacher complexity, which is then beneficial for improving the generalization ability of the neural network. Besides minimizing the general classification loss, the proposed distortion can simultaneously minimize the expected and empirical risks by adding distortions on feature maps. An extension to convolutional layers and corresponding optimization details are also provided. Experimental results on benchmark image datasets demonstrate that deep networks trained using the proposed feature distortion method perform better than those generated using state-of-the-art methods.

Preliminary

Dropout is a prevalent regularization technology to alleviate over-fitting of models and has achieved great success. It has been demonstrated dropout can improve the generalization ability of models both theoretically [24] and practically [21]. In this section, we briefly introduce the generalization theory and dropout method.

Generalization Theory

Generalization theory focuses on the relation between the expected risk and the empirical risk. Considering an -layer neural network , and a labeled dataset sampled from the ground-truth distribution , in which and . Denote the weight matrix as in which is the dimension of the feature map of

-th layer, and the corresponding output features before and after activation functions

of the -th layer as and , respectively. Omitting bias, we have . For simplicity, we further refer as .

Taking the image classification task as an example, the expected risk over the population and the empirical risk on the training set can be formulated as:

(1)
(2)

where denotes 0-1 loss. Various techniques have been developed to quantify the gap between the expected risk and the empirical risk, such as PAC learning  [8] , VC dimension  [20] and Rademacher complexity [14]. Wherein, the empirical Rademacher complexity (ERC) has been widely used as it often leads to a much tighter generalization error bound. The formal definition of ERC is given as follows:

Definition 1

For a given training dataset with instances generated by the distribution , the empirical Rademacher complexity of the function class of the network is defined as:

(3)

where Rademacher variables ,

’s are independent uniform random variables in

{-1,+1} and is the -th element in .

Using empirical Rademacher complexity and MaDiarmid’s inequality, the upper bound of the expected risk can be derived by Theorem 1 [14].

Theorem 1

Given a fixed , for any , with probability at least , for all

(4)

where denotes the output dimension of the network.

According to Theorem 1 we can find that the gap between expected and empirical risks can be bounded with the help of the empirical Rademacher complexity over the specific neural network and dataset. Directly calculating the ERC is vary hard [12], and thus the upper bound or approximate values of the ERC are usually used in the training phase for obtaining models with better generalization [12, 28]. [12] obtained models with better generalization by decreasing a regularization term related to the ERC. The effectiveness of decreasing ERC in previous works inspires us to leverage ERC to refine the conventional dropout methods.

Dropout

Dropout is a classical and effective regularization technology to improve the generalization capability of models. There are many variants of dropout,e.g. variational dropout and [13] DropBlock [6]). Most of them follows the technology of disabling part elements of the feature maps. In general, these methods can be formulated as:

(5)

where denotes the element-wise product, 111Without ambiguity, is denoted as for simplicity. and are the original feature and distorted features, respectively. In addition, is the binary mask applied on feature map , and each element in

is draw from Bernoulli distribution,

i.e. set to 1 with the dropping probability

. Admittedly, implementing dropout on the features in the training phase will force the given network paying more attentions on those non-zero regions, and partially solve the “over-fitting”. However, disabling the original feature is a heuristic approach and may not always leads to the optimal solution for addressing the aforementioned over-fitting problem in deep neural networks.

Approach

Instead of fixing the value of perturbation, we aim to learn the distortion of the feature map by reducing the ERC of the network. Generally, the disturbing operation employed on the output feature of the -th layer with input data can be formulated as:

(6)

where is the distortion applied the on feature map . Compared to the dropout method (Eq. (5)) which manually set the distortion as , Eq. (6) automatically learns the form of distortion in the guide of ERC. Directly using which is the ERC of the network to guide the distortion is very hard. Since is calculated on the final layer w.r.t. the output of the neural network, and it is difficult to trace the intermediate feature maps of the neural network during the training phase. Hence, we reformulate by considering the output feature of an arbitrary layer, and obtain the following theorem based on [24].

Theorem 2

Let denotes the -th row of the weight matrix and

is the p-norm of vector. Assume that

, and then the ERC of output can be bounded by the ERC of intermediate feature:

(7)

where and are the feature maps before and after activation function respectively.

The above theorem shows that the ERC of the network is upper bounded by the ERC of output feature or of -th layer 222The definition of and in -th layer has the same form as Definition 1, i.e. and . Thus, decreasing or can heuristically decrease . Note that is the feature map of arbitrary intermediate layer of the network, and the distortion is also applied on intermediate features. Thus, or is used to guide the distortion in the following.

Feature Map Distortion

In this section, we will illustrate the way of decreasing ERC by applying the distortion on the feature map of -th layer . By doing so, all the ERCs in the subsequent layers will be affected, and satisfying can guide the distortion of -th layer. Recall that in theorem 2, the closer a layer is to the output layer, the tighter the upper bound of the ERC of the whole network is, and may reduce more effectively. However, if , the relationship between and becomes complex and it is difficult to guide with . Thus, we use the ERC of -th layer to guide the distortion in -th layer. Specifically, we reduce by optimizing . Denoting

(8)

for simplicity, has the same dimension as feature map . And then, is calculated as:

(9)

where denotes the -th row of the weight matrix and . An ideal will reduce the ERC of the next layer while preserving the representation power.

During the training phase, considering a mini-batch with samples, the mask and distortion of the -th layer are and , respectively. Taking the classification problem as an example, the weights of the network are updated via minimizing the cross-entropy loss. Based on the current updated weights and Rademacher variables , the optimized disturbance is obtained by solving the optimization problem:

(10)

where

(11)

in which denotes the -norm of the vector and is a hyper-parameter balancing the objective function and the intensity of distortion. Intuitively, a violent distortion will destroy the original feature and reduce the representation power.

Optimization of the Distortion

Our goal is to reduce the first term in Eq. (11) related to ERC while constraining the intensity of distortion . Note that the conventional dropout which sets also achieves the similar goal in a special situation. When the drop probability and all the elements in mask are set to 1, the distortion makes and thus the first term in Eq. (11) is zero, showing that the dropout also has the potential to reduce ERC. However, the semantic information is also dropped away and the network will make random guess. In the general case where , the conventional dropout disables part of the feature maps, which may decrease the value of , but there is no explicit interaction with the empirical Rademacher complexity. We choose as the initial value of and optimize Eq. (10) with gradient descent. The partial derivative of w.r.t. is calculated as:

(12)

where

(13)
(14)

Eq. (13) chooses the row of weight matrix to obtain the maximum inner product and Eq. (14) calculates the sign of the inner product. The equations above show that the optimization of distortion is related to the feature and the weight in the following layer. Note that precisely calculating the gradient

is time-consuming and not necessary, and it can be appropriately estimated without much influence of the performance. Rademacher variable

is randomly sampled from with equal probability (Definition 1), and thus the impact of can be neglected. Selecting the row index of is also related to the random variable , and hence we leverage the random variables to approximate the process. Denote in which the -th element is the maximum value of the -th column of weight matrix . Then the gradient is approximated as:

(15)

where

is a random variable whose elements are sampled from standard normal distribution

with zero mean and standard deviation.

is to approximate the process of selecting the row of weight . Denote as the step length and we can update along the negative gradient direction:

(16)

To train an optimal neural network, we tend to simultaneously reduce the empirical risk on the training dataset (e.g. minimizing the cross entropy) and the Rademacher complexity. There is thus a balance between the ordinary loss and the reduction of Rademacher complexity. This can be realized by alternatively optimizing between the ordinary loss w.r.t. weights of the network and Rademacher complexity w.r.t. the distortion . After obtaining the updated weights of the network, the distortion is optimized to decrease the objective . After each update of weights of the network, the can be updated for several times, which is usually adopted in practice for training efficiency [7]. Using the case that applying distortion on feature maps of all the layers as an example, the training procedure of the network is summarized in Algorithm 1. Following dropout[21], the feature map is rescaled by a factor of at testing stage, which is equally implemented as dividing in the training phase in practice[21].

0:  Training data , The weights of the network
1:  repeat
2:     for  in  do
3:        Calculate the feature map of the -th layer;
4:        Generate the distortion and the corresponding sample mask ;
5:        Obtain distorted feature (Eq. (6));
6:        Feed-forward the network using ;
7:     end for
8:     Backward and update weights in the network;
9:  until Convergence;
9:  The resulting deep neural network.
Algorithm 1 Feature map distortion for training networks.
Method CIFAR-10 (%) CIFAR-100 (%)
CNN 81.99 49.72
CNN + Dropout [21] 82.95 54.19
CNN + Vardrop [13] 83.15 54.53
CNN + Sparse Vardrop [17] 82.13 54.26
CNN + RDdrop [28] 83.11 54.65
CNN + Feature Map Distortion 85.24 0.08 56.23 0.12
Table 1: Accuracies of conventional CNNs on CIFAR-10 and CIFAR-100 datasets.

Extension to Convolutional Layers

Convolutional layer can be seen as a special full-connected layer with sparse connection and shared weights. Hence, the distortion can be learned in the same way as that in the FC layer. In the following, we focus on distorting the feature maps to reduce the empirical Rademacher complexity in convolutional layers, considering the particularity of convolution operations.

The convolutional kernel of -th layer is denoted as , and the corresponding output feature maps before and after activation function are denoted as and , respectively. and are the height and width of convolutional kernels while and are those of the feature map. The mask and distortion of the -th layer have the same dimension as feature map and is applied to to get the disturbed feature map , i.e.

(17)

Similar to the fully-connected layer, the ERC in the -th layer is used to guide the optimization of distortion in layer . Given a mini-batch together with mask and distortion , and two symbols and are defined for notion simplicity:

(18)
(19)

where denotes convolutional operation. is related to the distorted feature and the Rademacher variable in the -th layer, and Eq. (19) applies the convolutional operation on . Given the notation mentioned above, can be derived by minimizing the following objective function:

(20)

where

(21)

comes from the simplified implementation which is the ERC in a mini-batch. As Eq. (21) calculates average over the spatial dimension of , elements in different spatial locations of has equal contribution to . Thus, the partial derivative of w.r.t. is:

(22)

where

(23)
(24)

in which is the sign of each element in . Considering the impact of Rademacher variable and similar to the method in FC layer, random variables and are introduced to simply Eq. (22), which are used to approximate and the channel selection process of respectively. Each element in is with equal probability and each element in follows the standard normal distribution . Given the gradient, the distortion is updated in a similar way as FC layer. The algorithm of the feature distortion on the convolutional layers is similar to Algorithm 1.

Different from the method applied on FC layers where each element of the binary mask is sampled independently, we draw lessons from DropBlock [6] where elements in a contiguous square block with given size of the feature map is distorted simultaneously. We denote the extension of the proposed method to convolutional layers as “block feature map distortion”.

Experiments

In this section, we conduct experiments on several benchmark datasets to validate the effectiveness of the proposed feature map distortion method. The method is implemented on both FC layers and convolutional layers, which are validated with conventional CNNs and modern CNNs (e.g. ResNet) respectively. In order to set unified hyper-parameters for different layers, we multiply by the standard deviation of the feature maps in each layer, and alternately update the distortion and weight one step for efficiency. The distortion probability (dropping probability for dropout and dropblock) increases linearly from 0 to the appointed distortion probability following [6].

Experiments on Fully Connected Layers

Model CIFAR-10 (%) CIFAR-100 (%)
Resnet-56 93.95 0.09 71.81 0.21
Resnet-56 + DropBlock [6] 94.18 0.14 73.08 0.23
Resnet-56 + Block Feature Map Distortion 94.50 0.15 73.71 0.20
Table 2: Accuracies of ResNet-56 on CIFAR10 and CIFAR-100 dataset.

To validate the effect of the proposed feature map distortion method implemented on the FC layers, we conduct experiments on a conventional CNN on CIFAR-10 and CIFAR-100 dataset. The proposed method is compared with multiple state-of-the-art variants of dropout.

Dataset. CIFAR-10 and CIFAR-100 dataset both contain 60000 natural images with size . 50000 images are used for training and 10000 for testing. The images are divided into 10 categories and 100 categories, respectively. 20% of the training data are regarded as validation sets. Data augmentation method is not used for fair comparison.

Implementation details. The conventional CNN has three convolutional layers with 96, 128 and 256 filters, respectively. Each layer consists of a

convolutional operation with stride 1 followed by a

max-pooling operation with stride 2. Then the features are sent to two fully-connected layers with 2048 hidden units each. We implement the distortion method on each FC layer. Distortion probability is selected from {0,4, 0.5, 0.6} and the step length

is set to 5. The model is trained for 500 epoch with batchsize 128. The learning rate is initialized with 0.01, and decayed by a factor of 10 at 200, 300 and 400 epochs. We run our method 5 times with different random seeds and report the average accuracy with standard deviation.

Compared methods. The CNN model trained without extra regularization tricks is used as the baseline model. Further more, we compare our method with the widely used dropout method [11] and several state-of-the-art variants, including Vardrop [13], Sparse Vardrop [17] and RDdrop [28].

Results. The test accuracies on both CIFAR-10 and CIFAR-100 are summarized in Table 1. The proposed feature map distortion method is superior to the compared methods by a large margin on both two datasets. CNN trained with the help of the proposed method achieves an accuracy of 85.24%, which improves the performance of the state-of-the-art RDdrop method with 2.13% and 1.58% on CIFAR-10 and CIFAR-100 dataset, respectively. It shows that the proposed feature map distortion method can reduce the empirical Rademacher complexity effectively while preserve the representation power of the model, resulting in a better test performance.

Experiments on Convolutional Layers

It is much important to apply the proposed method to convolutional layer since modern CNN such as ResNet mostly consist of convolutional layers. In this section, we apply the proposed method on convolutional layers and conduct several experiments on both CIFAR-10 and CIFAR-100 dataset.

Implementation details. The widely-used ResNet-56 [10] which contains three groups of blocks is used as the baseline model. DropBlock method [6] is used as the peer competitor. Both the proposed block feature map distortion method and DropBlock method are implemented after each convolution layers in the last group with block_size=6, and the distortion probability (dropping probability for DropBlock) is selected from . The step length is set to 30 empirically. Standard data augmentation including random cropping, horizontal flipping and rotation(within 15 degrees) are conducted during training. The networks are trained for 200 epochs, batchsize is set to 128 and weight decay is set to 5e-4. The initial learning rate is set to 0.1 and is decayed by a factor of 5 at 60, 120 and 160 epochs. All the methods are repeated 5 times with different random seeds and the average accuracies with standard deviations are reported.

Results. The results on both CIFAR-10 and CIFAR-100 dataset are shown in Table 2. The proposed method is superior to DropBlock method and improves the performance with 0.32% and 0.63%, respectively. It shows that the proposed feature map distortion methods suits for convolutional layers and can improves the performance of modern network structures.

Training curve. The training curves on CIFAR-100 dataset are shown in Figure 1. The solid line and dotted line denote the test stage and the training stage respectively, while the red line and blue line denote the proposed feature map distortion method and the baseline model. When training converges, the baseline ResNet-56 traps in over-fitting problem and achieves a higher training accuracy but lower test accuracy, while the proposed feature map distortion method overcome this problem and achieves a higher test accuracy, which shows the improvement of model generalization ability.

Figure 1: Training curves on the CIFAR-100 dataset.
Model Top-1 Accuracy (%) Top-5 Accuracy (%)
ResNet-50 76.51 0.07 93.20 0.05
ResNet-50 + Dropout [21] 76.80 0.04 93.41 0.04
ResNet-50 + DropPath [16] 77.10 0.08 93.50 0.05
ResNet-50 + SpatialDropout [23] 77.41 0.04 93.74 0.02
ResNet-50 + Cutout [4] 76.52 0.07 93.21 0.04
ResNet-50 + AutoAugment [2] 77.63 93.82
ResNet-50 + Label Smoothing [22] 77.17 0.05 93.45 0.03
ResNet-50 + DropBlock [6] 78.13 0.05 94.02 0.02
ResNet-50 + Feature Map Distortion 77.71 0.05 93.89 0.04
ResNet-50 + Block Feature Map Distortion 78.76 0.05 94.33 0.03
Table 3: Accuracies of ResNet-50 on ImageNet dataset.
Figure 2: The impact of distortion probability and step length on CIFAR-100 dataset. Test accuracies w.r.t. distortion probability for feature map distortion and dropblock are shown in (a). Test accuracies and accuracy gaps w.r.t. distortion probability and step length are shown in (b) and (c).

Feature map distortion v.s. DropBlock. The test accuracy of our method (red) and the Dropblock method (green) with various distortion probability (dropping probability) on CIFAR-100 dataset are shown in Figure 2(a). Increasing the drop probability enhances the effect of regularization, and the test accuracy can be improved when setting in an appropriate range. Note that our method achieves a better performance than DropBlock with in a larger range, which demonstrate the superior of feature map distortion.

Test accuracy v.s. accuracy gap. Figure 2(b) and (c) show how test accuracy (red) and the accuracy gap between training and testing accuracies (blue) vary when setting different distortion probability and length step . Larger implies that more locations of the feature maps are distorted while controls the intensity of disturbing in each location. Increasing either or bring stronger regularization, resulting in smaller gap between the training and testing accuracies, which means a stronger generalization ability. However, disturbing too many locations or disturbing a location with too much intensity may destroy the representation power and having negative impact on the final testing accuracy. Instead of using fixed intensity in conventional dropout and DropBlock method, out method applies proper intensity distortion on proper locations and results in better performance.

Experiments on ImageNet Dataset

In this section, we conducts experiments on large-scale ImageNet dataset and implement the feature map distortion method with conventional dropout and the recent DropBlock method, namely “Feature Map Distortion” and “Block Feature Map Distortion”, respective.

Dataset. ImageNet dataset contains 1.2M training images and 50000 validation images, consisting of 1000 categories. Standard data augmentation methods including random cropping and horizontally flipping is conducted on training data.

Implementation details. We follow the experimental settings in [6] for fair comparison. The prevalent ResNet-50 is used as the baseline model. The distortions are applied on the feature maps after both convolutional layers and skip connections in the last two groups. The step length is set to . For feature map distortion implemented based on conventional dropout, distortion probability (dropping probability) is set to 0.5 as suggested by [21]. For Block feature map distortion, the block_size and (dropping probability) are set to 6 and 0.05 following [6]. We report the single-crop top-1 and top-5 accuracies on the validation set and repeat the methods three time with different random seeds.

Compared method. Multiple state-of-the-art regularization methods are compared, including dropout based methods, data augmentation and label smoothing. DropPath[16], SpatialDropout[23] and Dropblock [6] are the state-of-the-art variants of dropout. Data augmentation including Cutout [4] and AutoAugment [2]), and label smoothing [22] are prevalent regularization techniques to alleviate over-fitting.

Results. In Table 3, the proposed feature distortion method can not only increase the performance of deep neural networks using conventional dropout method, but also enhance the peformance of the recent Dropblock method, since our method is also suitable and well adapted to convolutional layers. As a result, the feature map distortion improve the accuracy from 76.80% to 77.71% compared to the conventional dropout method . The block feature map distortion method achieves top-1 accuracy 78.76%, which surpass other state-of-the art methods from a large margin. The results demonstrate that our method can simultaneously increase the generalization ability and preserving the useful information of original features.

Conclusion

Dropout based methods have been successfully used for enhancing the generalization ability of deep neural networks. However, eliminating some of units in neural networks can be seen as a heuristic approach for minimizing the gap between expected and empirical risks of the resulting network, which is not the optimal one in practice. Here we propose to embed distortions onto feature maps of the given deep neural network by exploiting the Rademacher complexity. We further extend the proposed method to convolutional layers and explore the detailed feed-forward and back-propagation procedures. Thus, we can employ the proposed method into any off-the-shelf deep neural architectures. Extensive experimental results show that the feature distortion technique can be easily embedded into mainstream deep networks to achieve better performance on benchmark datasets over conventional approaches.

Acknowledgments

This work is supported by National Natural Science Foundation of China under Grant No. 61876007, 61872012 and Australian Research Council under Project DE-180101438.

References

  • [1] J. Ba and B. Frey (2013) Adaptive dropout for training deep neural networks. In Advances in Neural Information Processing Systems, pp. 3084–3092. Cited by: Introduction.
  • [2] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le (2018) Autoaugment: learning augmentation policies from data. arXiv preprint arXiv:1805.09501. Cited by: Experiments on ImageNet Dataset, Table 3.
  • [3] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    ,
    pp. 248–255. Cited by: Introduction.
  • [4] T. DeVries and G. W. Taylor (2017) Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552. Cited by: Experiments on ImageNet Dataset, Table 3.
  • [5] C. Feichtenhofer, A. Pinz, and A. Zisserman (2016) Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1933–1941. Cited by: Introduction.
  • [6] G. Ghiasi, T. Lin, and Q. V. Le (2018) Dropblock: a regularization method for convolutional networks. In Advances in Neural Information Processing Systems, pp. 10727–10737. Cited by: Introduction, Dropout, Extension to Convolutional Layers, Experiments on Convolutional Layers, Experiments on ImageNet Dataset, Experiments on ImageNet Dataset, Table 2, Table 3, Experiments.
  • [7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: Optimization of the Distortion.
  • [8] S. Hanneke (2016) The optimal sample complexity of pac learning.

    The Journal of Machine Learning Research

    17 (1), pp. 1319–1333.
    Cited by: Generalization Theory.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Introduction.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: Experiments on Convolutional Layers.
  • [11] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580. Cited by: Introduction, Experiments on Fully Connected Layers.
  • [12] K. Kawaguchi, L. P. Kaelbling, and Y. Bengio (2017)

    Generalization in deep learning

    .
    arXiv preprint arXiv:1710.05468. Cited by: Generalization Theory.
  • [13] D. P. Kingma, T. Salimans, and M. Welling (2015) Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pp. 2575–2583. Cited by: Introduction, Dropout, Table 1, Experiments on Fully Connected Layers.
  • [14] V. Koltchinskii, D. Panchenko, et al. (2002)

    Empirical margin distributions and bounding the generalization error of combined classifiers

    .
    The Annals of Statistics 30 (1), pp. 1–50. Cited by: Generalization Theory, Generalization Theory.
  • [15] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: Introduction.
  • [16] G. Larsson, M. Maire, and G. Shakhnarovich (2016) Fractalnet: ultra-deep neural networks without residuals. arXiv preprint arXiv:1605.07648. Cited by: Experiments on ImageNet Dataset, Table 3.
  • [17] D. Molchanov, A. Ashukha, and D. Vetrov (2017) Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2498–2507. Cited by: Table 1, Experiments on Fully Connected Layers.
  • [18] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: Introduction.
  • [19] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: Introduction.
  • [20] E. D. Sontag (1998) VC dimension of neural networks. NATO ASI Series F Computer and Systems Sciences 168, pp. 69–96. Cited by: Generalization Theory.
  • [21] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: Introduction, Preliminary, Optimization of the Distortion, Table 1, Experiments on ImageNet Dataset, Table 3.
  • [22] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: Experiments on ImageNet Dataset, Table 3.
  • [23] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler (2015) Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 648–656. Cited by: Experiments on ImageNet Dataset, Table 3.
  • [24] L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus (2013) Regularization of neural networks using dropconnect. In International conference on machine learning, pp. 1058–1066. Cited by: Introduction, Preliminary, Approach.
  • [25] C. Wang, M. Li, and A. J. Smola (2019) Language models with transformers. CoRR abs/1904.09408. Cited by: Introduction.
  • [26] Y. Wang, C. Xu, X. Chunjing, C. Xu, and D. Tao (2018) Learning versatile filters for efficient convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1608–1618. Cited by: Introduction.
  • [27] Y. Wang, C. Xu, C. Xu, and D. Tao (2018)

    Packing convolutional neural networks in the frequency domain

    .
    IEEE transactions on pattern analysis and machine intelligence. Cited by: Introduction.
  • [28] K. Zhai and H. Wang (2018) Adaptive dropout with rademacher complexity regularization. Cited by: Introduction, Generalization Theory, Table 1, Experiments on Fully Connected Layers.