Introduction
The superiority of deep neural networks, especially convolutional neural networks (CNNs) has been well demonstrated in a large variety of computer vision tasks including image recognition
[15, 9, 26], object detection [19, 18], video analysis [5][25] etc. Actually, the huge success of deep CNNs should be attributed to the larger number of trainable parameters and available annotation data, e.g.the ImageNet
[3] dataset with over 1 million images from 1000 different categories.Since deep networks are often over parameterized for achieving higher performance on the training set, an important problem is to avoid overfitting, i.e. the excellent performance achieved on the train set is expected to be repeated on the test set [11, 27]. In other words, the empirical risk should be closed to the expected risk. To this end, [11] first proposed the conventional binary dropout approach, which reduces the coadaptation of neurons by stochastically dropping part of them in the training phase. This operation can be either regarded as a model ensemble technique or a data augmentation method, which significantly enhances the performance of the resulting network on the test set.
To improve the performance of dropout implemented on deep neural networks, [1]
adaptively adjusted the dropout probability of each neuron by interleaving a binary belief network into the neural networks. Gaussian Dropout
[21] multiplying the outputs of the neurons by Gaussian random noise is equal to the conventional binary dropout. It was further analyzed from the perspective of Bayesian regularization and the dropout probability can be optimized automatically [13]. Instead of disabling the activation, DropConnect [24] randomly set a subset of network weights to zero. [24] derived a bound on the generalization performance for Dropout and DropConnect. [28] connected the bound with drop probability and optimized the dropout probability together with network parameters during the training. Focusing on the convolutional neural networks, [6] proposed to drop contiguous regions of a feature map to obstruct the information flow more radically.Existing variants of dropout have made tremendous efforts for minimizing the gap between the expected risk and the empirical risk, but they all follow the general idea of disabling parts of the output of an arbitrary layer in the neural network. The essence of the success is to randomly obscure part of semantic information extracted by the deep neural network and avoid the massive parameters to overfit the training set. Setting a certain number of the elements in the feature map to zero is a straightforward way to disturb the information propagation across layers in the neural network, but it is by no means the only way to accomplish this goal. Most importantly, such sort of handcrafted operations are hardly to be the optimal ones in most cases.
In this work, we propose a novel approach for enhancing the generalization ability of deep neural networks by investigating the distortion on the feature maps (Disout). The generalization error bound of the given deep neural network is established in terms of the Rademacher complexity of its intermediate layers. Distortion is introduced onto the feature maps to decrease the associated Rademacher complexity, which is then beneficial for improving the generalization ability of the neural network. Besides minimizing the general classification loss, the proposed distortion can simultaneously minimize the expected and empirical risks by adding distortions on feature maps. An extension to convolutional layers and corresponding optimization details are also provided. Experimental results on benchmark image datasets demonstrate that deep networks trained using the proposed feature distortion method perform better than those generated using stateoftheart methods.
Preliminary
Dropout is a prevalent regularization technology to alleviate overfitting of models and has achieved great success. It has been demonstrated dropout can improve the generalization ability of models both theoretically [24] and practically [21]. In this section, we briefly introduce the generalization theory and dropout method.
Generalization Theory
Generalization theory focuses on the relation between the expected risk and the empirical risk. Considering an layer neural network , and a labeled dataset sampled from the groundtruth distribution , in which and . Denote the weight matrix as in which is the dimension of the feature map of
th layer, and the corresponding output features before and after activation functions
of the th layer as and , respectively. Omitting bias, we have . For simplicity, we further refer as .Taking the image classification task as an example, the expected risk over the population and the empirical risk on the training set can be formulated as:
(1)  
(2) 
where denotes 01 loss. Various techniques have been developed to quantify the gap between the expected risk and the empirical risk, such as PAC learning [8] , VC dimension [20] and Rademacher complexity [14]. Wherein, the empirical Rademacher complexity (ERC) has been widely used as it often leads to a much tighter generalization error bound. The formal definition of ERC is given as follows:
Definition 1
For a given training dataset with instances generated by the distribution , the empirical Rademacher complexity of the function class of the network is defined as:
(3) 
where Rademacher variables ,
’s are independent uniform random variables in
{1,+1} and is the th element in .Using empirical Rademacher complexity and MaDiarmid’s inequality, the upper bound of the expected risk can be derived by Theorem 1 [14].
Theorem 1
Given a fixed , for any , with probability at least , for all
(4) 
where denotes the output dimension of the network.
According to Theorem 1 we can find that the gap between expected and empirical risks can be bounded with the help of the empirical Rademacher complexity over the specific neural network and dataset. Directly calculating the ERC is vary hard [12], and thus the upper bound or approximate values of the ERC are usually used in the training phase for obtaining models with better generalization [12, 28]. [12] obtained models with better generalization by decreasing a regularization term related to the ERC. The effectiveness of decreasing ERC in previous works inspires us to leverage ERC to refine the conventional dropout methods.
Dropout
Dropout is a classical and effective regularization technology to improve the generalization capability of models. There are many variants of dropout,e.g. variational dropout and [13] DropBlock [6]). Most of them follows the technology of disabling part elements of the feature maps. In general, these methods can be formulated as:
(5) 
where denotes the elementwise product, ^{1}^{1}1Without ambiguity, is denoted as for simplicity. and are the original feature and distorted features, respectively. In addition, is the binary mask applied on feature map , and each element in
is draw from Bernoulli distribution,
i.e. set to 1 with the dropping probability. Admittedly, implementing dropout on the features in the training phase will force the given network paying more attentions on those nonzero regions, and partially solve the “overfitting”. However, disabling the original feature is a heuristic approach and may not always leads to the optimal solution for addressing the aforementioned overfitting problem in deep neural networks.
Approach
Instead of fixing the value of perturbation, we aim to learn the distortion of the feature map by reducing the ERC of the network. Generally, the disturbing operation employed on the output feature of the th layer with input data can be formulated as:
(6) 
where is the distortion applied the on feature map . Compared to the dropout method (Eq. (5)) which manually set the distortion as , Eq. (6) automatically learns the form of distortion in the guide of ERC. Directly using which is the ERC of the network to guide the distortion is very hard. Since is calculated on the final layer w.r.t. the output of the neural network, and it is difficult to trace the intermediate feature maps of the neural network during the training phase. Hence, we reformulate by considering the output feature of an arbitrary layer, and obtain the following theorem based on [24].
Theorem 2
Let denotes the th row of the weight matrix and
is the pnorm of vector. Assume that
, and then the ERC of output can be bounded by the ERC of intermediate feature:(7) 
where and are the feature maps before and after activation function respectively.
The above theorem shows that the ERC of the network is upper bounded by the ERC of output feature or of th layer ^{2}^{2}2The definition of and in th layer has the same form as Definition 1, i.e. and . Thus, decreasing or can heuristically decrease . Note that is the feature map of arbitrary intermediate layer of the network, and the distortion is also applied on intermediate features. Thus, or is used to guide the distortion in the following.
Feature Map Distortion
In this section, we will illustrate the way of decreasing ERC by applying the distortion on the feature map of th layer . By doing so, all the ERCs in the subsequent layers will be affected, and satisfying can guide the distortion of th layer. Recall that in theorem 2, the closer a layer is to the output layer, the tighter the upper bound of the ERC of the whole network is, and may reduce more effectively. However, if , the relationship between and becomes complex and it is difficult to guide with . Thus, we use the ERC of th layer to guide the distortion in th layer. Specifically, we reduce by optimizing . Denoting
(8) 
for simplicity, has the same dimension as feature map . And then, is calculated as:
(9) 
where denotes the th row of the weight matrix and . An ideal will reduce the ERC of the next layer while preserving the representation power.
During the training phase, considering a minibatch with samples, the mask and distortion of the th layer are and , respectively. Taking the classification problem as an example, the weights of the network are updated via minimizing the crossentropy loss. Based on the current updated weights and Rademacher variables , the optimized disturbance is obtained by solving the optimization problem:
(10) 
where
(11) 
in which denotes the norm of the vector and is a hyperparameter balancing the objective function and the intensity of distortion. Intuitively, a violent distortion will destroy the original feature and reduce the representation power.
Optimization of the Distortion
Our goal is to reduce the first term in Eq. (11) related to ERC while constraining the intensity of distortion . Note that the conventional dropout which sets also achieves the similar goal in a special situation. When the drop probability and all the elements in mask are set to 1, the distortion makes and thus the first term in Eq. (11) is zero, showing that the dropout also has the potential to reduce ERC. However, the semantic information is also dropped away and the network will make random guess. In the general case where , the conventional dropout disables part of the feature maps, which may decrease the value of , but there is no explicit interaction with the empirical Rademacher complexity. We choose as the initial value of and optimize Eq. (10) with gradient descent. The partial derivative of w.r.t. is calculated as:
(12) 
where
(13)  
(14) 
Eq. (13) chooses the row of weight matrix to obtain the maximum inner product and Eq. (14) calculates the sign of the inner product. The equations above show that the optimization of distortion is related to the feature and the weight in the following layer. Note that precisely calculating the gradient
is timeconsuming and not necessary, and it can be appropriately estimated without much influence of the performance. Rademacher variable
is randomly sampled from with equal probability (Definition 1), and thus the impact of can be neglected. Selecting the row index of is also related to the random variable , and hence we leverage the random variables to approximate the process. Denote in which the th element is the maximum value of the th column of weight matrix . Then the gradient is approximated as:(15) 
where
is a random variable whose elements are sampled from standard normal distribution
with zero mean and standard deviation.
is to approximate the process of selecting the row of weight . Denote as the step length and we can update along the negative gradient direction:(16) 
To train an optimal neural network, we tend to simultaneously reduce the empirical risk on the training dataset (e.g. minimizing the cross entropy) and the Rademacher complexity. There is thus a balance between the ordinary loss and the reduction of Rademacher complexity. This can be realized by alternatively optimizing between the ordinary loss w.r.t. weights of the network and Rademacher complexity w.r.t. the distortion . After obtaining the updated weights of the network, the distortion is optimized to decrease the objective . After each update of weights of the network, the can be updated for several times, which is usually adopted in practice for training efficiency [7]. Using the case that applying distortion on feature maps of all the layers as an example, the training procedure of the network is summarized in Algorithm 1. Following dropout[21], the feature map is rescaled by a factor of at testing stage, which is equally implemented as dividing in the training phase in practice[21].
Method  CIFAR10 (%)  CIFAR100 (%) 

CNN  81.99  49.72 
CNN + Dropout [21]  82.95  54.19 
CNN + Vardrop [13]  83.15  54.53 
CNN + Sparse Vardrop [17]  82.13  54.26 
CNN + RDdrop [28]  83.11  54.65 
CNN + Feature Map Distortion  85.24 0.08  56.23 0.12 
Extension to Convolutional Layers
Convolutional layer can be seen as a special fullconnected layer with sparse connection and shared weights. Hence, the distortion can be learned in the same way as that in the FC layer. In the following, we focus on distorting the feature maps to reduce the empirical Rademacher complexity in convolutional layers, considering the particularity of convolution operations.
The convolutional kernel of th layer is denoted as , and the corresponding output feature maps before and after activation function are denoted as and , respectively. and are the height and width of convolutional kernels while and are those of the feature map. The mask and distortion of the th layer have the same dimension as feature map and is applied to to get the disturbed feature map , i.e.
(17) 
Similar to the fullyconnected layer, the ERC in the th layer is used to guide the optimization of distortion in layer . Given a minibatch together with mask and distortion , and two symbols and are defined for notion simplicity:
(18)  
(19) 
where denotes convolutional operation. is related to the distorted feature and the Rademacher variable in the th layer, and Eq. (19) applies the convolutional operation on . Given the notation mentioned above, can be derived by minimizing the following objective function:
(20) 
where
(21) 
comes from the simplified implementation which is the ERC in a minibatch. As Eq. (21) calculates average over the spatial dimension of , elements in different spatial locations of has equal contribution to . Thus, the partial derivative of w.r.t. is:
(22) 
where
(23)  
(24) 
in which is the sign of each element in . Considering the impact of Rademacher variable and similar to the method in FC layer, random variables and are introduced to simply Eq. (22), which are used to approximate and the channel selection process of respectively. Each element in is with equal probability and each element in follows the standard normal distribution . Given the gradient, the distortion is updated in a similar way as FC layer. The algorithm of the feature distortion on the convolutional layers is similar to Algorithm 1.
Different from the method applied on FC layers where each element of the binary mask is sampled independently, we draw lessons from DropBlock [6] where elements in a contiguous square block with given size of the feature map is distorted simultaneously. We denote the extension of the proposed method to convolutional layers as “block feature map distortion”.
Experiments
In this section, we conduct experiments on several benchmark datasets to validate the effectiveness of the proposed feature map distortion method. The method is implemented on both FC layers and convolutional layers, which are validated with conventional CNNs and modern CNNs (e.g. ResNet) respectively. In order to set unified hyperparameters for different layers, we multiply by the standard deviation of the feature maps in each layer, and alternately update the distortion and weight one step for efficiency. The distortion probability (dropping probability for dropout and dropblock) increases linearly from 0 to the appointed distortion probability following [6].
Experiments on Fully Connected Layers
Model  CIFAR10 (%)  CIFAR100 (%) 

Resnet56  93.95 0.09  71.81 0.21 
Resnet56 + DropBlock [6]  94.18 0.14  73.08 0.23 
Resnet56 + Block Feature Map Distortion  94.50 0.15  73.71 0.20 
To validate the effect of the proposed feature map distortion method implemented on the FC layers, we conduct experiments on a conventional CNN on CIFAR10 and CIFAR100 dataset. The proposed method is compared with multiple stateoftheart variants of dropout.
Dataset. CIFAR10 and CIFAR100 dataset both contain 60000 natural images with size . 50000 images are used for training and 10000 for testing. The images are divided into 10 categories and 100 categories, respectively. 20% of the training data are regarded as validation sets. Data augmentation method is not used for fair comparison.
Implementation details. The conventional CNN has three convolutional layers with 96, 128 and 256 filters, respectively. Each layer consists of a
convolutional operation with stride 1 followed by a
maxpooling operation with stride 2. Then the features are sent to two fullyconnected layers with 2048 hidden units each. We implement the distortion method on each FC layer. Distortion probability is selected from {0,4, 0.5, 0.6} and the step lengthis set to 5. The model is trained for 500 epoch with batchsize 128. The learning rate is initialized with 0.01, and decayed by a factor of 10 at 200, 300 and 400 epochs. We run our method 5 times with different random seeds and report the average accuracy with standard deviation.
Compared methods. The CNN model trained without extra regularization tricks is used as the baseline model. Further more, we compare our method with the widely used dropout method [11] and several stateoftheart variants, including Vardrop [13], Sparse Vardrop [17] and RDdrop [28].
Results. The test accuracies on both CIFAR10 and CIFAR100 are summarized in Table 1. The proposed feature map distortion method is superior to the compared methods by a large margin on both two datasets. CNN trained with the help of the proposed method achieves an accuracy of 85.24%, which improves the performance of the stateoftheart RDdrop method with 2.13% and 1.58% on CIFAR10 and CIFAR100 dataset, respectively. It shows that the proposed feature map distortion method can reduce the empirical Rademacher complexity effectively while preserve the representation power of the model, resulting in a better test performance.
Experiments on Convolutional Layers
It is much important to apply the proposed method to convolutional layer since modern CNN such as ResNet mostly consist of convolutional layers. In this section, we apply the proposed method on convolutional layers and conduct several experiments on both CIFAR10 and CIFAR100 dataset.
Implementation details. The widelyused ResNet56 [10] which contains three groups of blocks is used as the baseline model. DropBlock method [6] is used as the peer competitor. Both the proposed block feature map distortion method and DropBlock method are implemented after each convolution layers in the last group with block_size=6, and the distortion probability (dropping probability for DropBlock) is selected from . The step length is set to 30 empirically. Standard data augmentation including random cropping, horizontal flipping and rotation(within 15 degrees) are conducted during training. The networks are trained for 200 epochs, batchsize is set to 128 and weight decay is set to 5e4. The initial learning rate is set to 0.1 and is decayed by a factor of 5 at 60, 120 and 160 epochs. All the methods are repeated 5 times with different random seeds and the average accuracies with standard deviations are reported.
Results. The results on both CIFAR10 and CIFAR100 dataset are shown in Table 2. The proposed method is superior to DropBlock method and improves the performance with 0.32% and 0.63%, respectively. It shows that the proposed feature map distortion methods suits for convolutional layers and can improves the performance of modern network structures.
Training curve. The training curves on CIFAR100 dataset are shown in Figure 1. The solid line and dotted line denote the test stage and the training stage respectively, while the red line and blue line denote the proposed feature map distortion method and the baseline model. When training converges, the baseline ResNet56 traps in overfitting problem and achieves a higher training accuracy but lower test accuracy, while the proposed feature map distortion method overcome this problem and achieves a higher test accuracy, which shows the improvement of model generalization ability.
Model  Top1 Accuracy (%)  Top5 Accuracy (%) 

ResNet50  76.51 0.07  93.20 0.05 
ResNet50 + Dropout [21]  76.80 0.04  93.41 0.04 
ResNet50 + DropPath [16]  77.10 0.08  93.50 0.05 
ResNet50 + SpatialDropout [23]  77.41 0.04  93.74 0.02 
ResNet50 + Cutout [4]  76.52 0.07  93.21 0.04 
ResNet50 + AutoAugment [2]  77.63  93.82 
ResNet50 + Label Smoothing [22]  77.17 0.05  93.45 0.03 
ResNet50 + DropBlock [6]  78.13 0.05  94.02 0.02 
ResNet50 + Feature Map Distortion  77.71 0.05  93.89 0.04 
ResNet50 + Block Feature Map Distortion  78.76 0.05  94.33 0.03 
Feature map distortion v.s. DropBlock. The test accuracy of our method (red) and the Dropblock method (green) with various distortion probability (dropping probability) on CIFAR100 dataset are shown in Figure 2(a). Increasing the drop probability enhances the effect of regularization, and the test accuracy can be improved when setting in an appropriate range. Note that our method achieves a better performance than DropBlock with in a larger range, which demonstrate the superior of feature map distortion.
Test accuracy v.s. accuracy gap. Figure 2(b) and (c) show how test accuracy (red) and the accuracy gap between training and testing accuracies (blue) vary when setting different distortion probability and length step . Larger implies that more locations of the feature maps are distorted while controls the intensity of disturbing in each location. Increasing either or bring stronger regularization, resulting in smaller gap between the training and testing accuracies, which means a stronger generalization ability. However, disturbing too many locations or disturbing a location with too much intensity may destroy the representation power and having negative impact on the final testing accuracy. Instead of using fixed intensity in conventional dropout and DropBlock method, out method applies proper intensity distortion on proper locations and results in better performance.
Experiments on ImageNet Dataset
In this section, we conducts experiments on largescale ImageNet dataset and implement the feature map distortion method with conventional dropout and the recent DropBlock method, namely “Feature Map Distortion” and “Block Feature Map Distortion”, respective.
Dataset. ImageNet dataset contains 1.2M training images and 50000 validation images, consisting of 1000 categories. Standard data augmentation methods including random cropping and horizontally flipping is conducted on training data.
Implementation details. We follow the experimental settings in [6] for fair comparison. The prevalent ResNet50 is used as the baseline model. The distortions are applied on the feature maps after both convolutional layers and skip connections in the last two groups. The step length is set to . For feature map distortion implemented based on conventional dropout, distortion probability (dropping probability) is set to 0.5 as suggested by [21]. For Block feature map distortion, the block_size and (dropping probability) are set to 6 and 0.05 following [6]. We report the singlecrop top1 and top5 accuracies on the validation set and repeat the methods three time with different random seeds.
Compared method. Multiple stateoftheart regularization methods are compared, including dropout based methods, data augmentation and label smoothing. DropPath[16], SpatialDropout[23] and Dropblock [6] are the stateoftheart variants of dropout. Data augmentation including Cutout [4] and AutoAugment [2]), and label smoothing [22] are prevalent regularization techniques to alleviate overfitting.
Results. In Table 3, the proposed feature distortion method can not only increase the performance of deep neural networks using conventional dropout method, but also enhance the peformance of the recent Dropblock method, since our method is also suitable and well adapted to convolutional layers. As a result, the feature map distortion improve the accuracy from 76.80% to 77.71% compared to the conventional dropout method . The block feature map distortion method achieves top1 accuracy 78.76%, which surpass other stateofthe art methods from a large margin. The results demonstrate that our method can simultaneously increase the generalization ability and preserving the useful information of original features.
Conclusion
Dropout based methods have been successfully used for enhancing the generalization ability of deep neural networks. However, eliminating some of units in neural networks can be seen as a heuristic approach for minimizing the gap between expected and empirical risks of the resulting network, which is not the optimal one in practice. Here we propose to embed distortions onto feature maps of the given deep neural network by exploiting the Rademacher complexity. We further extend the proposed method to convolutional layers and explore the detailed feedforward and backpropagation procedures. Thus, we can employ the proposed method into any offtheshelf deep neural architectures. Extensive experimental results show that the feature distortion technique can be easily embedded into mainstream deep networks to achieve better performance on benchmark datasets over conventional approaches.
Acknowledgments
This work is supported by National Natural Science Foundation of China under Grant No. 61876007, 61872012 and Australian Research Council under Project DE180101438.
References
 [1] (2013) Adaptive dropout for training deep neural networks. In Advances in Neural Information Processing Systems, pp. 3084–3092. Cited by: Introduction.
 [2] (2018) Autoaugment: learning augmentation policies from data. arXiv preprint arXiv:1805.09501. Cited by: Experiments on ImageNet Dataset, Table 3.

[3]
(2009)
Imagenet: a largescale hierarchical image database.
In
2009 IEEE conference on computer vision and pattern recognition
, pp. 248–255. Cited by: Introduction.  [4] (2017) Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552. Cited by: Experiments on ImageNet Dataset, Table 3.
 [5] (2016) Convolutional twostream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1933–1941. Cited by: Introduction.
 [6] (2018) Dropblock: a regularization method for convolutional networks. In Advances in Neural Information Processing Systems, pp. 10727–10737. Cited by: Introduction, Dropout, Extension to Convolutional Layers, Experiments on Convolutional Layers, Experiments on ImageNet Dataset, Experiments on ImageNet Dataset, Table 2, Table 3, Experiments.
 [7] (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: Optimization of the Distortion.

[8]
(2016)
The optimal sample complexity of pac learning.
The Journal of Machine Learning Research
17 (1), pp. 1319–1333. Cited by: Generalization Theory.  [9] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Introduction.
 [10] (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: Experiments on Convolutional Layers.
 [11] (2012) Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580. Cited by: Introduction, Experiments on Fully Connected Layers.

[12]
(2017)
Generalization in deep learning
. arXiv preprint arXiv:1710.05468. Cited by: Generalization Theory.  [13] (2015) Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pp. 2575–2583. Cited by: Introduction, Dropout, Table 1, Experiments on Fully Connected Layers.

[14]
(2002)
Empirical margin distributions and bounding the generalization error of combined classifiers
. The Annals of Statistics 30 (1), pp. 1–50. Cited by: Generalization Theory, Generalization Theory.  [15] (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: Introduction.
 [16] (2016) Fractalnet: ultradeep neural networks without residuals. arXiv preprint arXiv:1605.07648. Cited by: Experiments on ImageNet Dataset, Table 3.
 [17] (2017) Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2498–2507. Cited by: Table 1, Experiments on Fully Connected Layers.
 [18] (2016) You only look once: unified, realtime object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: Introduction.
 [19] (2015) Faster rcnn: towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: Introduction.
 [20] (1998) VC dimension of neural networks. NATO ASI Series F Computer and Systems Sciences 168, pp. 69–96. Cited by: Generalization Theory.
 [21] (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: Introduction, Preliminary, Optimization of the Distortion, Table 1, Experiments on ImageNet Dataset, Table 3.
 [22] (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: Experiments on ImageNet Dataset, Table 3.
 [23] (2015) Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 648–656. Cited by: Experiments on ImageNet Dataset, Table 3.
 [24] (2013) Regularization of neural networks using dropconnect. In International conference on machine learning, pp. 1058–1066. Cited by: Introduction, Preliminary, Approach.
 [25] (2019) Language models with transformers. CoRR abs/1904.09408. Cited by: Introduction.
 [26] (2018) Learning versatile filters for efficient convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1608–1618. Cited by: Introduction.

[27]
(2018)
Packing convolutional neural networks in the frequency domain
. IEEE transactions on pattern analysis and machine intelligence. Cited by: Introduction.  [28] (2018) Adaptive dropout with rademacher complexity regularization. Cited by: Introduction, Generalization Theory, Table 1, Experiments on Fully Connected Layers.
Comments
There are no comments yet.