1 Introduction
Convolution neural network (CNN) is a powerful tool for computer vision tasks. With the help of gradually increasing depth and width, CNNs [5, 6, 7, 22, 19] gain a significant improvement in image classification problems by capturing multiscale features [24]. However, when the number of trainable parameters are far more than that of training data, deep networks may suffer from overfitting. This leads to the routine usage of regularization methods such as data augmentation [2], weight decay [10], Dropout [15] and Batch Normalization [9] to prevent overfitting and improve generalization.
Although regularization has been an essential part in deep learning, deciding which regularization methods to use remains an art. Even if each of the regularization methods works well on its own, combining them together does not always give improved performance. For instance, the network trained with both Dropout and Batch Normalization may not produce a better result [9]
. Dropout may change the statistical variance of layers output when we switch from training to testing, while Batch Normalization requires the variance to be the same during both stages
[12].1.1 Our contributions
To deal with the aforementioned challenges, we propose a novel regularization method, DropActivation, inspired by the works in [15, 3, 8, 21, 17], where some structures of networks are dropped to achieve better generalization. The advantages are as follows:

DropActivation provides an easytoimplement yet effective method for regularization via implicit parameter reduction.

DropActivation can be used in synergy with most popular architectures and regularization methods, leading to improved performance in various datasets.
The basic idea of DropActivation is that the nonlinearities in the network will be randomly activated or deactivated during training. More precisely, the nonlinear activations are turned into identity mappings with a certain probability, as shown in Figure
1. At testing time, we propose using a deterministic neural network with a new activation function which is a convex combination of identity mapping and the dropped nonlinearity, in order to represent the ensemble average of the random networks generated from DropActivation.The starting point of DropActivation is to randomly ensemble a large class of neural networks with either an identify or a ReLU activation function. The training process of DropActivation is to identify a set of parameters such that various neural networks in this class work well when assigned with these parameters, which prevents the overfitting of a fixed neural network. DropActivation can also be understood as adding noise to the training process for regularization. Indeed, our theoretical analysis will show that DropActivation implicitly adds a penalty term to the loss function, aiming at network parameters such that the corresponding deep neural network can be approximated by a shallower neural network,
i.e., implicit parameter reduction.1.2 Organizations
The remainder of this paper is structured as the following. In Section 2, we review some of the regularization methods and discuss their relations to our work. In Section 3, we formally introduce DropActivation. In Section 4, we demonstrate the regularization of DropActivation and its synergy with other regularization approaches on the datasets CIFAR10, CIFAR100, SVHN, and EMNIST. In Section 5, these advantages of DropActivation are further supported by our theoretical analyses.
2 Related work
Various regularization methods have been proposed to reduce the risk of overfitting. Data augmentation achieves regularization by directly enlarging the original training dataset via randomly transforming the input images [11, 14, 3, 2] or output labels [25, 18]. Another class of methods regularize the network by adding randomness into various neural network structures such as nodes [15], connections [17], pooling layers [23], activations [20] and residual blocks [4, 8, 21]. In particular [15, 3, 8, 21, 17] add randomness by dropping some structures of neural networks in training. We focus on reviewing this class of methods as they are most relevant to our method where the nonlinear activation functions are discarded randomly.
Dropout [15] drops nodes along with its connection with some fixed probability during training. DropConnect [17] has a similar idea but masks out some weights randomly. [8] improves the performance of ResNet [5] by dropping entire residual block at random during training and passing through skip connections (identity mapping) . The randomness of dropping entire block enables us to train a shallower network in expectation. This idea is also used in [21] when training ResNeXt [19] type 2residualbranch network. The idea of dropping also arises in data augmentation. Cutout [3] randomly cut out a square region of training images. In other words, they drop the input nodes in a patchwise fashion, which prevents the neural network model from putting too much emphasis on the specific region of features.
In the next section, inspired by the above methods, we propose the DropActivation method for regularization. We want to emphasize that the improvement by DropActivation is universal to most neuralnetwork architectures, and it can be readily used in conjunction with other regularizers without conflicts.
3 DropActivation
This section describes the DropActivation method. Suppose
is an input vector of an
layer feed forward network. Let be the output of th layer. is the elementwise nonlinear activation operator that maps an input vector to an output vector by applying a nonlinearity on each of the entries of the input. Without the loss of generality, we assume , e.g.,(1) 
where
could be a rectified linear unit (ReLU), a sigmoid or a tanh function. For standard fully connected or convolution network, the
dimensional output can be written as(2) 
where is the weight matrix of the th layer. Biases are neglected for the convenience of presentation.
In what follows, we modify the way of applying the nonlinear activation operator in order to achieve regularization. In the training phase, we remove the pointwise nonlinearities in randomly. In the testing phase, the function is replaced with a new deterministic one.
Training Phase: During training, the nonlinearities in the operator are kept with probability (or dropping them with probability ). The output of the th layer is thus
(3) 
where ,
are independent and identical random variables following a Bernoulli distribution
that takes value with probability and with probability . We useto denote the identity matrix. Intuitively, when
, then , meaing all the nonlinearities in this layer are kept. When , then , meaning all the nonlinearities are dropped. The general case lies somewhere between these two limits where the nonlinearities are kept or dropped partially. At each iteration, a different realization of is sampled from the Bernoulli distribution again.If the nonlinear activation function in Eqn. (3) is ReLU, the th component of can be written as
(4) 
Testing Phase: During testing, we use a deterministic nonlinear function resulting from averaging the realizations of . More precisely, we take the expectation of the Eqn. (3) with respect to the random variable :
(5) 
and the new activation function is the convex combination of an identity operator and an activation operator . Eqn. (5) is the deteministic nonlinearity used to generate a deterministic neural network for testing. In particular, if ReLU is used, then the new activation is the Leaky ReLU with slope [20].
4 Experiments
In this section, we empirically evaluate the performance of DropActivation and demonstrate its effectiveness. We apply DropActivation to modern deep neural architectures such as ResNet [5], PreResNet [6], DenseNet [7], ResNeXt [19], and WideResNet [22] on a series of data sets including CIFAR10, CIFAR100 [10], SVHN [13] and EMNIST [1]. This section is organized as follows. Section 4.1 contains basic experiment setting. In Section 4.2, we introduce the datasets and implementation details. In section 4.3, we present the numerical results.
4.1 Experiment Design
Our experiments are to demonstrate the following points:

Comparison with RReLU: Due to the similarity between the activation function used in our proposed method and randomized leaky rectified linear units (RReLU) in Eqn. (4), one may speculate that the use of RReLU gives similar performance. We show that this is indeed not the case by comparing DropActivation with the use of RReLU.

Improvement to modern neural network architectures: We show the improvement that DropActivation brings is rather universal by applying it to different modern network architectures on a variety of datasets.

Compatibility with other approaches: We show that DropActivation is compatible with other popular regularization methods by combining them in different network architectures.
4.1.1 Comparison with RReLU
Xu et al. proposed RReLU [20] with the following training scheme for an input vector ,
(6) 
where
is a random variable with a uniform distribution
with . In the case of ReLU in DropActivation, a comparison between Eqn. (4) with Eqn. (6) shows that the main difference between our approach and RReLU is the random variable used on the negative axis. It can be seen from Eqn. (6) that RReLU passes the negative data with a random shrinking rate, while DropActivation randomly lets the complete information pass. We compare DropActivation with RReLu using architectures like ResNet, PreResNet, and WideResNet on CIFAR10 and CIFAR100. The parameters and in RReLU are set at 1/8 and 1/3 respectively, as suggested in [20].4.1.2 Improvement to modern neural network architectures
The residualtype neural network structures greatly facilitate the optimization for deep neural network [5] and are employed by ResNet [5], PreResNet [6], DenseNet [7], ResNeXt [19], and WideResNet [22]. We demonstrate that DropActivation works well with these modern architectures. Moreover, since these networks use Batch Normalization to accelerate training and may contain Dropout to improve generalization (WideResNet), these experiments also show the ability of DropActivation to work in synergy with the prevalent training techniques. When applying DropActivation to these models, we directly substitute the original ReLU function with (4) during training time and Leaky ReLU with slope during testing.
4.1.3 Compatibility with other regularization approaches
To further show that DropActivation can cooperate well with other training techniques, we combine DropActivation with two other popular data augmentation approaches: Cutout [3] and AutoAugment [2]
. Cutout randomly masks a square region of training data and AutoAugment uses reinforcement learning to obtain an improved data augmentation scheme. We implement DropActivation in combination with Cutout and AutoAugment on WideResNet and ResNet for CIFAR100.
4.2 Datasets and implementation details
4.2.1 Choosing probability of retaining activation:
In our method, the only parameter that needs to be tuned is the probability
of retaining activation. To get a rough estimate of what
is, we train a simple network on CIFAR10 and perform a grid search for on the interval , with a step size equals to . When in DropActivation, the nonlinearity is just the standard ReLU. The simple network consists of the following layers: We first stack three blocks, and each block contains convolution with filter, Batch Normalization, ReLU, and average pooling. These are followed by two fully connected layers. Figure 2 shows the testing error on CIFAR10 versus , which is minimal at . Each data point is averaged over the outcomes of trained neuralnetworks. Based on this observation, we choose for all experiments.CIFAR: Both CIFAR10 and CIFAR100 contain 60k color nature images of size 32 by 32. There are 50k images for training and 10k images for testing. CIFAR10 has ten classes of objects and 6k for each class. CIFAR100 is similar to CIFAR10, except that it includes 100 classes and 600 images for each class. Normalization and standard data augmentation (random cropping and horizontal flipping) are applied to the training data as [5]. For CIFAR10, we train the models ResNet110, PreResNet164, DenseNetBC10012, DenseNetBC19040, ResNeXt864d, WideResNet2810. For CIFAR100, the models that we train are the same as in the case for CIFAR10 except that ResNet110 is replaced with ResNet164 using the bottleneck layer in [6]. We use the same hyperparameters as in the original papers except that the batchsize for DenseNetBC19040 is set to . The models are optimized using SGD with momentum of [16].
SVHN: The dataset of Street View House Numbers (SVHN) contains ten classes of color digit images of size 32 by 32. There are about 73k training images, 26k testing images, and additional 531k images. The training and additional images are used together for training, so there are totally over 600k images for training. An image in SVHN may contain more than one digit, and the recognition task is to identify the digit in the center of the image. We preprocess the images following [22]. The pixel values of the images are rescaled to , and no data augmentation is applied. For this dataset, we train the models WideResNet168, DenseNetBC10012, ResNeXt864d. We train WideResNet168 and DenseNetBC10012 as in [22, 7]
. For ResNeXt, we train it for 100 epochs, where the learning rate is initially set to 0.1 and decreases by a factor of 10 after the 40th and the 70th epoch. The rest of the hyperparameters are set in the same way as
[19] when training ResNeXt on CIFAR10.EMNIST: EMNIST is a set of grayscale images containing handwritten English characters and digits. There are six different splits in this dataset and we use the split Balanced. In Balanced, there are 131,600 images in total, including 112,800 for training and 18,800 for testing. There are 47 distinct classes. For this classification task, we train the models ResNet164, PreResNet164, WideResNet2010, DenseNetBC10012, and ResNeXt864d using the hyperparameter settings for training CIFAR100 in [5, 6, 22, 7, 19] respectively.
4.3 Experiment Result
Table 1, 2, 3, and 4 show the testing error on CIFAR100, CIFAR10, SVHN and EMNIST, respectively. The baseline results are from original networks without DropActivation. In what follows, we discuss how our results support the points raised in Section 4.1.
4.3.1 Comparison with RReLU
As shown in Table 1 and Table 2, RReLU may have worse performance than the baseline method, e.g., in the case of WideResNet. However, DropActivation consistently results in superior performance over RReLU and almost all baseline methods. Although DropActivation can not reduce the testing error of ResNeXt864d, DropActivation with DenseNetBC19040 has the best testing error smaller than that of the original ResNeXt864d.
model  Baseline  RReLU  DropAct 

ResNet164  25.16  24.15  23.88 
PreResNet164  24.33  23.22  22.72 
WideResNet2810  18.85  19.63  18.14 
DenseNetBC10012  22.27    21.71 
DenseNetBC19040  17.18    16.92 
ResNeXt29864d  17.77    17.68 
model  Baseline  RReLU  DropAct 

ResNet110  6.43  7.66  6.17 
PreResNet164  5.46  5.33  4.87 
WideResNet2810  3.89  4.31  3.74 
DenseNetBC10012  4.51    4.40 
DenseNetBC19040  3.75    3.45 
ResNeXt29864d  3.65    4.16 
4.3.2 Application to modern models:
As shown in Table 1 and 3, DropActivation improves the testing accuracy consistently comparing to Baseline for CIFAR100 and SVHN. The conclusion remains the same in Table 2 and 4 for CIFAR10 and EMNIST except for one case in CIFAR10 and one case of EMNIST. But the magnitude of deterioration is relatively small. In particular, DropActivation improves ResNet, PreResNet and WideResNet by reducing the relative test error for CIFAR10, CIFAR100 or SVHN by over 3.5%.
Therefore, DropActivation can work with most modern networks for different datasets. Besides, our results implicitly show that DropActivation is compatible with regularization techniques such as Batch Normalization or Dropout used in training these networks.
model  Baseline  DropAct 

WideResNet168  1.54  1.46 
DenseNetBC10012  1.76  1.71 
ResNeXt29864d  1.79  1.69 
model  Baseline  DropAct 

ResNet164  8.85  8.82 
PreResNet164  8.88  8.72 
WideResNet2810  8.97  8.72 
DenseNetBC10012  8.81  8.90 
ResNeXt298x64d  9.07  8.91 
4.3.3 Compatibility with other regularization approaches:
We apply DropActivation to network models that use Cutout or AutoAugment. As shown in Table 5 and 6, DropActivation can further improve ResNet18 and WideResNet2010 with Cutout or AutoAugment by decreasing over 0.5% test error. To the best of our knowledge, AutoAugment achieves the stateofart result on the dataset CIFAR100 using PyramidNet+ShakeDrop [21]. Due to the limitation of computing resource, the models with PyramidNet+ShakeDrop+DropAct and other possible combinations are still under training.
Baseline  Cutout  Cutout+DropAct  

ResNet18  
WideResNet2010 
Baseline  AutoAug  AutoAug+DropAct  

ResNet164  25.16  21.12  20.39 
WideResNet2010  18.85  17.09  16.20 
5 Theoretical Analysis
In Section 5.1, we show that in a neuralnetwork with onehiddenlayer, DropActivation provides a regularization to the network via penalizing the difference between a deep and shallow network, which can be understood as implicit parameter reduction, i.e., the intrinsic dimension of the parameter space is smaller than the original parameter space. In Section 5.2, we further show that the use of DropActivation does not impact some other techniques such as Batch Normalization, which ensures the practicality of using DropActivation.
5.1 DropActivation as a regularizer
In this section, we show that having DropActivation in a standard onehidden layer fully connected neural network with ReLU activation gives rise to an explicit regularizer.
Let be the input vector,
be the target output. The output of the onehidden layer neural network with ReLU activation is
, where , are weights of the network, is the function for applying ReLU elementwise to the input vector. Let denotes the leaky ReLU with slope in the negative part.As in Eqn. (3) and (5), applying DropActivation to this network gives
(7) 
during training, and
(8) 
during testing.
Suppose we have training samples . To reveal the effect of DropActivation, we average the training loss function over :
(9) 
where the expectation is taken with respect to the feature noise . The use of DropActivation can be seen as applying a stochastic minimization to such an average loss. The result after averaging the loss function over is summarized as follows.
Property 5.1
The optimization problem (9) is equivalent to
(10) 
Proof of Property 5.1 can be found in the Supplementary Material. The first term is nothing but the loss during prediction time , where ’s are defined via (8). Therefore, Property 5.1 shows that DropActivation incurs a penalty
(11) 
on top of the prediction loss. In Eqn. (11), the coefficient influences the magnitude of the penalty. In our experiments, is selected to be a large number that is close to (typically ), resulting a rather small regularization.
The penalty (11) consists of the terms and . Since has no nonlinearity, it can be viewed as a shallow network. In contrast, since has the nonlinearity , it can be considered as a deep network. The two networks share the same parameters and . Therefore the penalty (11) encourages weights such that the prediction of the relatively deep network should be somewhat close to that of a shallow network. In a classification or regression task, the shallow network has less representation power. However, the lower parameter complexity of the shallow network results in mappings with better generalization property. In this way, the penalty incurs by DropActivation may help in reducing overfitting by implicit parameter reduction.
To illustrate this point, we perform a simple regression task for two functions. In Figure (a)a, the ground truth function (Blue) is . To generate the training dataset, we sample 20 ’s on the interval , and let
(12) 
Then we train a fully connected network with three hidden layers of width 1000, 800, 200, respectively. As shown in Figure (a)a, the network with normal ReLU has a low prediction error on training data points, but is generally erroneous in other regions. Although the network with DropActivation does not fit as well to the training data (comparing with using normal ReLU), overall it achieves a better prediction error. In Figure (b)b, we show the regression results for the piecewise constant function, which can be viewed as a onedimensional classification problem. We again see that the network using normal ReLU has large test error near the left and right boundaries, where there are less training data. However, with the incurred penalty (11), the network with DropActivation yields a smooth curve. Furthermore, DropActivation reduces the influence of data noise.
In another experiment, we train ResNet164 on CIFAR100 to demonstrate the regularization property of DropActivation. In Figure 4, the training error with DropActivation is slightly larger than that of without DropActivation. However, in terms of generalization error, DropActivation gives improved performance. This verifies that the original network has been overparametired and DropActivation is able to regularize the network by implicit parameter reduction.
5.2 Compatibility of DropActivation with Batch Normalization
In this section, we show theoretically that DropActivation essentially keeps the statistical property of the output of each network layer when going from training to testing phase and hence it can be used together with Batch Normalization. [12] argues that Batch Normalization assumes the output of each layer has the same variance during training and testing. However, dropout will shift the variance of the output during testing time leading to disharmony when used in conjunction with Batch Normalization. Using a similar analysis as [12], we show that unlike dropout, DropActivation can be used together with BatchNormalization since it maintains the output variance.
To this end, we analyze the mappings in ResNet [5]. Figure 5 (Left) shows a basic block of ResNet while Figure 5 (Right) shows a basic block with DropActivation. We focus on the rectangular box with dashed line. Suppose the output from the shown in Figure 5 is , where are i.i.d. random variables. When
is passed to the DropActivation layer followed by a linear transformation
with weights , we obtain(13) 
where and . Similarly, during testing, taking the expectation of (13) over ’s gives
(14) 
The output of the rectangular box (and during testing) is then used as the input to in Figure 5. Since for Batch Normalization we only need to understand the entrywise statistics of its input, without loss of generality, we assume the linear transformation maps a vector from to , and are scalars.
We want to show and have similar statistics. By design, . Notice that the expectation here is taken with respect to both the random variables and the input of the box in Figure 5. Thus the main question is whether the variances of and are the same. To this end, we introduce the shift ratio [12]:
as a metric for evaluating the variance shift. The shift ratio is expected to be close to , since the Batch Normalization layer requires its input having similar variance in both training and testing time.
The proof of Property 5.2 is provided in the Supplementary Material. In Eqn. (15), the range of the shift ratio lies on the interval . In particular, when , , therefore is close to . This shows that in DropActivation, the difference in the variance of inputs to a Batch Normalization layer between the training and testing phase is rather minor.
We further demonstrate numerically that DropActivation does not generate an enormous shift in the variance of the internal covariates when going from the training time to the testing time. We train ResNet164 with CIFAR100 and let the probability of retaining activation be 0.95 in DropActivation. ResNet164 consists of a stack of three modules. Each module contains 54 convolution layers but has different number of channels. We observe the statics of the output of the second module by evaluating its shift ratio. We compute the variances of the output for each channel and then average the channels’ variance. As shown in Figure 6, the shift ratio stabilize at in the end of training.
In summary, by keeping the statistical property of the internal output of hidden layers, DropActivation can be combined with Batch Normalization to improve performance.
6 Conclusion
In this paper, we propose DropActivation, a regularization method that introduces randomness on the activation function. DropActivation works by randomly dropping the nonlinear activations in the network during training and uses a deterministic network with modified nonlinearities for prediction.
The advantage of the proposed method is twofold. Firstly, DropActivation provides a simple yet effective method for regularization, as demonstrated by the numerical experiments. Furthermore, this is supported by our analysis in the case of one hiddenlayer. We show that DropActivation gives rise to a regularizer that penalizes the difference between nonlinear and linear networks. Future direction includes the analysis of DropActivation with more than one hiddenlayer. Secondly, experiments verify that DropActivation improves the generalization in most the modern neural networks and cooperates well with some other popular training techniques. Moreover, we show theoretically and numerically that DropActivation maintains the variance during both training and testing times, and thus DropActivation can work well with Batch Normalization. These two properties should allow the wide applications of DropActivation in many network architectures.
Acknowledgments. H. Yang thanks the support of the startup grant by the Department of Mathematics at the National University of Singapore. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.
7 Appendix
Suppose that is the input vector. Let , where is a 01 vector, and the jth component of is equal to 1 if the jth component of is positive or is equal to 0 else. Then, . For simplification, denote
On one hand, . We expand it and obtain
(16) 
where is trace operator computing the sum of diagonal in the matrix. Function denotes converting the diagonal matrix into a column vector. Then we transform the first term of Eqn.(16), and get
(17) 
On the other hand, we have
(18) 
where the expectation is taken with respect to the feature noise Similar to Eqn.(17), we have
(19) 
Since has property of linearity, take the expectation of Eqn. (19) with respect to to obtain
(20) 
Denote , then
(21) 
Then, using Eqn. (21), Eqn. (17), Eqn. (19), we can get the difference between Eqn. (16) and Eqn. (18), this is,
Note that , then we can get
Finally, we attain the difference between Eqn. (16) and Eqn. (18),
7.1 Proof of Property (5.2)
Since it is easy to get
and
where the expectation is taken with respect to random variable We know that
where we take expectation with respect to features noise and inputs . That means In what follows, we compute and .
Since that,
we take expectation and obtain,
Therefore,
We finish . We are going to compute .
We take expectation with respect to the input ,
and Therefore,
So we have
References
 [1] G. Cohen, S. Afshar, J. Tapson, and A. van Schaik. Emnist: an extension of mnist to handwritten letters. arXiv preprint arXiv:1702.05373, 2017.
 [2] E. D. Cubuk, B. Zoph, D. Mané, V. Vasudevan, and Q. V. Le. Autoaugment: Learning augmentation policies from data. CoRR, abs/1805.09501, 2018.
 [3] T. DeVries and G. W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
 [4] X. Gastaldi. Shakeshake regularization. arXiv preprint arXiv:1705.07485, 2017.
 [5] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
 [6] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. CoRR, abs/1603.05027, 2016.
 [7] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, volume 1, page 3, 2017.
 [8] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. Deep networks with stochastic depth. In European Conference on Computer Vision, pages 646–661. Springer, 2016.
 [9] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 [10] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 [11] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [12] X. Li, S. Chen, X. Hu, and J. Yang. Understanding the disharmony between dropout and batch normalization by variance shift. arXiv preprint arXiv:1801.05134, 2018.
 [13] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
 [14] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[15]
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov.
Dropout: a simple way to prevent neural networks from overfitting.
The Journal of Machine Learning Research
, 15(1):1929–1958, 2014.  [16] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on International Conference on Machine Learning  Volume 28, ICML’13, pages III–1139–III–1147. JMLR.org, 2013.
 [17] L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus. Regularization of neural networks using dropconnect. In International Conference on Machine Learning, pages 1058–1066, 2013.

[18]
L. Xie, J. Wang, Z. Wei, M. Wang, and Q. Tian.
Disturblabel: Regularizing cnn on the loss layer.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 4753–4762, 2016.  [19] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5987–5995. IEEE, 2017.
 [20] B. Xu, N. Wang, T. Chen, and M. Li. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853, 2015.
 [21] Y. Yamada, M. Iwamura, and K. Kise. Shakedrop regularization. arXiv preprint arXiv:1802.02375, 2018.
 [22] S. Zagoruyko and N. Komodakis. Wide residual networks. CoRR, abs/1605.07146, 2016.
 [23] M. D. Zeiler and R. Fergus. Stochastic pooling for regularization of deep convolutional neural networks. arXiv preprint arXiv:1301.3557, 2013.
 [24] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
 [25] H. Zhang, M. Cisse, Y. N. Dauphin, and D. LopezPaz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
Comments
There are no comments yet.