Convolution neural network (CNN) is a powerful tool for computer vision tasks. With the help of gradually increasing depth and width, CNNs [5, 6, 7, 22, 19] gain a significant improvement in image classification problems by capturing multiscale features . However, when the number of trainable parameters are far more than that of training data, deep networks may suffer from overfitting. This leads to the routine usage of regularization methods such as data augmentation , weight decay , Dropout  and Batch Normalization  to prevent overfitting and improve generalization.
Although regularization has been an essential part in deep learning, deciding which regularization methods to use remains an art. Even if each of the regularization methods works well on its own, combining them together does not always give improved performance. For instance, the network trained with both Dropout and Batch Normalization may not produce a better result 
. Dropout may change the statistical variance of layers output when we switch from training to testing, while Batch Normalization requires the variance to be the same during both stages.
1.1 Our contributions
To deal with the aforementioned challenges, we propose a novel regularization method, Drop-Activation, inspired by the works in [15, 3, 8, 21, 17], where some structures of networks are dropped to achieve better generalization. The advantages are as follows:
Drop-Activation provides an easy-to-implement yet effective method for regularization via implicit parameter reduction.
Drop-Activation can be used in synergy with most popular architectures and regularization methods, leading to improved performance in various datasets.
The basic idea of Drop-Activation is that the nonlinearities in the network will be randomly activated or deactivated during training. More precisely, the nonlinear activations are turned into identity mappings with a certain probability, as shown in Figure1. At testing time, we propose using a deterministic neural network with a new activation function which is a convex combination of identity mapping and the dropped nonlinearity, in order to represent the ensemble average of the random networks generated from Drop-Activation.
The starting point of Drop-Activation is to randomly ensemble a large class of neural networks with either an identify or a ReLU activation function. The training process of Drop-Activation is to identify a set of parameters such that various neural networks in this class work well when assigned with these parameters, which prevents the overfitting of a fixed neural network. Drop-Activation can also be understood as adding noise to the training process for regularization. Indeed, our theoretical analysis will show that Drop-Activation implicitly adds a penalty term to the loss function, aiming at network parameters such that the corresponding deep neural network can be approximated by a shallower neural network,i.e., implicit parameter reduction.
The remainder of this paper is structured as the following. In Section 2, we review some of the regularization methods and discuss their relations to our work. In Section 3, we formally introduce Drop-Activation. In Section 4, we demonstrate the regularization of Drop-Activation and its synergy with other regularization approaches on the datasets CIFAR-10, CIFAR-100, SVHN, and EMNIST. In Section 5, these advantages of Drop-Activation are further supported by our theoretical analyses.
2 Related work
Various regularization methods have been proposed to reduce the risk of overfitting. Data augmentation achieves regularization by directly enlarging the original training dataset via randomly transforming the input images [11, 14, 3, 2] or output labels [25, 18]. Another class of methods regularize the network by adding randomness into various neural network structures such as nodes , connections , pooling layers , activations  and residual blocks [4, 8, 21]. In particular [15, 3, 8, 21, 17] add randomness by dropping some structures of neural networks in training. We focus on reviewing this class of methods as they are most relevant to our method where the nonlinear activation functions are discarded randomly.
Dropout  drops nodes along with its connection with some fixed probability during training. DropConnect  has a similar idea but masks out some weights randomly.  improves the performance of ResNet  by dropping entire residual block at random during training and passing through skip connections (identity mapping) . The randomness of dropping entire block enables us to train a shallower network in expectation. This idea is also used in  when training ResNeXt  type 2-residual-branch network. The idea of dropping also arises in data augmentation. Cutout  randomly cut out a square region of training images. In other words, they drop the input nodes in a patch-wise fashion, which prevents the neural network model from putting too much emphasis on the specific region of features.
In the next section, inspired by the above methods, we propose the Drop-Activation method for regularization. We want to emphasize that the improvement by Drop-Activation is universal to most neural-network architectures, and it can be readily used in conjunction with other regularizers without conflicts.
This section describes the Drop-Activation method. Suppose
is an input vector of an-layer feed forward network. Let be the output of -th layer. is the element-wise nonlinear activation operator that maps an input vector to an output vector by applying a nonlinearity on each of the entries of the input. Without the loss of generality, we assume , e.g.,
could be a rectified linear unit (ReLU), a sigmoid or a tanh function. For standard fully connected or convolution network, the-dimensional output can be written as
where is the weight matrix of the -th layer. Biases are neglected for the convenience of presentation.
In what follows, we modify the way of applying the nonlinear activation operator in order to achieve regularization. In the training phase, we remove the pointwise nonlinearities in randomly. In the testing phase, the function is replaced with a new deterministic one.
Training Phase: During training, the nonlinearities in the operator are kept with probability (or dropping them with probability ). The output of the -th layer is thus
where ,that takes value with probability and with probability . We use
to denote the identity matrix. Intuitively, when, then , meaing all the nonlinearities in this layer are kept. When , then , meaning all the nonlinearities are dropped. The general case lies somewhere between these two limits where the nonlinearities are kept or dropped partially. At each iteration, a different realization of is sampled from the Bernoulli distribution again.
If the nonlinear activation function in Eqn. (3) is ReLU, the -th component of can be written as
Testing Phase: During testing, we use a deterministic nonlinear function resulting from averaging the realizations of . More precisely, we take the expectation of the Eqn. (3) with respect to the random variable :
and the new activation function is the convex combination of an identity operator and an activation operator . Eqn. (5) is the deteministic nonlinearity used to generate a deterministic neural network for testing. In particular, if ReLU is used, then the new activation is the Leaky ReLU with slope .
In this section, we empirically evaluate the performance of Drop-Activation and demonstrate its effectiveness. We apply Drop-Activation to modern deep neural architectures such as ResNet , PreResNet , DenseNet , ResNeXt , and WideResNet  on a series of data sets including CIFAR-10, CIFAR-100 , SVHN  and EMNIST . This section is organized as follows. Section 4.1 contains basic experiment setting. In Section 4.2, we introduce the datasets and implementation details. In section 4.3, we present the numerical results.
4.1 Experiment Design
Our experiments are to demonstrate the following points:
Comparison with RReLU: Due to the similarity between the activation function used in our proposed method and randomized leaky rectified linear units (RReLU) in Eqn. (4), one may speculate that the use of RReLU gives similar performance. We show that this is indeed not the case by comparing Drop-Activation with the use of RReLU.
Improvement to modern neural network architectures: We show the improvement that Drop-Activation brings is rather universal by applying it to different modern network architectures on a variety of datasets.
Compatibility with other approaches: We show that Drop-Activation is compatible with other popular regularization methods by combining them in different network architectures.
4.1.1 Comparison with RReLU
Xu et al. proposed RReLU  with the following training scheme for an input vector ,
is a random variable with a uniform distributionwith . In the case of ReLU in Drop-Activation, a comparison between Eqn. (4) with Eqn. (6) shows that the main difference between our approach and RReLU is the random variable used on the negative axis. It can be seen from Eqn. (6) that RReLU passes the negative data with a random shrinking rate, while Drop-Activation randomly lets the complete information pass. We compare Drop-Activation with RReLu using architectures like ResNet, PreResNet, and WideResNet on CIFAR-10 and CIFAR-100. The parameters and in RReLU are set at 1/8 and 1/3 respectively, as suggested in .
4.1.2 Improvement to modern neural network architectures
The residual-type neural network structures greatly facilitate the optimization for deep neural network  and are employed by ResNet , PreResNet , DenseNet , ResNeXt , and WideResNet . We demonstrate that Drop-Activation works well with these modern architectures. Moreover, since these networks use Batch Normalization to accelerate training and may contain Dropout to improve generalization (WideResNet), these experiments also show the ability of Drop-Activation to work in synergy with the prevalent training techniques. When applying Drop-Activation to these models, we directly substitute the original ReLU function with (4) during training time and Leaky ReLU with slope during testing.
4.1.3 Compatibility with other regularization approaches
. Cutout randomly masks a square region of training data and AutoAugment uses reinforcement learning to obtain an improved data augmentation scheme. We implement Drop-Activation in combination with Cutout and AutoAugment on WideResNet and ResNet for CIFAR-100.
4.2 Datasets and implementation details
4.2.1 Choosing probability of retaining activation:
In our method, the only parameter that needs to be tuned is the probability
of retaining activation. To get a rough estimate of whatis, we train a simple network on CIFAR-10 and perform a grid search for on the interval , with a step size equals to . When in Drop-Activation, the nonlinearity is just the standard ReLU. The simple network consists of the following layers: We first stack three blocks, and each block contains convolution with filter, Batch Normalization, ReLU, and average pooling. These are followed by two fully connected layers. Figure 2 shows the testing error on CIFAR-10 versus , which is minimal at . Each data point is averaged over the outcomes of trained neural-networks. Based on this observation, we choose for all experiments.
CIFAR: Both CIFAR-10 and CIFAR-100 contain 60k color nature images of size 32 by 32. There are 50k images for training and 10k images for testing. CIFAR-10 has ten classes of objects and 6k for each class. CIFAR-100 is similar to CIFAR-10, except that it includes 100 classes and 600 images for each class. Normalization and standard data augmentation (random cropping and horizontal flipping) are applied to the training data as . For CIFAR-10, we train the models ResNet-110, PreResNet-164, DenseNet-BC-100-12, DenseNet-BC-190-40, ResNeXt-864d, WideResNet-28-10. For CIFAR-100, the models that we train are the same as in the case for CIFAR-10 except that ResNet-110 is replaced with ResNet-164 using the bottleneck layer in . We use the same hyper-parameters as in the original papers except that the batch-size for DenseNet-BC-190-40 is set to . The models are optimized using SGD with momentum of .
SVHN: The dataset of Street View House Numbers (SVHN) contains ten classes of color digit images of size 32 by 32. There are about 73k training images, 26k testing images, and additional 531k images. The training and additional images are used together for training, so there are totally over 600k images for training. An image in SVHN may contain more than one digit, and the recognition task is to identify the digit in the center of the image. We preprocess the images following . The pixel values of the images are rescaled to , and no data augmentation is applied. For this dataset, we train the models WideResNet-16-8, DenseNet-BC-100-12, ResNeXt-864d. We train WideResNet-16-8 and DenseNet-BC-100-12 as in [22, 7]
. For ResNeXt, we train it for 100 epochs, where the learning rate is initially set to 0.1 and decreases by a factor of 10 after the 40th and the 70th epoch. The rest of the hyper-parameters are set in the same way as when training ResNeXt on CIFAR-10.
EMNIST: EMNIST is a set of grayscale images containing handwritten English characters and digits. There are six different splits in this dataset and we use the split Balanced. In Balanced, there are 131,600 images in total, including 112,800 for training and 18,800 for testing. There are 47 distinct classes. For this classification task, we train the models ResNet-164, PreResNet-164, WideResNet-20-10, DenseNet-BC-100-12, and ResNeXt-864d using the hyper-parameter settings for training CIFAR-100 in [5, 6, 22, 7, 19] respectively.
4.3 Experiment Result
Table 1, 2, 3, and 4 show the testing error on CIFAR-100, CIFAR-10, SVHN and EMNIST, respectively. The baseline results are from original networks without Drop-Activation. In what follows, we discuss how our results support the points raised in Section 4.1.
4.3.1 Comparison with RReLU
As shown in Table 1 and Table 2, RReLU may have worse performance than the baseline method, e.g., in the case of WideResNet. However, Drop-Activation consistently results in superior performance over RReLU and almost all baseline methods. Although Drop-Activation can not reduce the testing error of ResNeXt-864d, Drop-Activation with DenseNet-BC-190-40 has the best testing error smaller than that of the original ResNeXt-864d.
4.3.2 Application to modern models:
As shown in Table 1 and 3, Drop-Activation improves the testing accuracy consistently comparing to Baseline for CIFAR-100 and SVHN. The conclusion remains the same in Table 2 and 4 for CIFAR-10 and EMNIST except for one case in CIFAR-10 and one case of EMNIST. But the magnitude of deterioration is relatively small. In particular, Drop-Activation improves ResNet, PreResNet and WideResNet by reducing the relative test error for CIFAR-10, CIFAR-100 or SVHN by over 3.5%.
Therefore, Drop-Activation can work with most modern networks for different datasets. Besides, our results implicitly show that Drop-Activation is compatible with regularization techniques such as Batch Normalization or Dropout used in training these networks.
4.3.3 Compatibility with other regularization approaches:
We apply Drop-Activation to network models that use Cutout or AutoAugment. As shown in Table 5 and 6, Drop-Activation can further improve ResNet-18 and WideResNet-20-10 with Cutout or AutoAugment by decreasing over 0.5% test error. To the best of our knowledge, AutoAugment achieves the state-of-art result on the dataset CIFAR-100 using PyramidNet+ShakeDrop . Due to the limitation of computing resource, the models with PyramidNet+ShakeDrop+DropAct and other possible combinations are still under training.
5 Theoretical Analysis
In Section 5.1, we show that in a neural-network with one-hidden-layer, Drop-Activation provides a regularization to the network via penalizing the difference between a deep and shallow network, which can be understood as implicit parameter reduction, i.e., the intrinsic dimension of the parameter space is smaller than the original parameter space. In Section 5.2, we further show that the use of Drop-Activation does not impact some other techniques such as Batch Normalization, which ensures the practicality of using Drop-Activation.
5.1 Drop-Activation as a regularizer
In this section, we show that having Drop-Activation in a standard one-hidden layer fully connected neural network with ReLU activation gives rise to an explicit regularizer.
Let be the input vector,
be the target output. The output of the one-hidden layer neural network with ReLU activation is, where , are weights of the network, is the function for applying ReLU elementwise to the input vector. Let denotes the leaky ReLU with slope in the negative part.
during training, and
Suppose we have training samples . To reveal the effect of Drop-Activation, we average the training loss function over :
where the expectation is taken with respect to the feature noise . The use of Drop-Activation can be seen as applying a stochastic minimization to such an average loss. The result after averaging the loss function over is summarized as follows.
The optimization problem (9) is equivalent to
Proof of Property 5.1 can be found in the Supplementary Material. The first term is nothing but the loss during prediction time , where ’s are defined via (8). Therefore, Property 5.1 shows that Drop-Activation incurs a penalty
on top of the prediction loss. In Eqn. (11), the coefficient influences the magnitude of the penalty. In our experiments, is selected to be a large number that is close to (typically ), resulting a rather small regularization.
The penalty (11) consists of the terms and . Since has no nonlinearity, it can be viewed as a shallow network. In contrast, since has the nonlinearity , it can be considered as a deep network. The two networks share the same parameters and . Therefore the penalty (11) encourages weights such that the prediction of the relatively deep network should be somewhat close to that of a shallow network. In a classification or regression task, the shallow network has less representation power. However, the lower parameter complexity of the shallow network results in mappings with better generalization property. In this way, the penalty incurs by Drop-Activation may help in reducing overfitting by implicit parameter reduction.
To illustrate this point, we perform a simple regression task for two functions. In Figure (a)a, the ground truth function (Blue) is . To generate the training dataset, we sample 20 ’s on the interval , and let
Then we train a fully connected network with three hidden layers of width 1000, 800, 200, respectively. As shown in Figure (a)a, the network with normal ReLU has a low prediction error on training data points, but is generally erroneous in other regions. Although the network with Drop-Activation does not fit as well to the training data (comparing with using normal ReLU), overall it achieves a better prediction error. In Figure (b)b, we show the regression results for the piecewise constant function, which can be viewed as a one-dimensional classification problem. We again see that the network using normal ReLU has large test error near the left and right boundaries, where there are less training data. However, with the incurred penalty (11), the network with Drop-Activation yields a smooth curve. Furthermore, Drop-Activation reduces the influence of data noise.
In another experiment, we train ResNet-164 on CIFAR-100 to demonstrate the regularization property of Drop-Activation. In Figure 4, the training error with Drop-Activation is slightly larger than that of without Drop-Activation. However, in terms of generalization error, Drop-Activation gives improved performance. This verifies that the original network has been over-parametired and Drop-Activation is able to regularize the network by implicit parameter reduction.
5.2 Compatibility of Drop-Activation with Batch Normalization
In this section, we show theoretically that Drop-Activation essentially keeps the statistical property of the output of each network layer when going from training to testing phase and hence it can be used together with Batch Normalization.  argues that Batch Normalization assumes the output of each layer has the same variance during training and testing. However, dropout will shift the variance of the output during testing time leading to disharmony when used in conjunction with Batch Normalization. Using a similar analysis as , we show that unlike dropout, Drop-Activation can be used together with Batch-Normalization since it maintains the output variance.
To this end, we analyze the mappings in ResNet . Figure 5 (Left) shows a basic block of ResNet while Figure 5 (Right) shows a basic block with Drop-Activation. We focus on the rectangular box with dashed line. Suppose the output from the shown in Figure 5 is , where are i.i.d. random variables. When
is passed to the Drop-Activation layer followed by a linear transformationwith weights , we obtain
where and . Similarly, during testing, taking the expectation of (13) over ’s gives
The output of the rectangular box (and during testing) is then used as the input to in Figure 5. Since for Batch Normalization we only need to understand the entry-wise statistics of its input, without loss of generality, we assume the linear transformation maps a vector from to , and are scalars.
We want to show and have similar statistics. By design, . Notice that the expectation here is taken with respect to both the random variables and the input of the box in Figure 5. Thus the main question is whether the variances of and are the same. To this end, we introduce the shift ratio :
as a metric for evaluating the variance shift. The shift ratio is expected to be close to , since the Batch Normalization layer requires its input having similar variance in both training and testing time.
The proof of Property 5.2 is provided in the Supplementary Material. In Eqn. (15), the range of the shift ratio lies on the interval . In particular, when , , therefore is close to . This shows that in Drop-Activation, the difference in the variance of inputs to a Batch Normalization layer between the training and testing phase is rather minor.
We further demonstrate numerically that Drop-Activation does not generate an enormous shift in the variance of the internal covariates when going from the training time to the testing time. We train ResNet-164 with CIFAR-100 and let the probability of retaining activation be 0.95 in Drop-Activation. ResNet-164 consists of a stack of three modules. Each module contains 54 convolution layers but has different number of channels. We observe the statics of the output of the second module by evaluating its shift ratio. We compute the variances of the output for each channel and then average the channels’ variance. As shown in Figure 6, the shift ratio stabilize at in the end of training.
In summary, by keeping the statistical property of the internal output of hidden layers, Drop-Activation can be combined with Batch Normalization to improve performance.
In this paper, we propose Drop-Activation, a regularization method that introduces randomness on the activation function. Drop-Activation works by randomly dropping the nonlinear activations in the network during training and uses a deterministic network with modified nonlinearities for prediction.
The advantage of the proposed method is two-fold. Firstly, Drop-Activation provides a simple yet effective method for regularization, as demonstrated by the numerical experiments. Furthermore, this is supported by our analysis in the case of one hidden-layer. We show that Drop-Activation gives rise to a regularizer that penalizes the difference between nonlinear and linear networks. Future direction includes the analysis of Drop-Activation with more than one hidden-layer. Secondly, experiments verify that Drop-Activation improves the generalization in most the modern neural networks and cooperates well with some other popular training techniques. Moreover, we show theoretically and numerically that Drop-Activation maintains the variance during both training and testing times, and thus Drop-Activation can work well with Batch Normalization. These two properties should allow the wide applications of Drop-Activation in many network architectures.
Acknowledgments. H. Yang thanks the support of the start-up grant by the Department of Mathematics at the National University of Singapore. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.
Suppose that is the input vector. Let , where is a 0-1 vector, and the j-th component of is equal to 1 if the j-th component of is positive or is equal to 0 else. Then, . For simplification, denote
On one hand, . We expand it and obtain
where is trace operator computing the sum of diagonal in the matrix. Function denotes converting the diagonal matrix into a column vector. Then we transform the first term of Eqn.(16), and get
On the other hand, we have
where the expectation is taken with respect to the feature noise Similar to Eqn.(17), we have
Since has property of linearity, take the expectation of Eqn. (19) with respect to to obtain
Denote , then
Note that , then we can get
7.1 Proof of Property (5.2)
Since it is easy to get
where the expectation is taken with respect to random variable We know that
where we take expectation with respect to features noise and inputs . That means In what follows, we compute and .
we take expectation and obtain,
We finish . We are going to compute .
We take expectation with respect to the input ,
So we have
-  G. Cohen, S. Afshar, J. Tapson, and A. van Schaik. Emnist: an extension of mnist to handwritten letters. arXiv preprint arXiv:1702.05373, 2017.
-  E. D. Cubuk, B. Zoph, D. Mané, V. Vasudevan, and Q. V. Le. Autoaugment: Learning augmentation policies from data. CoRR, abs/1805.09501, 2018.
-  T. DeVries and G. W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
-  X. Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. CoRR, abs/1603.05027, 2016.
-  G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, volume 1, page 3, 2017.
-  G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. Deep networks with stochastic depth. In European Conference on Computer Vision, pages 646–661. Springer, 2016.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
-  A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  X. Li, S. Chen, X. Hu, and J. Yang. Understanding the disharmony between dropout and batch normalization by variance shift. arXiv preprint arXiv:1801.05134, 2018.
-  Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov.
Dropout: a simple way to prevent neural networks from overfitting.
The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
-  I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML’13, pages III–1139–III–1147. JMLR.org, 2013.
-  L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus. Regularization of neural networks using dropconnect. In International Conference on Machine Learning, pages 1058–1066, 2013.
L. Xie, J. Wang, Z. Wei, M. Wang, and Q. Tian.
Disturblabel: Regularizing cnn on the loss layer.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4753–4762, 2016.
-  S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5987–5995. IEEE, 2017.
-  B. Xu, N. Wang, T. Chen, and M. Li. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853, 2015.
-  Y. Yamada, M. Iwamura, and K. Kise. Shakedrop regularization. arXiv preprint arXiv:1802.02375, 2018.
-  S. Zagoruyko and N. Komodakis. Wide residual networks. CoRR, abs/1605.07146, 2016.
-  M. D. Zeiler and R. Fergus. Stochastic pooling for regularization of deep convolutional neural networks. arXiv preprint arXiv:1301.3557, 2013.
-  M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
-  H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.