Convolutional neural networks (CNNs), which use a stack of convolution operations followed by non-linear activation (e.g., Rectified Linear Unit, ReLU) to extract high-level discriminative features, have achieved considerable improvements for visual tasks[12, 6, 24]. Via layer-by-layer connectivity, extracted features can reach outstanding representational power. Recent advances of the CNN architectures, such as ResNet , DenseNet , ResNeXt , and PyramidNet , ease the problems of vanishing gradients and boost the performance. However, overfitting, which reduces the generalization capability of CNNs, is still a big problem.
A wide variety of regularization strategies were exploited to alleviate overfitting and decrease the generalization error. Data augmentation 
is a simple yet effective manner to make models adapt to the diversity of data. Batch Normalization
standardizes the mean and variance of features for each mini-batch, which makes the optimization landscape smoother. Drop-based methods [7, 18] aim to train an ensemble of sub-networks, which weakens the effect of “co-adaptions” on training data. Recently, Shake-Shake regularization 
was proposed to randomly interpolate two complementary features in the two branches of ResNeXt, achieving state-of-the-art classification performance. ShakeDrop incorporated the idea of Stochastic Depth  with Shake-Shake regularization to stabilize the training process in the residual branch of ResNet or PyramidNet. Despite the impressive improvement by Shake-based regularization methods, there are two main drawbacks with this type of methods.
ShakeDrop regularization was designed for deep networks and not suitable for shadow network architectures. It may not improve generalization performance, and even make the performance worse for shadow networks (see the TABLE I).
The regularization strength (or amplitude) is unchangeable over the whole training process. The fixed strong regularization is beneficial to reduce overfitting, but it causes difficulties to fit data at the beginning of training. From the perspective of curriculum learning , the learner needs to begin with easy examples.
In view of these issues, we propose a dynamic regularization method for CNNs, in which the regularization strength is adaptable to the dynamics of the training loss. During training, the dynamic regularization strength can be gradually increased with respect to the training status. Analogous to human education, the regularizer is regarded as an instructor who gradually increases the difficulty of training examples by way of feature perturbation. Moreover, dynamic regularization can adapt to models with different sizes. It provides a strong regularization for large models and vice versa. (See Fig. 4 (b)). That is, the regularization strength grows faster and achieves the higher value for the large model than that of the light model.)
shows the proposed dynamic regularization in the ResNet structure. The training loss is not only used to perform backpropagation but also exploited to update the amplitude of the regularization. The features are multiplied by the regularizer in the residual branch. The regularizer works as a perturbation which introduces an augmentation in feature space, so CNNs are trained by the diversity of augmented features. Additionally, the regularization amplitude is changeable with respect to the dynamics of the training loss. We conduct experiments on the image classification task to evaluate our regularization strategy. Experimental results show that the proposed dynamic regularization outperforms state-of-the-art regularization methods, i.e., PyramidNet and ResNeXt equipped with our dynamic regularization improve the classification accuracy in various model settings, when compared with the same network with ShakeDrop and Shake-Shake  regularization.
The rest of this paper is organized as follows. We first briefly introduce the related work on deep CNNs and regularization methods in Section II. Then, the proposed dynamic regularization is presented in Section III. Experimental results and discussion are given in Section IV. Finally, Section V concludes this paper.
Ii Related Work
Ii-a Deep CNNs
CNNs have become deeper and wider with a more powerful capacity [6, 8, 5, 17, 20]. As our proposed regularization is based on ResNet and its variants, we briefly review the basic structure of ResNet, i.e., residual block.
Residual block. The residual block (Res-Block, shown in Fig. 1) is formulated as
where an identity branch is the input features of the Res-Block, which is added with a residual branch
that is non-linear transformations of the inputby a set of parameters ( will be omitted for simplicity). consists of two Conv-BN-ReLU or Bottleneck Architectures in the original ResNet structure . In recent improvement, can also be other forms, e.g. Wide-ResNet 19], PyramidNet , and ResNeXt . PyramidNet gradually increases the number of channels in the Res-Blocks as the layers go deep. ResNeXt has multiple aggregated residual branches expressed as
where and are two residual branches. The number of branches (namely cardinality) is not limited.
In addition to the advances of network architectures, many regularization techniques, i.e., data augmentation [12, 2], stochastic drooping [18, 9, 14], and Shake-based regularization methods [3, 22], have been successfully applied to avoid overfitting of CNNs.
Data augmentation (e.g., random cropping, flipping, and color adjusting ) is a simple yet effective strategy to increase the diversity of data. DeVries and Taylor  introduced an image augmentation technique, in which augmented images are generated by randomly cutting out square regions from input images (called Cutout). Dropout  is a widely used technique which stochastically drops out the hidden nodes from the networks during the training process. Following this idea, Maxout , Continuous Dropout , DropPath , and Stochastic Depth  were proposed. Based on ResNet, Stochastic Depth randomly drops a certain number of residual branches so that the network is shrunk in training. It performs inference using the whole network without dropping. Shake-based regularization approaches [3, 22] was recently proposed to augment features inside CNNs, which achieves outstanding classification performance.
(a). A random variableis used to control the interpolation of the two residual branches (i.e., and in 3-branch ResNeXt). It is given by:
follows the uniform distribution in the forward pass. For the backward pass,is replaced by another uniform random variable to disturb the learning process. The regularization amplitude of each branch is fixed to .
To extend the use of Shake-Shake regularization, Yamada et al.  introduced a single Shake in 2-branch architectures (e.g., ResNet or PyramidNet) as shown in Fig. 2 (b) in which they adopted Stochastic Depth  to stabilize the learning:
where is an uniform random variable and is a Bernoulli random variable which decides to performs the original network (i.e., , if ) or the perturbated one (i.e., , if ). In backward pass, is replaced by . The regularization amplitude of the branch is also fixed to . In , Yamada et al. also presented a structure of Single-branch Shake without the original network: , in which the perturbation is applied in the feature space. They showed that this structure gets bad results in some cases. For instance, the 110-layer PyramidNet with Single-branch Shake drops the error rate to 77.99% on CIFAR-100. This fixed large regularization overemphasizes the overfitting. We argue that the fixed regularization amplitude cannot fit the dynamics of the training process and different model sizes well.
Iii The Proposed Method
As aforementioned, the fixed regularization strength in the existing regularization methods, such as DropPath , Stochastic Depth , Shake-Shake , and Shakedrop , departs from the human learning paradigm (e.g., the curriculum learning  or self-paced learning ). A naive way is to predefine the schedule for updating the regularization strength, such as the linear increment scheme in , which linearly increases the learning difficulty from low to high. We argue that the predefined schedule is not flexible enough to reveal the learning process. Inspired by the fact that the feedback of the learning itself can provide useful information, we propose a dynamic regularization, which is capable of adjusting the regularization strength adaptively.
Our dynamic regularization for CNNs is based on the dynamics of the training loss. Specifically, at the beginning of the training process, both the training and testing loss keeps decreasing, which means the network is learning to recognize the images. However, through a certain number of iterations, the network may overfit training data, resulting in that the training loss decreases more rapidly than the testing loss. The design of the regularization method needs to follow this dynamics. If the training loss drops in an iteration, the regularization strength should increase against overfitting in the next iteration; otherwise, the regularization strength should decrease against underfitting. In what follows, we first introduce the network architectures with dynamic regularization and then deliberate the update of the regularization strength in each iteration of the training process.
Iii-a Network Architectures with Dynamic Regularization
Iii-A1 The 2-branch architecture with dynamic regularization
Training phase. During training, dynamic regularization is adopted in Res-Block, as shown in Figs. 3 (a) and (b). Specifically, a dynamic regularization unit (called random perturbation) is introduced into the residual branch of Res-Block. The random perturbation is achieved by
where is the basic constant amplitude, is the dynamic factor at the iteration, and is the uniform random noise with the expected value . The value of is updated via the backward difference of the training loss (See Section III.B). The regularization amplitude is proportional to . In the forward pass, the output of the Res-Block can be expressed as:
In the backward pass, has a different value (represented by in Fig. 3 (b)) due to the random noise .
Random noise. The range of , i.e., , is a hyper-parameter in the training phase. A straightforward way is to set to be uniform inside all Res-Blocks. According to , the features of the earlier Res-Blocks should remain more than those of the later Res-Blocks. Hence, we propose a linear enhancement rule to configure this range inside Res-Blocks. For the Res-Block, the range denoted as is given by
where is the total number of Res-Blocks. With the increasing trend of the range , the regularization strength is gradually raised from the bottom layer to the top layer. We conduct a comparison between different settings of inside Res-Blocks in Section IV.
Iii-A2 The 3-branch architecture with dynamic regularization
We apply the dynamic regularization on a 3-branch architecture (See Fig. 2 (a)). Shake-Shake regularization is given by Eq. (3), in which is a uniform random variable. We introduce the random perturbation in Eq. (5) to replace in Eq. (3). Res-Block with dynamic regularization can be defined as
If we set and and limit equal to , ranges from and , which is consistent with in Eq. (3). The Shake-Shake regularization can be thought of as a special case of our dynamic regularization with a fixed dynamic factor.
Iii-B Update of the Regularization Strength
The proposed updating solution for the dynamic regularization strength is achieved by the dynamics of the training loss. Specifically, the dynamic characteristic of the training loss can be model as the difference of the training loss between successive iterations. We define the backward difference between the training loss at two successive iterations as
where denotes the training loss at the iteration. Although the training loss shows a downtrend in overall, there are huge fluctuations when feeding sequential mini-batches. To eliminate the noise and find out the trend of the loss, we apply a Gaussian filter to smooth it. Hence, the filtered backward difference can be rewritten as
where is the filtering operation defined as
where the filter length is . We use the normalized Gaussian window defined by
where , and
. The standard deviation is determined by. We will discuss the Gaussian filter in the experiment. The dynamic factor in Eqs. (6) and (10) with respect to , i.e.,
where is a small constant step for changing the regularization amplitude. From Eq. (15), it can be observed that if the training loss decreases (), the regularization amplitude increases to avoid overfitting; otherwise, it decreases to prevent underfitting. The dynamic factor keeps updating to follow the dynamics of the training loss in each iteration of the training procedure.
Remark. There are some existing methods to change the regularization strength. For instance, Zoph et al. 
introduced a ScheduledDropPath to regularize NASNets, which is a linear increment scheme of the regularization strength. The probability of dropping out a path is increased linearly throughout the training. However, the constant or linear scheme is a predefined rule, which cannot adapt to the training procedure and different model size. Different from them, our proposed dynamic scheduling exploits the dynamics of the training loss, which is applicable to the training procedure in different network architectures. In SectionIV, we conduct comparisons between them.
Iv Experimental Results
In this section, we evaluate the proposed dynamic regularization on the classification benchmark: CIFAR100 , in comparison with two state-of-the-art regularization approaches: Shake-Shake  and ShakeDrop . Then we conduct ablation studies to compare with the fixed or linear-increment scheme of the regularization strength, and discuss the effectiveness of the Gaussian filter and the random noise.
Iv-a Implementation Details
The following settings are used throughout the experiments. We set the training epoch toand the batch size to . The learning rate was initialized to for the 2-branch architecture as  and for the 3-branch architecture as , and we used the cosine learning schedule to gradually reduce the learning rate to at the end of training. For the dynamic regularization, we set the initial dynamic factor , , and for the 2-branch architecture and for the 3-branch architecture. The length of the Gaussian filter was . PyramidNet  and ResNeXt  were used as baselines. We employed the standard translation, flipping  and Cutout  as the data augmentation scheme. Therefore, the Shake-based regularizer is the only one variable to affect experiments. All experimental results are presented by the average of 3 runs at the 300-th epoch.
Iv-B Comparison with State-of-the-Art Regularization Methods
|Network Architecture||Params||Regularization||Top-1 Error (%)|
We first compare the proposed dynamic regularization with ShakeDrop in the 2-branch architecture on CIFAR100. Following the ShakeDrop, we used PyramidNet  as our baseline (namely Baseline) and chose different architectures including: 1) PyramidNet-110-a48 (i.e., the network has a depth of 110 and a widening factor of 48) which is a deep and narrow network, 2) PyramidNet-26-a84 which is a light network, and 3) PyramidNet-26-a200 which is a shallow and wide network.
Table I shows the experimental results. From Table I, it can be observed that our dynamic regularization outperforms the counterparts of ShakeDrop in various architectures. The error rates of ShakeDrop are even worse than those of Baseline in the shallow architectures, i.e., PyramidNet-26-a84 and PyramidNet-26-a200, which means ShakeDrop with fixed regularization strength fails in this case. This issue comes from Stochastic Depth  in ShakeDrop where Stochastic Depth works well for deep networks. Regardless of the depth of networks, PyramidNet with dynamic regularization obtains a consistent improvement. Networks with the dynamic regularization are comparable with the baseline networks which has the double number of parameters (e.g., 23.83% of 26-a84-Dynamic v.s. 23.40% of 110-a48-Baseline; and 21.32% of 110-a48-Dynamic v.s. 22.53% of 26-a200-Baseline).
For the 3-branch architecture, we compare the dynamic regularization with Shake-Shake  in ResNeXt-26-2x32d (i.e., the network has the depth of 26 and the residual branch of 2, and the first residual block has the width of 32) and ResNeXt-26-2x64d as shown in Table II. We can see that the error rates of dynamic regularization are lower than those of Shake-Shake. The results from Tables I and II shows that our dynamic regularization can adapt to various network architectures.
Fig. 4 shows the training loss, dynamic factor, and Top-1 error with respect to the epoch in the two networks, i.e., PyramidNet-26-a84 and PyramidNet-110-a48. For networks with dynamic regularization, the downward trend of the training loss is slowed down, unlike Baseline in which the loss goes down towards zero (See Fig. 4 (a)). Dynamic regularization can prevent networks from rote learning the training data. As shown in Fig. 4 (b), the dynamic factor of the two network architectures gradually increases throughout the training process. Instead of using a predefined scheduling function in , our dynamic scheduling is self-adaptive according to the backward difference of training loss. Another important property of the dynamic scheduling is that a small regularization strength is generated for a light model (i.e., 26-a84), and a large strength is for a large model (i.e., 110-a48). Fig. 4 (c) illustrates networks with dynamic regularization can narrow the gap between the training and testing errors (from Gap-1 to Gap-2, and from Gap-3 to Gap-4) and achieve lower testing error when compared with Baseline.
|Network Architecture||Params||Regularization||Top-1 Error (%)|
Iv-C Ablation Study and Discussion
Iv-C1 Schedules of the regularization strength
Apart from the proposed dynamic schedule, the regularization strength can be adjusted by a linear-increment schedule as , where ScheduledDropPath is proposed to linearly increase the probability of dropped path (that can also be considered as the regularization strength) in training. Besides, the fixed regularization schedule is commonly used in many previous methods [14, 9, 3, 22]. We compared our dynamic method with such fixed or linear increment schedules. We used PyramidNet-26-a84 as a backbone to compare different regularization schedules.
Table III illustrates six different configurations of the regularization strength. ‘Fix-’ means the dynamic factor is fixed to and ‘Linear-’ means the dynamic factor is linearly scheduled from to over the course of training steps. ‘Fix-2’ and ‘Linear-3’ achieve the best results in fixed and linear schedules, respectively. Compared with them, the dynamic setting with 23.83% error rate achieves the best performance, which shows the effectiveness of our dynamic regularization schedules.
Iv-C2 Random noise
As mentioned in Section III, the range of the random noise involved in our dynamic regularization, i.e., , is designed to grow from bottom Res-Blocks to top Res-Blocks linearly. To evaluate this setting, we performed the dynamic regularization with uniform and linearly growing in PyramidNet-26-a84. From the third and fourth row of Table IV, we can see the model with uniform is inferior to the model with linearly growing inside Res-Blocks (25.83% v.s. 23.83%).
Iv-C3 Gaussian Filtering
In the process of updating the dynamic factor, we employed a Gaussian filter to remove the instant change of the training loss in a mini-batch mode. That is, we refer to the Eq. (11) instead of Eq. (12) to update the dynamic factor. To study the effectiveness of Gaussian filter, we conducted comparative experiments between the Eq. (11) and Eq. (12). The last two rows of Table IV shows that if we remove the Gaussian filter, the error rate increases by 1.38%. This shows that the Gaussian filter also plays an important role in dynamic regularization.
In this paper, we have presented a dynamic schedule to adjust the regularization strength to fit various network architectures and the training process. Our dynamic regularization is self-adaptive in accordance with the change of the training loss. It produces a low regularization strength for light network architectures and high regularization strength for large ones. Furthermore, the strength is self-paced grown to avoid overfitting. Experimental results demonstrate that the proposed dynamic regularization outperforms state-of-the-art ShakeDrop and Shake-Shake regularization in the feature augmentation field. We consider that the dynamic regularization highly encourages to be exploited in data augmentation and Dropout-based methods in the future.
-  (2009) Curriculum learning. In Proceedings of the Annual International Conference on Machine Learning, pp. 41–48. Cited by: item 2, §III.
-  (2017) Improved regularization of convolutional neural networks with cutout. CoRR abs/1708.04552. Cited by: §II-B, §II-B, §IV-A.
-  (2017) Shake-shake regularization. CoRR abs/1705.07485. Cited by: §I, §I, Fig. 2, §II-B, §II-B, §II-B, §III, §IV-A, §IV-B, §IV-C1, TABLE II, §IV.
-  (2013) Maxout networks. In International Conference on Machine Learning, Cited by: §II-B.
-  (2017) Deep pyramidal residual networks. In , pp. 5927–5935. Cited by: §I, §II-A, §II-A, §III-A, §IV-A, §IV-B, TABLE I, TABLE II.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §I, §II-A, §II-A.
-  (2012) Improving neural networks by preventing co-adaptation of feature detectors. CoRR abs/1207.0580. Cited by: §I.
-  (2017) Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708. Cited by: §I, §II-A.
-  (2016) Deep networks with stochastic depth. In European Conference on Computer Vision, pp. 646–661. Cited by: §I, §II-B, §II-B, §II-B, §III-A1, §III, §IV-B, §IV-C1.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448–456. Cited by: §I.
-  (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §IV.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §I, §I, §II-B, §II-B, §IV-A.
-  (2010) Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems, pp. 1189–1197. Cited by: §III.
-  (2017) Fractalnet: ultra-deep neural networks without residuals. In International Conference on Learning Representations, Cited by: §II-B, §II-B, §III, §IV-C1.
-  (2018) How does batch normalization help optimization?. In Advances in Neural Information Processing Systems, pp. 2483–2493. Cited by: §I.
-  (2017) Continuous dropout. IEEE Transactions on Neural Networks and Learning Systems 29 (9), pp. 3926–3937. Cited by: §II-B.
-  (2014) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: §II-A.
-  (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §I, §II-B, §II-B.
Inception-v4, inception-resnet and the impact of residual connections on learning. In
AAAI Conference on Artificial Intelligence, Cited by: §II-A.
-  (2015) Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. Cited by: §II-A.
-  (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500. Cited by: §I, §II-A, §III-A, §IV-A.
-  (2018) Shakedrop regularization for deep residual learning. CoRR abs/1802.02375. Cited by: §I, §I, Fig. 2, §II-B, §II-B, §II-B, §III, §IV-A, §IV-C1, TABLE I, §IV.
-  (2016) Wide residual networks. In British Machine Vision Conference, Cited by: §II-A.
Object detection with deep learning: a review. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §I.
-  (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8697–8710. Cited by: §III-B, §III, §IV-B, §IV-C1.