Deep neural networks (DNNs) have reached an unprecedented level of predictive accuracy in several real-world application domains (e.g., text processing Schwenk et al. (2017); Zhang et al. (2015), image recognition He et al. (2016a); Huang et al. (2017), and video analysis Liu et al. (2019); Xu et al. (2019)) due to their capability of approximating any universal function Goodfellow et al. (2016). However, DNNs are often difficult to train due in a large part to the vanishing gradient phenomenon Glorot and Bengio (2010); Ioffe and Szegedy (2015). The residual network (ResNet) He et al. (2016a) is proposed to alleviate this issue using its key component known as the skip connection, which creates “bypass” path for information propagation He et al. (2016b).
Nevertheless, it remains challenging to achieve robustness in training a very deep DNN. As the DNN becomes deeper, it requires more careful tuning of the model hyperparameters (e.g., learning rate) and initialization of the model weights to perform well. This issue can be mitigated using normalization techniques such asbatch normalization (BN) Ioffe and Szegedy (2015), layer normalization Ba et al. (2016), and group normalization Wu and He (2018), among which BN is most widely used. BN normalizes the inputs of each layer to enable a robust training of DNN. It has been shown that BN provides a smoothing effect on the optimization landscape Santurkar et al. (2018), thus ameliorating the issue of a vanishing gradient Ioffe and Szegedy (2015).
Unfortunately, we have empirically observed that even when a very deep DNN is coupled with BN, it can potentially (and surprisingly) experience an exploding gradient in its deeper layers, thus impairing its robustness in training. As a result, our experiments in Section 4.1 reveal that BN does not help in training ResNets with depths and robustly on the AG-NEWS text and the CIFAR- vision datasets, respectively. Furthermore, noisy data (e.g., images with mosaics and texts with spelling errors) can adversely impact the training procedure of a DNN with BN, thus degrading its robustness in generalization.
This paper presents a simple yet principled approach to boosting the robustness of ResNet in both training and generalization, which is motivated by a dynamical systems perspective Chen et al. (2018); Lu et al. (2018); Ruthotto and Haber (2018). Namely, we can view a DNN from the perspective of a partial differential equation (PDE). This naturally inspires us to characterize ResNet by an explicit Euler method. Theoretically, we show that the step factor in the Euler method can serve to control the robustness of ResNet in both training and generalization. Specifically, we prove that a small can benefit the training robustness from the view of back-propagation; from the view of forward-propagation, the small can help the generalization robustness. We conduct experiments on both vision and text datasets, i.e., CIFAR-10 Krizhevsky (2009) and AG-NEWS Zhang et al. (2015). The comprehensive experiments demonstrate that small can benefit the training and generalization robustness. We confirm that the use of a small can benefit the training robustness without using BN as well. To sum up, the empirical results validate the efficacy of small parameter in the goal towards robust ResNet.
2 Related literature
Before delving into robust ResNet (Section 3), we provide an overview of several prerequisites, including ResNet, batch normalization and partial differential equations.
2.1 Brief summary of ResNet
Residual network (ResNet) is a variant of the deep neural network, which exhibits compelling accuracy and convergence properties He et al. (2016a). Its key component is the skip connection. Thus, ResNet consists of many stacked residual blocks, and its general form is shown below:
where and are the input and output of the -th block, is a residual block containing feature transformation, e.g., convolutional operation or affine transformation, and
is an element-wise operation, e.g., ReLU functionNair and Hinton (2010) or identity mapping He et al. (2016b).
2.2 Batch normalization
Batch normalization (BN) has been widely adopted for training deep neural networks Ioffe and Szegedy (2015), which normalizes layer input over each training mini-batch. Specifically, for the input of a layer with dimension , , BN first normalizes each scalar feature independently, where and are computed over a mini-batch of size . Then, it performs an affine transformation of each normalized value by , where and are learned during training.
2.3 Characterizing deep neural network by a partial differential equation
Deep neural networks (DNN) represent the underlying nonlinear feature transformations, so that the transformed features can be matched with their target values, such as categorical labels for classification and continuous quantities for regression. Namely, DNN transforms the input data through multiple layers, and the feature in the last layer should be linearly separable Haber and Ruthotto (2017).
Turning to the dynamical systems view, a partial differential equation (PDE) can characterize the motion of particles Atkinson (2008); Ascher (2008). Motivated by this viewpoint, we can characterize the feature transformations in DNN as a system of the first-order PDE, which can be formulated as:
Feature transformations in DNN can be characterized as the first-order partial differential equation (PDE):
where , , , and .
is the time along the feature transformations, in which .
is the feature vector of dimension
is the feature vector of dimension.
Given an initial data as the input, a PDE gives rise to the initial value problem (IVP):
where is a specified initial input of feature vector. is the transformed feature at the time .
A solution to an IVP is a function of time . For example, we have IVP . Thus, we can easily derive its analytic solution as shown in Figure 1. However, for more complicated PDEs, it is not always feasible to obtain an analytic solution. Rather, their solutions are approximated by numerical methods, of which the explicit Euler method (shown in Definition 2 in next section) is the most classic example.
3 Towards Robust ResNet
Residual network (ResNet) is one exemplar of DNN. ResNet can be described by the explicit Euler method (Definition 2), which naturally brings us a Euler-viewed ResNet (Eq. (3.1)). Inspired by the stability of the Euler method, we emphasize the efficacy of the step factor for ResNet, which enables ResNet to have smoother feature transformations. Such smoothness means that feature transitions take a small change at every adjacent block during the forward propagation. This smoothness has two-fold benefits: preventing information explosion over the depth (Section 3.2), and filtering noise in the input features (Section 3.3). All these together drive the significant robustness gains in the ResNet.
3.1 Connections of ResNet with the Euler method
Consider the following definition of the explicit Euler method.
Given the solution at some time , the PDE tells “in which direction to continue”. At time , the explicit Euler method computes this direction , and follows it when a small time step changes ().
In order to obtain the reasonable approximation, the step factor in Euler’s method has to be chosen to be “small enough”. Its magnitude depends on the PDEs and its specified initial inputs. For instance, in Figure 1, we leverage Euler method with various step factors to approximate the IVP , . It is clearly shown that the approximation of fluctuates more with large , hence the smaller approximates the true function better.
Taking and , we can use the Euler method to characterize the forward propagation of the original ResNet structure Haber and Ruthotto (2017). Specifically, to generalize ResNet by the Euler method (Eq. (3)), we characterize the forward propagation of an Euler-viewed ResNet as follows.
At layer , is the input features to the -th Residual block . are the parameters of batch normalization, which applies directly after (or before) the transformation function , i.e., convolutional operations or affine transformation matrix. The function is an element-wise operation, e.g., identity mapping or ReLu.
As analysed in the chapter “analysis of the Euler method” Butcher (2016), for any fixed time interval , the Euler method approximation is better refined with smaller . However, it is not obvious whether small can helps on robustness of deep ResNet in terms of training and generalization.
To answer this question, we prove that the reduced step factor in Eq. (3.1) can boost its robustness. Specifically, the benefits of this new parameter are as follows. (a) Training robustness (Section 3.2): Small helps the information back-propagation during the training, and thus prevents its explosion over depth. (b) Generalization robustness (Section 3.3): With greater smoothness in feature forward-propagation, the noise in the features is reduced rather than amplified, thus mitigating the negative effects of noise and enhancing generalization. We will now theoretically explore the efficacy of small for training and generalization robustness.
3.2 Training robustness: Importance of small on information back-propagation
To simplify the analysis, let be the identity mapping operation from Eq. (3.1), i.e., . Recursively, we have
for any deeper block and any shallower block . Eq. (5
) is used to analyse the information back-propagation. Denoting the loss function as
, by the chain rule, we have:
Eq. (6) shows that the back-propagated information can be decomposed into two additive terms: a term (the first term) that propagates information directly without going through any weight layers, and a term (the second term), which propagates through weight layers between and .
The first term as stated by He et al. (2016b) ensures that information does not vanish. However, the second term can blow-up the information , especially when the weights of layers are large. The standard approach for tackling this is to apply batch normalization (BN) after transforming functions Ioffe and Szegedy (2015).
Let us first examine the effects of BN. For analysis purposes, we examine one residual block, taking , , , and , where BN is an element-wise operation. For a batch size of , there are transformed with mean vector
and standard deviation vector. Thus, we get
Taking as the smallest one among elements in , we get
To sum up, we have
As observed by Santurkar et al. (2018), tends to be large, thus BN gives the positive effect of constraining the explosive back-propagated information (Eq. (9)). However, when networks go deeper, tends to be highly uncontrolled in practice, and the back-propagated information is still accumulated over the depth and can again blow up. For example, Figure 2 (Section 4.1) shows that, when we train depth-110 ResNet on CIFAR-10 and depth-49 ResNet on AG-NEWS, the performance is significantly worse compared to their shallower counterparts. Thus, a reduced can serve to re-constrain the explosive back-propagated information. Hence, even without BN, reducing can also serve to stabilize the training procedure. In other words, our reduced offers enhanced robustness even without BN. We carry out experiments supporting this observation in Section 4.1.
Let us take a special case to illustrate the importance of small on information back-propagation.
let and , where is the number of last residual block. Supposing at any transformation block , we have , where upper bounds the effects of BN, i.e., . Then, we have .
Note that is bigger than 1, and the back-propagated information explodes exponentially w.r.t the depth . This hurts the training robustness of ResNet. However, reducing can give extra control of the back-propagated information, since the increasing term will be constrained by and the back-propagated information will not explode. This guides us that, when ResNet becomes deeper, we should reduce . This also informs us that, even without BN, training a deeper network is still possible.
3.3 Generalization robustness: Importance of small on information forward-propagation
It is obvious that the reduced gives extra control of the feature transformations in ResNet. Namely, it makes feature transformations smoother, i.e., is smaller, as features go through ResNet block by block. In other words, it forces to take a reduced change compared with .
More importantly, as features forward propagate through the deep ResNet, negative effects of the input noise are reduced and constrained over the depth. We theoretically show that a reduced can help stabilize the output of ResNet against the input noise, where can be a vector of pixels or vectorization of texts. Suppose we have a perturbed copy of , we expect is close to , that is
Let be the transformed feature of ResNet starting at input feature , and starting at , where is a perturbed copy of , i.e., .
Eq. (11) shows that the noise with input features is amplified along the depth of ResNet. However, with the introduction of a reduced (e.g., from 1 to 0.1), we limit the noise amplification, and provide the extra capacity for filtering out the input noise.
Let us take a special case to illustrate the importance of small on information forward-propagation.
Consider a deeper block as the last residual block in ResNet, and suppose at any intermediate transformation , . The noise at layer is denoted by , thus we have
Its proof follows directly from the Eq. (11). Eq. (12) informs us that the small can compensate the negative effects of noise accumulated over the depth . In other words, when ResNet is deeper (larger ), we expect to be smaller; while when ResNet is shallower, we allow to be larger.
Here, we should be informed that, for a given depth of ResNet, we cannot reduce step factor infinitely small. In the limiting case, if we take , there would be no transformations from initial feature to the final state. In this case, the noise during the transformation is perfectly bounded, but it also smooths out all transformations. In addition, if
is too small, we probably need more layers (increase depth) of ResNet to get enough approximations, so that the transformed features can be matched with their target values. In this case, we increase the computational cost by introducing more layers.
In this section, we conduct experiments on the vision dataset CIFAR-10 Krizhevsky (2009) and the text dataset AG-NEWS Zhang et al. (2015). We also employ a synthetic binary dataset TWO-MOON in Section 4.2 to illustrate input noise filtering along the forward propagation. We fix our step factor and compare it with the original ResNet, which corresponds to , in Sections 4.1 and 4.2. We discuss how to select the step factor in Section 4.3.
For the vision dataset CIFAR-10, the residual block contains two 2-D convolutional operations He et al. (2016a). For the text dataset AG-NEWS, the residual block contains two 1-D convolutional operations Schwenk et al. (2017). For the synthetic dataset TWO-MOON, the residual block contains two affine transformation matrices.
To tackle the dimensional mismatch in the shortcut connections of ResNet, we adopt the same practice as in He et al. (2016a) for CIFAR-10 and Schwenk et al. (2017) for AG-NEWS, which use convolutional layers of the kernel of size one to match the dimensions.
4.1 Small helps training robustness
Small helps on deeper networks:
) over various depths (other hyperparameters remain the same). Each configuration has 5 trials with different random seeds. We provide median test accuracy with the number of epochs on the datasets. The standard deviation is plotted as the shaded color.
Figure 2 shows that, in both vision and text datasets, our method () outperforms the existing method (corresponding to ), especially when the networks go deeper. In particular, in ResNet with depth-110 for CIFAR-10 and ResNet with depth-49 for AG-NEWS, the existing method () fails to train the networks properly, resulting in and of test accuracy, respectively.
It is noted that, the existing method degrades over the increasing depth of ResNet. For example, in CIFAR-10 experiments, ResNet with depth-56 and depth-110 perform worse than their counterparts ResNet with depth-32 and depth-44. In AG-NEWS experiments, ResNet with depth-29 and depth-49 perform worse than ResNet with depth-9 and depth-17. However, our method (
) reduces the variance of test accuracy over the increasing depth. For example, inCIFAR-10 experiments, our method shows a lower variance over the depth, with test accuracy improving from to from depth 32 to 110. The blue shaded area () is much thinner compared with the red one (). This shows that our method () has a smaller variance of the test accuracy over different random seeds. This demonstrates that small offers training robustness.
To sum up, as theoretically shown in Section 3.2, the reduced can prevent the explosion of back-propagated information over the depth, thus making the training procedure of ResNet more robust.
Small helps networks without BN:
Figure 3 shows that, without using BN, training plain ResNet () is unstable and exhibits large variance for both vision CIFAR-10 and text AG-NEWS datasets. Particularly, without using BN, even training a shallow ResNet (depth-32) on CIFAR-10 fails at almost every training trail. However, with reduced , training performance improves significantly and exhibits low variance. As theoretically shown in Section 3.2, reduced has beneficial effects on top of BN. This can help the back-propagated information by preventing its explosion, and thus enhance the training robustness of deep networks.
4.2 Small helps generalization robustness
Synthetic data for noise filtering:
To give insights on why small can help the generalization robustness of ResNet, let us first consider a synthetic data example of using ResNet for binary classification task, namely separating noisy “red and blue” points in a 2-D plane. We train a vanilla ResNet with and ResNet with on the middle of Figure 4, and perform the feature transformations on the test set (the right of Figure 4).
In Figure 5, the features are transformed through forward-propagation in vanilla ResNet (no BN, ). The noise in the input features leads to the mixing of red and blue points, sabotaging the generalization of ResNet. The reason for this phenomenon is that, with large , features undergo violent transformations between two adjacent blocks, and negative effects of noise are amplified along the depth.
On the other hand, Figure 6 shows that, with small features undergo smooth transformations at every adjacent residual block, and thus the input noise is gradually filtered out, which leads to the correct classification. As stated in Section 3.3, small can help bounding the negative effects of noise in input features. With small , noise amplification along the depth is effectively constrained.
We also made animations visualizing the transformations and conducted experiments comparing the effects of BN on smoothing out the noise. Interested readers may go to the Dropbox link for reference.
Real-world data for noise filtering:
We train ResNet with and on noisy data (i.e., input perturbations), and test it on clean data. For the training dataset CIFAR-10, we inject Gaussian noise at every normalized pixel with zero mean and a different standard deviation to represent different noise levels. For the training dataset AG-NEWS, we choose different proportions (different noise levels) of characters randomly in the texts and alter them.
Figure 7 shows that, at different noise levels, our method () continuously outperforms the original method (corresponding to ). In particular, for CIFAR-10 at the noise levels 0.01 and 0.1, our method outperforms the original method by a large margin around and respectively. We observe that ResNet with reduced has the smaller variance compared to its counterpart under different noise levels. In other words, our method is robust to training on noisy input by bounding the negative effects of noise. By taking the smooth transformations, it gradually filters out the noise in the input features along the forward propagation of ResNet. Thus, our method offers better generalization robustness.
4.3 How to select step size
Last but not least, we perform a grid search of from to to explore optimal training and generalization robustness.
Selection of for training robustness:
To search for training robustness, we train fixed depth ResNet with various . Figure 8 shows that the proper for CIFAR-10 (depth-110 ResNet) is near . The proper for AG-NEWS (depth-49 ResNet) is between and .
For a given depth, we should choose smaller but not too small. If is very small, e.g., and , it smooths out useful transformations, thus leading to undesirable performance.
Selection of for generalization robustness:
To search for generalization robustness, we train ResNet with various on the noisy input data. Figure 9 shows that for noisy CIFAR-10 (depth-110 ResNet, noise level ), the proper should be between and , and for noisy AG-NEWS (depth-49 ResNet, noise level ), the proper should be between and .
Too small h, e.g., h = 0.001, 0.01 leads to undesirable performance. As discussed in Section 3.3, h cannot be too small in order to get enough approximations to match the target value at the final layer of ResNet.
This paper proposed a simple but principled approach to enhance the robustness of ResNet. Motivated by the dynamical system view, we characterize ResNet by an explicit Euler method. We theoretically find that the step factor in the Euler method can control the robustness of ResNet in both training and generalization. From the view of back-propagation, we prove that a small can benefit the training robustness, while from the view of forward-propagation, we prove that a small can help the generalization robustness. We conduct comprehensive experiments on vision CIFAR-10 and text AG-NEWS datasets. Experiments confirm that small can benefit the training and generalization robustness. Future work can explore several promising directions: (a) How to transfer the experience of small
to other network structures, e.g., RNN for natural language processingCho et al. (2014), (b) How to handle the noisy labels Han et al. (2018), and (c) Other means for choosing the step size , e.g., using Bayesian optimization Hennig and Schuler (2012); Srinivas et al. (2010); Daxberger and Low (2017); Hoang et al. (2018).
-  (2008) Numerical methods for evolutionary differential equations. Computational Science and Engineering, Vol. 5, SIAM. Cited by: §2.3, Definition 2.
-  (2008) An introduction to numerical analysis. 2 edition, John Wiley & Sons. Cited by: §2.3, Definition 2.
-  (2016) Layer normalization. arXiv:1607.06450 Cited by: §1.
Numerical methods for ordinary differential equations. John Wiley & Sons. Cited by: §3.1.
-  (2018) Neural ordinary differential equations. In Proc. NeurIPS, pp. 6572–6583. Cited by: §1.
-  (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proc. EMNLP, pp. 1724–1734. Cited by: §5.
-  (2017) Distributed batch Gaussian process optimization. In Proc. ICML, pp. 951–960. Cited by: §5.
-  (2010) Understanding the difficulty of training deep feedforward neural networks. In Proc. AISTATS, pp. 249–256. Cited by: §1.
-  (2016) Deep learning. MIT Press. Note: http://www.deeplearningbook.org Cited by: §1.
-  (2017) Stable architectures for deep neural networks. Inverse Problems 34 (1), pp. 014004. Cited by: §2.3, §3.1.
-  (2018) Co-teaching: robust training of deep neural networks with extremely noisy labels. In Proc. NeurIPS, pp. 8536–8546. Cited by: §5.
-  (2016) Deep residual learning for image recognition. In Proc. IEEE CVPR, pp. 770–778. Cited by: §1, §2.1, §3.1, §4, §4.
-  (2016) Identity mappings in deep residual networks. In Proc. ECCV, pp. 630–645. Cited by: §1, §2.1, §3.1, §3.2.
-  (2012) Entropy search for information-efficient global optimization. JMLR 13, pp. 1809–1837. Cited by: §5.
-  (2018) Decentralized high-dimensional Bayesian optimization with factor graphs. In Proc. AAAI, Cited by: §5.
-  (2017) Densely connected convolutional networks. Proc. IEEE CVPR, pp. 2261–2269. Cited by: §1, §2.1.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proc. ICML, pp. 448–456. Cited by: §1, §1, §2.2, §2.2, §3.2.
-  (2009) Learning multiple layers of features from tiny images. Master’s Thesis, Univ. Toronto. Cited by: §1, §4.
-  (2019) LSTM-based multi-label video event detection. Multimedia Tools and Applications 78 (1), pp. 677–695. Cited by: §1.
-  (2018) Beyond finite layer neural networks: bridging deep architectures and numerical differential equations. In Proc. ICML, pp. 3282–3291. Cited by: §1.
-  (2010) Rectified linear units improve restricted Boltzmann machines. In Proc. ICML, pp. 807–814. Cited by: §2.1.
-  (2018) Deep neural networks motivated by partial differential equations. arXiv:1804.04272 Cited by: §1.
-  (2018) How does batch normalization help optimization?. In Proc. NeurIPS, pp. 2488–2498. Cited by: §1, §2.2, §3.2.
-  (2017) Very deep convolutional networks for text classification. In Proc. EACL, pp. 1107–1116. Cited by: §1, §2.1, §4, §4.
-  (2010) Gaussian process optimization in the bandit setting: no regret and experimental design. In Proc. ICML, pp. 1015–1022. Cited by: §5.
-  (2018) Group normalization. In Proc. ECCV, pp. 3–19. Cited by: §1.
Dual-stream recurrent neural network for video captioning. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §1.
-  (2015) Character-level convolutional networks for text classification. In Proc. NeurIPS, pp. 649–657. Cited by: §1, §1, §4.
6 Supplementary Materials
6.1 Details of configurations of our experiments
We will publicize our codes soon.
6.2 Animations of Synthetic data
Animations on the effects of step factor and BN of smoothing out noise can be found at the Dropbox link.
6.3 Visualization on noisy input features
For CIFAR-10, we add random Gaussian (zero mean and noise level std) noise to every normalized pixel of the images. For AG-NEWS, we randomly choose the noise level proportion of text, and randomly alter it. Illustrations are as follows.
video games good for children computer games can promote problem-solving and team-building in children, say games industry experts. (Noise level = 0.0) vedeo games good for dhildlenzcospxter games can iromote problem-sorvtng and teai-building in children, sby games industry experts. (Noise level = 0.1) video nawvs zgood foryxhilqretngomvumer games cahcprocotubpnoblex-szbvina and tqlmmbuaddiagjin whipdren, saywgsmes ildustry exmrrts. (Noise level = 0.3) tmdeo gakec jgopd brr cgildrenjcoogwdeh lxdeu vanspromote xrobkeh-svlkieo and termwwuojvinguinfcojbdses, sacosamlt cndgstoyaagpbrus. (Noise level = 0.5) vizwszgbrwjtguihcxfoatbhivrrwvq cxmpgugflziwls clfnzrommtohprtblef-solvynx rnjnyiaf-gjwlcergwklskqibdtjn,aoty gameshinzustrm oxpertsdm (Noise level = 0.8)