Towards Robust ResNet: A Small Step but A Giant Leap

02/28/2019 ∙ by Jingfeng Zhang, et al. ∙ 0

This paper presents a simple yet principled approach to boosting the robustness of the residual network (ResNet) that is motivated by the dynamical system perspective. Namely, a deep neural network can be interpreted using a partial differential equation, which naturally inspires us to characterize ResNet by an explicit Euler method. Our analytical studies reveal that the step factor h in the Euler method is able to control the robustness of ResNet in both its training and generalization. Specifically, we prove that a small step factor h can benefit the training robustness for back-propagation; from the view of forward-propagation, a small h can aid in the robustness of the model generalization. A comprehensive empirical evaluation on both vision CIFAR-10 and text AG-NEWS datasets confirms that a small h aids both the training and generalization robustness.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) have reached an unprecedented level of predictive accuracy in several real-world application domains (e.g., text processing Schwenk et al. (2017); Zhang et al. (2015), image recognition He et al. (2016a); Huang et al. (2017), and video analysis Liu et al. (2019); Xu et al. (2019)) due to their capability of approximating any universal function Goodfellow et al. (2016). However, DNNs are often difficult to train due in a large part to the vanishing gradient phenomenon Glorot and Bengio (2010); Ioffe and Szegedy (2015). The residual network (ResNet) He et al. (2016a) is proposed to alleviate this issue using its key component known as the skip connection, which creates “bypass” path for information propagation He et al. (2016b).

Nevertheless, it remains challenging to achieve robustness in training a very deep DNN. As the DNN becomes deeper, it requires more careful tuning of the model hyperparameters (e.g., learning rate) and initialization of the model weights to perform well. This issue can be mitigated using normalization techniques such as

batch normalization (BN) Ioffe and Szegedy (2015), layer normalization Ba et al. (2016), and group normalization Wu and He (2018), among which BN is most widely used. BN normalizes the inputs of each layer to enable a robust training of DNN. It has been shown that BN provides a smoothing effect on the optimization landscape Santurkar et al. (2018), thus ameliorating the issue of a vanishing gradient Ioffe and Szegedy (2015).

Unfortunately, we have empirically observed that even when a very deep DNN is coupled with BN, it can potentially (and surprisingly) experience an exploding gradient in its deeper layers, thus impairing its robustness in training. As a result, our experiments in Section 4.1 reveal that BN does not help in training ResNets with depths and robustly on the AG-NEWS text and the CIFAR- vision datasets, respectively. Furthermore, noisy data (e.g., images with mosaics and texts with spelling errors) can adversely impact the training procedure of a DNN with BN, thus degrading its robustness in generalization.

This paper presents a simple yet principled approach to boosting the robustness of ResNet in both training and generalization, which is motivated by a dynamical systems perspective Chen et al. (2018); Lu et al. (2018); Ruthotto and Haber (2018). Namely, we can view a DNN from the perspective of a partial differential equation (PDE). This naturally inspires us to characterize ResNet by an explicit Euler method. Theoretically, we show that the step factor in the Euler method can serve to control the robustness of ResNet in both training and generalization. Specifically, we prove that a small can benefit the training robustness from the view of back-propagation; from the view of forward-propagation, the small can help the generalization robustness. We conduct experiments on both vision and text datasets, i.e., CIFAR-10 Krizhevsky (2009) and AG-NEWS Zhang et al. (2015). The comprehensive experiments demonstrate that small can benefit the training and generalization robustness. We confirm that the use of a small can benefit the training robustness without using BN as well. To sum up, the empirical results validate the efficacy of small parameter in the goal towards robust ResNet.

2 Related literature

Before delving into robust ResNet (Section 3), we provide an overview of several prerequisites, including ResNet, batch normalization and partial differential equations.

2.1 Brief summary of ResNet

Residual network (ResNet) is a variant of the deep neural network, which exhibits compelling accuracy and convergence properties He et al. (2016a). Its key component is the skip connection. Thus, ResNet consists of many stacked residual blocks, and its general form is shown below:

where and are the input and output of the -th block, is a residual block containing feature transformation, e.g., convolutional operation or affine transformation, and

is an element-wise operation, e.g., ReLU function 

Nair and Hinton (2010) or identity mapping He et al. (2016b).

Based on the core idea of the skip connection, variants of ResNet have been proposed for specific tasks, e.g., DenseNet for image recognition Huang et al. (2017), and VDCNN with shortcut connections for text classification Schwenk et al. (2017).

2.2 Batch normalization

Batch normalization (BN) has been widely adopted for training deep neural networks Ioffe and Szegedy (2015), which normalizes layer input over each training mini-batch. Specifically, for the input of a layer with dimension , , BN first normalizes each scalar feature independently, where and are computed over a mini-batch of size . Then, it performs an affine transformation of each normalized value by , where and are learned during training.

BN has been shown to smooth the optimization landscape of deep neural network Santurkar et al. (2018)

. This offers more stable gradients for robust training of deep neural network, thus effectively ameliorating the vanishing gradient problem 

Ioffe and Szegedy (2015).

2.3 Characterizing deep neural network by a partial differential equation

Deep neural networks (DNN) represent the underlying nonlinear feature transformations, so that the transformed features can be matched with their target values, such as categorical labels for classification and continuous quantities for regression. Namely, DNN transforms the input data through multiple layers, and the feature in the last layer should be linearly separable Haber and Ruthotto (2017).

Turning to the dynamical systems view, a partial differential equation (PDE) can characterize the motion of particles Atkinson (2008); Ascher (2008). Motivated by this viewpoint, we can characterize the feature transformations in DNN as a system of the first-order PDE, which can be formulated as:

Definition 1.

Feature transformations in DNN can be characterized as the first-order partial differential equation (PDE):

(1)

where , , , and . is the time along the feature transformations, in which .

is the feature vector of dimension

.

Given an initial data as the input, a PDE gives rise to the initial value problem (IVP):

(2)

where is a specified initial input of feature vector. is the transformed feature at the time .

Figure 1: Given an IVP, we provide its analytic solution and approximation by an explicit Euler method with different step factors .

A solution to an IVP is a function of time . For example, we have IVP . Thus, we can easily derive its analytic solution as shown in Figure 1. However, for more complicated PDEs, it is not always feasible to obtain an analytic solution. Rather, their solutions are approximated by numerical methods, of which the explicit Euler method (shown in Definition 2 in next section) is the most classic example.

3 Towards Robust ResNet

Residual network (ResNet) is one exemplar of DNN. ResNet can be described by the explicit Euler method (Definition 2), which naturally brings us a Euler-viewed ResNet (Eq. (3.1)). Inspired by the stability of the Euler method, we emphasize the efficacy of the step factor for ResNet, which enables ResNet to have smoother feature transformations. Such smoothness means that feature transitions take a small change at every adjacent block during the forward propagation. This smoothness has two-fold benefits: preventing information explosion over the depth (Section 3.2), and filtering noise in the input features (Section 3.3). All these together drive the significant robustness gains in the ResNet.

3.1 Connections of ResNet with the Euler method

Consider the following definition of the explicit Euler method.

Definition 2.

The explicit Euler method Atkinson (2008); Ascher (2008) can be represented as follows.

(3)

where is the approximation of .

Given the solution at some time , the PDE tells “in which direction to continue”. At time , the explicit Euler method computes this direction , and follows it when a small time step changes ().

In order to obtain the reasonable approximation, the step factor in Euler’s method has to be chosen to be “small enough”. Its magnitude depends on the PDEs and its specified initial inputs. For instance, in Figure 1, we leverage Euler method with various step factors to approximate the IVP , . It is clearly shown that the approximation of fluctuates more with large , hence the smaller approximates the true function better.

Euler-viewed ResNet:

Taking and , we can use the Euler method to characterize the forward propagation of the original ResNet structure Haber and Ruthotto (2017). Specifically, to generalize ResNet by the Euler method (Eq. (3)), we characterize the forward propagation of an Euler-viewed ResNet as follows.

Following the conventional analysis He et al. (2016b), ResNet is a stacked structure He et al. (2016a), where its residual block with step factor performs the following computation:

(4)

At layer , is the input features to the -th Residual block . are the parameters of batch normalization, which applies directly after (or before) the transformation function , i.e., convolutional operations or affine transformation matrix. The function is an element-wise operation, e.g., identity mapping or ReLu.

As analysed in the chapter “analysis of the Euler method” Butcher (2016), for any fixed time interval , the Euler method approximation is better refined with smaller . However, it is not obvious whether small can helps on robustness of deep ResNet in terms of training and generalization.

To answer this question, we prove that the reduced step factor in Eq. (3.1) can boost its robustness. Specifically, the benefits of this new parameter are as follows. (a) Training robustness (Section  3.2): Small helps the information back-propagation during the training, and thus prevents its explosion over depth. (b) Generalization robustness (Section 3.3): With greater smoothness in feature forward-propagation, the noise in the features is reduced rather than amplified, thus mitigating the negative effects of noise and enhancing generalization. We will now theoretically explore the efficacy of small for training and generalization robustness.

3.2 Training robustness: Importance of small on information back-propagation

To simplify the analysis, let be the identity mapping operation from Eq. (3.1), i.e., . Recursively, we have

(5)

for any deeper block and any shallower block . Eq. (5

) is used to analyse the information back-propagation. Denoting the loss function as

, by the chain rule, we have:

(6)

Eq. (6) shows that the back-propagated information can be decomposed into two additive terms: a term (the first term) that propagates information directly without going through any weight layers, and a term (the second term), which propagates through weight layers between and .

The first term as stated by He et al. (2016b) ensures that information does not vanish. However, the second term can blow-up the information , especially when the weights of layers are large. The standard approach for tackling this is to apply batch normalization (BN) after transforming functions Ioffe and Szegedy (2015).

Let us first examine the effects of BN. For analysis purposes, we examine one residual block, taking , , , and , where BN is an element-wise operation. For a batch size of , there are transformed with mean vector

and standard deviation vector

. Thus, we get

(7)

Taking as the smallest one among elements in , we get

(8)

To sum up, we have

(9)

As observed by Santurkar et al. (2018), tends to be large, thus BN gives the positive effect of constraining the explosive back-propagated information (Eq. (9)). However, when networks go deeper, tends to be highly uncontrolled in practice, and the back-propagated information is still accumulated over the depth and can again blow up. For example, Figure 2 (Section 4.1) shows that, when we train depth-110 ResNet on CIFAR-10 and depth-49 ResNet on AG-NEWS, the performance is significantly worse compared to their shallower counterparts. Thus, a reduced can serve to re-constrain the explosive back-propagated information. Hence, even without BN, reducing can also serve to stabilize the training procedure. In other words, our reduced offers enhanced robustness even without BN. We carry out experiments supporting this observation in Section 4.1.

Let us take a special case to illustrate the importance of small on information back-propagation.

Proposition 1.

let and , where is the number of last residual block. Supposing at any transformation block , we have , where upper bounds the effects of BN, i.e., . Then, we have .

Proof.
(10)

Note that is bigger than 1, and the back-propagated information explodes exponentially w.r.t the depth . This hurts the training robustness of ResNet. However, reducing can give extra control of the back-propagated information, since the increasing term will be constrained by and the back-propagated information will not explode. This guides us that, when ResNet becomes deeper, we should reduce . This also informs us that, even without BN, training a deeper network is still possible.

3.3 Generalization robustness: Importance of small on information forward-propagation

It is obvious that the reduced gives extra control of the feature transformations in ResNet. Namely, it makes feature transformations smoother, i.e., is smaller, as features go through ResNet block by block. In other words, it forces to take a reduced change compared with .

More importantly, as features forward propagate through the deep ResNet, negative effects of the input noise are reduced and constrained over the depth. We theoretically show that a reduced can help stabilize the output of ResNet against the input noise, where can be a vector of pixels or vectorization of texts. Suppose we have a perturbed copy of , we expect is close to , that is

Let be the transformed feature of ResNet starting at input feature , and starting at , where is a perturbed copy of , i.e., .

(11)

Eq. (11) shows that the noise with input features is amplified along the depth of ResNet. However, with the introduction of a reduced (e.g., from 1 to 0.1), we limit the noise amplification, and provide the extra capacity for filtering out the input noise.

Let us take a special case to illustrate the importance of small on information forward-propagation.

Proposition 2.

Consider a deeper block as the last residual block in ResNet, and suppose at any intermediate transformation , . The noise at layer is denoted by , thus we have

(12)

Its proof follows directly from the Eq. (11). Eq. (12) informs us that the small can compensate the negative effects of noise accumulated over the depth . In other words, when ResNet is deeper (larger ), we expect to be smaller; while when ResNet is shallower, we allow to be larger.

Here, we should be informed that, for a given depth of ResNet, we cannot reduce step factor infinitely small. In the limiting case, if we take , there would be no transformations from initial feature to the final state. In this case, the noise during the transformation is perfectly bounded, but it also smooths out all transformations. In addition, if

is too small, we probably need more layers (increase depth) of ResNet to get enough approximations, so that the transformed features can be matched with their target values. In this case, we increase the computational cost by introducing more layers.

4 Experiments

In this section, we conduct experiments on the vision dataset CIFAR-10 Krizhevsky (2009) and the text dataset AG-NEWS Zhang et al. (2015). We also employ a synthetic binary dataset TWO-MOON in Section 4.2 to illustrate input noise filtering along the forward propagation. We fix our step factor and compare it with the original ResNet, which corresponds to , in Sections 4.1 and 4.2. We discuss how to select the step factor in Section 4.3.

For the vision dataset CIFAR-10, the residual block contains two 2-D convolutional operations He et al. (2016a). For the text dataset AG-NEWS, the residual block contains two 1-D convolutional operations Schwenk et al. (2017). For the synthetic dataset TWO-MOON, the residual block contains two affine transformation matrices.

To tackle the dimensional mismatch in the shortcut connections of ResNet, we adopt the same practice as in  He et al. (2016a) for CIFAR-10 and  Schwenk et al. (2017) for AG-NEWS, which use convolutional layers of the kernel of size one to match the dimensions.

4.1 Small helps training robustness

Small helps on deeper networks:

Figure 2: To verify the effectiveness of small on increasing the robustness of the training procedure of deep ResNet, we compare ResNet with reduced step () and the original ResNet (corresponding to

) over various depths (other hyperparameters remain the same). Each configuration has 5 trials with different random seeds. We provide median test accuracy with the number of epochs on the datasets. The standard deviation is plotted as the shaded color.

Figure 2 shows that, in both vision and text datasets, our method () outperforms the existing method (corresponding to ), especially when the networks go deeper. In particular, in ResNet with depth-110 for CIFAR-10 and ResNet with depth-49 for AG-NEWS, the existing method () fails to train the networks properly, resulting in and of test accuracy, respectively.

It is noted that, the existing method degrades over the increasing depth of ResNet. For example, in CIFAR-10 experiments, ResNet with depth-56 and depth-110 perform worse than their counterparts ResNet with depth-32 and depth-44. In AG-NEWS experiments, ResNet with depth-29 and depth-49 perform worse than ResNet with depth-9 and depth-17. However, our method (

) reduces the variance of test accuracy over the increasing depth. For example, in

CIFAR-10 experiments, our method shows a lower variance over the depth, with test accuracy improving from to from depth 32 to 110. The blue shaded area () is much thinner compared with the red one (). This shows that our method () has a smaller variance of the test accuracy over different random seeds. This demonstrates that small offers training robustness.

To sum up, as theoretically shown in Section 3.2, the reduced can prevent the explosion of back-propagated information over the depth, thus making the training procedure of ResNet more robust.

Small helps networks without BN:

Figure 3 shows that, without using BN, training plain ResNet () is unstable and exhibits large variance for both vision CIFAR-10 and text AG-NEWS datasets. Particularly, without using BN, even training a shallow ResNet (depth-32) on CIFAR-10 fails at almost every training trail. However, with reduced , training performance improves significantly and exhibits low variance. As theoretically shown in Section 3.2, reduced has beneficial effects on top of BN. This can help the back-propagated information by preventing its explosion, and thus enhance the training robustness of deep networks.

Figure 3: To verify the effectiveness of small on improving robustness of the training procedure without batch normalization (BN), we compare ResNet with reduced step () and the original ResNet (corresponding to ) without BN (other hyperparameters remain the same). Each configuration has 5 trials with different random seeds. We provide median test accuracy with the number of epochs on the datasets. The standard deviation is plotted as the shaded color.

4.2 Small helps generalization robustness

Synthetic data for noise filtering:

To give insights on why small can help the generalization robustness of ResNet, let us first consider a synthetic data example of using ResNet for binary classification task, namely separating noisy “red and blue” points in a 2-D plane. We train a vanilla ResNet with and ResNet with on the middle of Figure 4, and perform the feature transformations on the test set (the right of Figure 4).

Figure 4: Left: Synthetic binary data TWO-MOON (clean underlying data). Middle: Noisy training data with perturbed input features. Right: Noisy test data. Both noisy training and test data come from the same underlying distribution with the same noise-adding process.
Figure 5: To illustrate the feature transformations through the vanilla ResNet (no BN, ) on the synthetic 2-dim binary data of Figure 4 (right) , we plot transformed features from selected layers 0, 5, 10, 90, 100 of a deep (100-layer) ResNet. Axes ranges increase and shift to visualize diverging features with the increasing depth.

In Figure 5, the features are transformed through forward-propagation in vanilla ResNet (no BN, ). The noise in the input features leads to the mixing of red and blue points, sabotaging the generalization of ResNet. The reason for this phenomenon is that, with large , features undergo violent transformations between two adjacent blocks, and negative effects of noise are amplified along the depth.

Figure 6: Feature transformations through the ResNet without BN but with step factor of on the synthetic 2-dim data of Figure 4 (right). We plot transformed features from selected layers 0, 50, 100, 450, 500 of a deeper (500-layer) ResNet. Axes ranges shift to better visualize features with the increasing depth.

On the other hand, Figure 6 shows that, with small features undergo smooth transformations at every adjacent residual block, and thus the input noise is gradually filtered out, which leads to the correct classification. As stated in Section 3.3, small can help bounding the negative effects of noise in input features. With small , noise amplification along the depth is effectively constrained.

We also made animations visualizing the transformations and conducted experiments comparing the effects of BN on smoothing out the noise. Interested readers may go to the Dropbox link for reference.

Figure 7: To verify the effectiveness of small on generalization robustness of ResNet, we train on noisy input data with various noise level, and compare ResNet with reduced step factor () and original ResNet () (other hyperparameters remain the same.). We provide the best test accuracy with number of epochs on clean input data of both visual and text dataset CIFAR-10 (depth-110 ResNet) and AG-NEWS (depth-49 ResNet). Each configuration takes 5 trails each with different random seeds. The standard deviation is plotted as the shaded color (For the better visualization, the shaded color in CIFAR-10 is std.).

Real-world data for noise filtering:

We train ResNet with and on noisy data (i.e., input perturbations), and test it on clean data. For the training dataset CIFAR-10, we inject Gaussian noise at every normalized pixel with zero mean and a different standard deviation to represent different noise levels. For the training dataset AG-NEWS, we choose different proportions (different noise levels) of characters randomly in the texts and alter them.

Figure 7 shows that, at different noise levels, our method () continuously outperforms the original method (corresponding to ). In particular, for CIFAR-10 at the noise levels 0.01 and 0.1, our method outperforms the original method by a large margin around and respectively. We observe that ResNet with reduced has the smaller variance compared to its counterpart under different noise levels. In other words, our method is robust to training on noisy input by bounding the negative effects of noise. By taking the smooth transformations, it gradually filters out the noise in the input features along the forward propagation of ResNet. Thus, our method offers better generalization robustness.

4.3 How to select step size

Last but not least, we perform a grid search of from to to explore optimal training and generalization robustness.

Figure 8: To choose the appropriate for a fixed depth of ResNet, we perform a grid search of for CIFAR-10 (depth-110 ResNet) and for AG-NEWS (depth-49 ResNet) (other hyperparameters remain the same). We provide the median test accuracy with the number of epochs. Each configuration takes 5 trails with different random seeds. The standard deviation is plotted as the shaded color (For the better visualization, the shaded color in CIFAR-10 is std.).

Selection of for training robustness:

To search for training robustness, we train fixed depth ResNet with various . Figure 8 shows that the proper for CIFAR-10 (depth-110 ResNet) is near . The proper for AG-NEWS (depth-49 ResNet) is between and .

For a given depth, we should choose smaller but not too small. If is very small, e.g., and , it smooths out useful transformations, thus leading to undesirable performance.

Figure 9: To choose the appropriate for a fixed depth of ResNet on noisy input, we perform a grid search of for CIFAR-10 (depth-110 ResNet, noise level = 0.1) and for AG-NEWS (depth-49 ResNet, noise level = 0.5) (Other hyperparameters remain the same.). We provide the best test accuracy with number of epochs on clean input data. Each configuration takes 5 trials with different random seeds. The standard deviation is plotted as the shaded color (For the better visualization, the shaded color in CIFAR-10 is std.).

Selection of for generalization robustness:

To search for generalization robustness, we train ResNet with various on the noisy input data. Figure 9 shows that for noisy CIFAR-10 (depth-110 ResNet, noise level ), the proper should be between and , and for noisy AG-NEWS (depth-49 ResNet, noise level ), the proper should be between and .

Too small h, e.g., h = 0.001, 0.01 leads to undesirable performance. As discussed in Section 3.3, h cannot be too small in order to get enough approximations to match the target value at the final layer of ResNet.

5 Conclusion

This paper proposed a simple but principled approach to enhance the robustness of ResNet. Motivated by the dynamical system view, we characterize ResNet by an explicit Euler method. We theoretically find that the step factor in the Euler method can control the robustness of ResNet in both training and generalization. From the view of back-propagation, we prove that a small can benefit the training robustness, while from the view of forward-propagation, we prove that a small can help the generalization robustness. We conduct comprehensive experiments on vision CIFAR-10 and text AG-NEWS datasets. Experiments confirm that small can benefit the training and generalization robustness. Future work can explore several promising directions: (a) How to transfer the experience of small

to other network structures, e.g., RNN for natural language processing

Cho et al. (2014), (b) How to handle the noisy labels Han et al. (2018), and (c) Other means for choosing the step size , e.g., using Bayesian optimization Hennig and Schuler (2012); Srinivas et al. (2010); Daxberger and Low (2017); Hoang et al. (2018).

References

  • [1] U. M. Ascher (2008) Numerical methods for evolutionary differential equations. Computational Science and Engineering, Vol. 5, SIAM. Cited by: §2.3, Definition 2.
  • [2] K. E. Atkinson (2008) An introduction to numerical analysis. 2 edition, John Wiley & Sons. Cited by: §2.3, Definition 2.
  • [3] L. J. Ba, R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv:1607.06450 Cited by: §1.
  • [4] J. C. Butcher (2016)

    Numerical methods for ordinary differential equations

    .
    John Wiley & Sons. Cited by: §3.1.
  • [5] T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud (2018) Neural ordinary differential equations. In Proc. NeurIPS, pp. 6572–6583. Cited by: §1.
  • [6] K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proc. EMNLP, pp. 1724–1734. Cited by: §5.
  • [7] E. A. Daxberger and B. K. H. Low (2017) Distributed batch Gaussian process optimization. In Proc. ICML, pp. 951–960. Cited by: §5.
  • [8] X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In Proc. AISTATS, pp. 249–256. Cited by: §1.
  • [9] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT Press. Note: http://www.deeplearningbook.org Cited by: §1.
  • [10] E. Haber and L. Ruthotto (2017) Stable architectures for deep neural networks. Inverse Problems 34 (1), pp. 014004. Cited by: §2.3, §3.1.
  • [11] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama (2018) Co-teaching: robust training of deep neural networks with extremely noisy labels. In Proc. NeurIPS, pp. 8536–8546. Cited by: §5.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proc. IEEE CVPR, pp. 770–778. Cited by: §1, §2.1, §3.1, §4, §4.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In Proc. ECCV, pp. 630–645. Cited by: §1, §2.1, §3.1, §3.2.
  • [14] P. Hennig and C. J. Schuler (2012) Entropy search for information-efficient global optimization. JMLR 13, pp. 1809–1837. Cited by: §5.
  • [15] T. N. Hoang, Q. M. Hoang, R. Ouyang, and K. H. Low (2018) Decentralized high-dimensional Bayesian optimization with factor graphs. In Proc. AAAI, Cited by: §5.
  • [16] G. Huang, Z. Liu, and K. Q. Weinberger (2017) Densely connected convolutional networks. Proc. IEEE CVPR, pp. 2261–2269. Cited by: §1, §2.1.
  • [17] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proc. ICML, pp. 448–456. Cited by: §1, §1, §2.2, §2.2, §3.2.
  • [18] A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Master’s Thesis, Univ. Toronto. Cited by: §1, §4.
  • [19] A.-A. Liu, Z. Shao, Y. Wong, J. Li, Y.-T. Su, and M. Kankanhalli (2019) LSTM-based multi-label video event detection. Multimedia Tools and Applications 78 (1), pp. 677–695. Cited by: §1.
  • [20] Y. Lu, A. Zhong, Q. Li, and B. Dong (2018) Beyond finite layer neural networks: bridging deep architectures and numerical differential equations. In Proc. ICML, pp. 3282–3291. Cited by: §1.
  • [21] V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted Boltzmann machines. In Proc. ICML, pp. 807–814. Cited by: §2.1.
  • [22] L. Ruthotto and E. Haber (2018) Deep neural networks motivated by partial differential equations. arXiv:1804.04272 Cited by: §1.
  • [23] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry (2018) How does batch normalization help optimization?. In Proc. NeurIPS, pp. 2488–2498. Cited by: §1, §2.2, §3.2.
  • [24] H. Schwenk, L. Barrault, A. Conneau, and Y. LeCun (2017) Very deep convolutional networks for text classification. In Proc. EACL, pp. 1107–1116. Cited by: §1, §2.1, §4, §4.
  • [25] N. Srinivas, A. Krause, S. Kakade, and M. W. Seeger (2010) Gaussian process optimization in the bandit setting: no regret and experimental design. In Proc. ICML, pp. 1015–1022. Cited by: §5.
  • [26] Y. Wu and K. He (2018) Group normalization. In Proc. ECCV, pp. 3–19. Cited by: §1.
  • [27] N. Xu, A.-A. Liu, Y. Wong, Y. Zhang, W. Nie, Y. Su, and M. Kankanhalli (2019)

    Dual-stream recurrent neural network for video captioning

    .
    IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §1.
  • [28] X. Zhang, J. Zhao, and Y. LeCun (2015) Character-level convolutional networks for text classification. In Proc. NeurIPS, pp. 649–657. Cited by: §1, §1, §4.

6 Supplementary Materials

6.1 Details of configurations of our experiments

We will publicize our codes soon.

6.2 Animations of Synthetic data

Animations on the effects of step factor and BN of smoothing out noise can be found at the Dropbox link.

6.3 Visualization on noisy input features

For CIFAR-10, we add random Gaussian (zero mean and noise level std) noise to every normalized pixel of the images. For AG-NEWS, we randomly choose the noise level proportion of text, and randomly alter it. Illustrations are as follows.

video games good for children computer games can promote problem-solving and team-building in children, say games industry experts. (Noise level = 0.0) vedeo games good for dhildlenzcospxter games can iromote problem-sorvtng and teai-building in children, sby games industry experts. (Noise level = 0.1) video nawvs zgood foryxhilqretngomvumer games cahcprocotubpnoblex-szbvina and tqlmmbuaddiagjin whipdren, saywgsmes ildustry exmrrts. (Noise level = 0.3) tmdeo gakec jgopd brr cgildrenjcoogwdeh lxdeu vanspromote xrobkeh-svlkieo and termwwuojvinguinfcojbdses, sacosamlt cndgstoyaagpbrus. (Noise level = 0.5) vizwszgbrwjtguihcxfoatbhivrrwvq cxmpgugflziwls clfnzrommtohprtblef-solvynx rnjnyiaf-gjwlcergwklskqibdtjn,aoty gameshinzustrm oxpertsdm (Noise level = 0.8)