Residual neural networks (ResNet)  are composed of multiple residual blocks transforming the hidden states according to:
where is the input to the -th layer and is a non-linear function parameterized by . Recently, a continuous approximation to the ResNet architecture has been proposed , where the evolution of the hidden state can be described as a dynamic system obeying the equation:
where is the continuous form of the nonlinear function ; and are hidden states at two different time . A standard ODE solver can be used to solve all the hidden states and final states (output from the neural network), starting from an initial state (input to the neural network). The continuous neural network described in (2) exhibits several advantages over its discrete counterpart described in (1), in terms of memory efficiency, parameter efficiency, explicit control of the numerical error of final output, etc.
One missing component in the current Neural ODE network is the various regularization mechanisms commonly employed in discrete neural networks. These regularization techniques have been demonstrated to be crucial in reducing generalization errors, and in improving the robustness of neural networks to adversarial attacks. Many of these regularization techniques are based on stochastic noise injection. For instance, dropout  is widely adopted to prevent overfitting; injecting Gaussian random noise during the forward propagation is effective in improving generalization [4, 5] as well as robustness to adversarial attacks [6, 7]. However, these regularization methods in discrete neural networks are not directly applicable to Neural ODE network, because Neural ODE network is a deterministic system.
Our work attempts to incorporate the above-mentioned stochastic noise injection based regularization mechanisms to the current Neural ODE network, to improve the generalization ability and the robustness of the network. In this paper, we propose a new continuous neural network framework called Neural Stochastic Differential Equation (Neural SDE) network, which models stochastic noise injection by stochastic differential equations (SDE). In this new framework, we can employ existing techniques from the stability theory of SDE to study the robustness of neural networks. Our results provide theoretical insights to understanding why introducing stochasticity during neural network training and testing leads to improved robustness against adversarial attacks. Furthermore, we demonstrate that, by incorporating the noise injection regularization mechanism to the continuous neural network, we can reduce overfitting and achieve lower generalization error. For instance, on the CIFAR-10 dataset, we observe that the new Neural SDE can improve the test accuracy of the Neural ODE from 81.63% to 84.55%, with other factors unchanged. Our contributions can be summarized as follows:
We propose a new Stochastic Differential Equation (SDE) framework to incorporate randomness in continuous neural networks. The proposed random noise injection can be used as a drop-in component in any continuous neural networks. Our Neural SDE framework can model various types of noises widely used for regularization purpose in discrete networks, such as dropout (Bernoulli type) and Gaussian noise.
Training the new SDE network requires developing different backpropagation approach from the Neural ODE network. We develop a new efficient backpropagation method to calculate the gradient, and to train the Neural SDE network in a scalable way. The proposed method has its roots in stochastic control theory.
We carry out a theoretical analysis of the stability conditions of the Neural SDE network, to prove that the randomness introduced in the Neural SDE network can stabilize the dynamical system, which helps improve the robustness and generalization ability of the neural network.
We verify by numerical experiments that stochastic noise injection in the SDE network can successfully regularize the continuous neural network models, and the proposed Neural SDE network achieves better robustness and improves generalization performance.
Throughout this paper, we use to denote the hidden states in a neural network, where is the input (also called initial condition) and is the label. The residual block with parameters can be written as a nonlinear transform . We assume the integration is always taken from to . is -dimensional Brownian motion. is the diffusion matrix parameterized by . Unless stated explicitly, we use to represent
-norm for vector and Frobenius norm for matrix.
2 Related work
Our work is inspired by the success of the recent Neural ODE network, and we seek to improve the generalization and robustness of Neural ODE, by adding regularization mechanisms crucial to the success of discrete networks. Regularization mechanisms such as dropout cannot be easily incorporated in the Neural ODE due to its deterministic nature.
The basic idea of Neural ODE is discussed in the previous section, here we briefly review relevant literature. The idea of formulating ResNet as a dynamic system was discussed in . A framework was proposed to link existing deep architectures with discretized numerical ODE solvers , and was shown to be parameter efficient. These networks adopt layer-wise architecture – each layer is parameterized by different independent weights. The Neural ODE model  computes hidden states in a different way: it directly models the dynamics of hidden states by an ODE solver, with the dynamics parameterized by a shared model. A memory efficient approach to compute the gradients by adjoint methods was developed, making it possible to train large, multi-scale generative networks [10, 11]. Our work can be regarded as an extension of this framework, with the purpose of incorporating a variety of noise-injection based regularization mechanisms. Stochastic differential equation in the context of neural network has been studied before, focusing either on understanding how dropout shapes the loss landscape , or on using stochastic differential equation as a universal function approximation tool to learn the solution of high dimensional PDEs . Instead, our work tries to explain why adding random noise boosts the stability of deep neural networks, and demonstrates the improved generalization and robustness.
Noisy Neural Networks
Adding random noise to different layers is a technique commonly employed in training neural networks. Dropout 14] randomly drops some residual blocks of residual neural network during training time. Another successful regularization for ResNet is Shake-Shake regularization , which sets a binary random variable to randomly switch between two residual blocks during training. More recently, dropblock  was designed specifically for convolutional layers: unlike dropout, it drops some continuous regions rather than sparse points to hidden states. All of the above regularization techniques are proposed to improve generalization performance. One common characteristic of them is that they fix the network during testing time. There is another line of research that focuses on improving robustness to perturbations/adversarial attacks by noise injection. Among them, random self-ensemble [6, 7] adds Gaussian noise to hidden states during both training and testing time. In training time, it works as a regularizer to prevent overfitting; in testing time, the random noise is also helpful, which will be explained in this paper.
3 Neural Stochastic Differential Equation
In this section, we first introduce our proposed Neural SDE to improve the robustness of Neural ODE. Informally speaking, Neural SDE can be viewed as using randomness as a drop-in augmentation for Neural ODE, and it can include some widely used randomization layers such as dropout and Gaussian noise layer . However, solving Neural SDE is non-trivial, we derive the gradients of loss over model weights. Finally we theoretically analyze the stability conditions of Neural SDE.
Before delving into the multi-dimensional SDE, let’s first look at a 1-d toy example to see how SDE can solve the instability issue of ODE. Suppose we have a simple SDE, with be the standard Brownian motion. We provide a numerical simulation in Figure 1 for with different .
When we set , SDE becomes ODE and where is an integration constant. If we can see that . Furthermore, a small perturbation in will be amplified through . This clearly shows instability of ODE. On the other hand, if we instead make (the system is SDE), we have .
The toy example in Figure 1 reveals that the behavior of solution paths can change significantly after adding a stochastic term. This example is inspiring because we can control the impact of perturbations on the output by adding a stochastic term to neural networks.
Figure 2 shows a sample Neural SDE model architecture, and it is the one used in the experiment. It consists of three parts, the first part is a single convolution block, followed by a Neural SDE network (we will explain the detail of Neural SDE in Section 3.1) and lastly the linear classifier. We put most of the trainable parameters into the second part (Neural SDE), whereas the first/third parts are mainly for increasing/reducing the dimension as desired. Recall that both Neural ODE and SDE are dimension preserving.
3.1 Modeling randomness in neural networks
In the Neural ODE system (2), a slightly perturbed input state will be amplified in deep layers (as shown in Figure 1) which makes the system unstable to input perturbation and prone to overfitting. Randomness is an important component in discrete networks (e.g., dropout for regularization) to tackle this issue, however to our knowledge, there is no existing work concerning adding randomness in the continuous neural networks. And it is non-trivial to encode randomness in continuous neural networks, such as Neural ODE, as we need to consider how to add randomness so that to guarantee the robustness, and how to solve the continuous system efficiently. To solve these challenges, motivated by [9, 12], we propose to add a single diffusion term into Neural ODE as:
where is the standard Brownian motion , which is a continuous time stochastic process such that follows Gaussian with mean
and variance; is a transformation parameterized by
. This formula is quite general, and can include many existing randomness injection models with residual connections under different forms of. As examples, we briefly list some of them below.
Gaussian noise injection:
Consider a simple example in (3) when is a diagonal matrix, and we can model both additive and multiplicative noise as
where is a diagonal matrix and its diagonal elements control the variance of the noise added to hidden states. This can be viewed as a continuous approximation of noise injection techniques in discrete neural network. For example, the discrete version of the additive noise can be written as
which injects Gaussian noise after each residual block. It has been shown that injecting small Gaussian noise can be viewed as a regularization in neural networks [4, 5]. Furthermore, [6, 7] recently showed that adding a slightly larger noise in one or all residual blocks can improve the adversarial robustness of neural networks. We will provide the stability analysis of (3) in Section 3.3, which provides a theoretical explanation towards the robustness of Neural SDE.
Our framework can also model the dropout layer which randomly disables some neurons in the residual blocks. Let us see how to unify dropout under our Neural SDE framework. First we notice that in the discrete case
where and indicates the Hadamard product. Note that we divide by in (6) to maintain the same expectation. Furthermore, we have
3.2 Back-propagating through SDE integral
To optimize the parameters
, we need to back-propagate the Neural SDE system. A straightforward solution is to rely on the autograd method derived from chain rule. However, for Neural SDE the chain can be fairly long. If SDE solver discretizes the rangeto intervals, then the chain has nodes and the memory cost is . One challenging part of backpropagation for Neural SDE is to calculate the gradient through SDE solver which could have high memory cost. To solve this issue, we first calculate the expected loss conditioning on the initial value , denoted as . Then our goal is to calculate . In fact, we have the following theorem (also called path-wise gradient [18, 19]).
3.3 Robustness of Neural SDE
In this section, we theoretically analyze the stability of Neural SDE, showing that the randomness term can indeed improve the robustness of the model against small input perturbation. This also explains why noise injection can improve the robustness in discrete networks, which has been observed in literature [6, 7]. First we need to show the existence and uniqueness of solution to (3), we pose following assumptions on drift and diffusion .
and are at most linear, i.e. for , and .
and are -Lipschitz: for , and .
Based on the above assumptions, we can show that the SDE (3) has a unique solution . We remark that assumption on is quite natural and is also enforced on the original Neural ODE model ; as to diffusion matrix , we have seen that for dropout, Gaussian noise injection and other random models, both assumptions are automatically satisfied as long as possesses the same regularities.
We analyze the dynamics of perturbation. Our analysis applies not only to the Neural SDE model but also to Neural ODE model, by setting the diffusion term to zero. First of all, we consider initializing our Neural SDE (3) at two slightly different values and , where is the perturbation for with . So, under the new perturbed initialization , the hidden state at time follows the same SDE in (3),
Here we made an implicit assumption that the Brownian motions and have the same sample path for both initialization and , i.e. w.p.1. In other words, we focus on the difference of two random processes and driven by the same underlying Brownian motion. So it is valid to subtract the diffusion terms.
An important property of (12) is that it admits a trivial solution , and . We show that both the drift () and diffusion () are zero under this solution:
The implication of zero solution is clear: for a neural network, if we do not perturb the input data, then the output will never change. However, the solution can be highly unstable, in the sense that for an arbitrarily small perturbation at initialization, the change of output can be arbitrarily bad. On the other hand, as shown below, by choosing the diffusion term properly, we can always control within a small range.
In general, we cannot get the closed form solution to a multidimensional SDE but we can still analyze the asymptotic stability through the dynamics and . This is essentially an extension of Lyapunov stability theory to a stochastic system. First we define the notion of stability in the stochastic case. Let
be a complete probability space with filtrationand be an -dimensional Brownian motion defined in the probability space, we consider the SDE in Eq. (12) with initial value
For simplicity we dropped the dependency on parameters and . We further assume and are both Borel measurable. We can show that if assumptions (1) and (2) hold for and , then they hold for and as well (see Lemma A.1 in Appendix), and we know the SDE (14) allows a unique solution . We have the following Lynapunov stability results from .
Definition 3.1 (Lyapunov stability of SDE).
The solution of (14):
is stochastically stable if for any and , there exists a such that whenever . Moreover, if for any , there exists a such that whenever , it is said to be stochastically asymptotically stable;
is almost surely exponentially stable if a.s.111“a.s.” is the abbreviation for “almost surely”. for all .
Note that for part A in Definition 3.1, it is hard to quantify how well the stability is and how fast the solution reaches equilibrium. In addition, under assumptions (1, 2), we have a straightforward result whenever as shown in Appendix (see Lemma A.2). That is, almost all the sample paths starting from a non-zero initialization can never reach zero due to Brownian motion. On the contrary, the almost sure exponentially stability result implies that almost all the sample paths of the solution will be close to zero exponentially fast. We present the following theorem from  on the almost sure exponentially stability.
 If there exists a non-negative real valued function defined on that has continuous partial derivatives
and constants such that the following inequalities hold:
for all and . Then for all ,
In particular, if , the solution is almost surely exponentially stable.
We now consider a special case, when the noise is multiplicative and . The corresponding SDE of perturbation has the following form
Note that for the deterministic case of (16) by setting , the solution may not be stable in certain cases (see Figure 1). Whereas for general cases when , following corollary claims that by setting properly, we will achieve an (almost surely) exponentially stable system.
4 Experimental Results
In this section we show the effectiveness of our Neural SDE framework in terms of generalization, non-adversarial robustness and adversarial robustness. We use the SDE model architecture illustrated in Figure 2 during the experiment. Throughout our experiments, we set to be a neural network with several convolution blocks. As to we have the following choices:
Neural ODE, this can be done by dropping the diffusion term .
Additive noise, when the diffusion term is independent of , here we simply set it to be diagonal .
Multiplicative noise, when the diffusion term is proportional to , or .
Dropout noise, when the diffusion term is proportional to the drift term , i.e. .
Note the last three are our proposed Neural SDE with different types of randomness as explained in Section 3.1.
4.1 Generalization Performance
In the first experiment, we show small noise helps generalization. However, note that our noise injection is different from randomness layer in the discrete case, for instance, dropout layer adds Bernoulli noise at training time, but the layer are then fixed at testing time; whereas our Neural SDE model keeps randomness at testing time and takes the average of multiple forward propagation.
As for datasets, we choose CIFAR-10, STL-10 and Tiny-ImageNet222Downloaded from https://tiny-imagenet.herokuapp.com/ to include various sizes and number of classes. The experimental results are shown in Table 1. We see that for all datasets, Neural SDE consistently outperforms ODE, and the reason is that adding moderate noise to the models at training time can act as a regularizer and thus improves testing accuracy. Based upon that, if we further keep testing time noise and ensemble the outputs, we will obtain even better results.
|Data||Model size||Accuracy@1 — w/o TTN||Accuracy@1 — w/ TTN|
4.2 Improved non-adversarial robustness
In this experiment, we aim at evaluating the robustness of models under non-adversarial corruptions following the idea of . The corrupted datasets contain tens of defects in photography including motion blur, Gaussian noise, fog etc. For each noise type, we run Neural ODE and Neural SDE with dropout noise, and gather the testing accuracy. The final results are reported by mean accuracy (mAcc) in Table 2 by changing the level of corruption. Both models are trained on completely clean data, which means the corrupted images are not visible to them during the training stage, nor could they augment the training set with the same types of corruptions. From the table, we can see that Neural SDE performs better than Neural ODE in 8 out of 10 cases. For the rest two, both ODE and SDE are performing very close. This shows that our proposed Neural SDE can improve the robustness of Neural ODE under non-adversarial corrupted data.
|Data||Noise type||mild corrupt Accuracy severe corrupt|
|Level 1||Level 2||Level 3||Level 4||Level 5|
Downloaded from https://github.com/hendrycks/robustness
4.3 Improved adversarial robustness
Next, we consider the performance of Neural SDE models under adversarial perturbation. Clearly, this scenario is strictly harder than previous cases: by design, the adversarial perturbations are guaranteed to be the worst case within a small neighborhood (ignoring the suboptimality of optimization algorithms) crafted through constrained loss maximization procedure, so it represents the worst case performance. In our experiment, we adopt multi-step -PGD attack , although other strong white-box attacks such as C&W  are also suitable. The experimental results are shown in Figure 3. As we can see both Neural SDE with multiplicative noise and dropout noise are more resistant to adversarial attack than Neural ODE, and dropout noise outperforms multiplicative noise.
4.4 Visualizing the perturbations of hidden states
In this experiment, we take a look at the perturbation at any time . Recall the 1-d toy example in Figure 1, we can observe that the perturbation at time can be well suppressed by adding a strong diffusion term, which is also confirmed by theorem. However, it is still questionable whether the same phenomenon also exists in deep neural network since we cannot add very large noise to the network during training or testing time. If the noise is too large, it will also remove all useful features. Thus it becomes important to make sure that this will not happen to our models. To this end we first sample an input from CIFAR-10 and gather all the hidden states at time . Then we perform regular PGD attack  and find the perturbation such that is an adversarial image, and feed the new data into network again so we get at the same time stamps as . Finally we plot the error w.r.t. time (also called “network depth”), shown in Figure 4. We can observe that by adding a diffusion term (dropout-style noise), the error accumulates much slower than ordinary Neural ODE model.
To conclude, we introduce the Neural SDE model which can stabilize the prediction of Neural ODE by injecting stochastic noise. Our model can achieve better generalization and improve the robustness to both adversarial and non-adversarial noises.
We acknowledge the support by NSF IIS1719097, Intel, Google Cloud and AITRICS.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In , pages 770–778, 2016.
-  Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, pages 6572–6583, 2018.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Dropout: a simple way to prevent neural networks from overfitting.
The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
-  Chris M Bishop. Training with noise is equivalent to tikhonov regularization. Neural computation, 7(1):108–116, 1995.
-  Guozhong An. The effects of adding noise during backpropagation training on a generalization performance. Neural computation, 8(3):643–674, 1996.
-  Xuanqing Liu, Minhao Cheng, Huan Zhang, and Cho-Jui Hsieh. Towards robust neural networks via random self-ensemble. In Proceedings of the European Conference on Computer Vision (ECCV), pages 369–385, 2018.
-  Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certified robustness to adversarial examples with differential privacy. arXiv preprint arXiv:1802.03471, 2018.
-  Weinan E. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1):1–11, 2017.
-  Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. In International Conference on Machine Learning, pages 3282–3291, 2018.
-  Lynton Ardizzone, Jakob Kruse, Sebastian Wirkert, Daniel Rahner, Eric W Pellegrini, Ralf S Klessen, Lena Maier-Hein, Carsten Rother, and Ullrich Köthe. Analyzing inverse problems with invertible neural networks. arXiv preprint arXiv:1808.04730, 2018.
-  Will Grathwohl, Ricky TQ Chen, Jesse Betterncourt, Ilya Sutskever, and David Duvenaud. Ffjord: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367, 2018.
-  Qi Sun, Yunzhe Tao, and Qiang Du. Stochastic training of residual networks: a differential equation viewpoint. arXiv preprint arXiv:1812.00174, 2018.
-  Maziar Raissi. Forward-backward stochastic neural networks: Deep learning of high-dimensional partial differential equations. arXiv preprint arXiv:1804.07010, 2018.
-  Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In European conference on computer vision, pages 646–661. Springer, 2016.
-  Xavier Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485, 2017.
-  Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Dropblock: A regularization method for convolutional networks. In Advances in Neural Information Processing Systems, pages 10727–10737, 2018.
-  Bernt Øksendal. Stochastic differential equations. In Stochastic differential equations, pages 65–84. Springer, 2003.
-  Jichuan Yang and Harold J Kushner. A monte carlo method for sensitivity analysis and parametric optimization of nonlinear stochastic systems. SIAM journal on control and optimization, 29(5):1216–1249, 1991.
-  Emmanuel Gobet and Rémi Munos. Sensitivity analysis using itô–malliavin calculus and martingales, and application to stochastic optimal control. SIAM Journal on control and optimization, 43(5):1676–1713, 2005.
-  Xuerong Mao. Stochastic differential equations and applications. Elsevier, 2007.
-  Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.
-  Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
-  Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57. IEEE, 2017.
Appendix A Proofs
We present the proofs of theorems on stability of SDE. The proofs are adapted from . We start with two crucial lemmas.
For (14), whenever , .
We prove it by contradiction. Let . Then if it is not true, there exists some such that . Therefore, we can find sufficiently large constant and such that . By Assumption 2 on and , there exists a positive constant such that
Let . Then, for any and , we have
where the first inequality comes from Cauchy-Schwartz and the last one comes from (17). For any , we define the stopping time . Let . By Itô’s formula,
Since and for any , then (19) implies
Thus, . Letting , we obtain , which leads to a contradiction. ∎
Proof of Theorem 3.2
where is a continuous martingale with initial value . By the exponential martingale inequality, for any arbitrary and , we have
Applying Borel-Cantelli lemma, we can get that for almost all , there exists an integer such that if ,
for all and almost surely. Therefore, for almost all , if and , we have
which consequently implies
With condition (1) and arbitrary choice of , we can obtain (15).