1 Introduction
Neural ordinary differential equations (Chen et al., 2018) form a family of models that approximate nonlinear mappings by using continuoustime ODEs. Due to their desirable properties, such as invertibility and parameter efficiency, neural ODEs have attracted increasing attention recently (Dupont et al., 2019; Liu et al., 2019). For example, Grathwohl et al. (2018) proposed a neural ODEbased generative model—the FFJORD—to solve inverse problems; Quaglino et al. (2019) used a higherorder approximation of the states in a neural ODE, and proposed the SNet to accelerate computation. Along with the wider deployment of neural ODEs, robustness issues come to the fore. However, the robustness of neural ODEs is still yet unclear. In particular, it is unclear how robust neural ODEs are in comparison to the widelyused CNNs. Robustness properties of CNNs have been studied extensively. In this work, we present the first systematic study on exploring the robustness properties of neural ODEs.
To do so, we consider the task of image classification. We expect that results would be similar for other machine learning tasks such as regression. Neural ODEs are dimensionpreserving mappings, but a classification model transforms a highdimensional input—such as an image—into an output whose dimension is equal to the number of classes. Thus, we consider the neural ODEbased classification network (ODENet) whose architecture is shown in Figure
1. An ODENet consists of three components: the feature extractor (FE) consists of convolutional layers which maps an input datum to a multichannel feature map, a neural ODE that serves as the nonlinear representation mapping (RM), and the fullyconnected classifier (FCC) that generates a prediction vector based on the output of the RM.
The robustness of a classification model can be evaluated through the lens of its performance on perturbed images. To comprehensively investigate the robustness of neural ODEs, we perturb original images with commonlyused perturbations, namely, random Gaussian noise (Szegedy et al., 2013) and harmful adversarial examples (Goodfellow et al., 2014; Madry et al., 2017). We conduct experiments in two common settings—training the model only on authentic nonperturbed images and training the model on authentic images as well as the Gaussian perturbed ones. We observe that ODENets are more robust compared to CNN models against all types of perturbations in both settings. We then provide an insightful understanding of such intriguing robustness of neural ODEs by exploiting a certain property of the flow (Dupont et al., 2019), namely that integral curves that start at distinct initial states are nonintersecting. The flow of a continuoustime ODE is defined as the family of solutions/paths traversed by the state, starting from different initial points, and an integral curve is a specific solution for a given initial point. The nonintersecting property indicates that an integral curve starting from some point is constrained by the integral curves starting from that point’s neighborhood. Thus, in an ODENet, if a correctly classified datum is slightly perturbed, the integral curve associated to its perturbed version would not change too much from the original one. Consequently, the perturbed datum could still be correctly classified. Thus, there exists intrinsic robustness regularization in ODENets, which is absent from CNNs.
Motivated by this property of the neural ODE flow, we attempt to explore a more robust neural ODE architecture by introducing stronger regularization on the flow. We thus propose a TimeInvariant Steady neural ODE (TisODE). The TisODE removes the time dependence of the dynamics in an ODE and imposes a steadystate constraint on the integral curves. Removing the time dependence of the derivative results in the timeinvariant property of the ODE. To wit, given a solution , another solution , with an initial state for some , can be regarded as the  shift version of . Such a timeinvariant property would make bounding the difference between output states convenient. To elaborate, let the output of a neural ODE correspond to states at time . By the timeinvariant property, the difference between outputs, , equals to . To control this distance, a steadystate regularization term is introduced to the overall objective to constrain the change of a state after time exceeds . With the timeinvariant property and the steadystate term, we show that TisODE even is more robust. We do so by evaluating the robustness of TisODEbased classifiers against various types of perturbations and observe that such models are more robust than vanilla ODEbased models.
In addition, some other effective architectural solutions have also been recently proposed to improve the robustness of CNNs. For example, Xie et al. (2017)
randomly resizes or pads zeros into test images to destroy the specific structure of adversarial perturbations. Besides, the model proposed by
Xie et al. (2019) contains feature denoising filters to remove the featurelevel patterns of adversarial examples. We conduct experiments to show that our proposed TisODE can work seamlessly and in conjunction with these methods to further boost the robustness of deep models. Thus, the proposed TisODE can be used as a generally applicable and effective component for improving the robustness of deep models.In summary, our contributions are as follows. Firstly, we are the first to provide a systematic empirical study on the robustness of neural ODEs and find that the neural ODEbased models are more robust compared to conventional CNN models. This finding inspires new applications of neural ODEs in improving robustness of deep models, a problem that concerns many deep learning theorists and practitioners alike. Secondly, we propose the TisODE method, which is simple yet effective in significantly boosting the robustness of neural ODEs. Moreover, the proposed TisODE can also be used in conjunction with other stateoftheart robust architectures. Thus, TisODE can serve as a dropin module to improve the robustness of deep models effectively.
2 Preliminaries on neural ODE
It has been shown that a residual block (He et al., 2016) can be interpreted as the discrete approximation of an ODE by setting the discretization step to be one. When the discretization step approaches zero, it yields a family of neural networks, which are called neural ODEs (Chen et al., 2018). Formally, in a neural ODE, the relation between input and output is characterized by the following set of equations:
(1) 
where denotes the trainable layers that are parameterized by weights and represents the dimensional state of the neural ODE. In this case, the input of the neural ODE corresponds to the state at , and the output is associated to the state at some . Because governs how the state changes with respect to time , we also use to denote the dynamics of the neural ODE.
Given input , the output can be computed by solving the ODE in (1). If is fixed, the output only depends on the input and the dynamics , which also corresponds to the weighted layers in the neural ODE. Therefore, the neural ODE can be represented as the dimensional function of the input and the dynamics , i.e.,
The terminal time of the output state is set to be in practice. Several methods have been proposed for training neural ODEs, such as the adjoint sensitivity method (Chen et al., 2018), SNet (Quaglino et al., 2019), and the autodifferentiation technique (Paszke et al., 2017). In this work, we use the most straightforward technique, i.e., updating the weights
with the autodifferentiation technique in the PyTorch framework.
3 An empirical study on the robustness of ODENets
Robustness of deep models has gained increased attention, as it is imperative that deep models employed in critical applications, such as healthcare, are robust. The robustness of a model is measured by the sensitivity of the prediction with respect to small perturbations on the inputs. In this study, we consider three commonlyused perturbation schemes, namely random Gaussian perturbations, FGSM (Goodfellow et al., 2014) adversarial examples, and PGD (Madry et al., 2017) adversarial examples. These perturbation schemes reflect noise and adversarial robustness properties of the investigated models respectively. We evaluate the robustness via the classification accuracies on perturbed images, in which the original nonperturbed versions of these images are all correctly classified.
For a fair comparison with conventional CNN models, we made sure that the number of parameters of an ODENet is close to that of its counterpart CNN model. Specifically, the ODENet shares the same network architecture with the CNN model for the FE and FCC parts. The only difference is that, for the RM part, the input of the ODEbased RM is concatenated with one more channel which represents the time
. During the training phase, all the hyperparameters are kept the same, including training epochs, learning rate schedules, and weight decay coefficients. Each model is trained
threetimes with different random seeds, and we report the average performance (classification accuracy) together with the standard deviation.
3.1 Experimental settings
Dataset: We conduct experiments to compare the robustness of ODENets with CNN models on three datasets, i.e., the MNIST (LeCun et al., 1998), the SVHN (Netzer et al., 2011)
, and a subset of the ImageNet datset
(Deng et al., 2009). We call the subset ImgNet10 since it is collected from 10 synsets of ImageNet: dog, bird, car, fish, monkey, turtle, lizard, bridge, cow, and crab. We selected 3,000 training images and 300 test images from each synset and resized all images to .Architectures: On the MNIST dataset, both the ODENet and the CNN model consists of four convolutional layers and one fullyconnected layer. The total number of parameters of the two models is around 140k. On the SVHN dataset, the networks are similar to those for the MNIST; we only changed the input channels of the first convolutional layer to three. On the ImgNet10 dataset, there are nine convolutional layers and one fullyconnected layer for both the ODENet and the CNN model. The numbers of parameters is approximately 280k. In practice, the neural ODE can be solved with different numerical solvers such as the Euler method and the RungeKutta methods (Chen et al., 2018). Here, we use the easilyimplemented Euler method in the experiments. To balance the computation and the continuity of the flow, we solve the ODE initial value problem in equation (1) by the Euler method with step size
. Our implementation builds on the opensource neural ODE codes.
^{1}^{1}1https://github.com/rtqichen/torchdiffeq. Details on the network architectures are included in the Appendix.Training: The experiments are conducted using two settings on each dataset—training models only with original nonperturbed images and training models on original images together with their perturbed versions. In both settings, we added a weight decay term into the training objective to regularize the norm of the weights, since this can help control the model’s representation capacity and improve the robustness of a neural network (Sokolić et al., 2017). In the second setting, images perturbed with random Gaussian noise are used to finetune the models, because augmenting the dataset with small perturbations can possibly improve the robustness of models and synthesizing Gaussian noise does not incur excessive computation time.
3.2 Robustness of ODENets trained only on nonperturbed images
The first question we are interested in is how robust ODENets are against perturbations if the model is only trained on original nonperturbed images. We train CNNs and ODEnets to perform classification on three datasets and set the weight decay parameters for all models to be 0.0005. We make sure that both the welltrained ODENets and CNN models have satisfactory performances on original nonperturbed images, i.e., around 99.5% for MNIST, 95.0% for the SVHN, and 80.0% for ImgNet10.
Since Gaussian noise is ubiquitous in modeling image degradation, we first evaluated the robustness of the models in the presence of zeromean random Gaussian perturbations. It has also been shown that a deep model is vulnerable to harmful adversarial examples, such as the FGSM (Goodfellow et al., 2014). We are also interested in how robust ODENets are in the presence of adversarial examples. The standard deviation of Gaussian noise and the norm of the FGSM attack for each dataset are shown in Table 1.
Gaussian noise  Adversarial attack  
MNIST  FGSM0.15  FGSM0.3  FGSM0.5  
CNN  98.10.7  85.84.3  56.45.6  63.42.3  24.08.9  8.33.2 
ODENet  98.70.6  90.65.4  73.28.6  83.50.9  42.12.4  14.32.1 
SVHN  FGSM3/255  FGSM5/255  FGSM8/255  
CNN  90.01.2  76.32.7  60.93.9  29.22.9  13.71.9  5.41.5 
ODENet  95.70.7  88.11.5  78.22.1  58.22.3  43.01.3  30.91.4 
ImgNet10  FGSM5/255  FGSM8/255  FGSM16/255  
CNN  80.11.8  63.32.0  40.82.7  28.50.5  18.10.7  9.41.2 
ODENet  81.92.0  67.52.0  48.72.6  36.21.0  27.21.1  14.41.7 
From the results in Table 1, we observe that the ODENets demonstrate superior robustness compared to CNNs for all types of perturbations. On the MNIST dataset, in the presence of Gaussian perturbations with a large of 100, the ODENet produces much higher accuracy on perturbed images compared to the CNN model (73.2% vs. 56.4%). For the FGSM0.3 adversarial examples, the accuracy of ONEnet is around twice as high as that of the CNN model. On the SVHN dataset, ODENets significantly outperform CNN models, e.g., for the FGSM5/255 examples, the accuracy of the ODENet is 43.0%, which is much higher than that of the CNN model (13.7%). On the ImgNet10, for both cases of and FGSM8/255, ODENet outperforms CNNs by a large margin of around 9%.
3.3 Robustness of ODENets trained on original images together with Gaussian perturbations
Training a model on original images together with their perturbed versions can improve the robustness of the model. As mentioned previously, Gaussian noise is commonly assumed to be present in realworld images. Synthesizing Gaussian noise is also fast and easy. Thus, we add random Gaussian noise into the original images to generate their perturbed versions. ODENets and CNN models are both trained on original images together with their perturbed versions. The standard deviation of the added Gaussian noise is randomly chosen from on the MNIST dataset, on the SVHN dataset, and on the ImgNet10. All other hyperparameters are kept the same as above.
Gaussian noise  Adversarial attack  

MNIST  FGSM0.3  FGSM0.5  PGD0.2  PGD0.3  
CNN  98.70.1  54.21.1  15.81.3  32.93.7  0.00.0 
ODENet  99.40.1  71.51.1  19.91.2  64.71.8  13.00.2 
SVHN  FGSM5/255  FGSM8/255  PGD3/255  PGD5/255  
CNN  90.60.2  25.30.6  12.30.7  32.40.4  14.00.5 
ODENet  95.10.1  49.41.0  34.70.5  50.91.3  27.21.4 
ImgNet10  FGSM5/255  FGSM8/255  PGD3/255  PGD5/255  
CNN  92.60.6  40.91.8  26.71.7  28.61.5  11.21.2 
ODENet  92.60.5  42.00.4  29.01.0  29.80.4  12.30.6 
The robustness of the models is evaluated under Gaussian perturbations, FGSM adversarial examples, and PGD (Madry et al., 2017) adversarial examples. The latter is a stronger attacker compared to the FGSM. The norm of the PGD attack for each dataset is shown in Table 2. Based on the results, we observe that ODENets consistently outperform CNN models on both two datasets. On the MNIST dataset, the ODENet outperforms the CNN against all types of perturbations. In particular, for the PGD0.2 adversarial examples, the accuracy of the ODENet (64.7%) is much higher than that of the CNN (32.9%). Besides, for the PGD0.3 attack, the CNN is completely misled by the adversarial examples, but the ODENet can still classify perturbed images with an accuracy of 13.0%. On the SVHN dataset, ODENets also show superior robustness in comparison to CNN models. For all the adversarial examples, ODENets outperform CNN models by a margin of at least 10 percentage points. On the ImgNet10 dataset, the ODENet also performs better than CNN models against all forms of adversarial examples.
3.4 Insights on the robustness of ODENets
From the results in Sections 3.2 and 3.3, we find ODENets are more robust compared to CNN models. Here, we attempt to provide an intuitive understanding of the robustness of the neural ODE. In an ODENet, given some datum, the FE extracts an informative feature map from the datum. The neural ODE, serving as the RM, takes as input the feature map and performs a nonlinear mapping. In practice, we use the weight decay technique during training which regularizes the norm of weights in the FE part, so that the change of feature map in terms of a small perturbation on the input can be controlled. We aim to show that, in the neural ODE, a small change on the feature map will not lead to a large deviation from the original output associated with the feature map.
Theorem 1 (ODE integral curves do not intersect (Dupont et al., 2019; Younes, 2010)).
Let and be two solutions of the ODE in (1) with different initial conditions, i.e. . Then, it holds that for all .
To illustrate this theorem, considering a simple 1dimensional system in which the state is a scalar. As shown in Figure 2, equation (1) has a solution starting from , where is the feature of some datum. Equation (1) also has another two solutions and , whose starting points and , both of which are close to . Suppose is between and . By Theorem 1, we know that the integral curve is always sandwiched between the integral curves and .
Now, let . Consider a solution of equation (1). The integral curve starts from a point . The point is in the neighborhood of with . By Theorem 1, we know that . In other words, if any perturbation smaller than is added to the scalar in , the deviation from the original output is bounded by the distance between and . In contrast, in a CNN model, there is no such bound on the deviation from the original output. Thus, we opine that due to this nonintersecting property, ODENets are intrinsically robust.
4 TisODE: boosting the robustness of neural ODEs
In the previous section, we presented an empirical study on the robustness of ODENets and observed that ODENets are more robust compared to CNN models. In this section, we explore how to boost the robustness of the vanilla neural ODE model further. This motivates the proposal of timeinvariant steady neural ODEs (TisODEs).
4.1 Timeinvariant steady neural ODEs
From the discussion in Section 3.4, the key to improving the robustness of neural ODEs is to control the difference between neighboring integral curves. By Grownall’s inequality (Howard, 1998) (see Theorem 2 in the Appendix), we know that the difference between two terminal states is bounded by the difference between initial states multiplied by the exponential of the dynamics’ Lipschitz constant. However, it is very difficult to bound the Lipschitz constant of the dynamics directly. Alternatively, we propose to achieve the goal of controlling the output deviation by following two steps: (i) removing the time dependence of the dynamics and (ii) imposing a certain steadystate constraint.
In the neural ODE characterized by equation (1), the dynamics depends on both the state at time and the time itself. In contrast, if the neural ODE is modified to be timeinvariant, the time dependence of the dynamics is removed. Consequently, the dynamics depends only on the state . So, we can rewrite the dynamics function as , and the neural ODE is characterized as
(2) 
Let be a solution of (2) on and be a small positive value. We define the set . This set contains all points on the curve of during that are also inside the neighborhood of . For some element , let be the solution of (2) which starts from . Then we have
(3) 
for all in . The property shown in equation (3) is known as the timeinvariant property. It indicates that the integral curve is the shift of (Figure 3).
We can regard as a slightly perturbed version of , and we are interested in how large the difference between and is. In a robust model, the difference should be small. By equation (3), we have . Since , the difference between and can be bounded as follows,
(4) 
where all norms are norms and denotes the elementwise absolute operation of a vectorvalued function . That is to say, the difference between and can be bounded by only using the information of the curve . For any and element , consider the integral curve that starts from . The difference between the output state of this curve and satisfies inequality (4).
Therefore, we propose to add an additional term
to the loss function when training the timeinvariant neural ODE:
(5) 
where is the number of samples in the training set and is the solution whose initial state equals to the feature of the sample. The regularization term is termed as the steadystate loss. This terminology “steady state” is borrowed from the dynamical systems literature. In a stable dynamical system, the states stabilize around a fixed point, known as the steadystate, as time tends to infinity. If we can ensure that is small, for each sample, the outputs of all the points in will stabilize around . Consequently, the model is robust. This modification of the neural ODE is dubbed Timeinvariant steady neural ODE.
4.2 Evaluating robustness of TisODEbased classifiers
Here, we conduct experiments to evaluate the robustness of our proposed TisODE, and compare TisODEbased models with the vanilla ODENets. We train all models with original nonperturbed images together with their Gaussian perturbed versions. The regularization parameter for the steadystate loss is set to be . All other hyperparameters are exactly the same as those in Section 3.3.
Gaussian noise  Adversarial attack  

MNIST  FGSM0.3  FGSM0.5  PGD0.2  PGD0.3  
CNN  98.70.1  54.21.1  15.81.3  32.93.7  0.00.0 
ODENet  99.40.1  71.51.1  19.91.2  64.71.8  13.00.2 
TisODE  99.60.0  75.71.4  26.53.8  67.41.5  13.21.0 
SVHN  FGSM5/255  FGSM8/255  PGD3/255  PGD5/255  
CNN  90.60.2  25.30.6  12.30.7  32.40.4  14.00.5 
ODENet  95.10.1  49.41.0  34.70.5  50.91.3  27.21.4 
TisODE  94.90.1  51.61.2  38.21.9  52.00.9  28.20.3 
ImgNet10  FGSM5/255  FGSM8/255  PGD3/255  PGD5/255  
CNN  92.60.6  40.91.8  26.71.7  28.61.5  11.21.2 
ODENet  92.60.5  42.00.4  29.01.0  29.80.4  12.30.6 
TisODE  92.80.4  44.30.7  31.41.1  31.11.2  14.51.1 
From the results in Table 3, we can see that our proposed TisODEbased models are clearly more robust compared to vanilla ODENets. On the MNIST dataset, when combating FGSM0.3 attacks, the TisODEbased models outperform vanilla ODENets by more than percentage points. For the FGSM0.5 adversarial examples, the accuracy of the TisODEbased model is percentage points better. On the SVHN dataset, the TisODEbased models perform better in terms of all forms of adversarial examples. On the ImgNet10 dataset, the TisODEbased models also outperform vanilla ODEbased models on all types of perturbations. In the presence of FGSM and PGD5/255 examples, the accuracies are enhanced by more than percentage points.
4.3 TisODE  A generally applicable dropin technique for improving the robustness of deep networks
In view of the excellent robustness of the TisODE, we claim that the proposed TisODE can be used as a general dropin module for improving the robustness of deep networks. We support this claim by showing the TisODE can work in conjunction with other stateoftheart techniques and further boost the models’ robustness. These techniques include the feature denoising (FDn) method (Xie et al., 2019) and the input randomization (IR) method (Xie et al., 2017). We conduct experiments on the MNIST and SVHN datasets. All models are trained with original nonperturbed images together with their Gaussian perturbed versions. We show that models using the FDn/IRd technique becomes much more robust when equipped with the TisODE. In the FDn experiments, the dotproduct nonlocal denoising layer (Xie et al., 2019) is added to the head of the fullyconnected classifier.
Gaussian noise  Adversarial attack  

MNIST  FGSM0.3  FGSM0.5  PGD0.2  PGD0.3  
CNN  98.70.1  54.21.1  15.81.3  32.93.7  0.00.0 
CNNFDn  99.00.1  74.04.1  32.65.3  58.94.0  8.22.6 
TisODEFDn  99.40.0  80.62.3  40.45.7  72.62.4  28.23.6 
CNNIRd  95.30.9  78.12.2  36.72.1  79.61.9  55.52.9 
TisODEIRd  97.60.1  86.82.3  49.10.2  88.80.9  66.00.9 
SVHN  FGSM5/255  FGSM8/255  PGD3/255  PGD5/255  
CNN  90.60.2  25.30.6  12.30.7  32.40.4  14.00.5 
CNNFDn  92.40.1  43.81.4  31.53.0  40.02.6  19.63.4 
TisODEFDn  95.20.1  57.81.7  48.22.0  53.42.9  32.31.0 
CNNIRd  84.91.2  65.80.4  54.71.2  74.00.5  64.50.8 
TisODEIRd  91.70.5  74.41.2  61.91.8  81.60.8  71.00.5 
From Table 4, we observe that both FDn and IRd can effectively improve the adversarial robustness of vanilla CNN models (CNNFDn, CNNIRd). Furthermore, combining our proposed TisODE with FDn or IRd (TisODEFDn, TisODEIRd), the adversarial robustness of the resultant model is significantly enhanced. For example, on the MNIST dataset, the additional use of our TisODE increases the accuracies on the PGD0.3 examples by at least 10 percentage points for both FDn (8.2% to 28.2%) and IRd (55.5% to 66.0%). However, on both MNIST and SVHN datasets, the IRd technique improves the robustness against adversarial examples, but its performance is worse on random Gaussian noise. With the help of the TisODE, the degradation in the robustness against random Gaussian noise can be effectively ameliorated.
5 Related works
In this section, we briefly review related works on the neural ODE and works concerning improving the robustness of deep neural networks.
Neural ODE: The neural ODE (Chen et al., 2018) method models the input and output as two states of a continuoustime dynamical system by approximating the dynamics of this system with trainable layers. Before the proposal of neural ODE, the idea of modeling nonlinear mappings using continuoustime dynamical systems was proposed in Weinan (2017). Lu et al. (2017) also showed that several popular network architectures could be interpreted as the discretization of a continuoustime ODE. For example, the ResNet (He et al., 2016) and PolyNet (Zhang et al., 2017) are associated with the Euler scheme and the FractalNet (Larsson et al., 2016) is related to the RungeKutta scheme. In contrast to these discretization models, neural ODEs are endowed with an intrinsic invertibility property, which yields a family of invertible models for solving inverse problems (Ardizzone et al., 2018), such as the FFJORD (Grathwohl et al., 2018).
Recently, many researchers have conducted studies on neural ODEs from the perspectives of optimization techniques, approximation capabilities, and generalization. Concerning the optimization of neural ODEs, the autodifferentiation techniques can effectively train ODENets, but the training procedure is computationally and memory inefficient. To address this problem, Chen et al. (2018) proposed to compute gradients using the adjoint sensitivity method (Pontryagin, 2018), in which there is no need to store any intermediate quantities of the forward pass. Also in Quaglino et al. (2019), the authors proposed the SNet which accelerates the neural ODEs by expressing their dynamics as truncated series of Legendre polynomials. Concerning the approximation capability, Dupont et al. (2019) pointed out the limitations in approximation capabilities of neural ODEs because of the preserving of input topology. The authors proposed an augmented neural ODE which increases the dimension of states by concatenating zeros so that complex mappings can be learned with simple flow. The most relevant work to ours concerns strategies to improve the generalization of neural ODEs. In Liu et al. (2019), the authors proposed the neural stochastic differential equation (SDE) by injecting random noise to the dynamics function and showed that the generalization and robustness of vanilla neural ODEs could be improved. However, our improvement on the neural ODEs is explored from a different perspective by introducing constraints on the flow. We empirically found that our proposal and the neural SDE can work in tandem to further boost the robustness of neural ODEs.
Robust Improvement: A straightforward way of improving the robustness of a model is to smooth the loss surface by controlling the spectral norm of the Jacobian matrix of the loss function (Sokolić et al., 2017). In terms of adversarial examples (Carlini and Wagner, 2017; Chen et al., 2017), researchers have proposed adversarial training strategies (Madry et al., 2017; Elsayed et al., 2018; Tramèr et al., 2017) in which the model is finetuned with adversarial examples generated in realtime. However, generating adversarial examples is not computationally efficient, and there exists a tradeoff between the adversarial robustness and the performance on original nonperturbed images (Yan et al., 2018; Tsipras et al., 2018). There are also some works that propose defense mechanisms against adversarial examples by using obfuscated gradients. For example, Xie et al. (2017) utilized random resizing and random padding to destroy the specific structure of adversarial perturbations. In Xie et al. (2019), the authors designed a feature denoising filter that can remove the perturbation’s pattern from feature maps. In this work, we explore novel architectures with better intrinsic robustness. We show that the proposed TisODE can improve the robustness of deep networks and can also work in tandem with these stateoftheart methods Xie et al. (2017, 2019) to achieve further improvements.
6 Conclusion
In this paper, we first empirically study the robustness of neural ODEs. Our studies reveal that neural ODEbased models are superior in terms of robustness compared to CNN models. We then explore how to further boost the robustness of vanilla neural ODEs and propose the TisODE. Finally, we show that the proposed TisODE outperforms the vanilla neural ODE and also can work in conjunction with other stateoftheart techniques to further improve the robustness of deep networks. Thus, the TisODE method is an effective dropin module for building robust deep models.
References
 Analyzing inverse problems with invertible neural networks. arXiv preprint arXiv:1808.04730. Cited by: §5.
 Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. Cited by: §5.

Zoo: zeroth order optimization based blackbox attacks to deep neural networks without training substitute models.
In
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security
, pp. 15–26. Cited by: §5.  Neural ordinary differential equations. In Advances in neural information processing systems, pp. 6571–6583. Cited by: §1, §2, §2, §3.1, §5, §5.

Imagenet: a largescale hierarchical image database.
In
2009 IEEE conference on computer vision and pattern recognition
, pp. 248–255. Cited by: §3.1.  Augmented neural odes. arXiv preprint arXiv:1904.01681. Cited by: §1, §1, §5, Theorem 1.
 Adversarial examples that fool both human and computer vision. arXiv preprint arXiv:1802.08195 10. Cited by: §5.
 Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1, §3.2, §3.
 Ffjord: freeform continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367. Cited by: §1, §5.
 Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2, §5, §7.1.
 The gronwall inequality. lecture notes. Cited by: §4.1, §7.3.
 Fractalnet: ultradeep neural networks without residuals. arXiv preprint arXiv:1605.07648. Cited by: §5.
 Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §3.1.
 Neural sde: stabilizing neural ode networks with stochastic noise. arXiv preprint arXiv:1906.02355. Cited by: §1, §5.
 Beyond finite layer neural networks: bridging deep architectures and numerical differential equations. arXiv preprint arXiv:1710.10121. Cited by: §5.
 Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083. Cited by: §1, §3.3, §3, §5.
 Reading digits in natural images with unsupervised feature learning. Cited by: §3.1.
 Automatic differentiation in pytorch. Cited by: §2.
 Mathematical theory of optimal processes. Routledge. Cited by: §5.
 Accelerating neural odes with spectral elements. arXiv preprint arXiv:1906.07038. Cited by: §1, §2, §5.
 Robust large margin deep neural networks. IEEE Transactions on Signal Processing 65 (16), pp. 4265–4280. Cited by: §3.1, §5.
 Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1.
 Ensemble adversarial training: attacks and defenses. arXiv preprint arXiv:1705.07204. Cited by: §5.

Robustness may be at odds with accuracy
. arXiv preprint arXiv:1805.12152. Cited by: §5.  A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics 5 (1), pp. 1–11. Cited by: §5.
 Mitigating adversarial effects through randomization. arXiv preprint arXiv:1711.01991. Cited by: §1, §4.3, §5.
 Feature denoising for improving adversarial robustness. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 501–509. Cited by: §1, §4.3, §5.
 Deep defense: training dnns with improved adversarial robustness. In Advances in Neural Information Processing Systems, pp. 419–428. Cited by: §5.
 Shapes and diffeomorphisms. Vol. 171, Springer. Cited by: Theorem 1.
 Polynet: a pursuit of structural diversity in very deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 718–726. Cited by: §5.
7 Appendix
7.1 Networks used on the MNIST, the SVHN, and the ImgNet10 datasets
MNIST  Repetition  Layer 
FE  1  Conv(1, 64, 3, 1) + GroupNorm + ReLU 
1  Conv(64, 64, 4, 2) + GroupNorm + ReLU  
RM  2  Conv(64, 64, 3, 1) + GroupNorm + ReLU 
FCC  1  AdaptiveAvgPool2d + Linear(64,10) 
SVHN  Repetition  Layer 
FE  1  Conv(3, 64, 3, 1) + GroupNorm + ReLU 
1  Conv(64, 64, 4, 2) + GroupNorm + ReLU  
RM  2  Conv(64, 64, 3, 1) + GroupNorm + ReLU 
FCC  1  AdaptiveAvgPool2d + Linear(64,10) 
ImgNet10  Repetition  Layer 
FE  1  Conv(3, 32, 5, 2) + GroupNorm 
1  MaxPooling(2)  
1  BaiscBlock(32, 64, 2)  
1  MaxPooling(2)  
RM  3  BaiscBlock(64, 64, 1) 
FCC  1  AdaptiveAvgPool2d + Linear(64,10) 
In Table 5
, the four arguments of the Conv layer represent the input channel, output channel, kernel size, and the stride. The two arguments of the Linear layer represents the input dimension and the output dimension of this fullyconnected layer. In the network on the ImgNet10, the BasicBlock refers to the standard architecture in
(He et al., 2016), the three arguments of the BasicBlock represent the input channel, output channel and the stride of the Conv layers inside the block. Note that we replace the BatchNorm layers in BasicBlocks as the GroupNorm to guarantee that the dynamics of each datum is independent of other data in the same minibatch.7.2 The Construction of ImgNet10 dataset
Class  Indexing  

dog  n02090721,  n02091032,  n02088094 
bird  n01532829,  n01558993,  n01534433 
car  n02814533,  n03930630,  n03100240 
fish  n01484850,  n01491361,  n01494475 
monkey  n02483708,  n02484975,  n02486261 
turtle  n01664065,  n01665541,  n01667114 
lizard  n01677366,  n01682714,  n01685808 
bridge  n03933933,  n04366367,  n04311004 
cow  n02403003,  n02408429,  n02410509 
crab  n01980166,  n01978455,  n01981276 
7.3 Gronwall’s Inequality
We formally state the Gronwall’s Inequality here, following the version in (Howard, 1998).
Theorem 2.
Let be an open set. Let be a continuous function and let , : satisfy the initial value problems:
Assume there is a constant such that, for all ,
Then, for any ,